100M-Row Challenge with PHP

(github.com)

93 points | by brentroose 5 hours ago

9 comments

  • brentroose 5 hours ago
    A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds. This optimization process with so much fun, and so many people pitched in with their ideas; so I eventually decided I wanted to do something more.

    That's why I built a performance challenge for the PHP community

    The goal of this challenge is to parse 100 million rows of data with PHP, as efficiently as possible. The challenge will run for about two weeks, and at the end there are some prizes for the best entries (amongst the prize is the very sought-after PhpStorm Elephpant, of which we only have a handful left).

    I hope people will have fun with it :)

    • Tade0 2 hours ago
      Pitch this to whoever is in charge of performance at Wordpress.

      A Wordpress instance will happily take over 20 seconds to fully load if you disable cache.

      • embedding-shape 1 hour ago
        Microbenchmarks are very different from optimizing performance in real applications in wide use though, they could do great on this specific benchmark but still have no clue about how to actually make something large like Wordpress to perform OK out of the box.
      • monkey_monkey 1 hour ago
        That's often a skill issue.
    • gib444 2 hours ago
      > A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds

      That's a huge improvement! How much was low hanging fruit unrelated to the PHP interpreter itself, out of curiosity? (E.g. parallelism, faster SQL queries etc)

      • brentroose 2 hours ago
        Almost all, actually. I wrote about it here: https://stitcher.io/blog/11-million-rows-in-seconds

        A couple of things I did:

        - Cursor based pagination - Combining insert statements - Using database transactions to prevent fsync calls - Moving calculations from the database to PHP - Avoiding serialization where possible

        • tiffanyh 1 hour ago
          Aren’t these optimizations less about PHP, and more about optimizing how your using the database.
          • hu3 1 hour ago
            It's still valid as as example to the language community of how to apply these optimizations.
          • swasheck 1 hour ago
            in all my years doing database tuning/admin/reliability/etc, performance have overwhelmingly been in the bad query/bad data pattern categories. the data platform is rarely the issue
      • Joel_Mckay 18 minutes ago
        In general, it is bad practice to touch transaction datasets in php script space. Like all foot-guns it leads to Read-modify-write bugs eventually.

        Depending on the SQL engine, there are many PHP Cursor optimizations that save moving around large chunks of data.

        Clean cached PHP can be fast for REST transactional data parsing, but it is also often used as a bodge language by amateurs. PHP is not slow by default or meant to run persistently (low memory use is nice), but it still gets a lot of justified criticism.

        Erlang and Elixir are much better for clients/host budgets, but less intuitive than PHP =3

    • user3939382 2 hours ago
      exec(‘c program that does the parsing’);

      Where do I get my prize? ;)

      • brentroose 2 hours ago
        The FAQ states that solutions like FFI are not allowed because the goal is to solve it with PHP :)
        • kpcyrd 1 hour ago
          What about using the filesystem as an optimized dict implementation?
          • olmo23 48 minutes ago
            this is never going to be faster because it requires syscalls
  • semiquaver 30 minutes ago
    Are they just confused about what characters require escaping in JSON string literals or is PHP weirder than I remember?

        {
            "\/blog\/11-million-rows-in-seconds": {
                "2025-01-24": 1,
                "2026-01-24": 2
            },
            "\/blog\/php-enums": {
                "2024-01-24": 1
            }
        }
    • CapitaineToinon 24 minutes ago
      That's the default output when using json_encode with the JSON_PRETTY_PRINT flag in php.
    • poizan42 27 minutes ago
      > The output should be encoded as a pretty JSON string.

      So apparently that is what they consider "pretty JSON". I really don't want to see what they would consider "ugly JSON".

      (I think the term they may have been looking for is "pretty-printed JSON" which implies something about the formatting rather than being a completely subjective term)

  • Xeoncross 40 minutes ago
    This is why I jumped from PHP to Go, then why I jumped from Go to Rust.

    Go is the most battery-included language I've ever used. Instant compile times means I can run tests bound to ctrl/cmd+s every time I save the file. It's more performant (way less memory, similar CPU time) than C# or Java (and certainly all the scripting languages) and contains a massive stdlib for anything you could want to do. It's what scripting languages should have been. Anyone can read it just like Python.

    Rust takes the last 20% I couldn't get in a GC language and removes it. Sure, it's syntax doesn't make sense to an outsider and you end up with 3rd party packages for a lot of things, but can't beat it's performance and safety. Removes a whole lot of tests as those situations just aren't possible.

    If Rust scares you use Go. If Go scares you use Rust.

    • codegeek 35 minutes ago
      I am not that smart to use Rust so take it with a grain of salt. However, its syntax just makes me go crazy. Go/Golang on the other hand is a breath of fresh air. I think unless you really need that additional 20% improvement that Rust provides, Go should be the default for most projects between the 2.
      • Xeoncross 4 minutes ago
        I hear you, advanced generics (for complex unions and such) with TypeScript and Rust are honestly unreadable. It's code you spend a day getting right and then no one touches it.

        I'm just glad modern languages stopped throwing and catching exceptions at random levels in their call chain. PHP, JavaScript and Java can (not always) have unreadable error handling paths not to mention hardly augmenting the error with any useful information and you're left relying on the stack trace to try to piece together what happened.

    • thinkingtoilet 3 minutes ago
      It's almost comical how often bring up Rust. "Here's a fun PHP challange!" "Let's talk about Rust..."
  • pxtail 2 hours ago
    Side note - I wasn't aware that there is active collectors scene for Elephpants, awesome!

    https://elephpant.me/

    • t1234s 2 hours ago
      Elephpants should be for second and third place. First place should be the double-clawed hammer.
    • thih9 1 hour ago
      Excellent project. My favorites: the joker, php storm, phplashy, Molly.
  • poizan42 36 minutes ago
    > The output should be encoded as a pretty JSON string.

    ...

    > Your parser should store the following output in $outputPath as a JSON file:

        {
            "\/blog\/11-million-rows-in-seconds": {
                "2025-01-24": 1,
                "2026-01-24": 2
            },
            "\/blog\/php-enums": {
                "2024-01-24": 1
            }
        }
    
    They don't define what exactly "pretty" means, but superflous escapes are not very pretty in my opinion.
    • kijin 26 minutes ago
      They probably mean "Should look like the output of json_encode($data, JSON_PRETTY_PRINT)". Which most PHP devs would be familiar with.
      • poizan42 18 minutes ago
        It sounds plausible, but they really need to spell out exactly what the formatting requirements are, because it can make a huge difference in how efficiently you can write the json out.
  • tveita 2 hours ago
    > Also, the generator will use a seeded randomizer so that, for local development, you work on the same dataset as others

    Except that the generator script generates dates relative to time() ?

  • Retr0id 2 hours ago
    How large is a sample 100M row file in bytes? (I tried to run the generator locally but my php is not bleeding-edge enough)
  • spiderfarmer 3 hours ago
    Awesome. I’ll be following this. I’ll probably learn a ton.
  • wangzhongwang 2 hours ago
    [dead]