haewyr

What does the rip in ripgrep really mean?

I heard ripgrep was fast but I haven't yet trained my fingers to switch to it as it rarely comes in necessary, but then I needed to throw 6275 patterns at grep to perform an inverse match against a 79MB csv file with almost 350k records. The command took this form:

grep -f patterns.txt -v bigdataset.csv > myfiltereddataset.csv

I expected this might take a while, its an unreasonably heavy processing load but I got bored of waiting so I cancelled the operation and swapped out grep for rg and tried again.

It completed so fast I'm sure the Macbook Pro's SSD took longer to write the results to disk.

"Surely, this is too good to be true" I thought.

To be sure the output was correct I needed to compare against the grep output, so I tried with grep again. I went for a snack, ate the snack, made a brew, drank the brew, caught up on my email, my timeline and watched as the output file slowly approached the same size of the output from ripgrep like a train coming into a station.

<spongebob>A few moments later</spongebob>

Grep finished and first I compared the hashes, they didn't match 🤨 ,

"A-ha!" I shouted excitedly, "so Ripgrep lied to me! What a load of junk!" Proud of my earlier suspicion and commitment to the truth.

So I ran diff against the output files, one line was different, the top line containing the CSV header names. I ran head -1 for each file and eyeballed the result, column by column, character by character. The output on my screen looks identical, so I paste them into a diff checker which confirmed it, they were the same.

I assumed there must be encoding difference, so I check the encoding with file -I and indeed grep output UTF-8 but ripgrep output us-ascii, which is not my locale as I am not in the US but whatever, I can now explain the hash difference, but for all intents and purposes the output files were identical.

Still in shocked disbelief, I use BurntSushi's other excellent utility xsv to count the records, they both came up 328062. That's ~20k fewer than the input data, so yes both utilities did actually do what I asked and I'm sorry I doubted you ripgrep but with rediculous performance like that what did you expect?

Oh, and yes I did time both commands.

Grep: 1082.53s user 1.61s system 98% cpu 18:25.08 total

That's eighteen minutes and change.

Ripgrep: 0.20s user 0.04s system 91% cpu 0.256 total

A quarter of one second to complete the same job and output a 79MB CSV file.

BurntSushi, I salute you, you appear to have found a way to rip a hole in spacetime and process data outside of the universe, it's the only explaination.