My sad story
Sigh.
I feel so dirty.
I finally did it.
Yes, I used AI (ChatGPT) to write code.
Still with me? Good, because I would really like to explain myself!
So I have this backup drive that I’ve just been dumping files to for years, and I wanted to re-purpose the drive, possibly moving some of the content into the cloud. Because the drive had become a dumping ground, it contained many duplicates, i.e., multiple copies of identical files, so my first task would have to be identifying and subsequently removing duplicate content. While I knew this task would be massive (there’s 2.4TB of data), I am a software developer and so it was natural for me to write scripts to simplify this task.
The first thing I needed to do is—for each file on the drive—compare that file to all the other files on the drive to find any that were exact duplicates. Directly comparing the files to each other would be very time-consuming, so I needed to find a more efficient way. If only I could—for every file—generate some kind of fingerprint that —for all intents and purposes—could uniquely identify the content of said file.
Well, fortunately, there is a command-line tool md5sum that, given a file path, calculates a 16-bit value based on the content of the file. This 16-bit value, expressed as a 32-character string, behaves just like a fingerprint, uniquely identifying the file content such that two identical files (clones of each other) will generate the same md5sum, and this happens irrespective of filename!
Now there are other ways to generate this fingerprint (sha, for example), but I needed something that was extremely fast, while still having a low probability of collision, i.e., two different files having the same md5sum value. Well, md5sum is not only very fast, but the generated fingerprint has a near-zero probability of collision at just 1 in of 2^128 (see more at the end of this post).
Despite being fast, md5sum does have to perform some calculations based on the entire contents of each file, and all 671,406 files comprising 2.4TB resided on a very slow external drive, a 5,400 RPM physical disk, connected via USB3. Generating the file listing the md5sum and filename of every file on the external drive took a long time… I started it around 3 p.m. one afternoon and it was still running when I went to bed, but it was finished by the next morning.
Now I could take the results, a single file listing the fingerprint and pathname of each file on the external drive, and for each fingerprint, check if there were any matching fingerprints, and if there were, record the names of the duplicated files. I was able to implement this as a bash script, complete with progress reporting (so I could watch it run), in about 15 minutes, including testing it against a subset of data, manipulated to test all execution cases. Armed with my new bash script, I ran it against my data and while I was happy to see that it worked, I was sorely dismayed by its performance. It was taking nearly 3/4 second to process each file… that would take more than 7 days to finish!
Now I will admit my bash script could stand some improvement, but I knew that python could easily outperform bash task-for-task, so I decided I needed to rewrite my script in python. My relationship with python is interesting. Unlike other languages where I have tried to master, python has remained as something I feel I am still learning… when tasked with writing something in python, I take more of a “let’s get it to work” approach. This comes from the disposable nature of much of the python I have written… as with this exercise, I often write a python script to resolve a particular issue, use it, and then have no need for it any further. Sure I have written many “designed” or “engineered” python scripts, but more often than not, python is little more than a way to create a disposable tool.
And now I can get to the point of this post… here I was, with a bash script, that while working fine, would take days to complete its work. I knew that even though it was disposable, it was still worth the effort to write an equivalent python script to get my work done in a more timely manner. So I fired up vim and started to write the equivalent python script, and then it hit me: why not ask ChatGPT to take my bash script and convert it to python. It was worth a try, right?
I opened ChatGPT, typed “Convert the following bash script into python” and then pasted the bash script into the prompt, and pressed ENTER. In less than 15 seconds I had ChatGPT’s response… which I copied and then pasted into a file on my local drive, naming it doit.py. I expected I would have to do some work to get it working, confirmed by the fact that ChatGPT didn’t suggest the #!/usr/bin/python3 hash-bang to make the file executable natively rather than invoking python on the script. I fired it up and…
It just worked! No debugging… no futzing with it… it just worked. And it was much faster than the bash equivalent!
Now I will admit that yes, while it was faster, about 10 times so based on my wet-finger profiling of it, it could still benefit from some tweaking to make it even faster. But that was not the point of the exercise. The point was, could ChatGPT generate a python script from my bash variant, that was at the very least a good start? Yes, in this case it certainly could. Does this mean I will always have this trouble-free experience? Maybe not, especially as this test used a very simple bash script.
But I decided to see if ChatGPT could do better. I added “Can you rewrite it to make it faster” and it converted my script to use what’s known as a dictionary (something I was surprised to notice the original did not do). The new version finished in 0.68s. That’s quite an improvement over 7+ days!
But was this experiment a success? Yes, it most certainly was, and it motivates me to use this technique again, at least for my own personal needs.
Would I use this in a work environment where I am being paid to develop code? At this point, probably not, but probably not for the reasons you’re thinking. The main reason I would be uncomfortable using this in a work environment would be down to intellectual property ownership. I’m not going to take the intellectual property of my employer and paste it into ChatGPT… but for use at home, this capability is promising. My reticence to use this for commercial code would be mitigated if my employer had a locally running instance of some coding AI model, but that’s a topic for the future.
My experience with using ChatGPT to generate code, albeit driven by debug code I had written already, is very positive. I’d be interested to see how this is being used by others so feel free to let me know if you’ve done similar things or your company has embraced this idea.
Bonus
Bonus – how reliable is md5sum at generating a unique fingerprint for a file?
How confident am I that the 32-character value md5sum generates will be sufficient to uniquely identify the contents of a file? Well, the probability of any two hashes accidentally colliding is roughly 1 in 2^128.
But what is 2^128? Well, it’s a lot… specifically, you would have to compare 340,282,366,920,938,463,463,374,607,431,768,211,456 unique files for it to be likely that just two would have the same checksum!
For those wondering, 340,282,366,920,938,463,463,374,607,431,768,211,456 in words is:
- Three hundred and forty undecillion,
- two hundred and eighty-two decillion,
- three hundred and sixty-six nonillion,
- nine hundred and twenty octillion,
- nine hundred and thirty-eight septillion,
- four hundred and sixty-three sextillion,
- four hundred and sixty-three quintillion,
- three hundred and seventy-four quadrillion,
- six hundred and seven trillion,
- four hundred and thirty-one billion,
- seven hundred and sixty-eight million,
- two hundred and eleven thousand,
- four hundred and fifty-six.
Credit for the spoken form of 2^128 goes to https://www.techtarget.com/whatis/feature/IPv6-addresses-how-many-is-that-in-numbers#:~:text=So%202%20to%20the%20power,numbers%20without%20resorting%20to%20math.