Methods for testing data-heavy applications – Part I
For the past two months, I have been building a piece of code for a very interesting company that is not on the Fortune 500 list yet but if it goes public it may very well be a Fortune 1 company! This application is very heavy on data. I am still developing, I feed it chunks of 100 megabyte datasets, it goes through labor for a couple of minutes and crunches the numbers, and eventually comes back with a table with about 100 rows that I could take and do some statistical analysis on and plot. That big chunk of code gets reduced to bite size human-readable numbers and graphs. We are still testing it and I honestly have no idea how it will perform once it gets connected to the data firehose of the company! and that drives me insane!
I have been testing this software. It works fine. It works fine on all of my test datasets and I have been scrolling through its algorithms that span over a thousand line of code to make sure they work as I expect. I am going insane over this. What if we ship it next week and there is a bug in it? It is not like a Google+ application that you write, ship and then notice that instead of the photo of your female friend it brings back the photo of your dad! there is no way to see the bug when the product is out. your users cannot report the bugs because they won’t see any of it.
I have been looking into the scientific literature and there is tons of methods in the field of software development with tons of different names but none is about the type of software that I am working on. So I asked on twitter how other people are testing their software and here is a number of responses that I have received.
1- Take the current data files and manually perform your algorithm on the data set (using something like Excel) and see if the final results match
2- Fabricate test data sets for which you know the final result. Run the code on those test files. For example say you have written a code for summation. You can just feed it an array of ones and see if it works on them. Test for boundary cases, test for data that produce intuitive outputs, if you know your algorithm should have a specific behavior, test for that too (like when you expect symmetry in your output)
3 – Hopefully you break your code into atomic blocks that only do one thing (inverse a matrix, svd calculations, etc) write unit tests for those functions individually. Then combine those individual functions into larger functions and components and test them too and so on.
I am more interested to know your thoughts. what are your suggestions for testing big data applications?