Have you been asked to generate data for a demo system? Here’s some advice.

One of the smartest decision I saw was related to the data needed for a demo website. The website had to look like a real customer had been using the system for a while, and had accumulated a lot of data.

The salesperson would walk a customer through the demo. They would show some typical customer problems and demonstrate how the software could be used to fix them. The data in the system had to support these stories.

We couldn’t just take live customer data and put it into the demo. That would be a privacy violation. It wouldn’t support the story the demo presentor was telling. Worst of all, we’d have to build the data from scratch every time we updated the demo.

To fix this, one of the developers created a system to generate the demo data from other data.

The system generated all the data used in the demo from various sources:

  • Take real customer data, clip out the section we’d use, anonymize it.
  • Transform data using function f.
  • Generate fake data by calling function f.
  • Read data from a file.

The data was generated with a “program” that looked like this:


   # Sales people need to be able to show "problem X".
   # We found this data in customer1's dataset, but we
   # only need the first 200 rows:
   AnonymizeAndInject("customer1.data", 200)

   # The next thing sales will demonstrate is what it
   # looks like when Problem X is fixed.
   # We've written function X that generates data that looks that way.
   # It bases this off customer1's broken data.
   GenerateAndInject(X, "customer1.data")

   # There is a requirement that at least one "problem Y"
   # will be seen in the data. We hand-created that data.

Because we were generating the demo data this way it was easy to regenerate and iterate. For example the director would come to us and say “more cow bell!” and we could add an GenerateAndInject(cowbell). The next day we’d be told “the cowbell looks too blue. Can it be red instead?” and we would add code to turn it red. Re-run the code and we’re ready to show the next iteration.

This was so much better than hand-editing the data.

This really paid off a few months later when it was time to update the demo. They wanted 3 changes: update the demo to use newer customer data, there’s a new storyboard that requires we add data to support it, and the underlying database that the product uses has a new schema.

Oh, and it still needs to do all the things the old demo did.

If we had hand-crafted the demo data these changes would have been nearly impossible. We would have to reproduce every little manual change and update. Who the heck could remember ever little change we made?

Luckily we didn’t have to remember. The code told us every decision we had made. Heck, there was even a Makefile that embodied how the customer data was extracted and cleaned.

We were able to make the requested changes easily. The fact that the underlying database schema had changed wasn’t a big problem because the generator used the same library as the product. It “just worked”.

Yes, we did manually go over each story board and make sure that we didn’t break any of the “stories” that were told while doing the demo. We probably could have implemented unit tests that make sure we didn’t break or lose them but in this case manual testing was ok.

I highly recommend this kind of technique any time you need to make a synthetic data set. This is commonly needed for a demo disk, developer test data, or many other situations.

The tools for making this kind of system are much better today. We did this all in Perl, awk, and sed. Python’s string handling would make this a lot easier. I hear good things about R’s ability to tidy up data.

If you have suggestions for related tools (or related anecdotes!), please post in the comments!