Spreadsheets as a data exchange format

Jesse Luehrs (@doyster)
march 27th, 2014

When working with non-technical clients, often their preferred means of exchanging structured data is via spreadsheets. Using a custom tool is not always practical due to cost or training time constraints, and using a type of document that doesn’t have its own standard editor (such as XML or JSON) will generally result in having to deal with malformed files on a regular basis, since these files are often edited by hand.

Excel is the only program for managing structured data that is widely used by both technical and non-technical people, and being able to leverage that structure can make the whole data exchange process much smoother, even though it can be frustrating at times.

Tools

Luckily, Perl comes with a good set of modules for reading and writing various kinds of spreadsheets. I use the following:

In addition, I have a couple of unreleased scripts (here and here) for turning an Excel file into a set of CSV files and back, since it is occasionally easier to convert to CSV, make a few edits by hand in a text editor, and then convert back to Excel rather than writing a script to do it. CSV files are also diffable, so this can also be used to detect changes between different versions of Excel files.

If you need to debug files that are being misparsed, or if you need to implement Excel support in a language that doesn’t yet have a decent implementation, the relevant file format specifications are documented quite copiously (and somewhat impenetrably) here (.xlsx) and here (.xls).

Best practices

Mitigating data errors

Don’t let the structure of the spreadsheet drive your actual data model (or vice versa). It’s likely that the spreadsheet will be created by someone who isn’t familiar with how the system works underneath, and these documents are often structured in a way that emphasizes readability by humans over readability by computers. It’s quite possible for the document to change in ways that make sense from their business perspective, but not from the perspective of how the system works. Keeping a separation between “how the data is delivered to us” and “how the system actually understands the data” (the latter preferably driven by actual requirements documents) will make things run more smoothly.

In addition, these kinds of changes make it much more important to have a lot of error checking. Error checking is probably more important than testing for these situations, because it’s quite difficult to predict the ways in which things can go wrong. Enforcing a very strict document structure and throwing an error as soon as something unexpected is seen is preferable to trying to guess and ending up with malformed data. If you can impress upon the client the need for a strict document structure, even better, but this isn’t always possible.

If you aren’t able to enforce a structure, it’s often better to take a two stage approach. First, write your strict importer that expects a single, well defined document with lots of error checking, and then do whatever you have to do to the data you receive to make it fit, if possible. It’s rarely worth the effort to modify the actual parser every time the structure of the document changes, since that may happen quite frequently, and every time you change the actual parser is another opportunity to introduce bugs into it.

Other tricks and things to watch out for

Conclusion

Data exchange is a very common task when dealing with clients, and clients themselves typically use Excel. Learning to work within their processes (which typically exist because they do already actually work) can be less difficult than you think, and may save both you and the client a lot of time and effort.

Tags: technology perl communication