Book Review—O’Reilly’s Bad Data Handbook (January 2013)

A disappointing hotchpotch from 19 authors covers mostly web data capture and analysis.

O’Reilly’s formula for success is a) identify a good subject, b) find a subject matter expert c) perform brain dump. The Bad Data Handbook* (BDH) succeeds with step a) but that’s about it. BDH is a hotchpotch of essays from nineteen authors. Some manage to stay on message. Chapters on the cloud, social media and on caring for machine learning experts less so. 

Kevin Fink provides an interesting peek (with code) at processing web log data. Paul Murrel offers advice on getting data out of ‘awkward’ formats like Excel (use XLConnect) and processing it with ‘R’.

We enjoyed Joch Levy’s chapter on ‘bad data in plain text’ with an authoritative account of character encodings and text processing in Python. Adam Laciano’s chapter on scraping data from web pages does a good job of showing what an ugly task this can be. For one website using Flash, this meant running Matlab scripts to extract text from screen grabs! Jacob Perkins’ ‘detecting liars on the web’ describes how Python’s NLTK library for natural language processing is used to classify movie reviews. Interesting but again, somewhat off topic!

A problem with BDH is that the subject means different things to different people. Phil Janert’s chapter covers defect reduction in manufacturing, analyzing call center data and making the most of data with statistics-based hypothesis testing. BDH is very much in the modern world of NoSQL, file databases and the web. The topics of database integrity and naming conventions are not covered—even though these are key routes to clean data.

Ethan McCallum makes a brave attempt to tie all this together but his is less of an editor’s role, more on of an applier of lipstick to the pig. Again, the problem with BD is the subject and the fact that the book is mostly about making sense of data as it is found on the web. The issue of how to avoid creating bad data in the first place is not covered. Which is a shame as this is arguably more important. 

* by Ethan McCallum. O’Reilly 2013. ISBN 9781449321888.

Click here to comment on this article

If your browser does not work with the MailTo button, send mail to info@oilit.com with OilIT_1301_4 as the subject. Web use only - not for intranet/corporate use. Copyright © 2013 The Data Room - all rights reserved.