Book Review—Data analysis with open source tools

A new book from O’Reilly Media provides insight into statistics—targeting the ‘risk area’ of using statistical concepts with ‘a limited understanding of what they really mean.’

Philipp Janert’s new book1 ‘Data analysis with open source tools,’ (DA) described as a ‘hands-on guide for programmers and data scientists’ is more than that. Janert sets-out to address two common data analysis ‘risk areas.’ One is the use of statistical concepts ‘with a limited understanding of what they really mean.’ The other is the deployment of ‘complicated, expensive black box solutions’ used in the place of a ‘simple, transparent approach.’

One scene setting anecdote involved a client whose IT department recommended a cluster-based neural net to analyze product defect data. The solution Janert found was rather more economical—a one line calculation—not even a code! Janert offers some useful advice on educating the customer—observing that ‘few clients are in a good position to ask meaningful questions.’ This inevitably means that the statistician needs to be cognizant of the client’s business and terminology.

This book offers a deep approach to real world tasks. There is much more text than code. Janert’s contention that ‘statistics is usually equated with a college class that made no sense at all’ will chime with many. His promise to ‘explain what statistics really is’ should excite. He also provides insight into computational issues. For instance the value of the sine and cosine function for large values of x eventually degenerates to a random number as the limit of a float’s resolution nears.

While the thrust of DA is solving business (rather than scientific) problems, Janert is a polymath who is interested in his subject. This is conveyed particularly in the chapter on classical statistics which includes exposés on significance, design of experiment and a fascinating section on the Bayesian and frequentist approaches that are pretty a propos to the seismic imaging folks as we learned last month during the SEG’s Albert Tarantola memorial.

Also of interest to the oil and gas community is the section on financial analysis with clear explanations of net present value, risk analysis and opportunity costs.

Janert’s tools of choice are Python (with NumPy and SciPy) and Unix. On which topic, Janert curiously saves some unequivocal advice for page 494, ‘Work on Unix—I mean it. Unix was developed for precisely the kind of ad-hoc programming […] that encourages you to devise solutions.’

It is hard to fault this book except perhaps that at nearly 500 pages, it is too short! The couple of pages on Map/Reduce, for instance, fall way short of Janert’s pedagogical aims. But quibbles apart, DA gets a double thumbs up from this reviewer.

1 O’Reilly, ISBN 9780596802356.

This article originally appeared in Oil IT Journal 2010 Issue # 12.

For more information or to comment on this topic email here.