Book Review: Python Data Science Handbook: Essential Tools for Working with Data
One key factor for Python's immense growth in the past couple of years is the PyData Ecosystem. There is a wealth of powerful and reliable tools available for Python, which are used by many researchers from different fields for working with data. Five of the most essential tools IPython, NumPy, pandas, Matplotlib and Scikit-learn are covered in depth in the Python Data Science Handbook written by Jake VanderPlas.
The author is a well-known member of the Python community. Jake has given numerous talks at Pycon and PyData conferences, he writes the popular Pythonic Perambulations blog and is a contributor to many open source projects. One of particular interest being the declarative visualization library Altair.
The Python Data Science Handbook is divided into five main sections dedicated to one of the essential tools/libraries mentioned before. The book provides a solid introduction to the tools and a broad overview of their features. Along the way it teaches data science concepts and shows best practices for tasks such as handling missing values, joining datasets, grouping, aggregating and visualizing data.
For me, the most valuable part is the one about NumPy, which starts with an explanation of data types in Python and shows why and how you can drastically reduce the runtime with NumPy. I mostly use NumPy indirectly because it's the foundation of many other tools. Learning about what NumPy can do by itself can be very useful in other areas as well.
With the exception of Scikit-learn I use the tools covered in this book frequently and think it provides a lot of useful and applicable information. Obviously any book with code examples in this field will rather sooner than later include dated information. At some point Jake mentions this in the book and he frequently refers the reader to the respective online documentation.
The book was written as Jupyter notebooks, which are available on GitHub. You can run the example code if you follow the instructions to setup the environment correctly. When you want to use these tools in your own work, you should certainly install the latest stable releases though. These projects have evolved a lot in the past few years, see for example the Git contributions graph for Matplotlib.
Criticisms that were valid for earlier versions of Matplotlib have been addressed. This includes better default colors and styles and support for working with labeled data such as pandas DataFrames. I think Matplotlib is a particularly striking example that things change. Something that's good to be aware of when reading a book in a scientific field that seems to be exploding.
Overall the Python Data Science Handbook is a very well written and readable book. It includes many code examples and most importantly teaches concepts and skills you need to derive anything meaningful from your work with data. If you know Python and want to get into data science this book is a very good starting point. From my own experience I can tell that you will use NumPy, pandas and Matplotlib, either directly or in a library built on top. So this books provides you with a lot of essential knowledge.
I like to end with thanking Jake and O'Reilly for providing a free to read version of this book and to the people who contribute to the projects covered, which are all free to use. Something easily taken for granted but truly worth being grateful for.
About this Book
- Author(s): Jake VanderPlas
- Released: November 2016
- Publisher: O'Reilly Media
- Language: English
- Format: Paperback: 541 pages
- ISBN-10: 1491912057
- My rating: 5 out of 5
More Reviews
This post was written by Ramiro Gómez (@yaph) and published on . Subscribe to the Geeksta RSS feed to be informed about new posts.
Tags: book review data science pandas python matplotlib
Disclosure: External links on this website may contain affiliate IDs, which means that I earn a commission if you make a purchase using these links. This allows me to offer hopefully valuable content for free while keeping this website sustainable. For more information, please see the disclosure section on the about page.
Share post: Facebook LinkedIn Reddit Twitter