I'm happy to announce that I'll be writing the second edition of Data Science at the Command Line (O'Reilly, 2014). This issue explains why I think a second edition is needed, lists what changes I plan to make, and presents a tentative outline. Finally, I have a few words about the process and giving feedback.
Why a second edition?
While the command line as a technology and as a way of working is timeless, some of the tools discussed in the first edition have either: (1) been superseded by newer tools (e.g., csvkit has been replaced by xsv), (2) been abandoned by their developers (e.g., drake), or (3) been suboptimal choices (e.g., weka). Since the first edition was published in October 2014 I have learned a lot, either through my own experience or through the useful feedback from its readers. Even though the book is quite niche because it lies at the intersection of two subjects, there remains a steady interest from the data science community. I notice this from the many positive messages I receive almost every day. By updating the first edition I hope to keep the book relevant for at least another five years.
Changes with respect to the first edition
These are the general changes I currently have in mind. Please note that this is subject to change.
- Throughout the book: replace
csvkitwithxsvas much as possible.xsvis a much faster alternative to working with CSV files. - Section 1.6: Replace the used data set with one that is accessible without an API key.
- Section 2.2 and 3.2: Replace the VirtualBox image with a Docker image (this is already done on https://www.datascienceatthecommandline.com). Docker is a faster and more lightweight way of running an isolated environment than VirtualBox.
- Section 4.3: Split Python and R into separate sections. Furthermore, explain how to parse command-line options in those languages.
- Section 5.4: Split into two sections. Use
xmlstarletfor working with XML. - Section 5.5: Move these subsections beneath Section 5.3.
- Section 5.6: Use
pupinstead ofscrapeto work with HTML.scrapeis a Python tool I created myself.pupis much faster, has more features, and is easier to install. - Chapter 6: Replace Drake with Make. Drake is no longer maintained. Make is much more mature and is also very popular with developers.
- Section 7.3.2 and 7.4.x Replace
Riowithlittler.Riois a Bash script I created myself.littleris a much more stable way of using R from the command line and is easier to install. - Chapter 8. Add new sections that discuss how to get a list of running instances from not only AWS but also from two newer cloud providers: GCP and Azure.
- Chapter 9: Replace Weka, BigML, and SKLL with Vowpal Wabbit. Weka is old and the way it is used from the command line is clunky. BigML is a commercial API on which I no longer want to rely. SKLL is not truly from the command line. Vowpal Wabbit is a very mature machine learning tool, developed at Yahoo! and now at Microsoft. At some point, there was supposed to be an entire book about Vowpal Wabbit (titled Sequential Learning), but unfortunately this never was finished. These three sections will give Vowpal Wabbit the exposure it deserves and the readers the speed and stability of applying machine learning at the command line they deserve.
- Chapter 10: New chapter about integrating the command line into existing workflows, including Python, R, Julia, and Spark. In the first edition I mention that the command line can easily be integrated with existing workflows, but I never go into that. This chapter fixes that. My hope is that with this chapter, someone would be quicker inclined to pick up this book and learn about the advantages of the command line.
Book outline
In the tentative outline below, π indicates added and β indicates removed chapters and sections with respect to the first edition.
- Preface
- What to Expect from This Book
- How to Read This Book
- Who This Book Is For
- Acknowledgments
- Dedication
- About the Author
- Chapter 1 Introduction
- 1.1 Overview
- 1.2 Data Science is OSEMN
- 1.2.1 Obtaining Data
- 1.2.2 Scrubbing Data
- 1.2.3 Exploring Data
- 1.2.4 Modeling Data
- 1.2.5 Interpreting Data
- 1.3 Intermezzo Chapters
- 1.4 What is the Command Line?
- 1.5 Why Data Science at the Command Line?
- 1.5.1 The Command Line is Agile
- 1.5.2 The Command Line is Augmenting
- 1.5.3 The Command Line is Scalable
- 1.5.4 The Command Line is Extensible
- 1.5.5 The Command Line is Ubiquitous
- 1.6 A Real-world Use Case
- 1.7 Further Reading
- Chapter 2 Getting Started
- 2.1 Overview
- 2.2 Setting Up Your Data Science Toolbox β
- 2.2 Installing the Docker Image π
- 2.3 Essential GNU/Linux Concepts
- 2.3.1 The Environment
- 2.3.2 Executing a Command-line Tool
- 2.3.3 Five Types of Command-line Tools
- 2.3.4 Combining Command-line Tools
- 2.3.5 Redirecting Input and Output
- 2.3.6 Working With Files
- 2.3.7 Help!
- 2.4 Further Reading
- Chapter 3 Obtaining Data
- 3.1 Overview
- 3.2 Copying Local Files to the Data Science Toolbox β
- 3.2.1 Local Version of Data Science Toolbox β
- 3.2.2 Remote Version of Data Science Toolbox β
- 3.2 Copying Local Files to the Docker Image π
- 3.3 Decompressing Files
- 3.4 Converting Microsoft Excel Spreadsheets
- 3.5 Querying Relational Databases
- 3.6 Downloading from the Internet
- 3.7 Calling a Web API
- 3.8 Further Reading
- Chapter 4 Creating Reusable Command-line Tools
- 4.1 Overview
- 4.2 Converting One-liners into Shell Scripts
- 4.2.1 Step 1: Copy and Paste
- 4.2.2 Step 2: Add Permission to Execute
- 4.2.3 Step 3: Define Shebang
- 4.2.4 Step 4: Remove Fixed Input
- 4.2.5 Step 5: Parametrize
- 4.2.6 Step 6: Extend Your PATH
- 4.3 Creating Command-line Tools with Python and R β
- 4.3.1 Porting The Shell Script β
- 4.3.2 Processing Streaming Data from Standard Input β
- 4.3 Creating Command-line Tools with Python π
- 4.3.1 Porting The Shell Script π
- 4.3.2 Processing Streaming Data from Standard Input π
- 4.3.3 Parsing Command-Line Options π
- 4.4 Creating Command-line Tools with R π
- 4.3.1 Porting The Shell Script π
- 4.3.2 Processing Streaming Data from Standard Input π
- 4.3.3 Parsing Command-Line Options π
- 4.5 Further Reading
- Chapter 5 Scrubbing Data
- 5.1 Overview
- 5.2 Common Scrub Operations for Plain Text
- 5.2.1 Filtering Lines
- 5.2.2 Extracting Values
- 5.2.3 Replacing and Deleting Values
- 5.3 Working with CSV
- 5.3.1 Bodies and Headers and Columns, Oh My!
- 5.3.2 Performing SQL Queries on CSV
- 5.3.3 Extracting and Reordering Columns π
- 5.3.4 Filtering Lines π
- 5.3.5 Merging Columns π
- 5.3.6 Combining Multiple CSV Files π
- 5.4 Working with XML/HTML and JSON β
- 5.5 Common Scrub Operations for CSV β
- 5.5.1 Extracting and Reordering Columns β
- 5.5.2 Filtering Lines β
- 5.5.3 Merging Columns β
- 5.5.4 Combining Multiple CSV Files β
- 5.4 Working with JSON π
- Introducing jq π
- Filtering elements π
- Simplifying JSON π
- Converting JSON to CSV π
- 5.5 Working with XML π
- 5.5.1 Introducing xmlstarlet π
- 5.5.2 Extracting fields using XPath π
- 5.5.3 Converting XML to CSV π
- 5.6 Working with HTML π
- 5.6.1 Introducing pup π
- 5.6.2 Extracting fields using CSS Selectors π
- 5.6.3 Converting HTML to CSV π
- 5.7 Further Reading
- Chapter 6 Managing Your Data Workflow
- 6.1 Overview
- 6.2 Introducing
DrakeMake π - 6.3 Installing Drake β
- 6.3 One Script to Rule Them All π
- 6.4 Obtain Top E-books from Project Gutenberg
- 6.5 Every Workflow Starts with a Single Step
- 6.6 Well, That Depends
- 6.7 Rebuilding Certain Targets
- 6.8 Discussion
- 6.9 Further Reading
- Chapter 7 Exploring Data
- 7.1 Overview
- 7.2 Inspecting Data and its Properties
- 7.2.1 Header Or Not, Here I Come
- 7.2.2 Inspect All The Data
- 7.2.3 Feature Names and Data Types
- 7.2.4 Unique Identifiers, Continuous Variables, and Factors
- 7.3 Computing Descriptive Statistics
- 7.3.1
csvstatUsing xsv stat π - 7.3.2 Using R from the Command Line
using Rio
- 7.3.1
- 7.4 Creating Visualizations
- 7.4.1 Introducing Gnuplot and Feedgnuplot
- 7.4.2 Introducing ggplot2
- 7.4.3 Histograms
- 7.4.4 Bar Plots
- 7.4.5 Density Plots
- 7.4.6 Box Plots
- 7.4.7 Scatter Plots
- 7.4.8 Line Graphs
- 7.4.9 Summary
- 7.5 Further Reading
- Chapter 8 Parallel Pipelines
- 8.1 Overview
- 8.2 Serial Processing
- 8.2.1 Looping Over Numbers
- 8.2.2 Looping Over Lines
- 8.2.3 Looping Over Files
- 8.3 Parallel Processing
- 8.3.1 Introducing GNU Parallel
- 8.3.2 Specifying Input
- 8.3.3 Controlling the Number of Concurrent Jobs
- 8.3.4 Logging and Output
- 8.3.5 Creating Parallel Tools
- 8.4 Distributed Processing
- 8.4.1 Get List of Running AWS EC2 Instances β
- 8.4.1 Running Commands on Remote Machines
- 8.4.2 Distributing Local Data among Remote Machines
- 8.4.3 Processing Files on Remote Machines
- 8.4.4 Get List of Running EC2 Instances on AWS π
- 8.4.5 Get List of Running Compute Engine Instances on GCP π
- 8.4.6 Get List of Running Instances on Azure π
- 8.5 Discussion
- 8.6 Further Reading
- Chapter 9 Modeling Data
- 9.1 Overview
- 9.2 More Wine Please!
- 9.3 Dimensionality Reduction with Tapkee
- 9.3.1 Introducing Tapkee
- 9.3.2 Installing Tapkee
- 9.3.3 Linear and Non-linear Mappings
- 9.4 Clustering with Weka β
- 9.4.1 Introducing Weka β
- 9.4.2 Taming Weka on the Command Line β
- 9.4.3 Converting between CSV to ARFF Data Formats β
- 9.4.4 Comparing Three Cluster Algorithms β
- 9.4 Clustering with SciKit-Learn π
- 9.4.1 Using SciKit-Learn from the Command Line π
- 9.4.2 K-Means Clustering π
- 9.4.3 Hierarchical Clustering π
- 9.4.4 Pipelines π
- 9.5 Regression with SciKit-Learn Laboratory β
- 9.5.1 Preparing the Data β
- 9.5.2 Running the Experiment β
- 9.5.3 Parsing the Results β
- 9.6 Classification with BigML β
- 9.6.1 Creating Balanced Train and Test Data Sets β
- 9.6.2 Calling the API β
- 9.6.3 Inspecting the Results β
- 9.6.4 Conclusion β
- 9.5 Collaborative Filtering with Vowpal Wabbit π
- 9.5.1 Introducing Vowpal Wabbit π
- 9.5.2 Input Format π
- 9.5.3 Matrix Factorization π
- 9.5.4 Training a Model π
- 9.5.5 Making Predictions π
- 9.5.6 Measure Performance π
- 9.6 Regression with Vowpal Wabbit π
- 9.6.1 Feature Hashing π
- 9.6.2 Gradient Descent π
- 9.6.3 Hyper-parameter Optimization π
- 9.6.4 Inspecting Models π
- 9.7 Classification with Vowpal Wabbit π
- 9.7.1 Extended Input Format π
- 9.7.2 Multi-class Classification π
- 9.7.3 Online Learning π
- 9.8 Further Reading
- Chapter 10 Leverage the Unix Command Line Elsewhere π
- 10.1 Jupyter Notebook π
- 10.2 Python Scripts π
- 10.3 RStudio π
- 10.4 R Markdown π
- 10.5 R Scripts π
- 10.6 Julia Scripts π
- 10.7 Spark Pipes π
- Chapter
1011 Conclusion- 11.1 Letβs Recap
- 11.2 Three Pieces of Advice
- 11.2.1 Be Patient
- 11.2.2 Be Creative
- 11.2.3 Be Practical
- 11.3 Where To Go From Here?
- 11.3.1 APIs
- 11.3.2 Shell Programming
- 11.3.3 Python, R, and SQL
- 11.3.4 Interpreting Data
- 11.4 Getting in Touch
- References
In the past five years I have received a lot of valuable feedback in the form of emails, tweets, book reviews, errata submitted to O'Reilly, GitHub issues, and even pull requests. I love this. It has only made the book better.
O'Reilly has graciously given me permission to make the source of the second edition available on GitHub and an HTML version available on https://www.datascienceatthecommandline.com under a Creative Commons Attribution-NoDerivatives 4.0 International License from the start. That's fantastic because this way, I'll be able to receive feedback during the entire journey, which will make the book even better.
And feedback is, as always, very much appreciated. This can be anything ranging from a typo to a command-line tool or trick that might be of interest to others. If you have any ideas, suggestions, questions, criticism, or compliments, then I would love to hear from you. You may reply to this particular issue, create a new issue, tweet me at @jeroenhjanssens, or email me; use whichever medium you prefer.
Thank you.
Best wishes,
Jeroen