Second edition: rationale, changes, outline, and feedback

10 min read Original article β†—

I'm happy to announce that I'll be writing the second edition of Data Science at the Command Line (O'Reilly, 2014). This issue explains why I think a second edition is needed, lists what changes I plan to make, and presents a tentative outline. Finally, I have a few words about the process and giving feedback.

Why a second edition?

While the command line as a technology and as a way of working is timeless, some of the tools discussed in the first edition have either: (1) been superseded by newer tools (e.g., csvkit has been replaced by xsv), (2) been abandoned by their developers (e.g., drake), or (3) been suboptimal choices (e.g., weka). Since the first edition was published in October 2014 I have learned a lot, either through my own experience or through the useful feedback from its readers. Even though the book is quite niche because it lies at the intersection of two subjects, there remains a steady interest from the data science community. I notice this from the many positive messages I receive almost every day. By updating the first edition I hope to keep the book relevant for at least another five years.

Changes with respect to the first edition

These are the general changes I currently have in mind. Please note that this is subject to change.

  • Throughout the book: replace csvkit with xsv as much as possible. xsv is a much faster alternative to working with CSV files.
  • Section 1.6: Replace the used data set with one that is accessible without an API key.
  • Section 2.2 and 3.2: Replace the VirtualBox image with a Docker image (this is already done on https://www.datascienceatthecommandline.com). Docker is a faster and more lightweight way of running an isolated environment than VirtualBox.
  • Section 4.3: Split Python and R into separate sections. Furthermore, explain how to parse command-line options in those languages.
  • Section 5.4: Split into two sections. Use xmlstarlet for working with XML.
  • Section 5.5: Move these subsections beneath Section 5.3.
  • Section 5.6: Use pup instead of scrape to work with HTML. scrape is a Python tool I created myself. pup is much faster, has more features, and is easier to install.
  • Chapter 6: Replace Drake with Make. Drake is no longer maintained. Make is much more mature and is also very popular with developers.
  • Section 7.3.2 and 7.4.x Replace Rio with littler. Rio is a Bash script I created myself. littler is a much more stable way of using R from the command line and is easier to install.
  • Chapter 8. Add new sections that discuss how to get a list of running instances from not only AWS but also from two newer cloud providers: GCP and Azure.
  • Chapter 9: Replace Weka, BigML, and SKLL with Vowpal Wabbit. Weka is old and the way it is used from the command line is clunky. BigML is a commercial API on which I no longer want to rely. SKLL is not truly from the command line. Vowpal Wabbit is a very mature machine learning tool, developed at Yahoo! and now at Microsoft. At some point, there was supposed to be an entire book about Vowpal Wabbit (titled Sequential Learning), but unfortunately this never was finished. These three sections will give Vowpal Wabbit the exposure it deserves and the readers the speed and stability of applying machine learning at the command line they deserve.
  • Chapter 10: New chapter about integrating the command line into existing workflows, including Python, R, Julia, and Spark. In the first edition I mention that the command line can easily be integrated with existing workflows, but I never go into that. This chapter fixes that. My hope is that with this chapter, someone would be quicker inclined to pick up this book and learn about the advantages of the command line.

Book outline

In the tentative outline below, πŸ†• indicates added and ❌ indicates removed chapters and sections with respect to the first edition.

  • Preface
    • What to Expect from This Book
    • How to Read This Book
    • Who This Book Is For
    • Acknowledgments
    • Dedication
    • About the Author
  • Chapter 1 Introduction
    • 1.1 Overview
    • 1.2 Data Science is OSEMN
      • 1.2.1 Obtaining Data
      • 1.2.2 Scrubbing Data
      • 1.2.3 Exploring Data
      • 1.2.4 Modeling Data
      • 1.2.5 Interpreting Data
    • 1.3 Intermezzo Chapters
    • 1.4 What is the Command Line?
    • 1.5 Why Data Science at the Command Line?
      • 1.5.1 The Command Line is Agile
      • 1.5.2 The Command Line is Augmenting
      • 1.5.3 The Command Line is Scalable
      • 1.5.4 The Command Line is Extensible
      • 1.5.5 The Command Line is Ubiquitous
    • 1.6 A Real-world Use Case
    • 1.7 Further Reading
  • Chapter 2 Getting Started
    • 2.1 Overview
    • 2.2 Setting Up Your Data Science Toolbox ❌
    • 2.2 Installing the Docker Image πŸ†•
    • 2.3 Essential GNU/Linux Concepts
      • 2.3.1 The Environment
      • 2.3.2 Executing a Command-line Tool
      • 2.3.3 Five Types of Command-line Tools
      • 2.3.4 Combining Command-line Tools
      • 2.3.5 Redirecting Input and Output
      • 2.3.6 Working With Files
      • 2.3.7 Help!
    • 2.4 Further Reading
  • Chapter 3 Obtaining Data
    • 3.1 Overview
    • 3.2 Copying Local Files to the Data Science Toolbox ❌
      • 3.2.1 Local Version of Data Science Toolbox ❌
      • 3.2.2 Remote Version of Data Science Toolbox ❌
    • 3.2 Copying Local Files to the Docker Image πŸ†•
    • 3.3 Decompressing Files
    • 3.4 Converting Microsoft Excel Spreadsheets
    • 3.5 Querying Relational Databases
    • 3.6 Downloading from the Internet
    • 3.7 Calling a Web API
    • 3.8 Further Reading
  • Chapter 4 Creating Reusable Command-line Tools
    • 4.1 Overview
    • 4.2 Converting One-liners into Shell Scripts
      • 4.2.1 Step 1: Copy and Paste
      • 4.2.2 Step 2: Add Permission to Execute
      • 4.2.3 Step 3: Define Shebang
      • 4.2.4 Step 4: Remove Fixed Input
      • 4.2.5 Step 5: Parametrize
      • 4.2.6 Step 6: Extend Your PATH
    • 4.3 Creating Command-line Tools with Python and R ❌
      • 4.3.1 Porting The Shell Script ❌
      • 4.3.2 Processing Streaming Data from Standard Input ❌
    • 4.3 Creating Command-line Tools with Python πŸ†•
      • 4.3.1 Porting The Shell Script πŸ†•
      • 4.3.2 Processing Streaming Data from Standard Input πŸ†•
      • 4.3.3 Parsing Command-Line Options πŸ†•
    • 4.4 Creating Command-line Tools with R πŸ†•
      • 4.3.1 Porting The Shell Script πŸ†•
      • 4.3.2 Processing Streaming Data from Standard Input πŸ†•
      • 4.3.3 Parsing Command-Line Options πŸ†•
    • 4.5 Further Reading
  • Chapter 5 Scrubbing Data
    • 5.1 Overview
    • 5.2 Common Scrub Operations for Plain Text
      • 5.2.1 Filtering Lines
      • 5.2.2 Extracting Values
      • 5.2.3 Replacing and Deleting Values
    • 5.3 Working with CSV
      • 5.3.1 Bodies and Headers and Columns, Oh My!
      • 5.3.2 Performing SQL Queries on CSV
      • 5.3.3 Extracting and Reordering Columns πŸ†•
      • 5.3.4 Filtering Lines πŸ†•
      • 5.3.5 Merging Columns πŸ†•
      • 5.3.6 Combining Multiple CSV Files πŸ†•
    • 5.4 Working with XML/HTML and JSON ❌
    • 5.5 Common Scrub Operations for CSV ❌
      • 5.5.1 Extracting and Reordering Columns ❌
      • 5.5.2 Filtering Lines ❌
      • 5.5.3 Merging Columns ❌
      • 5.5.4 Combining Multiple CSV Files ❌
    • 5.4 Working with JSON πŸ†•
      • Introducing jq πŸ†•
      • Filtering elements πŸ†•
      • Simplifying JSON πŸ†•
      • Converting JSON to CSV πŸ†•
    • 5.5 Working with XML πŸ†•
      • 5.5.1 Introducing xmlstarlet πŸ†•
      • 5.5.2 Extracting fields using XPath πŸ†•
      • 5.5.3 Converting XML to CSV πŸ†•
    • 5.6 Working with HTML πŸ†•
      • 5.6.1 Introducing pup πŸ†•
      • 5.6.2 Extracting fields using CSS Selectors πŸ†•
      • 5.6.3 Converting HTML to CSV πŸ†•
    • 5.7 Further Reading
  • Chapter 6 Managing Your Data Workflow
    • 6.1 Overview
    • 6.2 Introducing Drake Make πŸ†•
    • 6.3 Installing Drake ❌
    • 6.3 One Script to Rule Them All πŸ†•
    • 6.4 Obtain Top E-books from Project Gutenberg
    • 6.5 Every Workflow Starts with a Single Step
    • 6.6 Well, That Depends
    • 6.7 Rebuilding Certain Targets
    • 6.8 Discussion
    • 6.9 Further Reading
  • Chapter 7 Exploring Data
    • 7.1 Overview
    • 7.2 Inspecting Data and its Properties
      • 7.2.1 Header Or Not, Here I Come
      • 7.2.2 Inspect All The Data
      • 7.2.3 Feature Names and Data Types
      • 7.2.4 Unique Identifiers, Continuous Variables, and Factors
    • 7.3 Computing Descriptive Statistics
      • 7.3.1 csvstat Using xsv stat πŸ†•
      • 7.3.2 Using R from the Command Line using Rio
    • 7.4 Creating Visualizations
      • 7.4.1 Introducing Gnuplot and Feedgnuplot
      • 7.4.2 Introducing ggplot2
      • 7.4.3 Histograms
      • 7.4.4 Bar Plots
      • 7.4.5 Density Plots
      • 7.4.6 Box Plots
      • 7.4.7 Scatter Plots
      • 7.4.8 Line Graphs
      • 7.4.9 Summary
    • 7.5 Further Reading
  • Chapter 8 Parallel Pipelines
    • 8.1 Overview
    • 8.2 Serial Processing
      • 8.2.1 Looping Over Numbers
      • 8.2.2 Looping Over Lines
      • 8.2.3 Looping Over Files
    • 8.3 Parallel Processing
      • 8.3.1 Introducing GNU Parallel
      • 8.3.2 Specifying Input
      • 8.3.3 Controlling the Number of Concurrent Jobs
      • 8.3.4 Logging and Output
      • 8.3.5 Creating Parallel Tools
    • 8.4 Distributed Processing
      • 8.4.1 Get List of Running AWS EC2 Instances ❌
      • 8.4.1 Running Commands on Remote Machines
      • 8.4.2 Distributing Local Data among Remote Machines
      • 8.4.3 Processing Files on Remote Machines
      • 8.4.4 Get List of Running EC2 Instances on AWS πŸ†•
      • 8.4.5 Get List of Running Compute Engine Instances on GCP πŸ†•
      • 8.4.6 Get List of Running Instances on Azure πŸ†•
    • 8.5 Discussion
    • 8.6 Further Reading
  • Chapter 9 Modeling Data
    • 9.1 Overview
    • 9.2 More Wine Please!
    • 9.3 Dimensionality Reduction with Tapkee
      • 9.3.1 Introducing Tapkee
      • 9.3.2 Installing Tapkee
      • 9.3.3 Linear and Non-linear Mappings
    • 9.4 Clustering with Weka ❌
      • 9.4.1 Introducing Weka ❌
      • 9.4.2 Taming Weka on the Command Line ❌
      • 9.4.3 Converting between CSV to ARFF Data Formats ❌
      • 9.4.4 Comparing Three Cluster Algorithms ❌
    • 9.4 Clustering with SciKit-Learn πŸ†•
      • 9.4.1 Using SciKit-Learn from the Command Line πŸ†•
      • 9.4.2 K-Means Clustering πŸ†•
      • 9.4.3 Hierarchical Clustering πŸ†•
      • 9.4.4 Pipelines πŸ†•
    • 9.5 Regression with SciKit-Learn Laboratory ❌
      • 9.5.1 Preparing the Data ❌
      • 9.5.2 Running the Experiment ❌
      • 9.5.3 Parsing the Results ❌
    • 9.6 Classification with BigML ❌
      • 9.6.1 Creating Balanced Train and Test Data Sets ❌
      • 9.6.2 Calling the API ❌
      • 9.6.3 Inspecting the Results ❌
      • 9.6.4 Conclusion ❌
    • 9.5 Collaborative Filtering with Vowpal Wabbit πŸ†•
      • 9.5.1 Introducing Vowpal Wabbit πŸ†•
      • 9.5.2 Input Format πŸ†•
      • 9.5.3 Matrix Factorization πŸ†•
      • 9.5.4 Training a Model πŸ†•
      • 9.5.5 Making Predictions πŸ†•
      • 9.5.6 Measure Performance πŸ†•
    • 9.6 Regression with Vowpal Wabbit πŸ†•
      • 9.6.1 Feature Hashing πŸ†•
      • 9.6.2 Gradient Descent πŸ†•
      • 9.6.3 Hyper-parameter Optimization πŸ†•
      • 9.6.4 Inspecting Models πŸ†•
    • 9.7 Classification with Vowpal Wabbit πŸ†•
      • 9.7.1 Extended Input Format πŸ†•
      • 9.7.2 Multi-class Classification πŸ†•
      • 9.7.3 Online Learning πŸ†•
    • 9.8 Further Reading
  • Chapter 10 Leverage the Unix Command Line Elsewhere πŸ†•
    • 10.1 Jupyter Notebook πŸ†•
    • 10.2 Python Scripts πŸ†•
    • 10.3 RStudio πŸ†•
    • 10.4 R Markdown πŸ†•
    • 10.5 R Scripts πŸ†•
    • 10.6 Julia Scripts πŸ†•
    • 10.7 Spark Pipes πŸ†•
  • Chapter 10 11 Conclusion
    • 11.1 Let’s Recap
    • 11.2 Three Pieces of Advice
      • 11.2.1 Be Patient
      • 11.2.2 Be Creative
      • 11.2.3 Be Practical
    • 11.3 Where To Go From Here?
      • 11.3.1 APIs
      • 11.3.2 Shell Programming
      • 11.3.3 Python, R, and SQL
      • 11.3.4 Interpreting Data
    • 11.4 Getting in Touch
  • References

In the past five years I have received a lot of valuable feedback in the form of emails, tweets, book reviews, errata submitted to O'Reilly, GitHub issues, and even pull requests. I love this. It has only made the book better.

O'Reilly has graciously given me permission to make the source of the second edition available on GitHub and an HTML version available on https://www.datascienceatthecommandline.com under a Creative Commons Attribution-NoDerivatives 4.0 International License from the start. That's fantastic because this way, I'll be able to receive feedback during the entire journey, which will make the book even better.

And feedback is, as always, very much appreciated. This can be anything ranging from a typo to a command-line tool or trick that might be of interest to others. If you have any ideas, suggestions, questions, criticism, or compliments, then I would love to hear from you. You may reply to this particular issue, create a new issue, tweet me at @jeroenhjanssens, or email me; use whichever medium you prefer.

Thank you.

Best wishes,

Jeroen