GitHub - arricor/deep-learning-machine: An Ansible script for setting up an Ubuntu 18.04 LTS Keras/Tensor Flow based Deep Learning Machine

Deep Learning Machine

This repository contains an Ansible playbook and instructions to create and manage a single (or many) bare metal deep learning machines. For a description of why Ansible was chosen and what other alternatives were considered, please see ToolSelection.md

Quick Reference

If you've already installed Ansible, you can execute the entire playbook by running:

$ ansible-playbook tensorflow.yml

You can also execute only the pieces you need by passing tags on the command line:

Install only apt/pip pre-requisites to execute the other roles:
```
$ ansible-playbook tensorflow.yml --tags "packages"
```

Install Docker CE:

$ ansible-playbook tensorflow.yml --tags "docker"

Install the Nvidia CUDA GPU drivers:

$ ansible-playbook tensorflow.yml --tags "cuda"

Install the Nvidia Docker Runtime:

$ ansible-playbook tensorflow.yml --tags "nvidia"

Download the TensorFlow container and launch Jupyter Notebook:
```
$ ansible-playbook tensorflow.yml --tags "jupyter"
```

What's Included

After running the ansible script your machines will be loaded with the following:

Docker
Nvidia CUDA GPU Drivers
Nvidia Docker Runtime
TensorFlow GPU Python3 Docker Container
JupyterLab

Using This Repository to Configure Your Environment

Installation

Ansible runs on your local machine and sends commands to the remote (machine learning) machines. You'll need ansible installed locally (not on the machine learning boxes). For macOS users, the easiest way to install Ansible is via Homebrew:

If that's not your cup of tea, install Ansible by following the directions for your machine here.

Configuration

Gather the following:

SSH key or user credentials for the remote account

Note: Ansible does not expose a channel to allow communication between the user and the ssh process to accept a password manually to decrypt an ssh key when using the ssh connection plugin (which is the default). The use of ssh-agent is highly recommended.
List of servers you wish to manage:
- hostnames/IP addresses
- SSH port
- usernames

Copy [hosts.example] to /etc/ansible/hosts (if it does not already exist). Populate the hosts file (no extension) with the information about the servers you gathered above.

Confirm that you have populated your Ansible hosts file correctly:

$ ansible-inventory --list

Running

Once you're satisfied that you correctly populated your hosts file, update the - hosts: line of [tensorflow.yml] to reflect the hosts or groups you want to configure.

Examples:

Apply against a single host defined as ml2 in /etc/ansible/hosts:
Apply against a group of hosts defined as production in /etc/ansible/hosts:
Apply against all hosts defined in /etc/ansible/hosts:

Then, when you're ready, run the playbook:

$ ansible-playbook tensorflow.yml --ask-become-pass

Note: You must have sudo access to run the playbook!

Review the output:

[ok] means no change (this task was already completed)
[changed] means the task successfully ran and the change was applied
[unreachable] means the host could not be reached
[failed] means the task ran but failed to complete

[ok] and [changed] are successful outcomes. Any [unreachable] and [failed] outputs should be investigated and resolved.

Note: This Ansible playbook is idempotent; once a configuration has been successfully applied, if you apply it again, all actions will report [ok].

Executing Tensorflow Jobs in Your New Environment

Point your browser to http://<hostname>:8888 and login with the password you provided.
The jupyter.volumes.source folder will be mounted as the notebooks folder.
Edit and execute your Jupyter notebooks as normal!

Command Line Access

If you need to drop into a GPU-powered TensorFlow environment, SSH into the remote machine and execute the following:

$ docker run --runtime=nvidia -it --rm tensorflow/tensorflow:latest-gpu-py3 bash

Note: You must be a member of the docker group or have sudo access on the remote machine to execute docker commands.

Additional Files

ansible.cfg enables SSH credential forwarding. This is a necessary step during data synchronization, as Ansible delegates those credentials to the master/writer host to push the data folder out to each of the mirrors.
Dockerfile is used to build the Arricor TensorFlow image. See Docker.md for additional details.
hosts.example is an example of the Ansible inventory hosts file saved in /etc/ansible/hosts
secrets.example.yml is an example of the expected structure of the secrets.yml file