Intro
Since there is a pressing need for the use of containers in a specific way, this document will begin limited to that specific context, and then branch from there.
Why Singularity and not Docker?
Docker is a fine virtual abstraction system, however it requires administrator privileges to run. Further, once an image is running, Docker's default preference is to have input and output resources relevant to the container
within the container itself. Singularity overcomes these two issues by running images as the user envoking them, and then exposing the local filesystem on the host from within the running container. More detailed information can be found
here.
Build a container
In this contextual need, we would like to be able to run Tensorflow. Singularity will allow us to obtain an existing container from a Singularity Container Library, a Singularity Hub, or a Docker Hub. In this particular case of Tensorflow, we'll download one of NVidia's premade Docker containers, which Singularity will convert into its own format.
First, we need to obtain access to NVidia's library of Docker containers, located here:
Nvidia NGC
This is initially done by requesting access through the
Create Account link. Once an account is granted, the
API key is needed.
Once the API key is obtained, set it usable in the shell:
export SINGULARITY_DOCKER_USERNAME=$oauthtoken
export SINGULARITY_DOCKER_PASSWORD=<API key>
Now you may log into the Nvidia NGC and browse for containers. In this example, we'll download
Tensorflow.
There, it lists the container as:
nvcr.io/nvidia/tensorflow:19.09-py3
By default, singularity will write the entire container into your home directory and then the container will be specific to your account, hidden, and available only to you. To avoid that, we'll write it to a file instead, in Singularity Image Format (SIF).
So, for singularity, we will
pull the image as such:
>singularity pull tf-19.09-py3.sif docker://nvcr.io/nvidia/tensorflow:19.09-py3
Singularity will then download each blob-layer in the
tensorflow:19.09-py3 tag, and cache it into your home directory.
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob sha256:35c102085707f703de2d9eaad8752d6fe1b8f02b5d2149f1d8357c9cc7fb7d0a
25.45 MiB / 25.45 MiB [====================================================] 1s
Copying blob sha256:251f5509d51d9e4119d4ffb70d4820f8e2d7dc72ad15df3ebd7cd755539e40fd
34.54 KiB / 34.54 KiB [====================================================] 0s
Copying blob sha256:8e829fe70a46e3ac4334823560e98b257234c23629f19f05460e21a453091e6d
848 B / 848 B [============================================================] 0s
Copying blob sha256:6001e1789921cf851f6fb2e5fe05be70f482fe9c2286f66892fe5a3bc404569c
162 B / 162 B [============================================================] 0s
Copying blob sha256:109c7cec1178b6d77b59e8715fe1eae904b528bb9c4868519ff5435bae50c44c
8.63 MiB / 8.63 MiB [======================================================] 0s
<sniped for brevity>
Copying blob sha256:df0616c98153f10db5b05626f314000ced06fb127a6cfaac37ff325d11488327
452 B / 452 B [============================================================] 0s
Copying blob sha256:7828d14d7927458867242e0274e706b1c902eb6b1207e5a28337a7fbc6400e17
209.76 KiB / 209.76 KiB [==================================================] 0s
Copying blob sha256:ae4e96299b2fca5dc08edbde4061cfdf25c313954a7d46a32dfc701af4e0edf6
9.64 KiB / 9.64 KiB [======================================================] 0s
Copying config sha256:ad94fc3cd170246d925f8c5aa8000b3e6cbe645d30ff185bf87b40cee8f41d32
33.55 KiB / 33.55 KiB [====================================================] 0s
Writing manifest to image destination
Storing signatures
INFO: Creating SIF file...
INFO: Build complete: /home/leblancd/.singularity/cache/oci-tmp/52706b896af93123eb7f894f9671bbd31a1cabd15943aa3334eb7af3a710d262/tensorflow_19.09-py3.sif
Now our container file is complete:
-rwxrwxr-x 1 leblancd supergroup 3.3G Oct 4 18:39 tf-19.09-py3.sif
Cache
Before we move on, note that singularity also stored the SIF file in the user's cache, in the home directory (there is now a 3.3G file there, with overhead).
You can view what is stored in the cache:
>singularity cache list
NAME DATE CREATED SIZE TYPE
tensorflow_19.09-py3.s 2019-10-04 18:39:01 3.45 GB oci
There 1 containers using: 3.45 GB, 68 oci blob file(s) using 3.70 GB of space.
Total space used: 7.15 GB
Clearing the cache is equally simple.
>singularity cache clean -a
Removes all images in the cache (with the -a parameter).
By default, Singularity caches pulled files in your homedirectory with a structure as:
$HOME/.singularity/cache/library
$HOME/.singularity/cache/oci
$HOME/.singularity/cache/oci-tmp
One can change the default location of the cache by setting
SINGULARITY_CACHEDIR
to your desired location, for example (in bash shell):
echo $SINGULARITY_CACHEDIR
export $SINGULARITY_CACHEDIR=/tmp/leblancd_sing_cache
echo $SINGULARITY_CACHEDIR
/tmp/leblancd_sing_cache
Build a container from scratch
Now lets say we built this image but discovered it is lacking software that we need.
We have a few options at this point, depending on how permanent we would like the software to be.
One solution would be to build a new image from the image we pulled and built locally (above).
Since creating an image from a definition requires root privelege, we cannot do this locally and need to
employ the
Remote Builder service.
There, once we have an account created and retrieved an API code (good for 30 days), and saving it to
~/.singularity/sylabs-token
,
we can proceed to build our new image with a definition file.
Using our image from above, lets install 'astropy' into the image.
We create a file definition, named
tensorflow-astropy.def
Bootstrap: docker
From: nvcr.io/nvidia/tensorflow:19.09-py3
%post
apt-get -y update
pip install astropy
%environment
export LC_ALL=C
export PATH=/usr/local/bin:/usr/bin:/usr/sbin:$PATH
Without going too deep into technical details, our definition begins with
Bootstrap
, which every definition file requires and describes the bootstrap agent;
docker
for Docker Hub images (whether from the Docker registry or other), and
library
for
Container Library images.
From:
describes where the Remote Builder will find the foundational image to build upon.
Anything in the
%post%
section will be performed after the initial image has been pulled and built locally, namely
apt-get -y update
updates the installed software in the container, and
pip install astropy
will install the Python module "astropy" in the container.
In the
%environment
we set the locale (LC_ALL=C), and the
$PATH
so that the system knows where
pip
is located.
For mor information about building images from scratch and definition files, seek
this documentation.
With our definition file,
tensorflow-astropy.def
, we can execute:
singularity build --remote tf-astropy.sif tensorflow-astropy.def
which will submit our definition file to the Remote Builder site and build our new image for us.
Interacting with the Image
exec
To run the container we just built, we need to decide how we would like to execute it.
We can
exec on the container, which will run a command inside the container:
>singularity exec tf-19.09-py3.sif cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
This will execute any command from within the container, and then immediately exit.
shell
We can also obtain a
shell within the container:
>singularity shell tf-19.09-py3.sif
Singularity tf-19.09-py3.sif:/tmp> ls -al
total 3366992
drwxrwxrwt 10 root root 4096 Oct 4 18:45 .
drwxr-xr-x 1 leblancd supergroup 80 Oct 4 18:51 ..
-rwxrwxr-x 1 leblancd supergroup 3447758848 Oct 4 18:39 tf-19.09-py3.sif
Notice that the current working directory is /tmp, which is not inside the container.
When Singularity executes, it attempts to make available all your existing filespaces, such as your homedirectory, and any common directories existing on the physical host (such as /tmp). Only the files and directories inside the container are exposed over the filespace of the physical host. This makes it rather easy to write any output from the container to the local filesystem, or to an NFS mounted homedirectory.
Now that we're interactively running the container, we can attempt to execute Tensorflow.
Singularity tf-19.09-py3.sif:/tmp> python
Python 3.6.8 (default, Aug 20 2019, 17:12:48)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Could not open PYTHONSTARTUP
FileNotFoundError: [Errno 2] No such file or directory: '/etc/pythonstart'
>>> import tensorflow as tf
2019-10-04 18:57:01.232584: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
>>> hello = tf.constant('Testing TF')
>>> sess = tf.Session()
2019-10-04 18:57:40.770169: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
2019-10-04 18:57:40.770224: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2019-10-04 18:57:40.773897: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: ml-login3
2019-10-04 18:57:40.773919: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: ml-login3
2019-10-04 18:57:40.773959: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2019-10-04 18:57:40.774016: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 435.21.0
2019-10-04 18:57:40.789291: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2494100000 Hz
2019-10-04 18:57:40.794142: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4c48fa0 executing computations on platform Host. Devices:
2019-10-04 18:57:40.794172: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
>>>
Uh oh. The container couldn't locate the CUDA library necessary to execute the Tensorflow Python module in the container.
nv
This is because when the singularity shell was launched, it was not made aware of any NVidia software, and did not map any NVidia drivers/devices into the container. This is easily remedied with the
--nv parameter.
>singularity shell --nv tf-19.09-py3.sif
Singularity tf-19.09-py3.sif:/tmp> python
Python 3.6.8 (default, Aug 20 2019, 17:12:48)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2019-10-04 19:00:12.417390: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
>>> hello = tf.constant('Testing TensorF')
>>> sess = tf.Session()
2019-10-04 19:00:38.980743: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-10-04 19:00:39.767134: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:84:00.0
2019-10-04 19:00:39.767193: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-10-04 19:00:39.825552: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-10-04 19:00:39.854719: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2019-10-04 19:00:39.868698: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2019-10-04 19:00:39.929108: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2019-10-04 19:00:39.944015: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2019-10-04 19:00:40.054598: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-04 19:00:40.057032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-04 19:00:40.072683: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2494100000 Hz
2019-10-04 19:00:40.077704: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x50cd760 executing computations on platform Host. Devices:
2019-10-04 19:00:40.077723: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2019-10-04 19:00:40.172994: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x50d0e80 executing computations on platform CUDA. Devices:
2019-10-04 19:00:40.173041: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): TITAN RTX, Compute Capability 7.5
2019-10-04 19:00:40.174405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:84:00.0
2019-10-04 19:00:40.174453: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-10-04 19:00:40.174511: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-10-04 19:00:40.174548: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2019-10-04 19:00:40.174585: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2019-10-04 19:00:40.174619: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2019-10-04 19:00:40.174680: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2019-10-04 19:00:40.174719: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-04 19:00:40.177149: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-04 19:00:40.180107: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-10-04 19:00:42.739483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-04 19:00:42.739536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2019-10-04 19:00:42.739545: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2019-10-04 19:00:42.742016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device
(/job:localhost/replica:0/task:0/device:GPU:0 with 22756 MB memory) -> physical GPU
(device: 0, name: TITAN RTX, pci bus id: 0000:84:00.0, compute capability: 7.5)
>>> a = tf.constant(10)
>>> b = tf.constant(25)
>>> sess.run(a+b)
35
Container Reuse
Now, we have built a container from NVidia, and demonstrated it as working,
can someone else use our container?
Yes, as long as you don't need to write anything into the container space, because the container will be mounted
read-only by default.
This is likely expected behavior, as it seems undesirable to create a container that allows anyone to write files, creating unforeseen entropy into the otherwise pristine container.
Regarding our example Tensorflow container above, it remains usable as
read-only since we are not modifying the binaries/scripts involved in running Tensorflow itself, and any data input or output can live safely and be accessed outside the container.
Consider the following example:
We have a very simple (example) Python script:
>cat helloworld.py
import tensorflow as tf
hello = tf.constant('Hello, TF!')
sess = tf.Session()
print(sess.run(hello))
That is in the current directory with our Tensorflow container:
-rw-rw-r-- 1 leblancd supergroup 101 Oct 4 19:30 helloworld.py
-rwxrwxr-x 1 leblancd supergroup 3.3G Oct 4 18:39 tf-19.09-py3.sif
We can use our
helloworld.py
script like so:
>singularity exec --nv ./tf-19.09-py3.sif python ./helloworld.py
2019-10-09 17:34:30.078770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:84:00.0
...snip...
2019-10-09 17:34:30.373594: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-09 17:34:30.388525: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2494100000 Hz
2019-10-09 17:34:30.393509: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56378c0 executing computations on platform Host. Devices:
2019-10-09 17:34:30.393530: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2019-10-09 17:34:30.488286: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5637f30 executing computations on platform CUDA. Devices:
2019-10-09 17:34:30.488342: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): TITAN RTX, Compute Capability 7.5
2019-10-09 17:34:30.489712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:84:00.0
...snip...
2019-10-09 17:34:30.492309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-09 17:34:30.495275: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-10-09 17:34:33.013974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-09 17:34:33.014043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2019-10-09 17:34:33.014054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2019-10-09 17:34:33.016631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device
(/job:localhost/replica:0/task:0/device:GPU:0 with 22756 MB memory) -> physical GPU
(device: 0, name: TITAN RTX, pci bus id: 0000:84:00.0, compute capability: 7.5)
b'Hello, TF!'
Likewise, redirecting output as:
>singularity exec --nv ./tf-19.09-py3.sif python ./helloworld.py >> OUTPUT
...snip...
2019-10-09 17:59:52.089016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326]
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22756 MB memory)
-> physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:84:00.0, compute capability: 7.5)
>
Then our
OUTPUT
file will be:
>ls -l
total 3367004
-rw-rw-r-- 1 leblancd supergroup 101 Oct 9 17:30 helloworld.py
-rw-rw-r-- 1 leblancd supergroup 14 Oct 9 17:59 OUTPUT
-rwxrwxr-x 1 leblancd supergroup 3447758848 Oct 9 17:09 tf-19.09-py3.sif*
>cat OUTPUT
b'Hello, TF!'
Instances
So far, we have run our container with
shell
or
exec
, which runs the container in the foreground, meaning the running container terminates when we exit, either by quitting from the
shell
or by the end of execution from
exec
.
We can, however,
detach a container and run it as a "service". In Singularity terms, this is called an
instance.
>singularity instance start --nv ./tf-19.09-py3.sif Tensorflow
INFO: instance started successfully
Now the instance is running, and we can launch however many as we like, though each instance needs a unique name.
>singularity instance start --nv ./tf-19.09-py3.sif tensorflow
INFO: instance started successfully
>singularity instance start --nv ./tf-19.09-py3.sif tensorflow2
INFO: instance started successfully
To review what instances we have running:
>singularity instance list
INSTANCE NAME PID IMAGE
Tensorflow 52226 /tmp/tf-19.09-py3.sif
tensorflow 52358 /tmp/tf-19.09-py3.sif
tensorflow2 52407 /tmp/tf-19.09-py3.sif
NOTE: we launched our instances with
--nv
, to expose the NVidia devices/drivers to the instances we launch (you cannot do this after the container instance is already running).
Now, to access (or "attach") to the running instances, we can use
exec
or
shell
as before, but target the
instance like so:
>singularity shell instance://tensorflow2
Singularity tf-19.09-py3.sif:/tmp> nvidia-smi
Wed Oct 9 18:24:58 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN RTX Off | 00000000:84:00.0 Off | N/A |
| 21% 35C P0 N/A / N/A | 0MiB / 24220MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Finally, to terminate our instances:
>singularity instance stop Tensorflow
Stopping Tensorflow instance of /tmp/tf-19.09-py3.sif (PID=52226)
Or:
>singularity instance stop -a
Stopping tensorflow2 instance of /tmp/tf-19.09-py3.sif (PID=52407)
Stopping tensorflow instance of /tmp/tf-19.09-py3.sif (PID=52762)
--
DavidLeBlanc - 2019-12-16