Intro

Since there is a pressing need for the use of containers in a specific way, this document will begin limited to that specific context, and then branch from there.

Why Singularity and not Docker?

Docker is a fine virtual abstraction system, however it requires administrator privileges to run. Further, once an image is running, Docker's default preference is to have input and output resources relevant to the container within the container itself. Singularity overcomes these two issues by running images as the user envoking them, and then exposing the local filesystem on the host from within the running container. More detailed information can be found here.

Build a container

In this contextual need, we would like to be able to run Tensorflow. Singularity will allow us to obtain an existing container from a Singularity Container Library, a Singularity Hub, or a Docker Hub. In this particular case of Tensorflow, we'll download one of NVidia's premade Docker containers, which Singularity will convert into its own format. First, we need to obtain access to NVidia's library of Docker containers, located here: Nvidia NGC

This is initially done by requesting access through the Create Account link. Once an account is granted, the API key is needed.

Once the API key is obtained, set it usable in the shell:
   export SINGULARITY_DOCKER_USERNAME=$oauthtoken
   export SINGULARITY_DOCKER_PASSWORD=<API key>

Now you may log into the Nvidia NGC and browse for containers. In this example, we'll download Tensorflow. There, it lists the container as:
nvcr.io/nvidia/tensorflow:19.09-py3

By default, singularity will write the entire container into your home directory and then the container will be specific to your account, hidden, and available only to you. To avoid that, we'll write it to a file instead, in Singularity Image Format (SIF). So, for singularity, we will pull the image as such:
>singularity pull tf-19.09-py3.sif docker://nvcr.io/nvidia/tensorflow:19.09-py3

Singularity will then download each blob-layer in the tensorflow:19.09-py3 tag, and cache it into your home directory.
INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
Getting image source signatures
Copying blob sha256:35c102085707f703de2d9eaad8752d6fe1b8f02b5d2149f1d8357c9cc7fb7d0a
 25.45 MiB / 25.45 MiB [====================================================] 1s
Copying blob sha256:251f5509d51d9e4119d4ffb70d4820f8e2d7dc72ad15df3ebd7cd755539e40fd
 34.54 KiB / 34.54 KiB [====================================================] 0s
Copying blob sha256:8e829fe70a46e3ac4334823560e98b257234c23629f19f05460e21a453091e6d
 848 B / 848 B [============================================================] 0s
Copying blob sha256:6001e1789921cf851f6fb2e5fe05be70f482fe9c2286f66892fe5a3bc404569c
 162 B / 162 B [============================================================] 0s
Copying blob sha256:109c7cec1178b6d77b59e8715fe1eae904b528bb9c4868519ff5435bae50c44c
 8.63 MiB / 8.63 MiB [======================================================] 0s
<sniped for brevity>
Copying blob sha256:df0616c98153f10db5b05626f314000ced06fb127a6cfaac37ff325d11488327
 452 B / 452 B [============================================================] 0s
Copying blob sha256:7828d14d7927458867242e0274e706b1c902eb6b1207e5a28337a7fbc6400e17
 209.76 KiB / 209.76 KiB [==================================================] 0s
Copying blob sha256:ae4e96299b2fca5dc08edbde4061cfdf25c313954a7d46a32dfc701af4e0edf6
 9.64 KiB / 9.64 KiB [======================================================] 0s
Copying config sha256:ad94fc3cd170246d925f8c5aa8000b3e6cbe645d30ff185bf87b40cee8f41d32
 33.55 KiB / 33.55 KiB [====================================================] 0s
Writing manifest to image destination
Storing signatures
INFO:    Creating SIF file...
INFO:    Build complete: /home/leblancd/.singularity/cache/oci-tmp/52706b896af93123eb7f894f9671bbd31a1cabd15943aa3334eb7af3a710d262/tensorflow_19.09-py3.sif

Now our container file is complete:
-rwxrwxr-x  1 leblancd supergroup 3.3G Oct  4 18:39 tf-19.09-py3.sif

Cache

Before we move on, note that singularity also stored the SIF file in the user's cache, in the home directory (there is now a 3.3G file there, with overhead). You can view what is stored in the cache:
>singularity cache list
NAME                     DATE CREATED           SIZE             TYPE
tensorflow_19.09-py3.s   2019-10-04 18:39:01    3.45 GB          oci

There 1 containers using: 3.45 GB, 68 oci blob file(s) using 3.70 GB of space.
Total space used: 7.15 GB

Clearing the cache is equally simple.
>singularity cache clean -a

Removes all images in the cache (with the -a parameter).

By default, Singularity caches pulled files in your homedirectory with a structure as:
$HOME/.singularity/cache/library
$HOME/.singularity/cache/oci
$HOME/.singularity/cache/oci-tmp

One can change the default location of the cache by setting SINGULARITY_CACHEDIR to your desired location, for example (in bash shell):
echo $SINGULARITY_CACHEDIR

export $SINGULARITY_CACHEDIR=/tmp/leblancd_sing_cache

echo $SINGULARITY_CACHEDIR
/tmp/leblancd_sing_cache

Build a container from scratch

Now lets say we built this image but discovered it is lacking software that we need. We have a few options at this point, depending on how permanent we would like the software to be. One solution would be to build a new image from the image we pulled and built locally (above).

Since creating an image from a definition requires root privelege, we cannot do this locally and need to employ the Remote Builder service. There, once we have an account created and retrieved an API code (good for 30 days), and saving it to ~/.singularity/sylabs-token, we can proceed to build our new image with a definition file.

Using our image from above, lets install 'astropy' into the image. We create a file definition, named tensorflow-astropy.def

Bootstrap:  docker
From: nvcr.io/nvidia/tensorflow:19.09-py3

%post
  apt-get -y update
  pip install astropy

%environment
  export LC_ALL=C
  export PATH=/usr/local/bin:/usr/bin:/usr/sbin:$PATH

Without going too deep into technical details, our definition begins with Bootstrap, which every definition file requires and describes the bootstrap agent; docker for Docker Hub images (whether from the Docker registry or other), and library for Container Library images.
From: describes where the Remote Builder will find the foundational image to build upon.
Anything in the %post% section will be performed after the initial image has been pulled and built locally, namely apt-get -y update updates the installed software in the container, and pip install astropy will install the Python module "astropy" in the container.
In the %environment we set the locale (LC_ALL=C), and the $PATH so that the system knows where pip is located.

For mor information about building images from scratch and definition files, seek this documentation.

With our definition file, tensorflow-astropy.def, we can execute:

singularity build --remote tf-astropy.sif tensorflow-astropy.def

which will submit our definition file to the Remote Builder site and build our new image for us.

Interacting with the Image

exec

To run the container we just built, we need to decide how we would like to execute it. We can exec on the container, which will run a command inside the container:
>singularity exec tf-19.09-py3.sif cat /etc/os-release
  NAME="Ubuntu"
  VERSION="18.04.3 LTS (Bionic Beaver)"
  ID=ubuntu
  ID_LIKE=debian
  PRETTY_NAME="Ubuntu 18.04.3 LTS"
  VERSION_ID="18.04"
  HOME_URL="https://www.ubuntu.com/"
  SUPPORT_URL="https://help.ubuntu.com/"
  BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
  PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
  VERSION_CODENAME=bionic
  UBUNTU_CODENAME=bionic

This will execute any command from within the container, and then immediately exit.

shell

We can also obtain a shell within the container:
>singularity shell tf-19.09-py3.sif
  Singularity tf-19.09-py3.sif:/tmp> ls -al
  total 3366992
  drwxrwxrwt 10 root     root             4096 Oct  4 18:45 .
  drwxr-xr-x  1 leblancd supergroup         80 Oct  4 18:51 ..
  -rwxrwxr-x  1 leblancd supergroup 3447758848 Oct  4 18:39 tf-19.09-py3.sif

Notice that the current working directory is /tmp, which is not inside the container. When Singularity executes, it attempts to make available all your existing filespaces, such as your homedirectory, and any common directories existing on the physical host (such as /tmp). Only the files and directories inside the container are exposed over the filespace of the physical host. This makes it rather easy to write any output from the container to the local filesystem, or to an NFS mounted homedirectory.

Now that we're interactively running the container, we can attempt to execute Tensorflow.
Singularity tf-19.09-py3.sif:/tmp> python
Python 3.6.8 (default, Aug 20 2019, 17:12:48)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Could not open PYTHONSTARTUP
FileNotFoundError: [Errno 2] No such file or directory: '/etc/pythonstart'
>>> import tensorflow as tf
2019-10-04 18:57:01.232584: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
>>> hello = tf.constant('Testing TF')
>>> sess = tf.Session()
2019-10-04 18:57:40.770169: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
2019-10-04 18:57:40.770224: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2019-10-04 18:57:40.773897: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: ml-login3
2019-10-04 18:57:40.773919: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: ml-login3
2019-10-04 18:57:40.773959: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2019-10-04 18:57:40.774016: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 435.21.0
2019-10-04 18:57:40.789291: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2494100000 Hz
2019-10-04 18:57:40.794142: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4c48fa0 executing computations on platform Host. Devices:
2019-10-04 18:57:40.794172: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
>>>

Uh oh. The container couldn't locate the CUDA library necessary to execute the Tensorflow Python module in the container.

nv

This is because when the singularity shell was launched, it was not made aware of any NVidia software, and did not map any NVidia drivers/devices into the container. This is easily remedied with the --nv parameter.
>singularity shell --nv tf-19.09-py3.sif
Singularity tf-19.09-py3.sif:/tmp> python
Python 3.6.8 (default, Aug 20 2019, 17:12:48)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2019-10-04 19:00:12.417390: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
>>> hello = tf.constant('Testing TensorF')
>>> sess = tf.Session()
2019-10-04 19:00:38.980743: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-10-04 19:00:39.767134: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:84:00.0
2019-10-04 19:00:39.767193: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-10-04 19:00:39.825552: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-10-04 19:00:39.854719: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2019-10-04 19:00:39.868698: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2019-10-04 19:00:39.929108: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2019-10-04 19:00:39.944015: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2019-10-04 19:00:40.054598: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-04 19:00:40.057032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-04 19:00:40.072683: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2494100000 Hz
2019-10-04 19:00:40.077704: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x50cd760 executing computations on platform Host. Devices:
2019-10-04 19:00:40.077723: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-10-04 19:00:40.172994: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x50d0e80 executing computations on platform CUDA. Devices:
2019-10-04 19:00:40.173041: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): TITAN RTX, Compute Capability 7.5
2019-10-04 19:00:40.174405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
   name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:84:00.0
2019-10-04 19:00:40.174453: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-10-04 19:00:40.174511: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-10-04 19:00:40.174548: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2019-10-04 19:00:40.174585: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2019-10-04 19:00:40.174619: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2019-10-04 19:00:40.174680: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2019-10-04 19:00:40.174719: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-04 19:00:40.177149: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-04 19:00:40.180107: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-10-04 19:00:42.739483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-04 19:00:42.739536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-10-04 19:00:42.739545: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-10-04 19:00:42.742016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device
   (/job:localhost/replica:0/task:0/device:GPU:0 with 22756 MB memory) -> physical GPU
   (device: 0, name: TITAN RTX, pci bus id: 0000:84:00.0, compute capability: 7.5)
>>> a = tf.constant(10)
>>> b = tf.constant(25)
>>> sess.run(a+b)
35

Container Reuse

Now, we have built a container from NVidia, and demonstrated it as working, can someone else use our container?
Yes, as long as you don't need to write anything into the container space, because the container will be mounted read-only by default. This is likely expected behavior, as it seems undesirable to create a container that allows anyone to write files, creating unforeseen entropy into the otherwise pristine container.
Regarding our example Tensorflow container above, it remains usable as read-only since we are not modifying the binaries/scripts involved in running Tensorflow itself, and any data input or output can live safely and be accessed outside the container.
Consider the following example:

We have a very simple (example) Python script:
>cat helloworld.py

import tensorflow as tf
hello = tf.constant('Hello, TF!')
sess = tf.Session()
print(sess.run(hello))

That is in the current directory with our Tensorflow container:
-rw-rw-r--  1 leblancd supergroup        101 Oct  4 19:30 helloworld.py
-rwxrwxr-x  1 leblancd supergroup       3.3G Oct  4 18:39 tf-19.09-py3.sif

We can use our helloworld.py script like so:
>singularity exec --nv ./tf-19.09-py3.sif python ./helloworld.py
2019-10-09 17:34:30.078770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
  name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:84:00.0
...snip...
2019-10-09 17:34:30.373594: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-09 17:34:30.388525: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2494100000 Hz
2019-10-09 17:34:30.393509: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56378c0 executing computations on platform Host. Devices:
2019-10-09 17:34:30.393530: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-10-09 17:34:30.488286: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5637f30 executing computations on platform CUDA. Devices:
2019-10-09 17:34:30.488342: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): TITAN RTX, Compute Capability 7.5
2019-10-09 17:34:30.489712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
  name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:84:00.0
...snip...
2019-10-09 17:34:30.492309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-09 17:34:30.495275: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-10-09 17:34:33.013974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-09 17:34:33.014043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-10-09 17:34:33.014054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-10-09 17:34:33.016631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device
   (/job:localhost/replica:0/task:0/device:GPU:0 with 22756 MB memory) -> physical GPU
   (device: 0, name: TITAN RTX, pci bus id: 0000:84:00.0, compute capability: 7.5)
b'Hello, TF!'

Likewise, redirecting output as:
>singularity exec --nv ./tf-19.09-py3.sif python ./helloworld.py >> OUTPUT
...snip...
2019-10-09 17:59:52.089016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326]
  Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22756 MB memory)
  -> physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:84:00.0, compute capability: 7.5)
>

Then our OUTPUT file will be:
>ls -l
total 3367004
-rw-rw-r--  1 leblancd supergroup        101 Oct  9 17:30 helloworld.py
-rw-rw-r--  1 leblancd supergroup         14 Oct  9 17:59 OUTPUT
-rwxrwxr-x  1 leblancd supergroup 3447758848 Oct  9 17:09 tf-19.09-py3.sif*

>cat OUTPUT
b'Hello, TF!'

Instances

So far, we have run our container with shell or exec, which runs the container in the foreground, meaning the running container terminates when we exit, either by quitting from the shell or by the end of execution from exec.
We can, however, detach a container and run it as a "service". In Singularity terms, this is called an instance.

>singularity instance start --nv ./tf-19.09-py3.sif Tensorflow

INFO:    instance started successfully

Now the instance is running, and we can launch however many as we like, though each instance needs a unique name.
>singularity instance start --nv ./tf-19.09-py3.sif tensorflow
INFO:    instance started successfully
>singularity instance start --nv ./tf-19.09-py3.sif tensorflow2
INFO:    instance started successfully

To review what instances we have running:
>singularity instance list
INSTANCE NAME    PID      IMAGE
Tensorflow       52226    /tmp/tf-19.09-py3.sif
tensorflow       52358    /tmp/tf-19.09-py3.sif
tensorflow2      52407    /tmp/tf-19.09-py3.sif

NOTE: we launched our instances with --nv, to expose the NVidia devices/drivers to the instances we launch (you cannot do this after the container instance is already running).

Now, to access (or "attach") to the running instances, we can use exec or shell as before, but target the instance like so:
>singularity shell instance://tensorflow2
Singularity tf-19.09-py3.sif:/tmp> nvidia-smi
Wed Oct  9 18:24:58 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN RTX           Off  | 00000000:84:00.0 Off |                  N/A |
| 21%   35C    P0    N/A /  N/A |      0MiB / 24220MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Finally, to terminate our instances:
>singularity instance stop Tensorflow
Stopping Tensorflow instance of /tmp/tf-19.09-py3.sif (PID=52226)

Or:
>singularity instance stop -a
Stopping tensorflow2 instance of /tmp/tf-19.09-py3.sif (PID=52407)
Stopping tensorflow instance of /tmp/tf-19.09-py3.sif (PID=52762)

-- DavidLeBlanc - 2019-12-16
Topic revision: r4 - 2019-12-16, DavidLeBlanc
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding CMS Wiki? Send feedback