Execute programs¶
By C.Du @snail123815
We manage our software using Conda-compatible environments, which have become a standard in small scale server configurations. By using environments, we can easily manage software dependencies and avoid conflicts between different software versions. More info.
On BLIS, we use a tool called micromamba, a lightweight and efficient alternative to conda, to manage these environments. A shared directory /vol/local/conda_envs dedicated to storing, sharing, and modifying environments is created on BLIS. By default, all users on BLIS should be in a group called condablis. This group grant users access to our shared environments.
All users should be in condablis group, you can use groups command to check:
[user@blis ~]$ groups
sgr condablis
Prepare micromamba¶
For new users, micromamba will be automatically setup at your first login to a server, if you see (base) in front of your prompt login and run micromamba info successfully, then you are good to go:
(base) [user@blis ~]$ micromamba info
libmamba version : 2.5.0
micromamba version : 2.5.0
curl version : libcurl/8.18.0 OpenSSL/3.5.4 zlib/1.3.1 zstd/1.5.7 libssh2/1.11.1 nghttp2/1.67.0 mit-krb5/1.21.3
libarchive version : libarchive 3.8.5 zlib/1.3.1 bz2lib/1.0.8 libzstd/1.5.7 libb2/bundled
envs directories : /vol/local/conda_envs
/home/<username>/.micromamba/envs
package cache : /vol/local/.conda_cache/<username>
environment : base (active)
env location : /home/<username>/.micromamba
user config files : /home/<username>/.mambarc
populated config files : /home/<username>/.mambarc
virtual packages : __unix=0=0
__linux=5.14.0=0
__glibc=2.34=0
__archspec=1=x86_64_v3
channels : https://conda.anaconda.org/bioconda/linux-64
https://conda.anaconda.org/bioconda/noarch
https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
base environment : /home/<username>/.micromamba
platform : linux-64
It means you have micromamba setup ready for use, you can execute an already installed program. Otherwise, you need to do some configuration before you can activate and create environments.
Note, you only need to do this once on one server.
Here is how (assuming you are using default shell bash):
[user@blis ~]$ micromamba shell init -s bash -r ~/.micromamba
Above command will create a “base” environment at your home directory. Then a script will be put into your ~/.bashrc file. Now you need to apply the setup by “sourcing” the ~/.bashrc file:
[user@blis ~]$ source ~/.bashrc
Next time when you start your bash shell (at login), the script will be automatically executed.
Now you should be able to run this command without error:
[user@blis ~]$ micromamba info
You should see the output similar to the one shown above. You may see less if the ~/.mambarc file is not fully populated as shown below. If you see errors, please try to restart your shell (restart ssh connection). Ask for help with screen shots if still no success.
New users should have ~/.mambarc file with the following content:
envs_dirs:
- /vol/local/conda_envs
pkgs_dirs:
- /vol/local/.conda_cache/<username>
channels:
- bioconda
- conda-forge
- defaults
auto_activate_base: true
The ~/.mambarc file is further explained in setting up config file section.
Execute an already installed program¶
In the base system, there is no bioinformatics related program. You cannot directly run programs like phylophlan when not in a corresponding environment. Activate the environment first:
(base) [user@blis ~]$ micromamba activate /vol/local/conda_envs/phylophlan
# Note your prompt changes indicating you changed environment
(/vol/local/conda_envs/phylophlan) [user@blis ~]$ phylophlan -v
PhyloPhlAn version 3.0.67 (24 August 2022)
You may notice the prefix of the environment /vol/local/conda_envs. We store all working environments here. You can check if the environment for target software already exists or not.
# `ls` - add forward slash in the end to list items in target directory,
# `|` - pipe output to the following `grep` which filter the input by `phylophlan`
(base) [user@blis ~]$ ls /vol/local/conda_envs/ | grep phylophlan
phylophlan
Many programs are dependencies of others. You may not need to create a dedicated environment for them. Using diamond as an example:
(base) [user@blis ~]$ find /vol/local/conda_envs/ -type f -name diamond -executable
/vol/local/conda_envs/bakta/bin/diamond
/vol/local/conda_envs/antismash8_cda1015/bin/diamond
/vol/local/conda_envs/cblaster/bin/diamond
/vol/local/conda_envs/phylophlan/bin/diamond
/vol/local/conda_envs/quasan/bin/diamond
/vol/local/conda_envs/decrippter/bin/diamond
Try activate a few (/vol/local/conda_envs/bakta, /vol/local/conda_envs/cblaster for example) and see if it fits your need.
Case-insensitive search, which uses -iname instead of -name; wild-card matching:
(base) [user@blis ~]$ find /vol/local/conda_envs -type f -iname macs* -executable
/vol/local/conda_envs/macs3/bin/macs3
If the target environment or program or specific version of a program is not found, you may need to create an environment and install the software yourself. For detailed instructions, please check program setup. You can install some frequently used small programs in your base environment. If it is activated by default, you can use them directly.
Do not analyse large dataset in your home directory¶
One IBL server may be powerful, but not when used by many colleagues, especially for data storage. Applying a quota system on the home directory ~/ is necessary for sustainable use. It means that you cannot use your ~/ freely. As a general rule when using Linux servers, you will find a quota system on almost all shared servers, including ALICE.
To check how much quota has left for you:
# Note you can only see the information when you are in ~ or any of its subdirectories.
[user@blis ~]$ quota -s
Disk quotas for user username (uid 148600000):
Filesystem space quota limit grace files quota limit grace
/dev/mapper/rl-home
16073M 20480M 25600M 154k 0 0
# Meaning you have used 16GB of the 20GB.
# You can store up to 25GB but after exceeding 20GB,
# you will be given a grace time of few days, then
# your home directory will be locked until you remove excess data
Plan and notify others before execute long program¶
Once you have done program installation, it is time to plan a real job. Please follow these steps:
Check the “jobs-on-servers” channel in our Slack group for pinned messages.
Log in to the system and use htop or top to check for heavily running processes. Notify the channel if you see any anomalies.
Run your job. If it will take a long time, run it in a
screensession and check its status.Inform everyone else in the “jobs-on-servers” channel using this format:
“{Job type} on {SERVER} is using {number} cores, (approximately, optional) {number} ram, until ~(approximately) {time}.”
For example, “phylophlan on BLIS is using 8 cores, 100 GB ram, until ~20 Nov 9:00.” Pin this message if the job list is a bit long.Once your job is done, edit the message to include a big “DONE” at the beginning and unpin it.
Rule of thumb
For any job, leave at least 2 cores free.
For short jobs, use as many cores as possible.
For long jobs, use maximum half of the avaliable cores. If more cores are needed, notify everyone at least half hour in advance. For example, “PLANNING phylophlan on BLIS, is using 18 cores, 160 GB ram, from 19 Nov 20:00 till ~20 Nov 9:00.”
File transfer¶
Most basic skill for file transfer is to identify the location of your data and the destination. Terminology may differ, here we use:
Local: the computer you are using to connect to the server, which can be your laptop or a desktop computer in the lab.
Server: the computer you are connecting to, which is IBL servers or ALICE in our case.
Cloud: the storage or computing resources provided by a third-party company, such as AWS, Google Cloud, Azure etc. Here it often refers to Research Drive.
To and from cloud¶
Since Research Drive is the recommended location for storing research data, it will be easy to keep research data just in the cloud and do analysis on the server. To transfer data between them, check how to upload data to Research Drive using Rclone.
Between local and server¶
To transfer your files from or to the server, you need to have already set up an SSH connection BLIS (blis). For MobaXterm users, you can drag and drop in the app’s file explorer. For MobaXterm, it is on the left.
SCP and rsync are recommended.
SCP¶
SCP for secure copy, or ssh cp, or safe cp…
When you have few files to copy, this is a good tool. It cannot resume transfer after transfer failure.
The syntax for scp is
scp [OPTIONS] [[user@]src_host:]file1 [[user@]dest_host:]file2
Same as command ssh, since you have already a config with user name and IP address for Host blis and Host hpc1, then the [[user@]dest_host:] part can be replaced with blis: or hpc1:.
# A file
scp path/to/file blis:/vol/local/username/
# Multiple files
scp path/to/fileA path/to/fileB blis:/vol/local/username/
# A directory, -r for recursive
scp -r path/to/dir blis:/vol/data/username
rsync¶
rsync is a more reliable tool for copying large datasets. It can continue the transfer from where it left off in case of interruption. (PowerShell does not have this program, use scp instead.)
The syntax for rsync is almost the same as scp:
rsync [OPTION...] SRC... [USER@]HOST:DEST
# Copy a directory to your own folder in the shared storage on BLIS
rsync -aP path/to/dir blis:/vol/data/username/
-a means archiving the directory, it will copy everything associated with the file, including its creation time, permissions etc. It implies -rlptgoD.
-P to show progress.
How to run long jobs¶
Your terminal connection to the server may be interrupted for various reasons, such as network issues, power failure, or simply closing the terminal by mistake. If you are running a long job in the terminal, it will be killed once the connection is lost. To avoid this, you can use tools like screen or tmux to run your job in a session that can be detached and re-attached later. This way, even if your connection is interrupted, your job will continue running in the background, and you can reconnect to it later to check its status or view the output.
Please try not use tmux or screen for long jobs on ALICE. All work should only run in a slurm queue.
Using tmux¶
Tmux is a similar program, allows you to run a shell in the background, it should be available on all servers, let administrators know if not.
tmux new -s <session_name>will create a session with the name you specifyCtrl + bis the magic key stroke in tmux,Ctrl + b, release the keys, and then pressdwill “detach” from the terminal and put it in the background.tmux a -t <session_name>will “re-attach” the session.
Using screen¶
To use screen, simply execute the command, and a new shell will open. From here, you can execute commands as usual.
screen -S <session_name>will create a session with the name you specify. You can have multiple sessions running at the same time, and this will help you to keep track of them.Ctrl + ais the majic key stroke in screen.Ctrl + a, release the keys, and then pressd. This will “detach” from the terminal and leave the program running in the background.To re-attach to the session, use
screen -r <session_name>.
Run a command in an environment without activation¶
Sometimes you may need to run a program in a script, which may require multiple different software environments. You can activate them in a script, but this is not possible in Slurm system, for example running the command in a sub-shell, then you need to know how to run a program in one command without activating the environment in your script (when it does not work).
There are two ways, given as examples:
Recommended:
For
micromamba:cmd="eval \"\$(micromamba shell hook --shell=posix)\" && micromamba activate /vol/local/conda_envs/bakta && echo \$BAKTA_DB" echo $cmd eval $cmd
For
conda:cmd="eval \"\$(conda shell.posix hook)\" && conda activate /vol/local/conda_envs/bakta && echo \$BAKTA_DB" echo $cmd eval $cmd
Not recommended:
It does not respect activation hooks (
activate.d).For
micromamba:cmd="micromamba run -p ~/genvs/bakta echo \$BAKTA_DB" echo $cmd eval $cmd
Same for
conda, just replacemicromamba run -pwithconda run -p.