# Doing BLAST on your local computer *By C.Du [@snail123815](https://github.com/snail123815)* ```{contents} --- depth: 3 --- ``` BLAST: Compare & identify sequences ## Basic sequence format - FASTA A fasta file is a **text file** that begins with a single-line description, followed by lines of sequence data. Introducing `.fasta`/`.fa` ``` >DNA_Sequence_ID_1 ATCGCCGGTGACGTCTGCGTC >DNA_Sequence_ID_2 TGGGCTAGCTAGTTATCTCCGCGCGGAAT AGCTCTATTGTGTGTATTATAT >DNA_Sequence_ID_3 TCGATCTCTTCTTATATATGCGCGGATCTAGGCTATATTCGATCGTAGCTA ``` ``` >Prot_Sequence_ID VAETLKKGSRVTGAARDKLAADLKKKYDAGASIRALAEETGRSYGFVHRMLSESGVTLRG RGGATRGKKATSA* ``` :::{NOTE} 1. Sequence data is always connected, neither 'space' nor 'line break' creates a gap. 2. One or more sequence can be in one file 3. ID can contain 'space', but not 'line break' 4. Protein may contain '*' representing stop codon, but not necessarily ::: Variants of file extensions: - `.fna` fasta for Nucleotide (DNA and RNA) - `.frn` fasta for RNA (RNA coding) - `.faa` fasta for protein (Amino acid) - `.ffn` fasta for protein coding DNA ## How BLAST works - BLAST comes with multiple programs for different tasks - blastn - Query: nucleic acid sequence(s) - Database: nucleic acid sequences - blastp - Query: amino acid sequence(s) - Database: amino acid sequences - blastx - Query: nucleic acid sequence(s) - Automatically translate **query** to amino acid sequences (all frames) - Database: amino acid sequences - tblastn - Query: amino acid sequence(s) - Database: nucleic acid sequences - Automatically translate **database** to amino acid sequences (all frames) - tblastx - Query: nucleic acid sequence(s) - Automatically translate **query** to amino acid sequences (all frames) - Database: nucleic acid sequences - Automatically translate **database** to amino acid sequences (all frames) - Query - A fasta file/text (one or multiple sequences) - Database (Target) - On NCBI website: RefSeq? Non-redudant? WGS? - Local - Custom genome(s) - Custom proteome(s) - Any combination of sequences of same type (Nucleotide or protein) - Parameters - Number of max targets - Expect (*e-value*) - Word size (length of intial exact match) - Output/Download format - etc. Check [Basic BLAST arguments](#basic-blast-arguments) :::{NOTE} You do NOT have a program called `blast`! Only `blastn`, `blastp` etc. ::: ## Setup command line BLAST tool set Like other programs, you have to install BLAST tool set in your local computer to do BLAST locally. ### Setup BLAST using Conda or Micromamba Check [how to install `conda` (`micromamba`)](micromamba.md). ```shell conda activate # this will activate base environment. You can also specify which environment of course. conda install blast -c bioconda ``` ### Setup Windows machine in bash environment (git for windows) 1. Install [https://gitforwindows.org/](https://gitforwindows.org/) first, you get access to `tar` command. 2. Download portable blast from NCBI: - [https://blast.ncbi.nlm.nih.gov/Blast.cgi](https://blast.ncbi.nlm.nih.gov/Blast.cgi) -> Download BLAST - ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ -> ncbi-blast-2.13.0+-x64-win64.tar.gz 3. Open 'Git Bash', change directory to the download files, do `tar xzf ncbi-blast-2.13.0+-x64-win64.tar.gz` - Equivalent to `tar -x -z -f ncbi-blast-2.13.0+-x64-win64.tar.gz` - `-x` To unpack the package (`.tar`) - `-z` Tells the program that this is a compressed package and the compression format is "gzip" (`.gz`) - `-f [file]` Specify the file to operate. 4. Copy everything from `ncbi-blast-2.13.0+/bin` to `/usr/bin` directory. ```shell $ cd /c/User/[name]/Downloads/ $ ls ncbi-blast-2.13.0+-x64-win64.tar.gz ncbi-blast-2.13.0+/ $ cp /c/User/[name]/Downloads/ncbi-blast-2.13.0+/bin/* /usr/bin/ $ blastn -h USAGE blastn [-h] [-help] [-import_search_strategy filename] [-export_search_strategy filename] [-task task_name] [-db database_name] ... ``` ### Setup BLAST on MacOS/Linux #### Solution 1 Download the program and put it in your environment variables. 1. Download portable blast from NCBI: - [https://blast.ncbi.nlm.nih.gov/Blast.cgi](https://blast.ncbi.nlm.nih.gov/Blast.cgi) -> Download BLAST - ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ - ncbi-blast-2.13.0+-x64-macosx.tar.gz - ncbi-blast-2.13.0+-x64-linux.tar.gz 2. Open terminal, change directory to download folder, do `tar xzf ncbi-blast-2.13.0+-x64-win64.tar.gz` - Equivalent to `tar -x -z -f ncbi-blast-2.13.0+-x64-win64.tar.gz` - `-x` To unpack the package (`.tar`) - `-z` Tells the program that this is a compressed package and the compression format is "gzip" (`.gz`) - `-f [file]` Specify the file to operate. 3. Put the directory `ncbi-blast-2.13.0+` in a location you can remember, eg. at your home directory `~`: ```shell # In this example, we will put blast directory # "ncbi-blast-2.13.0+" in your home directory "~" $ cp -r ~/Downloads/ncbi-blast-2.13.0+ ~/ $ ls ~/ ncbi-blast-2.13.0+/ $ ls -l ~/ncbi-blast-2.13.0+/ total 72 -rw-r--r--@ 1 bob staff 85 Feb 23 2022 ChangeLog -rw-r--r--@ 1 bob staff 1055 Feb 23 2022 LICENSE -rw-r--r--@ 1 bob staff 465 Feb 23 2022 README drwxr-xr-x@ 30 bob staff 960 Feb 23 2022 bin drwxr-xr-x@ 4 bob staff 128 Mar 7 2022 doc -rw-r--r--@ 1 bob staff 23845 Feb 23 2022 ncbi_package_info # Note: double check the spell! The special charaters are # important in this export command. # If you did not put ncbi-blast-2.13.0+ folder as the example # above, please change the full path! $ BLAST_BIN_FULL_PATH=~/ncbi-blast-2.13.0+/bin/ ``` 4. Now we need to tell bash that the `bin` directory is where we stored some program, so that when we want to use `blastn` etc, bash can find them. To do this, we need to add the full path to this directory to an environment variable called "PATH". ```shell # 1. Let's check current environment variable PATH $ echo $PATH /usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin # 2. Put full ncbi bin path to the front of it $ export PATH="$BLAST_BIN_FULL_PATH:$PATH" # 3. Check if your export command succeeded $ echo $PATH /usr/home/bob/ncbi-blast-2.13.0+/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin # Note the full path of ncbi-blast-2.13.0+/bin is now the # first item splited by colon ":" ``` :::{NOTE} You have to run the export command `export PATH="~/ncbi-blast-2.13.0+/bin/:$PATH"` after every restart of your terminal. If you want it to be permanent, you can put this command in a file that will be run every time when your shell starts: ```shell # FOR # MacOS default shell is zsh SHELL_CONFIG_FILE=~/.zshrc # All other systems using bash SHELL_CONFIG_FILE=~/.bashrc # Put the export command echo 'export PATH="~/ncbi-blast-2.13.0+/bin/:$PATH"' >> $SHELL_CONFIG_FILE ``` For MacOS only, the downloaded `blastn`, `blastp` ... programs are treated as dangerous programs which MacOS will prevent it to run. If you got this notification while running any blast program eg. `blastn -h`: ![warnning](../_static/images/Macos_warn_blast_1.png) Please click "Cancel", then go to your "System Settings" -> "Privacy & Security" -> Tab "Security". You will see somthing like: ![allow anyway](../_static/images/Macos_warn_blast_2.png) Then you can click "Allow Anyway" and follow instructions to free our blast program from jail. (Keep "System Settings" open, you will need it again for all blast programs) Now you may try the program again eg. `blastn -h`, you will see another warning message: ![open the program](../_static/images/Macos_warn_blast_3.png) ::: #### Solution 2 Use a package manager. ##### Linux Need **sudo** right. Ubuntu and other Debian based system: ```shell sudo apt install ncbi-blast+ ``` Centos or other RedHat based system: Download `ncbi-blast-2.13.0-1.x86_64.rpm` from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ ```shell sudo dnf install ~/Downloads/ncbi-blast-2.13.0-1.x86_64.rpm ``` ! not tested, but should work across platforms and do not need sudo right. ```shell pkcon install ncbi-blast+ ``` ##### MacOS Install [homebrew](https://brew.sh/) (need **sudo** right) ```shell /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" ``` Then install BLAST, you do not need sudo right anymore. ```shell brew install blast ``` ## Run your first BLAST job locally ### Make a database Obtain sequences you want to target. 1. Search an organism on NCBI, download its assembly - Genome / CDS / Proteins / RNA ... 2. Search a term on NCBI, download all matching sequences - Protein and Nucleotide sequences - Note "Gene" is information, "Nucleotide" is the sequence you want 3. Search a term on UniProt, download all matching sequences 4. "Write" a text fasta file by yourself - Make it up - Combine some of your downloaded sequences :::{TIP} "Protein" and "Translated CDS" are different. "Protein" has higher certainty as predicted that they can be translated into real proteins. "Gene" is information, "Nucleotide" is the sequence you want. ::: #### Example Download predicted proteome of *Streptomyces coelicolor* A3(2) from NCBI RefSeq database, make a BLAST database containing all possible proteins. 1. Go to [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov), search "Streptomyces coelicolor" in the search box. 2. Scroll down, in the "Genomes" box, click "Assembly". 3. In the left column, untick "Latest", tick "Latest RefSeq" and "Complete genome" 4. From the search results (middle column), you need to find "Organism: Streptomyces coelicolor A3(2)" as this is the model strain of this organism. Click on the accession link. 5. Upper right corner, you can find "Download Assembly" button, click on it, choose "RefSeq" as "Source database" and choose "Protein FASTA (.faa)" as "File type". The "Estimated size" should **not** show "0". Then click the "Download" button. 6. You have downloaded a file "genome_assemblies_prot_fasta.tar", which is a "tar ball". A "tar ball" is simply a package of files. 7. Go to your terminal, navigate to the "Downloads" folder containing this file, extract the package by running `tar` command: ```shell $ cd ~/Downloads $ ls genome_assemblies_prot_fasta.tar $ tar xf genome_assemblies_prot_fasta.tar $ ls genome_assemblies_prot_fasta.tar ncbi-genomes-2022-12-09/ $ cd ncbi-genome-2022-12-09 $ ls GCF_008931305.1_ASM893130v1_protein.faa.gz README.txt md5checksums.txt ``` 8. Note the file `GCF_008931305.1_ASM893130v1_protein.faa.gz`, it is your downloaded protein sequences (`.faa`) but it has an extra extension `.gz` meaning it is "**g** zipped". Before we can use it, we need to **unzip**: ```shell $ gzip -d GCF_008931305.1_ASM893130v1_protein.faa.gz # -d tells the program to "decompress", ie. unzip. $ ls GCF_008931305.1_ASM893130v1_protein.faa README.txt md5checksums.txt ``` Note that the `.gz` extension has gone. 9. Make database use `makeblastdb` command: ```shell $ makeblastdb -in GCF_008931305.1_ASM893130v1_protein.faa.gz -dbtype prot Building a new DB, current time: 13/09/2022 10:50:56 New DB name: /Users/bob/Downloads/ncbi-genomes-2022-12-09/GCF_008931305.1_ASM893130v1_protein.faa New DB title: GCF_008931305.1_ASM893130v1_protein.faa Sequence type: Protein Keep MBits: T Maximum file size: 3000000000B Adding sequences from FASTA; added 7511 sequences in 0.157292 seconds. $ ls GCF_008931305.1_ASM893130v1_protein.faa GCF_008931305.1_ASM893130v1_protein.faa.pjs GCF_008931305.1_ASM893130v1_protein.faa.pto GCF_008931305.1_ASM893130v1_protein.faa.pdb GCF_008931305.1_ASM893130v1_protein.faa.pot GCF_008931305.1_ASM893130v1_protein.faa.phr GCF_008931305.1_ASM893130v1_protein.faa.psq GCF_008931305.1_ASM893130v1_protein.faa.pin GCF_008931305.1_ASM893130v1_protein.faa.ptf README.txt md5checksums.txt ``` Note your database is made by generating many files with different extension, all of these files are needed for a complete blast database. :::{IMPORTANT} Which is "the database"? Usually when you are telling BLAST programs to do a search, you will need an argument specifying the path to your database, and you can only give **one**. But with this many files/paths, which one should we use? ! Your database is the **common** part of all generated files. In this example: `GCF_008931305.1_ASM893130v1_protein.faa`. ::: ### Make a query file Note you need to make a [FASTA file](#basic-sequence-format---fasta). Use a pure text editor (those you can not set format for your text) or switch your text editor to "plain text mode". You can copy and paste multiple sequences (with FASTA header line) into a single file. Or, if you remember `cat` command, you can generate a single query file containing multiple sequences by: ```shell # Assuming you have the sequences of two proteins downloaded as fasta file: cat protein_A.faa protein_B.faa > proteins.faa ``` To follow the example above, let's make a protein file directly from command line: ```shell $ echo ">protein_A VAETLKKGSRVTGAARDKL >protein_B MPQHTSGSDRAAIPPAARDGGSVRPPAPSTLDELWRSYKTTGDERLREQLILHYSPLVKY VAGRVSVGLPPNVEQADFVSSGVFGLIDAIEKFDVDREIKFETYAITRIRGAMIDELRAL DWIPRSVRQK" > proteins.faa ``` ### Run a BLAST job Basic syntex and output: ```shell blastn -query path/to/query.fasta -db path/to/database/file ``` To follow the example above: ```shell $ blastp -query protins.faa -db GCF_008931305.1_ASM893130v1_protein.faa BLASTP 2.13.0+ Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A. ... Database: GCF_001445655.1_ASM144565v1_prot 7,197 sequences; 2,424,953 total letters Query= protein_A Length=19 Score E Sequences producing significant alignments: (Bits) Value WP_030366669.1 MULTISPECIES: helix-turn-helix domain-containing p... 35.4 5e-06 ... Query= protein_B Length=130 Score E Sequences producing significant alignments: (Bits) Value WP_058046485.1 MULTISPECIES: RNA polymerase sigma factor WhiG [St... 243 2e-83 ... Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Neighboring words threshold: 11 Window for multiple hits: 40 ``` If you want to put the result in a file called `proteins_blastResult.txt`, the following two commands are equivalent: ```shell blastp -query protins.faa -db GCF_008931305.1_ASM893130v1_protein.faa > proteins_blastResult.txt blastp -query protins.faa -db GCF_008931305.1_ASM893130v1_protein.faa -out proteins_blastResult.txt ``` ### Basic BLAST arguments Short help `blastp -h`, full help can be obtained by `blastp -help`. In this example command: ```shell blastp -db GCF_008931305.1_ASM893130v1_protein.faa \ -query proteins.faa \ -outfmt 6 \ -max_target_seqs 2 \ -max_hsps 1 \ -evalue 1e-12 \ -word_size 7 \ -num_threads 4 \ > proteins_blastResult.txt ``` - `-outfmt 6` Tabular output - `-max_target_seqs 2` Report max **two** match per sequence (when `-outfmt` > 4) - If you set `-outfmt` ≤ 4, set both `-num_descriptions` and `-num_alignments` to 2 - `-max_hsps 1` For each match only show **one** aligned part - `-evalue 1e-12` Set expectation value threshold to < 1 × 10-12 - `-word_size 7` Set initial exact match to be a lenth of 7 (default 11) (for finding CRISPR off-targets) - `-num_threads 4` Use 4 threads for parallel computing For more output tips visit: ```{toctree} --- maxdepth: 1 --- tabular_output_guide