# HPC ALICE BLASTP *By Belmin Bajramovic [@B-Bajramovic](https://github.com/B-Bajramovic)* This guide shows common BLASTP use cases using built-in BLAST options. ## Preparation before running BLAST Activate the BLAST module ```sh module load BLAST+/2.16.0-gompi-2024a ``` run this to ensure the database path is correctly used ```sh export BLASTDB=/zfsstore/databases/NCBI ``` To automate this for every login, add it to your bashrc ```sh echo 'export BLASTDB=/zfsstore/databases/NCBI' >> ~/.bashrc echo 'module load BLAST+/2.16.0-gompi-2024a' >> ~/.bashrc source ~/.bashrc ``` ## 1. Single protein vs entire nr database Use this when you want to identify a protein or find homologs broadly. ```sh blastp \ -query protein.faa \ -db /zfsstore/databases/NCBI/nr/nr \ -evalue 1e-5 \ -max_target_seqs 20 \ -num_threads 8 \ -out protein_vs_nr.tsv \ -outfmt "6 qseqid sacc pident length qcovs evalue bitscore stitle" ``` ## 2. Single protein vs a specific taxonomic group Restrict results to a clade, order, genus, or species using NCBI taxids. NOTE: Taxonomy filtering limits biological scope, not database size. ```sh blastp \ -query protein.faa \ -db /zfsstore/databases/NCBI/nr/nr \ -taxids 28211 \ -evalue 1e-5 \ -max_target_seqs 20 \ -num_threads 8 \ -out protein_vs_taxid.tsv \ -outfmt "6 qseqid sacc pident length qcovs evalue bitscore stitle" ``` ## 3. Searching only a specific region of the protein Use this when only part of the protein is biologically relevant. Residue numbering is 1-based and inclusive. ```sh blastp \ -query protein.faa \ -query_loc 120-240 \ -db /zfsstore/databases/NCBI/nr/nr \ -evalue 1e-5 \ -max_target_seqs 20 \ -num_threads 8 \ -out protein_region_120_240.tsv \ -outfmt "6 qseqid sacc pident length qcovs evalue bitscore stitle" ``` ## 4. Searching for divergent or remote homologs Use this when close homologs are absent. This increases sensitivity at the cost of speed and specificity. ```sh blastp \ -query protein.faa \ -db /zfsstore/databases/NCBI/nr/nr \ -word_size 2 \ -matrix BLOSUM45 \ -evalue 1e-2 \ -qcov_hsp_perc 30 \ -max_target_seqs 200 \ -num_threads 8 \ -out divergent_homologs.tsv \ -outfmt "6 qseqid sacc pident length qcovs evalue bitscore stitle" ``` ## 5. Region-specific search for divergent homologs Combines region restriction with relaxed similarity thresholds. ```sh blastp \ -query protein.faa \ -query_loc 120-240 \ -db /zfsstore/databases/NCBI/nr/nr \ -word_size 2 \ -matrix BLOSUM45 \ -evalue 1e-2 \ -qcov_hsp_perc 30 \ -max_target_seqs 200 \ -num_threads 8 \ -out region_divergent.tsv \ -outfmt "6 qseqid sacc pident length qcovs evalue bitscore stitle" ``` ## 6. Performance and scaling guidelines - Single protein: 1 job, 4–8 threads, <=15 minutes walltime - Multiple proteins: use job arrays (one query per job) - When setting cpus for your slurm job, make sure they are the same as number of threads used in blastp - Avoid running BLAST on login nodes ## Summary of key options | Use case | Key options | | --- | --- | | Broad annotation | default `blastp` vs **nr** | | Taxonomy-restricted | `-taxids` | | Region-specific | `-query_loc` | | Divergent homologs | `-word_size`, `-matrix`, `-evalue`, `-qcov_hsp_perc` | | Throughput | job arrays + `-num_threads` | ## Large query searches To run blast for many query inputs, you should not use a single job. Instead, submit one job per query using my sbatch python script. You can download it from github and run with `-h` to see how to run. ```sh git clone https://github.com/B-Bajramovic/BLAST_ALICE.git cd BLAST_ALICE python blast_sbatch.py -h ```