# HPC ALICE BLASTP

*By Belmin Bajramovic [@B-Bajramovic](https://github.com/B-Bajramovic)*

This guide shows common BLASTP use cases using built-in BLAST options.

## Preparation before running BLAST

Activate the BLAST module

```sh
module load BLAST+/2.16.0-gompi-2024a
```

run this to ensure the database path is correctly used 

```sh
export BLASTDB=/zfsstore/databases/NCBI
```

To automate this for every login, add it to your bashrc

```sh
echo 'export BLASTDB=/zfsstore/databases/NCBI' >> ~/.bashrc
echo 'module load BLAST+/2.16.0-gompi-2024a' >> ~/.bashrc
source ~/.bashrc
```

## 1. Single protein vs entire nr database

Use this when you want to identify a protein or find homologs broadly.

```sh
blastp \
  -query protein.faa \
  -db /zfsstore/databases/NCBI/nr/nr \
  -evalue 1e-5 \
  -max_target_seqs 20 \
  -num_threads 8 \
  -out protein_vs_nr.tsv \
  -outfmt "6 qseqid sacc pident length qcovs evalue bitscore stitle"
```

## 2. Single protein vs a specific taxonomic group

Restrict results to a clade, order, genus, or species using NCBI taxids.  
NOTE: Taxonomy filtering limits biological scope, not database size.

```sh
blastp \
  -query protein.faa \
  -db /zfsstore/databases/NCBI/nr/nr \
  -taxids 28211 \
  -evalue 1e-5 \
  -max_target_seqs 20 \
  -num_threads 8 \
  -out protein_vs_taxid.tsv \
  -outfmt "6 qseqid sacc pident length qcovs evalue bitscore stitle"
```

## 3. Searching only a specific region of the protein

Use this when only part of the protein is biologically relevant.  
Residue numbering is 1-based and inclusive.

```sh
blastp \
  -query protein.faa \
  -query_loc 120-240 \
  -db /zfsstore/databases/NCBI/nr/nr \
  -evalue 1e-5 \
  -max_target_seqs 20 \
  -num_threads 8 \
  -out protein_region_120_240.tsv \
  -outfmt "6 qseqid sacc pident length qcovs evalue bitscore stitle"
```

## 4. Searching for divergent or remote homologs

Use this when close homologs are absent.
This increases sensitivity at the cost of speed and specificity.

```sh
blastp \
  -query protein.faa \
  -db /zfsstore/databases/NCBI/nr/nr \
  -word_size 2 \
  -matrix BLOSUM45 \
  -evalue 1e-2 \
  -qcov_hsp_perc 30 \
  -max_target_seqs 200 \
  -num_threads 8 \
  -out divergent_homologs.tsv \
  -outfmt "6 qseqid sacc pident length qcovs evalue bitscore stitle"
```

## 5. Region-specific search for divergent homologs

Combines region restriction with relaxed similarity thresholds.

```sh
blastp \
  -query protein.faa \
  -query_loc 120-240 \
  -db /zfsstore/databases/NCBI/nr/nr \
  -word_size 2 \
  -matrix BLOSUM45 \
  -evalue 1e-2 \
  -qcov_hsp_perc 30 \
  -max_target_seqs 200 \
  -num_threads 8 \
  -out region_divergent.tsv \
  -outfmt "6 qseqid sacc pident length qcovs evalue bitscore stitle"
```

## 6. Performance and scaling guidelines

- Single protein: 1 job, 4–8 threads, <=15 minutes walltime
- Multiple proteins: use job arrays (one query per job)
- When setting cpus for your slurm job, make sure they are the same as number of threads used in blastp
- Avoid running BLAST on login nodes

## Summary of key options

| Use case | Key options |
| --- | --- |
| Broad annotation | default `blastp` vs **nr** |
| Taxonomy-restricted | `-taxids` |
| Region-specific | `-query_loc` |
| Divergent homologs | `-word_size`, `-matrix`, `-evalue`, `-qcov_hsp_perc` |
| Throughput | job arrays + `-num_threads` |

## Large query searches

To run blast for many query inputs, you should not use a single job. Instead, submit one job per query using my sbatch python script. You can download it from github and run with `-h` to see how to run.

```sh
git clone https://github.com/B-Bajramovic/BLAST_ALICE.git
cd BLAST_ALICE
python blast_sbatch.py -h
```