Frequently Asked Questions

How to calculate an Alien Index (AI) and a HGT index

Alienness takes as input the result of a BLASTp search of a whole set of predicted proteins of interest (e.g. from a whole genome or a transcriptome) against the NCBI’s non-redundant (nr) library or any protein library available at NCBI.
The blast result of each query is read from the best blast hit to the last significant hit. Thus, the program records a couple of values composed of the best hit assigned to the taxonomic donor group called best donor e-value/score and the best hit assigned to the recipient taxonomic group called best recipient e-value/score.


Minc3s00019g01246       WP_028034051.1  50.6    338     163     3       1       338     1       334     1.4e-96 363.2
Minc3s00019g01246       WP_011579798.1  51.2    334     159     3       5       338     5       334     1.6e-95 359.8
Minc3s00019g01246       WP_159588070.1  49.5    333     164     3       6       338     6       334     2.5e-93 352.4
Minc3s00019g01246       KFB10895.1      50.0    332     162     3       8       339     8       335     2.8e-92 349.0
Minc3s00019g01246       WP_051913977.1  50.0    332     162     3       8       339     10      337     2.8e-92 349.0
...
Minc3s00019g01246       RCL30100.1      41.3    341     183     8       8       334     33      370     4.8e-60 241.9
Minc3s00019g01246       WP_088076312.1  42.2    344     178     7       1       330     1       337     4.8e-60 241.9
Minc3s00019g01246       XP_009065820.1  42.1    340     181     10      1       331     1       333     4.8e-60 241.9
Minc3s00019g01246       PYT47325.1      41.9    344     182     9       2       330     30      370     4.8e-60 241.9
Minc3s00019g01246       PCJ07871.1      39.6    338     197     6       1       338     1       331     4.8e-60 241.9
Minc3s00019g01246       WP_121651289.1  40.7    322     181     7       8       326     10      324     4.8e-60 241.9
...

The alien index is computed with the following formula :

												
ParametersDescription
AIAlien Index is a metric that allows to characterize the potential horizontal genes transfer
best recipient e-valuebest BLAST E-value for the recipient taxon
best donor e-valuebest BLAST E-value for the donor taxon
In our example, the couple of best e-value is ( best recipient e-value : 1.4e-96 / best donor e-value : 4.8e-60 ) giving an AI equal to 84.13
When either no donnor or no recipient significant BLAST hit is found, a penalty e-value of 1 is automatically assigned as the best donor or recipient e-value, respectively.
Hence, e-values of the best recipient and donor hits vary between 0 and 1 and, consequently AI scores vary between -460.5 and 460.5.
An AI>0 indicates a better hit to a donor species than to a recipient species and possible acquisition via HGT.

To know more, this method is defined in :
Gladyshev, E. A.; Meselson, M.; Arkhipova, I. R. Massive horizontal gene transfer in bdelloid rotifers. Science 2008, 320, 1210–3. (View Online)

The HGT index is computed with the following formula :
HGTindex = best donor bitscore - best recipient bitscore

												
ParametersDescription
HGT indexHGT index is a metric that allows to characterize the potential horizontal genes transfer
best recipient scorebest BLAST bitscore for the recipient taxon
best donor scorebest BLAST bitscore for the donor taxon
In our example, the couple of best score is ( best recipient score : 241.9 / best donor score : 363.2 ) giving a HGTindex equal to 121.30

What files and settings are expected

Input file

Alienness takes as input the BLAST or DIAMOND result of a proteome performed against a protein database.
The input file for alienness tool must be compressed in .zip or .gz format.
For example, the blastp program is used and available on the NCBI ftp website in BLAST+ package.
Expected options for the command-line blastp :

												
-option valueDescription
-outfmt X X = 6 = Tabular
X = 7 = Tabular with comment lines
-db nrBLAST database name
For a better coverage of the biodiversity, NCBI's nr library is recommanded but not necessary.
The protein library must have gi or accession numbers that exist in the NCBI database.
-seg noThe SEG program is used to mask or filter low complexity regions in amino acid queries
-evalue 1e-3Expect value (E) for saving hits

Alienness parameters

Taxonomic group of interest
Alienness requires the user to define two taxonomic groups : the group of donor species and the group of recipient species. The value of the taxonomic node or NCBI TaxID entered in the field "taxonomic group of interest" (TOI) is used to define these two taxonomic sets, only 1 TaxID should be put. Thus, one will group all taxonomic nodes included in the provided taxonomic node (donor group), the other will contain all other nodes (recipient group).
For instance, if you are interested in HGT of non-metazoan origin to a metazoan species, please input 33208 (NCBI TaxID for Metazoa). If you are interested in HGT of non green plant origin to a green plant species, please input 33090 (NCBI TaxID for Viridiplantae).
This is valid for any TaxID and this information is necessary to retrieve the best ‘TaxID of recipient’ e-value and best ‘TaxID of candidate donor’ e-value for calculation of an Alien index.

Taxonomic group(s) to exclude
Alienness expects NCBI TaxIDs (one or several) for the taxonomic groups you want to ignore in the calculation of the Alien index. You must at least input the TaxID of the query species you used to produce the BLAST result. Anything that will be included in the entered taxonomic node will be excluded. Note that you can input several TaxIDs separated by comma in this field if you want to ignore several non-overlapping taxonomic groups. This is useful if there is no monophyletic group in the NCBI taxonomy corresponding to the ensemble of species you want to ignore.

Taxonomic group(s) used to classify potential donors
By default, the taxonomic groups found are categorized as Archaea, Bacteria, Viruses, Eukaryota, Eukaryota@Fungi, Eukaryota@Metazoa, Eukaryota@Stramenopiles and Eukaryota@Viridiplantae. ‘Other’ and ‘Unclassified’ groups are ignored as they cannot be assigned to a species. If left blank, the best hits are classified in these main categories. If you want to further classify the best hits in other categories any additional NCBI TaxID can be entered (e.g: Chorophyta, Streptophyta).

In summary ...
Alienness expects NCBI TaxIDs.

												
FieldsDescription
Taxonomic group of interestTaxID, only one
Taxonomic group(s) to excludeTaxID, one or several separated by comma
Taxonomic group(s) to used to
classify potential donors
TaxID, optionnal, none to several separated by comma

Results description

The result of Alienness tool is a compressed directory to download, named Alienness_alienness-job-number.zip.
Uncompressed directory contains :

 Alienness_2017080515355122254/
 ├── Minc3_Metazoa_egp_Tylenchomorpha_alienness_AI_CALCULATION.xls
 ├── Minc3_Metazoa_egp_Tylenchomorpha_alienness_FEATURES.xls
 ├── Minc3_Metazoa_egp_Tylenchomorpha_alienness_INDEX.html
 ├── Minc3_Metazoa_egp_Tylenchomorpha_alienness_KRONA.html
 ├── Minc3_Metazoa_egp_Tylenchomorpha_alienness_SUMMARY.xls
 ├── Minc3_Metazoa_egp_Tylenchomorpha_stat_main.csv
 ├── Minc3_Metazoa_egp_Tylenchomorpha_stat_queries_1_likely_hgt.xls
 ├── Minc3_Metazoa_egp_Tylenchomorpha_stat_queries_2_possible_hgt.xls
 ├── Minc3_Metazoa_egp_Tylenchomorpha_stat_queries_3_likely_contamination.xls
 ├── Minc3_Metazoa_egp_Tylenchomorpha_stat_taxonomy_0_all_hgt.xls
 ├── Minc3_Metazoa_egp_Tylenchomorpha_stat_taxonomy_1_likely_hgt.xls
 ├── Minc3_Metazoa_egp_Tylenchomorpha_stat_taxonomy_2_possible_hgt.xls
 ├── Minc3_Metazoa_egp_Tylenchomorpha_stat_taxonomy_3_likely_contamination.xls
 ├── html/
 └── src/
*NB* In the form, a project name is required. This string is used to tag results files : Minc3_Metazoa_egp_Tylenchomorpha equal to Project_name

The result files contain the following information :

											
Results filesdescription
Project_name_alienness_AI_CALCULATION.xls (*1)table presenting AI and HGT score values and taxonomic information for all the proteins that
returned an AI value
Project_name_alienness_FEATURES.xls (*2)AI features file necessary to use AvP
Project_name_alienness_INDEX.htmlan html index file that allows visually exploring the BLAST results with a color code
Project_name_alienness_KRONA.htmlKrona charts are created to explore the best donors detected in your dataset
Project_name_alienness_SUMMARY.txtlog file providing information on execution time, parameters selected by the user, etc
Project_name_stat_main.xls (*3)number of queries classified in each category (likely hgt; possible hgt; likely contamination;
not hgt)
Project_name_stat_queries_1_likely_hgt.xls (*4)list of queries with a AI > 15
Project_name_stat_queries_2_possible_hgt.xls (*4)list of queries with a AI between 0 and 15
Project_name_stat_queries_3_likely_contamination.xls (*4)list of queries with a AI > 15 and a percentage of identity > 70
Project_name_stat_taxonomy_0_all_hgt.xls (*5)statistics on the taxonomic distribution (species and kingdoms) of candidate donors for
all categories
Project_name_stat_taxonomy_1_likely_hgt.xls (*6)statistics on the taxonomic distribution (species and kingdoms) of candidate donors for the
likely HGT category
Project_name_stat_taxonomy_2_possible_hgt.xls (*6)statistics on the taxonomic distribution (species and kingdoms) of candidate donors for the
possible HGT category
Project_name_stat_taxonomy_3_likely_contamination.xls (*6)statistics on the taxonomic distribution (species and kingdoms) of candidate donors for the
possible contamination category

We tested the accuracy of Alienness on the genomes of two plant-parasitic nematodes, for which phylogenetically supported HGT of a whole series of genes involved in plant parasitism had been previously identified [Danchin et al. Proc. Natl. Acad. Sci. USA 2010, 107, 17651–17656] [Haegeman et al. Mol. Plant Microbe Interact. 2011]. We found that all phylogenetically supported cases could be retrieved by Alienness with an AI > 9 and that this AI threshold corresponded to a low rate of putative false positives. To focus on candidates that are likely to produce phylogenetic trees supporting HGT, and minimizing the rate of false positives, we recommend an AI > 15.
Three categories are defined :
* likely_hgt : AI > 0 and <70% identity to putative donors
* possible_hgt : 0 < AI < 15
* likely_contamination : AI > 0 and >70% identity to putative donors

Below, you can see the generic description of files marked by (*1)

												
Column numColumn titleDescription
1AIAlien Index calculation
2HGTindexHGT index calculation
3query namethe query description
4categoryFour ordered categories from most likely hgt to rejected hgt
1:likely_hgt > 2:possible_hgt > 3:likely_contamination > 4:not_hgt
5query hits numberthe number of hits returned by the protein in consideration
6acc_recipientbest accession number for the user-defined taxonomic group of interest (recipient group, e.g Metazoa
or Stramenopiles)
7evalue_recipientbest e-value for the user-defined taxonomic group of interest (recipient group, e.g Metazoa
or Stramenopiles)
8bitscore_recipientbest e-value for the user-defined taxonomic group of interest (recipient group, e.g Metazoa
or Stramenopiles)
9acc_donorbest accession number for the potential donor
10evalue_donorbest e-value for the potential donor
11bitscore_donorbest e-value for the potential donor
12bestcloser to the best hit (donor hit or recipient hit)
13best_hit_accaccession number of the closer to the best hit
14best hit prct identthe percent identity between the query sequence and the best hit
15best hit org full namefull species name of the best hit
16best hit taxo groupabbreviated taxonomic classification of the best hit
17best hit taxidNCBI taxID of the best hit
18best hit lineagefull taxonomic lineage for the best hit

Description of features file (*2)

												
Column numColumn titleDescription
1query_namethe query description
2donordonor information separated by ":"
info1:info2:info3:info4:info5 (*)
3recipientrecipient information separated by ":"
info1:info2:info3:info4:info5 (*)
4AIAlien Index calculation
5HGTindexHGT index calculation
6query hits numberthe number of hits returned by the protein in consideration
(*) info1:info1:info2:info3:info4:info5 <=> accession:accession_hit_position:identity_percent:e-value:bitscore

In addition, all the files named with a _stat suffix
(*3) main statistics

												
Column numColumn titleDescription
1HGT classificationclassification into three categories : likely_hgt > possible_hgt > likely_contamination
2nboccurrence of each taxonomy
(*4) are built on a same twelve-column template and provide basic statistics on the candidate donors (or contaminant)

												
Column numColumn titleDescription
1aiAlien Index calculation
2hgtindexHGT Index calculation
3querythe query description
4best_donor_acc / best_toi_accaccessions of the donor/recipient couple
5best_donor_pidentthe percent identity between the query sequence and the best hit donor
6best_donor_orgnamefull species name of the best hit donor
7best_donor_taxonomyabbreviated taxonomic classification of the best hit donor
8nb_hits_supporting_taxonumber of hits supporting the donor taxonomic group
9nb_hits_between_donor_and_possible_toinumber of hits found between the donor and the possible taxon of interest (recipient)
10nb_total_hitstotal number of hits found
11nb_unknown_acc (dr)number of unknown accessions found between the donor and the possible taxon of interest (recipient)
12nb_excluded_acc (dr)number of excluded accessions found between the donor and the possible taxon of interest (recipient)
(*5) occurence of the best donors by taxonomy for each HGT category

												
Column numColumn titleDescription
1best donor by taxonomytaxonomic group
2likely_hgtoccurrence
3possible_hgtoccurrence
4likely_contaminationoccurrence
(*5) occurence of the best donors by orgname for each HGT category

												
Column numColumn titleDescription
1best donor by orgnameorganism name sorted by taxonomic group
2likely_hgtoccurrence
3possible_hgtoccurrence
4likely_contaminationoccurrence
(*6) are built on a same tthree-column template and provide basic statistics on the candidate donors (or contaminant)

												
Column numColumn titleDescription
1HGT classificationclassification into three categories : likely_hgt > possible_hgt > likely_contamination
2best donor by taxonomytaxonomic group
3nboccurrence
(*6) are built on a same tthree-column template and provide basic statistics on the candidate donors (or contaminant)

												
Column numColumn titleDescription
1HGT classificationclassification into three categories : likely_hgt > possible_hgt > likely_contamination
2best donor by orgnameorganism name
3nboccurrence