HOW TO USE orthoFind
Quick Start
The basic execution of orthoFind is easy. The only mandatory requirement is one protein sequence in fasta format, although more than one protein is also accepted. The sequence could be pasted or uploaded in a file. A sequence in fasta format begins with a line description, including its AC, Gene Name, Organism, etc. The description is followed by one or more lines of sequence data. One example of protein sequence in fasta format is the following:

>sp|Q16637|SMN_HUMAN Survival motor neuron protein OS=Homo sapiens GN=SMN1 PE=1 SV=1

After submitting the sequence, orthoFind checks if it is correct (it must be a protein sequence). If so, the tool will start its execution. It will look for homologs and orthologs to the initial sequence using the default parameters. The default parameters are:

Database: Swiss-Prot

Minimum identity required: 53%

Low complexity filter: OFF

Results will be available within 2-10 minutes from the beginning of the execution.

Advanced Options
The initial search for homologs uses Swiss-Prot database by default, but it can be selected alternatively a set of completely sequenced proteomes from each Kingdom (Animalia, Archaea, Bacteria, Fungi, Plantae), or all of them at the same time ("Reference_proteomes"). With the homologs collected in the first search, a second search can be performed. This second search can be performed against a selection of 75 proteomes:

Animalia: Aedes aegypti, Apis mellifera, Bombyx mori, Bos taurus, Branchiostoma floridae, Caenorhabditis elegans, Callithrix jacchus, Ciona intestinalis, Danio rerio, Daphnia pulex, Drosophila melanogaster, Equus caballus, Gallus gallus, Gasterosteus aculeatus, Homo sapiens, Ixodes scapularis, Latimeria chalumnae, Macaca mulatta, Mus musculus, Oryctolagus cuniculus, Oryzias latipes, Pan troglodytes, Pongo abelii, Rattus norvegicus, Tetraodon nigroviridis, Xenopus tropicalis.

Archaea: Cenarchaeum symbiosum, Halobacterium salinarum, Methanothermobacter thermautotrophicus, Pyrococcus furiosus, Sulfolobus solfataricus, Thermoplasma acidophilum.

Bacteria: Escherichia coli, Agrobacterium tumefaciens, Bacillus subtilis, Bifidobacterium longum, Clostridium botulinum, Corynebacterium glutamicum, Deinococcus radiodurans, Desulfovibrio vulgaris, Enterococcus faecalis, Flavobacterium psychrophilum, Haemophilus influenzae, Helicobacter pylori, Lactococcus lactis, Listeria monocytogenes, Mycobacterium tuberculosis, Pseudomonas aeruginosa, Salmonella typhimurium, Staphylococcus aureus, Streptococcus pneumoniae, Streptomyces coelicolor, Thermus thermophilus, Vibrio cholerae, Xanthomonas campestris, Yersinia pestis.

Fungi: Ajellomyces capsulata, Candida albicans, Coprinopsis cinerea, Cryptococcus neoformans, Emericella nidulans, Encephalitozoon cuniculi, Gibberella zeae, Neosartorya fumigata, Neurospora crassa, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Ustilago maydis, Yarrowia lipolytica.

Plantae: Arabidopsis thaliana, Brachypodium distachyon, Chlamydomonas reinhardtii, Vitis vinifera, Oryza sativa subsp japonica, Glycine max.

Alternatively, the search can be performed against a protein set uploaded by the user. This set will be assumed to represent the complete proteome of a given organism.

Finally, the user can provide a proteome as an EST dataset. This dataset must be a file with transcripts from one organism.

The updated files must be in FASTA format and have a size less than 250Mb.

Some of the running parameters of the tool can be modified, such as:

  • The minimum identity required: it refers to the minimum identity required for a found protein to be considered an homolog. A high value of this parameter usually leads to less homologs found (increasing its specificity), whereas reducing it turns out in a greater number of them.
  • The low complexity filter: it is off by default, but it can be turned on to avoid low complexity regions. Be careful if the initial sequence contains low complexity regions, since the search for homologs could be affected by that fact.
If a valid email address is typed, a link to the results will be send there when they are ready.

Non-Redundant Info
Results are summarized according to their pathway information, their Pfam domain organization, and GO terms. All of the found terms are shown along with their percentage of appearance. The color code is the following:
More than 99% of the homologs/orthologs found have it.

More than or equal to 75% of the homologs/orthologs found have it.

More than or equal to 50% of the homologs/orthologs found have it.

Less than 50% of the homologs/orthologs found have it.

The whole pathway rectangle represents a total of 100% of a pathway presence. Each different pathway rectangle is proportional to its percentage of appearance in the found proteins. The same is represented for the Pfam domain organization. As for the GO terms, there are three main categories: biological process, cellular component and molecular function. Each term is colored according to the upper table. If you excluded GO terms not assigned by a curator, GO terms inferred from electronic annotation will not be taken in account.

The homolog information considers the information from all of the found proteins (an also from the query and the initial proteins), whereas the ortholog information refers just to the proteins marked as orthologs.

N/A stands for Non Available Data.

Results' table
Results are summarized in a table. Each row is colored according to the status of the protein featured in the row. Three possible status:

  • Query protein. The first of the proteins submitted by the user. It is used to present the Blast and Psi-Blast results, even though the search was done using a PSSM derived from all of the initial sequences.
  • Initial protein. If there were more than one protein submitted by the user at the beginning of the execution, the initial proteins would be all of them but the query protein.
  • Found protein. Protein found by orthoFind in the selected database or proteome.
The orthologs found are tagged by an extra column at the left of the first column, including an "O". Those without that column are just homologs to the query sequence. The columns featured in the results' table are:

AC Accession Number. UniProt identifier of the protein.

GN Gene Name. Name of the gene that code for the protein.

Organism Organism to which the protein belongs. The number in brackets is the taxonomic identifier of the organism, and it is linked to its taxonomic information.

Length Protein's length. If its length is greater than the maximum length allowed, the sequence is cut. A file containing the homolog sequence and the non-homolog (cut) sequence would also be provided. In that case, the length in the table is from the complete sequence, and in brackets the length of the homolog part of the sequence.

%identity Referred to the best local alignment between the query and the protein sequence (low complexity filter OFF). The blast report can also be downloaded.

Pathway Pathway in which the protein is involved.

Domain Protein's annotated domain or set of domains.

GO terms Link to the complete GO annotation of the protein.

N/A stands for Non Available Data. For updated information about an homolog, go to its individual UniProt entry.