INPARANOID: Eukaryotic Ortholog Groups Release 7.0, June 2009 1. Introduction InParanoid is a program for automatic identification of orthologs while differentiating between inparalogs and outparalogs. An InParanoid cluster is seeded by a reciprocally bestmatching ortholog pair, around which inparalogs are gathered independently, while outparalogs are excluded. The InParanoid database is a collection of pairwise ortholog groups aiming to include all 'completely sequenced' eukaryotic genomes. By this we mean above 6X coverage, and less than 1% X letters in the protein sequences. 2. Online access The InParanoid eukaryotic ortholog database is available both for direct online access as well as for downloading. Online the user has the option to view all clusters between two species, search for clusters based on gene ID or free text search, as well as doing a blast search based on a sequence. Online access is available at: http://inparanoid.sbc.su.se. 3. Downloadable content The current database is available for download at: http://inparanoid.sbc.su.se/download/current Previous versions can be found at: http://inparanoid.sbc.su.se/download/old_versions Analysed sequences: Both the original and the processed (i.e. non-redundant, keeping only the longest transcript) sequences used for analysis can be downloaded either as fasta files or in XML format. In the XML directory, there is also an XML schema file available called inparanoid-input.xsd. To validate an XML sequence file, run for example xmllint with the following command: xmllint --noout --schema inparanoid-input.xsd fileToValidate.xml Ortholog clusters: The ortholog clusters for all pairwise species comparisons are available as orthoXML, SQL, HTML and raw text files (directory table_stats). For more information about orthoXML see: http://www.orthoXML.org (coming soon). Included with the orthoXML files is a schema file called orthoXML.xsd that can be used to validate the orthoXML files using xmllint with the follwoing command: xmllint --noout --schema orthoXML.xsd fileToValidate.orthoXML 4. Database statistics Version Date Species Species_pairs Ortholog_groups Proteins_processed Orthologous_proteins 2.0 05/03 7 21 57611 165186 86300 3.0 08/04 17 136 559269 303771 236979 4.0 04/05 26 325 463242 5.0 09/06 26 325 1501438 511758 368591 5.1 01/07 26 325 1501438 509483 405433 6.0 09/07 35 595 2642187 610047 501566 6.1 04/08 35 595 2642187 610047 501566 7.0 06/09 100 4950 42756512 1687023 1314592 5. Algorithmic differences compared to InParanoid 6 For release 7 we have used the program version 4.0. The differences compared to release 6 are: - Tightened overlap criteria - all matches must now cover >50%. - New Blast version (2.2.18) with compositional score matrix adjustment was used; a second pass without low complexity filter was used to get more reliable alignments. - Lower score threshold (50->40 bits) was used as low complexity is less of an issue. Effects of algorithmic differences: - In a few cases, orthologs that were previously found by InParanoid may be missed in release 7 due to the stricter overlap criteria. On the other hand, low-scoring matches between 40 and 50 bits are now used, generating orthologs that were previously not detected. - Much fewer low-complexity matches are included. This mostly affect the number of inparalogs in strongly biased composition proteomes, and not so much the number of ortholog groups. Other reasons exist why previous orthologs may be missing in the new release. In particular, new longer alternative transcripts can lead to a too small overlap. 6. Stand alone program InParanoid Version 4.0 will soon be published and be made available upon request - email inparanoid@sbc.su.se to obtain a copy.