π» SimSpliceEvol2: alternative splicing-aware simulation of biological sequence evolution and transcript phylogenies.
Full PDF documentation : Getting_started_with_SimSpliceEvol2
SimSpliceEvol is a tool designed to simulate the evolution of sets of alternative transcripts along the branches of an input gene tree. In addition to traditional sequence evolution events, the simulation also incorporates events related to the evolution of gene exon-intron structures and alternative splicing. These events modify the sets of transcripts produced from genes. Data generated using SimSpliceEvol is valuable for testing spliced RNA sequence analysis methods, including spliced alignment of cDNA and genomic sequences, multiple cDNA alignment, identification of orthologous exons, splicing orthology inference, and transcript phylogeny inference. These tests are essential for methods that require knowledge of the real evolutionary relationships between the sequences.
- β€ Overview
- β€ Operating System
- β€ Requirements
- β€ Graphical User Interface (GUI) and Webserver
- β€ Getting Started
- β€ Main Command - Execution
- β€ Descriptions of Project Files
- β€ Starting with SimSpliceEvolGUI
python3 (at leat python 3.6)
ETE toolkit (ete3)
pyQt5
Pandas
Numpy
Unzip the file
application.zip
and access the GUIsimspliceevolv2
in the application folder..
β οΈ It may take some time (not more than 15 seconds) to launch the program due to deploying the environment and the necessary modules to compute the program successfully. If any errors occur, feel free to contact us.
The webserver and the standalone software are available at
https://simspliceevol.cobius.usherbrooke.ca/
. The full documentation of the software can be found in this repository.
Command
usage: simspliceevolv2.py [-h] -i INPUT_TREE_FILE [-it ITERATIONS]
[-dir_name DIRECTORY_NAME] [-eic_el EIC_EL]
[-eic_ed EIC_ED] [-eic_eg EIC_EG] [-c_i C_I]
[-c_d C_D] [-k_nb_exons K_NB_EXONS] [-k_eic K_EIC]
[-k_indel K_INDEL] [-k_tc K_TC]
[-tc_rs RANDOM_SELECTION]
[-tc_a5 ALTERNATIVE_FIVE_PRIME]
[-tc_a3 ALTERNATIVE_THREE_PRIME]
[-tc_es EXON_SKIPPING] [-tc_me MUTUALLY_EXCLUSIVE]
[-tc_ir INTRON_RETENTION] [-tc_tl TRANSCRIPT_LOSS]
Usage example
(Note: We preserved the default settings/parameters for the next example and we redirected the output of our simulation to the directory ./execution/outputs)
python3 simspliceevolv2.py -dir_name 'execution/outputs' -i 'execution/inputs/small_example.nw'
Expected output
REQUIRED FILE
The only required file is the Newick file [-i INPUT TREE_FILE]
,which must contain the length of branches (NHX format is also accepted).
OPTIONAL ARGUMENTS
The other arguments are optional, and we describe them below:
[-it ITERATION]
name of the simulation (default='1')
[-k_nb_exons K_NB_EXONS]
multiplicative constant for number of exons in gene (default =1.5)
[-k_eic K_EIC]
multiplicative constant for exon-intron change (eic) rate (default=25)
[-k_indel K_INDEL]
multiplicative constant for codon indel rate (default= 5)
[-k_tc K_TC]
multiplicative constant for transcript change (default=10)
[-eic_el EIC_EL]
relative frequence of exon-intron structure change by exon loss (default=0.4)
[-eic_eg EIC_EG]
relative frequence of exon-intron structure change by exon gain (default=0.5)
[-eic_ed EIC_ED]
relative frequence of exon-intron structure change by exon duplication (default=0.1)
[-c_i C_I]
relative frequence of codon insertions (default=0.7)
[-c_d C_D]
relative frequence of codon deletions (default=0.3)
[-tc_rs RANDOM_SELECTION]
relative frequence of random selection (default =1.0)
[-tc_a5 ALTERNATIVE_FIVE_PRIME]
relative frequence of alternative five prime in tc (default =0.25)
[-tc_a3 ALTERNATIVE_THREE_PRIME]
relative frequence of alternative three prime in tc (default =0.25)
[-tc_es EXON_SKIPPING]
relative frequence of exon skipping in tc (default=0.35)
[-tc_me MUTUALLY_EXCLUSIVE]
relative frequence of mutually exclusive in tc (default =0.15)
[-tc_ir INTRON_RETENTION]
relative frequence of intron retention in tc (default=0.00)
[-tc_tl TRANSCRIPT_LOSS]
relative frequence of transcript loss in tc (default=0.3)
Outputs files
SimSpliceEvol creates nine(9) folders.
[output_directory]/genes/[iteration#i]
- The file genes.fasta contains all the gene sequences in FASTA format.
[output_directory]/transcripts/[iteration#i]
- The file transcripts.fasta contains all the transcript sequences in FASTA format.
[output_directory]/transcripts_to_gene/[iteration#i]
- The file mappings.txt contains all the transcript IDs along with their corresponding genes.
[output_directory]/pairwise_alignments/[iteration#i]
- The file pairwise_alignments.fasta contains all the spliced alignments of transcripts with their corresponding gene sequences in FASTA format.
[output_directory]/multiple_alignments/[iteration#i]
-
The file msa_transcripts.alg contains the multiple sequence alignment of transcripts in FASTA format.
-
The file splicing_structure.csv describes the representation of exons in CSV format.
[output_directory]/exons_positions/[iteration#i]
- The file exons_positions.txt contains the positions(start and end) of exons in transcripts and genes.
[output_directory]/clusters/[iteration#i]
- The file ortholog_groups.clusters describes the clusters of orthologous transcripts(transcripts with the same structure). A cluster can induce recent paralogs or isoorthologs.
[output_directory]/phylogenies/[iteration#i]
-
The svg images and newick files contained in the directory describe the evolutionary history of transcripts. (For further exploration, refer to the section below)
-
Nodes
-
leaves
-
gold : transcripts of existing genes.
-
gray : transcripts of ancestral genes.
-
-
internal nodes
-
red : Intron Retention (IR)
-
orange : Mutually Exclusive exons (ME)
-
violet : 5 prime Splice Site (5SS)
-
medium blue : 3 prime Splice Site (3SS)
-
lime green : Exon Skipping (ES)
-
white : Conservation (Speciation or Duplication event under the LCA-reconciliation), i.e., not a creation event.
-
-
-
+ Example`.
! two ME nodes
# (orange internal nodes)
! one 5SS node
# (violet internal node)
! conservation nodes
# (white internal nodes)
! transcript in existing genes
# (gold leaves'nodes)
! ancient transcripts
# (gray leaves'nodes)
Main Command (GUI)
- Open a terminal in Ubuntu/Windows and enter the following command. Users will be able to see the error logs in the terminal if they occur.
./simspliceevol2
- Or double-click on the icon to open the application. Perhaps you should first right-click and set the software to execute as an application (right-click > Properties > Permissions > Authorize the execution of the file as an application).
Main Command (Standalone Software)
- Open a terminal in Ubuntu/Windows and enter the following command to see the command-line help.
./simspliceevol2 -app no -h
- You can now run the software as the original Python script without installing prerequisites. For quick execution, use the provided tree file (small.nw) with this generic command. Of course, you can customize the command with the parameters described earlier.
./simspliceevol2 -app no -i ./small.nw
Interface
Execution
Outputs
phylogenies carousel
list output directories
.
βββ cdna
βΒ Β βββ cdna.fasta
βββ clusters
βΒ Β βββ ortholog_groups.clusters
βββ exons_positions
βΒ Β βββ exons_positions.txt
βββ genes
βΒ Β βββ genes.fasta
βββ multiple_alignments
βΒ Β βββ msa_transcripts.alg
βββ pairwise_alignments
βΒ Β βββ pairwise_alignments.fasta
βββ phylogenies
βΒ Β βββ phylo_1_msa_.png
βΒ Β βββ phylo_1_msa_.svg
βΒ Β βββ phylo_1.nwk
βΒ Β βββ phylo_1_w_msa_.png
βΒ Β βββ phylo_1_w_msa_.svg
βΒ Β βββ phylo_2_msa_.png
βΒ Β βββ phylo_2_msa_.svg
βΒ Β βββ phylo_2.nwk
βΒ Β βββ phylo_2_w_msa_.png
βΒ Β βββ phylo_2_w_msa_.svg
βΒ Β βββ phylo_3_msa_.png
βΒ Β βββ phylo_3_msa_.svg
βΒ Β βββ phylo_3.nwk
βΒ Β βββ phylo_3_w_msa_.png
βΒ Β βββ phylo_3_w_msa_.svg
βΒ Β βββ phylo_4_msa_.png
βΒ Β βββ phylo_4_msa_.svg
βΒ Β βββ phylo_4.nwk
βΒ Β βββ phylo_4_w_msa_.png
βΒ Β βββ phylo_4_w_msa_.svg
βΒ Β βββ phylo_5_msa_.png
βΒ Β βββ phylo_5_msa_.svg
βΒ Β βββ phylo_5.nwk
βΒ Β βββ phylo_5_w_msa_.png
βΒ Β βββ phylo_5_w_msa_.svg
βΒ Β βββ phylo_6_msa_.png
βΒ Β βββ phylo_6_msa_.svg
βΒ Β βββ phylo_6.nwk
βΒ Β βββ phylo_6_w_msa_.png
βΒ Β βββ phylo_6_w_msa_.svg
βββ transcripts
βΒ Β βββ transcripts.fasta
βββ transcripts_to_gene
βββ mappings.txt
MSA <-> phylogenies (
phylo[nΒ€]_msa.png
)
MORE with SimSpliceEvol
The main function simspliceevol()
simspliceevol(SRC, ITERATION_NAME, TREE_INPUT, K_NB_EXONS, K_INDEL, C_I, C_D, EIC_ED, EIC_EG, EIC_EL, K_EIC, K_TC, TC_RS, TC_A3, TC_A5, TC_ME, TC_ES, TC_IR, TC_TL)
returns a set that contains:
- an
ETE tree python object
as presented in the library ete3 . Each node possesses attributes used to provide additional details about the simulation. - a
pandas DataFrame
with data containing exons sequences, indexed by the names of transcripts, and columns representing exons.
After a simulation, each tree node has two types of attributes: one describing the evolution of genes and the other describing the evolution of transcripts.
GENE EVOLUTION
METHOD | DESCRIPTION |
---|---|
TreeNode.gene_name | returns the name of the gene. |
TreeNode.gene_stucture | returns a description of the gene's structure. This is an ordered list showing the alteration of exons and introns. |
TreeNode.exons_dict | stores the exons of the gene and their sequences. The sequence depicts codon substitutions and indel evolution, represented by *** in the sequence. |
TreeNode.introns_dict | stores the introns of the gene and their sequences. |
TRANSCRIPT EVOLUTION
METHOD | DESCRIPTION |
---|---|
TreeNode.transcripts_dict | stores transcripts of the gene node and the description of their structure. |
TreeNode.transcripts_sequences_dicts | stores the sequences of exons. |
Copyright Β© 2023 CoBIUS LAB