Skip to content

SimSpliceEvol: Alternative splicing-aware simulation of biological sequence evolution

Notifications You must be signed in to change notification settings

UdeS-CoBIUS/SimSpliceEvol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ’» SimSpliceEvol2: alternative splicing-aware simulation of biological sequence evolution and transcript phylogenies.

Full PDF documentation : Getting_started_with_SimSpliceEvol2


Overview

SimSpliceEvol is a tool designed to simulate the evolution of sets of alternative transcripts along the branches of an input gene tree. In addition to traditional sequence evolution events, the simulation also incorporates events related to the evolution of gene exon-intron structures and alternative splicing. These events modify the sets of transcripts produced from genes. Data generated using SimSpliceEvol is valuable for testing spliced RNA sequence analysis methods, including spliced alignment of cDNA and genomic sequences, multiple cDNA alignment, identification of orthologous exons, splicing orthology inference, and transcript phylogeny inference. These tests are essential for methods that require knowledge of the real evolutionary relationships between the sequences.

Image 2

πŸ“– Table of Contents

  1. ➀ Overview
  2. ➀ Operating System
  3. ➀ Requirements
  4. ➀ Graphical User Interface (GUI) and Webserver
  5. ➀ Getting Started
  6. ➀ Main Command - Execution
  7. ➀ Descriptions of Project Files
    1. ➀ Description of Inputs
    2. ➀ Description of Outputs
  8. ➀ Starting with SimSpliceEvolGUI

-----------------------------------------------------

πŸ‘¨β€πŸ’» Operating System

The program was both developed and tested on a system operating Ubuntu version 22.04.6 LTS.

-----------------------------------------------------

βš’οΈ Requirements

  • python3 (at leat python 3.6)
  • ETE toolkit (ete3)
  • pyQt5
  • Pandas
  • Numpy

-----------------------------------------------------

πŸ“¦ Standalone software/Graphical User Interface (GUI) and Webserver

Unzip the file application.zip and access the GUI simspliceevolv2 in the application folder..

⚠️ It may take some time (not more than 15 seconds) to launch the program due to deploying the environment and the necessary modules to compute the program successfully. If any errors occur, feel free to contact us.

The webserver and the standalone software are available at https://simspliceevol.cobius.usherbrooke.ca/. The full documentation of the software can be found in this repository.

-----------------------------------------------------

πŸš€ Getting Started with the python script

πŸ’» Main Command

Command

 usage: simspliceevolv2.py [-h] -i INPUT_TREE_FILE [-it ITERATIONS]
                          [-dir_name DIRECTORY_NAME] [-eic_el EIC_EL]
                          [-eic_ed EIC_ED] [-eic_eg EIC_EG] [-c_i C_I]
                          [-c_d C_D] [-k_nb_exons K_NB_EXONS] [-k_eic K_EIC]
                          [-k_indel K_INDEL] [-k_tc K_TC]
                          [-tc_rs RANDOM_SELECTION]
                          [-tc_a5 ALTERNATIVE_FIVE_PRIME]
                          [-tc_a3 ALTERNATIVE_THREE_PRIME]
                          [-tc_es EXON_SKIPPING] [-tc_me MUTUALLY_EXCLUSIVE]
                          [-tc_ir INTRON_RETENTION] [-tc_tl TRANSCRIPT_LOSS]

Usage example

(Note: We preserved the default settings/parameters for the next example and we redirected the output of our simulation to the directory ./execution/outputs)

 python3 simspliceevolv2.py -dir_name 'execution/outputs' -i 'execution/inputs/small_example.nw' 

Expected output

Image 2

-----------------------------------------------------

πŸ“ Description of Project Files/Arguments

⌨️ Description of Inputs

REQUIRED FILE

The only required file is the Newick file [-i INPUT TREE_FILE],which must contain the length of branches (NHX format is also accepted).

OPTIONAL ARGUMENTS

The other arguments are optional, and we describe them below:

[-it ITERATION] name of the simulation (default='1')

[-k_nb_exons K_NB_EXONS] multiplicative constant for number of exons in gene (default =1.5)

[-k_eic K_EIC] multiplicative constant for exon-intron change (eic) rate (default=25)

[-k_indel K_INDEL] multiplicative constant for codon indel rate (default= 5)

[-k_tc K_TC] multiplicative constant for transcript change (default=10)

[-eic_el EIC_EL] relative frequence of exon-intron structure change by exon loss (default=0.4)

[-eic_eg EIC_EG] relative frequence of exon-intron structure change by exon gain (default=0.5)

[-eic_ed EIC_ED] relative frequence of exon-intron structure change by exon duplication (default=0.1)

[-c_i C_I] relative frequence of codon insertions (default=0.7)

[-c_d C_D] relative frequence of codon deletions (default=0.3)

[-tc_rs RANDOM_SELECTION] relative frequence of random selection (default =1.0)

[-tc_a5 ALTERNATIVE_FIVE_PRIME] relative frequence of alternative five prime in tc (default =0.25)

[-tc_a3 ALTERNATIVE_THREE_PRIME] relative frequence of alternative three prime in tc (default =0.25)

[-tc_es EXON_SKIPPING] relative frequence of exon skipping in tc (default=0.35)

[-tc_me MUTUALLY_EXCLUSIVE] relative frequence of mutually exclusive in tc (default =0.15)

[-tc_ir INTRON_RETENTION] relative frequence of intron retention in tc (default=0.00)

[-tc_tl TRANSCRIPT_LOSS] relative frequence of transcript loss in tc (default=0.3)

πŸ’½ Description of Outputs

Outputs files

SimSpliceEvol creates nine(9) folders.

[output_directory]/genes/[iteration#i]

  • The file genes.fasta contains all the gene sequences in FASTA format.

[output_directory]/transcripts/[iteration#i]

  • The file transcripts.fasta contains all the transcript sequences in FASTA format.

[output_directory]/transcripts_to_gene/[iteration#i]

  • The file mappings.txt contains all the transcript IDs along with their corresponding genes.

[output_directory]/pairwise_alignments/[iteration#i]

  • The file pairwise_alignments.fasta contains all the spliced alignments of transcripts with their corresponding gene sequences in FASTA format.

[output_directory]/multiple_alignments/[iteration#i]

  • The file msa_transcripts.alg contains the multiple sequence alignment of transcripts in FASTA format.

  • The file splicing_structure.csv describes the representation of exons in CSV format.

[output_directory]/exons_positions/[iteration#i]

  • The file exons_positions.txt contains the positions(start and end) of exons in transcripts and genes.

[output_directory]/clusters/[iteration#i]

  • The file ortholog_groups.clusters describes the clusters of orthologous transcripts(transcripts with the same structure). A cluster can induce recent paralogs or isoorthologs.

[output_directory]/phylogenies/[iteration#i]

  • The svg images and newick files contained in the directory describe the evolutionary history of transcripts. (For further exploration, refer to the section below)

    • Nodes

      • leaves

        • gold : transcripts of existing genes.

        • gray : transcripts of ancestral genes.

      • internal nodes

        • red : Intron Retention (IR)

        • orange : Mutually Exclusive exons (ME)

        • violet : 5 prime Splice Site (5SS)

        • medium blue : 3 prime Splice Site (3SS)

        • lime green : Exon Skipping (ES)

        • white : Conservation (Speciation or Duplication event under the LCA-reconciliation), i.e., not a creation event.

+ Example`.

Image 2

! two ME nodes 
# (orange internal nodes)

! one 5SS node
# (violet internal node)

! conservation nodes
# (white internal nodes)

! transcript in existing genes
# (gold leaves'nodes)

! ancient transcripts
# (gray leaves'nodes)

-----------------------------------------------------

πŸ’» Starting with the Standalone Software / GUI

Main Command (GUI)

  • Open a terminal in Ubuntu/Windows and enter the following command. Users will be able to see the error logs in the terminal if they occur.
 ./simspliceevol2 
  • Or double-click on the icon to open the application. Perhaps you should first right-click and set the software to execute as an application (right-click > Properties > Permissions > Authorize the execution of the file as an application).

Main Command (Standalone Software)

  • Open a terminal in Ubuntu/Windows and enter the following command to see the command-line help.
 ./simspliceevol2 -app no -h 
  • You can now run the software as the original Python script without installing prerequisites. For quick execution, use the provided tree file (small.nw) with this generic command. Of course, you can customize the command with the parameters described earlier.
 ./simspliceevol2 -app no -i ./small.nw 

Interface

Image 2

Execution

Outputs

Image 2

phylogenies carousel

Image 2

list output directories


.
β”œβ”€β”€ cdna
β”‚Β Β  └── cdna.fasta
β”œβ”€β”€ clusters
β”‚Β Β  └── ortholog_groups.clusters
β”œβ”€β”€ exons_positions
β”‚Β Β  └── exons_positions.txt
β”œβ”€β”€ genes
β”‚Β Β  └── genes.fasta
β”œβ”€β”€ multiple_alignments
β”‚Β Β  └── msa_transcripts.alg
β”œβ”€β”€ pairwise_alignments
β”‚Β Β  └── pairwise_alignments.fasta
β”œβ”€β”€ phylogenies
β”‚Β Β  β”œβ”€β”€ phylo_1_msa_.png
β”‚Β Β  β”œβ”€β”€ phylo_1_msa_.svg
β”‚Β Β  β”œβ”€β”€ phylo_1.nwk
β”‚Β Β  β”œβ”€β”€ phylo_1_w_msa_.png
β”‚Β Β  β”œβ”€β”€ phylo_1_w_msa_.svg
β”‚Β Β  β”œβ”€β”€ phylo_2_msa_.png
β”‚Β Β  β”œβ”€β”€ phylo_2_msa_.svg
β”‚Β Β  β”œβ”€β”€ phylo_2.nwk
β”‚Β Β  β”œβ”€β”€ phylo_2_w_msa_.png
β”‚Β Β  β”œβ”€β”€ phylo_2_w_msa_.svg
β”‚Β Β  β”œβ”€β”€ phylo_3_msa_.png
β”‚Β Β  β”œβ”€β”€ phylo_3_msa_.svg
β”‚Β Β  β”œβ”€β”€ phylo_3.nwk
β”‚Β Β  β”œβ”€β”€ phylo_3_w_msa_.png
β”‚Β Β  β”œβ”€β”€ phylo_3_w_msa_.svg
β”‚Β Β  β”œβ”€β”€ phylo_4_msa_.png
β”‚Β Β  β”œβ”€β”€ phylo_4_msa_.svg
β”‚Β Β  β”œβ”€β”€ phylo_4.nwk
β”‚Β Β  β”œβ”€β”€ phylo_4_w_msa_.png
β”‚Β Β  β”œβ”€β”€ phylo_4_w_msa_.svg
β”‚Β Β  β”œβ”€β”€ phylo_5_msa_.png
β”‚Β Β  β”œβ”€β”€ phylo_5_msa_.svg
β”‚Β Β  β”œβ”€β”€ phylo_5.nwk
β”‚Β Β  β”œβ”€β”€ phylo_5_w_msa_.png
β”‚Β Β  β”œβ”€β”€ phylo_5_w_msa_.svg
β”‚Β Β  β”œβ”€β”€ phylo_6_msa_.png
β”‚Β Β  β”œβ”€β”€ phylo_6_msa_.svg
β”‚Β Β  β”œβ”€β”€ phylo_6.nwk
β”‚Β Β  β”œβ”€β”€ phylo_6_w_msa_.png
β”‚Β Β  └── phylo_6_w_msa_.svg
β”œβ”€β”€ transcripts
β”‚Β Β  └── transcripts.fasta
└── transcripts_to_gene
    └── mappings.txt

MSA <-> phylogenies (phylo[nΒ€]_msa.png)

Image 2

-----------------------------------------------------

MORE with SimSpliceEvol

The main function simspliceevol()

simspliceevol(SRC, ITERATION_NAME, TREE_INPUT, K_NB_EXONS, K_INDEL, C_I, C_D, EIC_ED, EIC_EG, EIC_EL, K_EIC, K_TC, TC_RS, TC_A3, TC_A5, TC_ME, TC_ES, TC_IR, TC_TL)

returns a set that contains:

  • an ETE tree python object as presented in the library ete3 . Each node possesses attributes used to provide additional details about the simulation.
  • a pandas DataFrame with data containing exons sequences, indexed by the names of transcripts, and columns representing exons.

After a simulation, each tree node has two types of attributes: one describing the evolution of genes and the other describing the evolution of transcripts.

GENE EVOLUTION

METHOD DESCRIPTION
TreeNode.gene_name returns the name of the gene.
TreeNode.gene_stucture returns a description of the gene's structure. This is an ordered list showing the alteration of exons and introns.
TreeNode.exons_dict stores the exons of the gene and their sequences. The sequence depicts codon substitutions and indel evolution, represented by *** in the sequence.
TreeNode.introns_dict stores the introns of the gene and their sequences.

TRANSCRIPT EVOLUTION

METHOD DESCRIPTION
TreeNode.transcripts_dict stores transcripts of the gene node and the description of their structure.
TreeNode.transcripts_sequences_dicts stores the sequences of exons.

Copyright Β© 2023 CoBIUS LAB