|
AltAnalyze Information and
Instructions Version 1.15
Section 1 - Introduction
1.1 Program Description AltAnalyze (http://altanalyze.org) is a freely available, cross-platform application
that allows you to take raw (CEL file) or processed microarray data and assess
alternative splicing or alternative promoter usage and then view how these
changes may affect protein sequence, domain composition, and microRNA
targeting. This software requires
no advanced knowledge of bioinformatics programs or scripting. All you need are your microarray files
or a list of regulated probe sets along with some simple descriptions of the
conditions that you’re analyzing. In addition to splicing sensitive microarrays
(Affymetrix Exon 1.0, Gene 1.0 and AltMouse), AltAnalyze can analyze
conventional gene expression arrays from Affymetrix (3’ arrays), Agilent,
Illumina, Codelink and others. Step-by-step tutorials are available from our website.
AltAnalyze is composed of a set of modules designed to (A)
summarize, organize and filter transcript tiling data; (B) calculate scores for
alternative splicing (AS), alternative promoter selection (APS) or alternative
3’ end-processing; (C) annotate regulated
alternative exon events; and (D) assess downstream predicted functional
consequences at the level of protein domains and microRNA binding sites
(miR-BS) and biological pathways. The resulting data will be a series of text
files (results and over-representation analyses) that you can directly open in
a computer spreadsheet program for analysis and filtering (Figure 1.1). In
addition, export files are created for the Cytoscape
(1)
program DomainGraph, to graphically view domain and miR-BS probe
set-exon alignments and AltAnalyze statistics.
Alternative
exon analysis is currently compatible with the Affymetrix Exon 1.0 ST,
Affymetrix Gene 1.0 ST array and the custom exon-junction Affymetrix AltMouse A
array, however, it may be adapted to support other platforms on a per example
basis (contact author) or by other developers. Analysis of conventional Affymetrix microarrays is supported
for array normalization, calculation of array group statistics, dataset
annotation and pathway over-representation analysis. For non-Affymetrix arrays,
all these steps are supported with exception to array normalization.
1.2 Updates
Version
1.15 Updates
1)
Addition of FIRMA alternative exon analysis
(Section 3.2)
2)
Alternative exon
analysis support or the Affymetrix Gene
1.0 array (Section 2.3)
3)
Universal import
of 3rd party alternative exon results for analysis
(Section 3.2)
4)
Restricted analysis of a pre-determined set of probe sets (Section 3.2)
5)
Bundled
integration of Cytoscape and DomainGraph (Sections 2.2 and 8)
6)
Improved
command-line options (Section 2.3)
Version
1.14 Updates
1)
Updated Ensembl
version 54 exon array annotations for constitutive gene probe sets (Section 3.4
and Section 6.2).
2)
Support for core
probe set only gene expression calculation (Section 3.4)
3)
Improved
filtering methods for gene expression data (Section 3.4).
4)
Addition of advanced gene expression analysis
statistics (Section 3.2).
5)
Support for multi-group (>2) alternative exon
analyses (Section 2.5).
6)
GO-Elite support for adjusted p-value filtering
optional settings (Section 3.2).
7)
Improved automatic array type identification.
8)
More streamlined graphical user interface options.
1.3 Implementation AltAnalyze is provided as a
stand-alone application that can be run on Windows or Mac OS X operating
systems, without installation of any additional software. This software is
composed of a set of distinct modules written in the programming language
Python and distributed as stand-alone programs and source-code. Python is a cross-platform compatible
language; therefore, AltAnalyze can be run on any operating system that has
Python and Tkinter for Python installed. On many operating systems, including Linux
and any Mac OS X operating systems the necessary python components are included
by default, however, on some operating systems, such as Ubuntu, Tkinter may
need to be installed when using the graphical user interface. AltAnalyze can be
run from either an intuitive graphic user interface or from the command-line.
To run AltAnalyze from source-code, rather than through the compiled executables,
see section 2.2) for more information.
1.4 Requirements The
basic installation of AltAnalyze requires a minimum of 500MB of hard-drive
space for all required databases and components. Species databases are downloaded separately by the user from
within the program, for various database versions. A minimum of 1GB of RAM and Intel Pentium III processor
speed are further required. At least an additional 1GB of free hard-drive space
is recommended for building the required output files. Additional RAM and
hard-drive space is recommended for large exon array studies.
1.5 Pre-Processing, External Files and Applications If
Affymetrix CEL files are processed directly by AltAnalyze (using APT and RMA),
two files will be produced; an expression file (containing probe set and
expression values for each array in your study) and a detection above
background (DABG) p-value file (containing corresponding DABG p-values for each
probe set). As mentioned, if FIRMA
is selected as the alternative exon analysis algorithm, APT will first perform
a separate RMA run to produce probe residuals for gene-level metaprobesets
(Ensembl associated AltAnalyze core, extended or full probe sets). The results
produced by AltAnalyze will be identical to those produced by APT or
ExpressionConsole(http://www.affymetrix.com/products/software/specific/expression_console_software.affx). For some exon arrays, users can choose to exclude
certain array probes based on genomic cross-hybridization (section 2.3).
For
array summarization, all required components are either pre-installed or can be
downloaded by AltAnalyze automatically (Affymetrix library and annotation
files) for most array types. If the user is prompted for a species library file
that cannot be downloaded, the user will be asked to download the appropriate
file from the Affymetrix website. Offline analyses require the user to follow the instructions in Section 2.3.
1.6 Help with AltAnalyzeAdditional documentation,
tutorials, help, and user discussions are available at the AltAnalyze
website http://www.altanalyze.org
or at our google groups http://groups.google.com/group/alt_predictions.
Tutorial 1
applies to conventional microarray analysis and Tutorial 2 applies to
exon array analysis. Downloads, tutorials and help for DomainGraph can
be found at http://domaingraph.bioinf.mpi-inf.mpg.de. Section 2 - Running AltAnalyze for Splicing Arrays2.1 Where to Save Input Expression Files?When performing analyses
in AltAnalyze, the user needs to store all of their Affymetrix CEL files
in one directory. This directory can be placed anywhere on your computer
and will be later selected in AltAnalyze.Example files can be downloaded
from: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE13297/GSE13297_RAW.tar.
Extract any downloaded TAR and/or GZIP compressed files prior to analysis. If the user has already
run normalization on their CEL files outside of AltAnalyze or have downloaded
already analyzed expression data from another source, you can save the
expression and DABG p-value file (optional) anywhere on your computer.
These files should be tab delimited text files that only consist of
probe sets, expression values and headers for each column. Example files
can be downloaded from http://AltAnalyze.org/normalized_hESC_differentiation.zip. 2.2 Running AltAnalyze from the Graphic User InterfaceWindows and
Mac Directions:
Once
you have saved your CEL files or normalized expression values to your computer,
open the AltAnalyze application folder and double-click on the executable file
named “AltAnalyze.exe” (Windows) or “AltAnalyze” (Mac). This will open a set of user interface
windows where you will be presented with a series of program options (see
following sections).
Unix/Linux
and Source Code Directions:
On
Linux, download the Linux executable and python source code archive version of
AltAnalyze. To run the compiled version, double-click the executable file named
“AltAnalyze” or open this file from command-line (./AltAnalyze). If this file
is not compatible with your configuration, you can start AltAnalyze by opening
a terminal window and going to the AltAnalyze main folder (e.g. “cd
AltAnalyze_v1release” from the program parent directory). Once in this
directory, typing “python AltAnalyze.py” in the terminal window will begin to
run AltAnalyze (you should see the AltAnalyze main menu within a matter of
seconds). Note: If there is a problem running the
graphical user interface version of AltAnalyze, you can report this to our help
desk or use the command-line options (Section 2.3), which do not require TK.
AltAnalyze
Graphical Interface Options:
There
are many options in AltAnalyze, which allow the user to customize their output,
the types of analyses they run and the stringency of those analyses. The
following sections show the sequential steps involved in running and navigating
AltAnalyze. Interactive tutorials for different analyses are provided from the
AltAnalyze website. Please note: if you will be using AltAnalyze on
a machine that does NOT have internet access, follow instructions 1-5 below on
an online machine and then copy the AltAnalyze program directory to an offline
machine.
1) Introduction Window - Upon opening
AltAnalyze, the user is presented with the AltAnalyze splash screen
and additional information. To directly open the AltAnalyze download
page, follow the hyperlink under "About AltAnalyze", otherwise select
"Begin Analysis".
2) Species Database Installation - The first
time AltAnalyze is used, the user will be prompted to download one or
more species database (requires internet connection). Independent of
the array type you are analyzing, select a species and continue. The
user can select from different versions of Ensembl. If your species
is not present, select the button "Add New Species".
3) Select species and array - Next,
the user must select a species, array vendor and array type for analysis.After
selecting these options click "Continue". If the species of interest
is unavailable select the button "Add New Species Support". Here you
can add a two letter species code, full name and array vendor that will
be used for all current and future analyses (Figure 2.1 B).
4) Select
analysis option–
In this window, the user must select the type data being analyzed. There are
four main types of data; A) CEL files, B) Expression files, C) AltAnalyze
filtered files and D) results from 3rd party applications (External
Results). Processing of CEL files will produce the two file types (expression
and DABG). Processing of Expression files, allows the user to select tab-delimited
text files where the data has already been processed (e.g. RMA), which will
also produce AltAnalyze filtered files. AltAnalyze filtered files are written
for any splicing array analysis (not for gene expression only arrays). These later files allow the user to
directly perform splicing analyses, without performing the previous steps. The
AltAnalyze filtered files are stored to the folder “AltExpression” under the
appropriate array and species directories in the user output folder. Since CEL
file normalization and array filtering and summarization can take a
considerable amount of time (depending on the number of arrays), if
re-performing an alternative exon analysis with different parameters, it is
recommended that the user select the “Expression files” or “AltAnalyze filtered
files”, depending on which options the user wants to change. Users can also
import lists of regulated probe sets with statistics obtained from a 3rd party application (e.g., JETTA) other than AltAnalyze.
5) Processing Affymetrix CEL Files - If you
selected the "CEL files" data type from "Main Dataset Parameters", you
will be presented with a new window for selecting the location of your
CEL files and desired output directory. Clicking the "select folder" icon will allow
you to browse your hard-drive to select the folder with your CEL files.
You can double-check the correct directory is selected by looking at
the adjacent text display. This window will be followed by an indicator
window that will automatically download the library and annotation files
for that array (human, rat and mouse only). If the array type is unrecognized
and you do not already have Affymetrix library files for your array
(e.g. PGF or CDF), you will need to download these files from the Affymetrix
website. To do so, select the link at the bottom left side of this window
named "Download Library Files". Select the array type being analyzed
from the web page and select the appropriate library files to download
and extract to your computer (requires an Affymetrix username and password)
(Figure 2.3 B).
6) Summarizing Gene Data and Filtering For Expression - After CEL file data summarization, a number of options
are available for summarizing gene level expression data and filtering
out probe sets prior to alternative exon analysis. For splicing arrays,
AltAnalyze calculates a "gene-expression" value based on the mean expression
ofall "core" (Affymetrix core probe sets and those aligning to known
transcript exons) or "constitutive" (probe sets aligning to those exon
regions most common among all transcripts) probe sets that have a mean
DABG p-value less than and a mean expression value greater the user
indicated thresholds for each gene. These values are used to report
predicted gene expression changes (independent of alternative
splicing) for all user-defined comparisons (see following section).
In addition, fold changes and ttest p-values are calculated for each
of these group comparisons. These statistics along with several types
of gene annotations exported to a file in the folder "ExpressionOutput"
in the user-defined results directory. Along with this tab-delimited
text file, a similar file with those values most appropriate for import
into the pathway analysis program GenMAPP will also be produced (Figure
5.1). For splicing analyses, probe sets with user defined splicing cutoffs
(expression and DABG p-values) will be retained for further analysis
(see section 5.2 - ExpressionBuilder algorithm).
7) Select Alternative Exon Analysis Parameters - If using an exon-sensitive
(e.g., Human Affymetrix 1.0 ST Exon array)
or junction-sensitive microarray, the user will be presented with specific
options for that microarray (Figure 2.5). These options include alternative
exon analysis methods, statistical thresholds, and options for additional
analyses (e.g., MiDAS), however, the default options are typically recommended.
These include restricting the analysis to a conservative set of probe
sets (e.g., "core") and changing the threshold of splicing statistics.
When complete, the user can select "Continue" in AltAnalyze to incorporate
these statistics into the analysis.
8) Assigning Groups to Samples - When analyzing
a dataset for the first time, the user will need to establish which
samples correspond to which groups. Type in the name of the group adjacent
to each sample name from in your dataset (Figure 2.6 A). 9) Establishing Comparisons between Groups - Once sample-to-group
relationships are added, the user can list which comparisons they wish
to be performed (Figure 2.6 B). For splicing and non-splicing arrays,
folds and p-values will be calculated for each comparison for the gene
expression summary file. For splicing arrays, each comparison will be
run in AltAnalyze to identify alternative exons. Thus, the more pair-wise
comparisons the longer the analysis.
10) AltAnalyze Status Window - While the AltAnalyze program is running, several intermediate results files will be created, including probe set, gene and dataset level summaries (Section 2.4). The results window (Figure 2.8) will indicate the progress of each analysis as it is running. When finished, AltAnalyze will prompt the user that the analysis is finished and a new “Continue” button will appear. A summary of results appears containing a basic summary of results from the analysis. This window contains buttons that will open the folder containing the results and suggestions for downstream interpretation and analysis. Selecting the button “Start DomainGraph in Cytoscape” will allow the user to directly open a bundled version of Cytoscape and DomainGraph (Section 8). In addition to viewing the program report, this information is written to a log text file in the user-defined output directory.
11) If
the user selected “decide later” for the analysis option “Perform GO-Elite
Pathway Analysis”, the user will have the option to run GO-Elite after run
completion. A similar summary results window as above will also appear with the
GO-Elite pathway and Gene Ontology results.
2.3 Running AltAnalyze from Command-LineIn addition to using the
default AltAnalyze graphical user interface (GUI), AltAnalyze can be run by
command line options by calling the python source code in a terminal window
or through other remote services. This option can be used to run AltAnalyze
on a remote server, to batch script AltAnalyze services or avoid having to select
specific options in the GUI. To do this, the user or program passes specific flags to
AltAnalyze to direct it where files to analyze are, what options to use and
where to save results. Note: the same alternative exon options exist for
Gene and Exon 1.0 arrays.
Examples:
(Analyzing
CEL files – Affymetrix 3’ array using default options and GO-Elite)
python AltAnalyze.py
--species Mm --arraytype gene --celdir "C:/CELFiles" --output
"C:/CELFiles" --expname "CancerCompendium"
(Analyzing CEL files – Exon 1.0 array using default
options)
python AltAnalyze.py --species
Hs --arraytype exon --celdir "C:/CELFiles" --output
"C:/CELFiles" --expname "CancerCompendium" python AltAnalyze.py
--species Hs --arraytype exon --celdir "C:/CELFiles" --output
"C:/CELFiles" --expname "CancerCompendium" --runGOElite no
--dabgp 0.01 --rawexp 100 --avgallss yes ––noxhyb yes
--analyzeAllGroups " all
groups" ––GEcutoff 4 --probetype core --altp 0.001 --altmethod
FIRMA ––altscore 8 --exportnormexp yes ––runMiDAS no
--ASfilter yes --mirmethod "two or more" --calcNIp yes
(Analyzing
Expression file – Gene 1.0 array using default options, without GO-Elite)
python AltAnalyze.py --species
Mm --arraytype gene --expdir
"C:/CELFiles/ExpressionInput/exp.CancerCompendium.txt" --statdir
"C:/CELFiles/ExpressionInput/stats.CancerCompendium.txt" --output
"C:/CELFiles"
(Analyzing
Filtered Expression file – Exon 1.0 array using default options)
python AltAnalyze.py
--species Hs --arraytype exon --filterdir
"C:/CELFiles/Filtered/Hs_Exon_prostate_vs_lung.p5_average.txt"
--output "C:/CELFiles"
(Annotate
External Probe set results – Exon 1.0 array using default options)
python AltAnalyze.py
--species Rn --arraytype exon --annotatedir
"C:/JETTA_Results/Hs_tumor_progression.txt"
--output "C:/JETTA_Results" --runGOElite yes
(Filter
AltAnalyze results with predefined IDs – Exon 1.0 array using default
options)
python AltAnalyze.py
--species Mm --arraytype gene --celdir "C:/CELFiles" --output
"C:/CELFiles" --expname "CancerCompendium" --returnAll yes
(Updating
the database for a species)
python AltAnalyze.py
--species Mm --update Official --version EnsMart54
(Operating
System Example Folder Locations)
PC: "C:/CELFiles"
Mac
OSX: "/root/user/admin/CELFiles
Linux: "/hd3/home/admin/CELFiles
Primary Analysis Variables
No
default value for these variables is given and must be supplied by the user if
running an analysis. For example, if analyzing CEL files directly in
AltAnalyze, you must include the flags ––species ––arraytype
––celdir ––expname and ––output, with corresponding values.
Likewise, when analyzing an existing expression file you must include the flags ––species
––arraytype ––expdir and ––output. Most of the variable values are
file or folder locations. These variable values will differ based on the
directory path of your files and operating system (e.g, linux has a distinct
path structure than windows – see above examples). The variable name used
in the AltAnalyze source code for each flag is indicated below.
Universally Required Variables
--arraytype:
long variable name “array_type”. No default value for this variable. Options
are exon, gene, AltMouse and “3’array”. This variable indicates the
general array type correspond to the input CEL files or expression file. An
example exon array is the Mouse Affymetrix Exon ST 1.0 array, an example gene
array is the Mouse Affymetrix Gene ST 1.0 array and example 3’array is the
Affymetrix Mouse 430 version 2.0 array. See Affymetrix website for array
classifications.
--species:
long variable name “cel_file_dir”. No default value for this variable. Species
codes are provided for this variable (e.g., Hs, Mm, Rn). Additional species can
be added through the graphic user interface.
--output:
long variable name “output_dir”. No default value for this variable. Required
for all analyses. This designates the directory which results will be saved to.
Analysis Specific Required
Variables
--expname:
long variable name “exp_name”. No default value for this variable. Required
when analyzing CEL files. This provides a name for your dataset. This name must
match any existing groups and comps files that already exist. The groups and
comps file indicate which arrays correspond to which biological groups and
which to compare. These files must exist in the designated output directory in
the folder “ExpressionInput” with the names “groups.*expname*.txt” and
“comps.*expname*.txt” where expname is the variable defined in this flag.
Alternatively, the user can name their CEL files such that AltAnalyze can
directly determine which group they are (e.g., wildtype-1.CEL, cancer-1.CEL,
cancer-2.CEL). See http://www.altanalyze.org/manual_groups_comps_creation.htm and http://www.altanalyze.org/automatic_groups_comps_creation.htm for more information
--celdir:
long variable name “cel_file_dir”. No default value for this variable. Required
when analyzing CEL files. This provides the path of the CEL files to analyze.
These must all be in a single folder.
--expdir:
long variable name “input_exp_file”. No default value for this variable.
Required when analyzing a processed expression file. This provides the path of
the expression file to analyze.
--statdir:
long variable name “input_stats_file”. No default value for this variable.
Optional when analyzing a processed expression file. This provides the path of
the DABG p-value file for the designated expression file to analyze (see –expdir).
--filterdir:
long variable name “input_filtered_dir”. No default value for this variable.
Required when aalyzing an AltAnalyze filtered expression file. This provides
the path of the AltAnalyze filtered expression file to analyze.
--cdfdir:
long variable name “input_cdf_file”. No default value for this variable.
Required when directly processing some CEL file types. This variable corresponds to the location of the Affymetrix CDF
or PGF annotation file for the analyzed array. If you are analyzing an exon, gene,
AltMouse or human, mouse or rat 3’arrays, AltAnalyze has default internet
locations for which to download these files automatically, otherwise, you must
download the compressed CDF file from the Affymetrix website (support),
decompress it (e.g., WinZip) and reference it’s location on your hard-drive
using this flag. If you are unsure whether AltAnalyze can automatically
download this file, you can try to exclude this variable and see if annotations
are included in your gene expression results file.
--csvdir:
long variable name “input_annotation_file”. No default value for this variable.
Required when analyzing some expression files or CEL file types. This variable corresponds to the location
of the Affymetrix CSV annotation file for the analyzed array. If you are
analyzing an exon, gene, AltMouse or human, mouse or rat 3’arrays, AltAnalyze
has default internet locations for which to download these files automatically,
otherwise, you must download the compressed CSV file from the Affymetrix
website (support), decompress it (e.g., WinZip) and reference it’s location on
your hard-drive using this flag. If you are unsure whether AltAnalyze can
automatically download this file, you can try to exclude this variable and see
if annotations are included in your gene expression results file.
--annotatedir:
long variable name “external_annotation_dir”. No default value for this
variable. Required when annotating a list regulated probe sets produced outside
of AltAnalyze. This variable corresponds to the location of the directory
containing one or more probe set files. These files can be in the standard
JETTA export format, or otherwise need to have probe set IDs in the first
column. Optionally, these files can have an associated fold change and p-value
(2nd and 3rd columns) which will be reported in the
results file.
--groupdir:
long variable name “groups_file”. No default value for this variable. Location
of an existing group file to be copied to the directory in which the expression
file is located or will be saved to.
--compdir:
long variable name “comps_file”. No default value for this variable. Location
of an existing comps file to be copied to the directory in which the expression
file is located or will be saved to.
Optional Analysis Variables
These
variables are set as to default values when not selected. The default values
are provided in the configurations text file in the Config directory of
AltAnalyze (default-**.txt) and can be changed by editing in a spreadsheet
program.
GO-Elite Analysis Variables
AltAnalyze
can optionally subject differentially or alternatively expressed genes
(AltAnalyze and user determined) to an over-representation analysis (ORA) along
Gene Ontology (GO) and pathways (WikiPathways) using the program GO-Elite.
GO-Elite is seamlessly integrated with AltAnalyze and thus can be run using
default parameters either the graphic user interface or command line. To run
GO-Elite using default parameters in command line mode, include the first flag
below with the option yes.
--runGOElite:
long variable name “run_GOElite”, default value for this variable: no. Used to indicate whether to run
GO-Elite analysis following AltAnalyze. Indicating yes would prompt GO-Elite to run.
--mod:
long variable name “mod”, default value for this variable: Ensembl. Primary gene system for Gene Ontology (GO) and Pathway
analysis to link Affymetrix probe sets and other output IDs to. Alternative
values: EntrezGene.
--elitepermut:
long variable name “goelite_permutations”, default value for this variable: 2000. Number of permutation used by
GO-Elite to calculate an over-representation p-value.
--method:
long variable name “filter_method”, default value for this variable: z-score. Sorting method used by
GO-Elite to compare and select the top score of related GO terms. Alternative
values: “gene number” combined
--zscore:
long variable name “z_threshold”, default value for this variable: 1.96. Z-score threshold used following
over-representation analysis (ORA) for reported top scoring GO terms and
pathways.
--elitepval:
long variable name “p_val_threshold”, default value for this variable: 0.05. Permutation p-value threshold
used ORA analysis for reported top scoring GO terms and pathways.
--dataToAnalyze:
long variable name “resources_to_analyze”, default value for this variable: both. Indicates whether to perform ORA
analysis on pathways, Gene Ontology terms or both. Alternative values: Pathways or Gene Ontology
--num:
long variable name “change_threshold”, default value for this variable: 3. The minimum number of genes regulated
in the input gene list for a GO term or pathway after ORA, required for
GO-Elite reporting.
--GEelitepval:
long variable name “ge_pvalue_cutoffs”, default value for this variable: 0.05. The minimum t-test p-value
threshold for differentially expressed genes required for analysis by GO-Elite.
--GEelitefold:
long variable name “ge_fold_cutoffs”, default value for this variable: 2. The minimum fold change threshold
for differentially expressed genes required for analysis by GO-Elite. Applied
to any group comparisons designated by the user.
AltAnalyze Expression Filtering
and Summarization
These
variables are used to determine the format of the expression data being read
into AltAnalyze, the output formats for the resulting gene expression data and
filtering thresholds for expression values prior to alternative exon analysis.
Since AltAnalyze can process both convention (3’array) as well as splicing
arrays (exon, gene or AltMouse), different options are available based on the
array type.
Universal Array Analysis
Variables
--logexp:
long variable name “expression_data_format”, default value for this variable: log. This is the format of the input
expression data. If analyzing CEL files in AltAnalyze or in running RMA or
GCRMA from another application, the output format of the expression data is log
2 intensity values. If analyzing MAS5 expression data, this is non-log.
--inclraw:
long variable name “include_raw_data”, default value for this variable: yes. When the value of this variable is no, all columns that contain the
expression intensities for individual arrays are excluded from the results
file. The remaining columns are calculated statistics (groups and comparison)
and annotations.
Exon or AltMouse Array Specific
Variables
--dabgp:
long variable name “dabg_p”, default value for this variable: 0.05. This p-value corresponds to the
detection above background (DABG) value reported in the “stats.” file from
AltAnalyze, generated along with RMA expression values. A mean p-value for each
probe set for each of the compared biological groups with a value less than
this threshold will be excluded, both biological groups don’t meet this
threshold for a non-constitutive probe set or if one biological group does not
meet this threshold for constitutive probe sets.
--rawexp:
long variable name “expression_threshold”, default value for this variable: 70. This value is the non-log RMA
average intensity threshold for a biological group required for inclusion of a
probe set. The same rules as the --dabgp apply to this threshold, accept that values below this threshold are excluded
when the above rules are not met.
--avgallss:
long variable name “avg_all_for_ss”, default value for this variable: no. Indicating yes, will force AltAnalyze to use all core (Affymetrix annotated
core in addition to any exon aligning probe sets) rather than probe sets that
align to predicted constitutive exons to determine transcriptional activity of
gene. This option only applies to both the gene expression export file and to
the alternative exon analyses.
--runalt:
long variable name “perform_alt_analysis”, default value for this variable: yes. Designating no for this variable will instruct AltAnalyze to only run the gene
expression analysis portion of the program, but not the alternative exon analysis
portion.
AltAnalyze Expression Filtering
and Summarization
Universal Array Analysis
Variables
--altmethod:
long variable name “analysis_method”, default value for this variable: splicing-index (exon and gene), FIRMA (exon and gene)and ASPIRE (AltMouse). These are the
splicing algorithm default algorithms for either exon or AltMouse arrays.
Currently, only one algorithm is available for exon arrays while linearregres is also available for
AltMouse arrays.
--altp:
long variable name “p_threshold”, default value for this variable: 0.05. This variable is the p-value
threshold for reporting alternative exons. This variable applies to both the
MiDAS and splicing-index p-values.
--probetype:
long variable name “filter_probe set_types”, default value for this variable: core. This is the class of probe sets
to be examined by the alternative exon analysis. Other options include, extended and full.
--altscore:
long variable name “alt_exon_fold_variable”, default value for this variable: 2 (exon) and 0.2 (AltMouse).This is the corresponding threshold for the default
algorithms listed under --altmethod.
--GEcutoff:
long variable name “gene_expression_cutoff”, default value for this variable: 3. This value is the non-log gene
expression threshold applied to the change in gene expression (fold) between
the two compared biological groups. If a fold change for a gene is greater than
this threshold it is not reported among the results, since gene expression
regulation may interfere with detection of alternative splicing.
--analyzeAllGroups:
long variable name “analyze_all_conditions”, default value for this variable: pairwise. This variable indicates
whether to only perform pair-wise alternative exon analyses (between two
groups) or to analyze all groups, without specifying specific comparisons.
Other options are “all groups” and both.
--altpermutep:
long variable name “permute_p_threshold”, default value for this variable: 0.05. This is the permutation p-value
threshold applied to AltMouse array analyses when generating permutation based
alternative exon p-values. Alternative exon p-values can be applied to either
ASPIRE or linregress analyses.
--altpermute:
long variable name “perform_permutation_analysis”, default value for this
variable: yes. This option directs
AltAnalyze to perform the alternative exon p-value analysis for the AltMouse
array (see –altpermutep).
--exportnormexp:
long variable name “export_splice_index_values”, default value for this
variable: no. This option directs
AltAnalyze to export the normalized intensity expression values (probe set
intensity/constitutive expression) for all analyzed probe sets rather than
perform the typical AltAnalyze analysis when its value is yes. This step can be useful for analysis of exon array data
outside of AltAnalyze and comparison of alternative exon profiles for many
biological groups (e.g., expression clustering).
--runMiDAS:
long variable name “run_MiDAS”, default value for this variable: yes. This option directs AltAnalyze to calculate
and filter alternative exon results based on the MiDAS p-value calculated using
the program Affymetrix Power Tools.
--calcNIp:
long variable name “calculate_normIntensity_p”, default value for this
variable: yes. This option directs
AltAnalyze to filter alternative exon results based on the t-test p-value
obtained by comparing either the normalized intensities for the array groups
examined (e.g., control and experimental) (splicing-index) or a t-test p-value
obtained by comparing the FIRMA scores for the arrays in the two compared groups.
--mirmethod:
long variable name “microRNA_prediction_method”, default value for this
variable: one. This option directs
AltAnalyze to return any microRNA binding site predictions (default) or those
that are substantiated by multiple databases (two or more).
--ASfilter:
long variable name “filter_for_AS”, default value for this variable: no. This option directs AltAnalyze to only
analyze probe sets for alternative expression that have an alternative-splicing
annotation (e.g., cassette-exon, alt-5’, alt-3’, intron-retention), when set
equal to yes.
--returnAll:
long variable name “return_all”, default value for this variable is no. When set to yes, returns all un-filtered alternative exon results by setting
all associated filtering parameters to the lowest stringency values. This is
equivalent to providing the following flags: --dabgp 1 --rawexp 1 --altp 1 --probetype full --altscore 1 --GEcutoff 10000. Since this option will
output all alternative exon scores for all Ensembl annotated probe sets, the
results file will be exceptionally large (>500,000 lines), unless the user
has saved previously run alternative exon results (e.g., MADS) to the directory
“AltDatabase/filtering” in the AltAnalyze program directory, with a name that
matches the analyzed comparison. For example, if the user has a list of 2,000 MADS
regulated probe sets for cortex versus cerebellum, then the MADS results should
be saved to “AltDatabase/filtering” with the name “Cortex_vs_Cerebellum.txt”
and in AltAnalyze the CEL file groups should be named Cortex and Cerebellum and
the comparison should be Cortex versus Cerebellum. When the filename for a file
in the “filtering” directory is contained within the comparison filename
(ignoring “.txt”), only these probe sets will be selected when exporting the
results. This analysis will produce a results file with all AltAnalyze
statistics (default or custom) for just the selected probe sets, independent of
the value of each statistic.
AltAnalyze Database Updates
Universal Array Analysis Variables
--update:
long variable name “update_method”, default value for this variable: empty. Setting this flag equal to Official, without specifying a version,
will download the most up-to-date database for that species. Other options here
are used internally by AltAnalyze.org for building each new database. See the
method “commandLineRun” in AltAnalyze.py for more details.
--version:
long variable name “ensembl_version”, default value for this variable: current. Setting this flag equal to a
specific Ensembl version name (e.g. EnsMart49)
supported by AltAnalyze will download that specific version for the selected
species, while setting this to current will download the current version.
2.4 AltAnalyze Analysis OptionsThere are a number of analysis options provided
through the AltAnalyze interface. This section provides an overview of these
options for the different compatible array analyses (gene expression arrays,
exon arrays and junction arrays). For new users, we recommend first running the
program with the pre-set defaults and then modifying the options as necessary.
Selecting
the Microarray Type and Species
When beginning AltAnalyze, the user can
select from a variety of species and array types. Only array manufacturers and
array types supported for each downloaded species will be displayed. When
multiple gene database versions are installed, a drop-down box at the top of
this screen will appear that allows the user to select different gene database
versions. These gene databases include all resources necessary for gene
annotation, alternative exon analysis (where applicable) and Gene Ontology and
pathway analysis. Microarray file normalization and summarization options are
only available for Affymetrix arrays, while annotation and statistical analyses
are supported for all Affymetrix, Agilent, Illumina and CodeLink arrays
supported by Ensembl. At the bottom of this interface is a check-box that the
user can select to download updated species gene databases, which will bring-up
the database downloader window.
Selecting
the Microarray Analysis Method
After the user has selected the species
of interest, they must choose what type of data they will next be analyzing.
Data can consist of; 1) Affymetrix CEL files, 2) an already processed
expression text files, 3) properly formatted and filtered AltAnalyze expression
input text file or 4) restricted list of probe sets to be directly analyzed. If
beginning with Affymetrix CEL files all three of these file types are produced
in series (see following section) and automatically processed without any user
intervention. If all CEL files from your study already been previously in
AltAnalyze or in another program, the user can load this file be selecting the
option “expression file” and choosing this text file from your computer. This
file needs to contain data from arrays corresponding to at least two biological
groups. Users may wish to re-analyze these files to change their expression
filtering parameters to be more or less stringent. For the two or more
biological groups (see how to define in Figure 2.6), AltAnalyze will segregate
the raw data based on the user-defined pair-wise group comparisons and filter
the containing probe sets based on whether they match the user-defined
thresholds for inclusion and are associated with Ensembl genes (see the below
section: Expression Analysis
Parameters). These files will be saved to the folder “AltExpression” in the
user-defined output directory. These files can be later selected by choosing
the option “AltAnalyze filtered”, if the user wishes to re-run or use different
AltAnalyze alternative exon analysis options (see below section: Alternative
Exon Analysis Parameters).
CEL files are one of the file types
produced after scanning an Affymetrix microarray. The CEL file is produced
automatically from the DAT file (an image file, similar to a JPEG), by the
Affymetrix software by overlaying a grid over the microarray florescent image
and assigning a numeric value to each cell or probe. From this file, expression
values for each probe set can be calculated and normalized for all arrays in
the study using various algorithms.
When choosing to analyze
CEL files in AltAnalyze, the user will be prompted to identify the folder
containing the CEL files and the folder in which to save these other results
to. The user will also need to assign a name to the dataset. These CEL files
will be summarized using the RMA algorithm using the program Affymetrix Power
Tools (APT). The APT C++ module “apt-probeset-summarize” is directly called by
AltAnalyze when running AltAnalyze on a Mac, PC or Linux operating system.
Unlike some other applications, APT is packaged with AltAnalyze and thus does
not require separate installation. However, because it is a separate
application there may be unknown compatibility issues that exist, depending on
your specific system configuration and account privileges. For human and mouse
exon arrays, AltAnalyze also allows for the masking of probes with
cross-hybridization potential, prior to running RMA. This is performed through
an experimental APT function (--kill-list), masking probes that are indicated
in files produced for the MADS application (http://biogibbs.stanford.edu/~yxing/MADS/Annotation.html)
that cross-hybridize to an off-target transcript within 3bp mismatches and a
person correlation coefficient > 0.55, as per the MADS recommendations. The probes with cross-hybridiation
potential are indicated in the AltAnalyze directory
“AltDatabase/Hs/exon/Hs_probes_to_remove.txt”.
APT requires the presence
of a library file(s) specific for that array. AltAnalyze will automatically
determine the array type and can install these files if the user wishes
(currently most human, rat and mouse arrays supported). If AltAnalyze does not
recognize the specific array type or the user chooses to download these files
themselves, they will need to select the appropriate files when prompted in
AltAnalyze. For exon arrays, a PGF, CLF and antigenomic BGP file are required.
These files will be automatically downloaded and installed if the user selects
“Download” when prompted. For non-exon arrays, the appropriate CDF file will be
downloaded. In addition to these library files, a NetAffx CSV annotation file
will be downloaded that allows for addition of gene annotations (non-exon
arrays) and Gene Ontology pathway annotations (all arrays). Once installed,
AltAnalyze will recognize these files and automatically use them for all future
analyses. Once the user selects the appropriate directories and files, the user
will be prompted to select the remaining options in AltAnalyze, before APT is
run. Once run, a tab-delimited
text expression file will be produced for all probe sets on the array and a
detection-above background (DABG) p-value file (exon array only).
Loading a
Processed Expression File
If CEL files are processed outside of AltAnalyze, the user must save the resulting expression text file in tab-delimited format. It is alright if the first rows in the file have run information as long as they are proceeded by a pound sign (#).
Expression Analysis Parameters The options presented in this interface
(Figure 2.4) allow the user to determine what fields are present in the gene
expression output file, what scale the data is in (e.g. logarithmic), which
probe sets to use when calculating gene expression and how to filter probe sets
for subsequent analyses.
1)
Perform
an alternative exon analysis - Selecting the option “just expression” will halt the analysis
after the gene expression result file has been written, such that no splicing
analysis is performed. This option is only available for splicing-sensitive
arrays.
2)
Expression
data format - Indicates
the format in which the CEL file summarized data has been written. When the CEL
files have been processed by AltAnalyze, ExpressionConsole, APT, RMAExpress or
through R, the file format will be logarithmic base 2 (log). If the user
designates “not-log”, then expression values will be log base 2 (log2) transformed
prior to analysis.
3)
Determine
gene expression levels using - For splicing sensitive microarrays, the user has the choice to
alter the way in which gene expression values are calculated and how to filter
their probe set expression files prior to alternative exon analysis. When “yes”
is selected for this option, all core probe sets (Affymetrix core annotated and
any exon aligning probest) linked to a unique gene will be used to calculate a
measure of gene expression by taking the mean expression of all probe set
values. When the “no” is selected, only those probe sets that have been
annotated as constitutive or common to the most isoforms will be used for gene
expression calculation. In either
case, only probe sets with at least one array possessing a DABG p-value less
than the user threshold will be retained (if a DABG p-value file is present).
In order to exclude this threshold, set the minimum DABG p-value equal to 1.
4)
Include
replicate experimental values in the export - Instructs AltAnalyze whether to include the expression
values associated with each CEL file in the output file. If not selected only
the mean expression value of all CEL files for each biological group will be
written.
5)
Remove
probesets with a DABG p-value above – When a DABG file has been produced (default when
summarizing CEL files with AltAnalyze for exon-arrays), this option is applied.
The default DABG p-value cutoff is p<0.05. This will filter out any
non-constitutive probe set that has a mean DABG p>0.05 for both compared biological
groups. For probe sets used in determining gene expression levels, both
biological groups must have a DABG p < user-value. In order to exclude this
option, you can remove the DABG file (contains the prefix “stats.”), or set
this value equal to 1.
6)
Remove
probesets expressed below (non-log) – This statistic is treated the same as the DABG p-value
cutoff except in that probe sets with a mean expression value less than this
cutoff will be excluded. The same rules apply to this value as to the DABG
value, where both variables must be true for probe set inclusion (e.g.,
p<0.05 and mean expression > 70). To exclude this option, set the default
value to 100,000 (greater than the maximum intensity value of most expression
summaries).
7)
Perform
GO/pathway analysis (longer run-time) – Choosing “decide later” will allow the user to view the
GO-Elite pathway and Gene Ontology over-representation analysis options after
the main gene expression and/or alternative exon analysis is run. This will
prompt a separate status window and results summary window displaying
over-representation statistics for pathway analysis. If the option “run
immediately” is selected, GO-Elite will run right away without a separate
window. Please note, GO-Elite analysis can take up to an hour per criterion
when using the default parameters (e.g., 2000 permutations and analysis of both
Gene Ontology and Pathways). For more details on this analysis see: http://www.genmapp.org/go_elite/help_main.htm
Alternative
Exon Analysis Parameters
The options presented in this menu
(Figure 2.5) instruct AltAnalyze what statistical methods to use when
determining alternative exon expression, which probe sets to select for
analysis, what domain-level and miR-BS analyses to perform and what additional
values to export for analyses in other tools. Details on each analysis
algorithm are covered in detail in section 3.2.
1)
Select
the alternative exon algorithm – For exon arrays, the splicing index method is the only
method currently provided, however, for junction arrays, several methods are
available. These methods are used to calculate an alternative exon score,
relative to gene expression levels. The default value for splicing-index
analysis is 2, indicating that an adjusted expression difference greater than
two fold (up- or down-regulated) is required for the probe set to be reported.
Based on the algorithm, different values and scales will apply. For
junction-arrays, the ASPIRE algorithm default cutoff is 0.2, whereas the
linear-regression algorithm is 2. For linear-regression (linearregres), a
minimum value of 2 will select any linear-regression fold greater than 2
(result folds are reported in log 2 scale, however), up- or down-regulated,
whereas ASPIRE’s scores ranging from -1 to 1. See algorithm descriptions for
more details (section 3.2).
2)
Minimum
alternative exon score – This value will vary based on the alternative exon analysis method
chosen (see above options).
3)
Max
MiDAS/normalized intensity p-value – This is the p-value cutoff applied to MiDAS and splicing-index
or FIRMA ttest p-values for exon array analyses. Currently, the user cannot set
different p-value thresholds for these two statistics. More on MiDAS can be
found below and in section 3.2.
4)
Select
probe sets to include – This option is used to increase or decrease the stringency of the
analysis. In particular, this option allows the user to restrict what type of
probe sets are to be used to calculate an alternative exon score. In the case
of junction arrays, this option includes the ability to merge the expression
values of probe sets that measure the same differential inclusion of an exon
(combined-junctions). For exon arrays, there are three options, “core”,
“extended” and “full”. Although these are the same probe set class names used
by Affymetrix to group probe sets, AltAnalyze uses a modification of these
annotations. Specifically, probe sets with the core annotation include all
Affymetrix core probe sets that specifically overlap with a single Ensembl gene
(2)
(based on genomic position) along with
any probe set that overlaps with an Ensembl or UCSC exon
(3)
. Likewise, extended and full probe sets
are those remaining probe sets that also align to a single Ensembl gene, with
the Affymetrix extended or full annotation.
5)
Maximum
absolute gene-expression change – This value indicates maximum gene expression fold change
(non-log, up- or down-regulated) that is allowed for a gene to be reported as
alternatively regulated. The default is 3-fold, up or down-regulated. This
filter is used with assumption that alternative splicing is a less critical
factor when a gene is highly differentially expressed.
6)
Perform
permutation analysis - (junction
arrays only) This analysis reports a p-value that represents the
likelihood of the observed alternative exon score occurring by change, after
randomizing the expression values of all samples.
7)
Maximum
reciprocal-junction permutatep – (junction arrays only) This p-value
cutoff applies to the permutation based alternative exon score p-values when
performing ASPIRE or linearregress (see section 3.2).
8)
Export
all normalized intensities – This option can be used to compare alternative exon scores prior to
filtering for biological multiple comparisons, outside of AltAnalyze. For
example, if comparing multiple tissues, the user may wish to export all
splicing-index or FIRMA scores for all tissue comparisons. The results will be
stored to the AltResults folder in the user-defined output-directory.
9)
Calculate
MiDAS p-values –
This statistic is analogous to the ttest p-value calculated during the
splicing-index analysis (see section 3.2 for more details). If not selected,
then only the splicing-index or FIRMA (depending on the user selection) fold
and p-value will be used to filter alternative exon results.
10)
Calculate normalized intensity p-values – Indicates whether to calculate
the splicing-index or FIRMA ttest p-values and filter using the above
threshold.
11)
Filter results for predicted AS – This option instructs AltAnalyze
to only include regulated probe sets in the output that have been assigned a
valid splicing annotation (e.g., alternative-cassette exon) provided by
AltAnalyze. These annotations exclude probe sets with no annotations or those
with only an alternative N-terminal exon or alternative promoter annotation.
12)
Align probesets to protein domains using – This option is used to restrict
the annotation source for domain/feature over-representation analysis. If
“direct-alignment” is chosen, only those probe sets that overlap with the
genomic coordinates of a protein domains/features will be included in the
over-representation analysis, otherwise, the inferred method is used (see
section 3.2 for more details).
13)
Number of algorithms required for miRNA
binding site reporting – This option is used to filter out miR-BS predictions that only occur in
one of the four miR-BS databases examined. For more miR database information
see section 6.5.
14)
Type
of group comparison to perform –This option indicates whether to
only perform pair-wise alternative exon analyses (between two groups) or to
analyze all groups, without specifying specific comparisons.
2.5 Overview of Analysis ResultsAltAnalyze
will output two main types of files: 1) Gene expression (GE) summary 2) Alternative exon summary Gene Expression Summary Data The GE summary files are two files that contain
all computed computed gene expression values from your dataset, saved
to the folder "ExpressionOutput" in the user-defined output directory. The first is a file is a complete dataset summary
file with the prefix "DATASET" followed by the user-defined dataset
name containing all array expression values (gene-level for exon arrays),
calculated group statistics (mean expression, foldsraw and adjusted t-test and f-test p-values) and gene annotations (e.g.,
gene symbol, description, Gene Ontology, pathway and some custom groups).
For exon arrays, the gene expression values are derived from either
probe sets that align to exons the are found in the largest number of
transcripts (constitutive) and thus are informative for transcription
or all core probe sets for that gene (see "Select expression analysis
parameters", in Section 2.3). Constitutive probe sets are determined
by finding discrete exon regions that are common to the most mRNA transcripts
(Ensembl and UCSC) for all transcripts used in the AltAnalyze database
build (section 6). When one or more gene expression reporting probe
sets have DABG p-values with at least one biological group with a mean
value below the user defined threshold, these probe sets will be used
to calculate gene expression, otherwise, all gene expression reporting
probe sets will be used. The second file, called the
GenMAPP input file, contains a subset of columns from the dataset summary
file for import into GenMAPP [4] or
PathVisio [5] (http://www.pathvisio.org/PathVisio2).
This file has the prefix "GenMAPP" and excludes all gene annotations and
individual array expression values. Alternative Exon Summary Data These results are produced from all probe sets
that may suggest alternative splicing, alterative promoter regulation,
or any other variation relative to the gene expression for that gene
(derived from comparisons file). When the user chooses to either analyze
all groups rather than just pair-wise comparisons or both, the same
output files will be produced but report MiDAS p-values comparing all
conditions and the maximum possible splicing index fold between all
conditions (Section 3.2). Each set of results corresponds to a single
pair-wise comparison (e.g., cancer vs. normal) and will be named with
the group names you assigned. Four sets of results files are produced
in the end: 1) Probe set-level - Probe set-level statistics, exon annotations,
AS/APS annotations, and functional predictions (protein, domain and
miRNA binding site). 2) Gene-level - Gene-level summary of data in probe set-level
file. 3) Domain-level - Over-representation analysis of gene-level
domain changes due alternative exon regulation. 4) miRNA binding sites - Over-representation analysis of gene-level,
predicted miRNA binding sites present in alternatively regulation exons.
5) Summary statistics file - Global statistics, reporting the number
of genes alternatively regulated, number differentially expressed and
summary protein association information (e.g, mean regulated protein
length). Each file is a tab delimited text file
that can be opened, sorted and filtered in a spreadsheet program. These files are saved to the user-defined
output directory under "AltResults/AlternativeOutput", all with the same
prefix (pair-wise group comparisons). AltAnalyze will analyze all pair-wise
comparisons in succession and combine the probe set-level and gene-level
results into two additional separate files (named based on the splicing
algorithm chosen). Probe set- and Gene-Level Alternative
Exon Result Files The probe
set-level file contains alternative exon data for either one probe set
(exon-array) or reciprocal probe sets (junction array). This includes:
The
gene-level file contains a summary of the data at the gene level, with each
row representing a unique gene.This
file also includes: ·
Gene Ontology and pathway information for each gene
extracted from any Affymetrix CSV annotation files for that species present
in the directory "AltDatabase/Affymetrix/*species*". Protein Feature and MicroRNA Binding
Site Over Representation Files Over-representation
analyses, (files 3 and 4) have the same structure:
Section 3 - Algorithms
Multiple algorithms are available in AltAnalyze
to identify individual probe sets (for exon arrays (EA)) or reciprocal
probe sets (exon-exon junction array (JA)) that are differentially regulated
relative to gene expression changes. These include the splicing index
method (EA), MiDAS (EA), ASPIRE (JA) and Linear Regression (JA).In addition
to these statistical methods, several novel methods are used to predict
which alternative proteins correspond to a regulated exon, which protein
domain/features differ between these and which RNA regulatory sequences
differ between the associated transcripts (e.g., miR-BS). 3.1 Default Methods The default options are stored in
external text files in the folder "Config" as "defaults-expr.txt", "defaults-alt_exon.txt",
and "defaults-funct.txt".
These options correspond to those
found in the configuration file "options.txt". The user is welcome to modify
the defaults and theoretically even the options in the "options.txt" file,
however, care is required to ensure that these options are supported the by
the program.Since AltAnalyze is an
open-source program, it is feasible for the user to add new species and array
support or to do so with AltAnalyze support. The default algorithms for the
currently supported arrays are as follows:
3.2 Algorithm DescriptionsGene Expression Analysis For the simple gene expression output files saved to the ExpressionOutput directory, several basic expression statistics are calculated. These statistics are performed for any user specified pair-wise comparisons (e.g., cancer versus normal) and for between all groups in the user dataset (e.g., time-points of differentiation). These statistics are comprised of the following: (1) rawp, (2) adjp, (3) log-fold, (4) fold change, (5) ANOVA rawp, (6) ANOVA adjp and () max log-fold. The rawp is a one-way analysis of variance (ANOVA) p-value calculated for each pair-wise comparison (two groups only). The adjp is the Benjamini-Hochberg (BH) 1995 adjusted value of the rawp. The log-fold is the log2 fold calculated by geometric subtraction of the experimental from the control groups for each pair-wise comparison. The fold change is the non-log2 transformed fold value. The ANOVA rawp is the same as the comparison rawp, but for all groups analyzed (note, this is reduced to the rawp when only two groups are analyzed). The ANOVA adj is the BH adjusted of the ANOVA rawp. The max log-fold is the log2 fold value between the lowest group mean and the highest group expression mean for all conditions in the dataset. These statistics are intended for further data filtering and prioritization in order to assess putative transcription differences between genes.
Splicing Index Method This algorithm is described in
detail in the following publications: [6,7]. In brief, the expression value of each probe sets
for a condition is converted to log space (if necessary). For each probe
sets examined, its expression (log2) is subtracted
from the mean gene expression of value to create a gene expression corrected
log ratio (subtract instead of divide when these values are in log space).
This value is the normalized intensity. This normalized intensity is
calculated for each microarray sample, using only data from that sample.
To derive the splicing-index value, the group normalized intensity of
the control is subtracted from the experimental. This value is the change
in exon-inclusion (delta I or splicing index fold change).
FIRMA
Analysis
FIRMA or Finding Isoforms using Robust
Multichip Analysis
(8)
is an alternative to the splicing-index approach, to
calculate alternative splicing statistics. Rather than using the probe set
expression values to determine differences in the relative expression of an
exon for two or more conditions, FIRMA uses the residual values produced by the
RMA algorithm for each probe, corresponding to a gene. The median of the
residuals for each probe set, for each array sample are compared to the median
absolute deviation for all residuals and samples for the gene.
Although the core FIRMA methods are the same as the
original implementation in R, AltAnalyze FIRMA differs in several important
ways:
1)
The probe composition of each gene is defined by the standard AltAnalyze
core, extended or full probe set definitions, rather than the Affymetrix
transcript cluster definitions. Thus, each probe must correspond to a single
Ensembl gene to be analyzed.
2)
While FIRMA scores for each sample and probe set are calculated, only
summary statistics are reported in the standard AltAnalyze output files. This
statistic is the average FIRMA score for all samples in the experimental group
minus the average of the FIRMA scores in the designated baseline group. If no
comparisons are specified, the two groups with the largest difference in scores
are reported.
3)
FIRMA scores are organized into groups for calculation of summary
statistics (FIRMA fold change and t-test p-values). However, scores for each
probe set and sample can be optionally exported.
Users can
define whether to use the AltAnalyze core, extended or full annotations in the
program interface or using the --probetype flag in the command-line interface.
An unique numerical ID corresponding to each Ensembl gene and all associated
gene probe sets for FIRMA analysis are stored in a metaprobeset file in the
array annotation directory (e.g., AltDatabase/EnsMart54/Hs/exon/
Hs_exon_core.mps). This file is used to define gene level probe sets by the
program APT.
Similar to splicing-index
analysis of exon-tiling data, expression and DAGB filtering of probe set
expression, FIRMA score p-value filtering and gene expression
reporting/filtering based on either constitutive or all exon aligning probe
sets. To export FIRMA scores for each probe set and sample, select the option
“Export all normalized intensities” from the “Alternative Exon Analysis
Parameters” window, or by using the flag --exportnormexp in the command-line interface.
MiDAS The MiDAS statistic is described
in detail in the white paper: www.affymetrix.com/support/technical/whitepapers/exon_alt_transcript_analysis_whitepaper.pdf.
This analysis method is available from the computer program APT, mentioned
previously. APT uses a series of text files to examine the expression
values of each probe set compared to the expression of user supplied
gene expression reporting probe sets. Since AltAnalyze uses only probe
sets found to align to a single Ensembl gene, AltAnalyze creates it's
own unique numerical gene identifiers (different than the Affymetrix
transcript clusters). When written, a conversion file is also written
that allows AltAnalyze to translate from this arbitrary numerical ID
back to an Ensembl gene ID. These relationships are stored in the following
files along with the probe set expression values:
When the user selects the option "Calculate MiDAS p-values", AltAnalyze first exports expression
data for selected probe sets (e.g., AltAnalyze "core" annotated - Figure
2.5) to these files for all pair-wise comparisons. Once exported, AltAnalyze
will communicate with the APT binary files packaged with AltAnalyze
to run the analysis remotely. MiDAS will create a folder with the pair-wise
comparison dataset name and a file with MiDAS p-values that will be
automatically read by AltAnalyze and used for statistical filtering
(stored in the AltAnalyze program directory under "AltResults/MiDAS").
These statistics will be clearly labeled in the results file for each
probe set and used for filtering based on the user-defined p-value thresholds
(Figure 2.5). Note: Different versions of the
APT MiDAS binary have been distributed. AltAnalyze is distributed with two
versions that report slightly different p-values. The older version (1.4.0)
tends to report larger p-values than the most recent distributed (1.10.1). Previous versions of AltAnalyze used
version 1.4.0, while AltAnalyze version 1.1 uses MiDAS 1.10.1 that produces
p-values that are typically equivalent to those as the AltAnalyze calculated
splicing-index p-values (larger). ASPIRE For exon-exon junction microarray data (e.g.,
AltMouseA), the algorithm analysis of splicing by isoform reciprocity
or ASPIRE was adapted from the original report [8] for inclusion into AltAnalyze.This
algorithm uses the expression of probe sets aligning to two competitive
exon-exon junctions, or one exon-exon junction and an exon aligning
probe set along with gene expression values calculated as described
with the splicing-index method. These probe set relationships were derived
using the Affymetrix exon or exon-exon junction names (e.g., E1-E3 and
E2-E3 or E1-E3 and E2), obtained by the Affymetrix AltMerge transcript
assembly program [9]. For exon-exon junctions
and exons aligning to the same gene, reciprocal probe set pairs were
extracted using the ExonAnnotate_module.py program in AltAnalyze
using the identifyPutativeSpliceEvents function. Such splicing events
are classified as mutually-exclusive (mx-mx) or exon-inclusion/exon-exclusion
(ei-ex). Mutually-exclusive splicing events represent an exchange of
one exon for another (e.g., when E2-E4 and E1-E3 are compared, the E2
and E3 are considered regulated exons). For ei-ex, the proximal exon
in the compared junctions is considered the regulated exon (e.g., when
E1-E2 and E1-E3 are compared, E2 is the regulated exon).
Similar to the splicing-index
method, for each reciprocal probe set, a ratio is calculated for expression
of the probe set (non-log) divided by the mean of all gene expression
reporting probe sets (non-log), for the baseline and experimental groups. The ASPIRE dI
was then calculated for the inclusion (ratio1) and exclusion (ratio2)
probe sets, as such: Rin = baseline_ratio1/experimental_ratio1 Rex = baseline_ratio2/experimental_ratio2 1=baseline_ratio1/(baseline_ratio1+baseline_ratio2) 2=experimental_ratio1/(experimental_ratio1+experimental_ratio2) in1= ((Rex-1.0)*Rin)/(Rex-Rin) in2= (Rex-1.0)/(Rex-Rin) dI = ((in2-in1)+(I2-I1))/2.0 If (Rin>1 and Rex<1)
or (Rin<1 and Rex>1) and the absolute dI score is greater than the user supplied threshold (default is 0.2),
then the dI is retained for the
next step in the analysis. If designated by the user, this next
step will be a permutation analysis of the raw input data to determine
the likelihood of each ASPIRE score occurring by chance alone. This
permutation p-value is calculated by first storing all possible combinations
of the two group comparisons. For example, if there are 4 samples (A-D)
corresponding to the control group and 5 (E-H) samples in the experimental
group, then all possible combinations of 4 and 5 samples would be stored
(e.g, [B, C, G, H] and [A, D, E, F]). For each permutation set, ASPIRE
scores were re-calculated and stored for all of these combinations.
The permutation p-value is the number of times that the absolute value
of a permutation ASPIRE score is greater than or equal to the absolute
value of the original ASPIRE score (value = x) divided by the number
of possible permutations that produced a valid ASPIRE score ((Rin>1
and Rex<1) or (Rin<1 and Rex>1)). If this p-value is less than user defined threshold,
or x<2 (since some datasets have a small number of samples and thus
little power for this analysis), the reciprocal probe sets are reported. Linear Regression When
working with the same type of reciprocal probe set data as ASPIRE, a
linear regression based approach can be used with similar results. This
method is based on previously described approach [10]. This algorithm uses
the same input ASPIRE (junction comparisons, constitutive adjusted expression
ratios). To derive the slope for each of the two biological conditions
(control and experimental), the constitutive corrected expression of
all samples for both reciprocal junctions is plotted against each other
to calculate a slope for each of the two biological groups using the
least squared method. In each case, the slope is forced through the
origin of the graph (model = y ~ x - 1 as opposed to y ~ x). The final linear regression score is the log2
ratio of the slope of the experimental group divided by the baseline
group. This ratio is analogous to a fold change, where 1 is equivalent
to a 2-fold change. When establishing cut-offs, select 2 to designate
a minimum 2-fold change. The same permutation analysis used for ASPIRE
is also available for this algorithm. Note: For previous published analyses
([10] and Salomonis et al. in preparation),
linear regression was implemented using the algorithm rlm, which is apart of the R mass package from bioconductor. The
Python R interpreter rpy, was used to run these analyses
(which requires installation of R). To use this option, select "linearregres-rlm"
under "select the alternative exon algorithm" (Figure 2.5). If a related
warning appears while running, you may need to load AltAnalyze.py and associated source code directly
(requires installation of R version specific rpy - see Linux and Source Code AltAnalyze
instructions). External
Alternative Exon Analysis File Import
In addition to providing several
alternative exon analysis algorithms options in AltAnalyze (MiDAS, FIRMA,
splicing-index), users can import results from other programs for downstream
functional annotation analyses. These analyses consist of alternative exon
annotation (alternative splicing and alternative promoter selection), protein isoform,
protein domain and microRNA binding site disruption analysis and pathway
over-representation. Two options for external probe set analysis are now
available:
1)
Annotation of 3rd party alternative exon data in AltAnalyze
2)
Restricted
analysis of probe sets using a pre-defined list
Both
options are conceptually similar; provide AltAnalyze with a list of probe sets
and optionally statistics, and AltAnalyze will return alternative exon
statistics (option 2) and/or functional annotations for the input probe sets
(option 1 and 2).
Option 1
– Import and annotation of 3rd party probe set results
For this option, no raw data is
required (e.g., CEL files or expression values), just a list of probe sets of
interest. This option is simply available by selecting the main analysis option
“Annotate External Results” and selecting a tab-delimited probe set list. The
probe sets supplied in this list can be produced by any alternative exon
analysis program the user prefers, as long as the output is exon-level
Affymetrix probe sets. Users can provide direct output files from JETTA or
provide results in a more generic format, consisting of regulated probe sets
and result statistics (optional).
The
generic file format for alternative exon results import is:
1)
(Column 1) Probe
set ID (required)
2)
(Column 2)
Alternative-exon fold (optional)
3)
(Column 3)
Alternative-exon p-value (optional)
4)
(Column 4-100)
Ignored data (optional)
In
addition to this generic format, results from the program JETTA can also be
directly imported. Since this method produces two p-values for the MADS
algorithm, the smallest of the two p-values is used. The results from this
analysis are typical AltAnalyze result files, including input for DomainGraph
and include any alternative exon statistics supplied with the probe set list.
Option 2
– Restricted probe set analysis (Advanced Feature)
This
option can be performed with any of the main AltAnalyze analysis options (e.g.,
Process CEL Files, Expression Files or AltAnalyze Filtered). It is triggered by
supplying a list of probe sets for each comparison of interest in the directory
“filtering” in the AltAnalyze program directory
(AltAnalyze_v1release/AltDatabase/filtering). The restricted filtering files
are tab-delimited text files containing the corresponding exon probe set ID
(Column 1) and optionally exon statistics and annotations (Columns 1-100).
While the statistics and annotations provided in the restricted filtering file
are not used for any computation, they are appended as extra columns in the
alternative exon results file. Multiple files, corresponding to specific
biological comparisons (e.g., Tumor_vs_wild-type) can exist in this folder and
only those files who’s name matches the comparison name will be matched and
used for restricted analysis. For example, if biological sample groups A, B and
C are being compared, to filter C_vs_A, the restricted probe set file must be
named C_vs_A.txt. Note: If the user enters any filters (e.g., maximum DABG p
value threshold, minimum alternative exon score), these will further restrict
which “significant” probe sets are reported. To report all probe sets found in
the AltAnalyze database, make sure to set all thresholds to the minimum or
maximum value (e.g., --dabgp 1 --rawexp 1 --altp 1 --probetype full --altscore 1 --GEcutoff 10000 – Section 2.3). Alternatively, users working with the command-line
interface can use the option --returnAll to automatically enter this minimum values (see section 2.3).
Domain/miR-BS Over-Representation Analysis A z-score is calculated to assess over-representation
of specific protein features (e.g., domains) and miR-BS's found to overlap
with probe sets that are alternatively regulated according to the AltAnalyze
user analysis. This z-score is calculated by subtracting the expected
number of genes in with a protein feature or miR-BS meeting the criterion
(alternatively regulated with the user supplied thresholds) from the
observed number of genes and dividing by the standard deviation of the
observed number of genes. This z-score is a normal approximation to
the hypergeometric distribution. This equation is expressed as:
n = All genes associated
with a given element r = Alternatively regulated
genes associated with a given element N = All genes examined R = All alternatively
regulated genes Once z-scores have been calculated for all protein
features and miR-BS linked to alternatively regulated probe sets, a
permutation analysis is performed to determine the likelihood of observing
these z-scores by chance. This is done by randomly selecting the same
number of regulated probe sets from all probe sets examined and recalculating
z-scores for all terms 2000 times. The likelihood of a z-score occurring
by chance is calculated as the number of times a permutation z-score
is greater than or equal to the original z-score divided by 2000. A
Benjamini-Hochberg
correction is used to transform this p-value to adjusted for multiple hypothesis testing. Gene Ontology and Pathway Over-Representation
Analysis To perform advanced pathway
and Gene Ontology (GO) over-representation AltAnalyze includes core
python modules from the program GO-Elite (version 1.2) in AltAnalyze
(http://www.genmapp.org/go_eite).
GO-Elite performs Gene Ontology (GO) and WikiPathway over-representation
using the same algorithms listed for domain/miR-BS over-representation
analysis (ORA) (e.g., z-score, permutation p-value, BH p-value calculation).
After ORA, all GO terms and pathways are filtered using user-defined
statistical cut-offs (z-score, permutation p-value and number of genes
changed), with GO terms further pruned to identify a minimally redundant
set of reported GO terms, based on hierarchical relationships and ORA
scores (see GO-Elite documentation).
The gene database used for GO-Elite is specific to the version of Ensembl
in the downloaded AltAnalyze gene database (Plus versions only). All
result files normally produced by GO-Elite will be produced through
AltAnalyze. AltAnalyze generates two types of gene lists
for automated analysis in GO-Elite; differentially expressed genes and
alternatively regulated genes. Criterion for differentially expressed
genes are defined by the user in the GO-Elite parameters window (e.g.,
fold difference > 2 and t-test p < 0.05). The user can choose
to use the rawp (one-way ANOVA, two groups) or adjp (BH adjusted value
of the rawp) in the software. All genes associated with alternative
exons are used for GO-Elite analysis. Genes with alternative exons that
also have alternative splicing annotations can be further selected using
the "Filter results for predicted AS"
option. Results are exported to the "GO-Elite" directory in the user-defined
output directory, while input gene lists can be found in the folders
"GO-Elite/input" and "GO-Elite/denominator". The gene lists for differentially
expressed genes have the prefix "GE." while the alternative exon files
have the prefix "AS.". The appropriate denominator gene files for each
is selected by GO-Elite. Species and array specific databases for GO-Elite
analysis are downloaded automatically from AltAnalyze when the species
database is installed. Array types supported for each species include
any Affymetrix supported at the time the Ensembl database was released
along with any arrays supported by Ensembl. If a particular array is not supported by AltAnalyze
but has a valid Affymetrix CSV annotation file, AltAnalyze will build
the necessary GO-Elite relational databases to perform both GO and WikiPathway
ORA. Probe set, gene and GO relationships will be extracted automatically
from the Affymetrix annotation CSV file and gene to WikiPathway relationships
from the http://www.wikipathways.org
(if relationships are present). 3.3 Probe set FilteringPrior to
alternative exon analysis, AltAnalyze can be used to remove probe sets
that are not deemed as sufficiently expressed. For the two conditions
that AltAnalyze compares (e.g., cancer versus normal), a probe set will
be removed if neither condition has a mean detection above background
(DABG) p value less than the user threshold (e.g., 0.05). Likewise,
if neither condition has a mean probe set intensity greater than the
user threshold (e.g., 70), then the probe set will be excluded from
the analysis. When comparing two conditions (pairwise comparison) for
probe sets used to determine gene transcription (e.g., constitutive
aligning), both conditions will be required to meet these expression
thresholds in order to ensure that the genes are expressed in both conditions
and thus reliable for detecting alternative exons as opposed to changes
in transcription. When comparing all biological groups in the user dataset,
however, these additional filters are not used. 3.4 Constitutive and gene expression calculationIdentifying probe sets for gene expression calculations A reliable estimate of transcriptional activity for each gene is required to both report basic gene expression statistics and perform alternative exon analysis. Exon arrays probe most well annotated exon regions, many introns, untranslated regions and theoretically transcribed regions. Gene expression values can be calculated in one of two ways from AltAnalyze; (A) using constitutive probe sets, (B) using core probe sets. The constitutive probe sets are recommended for use for all analyses.
Constitutive probe sets in AltAnalyze are classified as probe sets that align to a pre-determined set of exon regions that are most common to all mRNA transcripts used when the AltAnalyze database is created. RNA transcript exon structures and genomic positions are extracted from analogous versions of Ensembl and the UCSC genome browser database. While AltAnalyze database versions EnsMart49 and EnsMart52 used mRNA annotations from the Affymetrix probeset annotation file, EnsMart54 exclusively derives constitutive probe set annotation from UCSC and Ensembl exon structures (Figure 3.1).
In addition to constitutive probe sets, users have the option to use all probe sets that have an Affymetrix core probe set annotation or align to an AltAnalyze defined exon region. Affymetrix defines probe sets according to three criterion, "core", "extended" and "full". The Affymetrix core annotations are recommended for calculation of constitutive levels by Affymetrix, however, these are often a mix of known alternative exons and AltAnalyze annotated constitutive exons. In addition to these Affymetrix core exons, the AltAnalyze core set consists of any probe set that aligns to an exon region, but excludes probe sets aligning to any non-exon features, Including introns and untranslated regions (Figure 3.1).
Calculating gene expression values Gene expression values are calculated twice during an exon array analysis; (1) To report gene expression fold changes and summary statistics, (2) To calculate gene expression normalized intensities for alternative exons. While both analyses use the same set of starting probe sets (constitutive or core), they differ in how filtering of these probe sets occurs. These differences are important when trying to eliminate false positive alternative exon predictions.
For gene expression summary reporting, expression values from all constitutive or core probe sets (user defined) are filtered and then averaged. For this analysis, the constitutive or core probe sets are first identified from the AltAnalyze database (Ensembl_probeset file) and corresponding expression and detection probabilities extracted from the APT RMA result files from the input array files. For each biological group (typically containing replicates), the mean expression and mean detection above background probabilities are calculated for each probe set. If the mean expression and the mean detection above background probabilities for at least one group does not meet the user defined thresholds (by default p<0.05 and expression>70), then this probe set will not be used to calculate the gene expression value for the corresponding gene. The average expression for all remaining probe sets, for each array will calculated to obtain a mean gene expression value for each probe set for each replicate. Thus, any differences between probe sets used to calculate gene expression will be averaged. For example, when all core probe sets are used, the expression of any alternative exons will be averaged with the non-alternative exons. If no constitutive or core probe sets for a gene meet these criterion (see below), then all will be used to calculate gene expression. Only one set of gene expression values is reported for each Ensembl gene.
This same strategy is used for calculating a gene expression value for the alternative exon analysis with the following differences: (1) a probe set used to calculate gene expression must demonstrate evidence expression (expression and detection probability filtering) in both conditions (pairwise comparisons only) or it is removed and (2) if no probe sets remain after expression and detection probability filtering, then no probe sets for the gene will NOT be analyzed for alternative splicing. If all groups are being compared for the alternative exon analysis, rather than direct comparison of two groups, then only one biological group must meet the expression and detection probability thresholds for the gene expression (same as for gene expression summary reporting, above). These additional requirements ensure that the gene is expressed at a sufficient levels to assess alternative splicing in both compared biological groups. This is important in eliminating false positive changes in probe set expression that can occur when the gene is expressed in only one condition. 3.5 Alternative Splicing PredictionTo predict whether or not a single probe set (EA) or reciprocal probe
set-pair (JA) associates with an alternative splicing or alternative
promoter sequence, AltAnalyze uses two strategies; 1) Identify alternative
exons/introns based on de novo
isoform comparison and 2) Incorporating splicing predictions from UCSC's
"known_alt.txt" file. De Novo Splicing Prediction and Exon Annotation In
order to identify exons with alternative splicing or promoters, AltAnalyze
compares all available gene transcripts from UCSC and Ensembl to look
for shared and different exons. To achieve this, all mRNA transcripts
from UCSC's species-specific "mRNA.txt" file that have genomic coordinates
aligning to a single Ensembl gene and all Ensembl transcripts from each
Ensembl build are extracted.Only UCSC transcripts that have a distinct
exon composition from Ensembl transcripts are used in this analysis,
excluding those that have a distinct genomic start or stop position
for the first and last exon respectively (typically differing 5' and
3' UTR agreement), but identical exon-structure. When
assessing alternative splicing, cases of intron retention are identified
first. These regions consist of a single exon that spans an two adjacent
exons at least one another transcript for that same gene. These retained
introns are stored for later analysis, but eliminated as annotated exons.
Remaining exons are clustered based on whether their genomic positions
overlap (e.g., alternate 5' or 3' start sites). Each exon cluster is
considered an exon block with one or more regions, where each block
and region is assigned a numerical ID based on genomic rank(e.g., E1.1,
E1.2, E2.1, E3.1). For each exon in a transcript, the exon is annotated
as corresponding to an exon block and region number (Figure 3.1). All possible pair-wise transcript comparisons
for each gene are then performed to identify exon pairs that show evidence
of alternative exon-cassettes, alternative 3' or 5' splice sites or
alternative-N or -C terminal exons (Figure 3.1). All transcript exon
pairs are considered except for those adjacent to a retained intron.
This analysis is performed by comparing the exon block ID and region
IDs of an exon and it's neighboring exons to the exon blocks and regions
in the compared transcript. Ultimately, a custom heuristic assigns the
appropriate annotation based on these transcript comparisons.
Incorporating UCSC Splicing Predictions In addition to all de
novo splicing annotations,
additional splicing annotations are imported from the UCSC genome database
and linked to existing exon blocks and regions based on genomic coordinate
overlap.This comparison is performed by the alignToKnownAlt.py module of AltAnalyze (called
from EnsemlbImport.py). De novo and
UCSC splicing annotations are stored along with probe set Ensembl gene
alignment data in the file <species>_Ensembl_probesets.txt. These annotations are used by AltAnalyze
and DomainGraph. In the AltAnalyze result file, UCSC KnownAlt the major
splicing annotation types are altFivePrime, altThreePrime, cassetteExon,
altPromoter, bleedingExon and retainedIntron. In comparison, the major
AltAnalyze determined splicing annotation types alt-5', alt-3', cassette-exon,
alt-N-term, intron-retention, exon-region-exclusion and alt-C-term.
These splicing annotations are determined by comparing relative exon-junction
positions for all analyzed transcripts for each gene (see proceeding
section). The annotation exon-region-exclusion is the opposite of intron-retention
(region most commonly described as exon rather than intron), alt-N-term
is similar to altPromoter (distinct 5' distal-transcript exon with shared
3' exons) and alt-C-term is the opposite of alt-N-term. To better understand
these annotations, look at specific probe set examples through the NetAffx
website (www.affymetrix.com/analysis/index.affx)
using the UCSC browser option. Filtering for Alternative
Splicing As mentioned in previous sections, AltAnalyze includes the option restrict
alternative exon results to only those probe sets or reciprocal probe
set-pairs predicted to indicate alternative splicing. AltAnalyze considers alternative splicing as
any alternative exon annotation other than an alternative N-terminal
exon or alternative promoter annotation derived from de novo or UCSC genome database annotations. 3.6 Protein/RNA Inference AnalysisIdentifying Alternative Proteins Protein Domains Probe set sequences present on exon or junction
arrays are used to identify which proteins align to or are missing from
transcripts for that gene. To do this, all Ensembl and UCSC mRNA transcripts
are extracted for a gene that corresponds to a given probe set. For
each transcript, all exon genomic coordinates are stored. For exon arrays,
transcripts with exons that contain the probe set genomic coordinates
are considered transcript aligning, while all others are considered
non-aligning. For junction arrays, if two reciprocal junction probe
sets align to distinct isoforms, these relationships are stored rather
than the aligning and non-aligning isoforms, however, if both isoforms
do not align to distinct transcripts, then the one with a aligning and
non-aligning set of transcripts is stored for further exon comparisons.
If a set of aligning and non-aligning isoforms
(competitive) is identified for a probe set or reciprocal junction pair,
all possible aligning and non-aligning pair-wise combinations are identified
to find those pair-wise comparisons with the smallest difference in
exon composition. This is accomplished by determining the number of
different and common exons each transcript pair contains (based on genomic
start and stop of the exon). When comparing the different transcript
pairs, the most optimal pair is selected by first considering the combined
number of distinct exons in both transcripts and second the number of
common transcripts. Thus, if one transcript pair has 4 exons in common
and 2 exons not in common, while a second pair has 5 exons in common
and 3 exon not in common, the prior will be selected as the optimal
since it contains less overall differences in exon composition (even
though it has less common exons than the other pair). A theoretical
example is illustrated in Figure 3.2.
Once a single optimal isoform pair has been
identified, protein sequence is obtained for each by identifying protein
IDs that correspond to the mRNA (Ensembl or NCBI) and if not available,
a predicted protein sequence is derived based on in silico translation. Although such
a protein sequence may not be valid, given that translation of the protein
may not occur, these sequences provide AltAnalyze the basis for identifying
conservative changes predictions for a change in protein size, sequence
and domain composition. Domain/protein features are obtained directly
from UniProt's sequence annotation features or from Ensembl's InterPro
sequence annotations (alignment e-value <1) (see section 5 data extraction
protocols). Any InterPro sequences with a description field or any UniProt
sequence annotation feature that is not of the type "CHAIN", "CONFLICT",
"VARIANT", "VARSPLIC" and "VAR_SEQ" are examined by AltAnalyze. To compare
domain or motif sequence composition differences, the protein sequence
that corresponds to the amino-acid start and stop positions of a domain
for each transcript is searched for in each of the compared protein
isoforms. If the length of a motif sequence is less than 6 amino-acids,
flanking sequence is included. If a domain is present in one but not
another isoform, that domain is stored as differentially present.To
identify differences in protein sequence (e.g., alternative-N-terminus,
C-terminus, truncation coding sequence and protein length), the two
protein sequences are directly compared for shared sequence in the first
and last five residues and comparison of the entire sequences. If the
N-terminal sequence is common to both isoforms but there is a reduction
in more than 50% of the sequence length, the comparison is annotated
as truncated. All of these annotations are stored for each probe set
for import into AltAnalyze, with each new database build. Direct Domain/Motif Genomic Alignment The above strategy allows AltAnalyze to identify
predicted protein domains and motifs that are found in one isoform but
not the other (aligning to a probe set and not aligning). In addition
to these "inferred" domain predictions, that include protein domains/motifs
that do not necessarily overlap with the regulated probe set, AltAnalyze
includes a distinct set of annotations that only corresponds to domain/motif
protein sequence that directly overlaps with a probe set. Alignment
of probe sets to InterPro IDs is achieved by comparing probe set genomic
coordinates to InterPro genomic coordinates. To obtain InterPro genomic
start and end positions are determined by first identifying the relative
amino-acid positions of an InterPro region in an Ensembl protein, finding
which exon and at what position the InterPro region begins and ends
and finally stroing the genomic position of these relative exon coordinates. These probe set-InterPro overlaps can be of
two types; 1) probe sets whose sequence is present in the domain coding
RNA sequence and 2) probe sets whose sequence is not present in the
domain coding RNA sequence. The second type of overlap typically occurs
in the gene introns. While these associations are typically meaningful,
false "indirect" associations are possible. To reduce the occurrences
of these false positives, any probe set that aligns to the UTR of an
Ensembl gene or that occurs in the first or last exon of an mRNA transcript
are excluded. These heuristics were chosen after looking at specific
examples that the authors considered to be potential false positives.
For junction arrays, the alternative exon genomic position rather probe
set was used. Identifying microRNA Binding Sites associated
with Alternative Exons MicroRNA binding site (miR-BS) sequence is obtained
and compared from four different microRNA databases (see section 6.5),
and compared to identify miR-BSs in common to or distinct to
different databases. Probe set sequences or exon sequences (aligning
to two reciprocal junctions) are obtained from Affymetrix (see section
6.7). Each miR-BS sequence is searched for within probe set or exon
sequence to identify a match. These relationships are stored with each
new database build and are used by AltAnalyze and DomainGraph. Exhaustive Protein Domain/Motif Analysis Very similar to the competitive
protein domain/motif analysis is the exhaustive protein comparison analysis.
This feature is currently not available by default in AltAnalyze and requires
replacement database files from AltAnalyze support. The purpose of these
files is to obtain the most conservative possible domain-level prediction results
from the competitive analysis. This done by storing all pair-wise aligning
and non-aligning isoform comparisons (Figure 3.2) and then obtaining protein sequence for each transcript,
as described in earlier sections and storing all domain/motifs differentially
found between all possible competitive isoforms. From these stored results,
all possible competitive isoforms are themselves compared to find isoforms
that ideally show only differences in central regions of the protein (no
N-terminal or C-terminal differences), next for those that contain as few
possible domain-level predictions and finally for those with the smallest
overall differences in protein length. For detailed algorithm information see
the IdentifyAltIsoforms module and
the function "compareProteinFeaturesForPairwiseComps". Section 4 - Using R with AltAnalyze
4.1 Configuring R Although installation of R is not
required for any of the AltAnalyze analyses, for users who wish to use more
advanced statistics, it will be necessary. Currently, the only statistic that
requires installation of R is the linear regression method rlm.
rlm
is a regression statistical method apart of the R package mss.
This method is preferred by some over the alternative linear regression
method provided by default in AltAnalyze. Both methods produce very similar
statistics, with only a few probe sets differing between threshold parameters
out of hundreds of results.However,
since the rlm
method was used in the published linear regression analyses, other users
wishing to replicate these results, may wish to use this algorithm. To run the option
"linearegress-rlm" from the "Alternative Exon Analysis Parameters" window, you will
need to install a compatible version R (only version 2.1 has been extensively
tested). Along with R, the user will need to install the R statistical
package mss. R is interpreted by Python using the Python
program Rpy, which is packaged with the compiled versions of AltAnalyze, but
not the source code (http://rpy.sourceforge.net/download.html).
With some compiled installations of AltAnalyze, you will receive a warning
when running "linearegress-rlm". This error occurs
due to an Rpy compiling error that is specific to the OS executable version
of AltAnalyze, but should work fine when Rpy is installed for the associated
version of R by the user (each version of Rpy corresponds to several versions
of R). Whether dealing with a compiled or
source version of AltAnalyze, if Python reports that it cannot find the
current version of R, the user may need to update the computers Environment
Variables setting Path (Windows only). This is accessed through by opening
Control Panels>System Properties>Advanced>Environment Variables and
selecting the Variable "Path" and entering the path location of R (for
example, ";C:\Program Files\R\rw2010;" - No
spaces before or after ;). Contact AltAnalyze support if problems
persist. Section 5 - Software Infrastructure5.1 Overview AltAnalyze consists of more than
30 modules and over 10,000 lines of code. The core modules for AltAnalyze
consist of the programs ExpressionBuilder and AltAnalyze, which can be used
in tandem or separately through the AltAnalyze GUI. The user will never
likely need to deal directly with these modules names when running
AltAnalyze, but these distinct core modules are used for different analysis
functions (Figure 5.1).
The ExpressionBuilder component
builds constitutive gene expression summary files as well as filters the
probe set expression data prior to alternative-exon analysis. The AltAnalyze
module performs all of the alternative-exon analysis and MiDAS p-value
calculations. 5.2 ExpressionBuilder ModuleThe ExpressionBuilder
program is principally designed to perform the following tasks: 1)
Import user expression data from tab-delimited files. 2)
(Splicing Arrays Only) Exclude probe sets where no samples
have a DABG p<user-threshold (only applicable when DABG p-values exist). 3)
Organize your data according to biological groups and
comparisons (specified by the user from custom text files). 4)
(Exon Arrays Only) Calculate gene transcription levels for
all Ensembl genes. 5)
Export calculated gene expression values along with folds, t-test and
f-test p-values and gene annotations for all genes and all user indicated
comparisons. 6)
(Splicing Arrays Only) Export all gene linked probe set expression
and DABG data for all pairwise comparisons for further filtering (next
step). 7)
(Splicing Arrays Only)
The above tasks are performed
in order by the ExpressionBuilder module. Detection probabilities are
assessed at two steps (2 and 7). In step 2, import of DABG p-values
are for the purpose of calculating a transcription intensity value only
for those constitutive probe sets (present in all or most transcripts)
that show detection above background, since some probe sets will have
weaker expression/hybridization profiles as others. If no probe sets
have a DABG p-value less than the default or user supplied threshold
(for at least one sample in your dataset), all selected probe sets will
be used to calculate expression (constitutive aligning only if default
is selected).
In step 7, the probe
set DABG p-values are examined to include or exclude probe sets for
alternative splicing analysis. This step is important in minimizing
false positive splicing calls. False positive splicing calls can occur
when a probe set is expressed below detectable limits and results in
a transcription-corrected expression value that artificially appears
to be alternatively regulated. For the expression and DABG p-value files
output from APT, probe set expression values are initially filtered
to remove any probesets where the expression and dabg p-values are below
the user defined threshold for all biological groups examined and export
all pairwise comparison group files (expression and dabg) for further
filtering. For probe sets used to determine gene expression levels (constitutive
or all core), the module FilterDABG is used to remove probe sets that
are expressed below user defined thresholds (expression and dabg) in
the two comparison groups for all pairwise comparisons. If a probe set
is not used to determine gene expression levels, then for at least one
of the two biological groups, the same criterion must be true (mean
DABG p-value and expression). These filters ensure that: 1) the gene
is "expressed" in at least one condition. The probe sets and expression
values passing these user defined filters are exported to a new file
that is ready to use for splicing analyses. This file can also be directly selected in
future AltAnalyze runs as input for analysis ("AltAnalyze filtered file"
- Figure 2.2). Runtime of ExpressionBuilder is
dependent on the number of conditions and array type being analyzed (10
minutes plus for Affymetrix Exon 1.0 ST arrays). If multiple comparisons are
present in a single expression file, input files for AltAnalyze will all be
generated at once and thus runtimes will take longer. Note 1: While this pipeline is mainly for use with exon or junction
arrays, it is also compatible with a standard 3' Affymetrix microarray
dataset to calculate folds, t-test p-values and assign annotations to this
data.This is useful when you have
many comparisons in your dataset and you don't wish to manually calculate
these values. Note 2: You do not need to run ExpressionBuilder if you have an
alternative way of building AltAnalyze input files. To do so, your file
headers for each array must have the name "group:sample_name", where your
group names are different for each group and the denominator group is listed
first and the numerator is listed second. Below the header line should only
be probe set IDs and log2 expression values. 5.3 AltAnalyze ModuleThe
AltAnalyze module is the primary software used for all alternative exon
analyses. This software imports the filtered expression data and performs all
downstream statistical and functional analyses. This program will analyze any number of
input comparison files that are in the "AltExpression" results directory for
that array type. The main analysis steps in this program are: 1) Import exon or junction annotations, to determine which probe sets
to analyze, which are predicted constitutive probe sets and which correspond
to known AS or APS events. 2) Import the user expression data for the pair-wise comparison. 3) Store probe set level data for all probe sets corresponding to either
a constitutive exon or for splicing event, while storing the group membership
for each value. 4) Calculate a constitutive expression value for each gene and each sample
(used for the splicing score later on). OPTIONAL: If the user selected
a cut-off for constitutive fold changes allowed to look at alternative
exon regulation, then remove genes from the analysis that have a gene-expression
difference between the two groups > cut-off (up or down). 5) (Exon array only) OPTIONAL: exports input for the Affymetrix
Power Tools (APT) program to calculate a MiDAS p-value for each probe
set. If using this option. 6) Calculate a splicing score and t-test p-value from the probe set and
constitutive expression values. This calculation requires that splicing
ratios are calculate for each sample (exon/constitutive expression)
and then compared between groups. For exon arrays, the splicing index
method (SI) is calculated for each probe set. For junction arrays, ASPIRE,
Linear Regression are used with the pre-determined reciprocal junctions
or alternatively are calculated for individual probe sets using the
SI method. 7) (Junction array only) OPTIONAL: performs a
permutation analysis of the sample ASPIRE input values or Linear Regression
values to calculate a likelihood p-value for all possible sample combinations. 8) Retain only probe sets meeting the scoring thresholds for these statistics
(splicing score, splicing t-test p, permutation p, MiDAS p - see Section
3.2). 9) Import probe set-protein, probe set-domain and probe set-miRNA associations
pre-built local from flat files (see Section 6). 10) For the remaining probe
sets, import all protein domain and miR-BS to all pre-built probe set
associations (see Section 3.3 for details). Import all probe set-domain
and -miR-BS associations for all genes to calculate an over-representation
z-score for all domains and miR-BS's along with a non-adjusted and adjusted
p-value. Export the resulting statistics and annotations to tab-delimited
files in the "AltResutls/AlternativeOutput" folder in the user-defined
output directory. 11) (Junction array only) Import splicing and exon annotations for regulated exons corresponding
to each set of reciprocal probe sets (e.g. for E1-E3 compared to E1-E2,
E2 is the regulated exon). These annotation files are the same as those
for exon arrays, except that the probe set is replaced by the exon predicted
to be regulated by the reciprocal probe set-pair (see Section 6.2). 12) For protein domain and
miR-BS annotations, reformat the direction/inclusion status of the annotation.For
example, if a kinase domain is only found in a protein that aligns to
a robe set, but was down-regulated, then the annotation is listed as
(-) kinase domain, but if up-regulated is listed as (+) kinase domain. 13) Export the results from
this analysis to the "AltResutls/AlternativeOutput" folder in the user-defined
output directory. 14) Summarize the probe set
or reciprocal junction data at the level of genes and export these results
(along with Gene Ontology/Pathway annotations). 15) Export overall statistics
from this run (e.g. number of genes regulated, splicing events). 16) (Junction array only) Combine and export the exported probe set and gene files for each
comparison analyzed, to compare and contrast differences. Result File Types When
finished AltAnalyze will have generated five files. 1)
name-scoringmethod-exon-inclusion-GENE-results.txt 2)
name-scoringmethod-exon-inclusion-results.txt 3)
name-scoringmethod-ft-domain-zscores.txt 4)
name-scoringmethod-miRNA-zscores.txt 5)
name-scoringmethod-DomainGraph.txt Here, "name" indicates the
comparison file name from ExpressionBuilder, composed of the species +
array_type + comparison_name (e.g. Hs_Exon_H9-CS-d40_vs_H9-ESC-d0),
"scoringmethod" is the type of algorithm used (e.g. SI) and the suffix
indicates the type of file. The annotation files used by
AltAnalyze are pre-built using other modules with this application (see
Section 6).Although the user should
not need to re-build these files on their own, advanced users may wish to
modify these tables manually or with programs provided (see Section 6.7). For protein-level functional
annotations (e.g., domain changes), this software assumes that if an exon is
up-regulated in a certain condition, that the protein domain is also
up-regulated and indicates it as such. For example, for exon array data, if a
probe set is up-regulated (relative to gene constitutive expression) in an
experimental group and this domain is found in the protein aligning to this
probe set, in the results file this will be annotated as (+) domain. If the
probe set were down-regulated (and aligns as indicated), this would be
annotated as (-) domain. Section 6 - Building AltAnalyze Annotation
Files
6.1 Splicing Annotations and Protein AssociationsA number of
annotation files are built prior to running AltAnalyze that are necessary
for: 1) Organizing exons and introns from discrete transcripts into consistently
ordered sequence blocks (UCSCImport.py and EnsemblImport.py). 2) Identifying which exons and introns align to alternative annotations
(alignToKnownAlt.py and EnsemblImport.py). 3) Identifying probe sets with likely constitutive annotations (ExonArrayAffyRules.py). 4) Identifying which probe sets align to which exons and introns (ExonArrayEnsemblRules.py). 5) Extracting out protein sequences with functional annotations (ExtractUniProtFunctAnnot.py and EnsemblSQL.py). 6) Identifying probe sets that overlap with microRNA binding sites (MatchMiRTargetPredictions.py and ExonSeqModule.py). 7) Matching probe set genomic coordinates to cDNA exon coordinates and
identify the optimal matching and non-matching mRNA/protein for each
probe set (IdentifyAltIsoforms.py
and ExonAnalyze_module.py) These
annotation files are necessary for all exon and junction array analyses. Junction array analyses further require: 8) Matching reciprocal junction probe sets to annotated exons or introns
(JunctionArray.py, EnsemblImport.py and JunctionArrayEnsemblRules.py), creating a file analogous
to (4) above. 9) Matching reciprocal junction probe set sequence to microRNA binding
sites (JunctionSeqSearch.py), creating a file analogous
to (6) above. 10) Matching probe set sequence
to cDNA sequences (mRNASeqAlign.py), prior to identification
of optimal matching and non-matching mRNA/proteins (IdentifyAltIsoforms.py). With the creation/update of these
files, the user is ready perform alternative exon analyses for the selected
species and array type.Since many of
these analyses utilize genomic coordinate alignment as opposed to direct
sequence comparison, it is import to ensure that all files were derived from
the same genomic assembly. Note: Although all
necessary files are available with the AltAnalyze program at installation and
some files can be updated automatically from the AltAnalyze server, users can
use these programs to adjust the content of these files, use the output for
alternative analyses, or create custom databases for currently unsupported
species. 6.2 Building Ensembl-Probe Set Associations
Exon and Gene Arrays
Affymetrix Exon 1.0 ST arrays are provided with probe
set sequence, transcript cluster and probe set genomic location from
Affymetrix. Each of these
annotations is used by AltAnalyze to provide gene, transcript and exon
associations. Although transcript clusters represent putative genes, the
AltAnalyze pipeline derives new gene associations to Ensembl genes, so that
each probe set aligns to a single gene from a single gene database. This
annotation schema further allows AltAnalyze to determine which probe sets align
to defined exons regions (with external exon annotations), introns, and
untranslated regions (UTR).
To
begin this process, Ensembl exons (each with a unique ID) and their genomic
location and transcript associations are downloaded for the most recent genomic
assembly using the AltAnalyze EnsemblSQL.py module,
which parses various files on the Ensembl FTP SQL database server to assemble
the required fields. This file is
saved to the directory “AltDatabase/ensembl/*species*/” with the filename
“*species*_Ensembl_transcript-annotations.txt". Since Ensembl transcript associations are typically
conservative, transcript associations are further augmented with
exon-transcript structure data from the UCSC genome database, from the file
“all_mrna.txt” (Downloads/*species*/Annotation database/all_mrna.txt.gz). This
file encodes genomic coordinates for exons in each transcript similar to
Ensembl. Transcript genomic
coordinates and genomic strand data from UCSC is matched to Ensembl gene
coordinates to identify genes that specifically overlap with Ensembl genes with
the Python program UCSCImport.py. Unique
transcripts, with distinct exon structures from Ensembl, are exported to the
folder “AltDatabase/ucsc/*species* “*species*_UCSC_transcript_structure_filtered_mrna.txt”,
with the same structure as the Ensembl_transcript-annotations file.
Once
both transcript-structure files have been saved to the appropriate directory, ExonArrayAffyRules.py calls
the program EnsemblImport.py to perform the following steps:
1)
Imports these
two files, stores exon-transcript associations identify exon regions to exclude
from further annotations. These are exons that signify intron-retention
(overlapping with two adjacent spliced exons) and thus are excluded as valid
exon IDs. These regions are also
flagged as intron-retention regions for later probe set annotations.
2)
Assembles exons
from all transcripts for a gene into discrete exon clusters. If an exon cluster contains multiple
exons with distinct boundaries, the exon cluster is divided into regions that
represent putative alternative splice sites (Text S1 Figure 1). These splice
sites are explicitly annotated downstream. Each exon cluster is ordered and
number from the first to the last exon cluster (e.g., E1, E2, E3, E4, E5),
composed of one or more regions. These exon cluster/region coordinates and
annotations are stored in memory for downstream probe set alignment in the
module ExonArrayAffyRules.py.
3)
Identifies
alternative splicing events (cassette-exon inclusion, alternative 3’ or 5’
splice sites, alternative N-terminal and C-terminal exons, and combinations
there in) for all Ensembl and UCSC transcripts by comparing exon cluster and
region numbers for all pairs of exons in each transcript (see proceeding
sections for more information). Alternative exons/exon-regions and
corresponding exon-junctions are stored in memory for later probe set
annotation and exported to summary files for creation of databases for the
Cytoscape exon structure viewer, SubgeneViewer (currently in development).
4)
(New to
AltAnalyze 1.14) Constitutive exon
regions are defined by counting the number of unique Ensembl and UCSC mRNA
transcripts (based on structure) associated with each unique exon region. If
multiple exon regions have the same number of transcript associations, then
these are grouped. The grouped exon regions that contained the most transcripts
for the gene are defined as constitutive exon regions. If only one exon region
is considered constitutive then the grouped exon regions with the second
highest ranking (based on number of associated transcripts) are included as
constitutive (when there at least 3 ranks) (Figure 3.1).
Upon completion, ExonArrayAffyRules.py:
1)
Imports
Affymetrix Exon 1.0 ST probe sets genomic locations and transcript cluster
annotations from the Affymetrix probe set.csv annotation file (e.g.,
HuEx-1_0-st-v2.na23.hg18.probeset.csv). Note: prior to AltAnalzye 1.14, mRNA
alignment count numbers were also downloaded from this file to deduce constitutive
probe sets. Although transcript clusters will be disregarded at the end of the
analysis, these are used initially to group probe sets.
2)
Transcript
cluster genomic locations are matched to Ensembl genes genomic locations (gene start and stop) to
identify single transcript clusters that align to only one Ensembl gene for the
respective genomic strand. For transcript clusters aligning to more than one
Ensembl gene, coordinates for each individual probe set are matched to aligning
Ensembl genes, to identify unique matches. If multiple transcript clusters
align to a single Ensembl gene, only probe sets with an Affymetrix annotated
annotation corresponding to that Ensembl gene, from the probeset.csv file, are
stored as proper relationships. This ensures that if other genes, not annotated
by Ensembl exist in the same genomic interval, that they will not be
inaccurately combined with a nearby Ensembl gene. If multiple associations or other inconsistencies are found,
probe set coordinates are matched directly to the exon cluster locations
derived in EnsemblImport.py.
3)
(The following method is deprecated) - For AltAnalyze version 1.13 and below,
once unique probe set-to-Ensembl gene associations have been defined,
constitutive probe sets are identified using the Affymetrix mRNA counts
provided in the program ExonArrayEnsemblRules.py. The mRNA counts are distributed based
on the types of mRNAs they align to (full-length, Ensembl, and EST), where the
probe sets with the largest number of high quality mRNA associations are chosen
as constitutive. Probe sets for a given gene are ranked based on the: A) number
of Ensembl transcripts, B) full-length and C) ESTs associated, in descending
order, where multiple associations are required for each annotation type. If
all probe sets have the same number of Ensembl and full-length transcript
associations, then the number of EST aligning are compared. If no difference in these mRNA
assignments exists, no constitutive probe sets are annotated.
4)
Each probe sets is then aligned to exon clusters, regions, retained
introns, constitutive exon regions and splicing annotations for that gene. In
addition to splicing annotations from EnsemblImport.py, splicing annotations
from the UCSC genome annotation file “knownAlt.txt” (found in the same server
directory at UCSC as “all_mRNA.txt”) using the program alignToKnownAlt.py, are aligned to these
probe sets. If a probe set does not align to an Ensemblmport.py defined exon or intron
and is upstream of the first exon or downstream of the last exon, the probe set
is assigned a UTR annotation (e.g., U1.1). All aligning probe sets are
annotated based on the exon cluster number and the relative position of that
probe set in the exon cluster, based on relative 5’ genomic start (e.g., E2.1).
This can mean that probe set E2.1 actually aligns to the second exon cluster in
that gene in any of the exon regions (not necessarily the first exon region),
if it is the most 5’ aligning.
5)
These probe set annotations are exported to the directory
“AltDatabase/*species*/exon” with the filename “*species* _Ensembl_probe sets.txt”.
Exon block and region annotations for each probe set are designated in the
AltAnalyze result file.
For
the exon-junction array AltMouseA, the same process is applied to the
highlighted exon(s) from all pre-determined reciprocal probe sets, exported
by the program ExonAnnotate_module.py.A highlighted exon is an exon that is considered
to be regulated as the result of two alternative junctions.For example,
if examining the exon-junctions E1-E2 and E1-E3, E2 would be the highlighted
exon. Alternatively, for the mutually-exclusive splicing event E2-E4
and E1-E3, E2 and E3 would be considered to be the highlighted exons.
To obtain the genomic locations of these exons, sequences for each are
obtained from a static build of the mouse AltMerge program (March 2002)
(ExonAnalyze_module.py) and searched for in FASTA formatted sequence obtained
from BioMart for all Ensembl genes with an additional 2 kb upstream
and downstream sequence (JunctionArray.py and EnsemblImport.py).This allows for the export of an exon-coordinate
file analogous to the exon probeset.csv file.The main difference in
this file is that AltMouseA gene to Ensembl ID associations are obtained
by comparing gene symbol names and external GenBank accession numbers
in common, as opposed to coordinate comparisons. The resulting highlighted exon file is
named "Mm_Ensembl_AltMouse_probe sets.txt" and is saved to "AltDatabase/Mm/AltMouse",
with the same structure as its exon array analogue. 6.3 Extracting UniProt Protein Domain Annotations Overview The UniProt protein database is a
highly curated protein database that provides annotations for whole proteins
as well as protein segments (protein features or domains). These protein
feature annotations correspond to specific amino acid (AA) sequences that are
annotated using a common vocabulary, including a class (feature key) and
detailed description field.An example
is the TCF7L1 protein (http://www.uniprot.org/uniprot/Q9HCS4),
which has five annotated feature regions, ranging in size from 7 to 210 AA.
One of these regions has the feature key annotation "DNA
binding" and the description "HMG box".To utilize
these annotations in AltAnalyze, these functional tags are extracted along
with full protein sequence, and external
annotations for each protein (e.g., Ensembl gene) from the "uniprot_sprot_*taxonomy*.dat"
file using the ExtractUniProtFunctAnnot.py program. FTP file locations for the UniProt database file
can be found in the file "Config/Default-file.csv"
for each supported species. To improve Ensembl-UniProt annotations, these
relationships are also downloaded from BioMart and stored in the folder
"AltDatabase/uniprot/*species*" as "*species*_Ensembl-UniProt.txt", which are
gathered at ExtractUniProtFunctAnnot.py runtime to include in the UniProt sequence annotation
file. These files are saved to "AltDatabase/uniprot/*species*" as "uniprot_feature_file.txt" and "uniprot_sequence.txt". 6.4 Extracting Ensembl Protein Domain Annotations Overview In addition to protein
domains/features extracted from UniProt, protein features associated with
specific Ensembl transcripts are extracted from the Ensembl database. One
advantage of these annotations over UniProt, is that alternative exon changes that alter the sequence of a feature but not its
inclusion will be reported as a gain and loss of the same feature, as opposed
to just one with UniProt.This is
because protein feature annotations in UniProt only typically exist for one
isoform of a gene and thus, alteration of this
feature in any way will result in this feature being called regulated.
Although an Ensembl annotated feature with a reported gain and loss can be
considered not changed at all, functional differences can exist do to a minor
feature sequence change that would not be predicted
if the gain and loss of the feature were not reported. Three separate annotation
files are built to provide feature sequences and descriptions, "Ensembl_Protein", "Protein", and "ProteinFeatures" files ("AltDatabase/ensembl/*species*"). The "ProteinFeatures" contains relative AA positions for protein features for
all Ensembl protein IDs, genomic start and end locations along with InterPro
annotations and IDs. Only those
InterPro domains/features with an alignment e-score < 1 are stored for
alignment to regulated exons. The "Protein" file
contains AA sequences for each Ensembl protein. The "Ensembl_Protein " provides Ensembl gene, transcript and protein ID
associations. Data for these files are downloaded and
extracted using the previously mentioned EnsemblSQL.py module. The feature annotation source in these
files is InterPro, which provides a description similar to UniProt. As an
example, see: http://ensembl.genomics.org.cn/Homo_sapiens/protview?db=core;peptide=ENSP00000282111,
which has similar feature descriptions to UniProt for the same gene, TCF7L1. 6.5 Extracting microRNA Binding Annotations Overview To examine the potential gain
or loss of microRNA bindings sites as the direct result of exon-inclusion or
exclusion, AltAnalyze requires putative microRNA sequences from multiple
prediction algorithms. These binding site annotations are extracted from the following flat files:
Ensembl gene to microRNA name
and sequence are stored for all prediction algorithm flat files and directly
compared to find genes with one or more lines of microRNA binding site
evidence using the program MatchMiRTargetPredictions.py.
The flat file produced from this program ("combined_gene-target-sequences.txt") was used by the program ExonSeqSearch.py to search for these putative microRNA binding site
sequences among all probe sets from the "*species*_Ensembl_probe set.txt"
file built by ExonArrayEnsemblRules.py and probe set sequence from the Affymetrix 1.0 ST probe
set fasta sequence file (Affymetrix) or the reciprocal junction highlighted
exon sequence file (see section 6.2). Two resulting files, one with any
binding site predictions and another required to
have evidence from at least two algorithms, are saved to
"AltDatabase/*species*/*array_type*/" as "*species*_probe
set_microRNAs_any.txt" and "*species*_probe set_microRNAs_multiple.txt", respectively. 6.6 Inferring Protein-Probe Set Associations OverviewTo obtain associations between specific probe sets and proteins, the programs IdentifyAltIsoforms.py | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||