EASE: Annotation Over-representation Analysis

Parameter Information


File Updates and Configuration

Select EASE File System

This button enables the selection of a local directory to be used as the source for annotation files for EASE analysis. Multiple file systems can be present to support a variety of array types. Selection of a file system directs file choosers to that area, however, file selections may be made outside of selected base file system if appropriate. Note that the selected directory should be the directory that contains the "Data" directory. In MeV's default data dircetory this would be the 'ease' directory.

Update EASE File System

This button allows the download of EASE annotation file systems for a selected species and clone set. A selection dialog will allow species selection from a variety of plant and animal species. A list of many commercially available arrays for the selected species is also presented for selection. After species and array selection, a dialog will be presented to select a directory as the destination directory for the EASE file system. Zip files will then be downloaded and automatically extracted into the destination directory. The new base directory will be labeled with "ease_" and the selected array name. This new file system can be selected as the default system (Please see the "Select EASE File System" option above.).

Mode Selection

Cluster Analysis

This mode performs annotation analysis on a selected subset (sample list or cluster) of the full data set loaded in MeV. The output is a list of biological 'themes' represented in the cluster and a statistic reporting the probability that a particular theme is over represented in the cluster relative to it's representation in the entire data set. The resulting table will initially be sorted by this statistic.

Annotation Survey

The survey mode simply produces a list of biological themes that are represented in the data currently loaded in the viewer from which Ease is launched. Note that this could be a subset of the total slide data. If you want to survey all annotation on the slide you have to use a viewer with all of the slide's data loaded. The initial ordering of the output table is based on the prevalence of a theme in the data set (hit count). This mode can be used to cluster genes based on biological themes. The clusters can then be stored and marked (colored) for tracking during cluster analysis.

Parameter Pages

Several parameter input pages are available:

Population and Cluster Selection

This section permits selection of a cluster for analysis and defines the population to which the cluster should be compared. The population selection panel, on top, allows the user to specify whether the population set of gene indices should be loaded from a file or if the population set should be taken as all indices loaded in the current Multiple Experiment Viewer. Note that if the current viewer does not contain all population indices it is important to use the default option of a population file.

A population file is a list of indices representing the indices from which the cluster was segregated by statistical or other means. The file format consists of a column of indices with one index per line. The population often represents a set of indices representing each element on the array, however, there are circumstances where one might wish to disregard particular spots such as internal controls.

The cluster panel, below the population panel, displays gene clusters currently stored in MeV's cluster repository. If no clusters have been saved then a blank browser page or empty table will be displayed and the Cluster Analysis mode option will be disabled. Selecting a row in the cluster table will display the cluster in the expression graph area of the browser. EASE cluster analysis will operate on the selected cluster..

Annotation Parameters Page

This page has three major parts described below.

MeV Annotation Key

This area contains a drop down list which contains a list of available annotation types which can be used identify genes. Generally it's best to use an index or accession which 'uniquely' identifies the spotted material.

Annotation Conversion File

This optional file provides the mapping from your annotation key (above) to the index used to map to biological themes (GO terms, KEGG pathways, etc.). If your annotation key type is the one used in the linking file (below) then this conversion (mapping) is not needed.

Gene Annotation / Gene Ontology Linking Files

This section allows one to specify one or more annotation files. These files contain gene indices paired with biological themes such as go terms.

File Selection Scenario

One possible example of the file linking structure could be:
[GenBank#]-->[GenBank#]:[locus_link_id]-->[locus_link_id]:[go_term]
This shows the progression from 'Annotation Key', to conversion file (converting GenBank# to locus_link_id), to final linking with GO terms. Keep in mind that although shown with a single arrow, in general one gene index will map to many GO terms (or other biological theme or pathway categories).


Statistical Parameters Page

Several sections on this page are used to specify reported statistical and result trimming parameters.

Reported Statistic

Fisher's Exact Probability

The Fisher's Exact Probability reports the probability that a biological theme is over-represented in the cluster of interest relative to the representation of that theme in the total gene population. For example, suppose that one has a gene list of 50 genes from a population of 10,000 genes. Now suppose that 10 of the 50 genes were related to pathway "A" but only 13 genes in the total population were associated with pathway "A". This scenario would yield a low probability that the observed number of hits (occurrences of pathway "A") within the small sample could be due to chance alone. This statistic is based on the hypergeometric distribution and has benefits over chi-square in that it is appropriate for finite populations. The reference sited for EASE describes this statistic at length.

EASE Score

The EASE Score reported is essentially a jackknifed Fisher's Exact Probability which is arrived at by calculation of the Fisher's Exact where one occurrence (list hit for a term) has been removed.

Multiplicity Corrections

Several p-value corrections can be applied to help correct for the chance of arriving at a significant result when performing multiple tests.

Bonferroni Correction

This correction simply multiplies the statistic by the number of results generated. This is the most stringent correction of the three options.

Bonferroni Step Down Correction

This modified Bonferroni correction ranks the results by the statistic in ascending order. Each value is multiplied by (n-rank) where n is the number of results. In the case of a tie, where two results have the same probability the rank is kept constant until the next element occurs having a higher probability value. The rank is then adjusted for the number of tied elements where rank was constant.

Sidak Method

This correction uses the following formula where v' is the corrected value and k is the rank of the result in terms of original statistic value. In this case ties in rank are handled as described in the step down Bonferroni correction.
v' = 1-(1-v)k

Resampling Probability Analysis

The resampling option performs a number of analysis iterations in which random gene lists of the original cluster size are selected from the population without replacement. The end result reported for a particular term is the probability of obtaining the determined significance level by chance.

Trim Parameters

The trim parameters can be applied to filter analysis results based on the number of hits or the fraction of genes in the cluster that are represented by an annotation term. Sometimes a term can be found significant but does not represent a large segment of the cluster of interest. These options can be applied to be certain that a minimum number of genes in the cluster fall under that particular annotation class. This feature should be used with caution so that biological themes represented by very few genes are not excluded.