Overview of the FGDP Core, Version 1.0

The FGDP -- the Functional Genomics Data Pipeline -- is a flexible system of gene expression analysis which uses interchangeable 'modules' of analysis, each of which performs one specific role. Each module usually has several parameters which can be set, yielding more specific 'methods'.
The full pipeline has five sections and four module types, which always go in the same order. Due to the interchangeable nature of the pipeline, there are multiple entry points. Also, the pipeline can stop after the completion of any step.

1: Quantitation.The process of taking an image of a microarray and condensing the image into a sequence of intensity numbers, one for each gene and its background. The pipeline, only supporting fully automated analyses, is somewhat limited in this regard since we have only one fully automated quantifying system, WaveRead.

-- Most users we have encountered so far are using Imagene for Quantitation, and they enter the pipeline here.

2: Normalization. This includes background subtraction, accounting for dye effects, saturation effects, microarray effects, and in general things that have to do with only one microarray at once. A typical method (which all of our currently implemented methods use) is to take a curvefit of an MA plot and subtract the curvefit from the data so that the ratios are close to 1 for all intensity regimes.

-- Users of Affymetrix have had the previous two steps taken care of for them already, so they enter here.

3: Statistics. In each condition, condensing the many values for each gene into one, standard value for that gene in that condition. A typical method is to take the average, though there are much more sophisticated methods available, such as SAM or VERA.

4: Filtering. This is relatively amorphous - these are the calculations that should be done on all data before trying to break them up into patterns. Mainly, it's cutting out data that doesn't change or is of low quality. Also, the data can be re-scaled or altered in various ways.

5: Pattern Recognition. This should derive several 'shapes' of gene expression profiles, and assign the genes to them. This includes clustering algorithms such as Self-Organising Maps or K-Means, and can also include decomposition algorithms such as Principal Component Analysis

Though you can specify one module of each type desired and run, one of the strengths of the FGDP is that it is very easy to make it branch out and execute multiple combinations of modules for comparison purposes.

The Modules currently supported by the pipeline:

Data Cleaners:



Pattern Recognizers:

This document was created by:

Luke Somers (la_somers@fccc.edu)
July 28, 2004

Bioinformatics Facility
Fox Chase Cancer Center
Philadelphia, PA