LSNMF: Least Squared Non-negative Matrix Factorization

LSNMF is a modified version of NMF (non-negative matrix factorization). It focuses on gene expression patterns analysis of microarray datasets. The major improvement of LSNMF algorithm over NMF is that it incorporates uncertainty estimates into the update rules, so it is much more stable over the possible noise in the datasets, and more sensitive to the real signals.

LSNMF is implemented in LAM/MPI-based parallelized C++. It was designed for running on Beowulf clusters (the desktop version is also available). It is an open-source package, please see LICENSE.txt for information concerning the end-user license.

I. Compilation
II. Usage
III. File Format
  1. data_file (rootname.txt)

    data_file contains the actual processed data from microarray experiment, it is a tab-delimited file with (N+1) rows and (M+1) columns. The first row contains M condition names, and the first column contains N gene (or probe) names. From the second row, each row corresponds to the processed expression intensity for a single gene across all M conditions

    NM_001001491	0.30372445	0.780793525	1.4116462	1.576337425	0.53325	0.975205035	1.025554875	1.3881536
    NM_001001602	4.9866785	6.74387545	7.783827125	6.18556855	0.17525	4.9866785	6.74387545	7.783827125
    NM_001001792	0.24535537	0.223226523	0.189508748	0.62436459	2.1075	0.24535537	0.223226523	0.189508748
    NM_001001880	0.757604305	0.57070907	0.341938623	0.402804123	2.7675	2.29897715	1.34643915	1.263142475
    NM_001001986	4.94571825	4.50987245	3.631820625	0.942712525	5.5475	6.331860225	4.1650696	0.943798188
  2. uncertainty_file (rootname.unc)

    uncertainty_file has the same format as data_file, except that each row (excluding the first row) corresponds to the uncertainty estimates for single gene's processed expression intensities across all conditions

    NM_001001491	0.063257445	0.096807847	0.157888001	0.404579252	0.047675116	0.153415716	0.247063486	0.279377111
    NM_001001602	1.199352546	1.042125904	0.879334079	2.045560819	0.057575313	1.199352546	1.042125904	0.879334079
    NM_001001792	0.082980033	0.099640698	0.047767522	0.269819968	1.576417352	0.082980033	0.099640698	0.047767522
    NM_001001880	0.241294728	0.117321169	0.077849843	0.175336753	0.325512417	0.380948604	0.307905291	0.453841254
    NM_001001986	0.239042277	0.649848521	1.211522632	0.065456824	2.06638775	0.467441096	0.837032412	0.476070493
  3. control_file (rootname.ctr)

    control_file defines all parameters which are needed for each specific simulation. Following is an example of control_file, users can edit this sample file to fit their own task.

    5 8 1   # Ngenes (N), Nconditions (M), uncertainty_flag
    3 6 20 2000  # StartRank, EndRank, Nchains, SimulationIteration
  4. mock_file (rootname.out)

    rootname.out contains the simulated mock matrix data along with the original data matrix and uncertainty information for convenient comparison.

    The first line contains the experimental tag used for file names. The second line has four float values, the Chi-square error, RMSD error, and two undefined parameters for future use. The third line has 5 integers, the first is always 1, following that are: maximum round which was setup before the simulation, actual finished rounds, seed for randomizing A matrix, seed for randomizing P matrix. The fourth line has three integers, the number of patterns, the number of genes, and the number of conditions. The fifth and sixth lines were prepared for future use. Starting with the seventh line, each line contains three columns (two columns if uncertainty_flag was set to 0), the first column is the data from data_file, the second column is the corresponding mock matrix element, and the third column is the uncertainty estimate (this column would be omitted in case uncertainty_flag=2). If the difference between mock matrix and data matrix is beyond the tolerance of uncertainty estimate, an "*" mark is put in the end of the line.

    The data in mock_file is entered row by row in regular matrix point of view, i.e. the first line would be the data for the first gene in the first experiment, and the second line would be the first gene in the second experiment, and so on. After matrix element data, some summary about the simulation is given in the bottom of the file.

  5. A and P matrices

    A matrix and P matrix are saved in regular matrix format, Nrows by Ncolumns. For A matrix, Nrows is the number of genes (N), and Ncolumns is the number of patterns (rank K). In P matrix, Nrows is patterns number (K), Ncolumns is the sample size (M).

IV. Remarks and Comments