LSNMF is a modified version of NMF (non-negative matrix factorization). It focuses on gene expression patterns analysis of microarray datasets. The major improvement of LSNMF algorithm over NMF is that it incorporates uncertainty estimates into the update rules, so it is much more stable over the possible noise in the datasets, and more sensitive to the real signals.
LSNMF is implemented in LAM/MPI-based parallelized C++. It was designed for running on Beowulf clusters (the desktop version is also available). It is an open-source package, please see LICENSE.txt for information concerning the end-user license.
To compile LSNMF, LAM/MPI libraries and compiler must be installed, and need to complete the following steps:
An executable LSNMF will be generated in the current directory. To make it accessible from all cluster nodes, you may need to move or copy it to a place defined in $PATH for all nodes.
LSNMF takes at least two input files, one is the data_file, and the other is the control_file. To take the full advantage of LSNMF, another file, i.e. uncertainty_file is also needed. data_file contains microarray expression data as ratio values, uncertainty_file contains the uncertainty estimates for all data points in data_file, and control_file defines all necessary parameters. All the files need to follow specific format which are described in the next section. data_file, uncertainty_file, and control_file must have the same root name. The name convention is:
Output files include Amplitude matrix rootname_rank_round_A.dat, Pattern matrix rootname_rank_round_P.dat, and a mock matrix rootname_rank_round.out. "rank" and "round" are automatically generated number, please see File Format section for detail.
data_file contains the actual processed data from microarray experiment, it is a tab-delimited file with (N+1) rows and (M+1) columns. The first row contains M condition names, and the first column contains N gene (or probe) names. From the second row, each row corresponds to the processed expression intensity for a single gene across all M conditionsexample:
MLP(BM) CLP(BM) FrA(BM) FrB(BM) MLP(FL) CLP(FL) FrA(FL) FrB(FL) NM_001001491 0.30372445 0.780793525 1.4116462 1.576337425 0.53325 0.975205035 1.025554875 1.3881536 NM_001001602 4.9866785 6.74387545 7.783827125 6.18556855 0.17525 4.9866785 6.74387545 7.783827125 NM_001001792 0.24535537 0.223226523 0.189508748 0.62436459 2.1075 0.24535537 0.223226523 0.189508748 NM_001001880 0.757604305 0.57070907 0.341938623 0.402804123 2.7675 2.29897715 1.34643915 1.263142475 NM_001001986 4.94571825 4.50987245 3.631820625 0.942712525 5.5475 6.331860225 4.1650696 0.943798188
uncertainty_file has the same format as data_file, except that each row (excluding the first row) corresponds to the uncertainty estimates for single gene's processed expression intensities across all conditionsexample:
MLP(BM)U CLP(BM)U FrA(BM)U FrB(BM)U MLP(FL)U CLP(FL)U FrA(FL)U FrB(FL)U NM_001001491 0.063257445 0.096807847 0.157888001 0.404579252 0.047675116 0.153415716 0.247063486 0.279377111 NM_001001602 1.199352546 1.042125904 0.879334079 2.045560819 0.057575313 1.199352546 1.042125904 0.879334079 NM_001001792 0.082980033 0.099640698 0.047767522 0.269819968 1.576417352 0.082980033 0.099640698 0.047767522 NM_001001880 0.241294728 0.117321169 0.077849843 0.175336753 0.325512417 0.380948604 0.307905291 0.453841254 NM_001001986 0.239042277 0.649848521 1.211522632 0.065456824 2.06638775 0.467441096 0.837032412 0.476070493
control_file defines all parameters which are needed for each specific simulation. Following is an example of control_file, users can edit this sample file to fit their own task.
5 8 1 # Ngenes (N), Nconditions (M), uncertainty_flag 3 6 20 2000 # StartRank, EndRank, Nchains, SimulationIteration
The first line in control_file define the numbers of genes and conditions, which are fixed parameters for a specific microarray dataset. The third number in the first line defines the uncertainty flag, if it is 1, then uncertainty estimates are available and they would be used in the simulation; if uncertainty flag is 0, then either the uncertainty information is not available, or the uncertainty information would not be used in the simulation. In this case, LSNMF is reduced back to NMF.
The second line defines major calculation parameters. The first number is start rank, and the second number is end rank, rank refer to how many patterns the simulation want to get. The third number defines how many repeats should be performed for each specific rank (K), the forth number defines the maximum round for LSNMF simulation in each repeat.
rootname.out contains the simulated mock matrix data along with the original data matrix and uncertainty information for convenient comparison.
The first line contains the experimental tag used for file names. The second line has four float values, the Chi-square error, RMSD error, and two undefined parameters for future use. The third line has 5 integers, the first is always 1, following that are: maximum round which was setup before the simulation, actual finished rounds, seed for randomizing A matrix, seed for randomizing P matrix. The fourth line has three integers, the number of patterns, the number of genes, and the number of conditions. The fifth and sixth lines were prepared for future use. Starting with the seventh line, each line contains three columns (two columns if uncertainty_flag was set to 0), the first column is the data from data_file, the second column is the corresponding mock matrix element, and the third column is the uncertainty estimate (this column would be omitted in case uncertainty_flag=2). If the difference between mock matrix and data matrix is beyond the tolerance of uncertainty estimate, an "*" mark is put in the end of the line.
The data in mock_file is entered row by row in regular matrix point of view, i.e. the first line would be the data for the first gene in the first experiment, and the second line would be the first gene in the second experiment, and so on. After matrix element data, some summary about the simulation is given in the bottom of the file.
A matrix and P matrix are saved in regular matrix format, Nrows by Ncolumns. For A matrix, Nrows is the number of genes (N), and Ncolumns is the number of patterns (rank K). In P matrix, Nrows is patterns number (K), Ncolumns is the sample size (M).