papers
Insilicos authors are noted in bold.
Collaborative Systems Biology: Open Source, Open Data, and Cloud Computing
Brian Pratt
Book chapter in M. A. Z. Hupcey, A. J. Williams (Eds.) Collaborative Computational Technologies for Biomedical Research (pp 209-220) John Wiley and Sons (2011) ISBN: 978-0-470-63803-3
In disciplines such as Systems Biology which have largely arisen in the internet age, and in which code and data are the fundamental artifacts, there are new expectations as to what constitutes full disclosure.
It’s said that there is no such thing as a free lunch, but in the case of software it’s probably more apt to say that the price of freedom is vigilance.
MR-Tandem: Parallel X!Tandem using Hadoop MapReduce on Amazon Web Services

Brian Pratt, J. Jeffry Howbert, Natalie Tasman, Erik Nilsson.
Bioinformatics (2011) doi: 10.1093/bioinformatics/btr615
MR-Tandem adapts the popular X!Tandem peptide search engine to work with Hadoop MapReduce for reliable parallel execution of large searches.
MR-Tandem runs on any Hadoop cluster but offers special support for Amazon Web Services for creating inexpensive on-demand Hadoop clusters, enabling search volumes that might not otherwise be feasible with the compute resources a researcher has at hand. MR-Tandem is designed to drop in wherever X!Tandem is already in use and requires no modification to existing X!Tandem parameter files, and only minimal modification to X!Tandem-based workflows.
mzML - a Community Standard for Mass Spectrometry Data
Martens L, Chambers M, Sturm M, Kessner D, Levander F, Shofstahl J, Tang WH, Rompp A, Neumann S, Pizarro AD, Montecchi-Palazzi L, Tasman N, Coleman M, Reisinger F, Souda P, Hermjakob H, Binz PA, Deutsch EW.
Molecular and Cellular Proteomics. 2011 Jan;10(1):R110.000133
Rapid advances in mass spectrometry make it imperative to provide a standard format for mass spectrometry data that will facilitate data sharing and analysis. Vendors, researchers, and software developers convened under the banner of the HUPO PSI to develop a single standard. The new data format incorporated many of the desirable technical attributes from previous formats, while adding a number of improvements. The resulting standard data format, mzML, is a well tested open-source format for mass spectrometer output files that can be readily utilized by the community and easily adapted for incremental advances in mass spectrometry technology.
HDL in humans with cardiovascular disease exhibits a proteomic signature

Vaisar T, Mayer P, Nilsson E, Zhao XQ, Knopp R, Prazen BJ.
Clinica Chimica Acta 2010 Jul 4;411(13-14):972-9. Epub 2010 Mar 20
Alterations in protein composition and oxidative damage of high density lipoprotein (HDL) have been proposed to impair the cardioprotective properties of HDL. We tested whether relative levels of proteins in HDL(2) could be used as biomarkers for coronary artery disease (CAD).
Conclusions
HDL2 of CAD subjects carries a distinct protein cargo and that protein oxidation helps generate dysfunctional HDL. Moreover, models based on selected identified peptides in MALDI-TOF mass spectra of the HDL may have diagnostic potential.
A guided tour of the Trans-Proteomic Pipeline.
Deutsch EW, Mendoza L, Shteynberg D, Farrah T, Lam H, Tasman N, Sun Z, Nilsson E, Pratt B, Prazen B, Eng JK, Martin DB, Nesvizhskii A, Aebersold R.
Proteomics 10, 1150–1159 (15 March 2010)
The Trans-Proteomic Pipeline (TPP) is a suite of software tools for the analysis of MS/MS data sets. The tools encompass most of the steps in a proteomic data analysis workflow in a single, integrated software system. Specifically, the TPP supports all steps from spectrometer output file conversion to protein-level statistical validation, including quantification by stable isotope ratios. We describe here the full workflow of the TPP and the tools therein, along with an example on a sample data set, demonstrating that the setup and use of the tools are straightforward and well supported and do not require specialized informatic resources or knowledge.
Trans Proteomic Pipeline supports and improves analysis of electron transfer dissociation datasets.

E. W. Deutsch, D. Shteynberg, H. Lam, Z. Sun, J. K. Eng, C. Carapito, P. D. von Haller, N. Tasman, L. Mendoza, T. Farrah, R. Aebersold.
Proteomics 10, 1190-1195 (15 March 2010)
Electron transfer dissociation (ETD) is an alternative fragmentation technique to CID that has recently become commercially available. ETD has several advantages over CID. It is less prone to fragmenting amino acid side chains, especially those that are modified, thus yielding fragment ion spectra with more uniform peak intensities. Further, precursor ions of longer peptides and higher charge states can be fragmented and identified. However, analysis of ETD spectra has a few important differences that require the optimization of the software packages used for the analysis of CID data or the development of specialized tools. We have adapted the Trans-Proteomic Pipeline to process ETD data. Specifically, we have added support for fragment ion spectra from high-charge precursors, compatibility with charge-state estimation algorithms, provisions for the use of the Lys-C protease, capabilities for ETD spectrum library building, and updates to the data formats to differentiate CID and ETD spectra. We show the results of processing data sets from several different types of ETD instruments and demonstrate that application of the ETD-enhanced Trans-Proteomic Pipeline can increase the number of spectrum identifications at a fixed false discovery rate by as much as 100% over native output from a single sequence search engine.
The gputools package enables GPU computing in R

J. Buckner, J. Wilson, M. Seligman, B. Athey, S. Watson, F. Meng.
Bioinformatics (2010) 26 (1): 134-135. doi: 10.1093/bioinformatics/btp608
By default, the R statistical environment does not make use of parallelism. Researchers may resort to expensive solutions such as cluster hardware for large analysis tasks. Graphics processing units (GPUs) provide an inexpensive and computationally powerful alternative. Using R and the CUDA toolkit from Nvidia, we have implemented several functions commonly used in microarray gene expression analysis for GPU-equipped computers.
Results: R users can take advantage of the better performance provided by an Nvidia GPU.
Least angle regression and LASSO for large datasets

Fraley C, Hesterberg T.
Statistical Analysis and Data Mining 2009, 1:251-259
Least angle regression and LASSO (ℓ1-penalized regression) offer a number of advantages in variable selection applications over procedures such as stepwise or ridge regression, including prediction accuracy, stability, and interpretability. We discuss formulations of these algorithms that extend to datasets in which the number of observations could be so large that it would not be possible to access the matrix of predictors as a unit in computations. Our methods require a single pass through the data for orthogonal transformation, effectively reducing the dimension of the computations required to obtain the regression coefficients and residual sum of squares to the number of predictors, rather than the number of observations.
Bayesian regularization for normal mixture estimation and model-based clustering
Chris Fraley, Adrian E. Raftery
Journal of Classification 2007, 24:155-181
Normal mixture models are widely used for statistical modeling of data, including cluster analysis. However maximum likelihood estimation (MLE) for normal mixtures using the EM algorithm may fail as the result of singularities or degeneracies. To avoid this, we propose replacing the MLE by a maximum a posteriori (MAP) estimator, also found by the EM algorithm. For choosing the number of components and the model parameterization, we propose a modified version of BIC, where the likelihood is evaluated at the MAP instead of the MLE. We use a highly dispersed proper conjugate prior, containing a small fraction of one observation's worth of information. The resulting method avoids degeneracies and singularities, but when these are not present it gives similar results to the standard method using MLE, EM and BIC.



