Abstract

In order to study the differential protein expression in complex biological samples, strategies for rapid, highly reproducible and accurate quantification are necessary. Isotope labeling and fluorescent labeling techniques have been widely used in quantitative proteomics research. However, researchers are increasingly turning to label-free shotgun proteomics techniques for faster, cleaner, and simpler results. Mass spectrometry-based label-free quantitative proteomics falls into two general categories. In the first are the measurements of changes in chromatographic ion intensity such as peptide peak areas or peak heights. The second is based on the spectral counting of identified proteins. In this paper, we will discuss the technologies of these label-free quantitative methods, statistics, available computational software, and their applications in complex proteomics studies.

1. Introduction

Mass spectrometry plays a central role in proteomics [1]. In addition to global profiling of the proteins present within a system at a given time, information on the level of protein expression is increasingly required in proteomics studies [1, 2]. Protein separation and comparison by two-dimensional polyacrylamide gel electrophoresis (2D-PAGE), followed by mass spectrometry (MS) or tandem mass spectrometry (MS/MS) identification is the classical method for quantitative analysis of protein mixtures [3]. In this method, the intensity of the protein stain is used to make a determination regarding the quantity of a particular protein. The development of 2D Fluorescence Difference Gel Electrophoresis (2D-DIGE) gives more accurate and reliable quantification information of protein abundance because the samples to be compared are run together on the same gel, eliminating potential gel-to-gel variation [4]. However, spots on a given 2D gel often contain more than one protein, making quantification ambiguous since it is not immediately apparent which protein in the spot has changed. In addition, any 2D gel approach is subject to the restrictions imposed by the gel method, which include limited dynamic range, difficulty handling hydrophobic proteins, and difficulty detecting proteins with extreme molecular weights and pI values.

The development of non-gel-based, “shotgun” proteomic techniques such as Multidimensional Protein Identification (MudPIT) has provided powerful tools for studying large-scale protein expression and characterization in complex biological systems [5, 6]. Non-gel-based quantitative proteomics methods have, therefore, also been developed significantly in recent years. Because the chemical and physical properties of isotope labeled compounds are identical to properties of their natural counterparts except in mass, isotope labeled molecules were incorporated into mass spectrometry-based proteomics methods as internal standards or relative references. A number of stable isotope labeling approaches have been developed for “shotgun” quantitative proteomic analysis. These include Isotope-Coded Affinity Tag (ICAT), Stable Isotope Labeling by Amino Acids in Cell Culture (SILAC), metabolic labeling, enzymatic labeling, Isotope Coded Protein Labeling (ICPL), Tandem Mass Tags (TMT), Isobaric Tags for Relative and Absolute Quantification (iTRAQ), and other chemical labeling [7, 8]. These stable isotope labeling methods have provided valuable flexibility while using quantitative proteomic techniques to study protein changes in complex samples. However, most labeling-based quantification approaches have potential limitations. These include increased time and complexity of sample preparation, requirement for higher sample concentration, high cost of the reagents, incomplete labeling, and the requirement for specific quantification software. Moreover, so far only TMT and iTRAQ allow the comparison of multiple (up to 8) samples at the same time. The other labeling methods can only compare the relative quantity of a protein between 2 and 3 different samples. There has, therefore, been increased interest in label-free shotgun proteomics techniques in order to address some of the issues of labeling methods and achieve faster, cleaner, and simpler quantification results [7, 911].

2. Label-Free Quantitative Proteomics

Regardless of which label-free quantitative proteomics method is used, they all include the following fundamental steps: (i) sample preparation including protein extraction, reduction, alkylation, and digestion; (ii) sample separation by liquid chromatography (LC or LC/LC) and analysis by MS/MS; (iii) data analysis including peptide/protein identification, quantification, and statistical analysis. In protein-labeling approaches, different protein samples are combined together once labeling is finished and the pooled mixtures are then taken through the sample preparation step before being analyzed by a single LC-MS/MS or LC/LC-MS/MS experiment (Figure 1(a)). In contrast, with label-free quantitative methods, each sample is separately prepared, then subjected to individual LC-MS/MS or LC/LC-MS/MS runs (Figure 1(b)). Protein quantification is generally based on two categories of measurements. In the first are the measurements of ion intensity changes such as peptide peak areas or peak heights in chromatography. The second is based on the spectral counting of identified proteins after MS/MS analysis. Peptide peak intensity or spectral count is measured for individual LC-MS/MS or LC/LC-MS/MS runs and changes in protein abundance are calculated via a direct comparison between different analyses.

2.1. Relative Quantification by Peak Intensity of LC-MS

In LC-MS, an ion with a particular m/z is detected and recorded with a particular intensity, at a particular time. It has been observed that signal intensity from electrospray ionization (ESI) correlates with ion concentration [12]. The label-free quantification of peptide/protein via peak intensity in LC-MS was first studied by loading 10 fmol–100 pmol of myoglobin digests to nano-LC and analyzing by LC/MS/MS [13]. When the chromatographic peak areas of the identified peptides were extracted and calculated, the peak areas were found to increase with increased concentration of injected peptides. After the peak areas of all identified myoglobin peptides were combined and plotted against the protein amount, the peak area was found to correlate linearly to the concentration of protein . The strong correlation between chromatographic peak areas and the peptide/protein concentration remained when myoglobin was spiked into a complex mixture (human serum) and its digests were detected by LC-MS/MS. The results of quantitative profiling were further improved by normalizing the calculated peak areas [13, 14].

Although these early studies showed that the relative quantification of the peptides could be achieved via direct comparison of peak intensity of each peptide ion in multiple LC-MS datasets, applying this method for the analysis of changes in protein abundances in complex biological samples had some practical constraints. First, even the same sample can result in differences in the peak intensities of the peptides from run to run. These differences are caused by experimental variations such as differences in sample preparation and sample injection. Normalization is required to account for this kind of variation. Second, any experimental drifts in retention time and m/z will significantly complicate the direct, accurate comparison of multiple LC-MS datasets. Chromatographic shifts may occur as a result of multiple sample injections onto the same reverse-phase LC column. Unaligned peak comparison will result in large variability and inaccuracy in quantification. Thus, highly reproducible LC-MS and careful chromatographic peak alignment are required and critical in this comparative approach. Last, the large volume of data collected during LC-MS/MS analysis of complex protein mixtures requires the data analysis of these spectra to be automated. Therefore, capable computer algorithms were developed in the later studies in order to solve these issues and automatically compare the peak intensity data between LC-MS samples at a comprehensive scale. Several similar steps in data processing were performed in these label-free quantifications. Peptide peaks were first distinguished from background noise and from neighboring peaks (peak detection). Isotope patterns were assigned by deconvolution. LC-MS retention times were carefully adjusted in order to correctly match the corresponding mass peaks between multiple LC-MS runs (peak matching). Chromatographic peak intensity (either peak area or peak height) was calculated and normalized to enable a more accurate matching and quantitation. Finally, statistical analysis such as Students t-test was performed to determine the significance of changes between multiple samples [1517].

Automatic comparison of peak intensity from multiple LC-MS datasets is well suited for clinical biomarker discovery, which normally requires high sample throughput. The following studies were all performed using this label-free quantitative approach. The comparison of control and radiated human colon cancer cells proved the reproducibility of this label-free approach [18]; the serum proteomic profiling of familial adenomatous polyposis patients revealed multiple novel celecoxib-modulated proteins [19]; proteins significantly associated with metastasis were identified by analyzing paraffin-embedded archival melanomas [20]; the analysis of 55 clinical serum samples from schizophrenia patients and healthy volunteers identified hundreds of differentially expressed serum proteins [21]; diagnostic markers and protein signatures were recognized from the serum of Gaucher patients [22] and the cerebrospinal fluid of schizophrenia patients [23].

2.2. Relative Quantification by Spectral Count

In the spectral counting approach, relative protein quantification is achieved by comparing the number of identified MS/MS spectra from the same protein in each of the multiple LC-MS/MS or LC/LC-MS/MS datasets. This is possible because an increase in protein abundance typically results in an increase in the number of its proteolytic peptides, and vice versa. This increased number of (tryptic) digests then usually results in an increase in protein sequence coverage, the number of identified unique peptides, and the number of identified total MS/MS spectra (spectral count) for each protein [24]. Liu et al. studied the correlation between relative protein abundance and sequence coverage, peptide number, and spectral count. It was demonstrated that among all the factors of identification, only spectral count showed strong linear correlation with relative protein abundance with a dynamic range over 2 orders of magnitude [25]. Therefore, spectral count can be used as a simple but reliable index for relative protein quantification. An intriguing study evaluated relative quantification of protein complex by spectral counting-based method and isotope labeled, ion chromatographic method [26]. The crude membrane proteins extracted from S. cerevisiae grown in rich and minimal media were analyzed by MudPIT and quantified using both approaches. It was found that the two quantitative methods showed a strong correlation when the peptides with high signal-to-noise ratio in the extracted ion chromatogram were used in the comparison. Moreover, spectral counting-based quantification is proved more reproducible and has a larger dynamic range than the peptide ion chromatogram-based quantification [26].

In contrast to the chromatographic peak intensity approach, which requires delicate computer algorithms for automatic LC-MS peak alignment and comparison, no specific tools or algorithms have been developed specially for spectral counting due to its ease of implementation. However, normalization and statistical analysis of spectral counting datasets are necessary for accurate and reliable detection of protein changes in complex mixtures. A simple normalization method based on total spectral counts has been reported to account for the variation from run to run [27]. Since large proteins tend to contribute more peptide/spectra than small ones, a normalized spectral abundance factor (NSAF) was defined to account for the effect of protein length on spectral count [28, 29]. NSAF is calculated as the number of spectral counts (SpC) identifying a protein, divided by the protein’s length (L), divided by the sum of SpC/L for all proteins in the experiment. NSAF allows the comparison of abundance of individual proteins in multiple independent samples and has been applied to quantify the expression changes in various complexes [29, 30].

Five different statistical tests have been compared by Zhang et al. to evaluate the significance of comparative quantification by spectral counts [31]. The Fisher’s exact test, goodness-of-fit test (G-test), AC test, Student’s t-test, and Local-Pooled-Error (LPE) test were performed on spectral count data collected by MudPIT analysis of yeast digests. The Student’s t-test was found to be the best when three or more replicates are available. The Fisher’s exact test, G-test, and AC test can be used when the number of replications is limited (one or two), while G-test has the advantage due to its computational simplicity.

Relative quantification by spectral count has been widely applied in different biological complex, including analysis of urine sample from healthy donors and patients with acute inflammation [32], finding biomarkers in human saliva proteome in type-2 diabetes [33], comparison of protein expression in yeast and mammalian cells under different culture conditions [11, 26, 29], distinguishing lung cancer from normal [34], screening of phosphotyrosine-binding proteins in mammalian cells [35], and identifying differential plasma membrane proteins from terminally differentiated mouse cell lines [36].

2.3. Absolute Label-Free Quantification

In addition to relative quantification, label-free proteomics methods can also be used in the determination of absolute abundance of proteins. Protein abundance index (PAI), defined as the number of identified peptides divided by the number of theoretically observable tryptic peptides for each protein, was used to estimate protein abundance in human spliceosome complex [37]. This index was later converted to exponentially modified PAI (emPAI, the exponential form of PAI minus one) [38]. The emPAI demonstrated its success by determining absolute abundance of 46 proteins in a mouse whole-cell lysate, which had been measured using synthetic peptides. The values of emPAI can be calculated easily with a simple script and do not require additional experimentation in protein identification experiments. It can be routinely used for reporting approximate absolute protein abundance in a large-scale analysis.

Recently, a modified spectral counting strategy termed absolute protein expression (APEX) profiling was developed to measure the absolute protein concentration per cell from the proportionality between the protein abundance and the number of peptides observed [39]. The key to APEX is the introduction of appropriate correction factors that make the fraction of expected number of peptides and the fraction of observed number of peptides proportional to one another. The protein’s absolute abundance is indicated by an APEX score, which is calculated from the fraction of observed peptide mass spectra associated with one protein, corrected by the prior estimate of the number of unique peptides expected from a given protein during a MudPIT experiment. The critical correction factor for each protein (called Oi value) is calculated by using a machine learning classification algorithm to predict the observed tryptic peptides from a given protein based upon peptide length and amino acid composition. APEX successfully determined the abundance of 10 proteins that were spiked in a yeast cell extract with known amounts. The absolute protein abundance of yeast and E. coli proteomes analyzed by APEX correlated well with the measurements from other absolute expression measurements such as high-throughput analysis of fusion proteins by western blotting or flow cytometry. The APEX technique has recently been developed as APEX Quantitative Proteomics Tool [40], a free open source Java implementation for the absolute quantification of proteins (http://pfgrc.jcvi.org/).

3. Commercially Available Software for Label-Free Quantitative Proteomics

There has recently been a rapid increase in the development of new bioinformatics tools that aid in automated label-free analysis for comparative LC-MS. The data processing pipelines generally include data normalization, time alignment, peak detection, peak quantification, peak matching, identification, and statistical analysis. Numerous open source and commercial software are available currently. The open source programs include MapQuant, MZmine, MsInspect, OpenMs, MSight, SuperHirn, and PEPPeR [41, 42]. The commercially available software is listed in Table 1. Decyder MS is based on DeCyder 2D Differential Analysis Software. It consists of two main analysis features: peptide detection with the PepDetect module and run-to-run matching with the PepMatch module. PepDetect module provides background subtraction, isotope and charge-state deconvolution, and peak volume calculations using imaging algorithms. This module also provides the option of submitting all or a selected subset of peptides for protein identification by database searching. PepMatch module aligns peptides from different LC-MS runs and detects small quantitative differences between peptides across multiple runs with statistical confidence. Various normalization techniques can be applied to further improve results [18].

SIEVE software employs an algorithm called ChromAlign for chromatographic alignment prior to find differences that are statistically meaningful. The software can determine a P-value for the expression ratio of each differential peak, providing an extra measure of confidence. Peptides that show statistically significant differences can be searched against protein databases to determine peptide and protein identities. Its prefiltering function reduces the number of spectra that need to be searched, decreases the time spent on identification, and increases the throughput of complex biomarker discovery experiments.

The Rosetta Elucidator system is not only a label-free quantitative software, but also a data management platform to store and manage large volumes of MS data. It also supports labeling analysis such as SILAC and ICAT. Elucidator uses an algorithm called PeakTeller for peak detection, extraction, and quantitation from mass spectrometry data. It uses PeptideTeller and ProteinTeller for verifying correct peptide/protein assignments for all features. The system supports a wide range of MS instruments, database search algorithms, comprehensive visualization and analysis tools [43]. It supports label-free quantification by spectral counting as well.

ProteinLynx Global Server supports label-free quantification by peak intensity. It is also a database searching engine for peptide/protein identification [21, 23].

4. Conclusions

The rapid development of label-free quantitative proteomic techniques has provided fast and low-cost measurement of protein expression levels in complex biological samples. Peak intensity-based comparative LC-MS and spectral count-based LC-MS/MS are the two most commonly used label-free quantification methods. Compared with isotope-labeling methods, label-free experiments need to be more carefully controlled, due to possible error caused by run-to-run variations in performance of LC and MS. However, the development of highly reproducible nano-HPLC separation, high resolution mass spectrometer, and delicate computational tools has greatly improved the reliability and accuracy of label-free, comparative LC-MS. Commercially available data processing software is able to automatically detect, match, and analyze peptides from hundreds of different LC-MS experiments simultaneously, which provides a high-throughput technique for disease-related biomarker discovery. The spectral count-based label-free method positively correlates with isotope-labeling quantification and allows both relative and absolute quantification of protein abundance. These label-free quantitative approaches have provided rigorous, powerful tools for analyzing protein changes in large-scale proteomics studies.

Acknowledgments

The authors thank Ms. Sheryl Harvey for critical review. This work is supported by NIH Grant no. RR020843.