|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
EMBRYONIC STEM CELLS: CHARACTERIZATION SERIES |
Transcriptome Research Center, National Institute of Radiological Sciences, Chiba, Japan
Key Words. Embryonic stem cell • Mouse • Gene expression profiling • Low-abundance transcripts
Correspondence: Masumi Abe, Ph.D., Transcriptome Research Center National Institute of Radiological Sciences, Anagawa 4-9-1, Inage-ku, Chiba-shi, Chiba 263-8555, Japan. Telephone: 81-43-206-3219; Fax: 81-43-251-4593; e-mail: abemasum{at}nirs.go.jp
Received January 4, 2006;
accepted for publication June 15, 2006.
First published online in STEM CELLS EXPRESS July 6, 2006.
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
There are polymerase chain reaction (PCR)-based methods that can detect rarely expressed genes and novel transcripts such as cDNA-amplified fragment length polymorphism (AFLP), arbitrarily primed-PCR and so on, but their coverage is not very high. Several improved methods that overcome previous technical difficulties have been developed, but these still suffer from a substantial rate of false positives caused by misannealing during the PCR step [3, 4]. Each peak does not necessarily correspond to a single transcript, making these methods unsuitable for genome-wide profiling.
High-coverage gene expression profiling (HiCEP) features substantial improvements over cDNA-AFLP, especially in the selective PCR step [5]. It can detect unknown transcripts and is sensitive enough to detect low-abundance transcripts. It features an extremely low rate of false positives (less than 4%), enabling the user to assign each peak to a specific gene unequivocally. In addition, it is noteworthy that highly expressed transcripts do not in general mask moderately or rarely expressed ones the way they do with other procedures such as EST collection and SAGE, since the DNA fragments (derived from transcripts) are usually different in molecular size and easily separated by capillary electrophoresis.
HiCEP can detect approximately 70%80% of human and mouse transcripts. The dynamic range of the analysis is 103, from 1 transcript copy per cell to approximately 1,000 transcript copies per cell, and its resolution for expression difference is approximately 1.2-fold [5].
We used HiCEP to analyze embryonic stem (ES) cells. Many unknown or rarely expressed transcripts are thought to express in stem cells; the implication of such transcripts for the pluripotency of ES cells has been discussed elsewhere [6, 7].
| MATERIALS AND METHODS |
|---|
|
|
|---|
HiCEP Reaction
The HiCEP reaction was performed according to a previous report [5]. An outline is shown in supplemental online Figure 1. Briefly, RNA is converted to cDNA by reverse transcriptase with biotynated oligo(dT), and double-strand cDNA is prepared, digested by a specific restriction enzyme, and trapped by avidin bound to magnetic beads. After washing off the fragments digested by the enzyme except most of the 3'-region bearing oligo(dT)-biotin, a synthetic adaptor is ligated, and the trapped templates are digested by another restriction enzyme. Most of the steps are the same as in standard AFLP except for the primers and annealing temperature in selective PCR. We found that there is a temperature region, 7072°C, where we can remove almost all false positives with specific primers whose GC content is 55%60%, resulting in a decrease in the total number of peaks. This enables us to use four-nucleotide instead of six-nucleotide restriction enzymes for high coverage detection, since the drastic decrease of false positives reduces the total number of peaks. HiCEP analysis consists of two reactions, one for detecting 5'-MspI-MseI-3' fragments and the other for 5'-MseI-MspI-3' fragments. Using the same two enzymes twice in opposite order tends to maximize coverage while minimizing the number of fragments that are detected twice.
Fractionation and Sequencing of HiCEP Peaks
One microliter of each selective PCR product, 2.7 µl of formamide, 0.3 µl of GeneScan 500 ROX (Applied BioSystems, Foster City, CA, http://www.appliedbiosystems.com) and 2.0 µl of 10x loading buffer were mixed, denatured by incubation at 95.0°C for 2 minutes, and loaded on a denaturing gel (20 cm x 40 cm of slab gel): 4%, 6%, or 10% polyacrylamide containing 7.0 M urea, followed by electrophoresis at 1,500 V for 4 hours. Fluorescence from the products was detected by Typhoon 9210 (GE Healthcare, Uppsala, Sweden, http://www.gehealthcare.com). The portions of the gel containing bands were cut out and suspended in 60 µl of TE (10 mM Tris-HCl, 1 mM EDTA, pH 8.0) buffer. After 30 minutes of incubation at 20°C, 2.0 µl was used for PCR and then forwarded to cloning using the pGEM-T Easy Vector System (Promega, Madison, WI, http://www.promega.com). To confirm the reliability of the cloning step, electrophoresis distance was tested by comparing the distance in the HiCEP analysis, and the distance for the fluorescent product resynthesized with a clone obtained by cloning as a template. The reliability of our criterion was also supported by competitive PCR using a gene-specific primer [11]. T7 primer and the BigDye Terminator v3.1 cycle sequencing kit (Applied BioSystems) were used for sequencing reactions.
Assigning HiCEP Peaks to Genes
To determine the UniGene cluster, we did a BLAST search of the mRNA sequences in mouse UniGene (Build 141) and then searched for ESTs in UniGene. We looked for sequences with at least 90% sequence matching and homology. We checked whether the hit sequence had MspI and MseI sites and the correct selective sequence to correspond to a pattern of selective PCR. To determine the genome region from which the transcripts detected by HiCEP are expressed, we searched mouse Genome (NCBI Mouse Build 33.1) with the BLAST Like Alignment Tool (BLAT) (http://genome.ucsc.edu/) followed by Washington University BLAST (http://blast.wustl.edu/).
Reverse Transcription-Polymerase Chain Reaction Analysis
We randomly selected 50 low-abundance transcripts with peak height less than 500 from category IIIb to validate the HiCEP analysis. The transcripts in category IIIb are strong candidates for unknown transcripts, because they are expressed from outside known gene loci. In our system, intensity 500 corresponds to approximately 10 transcript copies per cell [5]. Primers for the candidates were designed using the program Primer3 (http://frodo.wi.mit.edu/cgi-bin/primer3_www.cgi). Reverse transcription-polymerase chain reaction (RT-PCR) was performed using the QuantiTect SYBR Green RT-PCR kit (Qiagen). Each reaction mixture contained 20 ng of E14 total RNA, and their signals were monitored by PRISM 7700 (Applied BioSystems). The reaction conditions were 50°C for 30 minutes and 95°C for 15 minutes for cDNA synthesis, followed by 50 cycles at 94° C for 10 seconds, 55°C for 30 seconds, and 72°C for 40 seconds.
| RESULTS |
|---|
|
|
|---|
|
Content of Peaks
Our AFLP-based method divides huge numbers of cDNA molecules into 512 groups using two nucleotides at either end of the fragments generated by digestion with two restriction enzymes, MspI and MseI. What we call the "predicted region" of a transcript is the portion of the transcript that should be amplified in the HiCEP selective PCR step; that is, the 3'-most fragment that has the second enzyme recognition site on the 3' side and the first enzyme recognition site on the 5' side (if both recognition sites exist; otherwise the predicted region is undefined and the transcript is not covered by HiCEP analysis) (supplemental online Fig. 1).
A total of 16,873 kinds of fragments were found in the 14,383 peaks, meaning that several peaks were composed of multiple fragments. An in silico study revealed that 14,332 (86%) of these fragments have a match in the UniGene mRNA or EST databases (Fig. 1B). We divided them into four categories: completely known transcripts (category I), partially known transcripts (category II), unknown transcripts (category III), and a few fragments too short or repetitive to assign to a specific region in the genome (category IV) (Fig. 1B).
We subsequently subdivided the categories further. Category Ia includes predicted regions of completely known transcripts. Ib includes parts of the predicted regions of completely known transcripts, and Ic includes portions outside the predicted regions of completely known transcripts. Categories Ib and Ic contain unknown forms of alternative transcripts and transcripts that exhibit poly(A) site heterogeneity. Note that first strand synthesis sometimes occurs in A-rich regions within the transcripts and that these artifacts can be also classified into these categories. Fragments in category II were found in the UniGene EST sequences database, but their full lengths are not available. Category IIIa includes transcripts from the intronic region, whereas Category IIIb includes transcripts from fully novel gene loci. All this information is shown in supplemental online Tables 16.
Candidates for Novel Transcripts in the E14 Cells
A total of 698 (546 in category IIIb + 152 in IIIc) transcripts are derived from a genomic region in which no open reading frame has been identified or predicted (Fig. 1B). These can be assumed to be transcripts expressed from novel gene loci. An additional 243 in category Ib are unregistered transcripts expressed from known gene loci, making a total of 941 (698 + 243) unregistered transcripts.
Categories Ic and IIIa also can contain novel transcripts, but we did not count these as novel, because these categories can also contain transcripts that include new poly(A) sites in known transcripts, intronic regions from immature forms of known transcripts, or artifacts caused by misannealing as mentioned above.
Our analysis suggests that approximately 2,000 (941 x 33,136/14,383) novel transcripts remain in E14 cells and that these novel transcripts tended to produce low-intensity peaks. RT-PCR confirmed their expression, as mentioned below.
We confirmed that no peaks, except for a few mechanical artifacts, were detected in a HiCEP analysis with genomic DNA only as a template, and no difference was observed between the analyses using total RNA with and without DNaseI treatment (data not shown), indicating that no signals are coming from genomic DNA, which can be contaminated in the RNA fraction.
The novel transcripts detected by our analysis are very interesting, because they may be ES-specific. We examined whether or not these novel transcripts are expressed in an ES-specific manner using randomly selected transcripts. A decrease in expression after the removal of LIF from the culture medium was investigated (Fig. 2A). Two ES-specific transcripts were found in category IIIa, and these were assigned to intronic regions of known genes whose functions were unknown. Four decreasing transcripts were identified in category IIIb and one in category IV. These transcripts are derived from regions of the genome where no open reading frame has been predicted.
|
It would be interesting to know whether the noncoding transcript candidates play roles in the mechanism underlying the pluripotency of ES cells. Among the candidates detected, we found two transcripts whose expression changes when LIF is removed from the culture medium, 3_1463 (supplemental online Table 5) and 1c2_6130 (supplemental online Table 3), corresponding to B930044J18 and 4833429C02, respectively, in the FANTOM noncoding transcript library (http://www.fantom.gsc.riken.go.jp). The former was suppressed, and the later was induced (Fig. 2B). B930044J18 is located in the intronic region of A630034I12Rik, whose function is unknown, and 4833429C02 is located 1 kilobase pair (kbp) downstream from the 3' end of the paternally expressed 3 gene (Peg3). No functional information has been obtained yet, so additional studies on these transcripts are needed.
| DISCUSSION |
|---|
|
|
|---|
If misannealing occurs during the selective PCR step, the sequences at both ends of the product fragments reflect the sequence of primers used. Therefore, the sequences at both ends of all products amplified by selective PCR using a given set of primers are the same, regardless of whether or not misannealing occurs. So we can tell whether misannealing has occurred by comparing the HiCEP fragment with the sequence registered in the public database. Analyzing the 16,873 cloned fragments revealed that the false-positive rate is 3.8%. Incidentally, one particular primer set was responsible for most of the false positives (manuscript in preparation).
Each peak corresponds roughly to one transcript, but this correspondence depends on the complexity of the transcriptome. Even if selective PCR works perfectly (i.e., without misannealing), the amplified fragments may overlap by chance. The incidence of overlapping depends on the complexity of the transcriptome. We estimated the transcripts-per-peak value through detailed analysis of the 14,383 peaks isolated and characterized as described above; the ratio is 1.17. Some peaks are still overlapping peaks corresponding to multiple transcripts in mice, and this value will only increase with more extensive cloning. The effect is more severe in species having a complex genome, although the use of additional restriction enzyme sets will overcome this difficulty. Coverage was estimated in silico to be 85.7% using information from 33,434 full-length cDNA sequences in the public database. Taken together, these data lead us to conclude that more than 40,000 kinds of transcripts are expressed in E14 cells.
We estimated the total number of transcripts using the information obtained by random cloning of more than 16,000 transcripts. However, we could not totally exclude the possibility that our sample is biased, because the expression level of the transcripts isolated by random cloning tends to be rather high. On this matter, additional studies will be needed. Nevertheless, we are confident that at least 40,000 kinds of transcripts are expressed, since the transcript-per-peak ratio must increase. It is fairly definite that more cloning needs to be done, and that the 14,383 peaks would contain more than 16,873 transcripts (Fig. 1B).
On the other hand, annealing between the oligo(dT) primer and A-rich regions within mRNA sometimes occurs. This means that the predicted total number of transcripts could be an overestimate, since multiple tags can be generated from one transcript. We use oligo(dT) primer for cDNA synthesis, so this is an unavoidable problem. In addition, it is hard to discriminate between such artifacts and novel forms of transcripts whose polyadenylation occurs at unknown sites. We experimented with primer-annealing temperatures of 42°C and 50°C, and few differences were observed in the peak pattern, so we performed the reaction under 50°C to minimize this effect.
Finally, we want to emphasize that we detected approximately 4,000 signals with HiCEP analysis in Saccharomyces cerevisiae, which is the simplest eukaryote. This number is nearly equal to that estimated by another method under similar culture conditions [5]. In addition, the prediction of peak positions using public databases was quite efficient for most signals due to the presence of few exons and thus few alternative transcripts, almost all of which are already identified. These results mean that few artifacts generated by partial digestion by restriction enzymes or inefficient washing were observed.
Ability of HiCEP to Quantitatively Measure Rarely Expressed Transcripts
HiCEP analysis is sensitive enough to detect 1 transcript per cell [5]. This is a critical point for studies such as gene network analysis because substantial numbers of gene-expression regulation factor-coding transcripts are low-abundance transcripts.
Here, we add more information to indicate the potential of our method by comparing the results of HiCEP and those of quantitative PCR analysis. We statistically analyzed the expression of unknown transcripts and found that significant numbers of unknown transcripts exhibit low levels of expression (Fig. 3A). Next, we attempted to design primers for 50 unknown transcripts whose HiCEP peaks exhibited low intensity, and for 46 of the 50, our primers worked well. Real-time PCR detected clear bands for 43 of the 46 (Fig. 3B) and revealed low expression similar to that exhibited by HiCEP analysis (supplemental online Fig. 3B). We confirmed that these products were not amplified from genome DNA molecules contaminated during RNA preparation (supplemental online Fig. 3A). HiCEP technology has been validated by quantitative PCR in other studies as well [5, 1315].
|
|
We examined another ES cell line, R1, to see whether the database for a given cell type is applicable to cells of a related type, and we found that the positions of most peaks were identical (Fig. 5). However, a significant number of peaks were different in height (supplemental online Fig. 4). We analyzed the R1 peaks to determine whether the peaks appearing in both R1 and E14 lines were derived from identical transcripts. We cloned 43 randomly selected signals using the R1 cell line with the MspI-CA and MseI-GG primer set, sequenced them, and found that the transcripts corresponding to these peaks were identical to those of the E14 peaks (data not shown). This implies that most of the shared peaks are derived from identical transcripts. These results show that the HiCEP peak database of E14 is applicable to R1 cell analysis and strongly suggest that the database is useful for most murine ES cell lines.
|
We tried to compare the expression of some genes whose products play critical roles in the mechanism underlying pluripotency, Nanog, Oct3/4, and Sox2. We found that Oct3/4 and Sox2 are expressed clearly in E14 cells, but no expression could be observed in MEFs (supplemental online Fig. 5). HiCEP cannot be used to analyze the Nanog transcript, because it contains no recognition sites for the restriction enzyme MspI, which HiCEP uses. These results were confirmed by quantitative PCR, and the ES-specific Nanog expression was also confirmed.
It is also noteworthy that we found many (1,858) genomic regions from which transcripts expressed with both sense and antisense orientation. This suggests that there are more than 4,000 such transcripts in E14 cells. Our database will provide quantitative information on these transcripts. The next question is what their role is.
It has been suggested that ES cells are relatively unstable in in vitro culture and often partially differentiate under standard culture conditions [16]. Whether or not our database really represents the transcriptome of the immature stem cell is an important point. The following observations suggest that most ES cells used in the present study maintain their pluripotency. (a) The ES cells injected into blastocysts developed into intact mice. (b) Most of them were alkaline phosphatase-positive. (c) SSEA1 products were positive [17]. Nanog, Oct3/4, Sox2, Rex1, fibroblast growth factor 4 (Fgf4), and Utf1 were also positive [1822]. (d) Three mouse ES cell lines (E14, R1, and TT2) share quite similar expression profiles (Fig. 5) (manuscript in preparation). (e) We identified a substantial number of transcripts whose expression decreases drastically upon removal of LIF in culture (Fig. 2). However, at the same time, slight expression of H19 and Fgf5 transcripts, which are known differentiation markers, was also observed (data not shown).
There have been several studies of the transcriptome of ES cells. SAGE suggested 44,569 unique tags, including 31,184 that were identified once, in mouse ES cell line R1 [23]. We attempted to compare their results with ours. However, direct comparison of tag sequences between SAGE and HiCEP was impossible, since the tags of these two types of analyses are derived from different regions of the transcript. Only tags assigned to genes whose full-length sequence or genome organization had already been determined could be used for the comparison. Even in this case, the comparison would be not accurate, since alternative transcripts in a gene locus could not be discriminated.
Focusing on known gene loci, we compared the gene symbols corresponding to the results of two analyses and found that 5,895 tags were detected in both analyses, 4,712 were detected by SAGE only, and 1,727 were detected by HiCEP only. This comparison has several limitations. Alternative transcripts from identical gene loci were defined as one and then calculated. The tags assigned to novel gene loci cannot be included, because full-length cDNA and their genome genes were not finally determined. Furthermore, most SAGE tags match more than two regions on the genome and could not be assigned to specific genes, resulting in overestimation of the number of gene loci.
In general, rarely expressed genes are detected as singletons in SAGE analysis, and this discourages further statistical study since the assignment of singletons is prone to sequencing errors and single nucleotide polymorphisms. Our results contain accurate gene expression information for over 70% of the transcripts and annotations for 45%. With this database and HiCEP analysis, quantitative gene expression profiling of mouse ES cells becomes possible. This system enables us to measure quantitatively a number of noncoding transcripts, as well as known and unknown protein-encoding transcripts, with high resolution. Currently, we are developing a system using computers to compare the expression change of every peak detected in HiCEP analysis automatically and comprehensively [24]. We will release updated information about the HiCEP database on our homepage (http://133.63.22.11/english/index.html).
"How many transcripts are expressed in mice?" is quite an important question, and answering it is one of our goals. Using the whole body to count the total number of transcripts is not realistic because analysis would suffer from the complexity of cell types, leading analysts to miss a substantial number of low-abundance transcripts. We must integrate results using every tissue in the body at every stage of differentiation. Although we report only on a database of embryonic stem cells here, we plan to set up a system in which HiCEP analysis results will be registered and available to the scientific community, enabling us to address this important issue.
We used oligo(dT) primer for the cDNA synthesis, so we were measuring only RNA molecules with poly(A) tails. Recently it has become clear that there are several types of RNA molecules without poly(A) tails, but our system does not detect such molecules [25].
| CONCLUSION |
|---|
|
|
|---|
| DISCLOSURES |
|---|
|
|
|---|
| ACKNOWLEDGMENTS |
|---|
|
|
|---|
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| STEM CELLS | THE ONCOLOGIST | CME | ALPHAMED PRESS JOURNALS |
