Why do we use cRNA instead of initial extracted RNA in the microarray technique?

Why do we use cRNA instead of initial extracted RNA in the microarray technique?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am recapping for the exam in transcriptomics and stumbled into a question about microarray. So, our regular workflow is to

  1. extract RNA from cells
  2. generate cDNA from the initial RNA using reverse transcription
  3. transcribe cRNA from cDNA
  4. label cRNA with biotin
  5. fragment biotin labeled cRNA
  6. hybridize to a plate, scan and quantitate

So the question is - since we use RNA for the further quntification steps why should we perform reverse transcription+transcription to produce cRNA while we could just use the initial RNA?

Is the cRNA more stable than RNA? Or we already make labeled cRNA using labeled nucleotides?

After clarification via the comments of the OPs question:

1) The procedure you describe is not for standard RNA microarray experiments, it for olgionucleotide microarray. This type of chip is special because it has multiple probes for each gene to allow detection of special features (i.e. mutations, alternative splicing,… ).

2) The reason for the generation of cDNA with subsequent reverse transcription and biotinylation is not explicitly stated in the source of the OP, but there are some possible advantages:

  • Amplification of material: cDNA can easily be amplified, which would make it easier to analyse samples whit low RNA yield, genes with low expression levels or to analyse the same sample on multiple chips.

  • Sensivity: oligonucleotide arrays may use very short (10-25 bases) probes, for which fluorescence readout may not be sensitive enough. Biotinylation of a given RNA base (by the use of pre-biotinylated NTPs during reverse transcription) can give multiple 'marks' on each fragment, which can be read with out streptavidin coupled signals. Additionally this would allow the signal strength to change in respond to certain SNPs/mutations.


my guess is that the final cDNA after doing RT-PCR (not the first strand) which is double stranded is transcribed to make a ssRNA which can then be used as a probe. ssDNA would also work but the cDNA synthesis followed by PCR would not produce ssDNA. You can simply use the first strand after RT but the concentration would be less. With a cDNA library, you can produce a lot of RNA using IVT.

Reverse transcription polymerase chain reaction

Reverse transcription polymerase chain reaction (RT-PCR) is a laboratory technique combining reverse transcription of RNA into DNA (in this context called complementary DNA or cDNA) and amplification of specific DNA targets using polymerase chain reaction (PCR). [1] It is primarily used to measure the amount of a specific RNA. This is achieved by monitoring the amplification reaction using fluorescence, a technique called real-time PCR or quantitative PCR (qPCR). Combined RT-PCR and qPCR are routinely used for analysis of gene expression and quantification of viral RNA in research and clinical settings.

The close association between RT-PCR and qPCR has led to metonymic use of the term qPCR to mean RT-PCR. Such use may be confusing, [2] as RT-PCR can be used without qPCR, for example to enable molecular cloning, sequencing or simple detection of RNA. Conversely, qPCR may be used without RT-PCR, for example to quantify the copy number of a specific piece of DNA.

Bowtell, D.D. Options available—from start to finish—for obtaining expression data by microarray. Nature Genet. 21, 25–32 (1999).

Singh-Gasson, S. et al. Maskless fabrication of light-directed oligonucleotide microarrays using a digital micromirror array. Nature Biotechnol. 17, 974–978 (1999).

Lipshutz, R.J., Fodor, S.P., Gingeras, T.R. & Lockhart, D.J. High density synthetic oligonucleotide arrays. Nature Genet. 21, 20–24 (1999).

Hughes, T.R. et al. Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nature Biotechnol. 19, 342–347 (2001).

Pollack, J.R. et al. Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nature Genet. 23, 41–46 (1999).

Albertson, D.G. et al. Quantitative mapping of amplicon structure by array CGH identifies CYP24 as a candidate oncogene. Nature Genet. 25, 144–146 (2000).

Snijders, A.M. et al. Assembly of microarrays for genome-wide measurement of DNA copy number. Nature Genet. 29, 263–264 (2001).

Pinkel, D. et al. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature Genet. 20, 207–211 (1998).

Hayward, R.E. et al. Shotgun DNA microarrays and stage-specific gene expression in Plasmodium falciparum malaria. Mol. Microbiol. 35, 6–14 (2000).

El-Sayed, N.M., Hegde, P., Quackenbush, J., Melville, S.E. & Donelson, J.E. The African trypanosome genome. Int. J. Parasitol. 30, 329–345 (2000).

Lee, J.M., Williams, M.E., Tingey, S.V. & Rafalski, J.A. DNA array profiling of gene expression changes during maize embryo development. Funct. Integr. Genomics 2, 13–27 (2002).

Osoegawa, K. et al. A bacterial artificial chromosome library for sequencing the complete human genome. Genome Res. 11, 483–496 (2001).

Halgren, R.G., Fielden, M.R., Fong, C.J. & Zacharewski, T.R. Assessment of clone identity and sequence fidelity for 1189 IMAGE cDNA clones. Nucleic Acids Res. 29, 582–588 (2001).

Knight, J. When the chips are down. Nature 410, 860–861 (2001).

Bowtell, D.D. & Sambrook, J.F. DNA Microarrays: A Molecular Cloning Manual (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, 2002).

Relogio, A., Schwager, C., Richter, A., Ansorge, W. & Valcarcel, J. Optimization of oligonucleotide-based DNA microarrays. Nucleic Acids Res. 30, e51 (2002).

Rouillard, J.M., Herbert, C.J. & Zuker, M. OligoArray: genome-scale oligonucleotide design for microarrays. Bioinformatics 18, 486–487 (2002).

DeRisi, J.L., Iyer, V.R. & Brown, P.O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997).

Wei, Y. et al. High-density microarray-mediated gene expression profiling of Escherichia coli. J. Bacteriol. 183, 545–556 (2001).

Hegde, P. et al. A concise guide to cDNA microarray analysis. Biotechniques 29, 548–556 (2000).

Schena, M. et al. Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc. Natl Acad. Sci. USA 93, 10614–10619 (1996).

Schena, M., Shalon, D., Davis, R.W. & Brown, P.O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470 (1995).

Bertucci, F. et al. Sensitivity issues in DNA array-based expression measurements and performance of nylon microarrays for small samples. Hum. Mol. Genet. 8, 1715–1722 (1999).

Yue, H. et al. An evaluation of the performance of cDNA microarrays for detecting changes in global mRNA expression. Nucleic Acids Res. 29, E41 (2001).

Wang, X., Ghosh, S. & Guo, S.W. Quantitative quality control in microarray image processing and data acquisition. Nucleic Acids Res 29, E75 (2001).

Lock, C. et al. Gene-microarray analysis of multiple sclerosis lesions yields new targets validated in autoimmune encephalomyelitis. Nature Med. 8, 500–508 (2002).

Alizadeh, A.A. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000).

Bittner, M. et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406, 536–540 (2000).

Dhanasekaran, S.M. et al. Delineation of prognostic biomarkers in prostate cancer. Nature 412, 822–826 (2001).

Golub, T.R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999).

Hedenfalk, I. et al. Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med. 344, 539–548 (2001).

Shipp, M.A. et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Med. 8, 68–74 (2002).

Sorlie, T. et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl Acad. Sci. USA 98, 10869–10874 (2001).

van't Veer, L.J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002).

Volm, M., Koomagi, R., Mattern, J. & Efferth, T. Expression profile of genes in non-small cell lung carcinomas from long-term surviving patients. Clin. Cancer Res. 8, 1843–1848 (2002).

Miki, R. et al. Delineating developmental and metabolic pathways in vivo by expression profiling using the RIKEN set of 18,816 full-length enriched mouse cDNA arrays. Proc. Natl Acad. Sci. USA 98, 2199–2204 (2001).

Iyer, V.R. et al. The transcriptional program in the response of human fibroblasts to serum. Science 283, 83–87 (1999).

Lo, A.W. et al. A novel chromatin immunoprecipitation and array (CIA) analysis identifies a 460-kb CENP-A-binding neocentromere DNA. Genome Res. 11, 448–457 (2001).

Shannon, M.F. & Rao, S. Transcription. Of chips and ChIPs. Science 296, 666–669 (2002).

Ahrendt, S.A. et al. Rapid p53 sequence analysis in primary lung cancer using an oligonucleotide probe array. Proc. Natl Acad. Sci. USA 96, 7382–7387 (1999).

Lindblad-Toh, K. et al. Loss-of-heterozygosity analysis of small-cell lung carcinomas using single-nucleotide polymorphism arrays. Nature Biotechnol. 18, 1001–1005 (2000).

Lindblad-Toh, K. et al. Large-scale discovery and genotyping of single-nucleotide polymorphisms in the mouse. Nature Genet. 24, 381–386 (2000).

Ziauddin, J. & Sabatini, D.M. Microarrays of cells expressing defined cDNAs. Nature 411, 107–110 (2001).

Kerr, M.K. & Churchill, G.A. Statistical design and the analysis of gene expression microarray data. Genet Res. 77, 123–128 (2001).

Dudley, A.M., Aach, J., Steffen, M.A. & Church, G.M. Measuring absolute expression with microarrays with a calibrated reference sample and an extended signal intensity range. Proc. Natl Acad. Sci. USA 99, 7554–7559 (2002).

Hill, A.A. et al. Evaluation of normalization procedures for oligonucleotide array data based on spiked cRNA controls. Genome Biol. 2, research0055 (2001).

Yang, Y.H. et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 30, e15 (2002).

Van Gelder, R.N. et al. Amplified RNA synthesized from limited quantities of heterogeneous cDNA. Proc. Natl Acad. Sci. USA 87, 1663–1667 (1990).

Karsten, S.L., Van Deerlin, V.M., Sabatti, C., Gill, L.H. & Geschwind, D.H. An evaluation of tyramide signal amplification and archived fixed and frozen tissue in microarray gene expression analysis. Nucleic Acids Res. 30, E4 (2002).

Colantuoni, C., Henry, G., Zeger, S. & Pevsner, J. Local mean normalization of microarray element signal intensities across an array surface: quality control and correction of spatially systematic artifacts. Biotechniques 32, 1316–1320 (2002).

Stoeckert, C.J., Causton, H.C. & Ball, C.A. Microarray databases: standards and ontologies. Nature Genet. 32, 469–473 (2002).

Saal, L.H. et al. BioArray software environment: a platform for comprehensive management and analysis of microarray data. Genome Biol. 3, software 0003.1–0003.6 (2002).

Fielden, M.R., Halgren, R.G., Dere, E. & Zacharewski, T.R. GP3: GenePix post-processing program for automated analysis of raw microarray data. Bioinformatics 18, 771–773 (2002).

Geschwind, D.H. Sharing gene expression data: an array of options. Nat. Rev. Neurosci. 2, 435–438 (2001).

Kellam, P. Microarray gene expression database: progress towards an international repository of gene expression data. Genome Biol. 2, reports4011 (2001).

Microarrays standards at last. Nature 419, 323 (2002).

Coming to terms with microarrays. Nature Genet. 32, 333–334 (2002).

Gardiner-Garden, M. & Littlejohn, T.G. A comparison of microarray databases. Brief. Bioinform. 2, 143–158 (2001).

Bilban, M., Buehler, L.K., Head, S., Desoye, G. & Quaranta, V. Normalizing DNA microarray data. Curr. Issues Mol. Biol. 4, 57–64 (2002).

Quackenbush, J. Microarray data normalization and transformation. Nature Genet. 32, 496–501 (2002).

Ripley, B.D. The project in statistical computing. MSOR Connections. Newsletter of the LTSN Maths, Stats & OR Network (The University of Birmingham, Edgbaston, U.K.) 1, 23–25 (2001).

Slonim, D. From patterns to pathways: gene expression data analysis comes of age. Nature Genet. 32, 502–508 (2002).

Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA 95, 14863–14868 (1998).

Brown, M.P. et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA 97, 262–267 (2000).

Pomeroy, S.L. et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415, 436–442 (2002).

Khan, J. et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Med. 7, 673–679 (2001).

Xu, Y. et al. Artificial neural networks and gene filtering distinguish between global gene expression profiles of Barrett's esophagus and esophageal cancer. Cancer Res. 62, 3493–3497 (2002).

Ramaswamy, S. et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA 98, 15149–15154 (2001).

Holmes, G. & Hall, M.A. A development environment for predictive modelling in foods. Int. J. Food Microbiol. 73, 351–362 (2002).

The Gene Ontology Consortium. Creating the gene ontology resource: design and implementation. Genome Res. 11, 1425–1433 (2001).

Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).

Delozier, E.P. & Lingle, V.A. MEDLINE and MeSH: challenges for end users. Med. Ref. Serv. Q 11, 29–46 (1992).

Lowe, H.J. & Barnett, G.O. Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches. J. Am. Med. Assoc. 271, 1103–1108 (1994).

Results and discussion

Handling and initial processing of leukemic cell samples

Leukemic material is usually obtained from peripheral blood or bone marrow. Between drawing the sample and laboratory evaluation, with further processing (freezing, purification or RNA extraction for downstream microarray analyses), a variable length of time may exist, especially within the framework of multi-center studies. 40 As shown recently, blood or bone marrow drawn in sterile tubes with appropriate anticoagulants should be processed within 24 h. 41 Material stored at room temperature for longer periods of time, particularly if sent in polystyrene boxes in the case of multi-center studies, can still yield useful data if used in large classification series. These patient samples and the time until processing should be annotated. Chelating of divalent cations by long storage in ethylenediaminotetraacetic acid could negatively influence the integrity of lymphocytes. 42, 43 However, there are no data on the role of different anticoagulants in gene expression studies. Therefore, no consensus on the usage of a certain anticoagulant has been reached. Most importantly, the same anticoagulant should be used throughout the study.

In most studies, Ficoll density centrifugation is used to isolate the mononuclear cell fraction containing the leukemic cells. The main advantage of Ficoll density centrifugation is to isolate mononuclear cells while granulocytes, erythrocytes, reticulocytes and platelets are removed. In particular, the exclusion of granulocytes is an advantage compared to all other methods, with respect to the ‘purity’ of gene profiles and the quality of RNA, as it is notoriously difficult to obtain good RNA from granulocytes (U Lehmann, personal communication). The exclusion of reticulocytes reduces the contribution of globin mRNA to the expression profile. However, a reduction of globin mRNA can also be achieved using erythrocyte lysis. 41 An additional advantage of Ficoll is that it can readily be used on thawed material to remove dead cells, resulting in better RNA quality.

The use of PAXgene blood RNA system (PreAnalytiX GmbH, Hembrechtikon, Switzerland) as stabilization reagent offers an advantage over the shipping of non-stabilized samples by direct preservation of RNA. However, for bone marrow samples, it was shown that the PAXgene reagent provided only insufficient protection from pre-analytical alterations for a number of transcripts. 41 In addition, a disadvantage of this approach is the contamination with, for example, large amounts of globin- and other erythrocyte-specific mRNAs, which significantly influences the whole gene expression profile owing to its abundance. 44 Although, for instance Affymetrix has developed globin reduction protocols, these are quite time consuming and difficult to implement in large-scale clinical studies. However, globin reduction protocols provide a huge advantage for solid tumors, where the application of Ficoll or other methods of purification (e.g. antibody-based by magnetic activated cell sorter (MACS)) is difficult.

If RNA is not directly prepared (see below), cells can then be frozen, preferentially stepwise in a density below 20 × 10 6 cells/ml to assure good recovery upon thawing. In our experience, rapid thawing at 37°C produces the best results in terms of recovery of total RNA from the leukemic cells. This is because rapid thawing (until a small clump remains in the ampoule, followed by rapid dilution in medium devoid of dimethyl sulfoxide) leads to the best recovery of leukemic cells.

To purify or not to purify

Before proceeding to RNA extraction, it is important to consider whether to purify the leukemic cell population. Sometimes tumor loads approach 100% and purification obviously is not needed. In most cases, however, tumor loads vary between 30 and 95%. It depends on the question to be answered whether purification is an absolute requirement. To understand signaling pathways, the analysis of a pure cell population is often required. Thus, if tumor loads are below 90% and if one wishes to make statements on the gene expression profile of tumor cells with a statistical confidence of 95%, which is a standard cutoff in most studies, 45 purification is necessary. However, the influence of purification procedures on gene expression has not yet been investigated very extensively in experimental settings. Moreover, molecular features of non-malignant immune cells present in the tumor or leukemia sample at diagnosis may give important information. 46 If there is doubt as to the specificity of genes detected, for example, a T-cell-associated signature in B-precursor acute lymphoblastic leukemia (ALL), an alternative way is to compare the gene expression signatures of unpurified and purified cells in a subset of samples. For classification purposes, purification is often not obligatory, but would certainly improve the quality of the data. One needs to realize that even in classification studies the signature, that is, list of genes that discriminates between different leukemic subcategories, is error-prone and lower tumor loads lead to a larger proportion of falsely identified genes, particularly if the number of samples per class is too low. Proper use of biostatistical tools and an adequate validation are important in this respect. The minimum tumor load that will still lead to correct classification probably depends on several variables, that is, the leukemic subtype in question and this issue should be explored further by future experimentation. We strongly encourage such studies. For now, we recommend that, for classification studies, the tumor load of the samples should be reported.

For most microarray purposes, MACS purification based on markers expressed on the tumor cells, such as CD34, CD19, (B-precursor ALL) or CD7 (T-ALL), is the most appropriate technique to purify leukemic cells. Alternatively, MACS purification via negative selection can be performed with a mixture of antibodies that can deplete all non-leukemic cells (e.g. acute myeloid leukemia (AML)). It is sometimes difficult to reach sufficiently high purity (>90%) required for array analyses. 45 As purification by positive selection may stimulate cell surface receptor expression and concurrently alter the gene expression profile, the purification procedure needs to be carried out in the cold to reduce such an effect.

RNA extraction

RNA extraction is very important, as microarray results are directly influenced by the quality of the RNA used. Most investigators currently use two major techniques: TRIzol ® (Invitrogen Corp., Carlsbad, CA, USA) and RNeasy (Qiagen, Hilden, Germany) and sometimes TRIzol followed by column purification from the RNeasy kit. Some investigators have used adapted TRIzol extraction methods, with improved quality compared to the original TRIzol methods. Others prefer RNeasy, as it is quick, reproducible and easy to use. The results of a multi-center quality round within the Genomics/Proteomics project of the German Competence Network ‘Acute and Chronic Leukemias’ demonstrated that both methods work for extracting high-quality RNA in sufficient amounts (Figure 1). However, as the methods are technically different, it is wise to use only one method within an experimental series. This is illustrated by a series of T-ALL samples from which RNA was extracted either by RNeasy or by TRIzol. Using hierarchical clustering, all TRIzol samples clustered together instead of with the RNeasy extracted samples of the same patients (see Supplementary data). This demonstrates that the RNA extraction method used can have significant influence on the gene expression profile generated. Which of the many methods to isolate RNA is best suited for microarray analyses, especially on large numbers of clinical studies, warrants future study.

Results of a multi-center quality round within the project Genomics/Proteomics of the German Competence Network ‘Acute and Chronic Leukemias’. Frozen primary AML mononuclear bone marrow cells (BM MNC) were sent to different laboratories. RNA extraction using TRIzol (a), RNeasy (b) or TRIzol followed by column purification from the RNeasy kit (c) leads to high-quality RNA as measured by the Agilent Bioanalyzer 2100 (Agilent Technologies). Degraded RNA shows additional peaks to the left of the 18S peak and should not be considered for analysis (d).

The amount of extracted RNA should be quantified by spectrophotometry. The NanoDrop spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) has become popular, as it is capable of measuring nucleic acids in very small volumes, allowing accurate quantification on small amounts of material. The ratio of spectra measured at 260 and 280 nm is very helpful in assessing the purity of RNA and often used as an indicator of RNA quality. However, it is important to consider that the absorption ratio is pH and ionic strength-dependent and should be measured under identical conditions. 47 The RNA 260/280 ratio should ideally be 2.0, but anything above 1.8 is also acceptable.

For quality control (QC), the Agilent Bioanalyzer 2100 has established itself as an important piece of laboratory equipment. Using the ‘Lab-on-a-chip’ technology, the quality of RNA can be estimated very precisely, again on small amounts of the precious material. Freely available software tools such as Degradometer and RNA Integrity Number (RIN) systems produce user-independent and objective RNA quality metrics. 48, 49 However, this instrument should not been used for quantification purposes as an acceptable precision of quantification can only be obtained in a small range of concentration.

In most studies, poor RNA samples are not being used for further microarray analysis. Our extensive study on RNA quality versus microarray outcome on the Affymetrix platform did not show a correlation between 260/280 ratio and outcome of the array in terms of glyceraldehyde-3-phosphate dehydrogenase (GAPDH) 3′/5′ ratio or percentage of genes present, largely because poor RNA samples were not further processed and therefore never yielded a GAPDH result from the array (Figure 2). The GAPDH 3′/5′ ratio is a commonly used quality parameter measuring the signal at the 3′ and 5′ regions of the GAPD mRNA, which, without degradation of the RNA, should ideally be 1.0. A commonly accepted GAPD ratio is <3.0 (Figure 2). A similar statement has to be made for 28S:18S rRNA ratios measured using the Agilent Bioanalyzer and the number of well-measured spots on spotted arrays. From our point of view, it is important to consider both the 28S and 18S rRNA peaks and signs of degradation or to use objective metrics such as Degradometer or RIN. Further studies are needed to address systematically the correlation of, for example, RIN numbers and microarray outcome. However, objective RNA quality metrics should be reported as part of the MIAME standards. 35

Correlation between RNA quality and outcome of Affymetrix microarrays as measured by 3′/5′ GAPD ratio and percentage of genes present. Data from 421 different Affymetrix microarrays from four different centers were analyzed for RNA quality as measured by 260/280 ratio and outcome of the array as measured by GAPD 5′ and 3′ fluorescence signal ratio. Samples with low cRNA quality are, in general, not further processed by the investigators involved. There is, however, a weak correlation between 260/280 ratio and percentage of genes present, as well as an expected inverse correlation between GAPD 3′/5′ ratios and percentage of genes present.

RNA labeling and possible amplification

For most microarray systems, standard protocols exist for labeling the RNA or cDNA that is to be hybridized onto the array. The first method typically involves a cDNA reaction to yield a suitable template that can be used by RNA polymerases (most often T7 RNA polymerase) to incorporate labeled nucleotides. Most spotted arrays use the labels Cy3 and Cy5, whereas Affymetrix uses biotinylated nucleotides, which are subsequently stained by phycoerythrin-conjugated streptavidin after hybridization. In some applications, directly labeled RNA is used. This avoids any artifacts that can arise from the (typically fairly linear) amplification by T7 polymerase. The disadvantage is that at least five- to 10-fold more input RNA is required, which may mean that many clinical samples yield insufficient RNA for microarray analysis. Moreover, direct labeling can cause artifacts, regardless of whether or not amplification is being applied. Again, the same labeling method should be used throughout the study.

Most suppliers now have protocols that require 1–5 μg of total RNA as starting material. In our experience, starting with amounts much lower than 1 μg for Affymetrix arrays can give good results, but this is unpredictable. Even amounts of 1–2 μg do not always yield sufficient cRNA for hybridization onto an array.

When performing gene expression studies on leukemic samples, in most cases more than 2 μg RNA can be isolated, and amplification schedules, which also take longer and cost more, can readily be avoided, especially when using the Affymetrix system (keep in mind that the Affymetrix standard protocol already includes one step of amplification). Many spotted arrays require more input RNA than Affymetrix arrays therefore, amplification is more often required. Various amplification protocols using an additional amplification step have been developed that yield sufficient RNA for hybridization. Using, for instance, commercially available amplification kits based on the linear amplification protocol developed by Van Gelder et al., 50 profiles of gene expression using amplified RNA could be confirmed with a high accuracy when re-analyzing the expression of these genes using non-amplified RNA by quantitative reverse transcriptase-polymerase chain reaction (RT-PCR). 18 If amplification is applied, it should be performed in all samples using the same protocol and chemistry. Quantitative RT-PCR validation from non-amplified RNA should be performed to test for amplification bias.

Finally, a general recommendation needs to be stated here: as by using microarrays, differential gene expression between samples and not primarily absolute expression values of an individual sample is measured, samples within a study should be treated as identically as possible to avoid technical variation to over-ride biological variation of differential expression. The technical variations should be kept to a minimum and documented in detail.

Technical aspects of hybridization, washing and scanning

Similar to labeling, hybridization is often carried out according to protocols supplied by the manufacturers of those arrays actually used (Agilent, Affymetrix, etc.). Because these protocols can change and therefore real standard protocols do not exist, it is important to document which protocol is used. For spotted arrays, comparative hybridization of two samples labeled with different fluorescent dyes is often performed. We recommend the use of a common RNA standard for all arrays in one study, for example, the Universal Human Reference RNA (Stratagene, Heidelberg, Germany), especially if an enlargement of sample size is intended. For Affymetrix microarrays, established procedures also exist for washing and scanning. For spotted arrays, washing can be performed manually or using an automated system. After scanning, one should check the array for scratches and bubbles and other artifacts.

Data extraction

Every supplier has its own software to translate the fluorescence signal into a value corresponding to the expression of the gene interrogated by the probe. In Affymetrix arrays, several quality parameters should be checked: the above-mentioned 3′/5′ ratio for the housekeeping genes such as GAPD and ACTB, the noise, background, scaling factor and percentage of genes was detected as present. These parameters not only tell the investigator something about the sample in question but also if it can be safely compared to other samples in a given series of experiments. A scaling factor should not differ more than threefold between arrays in order to compare them. Also, in our experience, leukemic samples should have at least 25% of genes present on the latest Affymetrix HG-U133 Plus 2.0 arrays, as well as on the U133A arrays. On U95A arrays, present calls tend to be higher (35–50%). To assess the quality of spotted arrays, it is important to check the homogeneity of hybridization, and to assure a uniformly low hybridization background. In addition, one has to carefully check and optimize the grid alignment and to define parameters for well-measured spots, which can be included in subsequent analyses. Examples are described in recently published studies. 8, 18, 50, 51 The QC procedure depends on the platform used. For spotted arrays, it is difficult to come up with strict guidelines. For Affymetrix arrays, besides the visual aspects, several parameters can be checked, as recently proposed by the ‘Tumor Analysis Best Practices Working Group’. 34

All microarray analyses should include some technical replicates leading to a measure of technical reproducibility in order for the researchers to determine whether the individual measurements are reproducible enough. However, replicates of the whole series of samples, as performed in the early times of microarray analyses, are not generally necessary because of the high concordance of the data obtained by experienced researchers. Instead of increasing the number of technical replicates, one should increase the number of biological samples in, for example, two groups that will be compared. By including several biological samples, the influences of biological differences and technical variations are considered, although it will be unclear how much of the variation is owing to technical and as to how much to biological variation (see Figure 3). However, if one wishes to compare the expression profile of, for example, one sample of a patient with an unknown disease to a group of normal controls, technical duplicates or triplicates of the patient's sample should be made, ideally beginning with the processing of the sample. In this regard, the use of spike-in-RNA controls (corresponding to known amounts of RNA) can also be helpful, as well as tools generated by the External RNA Controls Consortium (ERCC), that is, providing external RNA. 52

Examples of biological and technical replicates and their influence on differential expression by microarray analyses. Experiments were performed using spotted cDNA microarrays as described recently. 18 The correlation of relative gene expression values of two microarray experiments, respectively, are shown for (a) two biological replicates (mononuclear bone marrow cells (BM MNC) of two healthy controls) (b) two biological replicates of one cell line that was split, then cultured under the same conditions up to two different time points and subsequently processed in parallel and (c) two technical replicates (the amplified RNA of one sample was split and subsequent steps were performed in parallel). The correlation coefficients R 2 are shown. Similar correlation coefficients were observed using the Affymetrix system. 57

In general, and especially for pathway analyses, genes of interest have to be validated using an alternative technique, for example, quantitative RT-PCR.

Biostatistics for data analysis

Various software packages exist to help the investigators mine the enormous amount of information generated by microarray experiments. Commonly used techniques are statistical significance analysis with parametric and non-parametric tests (e.g. t-test, analysis of variance (ANOVA), significance analysis of microarrays (SAM) 53, 54 ), with correction for multiple testing, a variety of further classification methods (e.g. principal component analysis) and different types of cluster analysis (e.g. hierarchical clustering, K-means). 55 Various visualization tools may help to obtain a better overview of the results of analysis, and mapping of genes to molecular and metabolic pathways will help to identify biologically relevant findings. It is beyond the scope of this article to discuss these bioinformatical tools in detail (an overview is given by Allison et al. 56 ). We have listed several packages, both commercial and freely available in Table 1. We highly recommend clinical and biological researchers to consult with a biostatistician even before performing the array experiment to formulate specific questions of the study, to define requirements and to address several issues of data analysis strategies. It should be noted that the same raw microarray data analyzed by different bioinformatical methods can yield very different results, especially for low expressed genes. As these methods differ in the statistical algorithms used to analyze the data, this is not surprising, but is an important aspect to consider. It also goes without saying that the same analysis method should be used throughout a study to compare various data sets.

Biostatistics are essential in understanding the results of array experiments and should be used as transparently as possible, to avoid misleading readers who are not experts in this field. For instance, a list of differentially expressed genes that constitute a signature profile should be accompanied by a false discovery rate (FDR) 58 or a P-value corrected for multiple testing to give an idea of how significant the findings actually are. To illustrate this further, a signature profile with an FDR of, for example, 50% would mean that about half the genes in the signature are selected by chance, and it is not possible to know without independent validation studies which genes these are. This may in some cases still yield important information, but in most cases this is a very high error rate and FDR should be much lower (e.g. in the 10% range). Concerning the individual genes, the reader will realize that further experiments are necessary to identify the true positive ones.

All relevant information necessary for proper interpretation of the experiment and bioinformatic analysis must be included in publications. Array data uploaded into public databases should be raw data as far as possible. Uploading of, for example, Affymetrix GeneChip ® Operating Software (GCOS) expression estimates should be accompanied by cell intensity (CEL) files to allow independent data analysis of the raw data. As pointed out above, the data should be presented according to MIAME guidelines and include information about the scaling factor. Further, the description of bioinformatic analysis should document used software packages and algorithms, indication of statistical significance like P-values or FDR of gene lists and handling of outliers and missing values. If possible, the robustness of the results should be tested by means of bootstrapping or leave-one-out cross-validation tests.


Non-protein-coding RNAs (ncRNAs) do not encode proteins but function directly at the level of the RNA in the cell. Over the last few years, the importance of this surprisingly diverse class of molecules has been widely recognized ( 1 – 5 ). NcRNAs have been identified in unexpectedly large numbers, with present estimates—based on bioinformatical approaches—in the range of thousands per eukaryal and hundreds per bacterial genome ( 6 – 9 ). They play key roles in a variety of fundamental processes in all three domains of life, i.e. Eukarya, Bacteria and Archaea. Their functions include DNA replication and chromosome maintenance, regulation of transcription, RNA processing (not only RNA cleavage and religation, but also RNA modification and editing), translation and stability of mRNAs, and even regulation of stability and translocation of proteins ( 4 , 5 , 10 – 13 ). Many of them have been discovered fortuitously, suggesting they merely represent the tip of the iceberg. Many known ncRNAs are small, i.e. typically <500 nt, and thus much shorter than the majority of mRNAs. However, eukaryotes also express a number of large ncRNAs, e.g. Xist or Air RNAs, which are several 1000 nt long ( 14 – 16 ). The highly specific roles of ncRNAs reflect in most cases their ability to selectively bind a small set of proteins as well as their potential to specifically recognize definite RNA targets via regions of sequence-complementarity.

In recent years, new bioinformatical and experimental strategies have been taken to identify a great number of novel ncRNA candidates in various model organisms from Escherichia coli to Homo sapiens ( 5 – 7 , 17 – 31 ). These findings demonstrated that the number of ncRNAs in genomes of model organisms is much higher than it had been anticipated.

In the following, we will review various experimental strategies that were employed to identify novel ncRNAs in genomes of model organisms. For these approaches, the term ‘Experimental RNomics’ has been coined ( 3 ). Four different methods will be presented and their advantages as well as their obstacles in the identification of novel ncRNA molecules will be discussed: (i) RNA sequencing (enzymatically or chemically) as the most traditional method to reveal novel ncRNA species (ii) the parallel cloning of many ncRNA by generating specialized cDNA libraries (iii) the use of microarrays to predict ncRNAs that are expressed under a given experimental condition (iv) ‘genomic SELEX’ and its potential application to select ncRNA candidates from the sequence space represented by the genome of an organism of interest.

Alternatively to biochemical methods, genetic and bioinformatical tools may also be employed to identify ncRNAs in model organisms. In fact, some of the first chromosomally encoded regulatory ncRNAs, e.g. MicF, DsrA and RprA of E.coli , were discovered in the course of a genetic screen ( 32 – 34 ). Similarly, genetics also discovered the founding member, lin-4 RNA, of the ever-growing class of eukaryotic miRNAs ( 35 ). Due to space constraints, however, we would like to refer the reader to ( 6 , 36 , 37 ) for a more detailed review of genetic and biocomputational routes to ncRNA discovery.

Identification of ncRNAs by chemical or enzymatic sequencing

In the very early days of ncRNA research, e.g. some 35–40 years ago, single ncRNA species (at the time ribosomal RNAs, tRNAs or viral RNAs) were selected by size-separation of total RNA on denaturing gels, followed by visualization and excision of specific bands, ideally representing single ncRNA species. Thus, for its identification, the ncRNA of interest must be present in high amounts, e.g. visible as a distinct band in an ethidium bromide-stained polyacrylamide gel, exposed to ultraviolet (UV) light ( Figure 1A ).

Subsequently (and prior to their identification by sequence analysis), ncRNAs are labelled either at their 5′ end or at their 3′ end: (i) for labelling of RNAs at their 5′ end, the mono- or triphosphate group usually found at the 5′ end of ncRNAs is removed first. This is achieved by the addition of calf intestinal alkaline phosphatase at an elevated temperature inactivation of the enzyme is performed by repeated extraction with phenol/chloroform or by gel purification ( 38 ). Labelling of the RNA is then performed by the addition of polynucleotide kinase in the presence of [γ- 32 P]ATP ( 38 ). (ii) For labelling at their 3′ end, ncRNAs can be labelled by the procedure described by Bruce and Uhlenbeck using 5′- 32 PpCp as a donor molecule in the presence of T4 RNA ligase ( 39 ). Subsequently, ncRNAs are gel-purified on denaturing polyacrylamide gels.

RNAs may also be labelled in vivo prior to extraction from an organism. In some early studies, E.coli total RNA was metabolically labelled with orthophosphate,

After extraction from a cell or organism, size separation by PAGE and elution from the gel, ncRNAs are identified by sequence analysis. This is either achieved by 2D RNA fingerprinting or by enzymatic or chemical sequencing of ncRNAs.

There are several versions of 2D RNA fingerprinting techniques to sequence small RNAs (or oligonucleotides) or prepare various RNase-digested oligonucleotide catalogs. The differences are use of uniformly or end-labelled RNAs, partial or complete digestion with various RNases, electrophoresis on cellulose acetate strips or in acrylamide gels for the first dimension, electrophoresis on DEAE-cellulose paper, or homochromatography on DEAE-cellulose plates, or gradient thin layer chromatography on DEAE-cellulose plates ( 44 – 47 ).

For enzymatic sequence analysis, labelled ncRNAs (at 5′ or 3′ ends) are subjected to partial digestion with base-specific ribonucleases at elevated temperatures (50–55°C) and in the presence of 7 M urea to avoid interference of the secondary/tertiary structure of the RNA with enzymatic hydrolysis steps. For base-specific cleavage, a plethora of RNases (RNase T1, T2, U2, PHY1, PHY M, CL3, A or M1) can be used which cleave preferentially 3′ to either G, C, U or A bases ( 48 – 50 ). To resolve obtained RNA fragments by size, 1D gel electrophoresis is carried out on denaturing polyacrylamide gels (see below).

For chemical sequence analysis of ncRNAs, four different base-specific chemical reactions generate a means of directly sequencing RNA that was terminally labelled with 32 P ( 51 ). After a partial specific modification of each kind of RNA base, an amine-catalysed strand scission generates labelled fragments whose length determine the positions of each nucleotide in a sequence. Dimethyl sulfate modifies guanosine, diethyl pyrocarbonate attacks primarily adenosine, hydrazine attacks uridine and cytidine, but salt suppresses the reaction with uridine. In all cases, aniline induces a subsequent strand scission ( 51 ).

Subsequent to enzymatic or chemical sequencing, electrophoretic fractionation of the labelled fragments is achieved on denaturing polyacrylamide gels, followed by autoradiography, which allows determination of the RNA sequence of interest.

The earliest studies to identify RNA molecules by direct sequencing were performed on tRNAs as well as on ribosomal RNAs ( 48 , 52 – 54 ). In the case of 16S ribosomal RNA, exhibiting a size of ∼1500 nt, smaller fragments were first generated by RNase T1 cleavage and subsequently analysed by RNase fingerprinting techniques ( 54 ). Direct RNA sequencing for identification of novel RNA species is far from being outdated, as was shown in more recent studies: by labelling and direct RNA sequencing, a novel class of ncRNAs, designated as small nucleolar RNAs, involved in rRNA modification ( 55 ) could be identified in eukaryotes. Lately, this technique was also used to visualize and subsequently sequence abundant RNAs of gram-positive bacteria ( 56 , 57 ).

Obstacles and advantages of the method

Identification of novel ncRNA species by RNA sequencing encounters four main obstacles. First, for identification, ncRNAs have to be highly abundant to be visible as single bands in ethidium-bromide stained gels to circumvent this problem, labelling of total RNA, followed by size separation on a gel system (e.g. vice versa as described above), allows identification of less abundant ncRNA species.

Second, no other ncRNAs in the same size range should be present in the total RNA population, since it would hamper isolation of a single RNA species and thus would result in ambiguous sequencing data. If a band or spot is found to contain multiple RNA species, these can be resolved by 2D gel electrophoresis, which allows separation of RNA species with similar or identical sizes.

Third, chemical or enzymatic sequencing of ncRNAs sometimes results in sequencing data that are difficult to interpret. The reason being that, for enzymatic sequencing, RNases are not strictly specific for a distinct base but possess residual cleavage activity for other bases similarly, chemical sequencing does not always result in unambiguous modification and cleavage of nucleotides, thus obscuring the readout of obtained sequence data.

Finally, due to the sequencing methods and resolution capacity of polyacrylamide gels, sequencing is limited to RNAs sized—at the most—a couple of hundred nucleotides. Thus, ncRNA species, which exceed this size range, cannot be directly analysed by this method, but have to be cleaved into smaller pieces (e.g. by T1 nuclease digestion) prior to further analysis.

The advantage of direct RNA sequencing, as compared with sequencing cDNA clones generated from ncRNAs (see below), is the fact that ncRNAs do not have to be reverse transcribed for analysis. Thus, RNA secondary/tertiary structures that might impede reverse transcription into cDNA do not interfere with RNA identification by using direct RNA sequencing.

Identification of ncRNAs by specialized cDNA libraries

The second method for the identification of novel ncRNA species involves the generation of cDNA libraries, in analogy to expressed sequence tag libraries (EST libraries) for identification of mRNAs ( 58 , 59 ). The original mRNA cloning method is based on reverse transcription of mRNAs from an organism by an oligo(dT) primer and second strand synthesis, resulting in a cDNA library that ideally represents all protein-coding transcripts of a genome. Compared with these conventional EST libraries, the main difference for ncRNA library approaches is the source and treatment of the cloned RNA.

Since most mRNAs are >500 nt in length but many ncRNAs considerably smaller, first RNAs in the size range of ∼20–500 nt are isolated. This fraction is usually depleted in EST libraries as it will not be present in poly(A) + mRNA. The isolation of small-sized RNAs is achieved by size separation of total RNA (either from the entire organism at different developmental stages or from an individual organ) by denaturing PAGE ( Figure 1B ).

Alternatively, by employing an antibody against an RNA-binding protein of interest, entire groups of ncRNAs can be isolated by immunoprecipitation. Thereby, RNAs are not selected by their size but rather based on their function since they bind to a common RNA binding protein, e.g. a library generated by immunoprecipitation with an antibody against a common small nucleolar RNA–protein will help identify ncRNAs from the class of snoRNAs ( 26 ).

In many cases, these size- or antibody-selected RNAs will lack poly-adenylated tails. In general, there are three different methods to reverse transcribe ncRNAs into cDNA as a prerequisite for cloning and sequencing ( Figure 1B ).

First, to generate cDNA from this ncRNA fraction, addition of an oligo(C) or oligo(A) tail to the RNA is performed in the presence of poly(A) polymerase, which uses ATP, but also—to a lesser extent CTP—as a substrate ( 60 ). Subsequently, tailed RNAs are reverse transcribed employing an oligo(dG) or oligo(dT) primer, respectively. Following second strand synthesis by employing DNA polymerase I and limited amounts of RNase H and subsequent ligation of double-stranded DNA linkers, the obtained double-stranded cDNAs are cloned into a standard vector system (e.g. pSPORT1/GibcoBRL), thus generating a cDNA library [for a detailed method description see ( 61 )].

As a second approach subsequent to C-tailing at the 3′ end (see above), an oligonucleotide linker is ligated to 5′ end of ncRNAs by T4 RNA ligase. The oligonucleotide can be made from RNA or almost entirely from DNA [for a more detailed method description see ( 62 )]. To avoid multimerization of linker sequences, the 5′-oligonucleotide carries a 5′-hydroxyl group. Since T4 RNA ligase uses RNA as a template, the last 3 nt at the 3′ end of the oligonucleotide should be ribonucleotides to increase efficiency of ligation. To add a linker to RNAs with modified 5′ ends, such as a cap structure or a tri-phosphate group, the RNA is first treated with tobacco acid pyrophosphatase (TAP) which cleaves between the α and β phosphate group, thus leaving 5′-monophosphates ( 62 ). For RT–PCR of RNAs, an oligo(dC) or d(T) primer is used in combination with a 5′-primer that is complementary to the ligated 5′ linker sequence.

In a third method, RNA oligonucleotide linkers are sequentially ligated to both the 3′ and the 5′ end by T4 RNA ligase. To avoid mulitmerization of linker sequences, the oligonucleotide at the 5′ end of the RNA lacks a phosphorylated 5′ end, while the oligonucleotide ligated to the 3′ end of the RNA contains a blocked 3′ end. Typically, the entire RNA pool is subjected to another round of gel extraction after the first linker ligation step to remove excessive linker that would otherwise form dimers with the second adapter oligo. As described above, the terminal 3 nt of the 5′-oligonucleotide linker and the first three of the 3′-oligonucleotide linker might contain RNA bases to increase efficiency of ligation by T4 RNA ligase. RT–PCR of the ligated RNA fraction is achieved by DNA primers complementary to the respective 5′- or 3′-linker oligonucleotide [for a more detailed method description see ( 61 )].

Subsequent to cDNA synthesis, cDNA fragments are cloned into standard vector systems and sequenced by cycle sequencing. Dependent on the expected complexity of the library, up to 10 000 cDNA clones should be sequenced (for example in the case of large eukaryal genomes). Sequencing is usually followed by bioinformatical analyses, e.g. mapping of the ncRNA gene to a certain locus on the genome and identification of structure or sequence motifs, which might contribute to the identification of the function of the ncRNA species of interest.

In the recent past, numerous studies have been performed to identify ncRNAs in genomes of model organisms by constructing specialized cDNA libraries. The first study was initiated in the mouse Mus musculus , where by a cDNA library derived from size-selected RNAs (50–500 nt) 201 candidates for ncRNAs were identified from ∼5000 cDNA clones analysed, about half of which belonging to the class of snoRNAs ( 19 ). This study was followed by using a similar approach for the plant Arabidopsis thaliana ( 20 ), the fruit fly, Drosophila melanogaster ( 23 ), the two archaeal species Archaeoglobus fulgidus and Sulfolobus solfataricus ( 21 , 22 ) and the eubacteria E.coli ( 63 , 64 ) and Aquifex aeolicus ( 65 ).

Specialized cDNA library cloning was also applied to identify certain subclasses of ncRNAs, e.g. miRNAs, in different model organisms. Here, ncRNAs with a very narrow size range of about 18–25 nt, i.e. centering around the known sizes of miRNAs, were size-selected, cloned and sequenced ( 10 , 66 – 72 ).

Identification of small ncRNAs by generation of specialized cDNA libraries is now wide-spread and includes analysis of amoebozoa such as the slime mold Dictyostelium discoideum ( 62 ). For identification of specific classes of ncRNAs (such as snoRNAs) the method of cDNA library generation by immunoprecipitation with a snoRNA-binding protein like fibrillarin followed by cloning of ncRNAs has been employed successfully for C/D and H/ACA snoRNAs ( 26 , 73 ).

Obstacles and advantages of the method

The above method for cloning of ncRNAs has its downfalls by the fact that it might not always be possible to reverse transcribe an ncRNA into cDNA because of its structure or modification (e.g. base or backbone modifications). Thus, a cDNA library is neither likely to reflect all ncRNAs in a cell, nor will it necessarily reflect—by number of indvidual cDNA clones—the abundance of the respective ncRNA. The rationale behind this is that less structured/modified ncRNAs are more easily reverse transcribed than others and will be over-represented within a cDNA library similarly, smaller ncRNAs will be more abundant than longer ones, since they are more likely to be fully reverse transcribed.

For size-selected cDNA libraries, in general, it will not be possible to identify all ncRNAs of a cell type or organism, since the cut-off by size (e.g. 20–500 nt) will prohibit identification of longer ncRNAs (such as ncRNAs like Xist and Air RNA, which exhibit sizes in the range of many kb). In addition, by the very nature of a cDNA expression library, only those ncRNA species will be detected, which are transcribed from a genome. This might depend, however, on a specific developmental state of the organism or on expression in a certain tissue. Thus, to be able to clone all expressed RNA sequences from an organism, ideally, all developmental stages, in all tissues under all possible growth and nutrient conditions would have to be analysed and total RNA extracted from these different states. This might not always be possible and hence some ncRNA species, which are expressed under certain conditions, only, will not be cloned.

As for cloning strategies, employing method I for conversion of ncRNAs into cDNAs, e.g. the method of reverse transcription, second strand synthesis and addition of DNA linkers (see above) will not result in full-length cDNA clones, according to our experience, but in truncated 5′-termini lacking about 10–15 nt of the full-length RNA.

Conversely, the disadvantage of methods II and III involving a linker oligonucleotide (see above) is the rather inefficient ligation step of the linkers to the potential ncRNA of interest, and the failure of linker attachment to modified termini. The advantage of this method is, however, that often full-length cDNA clones can be obtained, as compared with method I.

In general, cDNA cloning will result in identification of highly abundant known ncRNA species, such as tRNAs or small ribosomal RNAs (e.g. 5S or 5.8S rRNAs). To circumvent repeated sequencing of these already known ncRNA genes, one can try to excise these ncRNA species from the gel after PAGE. However, this might result in the loss of ncRNA species exhibiting the same or similar sizes as these known RNA species. Alternatively, one can spot cDNAs on filters (as a dot blot) and hybridize filters with radiolabelled oligonucleotides directed against the most abundant known ncRNAs species. Subsequently, only those cDNA clones are sequenced, which show no hybridization signal on autoradiograms of filters.

Microarray analysis

Microarrays have become the preferred method to monitor the levels of many transcripts in parallel and often at the whole-genome level ( Figure 1C ). Microarrays, also known as DNA chips or expression arrays, are glass (or silicon) slides onto whose surface DNA probes have been printed in a grid-like arrangement. To date, single-stranded DNA oligonucleotides of 25–70 in length are the predominant type of DNA probe on commercial microarrays, though double-stranded PCR products may also serve as probes.

To analyse the entire level of cellular transcripts, samples are prepared from total RNA of an organism. The samples used for microarray hybridization can be the extracted RNA, the converted cDNA or cRNA in any case, these probes will generally be labelled with fluorescent dyes, such as Cy3 or Cy5. For more details on the various labelling protocols that are currently being used, see references in ( 74 ) and the work cited below. The prepared sample is then mixed with hybridization buffer and applied to the glass slide so that they will hybridize to a spot on the microarray.

The fluorescence of the spots to which the sample hybridized is read by a scanner and the results are displayed as a pattern of coloured, e.g. red or green dots, with the colour intensity reflecting the amount of transcripts that was present in the cell. If two samples labelled with different dyes were hybridized in parallel onto the same microarray, additional colours such as yellow or orange would indicate relative amounts of the individual transcripts in the two RNA pools.

Microarrays are mostly used for mRNA expression profiling but they could also be a means for studying ncRNA expression or even for ncRNA discovery ( Figure 2 ). The main caveat for their use with ncRNAs, however, was—and in may cases still is—the design of the commercially available microarrays. Since tailored for mRNA profiling, most of these arrays carry probes only for coding regions, thus transcripts from non-coding genome regions will not be detected. Nonetheless, the last few years have seen considerable improvement of this situation.

In bacteria, most of the functional ncRNAs are encoded in intergenic regions (IGRs). The first microarray to include IGRs in addition to coding regions was introduced for the model bacterium E.coli by ( 75 ). Their high-density array (tiling array) carries ∼300 000 strand-specific 25mer oligonucleotide probes for all mRNA, tRNA and rRNA regions at a 30 bp resolution as well as for all IGRs of >40 bp with 6 bp resolution.

While this initial study primarily focused on technical issues of mRNA level profiling, Wassarman et al . (2001) subsequently used this microarray type to specifically analyse the transcriptional output from IGRs. They found that array hybridization with RNA extracted from three different growth conditions yielded signals for at least a third of the ncRNAs that were detected by parallel probing on northern blots.

These global analyses of the E.coli transcriptome were subsequently extended by ( 76 ). By including a much broader set of growth conditions, additional transcripts from IGRs that may be novel ncRNA candidates were detected. Notably, the extraordinarily high probe density here facilitated detection of 3′- or 5′- UTR RNA fragments that accumulate independently after the processing of mRNA transcripts.

In a third study with this microarray type, cellular RNAs that associate with E.coli Hfq protein were analysed ( 77 ). This bacterial Sm-like protein, over the previous years, has emerged as a key player in regulation by small regulatory ncRNAs ( 78 ) and was known to bind a number of bacterial ncRNAs (i.e. in addition to mRNAs). By the time of the study of Zhang et al . (2003), 46 ncRNAs were known in E.coli , of which ∼30% were detected by array hybridizations of RNA that co-immunoprecipitated with Hfq. For bacteria other than E.coli , microarrys have been applied to support biocomputational prediction of ncRNAs of Staphylococcus aureus ( 56 ). In contrast to the aforementioned E.coli oligonucleotide tiling array, selected S.aureus IGRs were PCR amplified to yield double-stranded DNA probes that were then spotted on glass slides.

Similar to bacteria, microarrays have been increasingly used to confirm global predictions of certain classes of eukaryotic ncRNAs as well as to study their expression profile in different tissues. One such class, the ∼22 nt microRNAs, is matured from 60 to 110 nt pre-miRNA hairpin transcripts thought to derive from longer pri-miRNA products. Microarrays with 40 or 60mer oligonucleotides to detect known microRNAs or their hairpin precursors were introduced recently ( 79 , 80 ). Barad et al . ( 80 ) evaluated several aspects of the methodology in order to standardize it and define the parameters needed to achieve efficient hybridization and reliable results, including mismatch analysis to determine the specificity of microRNA probes. It was observed that signal intensity correlates with the location of the microRNA sequence within the 60mer probes, showing that location at the 5′ region yields the highest signals, whereas the 3′ end location results in poor signals. These results were subsequently used to develop an integrative approach to the discovery of new microRNAs, in which potential microRNA precursor regions were predicted in the human genome and 5300 of these candidates tested in a high-throughput manner on the aforementioned microarrays ( 81 ).

Several groups have recently used microarrays to study ncRNAs of the yeast Saccharomyces cerevisiae . Following earlier work with microarrays that carry individual probes for a representative set of certain ncRNAs, e.g. snoRNAs ( 82 ), the Hughes laboratory designed a tiling microarray to cover all known and several predicted yeast ncRNAs ( 83 ). Here, each ncRNA transcript is covered by oligonucleotide probes at ∼5 nt intervals including 100 nt of flanking sequence on both the 5′ and 3′ ends. Thus far, however, these arrays have mainly been used to monitor the synthesis, processing and modification of known ncRNAs ( 84 , 85 ).

New yeast ncRNAs were identified by means of a truly whole-genome microarray that contains 6700 PCR fragments to cover all yeast open reading frames, annotated small RNAs and all intergenic regions ( 86 , 87 ). Here, Inada and Guthrie ( 87 ) sought to identify the RNA binding partners of the yeast La protein (Lhp1) at a global scale. La is a ubiquitous, nuclear RNA-binding protein that is conserved among eukaryotes. Aside from binding mRNAs, it is known to associate with the primary transcripts of RNA polymerase III, including all tRNAs and other small RNAs. To selectively identify La binding RNAs in yeast, a Myc-tagged Lhp1 protein was immunoprecipitated with its associated RNAs and an untagged strain was used as the reference sample in subsequent microarray hybridizations ( Figure 2 ). The La targets identified in this work included 20 annotated snoRNAs. Furthermore, at least three novel H/ACA snoRNAs that were not before annotated as such were newly discovered in intergenic regions. Additional highly enriched signals from other intergenic regions suggest that these also represent novel unannotated transcripts which may be unknown ncRNAs.

Customized tiling arrays have now also been applied to systematically search for functional ncRNAs in higher eukaryotes. For example, a biocomputational approach was taken to extract 3478 intergenic and intronic sequences that are conserved between the human, mouse and rat genomes, and that showed characteristics of ncRNAs by a number of other criteria ( 88 ). This information was then used to design tiling arrays that contained probes for this candidate set, and these arrays were probed with RNA isolated from 16 wild-type mouse tissues. Subsequently, 55 candidates for highly expressed novel ncRNAs were tested on northern blots, thus confirming eight of these as small, highly and ubiquitously expressed RNAs in mouse. Interestingly, only five of these ncRNAs could also be detected in rat tissues, but none in human tissues or cultured cells. The conserved expression of these five ncRNAs in mouse and rat may indicate these molecules to be functional in these two organisms albeit not in human.

Obstacles and advantages of the method

The aforementioned studies provided valuable clues as to the potential of this technique for ncRNA discovery as well as the problems associated with microarrays when assaying small and highly structured ncRNAs. Analyzing the hybridization signals from E.coli tiling arrays it was noted that often only a subset of the oligonucleotide probes within the range of a given ncRNA transcript region yielded a signal peak, even though the same sRNA locus gave a strong and distinct band on northern blots [cf. Figures 2 and 3 in ( 18 )]. Tjaden et al . ( 76 ) occasionally observed transcripts on the strand opposite an experimentally validated ncRNA, which may account either for unknown ncRNA antisense transcripts or simply for experimental noise.

Although so far techniques for the reliable microarray detection of bacterial small RNAs, which are usually highly structured, have not been thoroughly evaluated, sample preparation would seem a major issue. To date, most microarray approaches involve fluorescent labelling of the RNA to be used as sample. Frequently, the RNA is converted into cDNA in the presence of modified nucleotides that carry fluorescent dyes. Most bacterial ncRNAs cluster in a size range of 100–150 nt ( 8 , 9 ), and thus reverse transcription may not be efficient and could further be hampered by tight secondary structure.

Whether direct labelling approaches, e.g. chemical labelling of fragmented RNA as alternatively used in ( 18 ), would fully solve these problems is currently unknown. However, Zhang et al . (2003) drastically improved detection sensitivity by directly hybridizing RNA to oligonucleotide arrays without labelling or cDNA synthesis ( Figure 2 ). Instead, hybridization was assayed using an antibody that sees RNA:DNA hybrids. The highly improved sensitivity of this method is demonstrated by the detection of the oxidative stress-induced OxyS RNA, which is present in very low concentrations under the growth conditions used in this study.

Hence, microarrays bear a great potential to not only detect many RNAs in parallel but also to point to transcripts that are present at low levels. As a note of caution, the fact that the vast majority of the mouse ncRNA candidates suggested by microarray analysis failed in downstream northern analysis ( 88 ) clearly emphasizes the need for validation of microarray hybridization results by independent methods.

These authors also point out that hybridizing covalently labelled total RNA as applied in their study, as opposed to reverse transcribed RNA derived from poly-adenylated RNA, would be important in tiling array analyses, since any amplification or enrichment steps are is likely to skew the representation of the large noncoding regions of eukaryotic genomes and may thus make it difficult to distinguish such signals from global ‘transcriptional noise’. The application of stringent criteria when using microarrays for ncRNA discovery seems to be imperative as more data from whole genome tiling microarrays are becoming available ( 89 – 91 ) and this data will increasingly serve as input for biocomputational ncRNA predictions by others [e.g. ( 7 )].

Genomic SELEX

Many ncRNAs form ribonucleo-protein particles (RNPs) at various time points in their life cycle. Such RNA-binding proteins may help an ncRNA fold into its active conformation, shield it from nucleases prior to exerting its function or promote its annealing with target RNAs up to guiding a protein to its proper target. Other ncRNAs interact with proteins to directly regulate their activity.

The techniques discussed so far allow to identify ncRNAs from the pool of expressed cellular RNAs after co-purification with proteins, i.e. by cloning, direct sequencing or microarray analysis. Given that many such proteins bind their RNA ligands in a nanomolar range, it should also be possible to select RNA ligands from the pool of ncRNAs that an organism can possibly express even without isolating their in vivo transcripts.

This approach, termed genomic SELEX ( 92 ), is based on the in vitro generation of RNA species that are derived from a library of an organism's entire genomic DNA ( Figure 1D ). The generated RNA pool will undergo successive rounds of association with a given RNA-binding protein, partitioning and re-amplification. As a result, RNA sequences that are stringently bound by the protein partner will be enriched. Once the sequence of the bound RNAs is determined, this information can be used to search for matches in the genome, and so predicted genomic regions could then be tested for the expression of unknown ncRNAs. Genomic SELEX has been successfully applied to select mRNA binding partners of proteins [e.g. ( 93 , 94 )], but to the best of our knowledge, studies that focused on ncRNAs have not been published for any organism, yet.

Currently, the Schroeder laboratory has taken this approach to identify new Hfq-binding RNAs from E.coli (C. Lorenz and R. Schroeder, personal communication). A representative library of the E.coli genome was constructed from random 50–500 bp genomic DNA fragments to which defined linkers, one of these containing a T7 RNA polymerase promoter, were attached in the course of the initial library generation step ( 92 ). These fragments were in vitro transcribed with T7 RNA polymerase, incubated with Hfq and selected for Hfq binding on filters. Taking the standard SELEX route ( 95 ), the retained RNA was converted to cDNA and subjected to additional (eight) re-amplification and selection rounds, which finally resulted in a pool of RNAs that bound Hfq with Kd values of 5–50 nM. Subsequently, specific Hfq interaction of the thus enriched RNAs was determined in vivo using a yeast three-hybrid screen ( 96 ). Preliminary results suggest that these experiments identified a number of novel Hfq-binding RNAs, including antisense RNAs and candidate ncRNAs from intergenic regions.

Obstacles and advantages of the method

Genomic SELEX would clearly have its strength in finding ncRNAs that are overlooked by methods that require an ncRNA gene to be expressed at a certain level. With their small genome sizes, prokaryotes should be particularly amenable to this type of approach. Since in bacteria functional ncRNAs are mostly encoded by intergenic regions, the original pool of DNA fragments could be loaded by specifically amplifying this portion of the genome, which in bacteria typically constitute <10% of the entire genome. As a further advantage of genomic SELEX, the tight association of an ncRNA with a given protein that is a prerequisite for its successful selection could also point to a biological role of this ncRNA, e.g. its function as an antagonist or cofactor of the protein's activity.

At present, very few general RNA-binding proteins are known that specifically form complexes with ncRNAs. Two of the proteins discussed above, Hfq and La (Lhp1p), also associate with mRNAs [references in ( 78 , 87 )]. Thus, similar to cDNA cloning and microarray analysis, a genomic SELEX approach with such general RNA-binding proteins is expected to yield many additional RNA candidates that one would not readily consider as ncRNAs. What is more, this method only indicates that a certain genomic locus could have a function when transcribed into RNA. However, the exact condition under which such an RNA is expressed—if it is at all—will still have to be determined.

A major advantage of the genomic SELEX method, compared with the cDNA cloning strategy (see above), is, however, that the latter requires isolation of ncRNAs from an organism or cell under all possible developmental and growth conditions, which might not be always feasible. In contrast, genomic SELEX generates RNA species from all regions of a genome und thus is not dependent on isolating RNAs from all these different states.

Functional RNomics approaches: techniques following RNA identification

Identification of ncRNAs can be only regarded as a first step towards the elucidation of their functions. The term ‘candidate’ should be used as a suffix to the ncRNA, as long as the function of an ncRNA has not been elucidated. Only then, the RNA species should be designated as a bona fide ncRNA.

To obtain hints towards the function of an ncRNA candidate, several approaches can be performed:

Since most functional ncRNAs are part of an ribonucleoprotein particle (RNP), the protein components of ncRNAs can be searched for. This is achieved, for example, by using the RNA as a ‘bait’ to fish for these RNA binding proteins in cell extracts. RNAs can be synthesized with an ‘affinity-tag’ such as biotin by T7 RNA polymerase in vitro transcription in the presence of biotin-UTP. The biotinylated RNAs are then coupled to a streptavidin column. Alternatively, an RNA sequence binding to a known protein can be cloned 5′- or 3′- to the ncRNA gene. By attaching the known RNA binding protein to a solid support, the ncRNP can be isolated by using the known RNA tag as a bait ( 97 ). Elucidation of the protein-components of an RNP can hint towards its functions, since the proteins might exhibit domains with known catalytic activity. For in vivo analysis, the yeast two-hybrid system has been expanded to a three-hybrid system, where the ncRNA is used as a bait in vivo to fish for proteins which bind to it ( 98 ).

Many of the ncRNAs hitherto found exhibit specific RNA targets, which they recognize by an antisense mechanism, e.g. Watson–Crick base pairing ( 99 ). Target RNAs include mRNAs or other ncRNAs such as ribosomal RNAs, snRNAs or tRNAs. For elucidation of ncRNA targets, either bioinformatical or experimental methods can be employed. For bioinformatical methods, search for complementarity can be performed. This was successfully achieved, for example, in the case of miRNAs targets ( 100 , 101 ). Experimental methods could include that by fishing the ncRNA of interest through an RNA-binding protein (see above), the target RNA, complementary to the ncRNA, could be co-isolated as well. This might require a cross-link prior to isolation of the RNA heteroduplex, depending on the stability of the RNA–RNA interaction. Alternatively, by expression/overexpression of an ncRNA of interest and subsequent microarray analysis, potential mRNA targets can be identified, if the ncRNAs influences the abundance of its respective mRNA target(s) in the cell ( 102 ).

Analysis of expression patterns from an ncRNA of interest: for example, the cellular/subcellular localization of the RNA/RNP particle might shed some additional light on its function, e.g. localization in the nucleolus, nucleus or cytoplasm might hint towards an ncRNA involvement in functions exerted in these cellular compartments. To this end, fluorescent in situ hybridization techniques can be used to localize the RNA of interest ( 103 ). In addition to sub-cellular localization of ncRNAs, the tissue-specific or developmental expression of ncRNAs can be analysed by northern blotting, using total RNA from different tissues or developmental states. Thus, if an ncRNA is only expressed in the brain, at a certain developmental stage, for example, the function of the ncRNA can be searched for within this temporal and spatial expression window of the respective organism.

Ultimately, to address the function of ncRNAs, their genes have to be eliminated in the genomes of respective organisms. In other cases, overexpression of ncRNA genes has been useful to obtain a more prominent phenotype [see discussion of multicopy plasmid approaches in ( 36 )].

For certain model bacteria such as E.coli , gene deletions are usually accomplished in a few days ( 104 , 105 ). For most other organisms, only the conventional time-consuming knock-out technology is available for this purpose. Very recently, the more elegant knock-down strategies by RNA interference, so far applied only to protein-coding mRNAs, have been shown - in some cases - to be also suitable for rapid ncRNA depletion ( 106 , 107 ) however, the mechanism by which RNAi targets ncRNAs is completely unknown. In addition, a very elegant study has also very recently demonstrated the potential of chemically modified antisense miRNAs (so called ‘antagomirs’) for the knock-down of certain miRNA species ( 108 ).

Cilia, Part B

Lis Jakobsen , . Jens S. Andersen , in Methods in Enzymology , 2013

2.2 Metabolic labeling

For stable isotope labeling by amino acids in cell culture (SILAC) ( Ong et al., 2002 ), grow cells in RPMI custom culture medium without arginine and lysine supplemented with 10% dialyzed fetal bovine serum, 100 U penicillin/ml, 100 μg streptomycin/ml, and 2 mM l -glutamine in a humidified incubator at 37 °C, 5% CO2. Further supplement the medium with either normal l -lysine 1 H (Lys0) and l -arginine 12 C6, 14 N4 (Arg0) or stable isotope-labeled “medium” amino acids l -lysine 2 H4 (Lys4) and l -arginine 13 C6, 14 N4 (Arg6) for experiments based on two cell populations. Culture the cells for at least six cell divisions to fully incorporate the SILAC amino acids. The medium can also be supplemented with stable isotope-labeled “heavy” l -lysine 13 C6, 15 N2 (Lys8) and l -arginine 13 C6, 15 N4 (Arg10) for experiment based on three cell populations such as the double PCP-SILAC experiment ( Fig. 18.2 ). To reduce the amount of expensive isotope-labeled amino acids, arginine can be omitted or added at 1/3 the normal concentration in the RPMI medium. If arginine is omitted, digest proteins to peptides using only endoprotease Lys-C and not trypsin.

Figure 18.2 . Isolation of centrosomes and identification of centrosomal proteins by PCP-SILAC. (A) Schematic outline of the protocols for centrosome isolation and sample preparation for mass spectrometry-based proteomics. Centrosomes are detached from the cytoskeleton network, nuclei, and intermediate filaments by treatment of cell with nocodazole and cytochalasin-D and by cell lysis in a low ionic strength buffer. Chromatin is removed by aggregation and centrifugation. Centrosomes are sedimented onto a sucrose cushion and further enriched by sucrose gradient centrifugation. Proteins in each of the collected fractions are digested with trypsin, and the resulting peptides are analyzed by LC-MS. (B) Protein correlation profiling (PCP) of centrosomal proteins. Protein abundance profiles are obtained from the LC-MS data by integrating peptide ion intensity signal for each protein in each fraction. The leftmost table displays the abundance profile of selected centrosomal marker proteins and the right table shows the relative values after normalization. The plot reveals centrosome-containing fractions and can be used to differentiate genuine centrosomal proteins from copurifying unspecific proteins by comparing the profiles for all proteins with profiles for known centrosomal proteins. (A, C) PCP of centrosomal proteins from SILAC-labeled cells (PCP-SILAC). Centrosomes isolated from different isotope-labeled cell populations can be used to profile proteins more accurately and thereby increase the confidence in the assignment of centrosomal proteins. The centrosome-containing fractions from light-labeled cells are combined to generate an internal standard, which are then distributed to the corresponding fractions prepared from medium- and heavy-labeled cells. Peptides are prepared from the mixed samples and analyzed by LC-MS. (C) Relative enrichment profiles for each protein in the medium- and heavy-labeled preparations are derived from LC-MS data by calculating the medium/light and heavy/light protein ratios for all proteins in each fraction. The profiles shown represent known and candidate centrosomal proteins and a single unspecific protein (RPL6).

What Are mRNA Vaccines, and Could They Work Against COVID-19?

Earlier today, the drug maker Moderna announced the coronavirus vaccine it created was 94.5 percent effective in a major trial. The news came a week after Pfizer and and bioNTech announced their coronavirus vaccine was more than 90 percent effective. The results from both companies, which exceeded expectations, came from large, continuing studies and were not published in peer-reviewed journals. Still, the results are a sign of hope—the companies may seek permission for emergency use in the United States within weeks—though experts caution the vaccines will not likely be widely available for several months.

Back in July, the U.S. government spurred the race to develop a vaccine when it agreed to pay $4 billion to six pharmaceutical companies in return for the promise of delivering 100 million doses of a new vaccine against the novel coronavirus by early 2021. This timetable is breathtakingly fast, as new vaccine development typically requires several years, but it demonstrated the urgency with which scientists around the world are trying to slow down Covid-19.

The sprint for a vaccine brings a new technique to the fore: using messenger RNA (mRNA). If successful, both Moderna and Pfizer’s/bioNTech’s creations would be the first-ever commercially available mRNA vaccines for any virus.

What is an mRNA vaccine?

Inside the human body, messenger RNA supplies the information that DNA uses to make proteins, which regulate our cells and tissues. Viruses use RNA for a much more devilish purpose. They lack the cellular machinery to replicate themselves, so they invade healthy cells and propagate within them, sometimes causing sickness or death. For example, the mRNA in the novel coronavirus behind Covid-19 enables a “spike protein” that pierces cells throughout the body. This is particularly damaging whenever the virus invades the lungs, making the simple act of breathing difficult.

An mRNA vaccine contains a synthetic version of the RNA that a virus uses to form proteins. The vaccine doesn’t contain enough genetic information to produce viral proteins just enough to trick the immune system into thinking a virus is present so that it will spring into action to make antibodies, which are proteins specifically designed to fight a virus.

Traditional vaccines, such as for flu or measles, activate the immune system by injecting people with small amounts of a virus. Vaccines may include weaker “attenuated” forms of the virus, or a virus that scientists have killed but whose viral proteins can still stimulate immunity. Drew Weissman, an immunologist at the University of Pennsylvania and an expert about mRNA vaccines, says that in some very rare cases the virus is not dead despite best efforts to kill it, or the attenuated dose is so strong it makes some sick. The mRNA vaccines eliminate that concern because they do not contain any virus.

"You can never make an infectious virus with mRNA," he says.

Another weakness of traditional vaccines, he says, is that they can take a long time to develop. To make a vaccine, scientists typically grow a weakened form of the virus in chicken eggs and test which parts of the virus successfully elicit antibodies. This can take four to six months in the case of the annual flu vaccine, even though scientists already know how to make these vaccines and which flu strains are likely to predominate any given year. With a brand-new virus, the vaccine-making process can stretch into years or even decades. Large-scale testing of a new vaccine, while necessary to assure safety, also takes time.

"Let's say you want to make a killed virus,” Weissman says. “First you have to figure out how to grow it, and how to grow it at large scale. Then you have to figure out to kill it, but not change it so it no longer makes an immune response that protects the host. Then after you do that, you have to show that, in fact, the virus is dead.”

With a pandemic going on, speed is of the essence, and so vaccine researchers are trying to accelerate that timetable. "The advantage of RNA is that it takes you literally days to make a new vaccine," Weissman says.

Once researchers determine the mRNA that results in the virus in question producing its proteins, scientists can make synthetic RNA that becomes the basis of a new vaccine. In an ideal scenario, scientists would use specially selected enzymes to stimulate the production of this synthetic mRNA, and then wrap the mRNA in protective wrapping to prevent it from degrading.

So where are our mRNA vaccines?

The possibility of mRNA vaccines has existed since 1990 when researchers first injected mRNA into mice and elicited antibody production. In these early years, mRNA delivery was dangerous mice sometimes died due to excessive inflammation after receiving the RNA. These unfortunate mice had activated what is known as the innate immune response, an indiscriminate strategy that mammals use to resist anything that might be harmful. This was a serious hurdle, as researchers could not make a useable mRNA vaccine without figuring out how to suppress this response, Weissman says.

The story began to change in the mid-2000s when Weissman and his colleague Katalin Karikó discovered how to reduce or eliminate the risk of inflammation. The answer turned out to be additional substances such as carbon atoms to mRNA without changing its function. "When you change the structure of some of those RNA bases, you get rid of the inflammatory potential of the RNA," Weissman says.

These additions block sensors on cells from overreacting to the newly injected mRNA. This understanding has been incorporated into the vaccines Moderna and Pfizer/bioNTech are testing. (Karikó is the senior vice president of bioNTech Weissman is an advisor to bioNTech.)

Back in July, both Moderna and Pfizer/bioNTech began studies of their mRNA vaccines in about 30,000 people apiece, hoping to show their vaccines are safe in large groups of people and effective at building some immunity to the coronavirus. With the November results, the world is one step closer to its first mRNA vaccine and a way to slow the Covid-19 pandemic.

Sara Suliman, an immunologist at Harvard, says the sheer scale of the COVID-19 pandemic means that multiple vaccine types will be necessary—mRNA and otherwise. “In the case of COVID we can’t put all our eggs in one basket,” Suliman says. “Ideally, you want to give the whole world this vaccine.” arguing that no single company can meet a global vaccine demand.

In less extreme times, Suliman says, companies would not manufacture millions of vaccine doses without solid proof that a vaccine will enable long-lasting immunity. With COVID-19, though, companies may start to produce millions of doses based on less-solid evidence so they can be ready for distribution as soon as governmental groups like the FDA approve them.

Drew Weissman sees a big future for mRNA vaccines after the pandemic, too. Perhaps, he says, one day a single mRNA vaccine (sometimes supplemented with booster shots) could replace the 20 or so vaccines children receive today. Suliman, though, is more cautious, pointing out that the measles vaccine already works well as is and doesn’t need reconfiguring. She says we should save mRNA for new vaccines to face new threats—not reinvent the wheel.


Diffuse and Intestinal-Type Gastric Adenocarcinomas Exhibit Strong Differences in mRNA Expression Profiles

Gastric adenocarcinoma samples that could clearly be assigned to either the intestinal or diffuse type and that contained at least 75% tumorous tissue were used to form the two histological groups (diffuse n=19, intestinal n=24). Genes differentially expressed between these groups were identified by Welch-test. Significance in differential expression was accepted at multiple testing corrected P<0.05 and a meaningful difference in expression at a FC ≥2. Each multiple testing method yielded a different number of genes with significant differences in expression. Applying the Bonferroni FWER, the most conservative and stringent multiple testing technique available, yielded 207 unique annotated transcripts and 44 unique non-annotated ones (322 probe sets). The more moderate technique of Benjamini and Hochberg FDR identified 1280 unique annotated transcripts and 253 unique non-annotated ones (2071 probe sets). In both test systems, the majority of differentially expressed genes (eg ∼ 73% for FDR) were upregulated in diffuse-type tumors, whereas a smaller number of genes (eg ∼ 27% for FDR) were upregulated in the intestinal type (Figure 1). Annotated lists of the 50 genes that showed the most significant upregulation in either histological type can be found in Supplementary Tables S1 and S2. Two-way hierarchical clustering using the obtained gene lists resulted in dendrograms comprising two major sample clusters representing the two histological types and two major gene clusters indicating the direction of regulation. However, the use of the gene list gained by FDR resulted in 95.3% (41 of 43) of the samples being grouped to the respective cluster (Figure 1a), whereas the more stringently selected list of FWER produced a sample dendrogram in which 97.7% (42 of 43) were grouped ‘properly’ (Figure 1b). The one misgrouped sample (Figure 1b) was a diffuse-type tumor clustered to the intestinal tumor type. It possessed a very dense tumor cell population with almost no stroma present, a rather uncommon feature of diffuse-type gastric adenocarcinomas.

Two-way hierarchical clustering heatmaps of genes differentially expressed between human diffuse and intestinal-type gastric adenocarcinomas. Genes whose expression was significantly different between diffuse and intestinal-type samples were identified by Welch-test. Different multiple testing techniques were applied to correct for false-positives and resulted in different numbers of significant probe sets passing the test (corrected P<0.05 fold change ≥2). Different lists were subsequently used for two-way hierarchical clustering. Normalized expression intensities of probe sets are depicted as a heatmap. ‘Euclidean distance’ and ‘complete linkage’ were used as distance metric and linkage algorithm for all clusterings. (a) Hierarchical clustering heatmap obtained for the signature of 1533 genes (2071 probe sets) identified by application of Benjamini and Hochberg False Discovery Rate (FDR). (b) Hierarchical clustering heatmap obtained for the signature of 251 genes (322 probe sets) identified by application of Bonferroni Family Wise Error Rate (FWER). (c) Hierarchical clustering heatmap obtained when using the probe set for THBS4, the gene with the strongest significance in this test system, alone.

Genes Upregulated in Diffuse and Intestinal-Type Gastric Adenocarcinomas Belong to Different Biological Processes

Very stringent multiple testing corrections, such as the Bonferroni FWER, lead to a trade-off in that they may yield high numbers of false-negatives. Hence, the significant genes identified by Benjamini and Hochberg FDR were used for biological interpretation. GO analysis uncovered that genes overexpressed in intestinal-type gastric adenocarcinomas are predominantly associated with proliferation and growth-connected processes, such as the cell cycle and mitosis (Table 1). In contrast, most of the genes upregulated in diffuse-type gastric adenocarcinomas encode for proteins of the extracellular matrix or for proteins that have important roles in adhesion or developmental processes (Table 2). None of the significantly enriched GO terms was shared between the two types.

Thrombospondin 4 – the Most Potent Marker for Diffuse-Type Gastric Adenocarcinoma in This Data Set

The probe set with the highest FC and lowest P-value in this test system represented the thrombospondin 4 (THBS4) transcript. It was 40.8-fold upregulated in diffuse tumors and held a FDR corrected P-value of 1.65E−7. Two-way hierarchical clustering based on the THBS4 probe set alone yielded the same ‘correctness’ in sample grouping as when using the whole FDR generated gene signature (Figure 1c). Only two diffuse-type tumors were misgrouped to the intestinal tumor cluster. There were no common unusual histological features to these samples. One was a mucinous adenocarcinoma (G3) with low tumor cell density. The other, in contrast, was an adenocarcinoma (G3) with very dense tumor cells and almost no stroma.

Quantitative real-time PCR was conducted to validate the THBS4 microarray data (on all 43 samples). The analysis clearly confirmed the strong significance in differential mRNA expression (P<0.0001, Mann–Whitney U-test) and demonstrated that THBS4 mRNA is principally absent from the majority of intestinal-type gastric adenocarcinomas of this cohort, whereas increased amounts are present within the diffuse-type population (Figure 2). A pathological reevaluation revealed that the few intestinal-type tumors that showed some weak expression were all contaminated by small amounts of smooth muscle, mainly because the part of the tumor that had been analyzed was derived from a region where muscle layers had been infiltrated. The PCR analysis additionally showed that the two diffuse gastric adenocarcinoma samples that had been misgrouped to the intestinal population according to clustering (refer to Figure 1c) exhibit slightly higher amounts of THBS4 mRNA than the majority of intestinal samples, but lower amounts than found in the rest of diffuse-type gastric adenocarcinomas. The immunohistochemical detection of THBS4 protein within randomly selected tumor samples confirmed that transcriptional differences are reflected on the protein level. All the diffuse tumors that were examined (n=10) showed specific positivity for THBS4, whereas no notable positivity could be observed in the intestinal-type population (n=5) (Figure 3).

THBS4 mRNA expression in human diffuse and intestinal-type gastric adenocarcinomas. The mRNA abundance of THBS4 was examined by means of quantitative real-time PCR. Quantitation was done relative to the transcript of actin, β (ACTB). Significance in differential expression between groups was calculated using Mann–Whitney U-test. Expression value distribution within groups is displayed by box and whisker plots.

THBS4 expression in human diffuse and intestinal-type gastric adenocarcinomas and non-neoplastic gastric mucosa. Immunohistochemical detection of THBS4 (red) was performed on 10 μm thin cryosections. Cell nuclei were counterstained using hematoxylin (blue). In diffuse tumors, THBS4 expression was predominantly observed as fibrillar extracellular matrix structures of the tumor stroma (ad). Occasionally, cytosolic positivity of cells resembling fibroblasts was detected (c, d). Expression was especially strong in tumor cell nests (a, c) and in regions of invasion into healthy epithelium (dotted line in b). Examples of signet ring cells are indicated with arrowheads. No specific THBS4 expression could be identified within intestinal-type tumors and epithelium and stroma of non-neoplastic gastric mucosa (region depicted was on the same specimen slide as diffuse-type sample a). Representative sections are shown, respectively. Negative controls were obtained by omission of primary antibody (data not shown). Pictures are × 400 magnified with scale bars representing 50 μm each.

Thrombospondin 4 Is a Highly Abundant Tumor Stroma Constituent of Diffuse-Type Gastric Adenocarcinomas

The localization of THBS4 within diffuse-type gastric adenocarcinomas was traced to the tumor stroma. All the specimens investigated were positive for THBS4 in extracellular fibrillar structures surrounding the tumor cells (Figure 3a and b). In some cases, additional intracellular positivity could be detected in the stroma (Figure 3c and d). Cells showing this cytosolic THBS4 expression were rather small in size and had a fusiform or spindle-like shape, sometimes accompanied by extended cell processes. All of these features hint at a potential fibroblast phenotype. THBS4 expression was particularly strong within regions of high tumor cell density, so-called tumor cell nests (Figure 3c), and at sites of infiltration into adjacent ‘healthy’ tissues (Figure 3b). Neither the epithelium nor stroma of matched non-neoplastic counterparts (n=5) exhibited any detectable THBS4 expression (Figure 3). The only aspects of the non-neoplastic gastric wall, which definitely showed THBS4 expression, were the smooth muscle layers (interstitial cells and perimysium) of muscularis mucosae and muscularis propria as well as vessel walls (Supplementary Figure S1).

In Diffuse-Type Gastric Adenocarcinomas, Thrombospondin 4 Is Expressed and Secreted by Cancer-Associated Fibroblasts

Immunohistochemical colocalization studies using markers for various cell entities were conducted to identify the cells featuring cytosolic THBS4 positivity, namely those that express and secrete the protein. Cytokeratins served as markers for cells of epithelial origin (in this case the carcinoma cells). Vimentin was employed to identify fibroblasts and mesenchymal cells in general. α-Smooth muscle actin was used to mark myofibroblasts, 35 which represent a subpopulation of cancer-associated fibroblasts in malignant transformation. None of the THBS4-positive cells showed any positivity for cytokeratins, whereas they all were positive for vimentin and α-smooth muscle actin (Figure 4). Thus, cancer-associated fibroblasts of the myofibroblast phenotype are the THBS4-expressing cells in diffuse-type gastric adenocarcinomas.

Colocalization of THBS4 and cytokeratin, vimentin and α-smooth muscle actin in human diffuse-type gastric adenocarcinomas. Simultaneous fluorescent immunohistochemical detection of THBS4 in red and pan-cytokeratin (KRT), vimentin (VIM) or α-smooth muscle actin (αSMA) in green was performed on 10 μm thin cryosections. Signals were scanned using a confocal laser scanning microscope. Representative pictures of single confocal sections are shown, respectively. Negative controls were obtained by omission of primary antibodies and scanned with identical settings (pinhole, excitation, frame average, etc.) to positive stainings, respectively (data not shown).

Additionally, in vitro cell line systems of diffuse-type gastric adenocarcinoma-derived cancer-associated fibroblasts and healthy counterpart-derived normal fibroblasts were examined for THBS4 mRNA expression. In the two cell line pairs that were examined, cancer-associated fibroblasts contained significantly higher expression levels (P<0.01, t-test) than normal fibroblasts with overexpression of ≈2 and ≈3.3 fold, respectively (Figure 5).

THBS4 mRNA expression in human diffuse-type gastric adenocarcinoma-derived cancer-associated fibroblasts and healthy counterpart-derived normal fibroblasts. Two cell lines of matched diffuse-type gastric adenocarcinoma-derived cancer-associated fibroblasts (CAF-32, CAF-33) and healthy counterpart-derived normal fibroblasts (NF-32, NF-33) were analyzed. The mRNA abundance was examined by means of quantitative real-time PCR. Quantitation was done relative to the transcript of actin, β (ACTB) and expression levels in NF-lines were set to 100%, respectively. Significance in differential expression was assessed using t-test ( ** P<0.01 *** P<0.001). Error bars represent integrated standard errors of the mean.

In order to obtain a more global view of THBS4-expressing cells, a panel of human cell lines from different entities was examined for THBS4 mRNA expression. This analysis confirmed that carcinoma cells, regardless of derivation, do not generally express THBS4. A very restricted expression pattern was found in cell lines from other malignancies. Only SH-SY5Y, Daudi and HL-60 exhibited measurable expression levels. The highest expression by far was discovered in HEK-293, a cell line derived from normal embryonic kidney. No relevant expression was found in the only fibroblast cell line in this study, derived from normal forehead skin and designated 142BR (Supplementary Figure S2).

Overexpression of Thrombospondin 4 in Gastric Fibroblasts Is Stimulated by Tumor Cells

To evaluate whether THBS4 expression in fibroblasts is triggered by tumor cells, an in vitro indirect coculture model was established. Two pairs of matching normal and cancer-associated fibroblasts were analyzed for changes in THBS4 mRNA expression when challenged with different tumor cell-conditioned media. The treatment with conditioned medium from OCUM-2M and OCUM-8, two cell lines derived from scirrhous gastric carcinomas (a subpopulation of diffuse-type gastric adenocarcinomas marked by excessive fibrosis), resulted in significantly elevated expression levels in both normal fibroblast lines (P<0.05, t-test). In cancer-associated fibroblasts, only OCUM-8-conditioned medium yielded significant increases (P<0.05, t-test). OCUM-2M-conditioned medium showed tendencies for significance, though (P<0.1, t-test). The conditioned medium from the non-scirrhous, diffuse-type gastric adenocarcinoma-derived cell line MKN-45 was not able to significantly alter THBS4 expression in all fibroblasts (Figure 6).

Changes in THBS4 mRNA expression in human diffuse-type gastric adenocarcinoma-derived cancer-associated fibroblasts and healthy counterpart-derived normal fibroblasts upon stimulation with tumor cell-conditioned media. Two cell lines of matched diffuse-type gastric adenocarcinoma-derived cancer-associated fibroblasts (CAF-32, CAF-33) and healthy counterpart-derived normal fibroblasts (NF-32, NF-33) were analyzed. Fibroblasts were incubated for 48 h with conditioned medium derived from human diffuse gastric cancer cell line OCUM-2M, OCUM-8 or MKN-45. Fresh tumor cell-specific medium was used as control medium, respectively. All experiments were run in triplicate with control experiments run on the same plate. THBS4 mRNA abundance was examined by means of quantitative real-time PCR. Quantitation was done relative to the transcript of GAPDH and expression levels in control medium-treated fibroblasts were set to 100%, respectively. Significance in differential expression was assessed using t-test (°P<0.1 * P<0.05 ** P<0.01). Error bars represent integrated standard errors of the mean. Results of representative experiments are shown.


Recent advances in single-cell technologies, such as droplet-based methods 2 , make it easy and inexpensive to collect hundreds of thousands of scRNA-seq profiles, allowing researchers to study very complex biological systems at high resolution. The resulting libraries are often sequenced at extremely low depth (tens of thousands of reads per cell, only), making the corresponding read count data truly sparse. Hence, there is a growing need for developing reliable statistical methods that are scalable and that can account for zero inflation.

ZINB-WaVE is a general and flexible approach to extract low-dimensional signal from noisy, zero-inflated data, such as those from scRNA-seq experiments. We have shown with simulated and real data analyses that ZINB-WaVE leads to robust and unbiased estimators of the underlying biological signals. The better performance of ZINB-WaVE with respect to PCA comes at a computational cost, as we need to numerically optimize a non-convex likelihood function. However, we empirically found that the computing time was approximately linear in both the number of cells and the number of genes, and approximately quadratic in the number of latent factors (Supplementary Fig. 30). The algorithm benefits from parallelization on multicore machines and takes a few minutes on a modern laptop to converge for thousands of cells.

One major difference between ZINB-WaVE and previously proposed factor analysis models (such as PCA and ZIFA) is the ability to include sample-level and gene-level covariates. In particular, by including a column of ones in the gene-level covariate matrix, the corresponding cell-level intercept acts as a global-scaling normalization factor, allowing the modeling of raw count data, with no need for prior normalization.

However, there is no guarantee that the low-dimensional signal extracted by ZINB-WaVE is biologically relevant: If unwanted technical variation affects the data and is not accounted for in the model (or in prior normalization), the low-rank matrix W inferred by ZINB-WaVE will capture such confounding effects. It is therefore important to explore the correlation between the latent factors estimated by our procedure and known measures of quality control that can be computed for scRNA-seq libraries, using, for instance, the Bioconductor R package scater 43 (see Fig. 2f). If one observes high correlation between one or more latent factors and some QC measures, it may be beneficial to include these QC measures as covariates in the model.

Several authors have recognized that high-dimensional genomic data are affected by a variety of unwanted technical effects (e.g., batch effects) that can be confounded with the biological signal of interest, and have proposed methods to account for such effects in either a supervised 38 or unsupervised way 27,44 . Recently, Lin et al. 45 proposed a model that can extend PCA to adjust for confounding factors. This model, however, does not seem to be ideal for zero-inflated count data. In the scRNA-seq literature, MAST 23 uses the inferred cellular detection rate to adjust for the main source of confounding, in a differential expression setting, but is not designed to infer low-dimensional signal.

The removal of batch effects is an important example of how including additional covariates in the ZINB-WaVE model may lead to better low-dimensional representations of the data. However, ZINB-WaVE is not limited to including batch effects, as other sample-level (e.g., QC metrics) and/or gene-level (e.g., GC-content) covariates may be included in the model. Although we did not find any compelling examples in which adding a gene-level covariate leads to improve signal extraction, it is interesting to note the relationship between GC-content and batch effects 46 . With large collaborative efforts, such as the Human Cell Atlas 47 , on the horizon, we anticipate that the ability of our model to include gene-level covariates that can potentially help accounting for differences in protocols will prove important.

Although the low-dimensional signal inferred by ZINB-WaVE can be used to visually inspect hidden structure in the data, visualization is not the main point of our proposed method. The low-dimensional factors are intended to be the closest possible approximation to the true signal, which is assumed to be intrinsically low-dimensional. Such a low-dimensional representation can be used in downstream analyses, such as clustering or pseudotime ordering of the cells 15 .

Visualization of high-dimensional data sets is an equally important area of research and many algorithms are available, among which t-SNE 25 has become the most popular for scRNA-seq data. Recently, Wang et al. 48 have proposed a novel visualization algorithm that can account for zero inflation and showed improvement over t-SNE. As t-SNE takes as input a matrix of cell pairwise distances, which may be noisy in high dimensions, a typical pipeline involves computing such distances in PCA space, selecting, for example, the first 50 PCs. An alternative approach is to derive such distances from the low-dimensional space defined by the factors inferred by ZINB-WaVE. This strategy was used effectively in Fig. 3c to visualize the PBMC data set.

In this article, we have focused on an unsupervised setting, where the goal is to extract a low-dimensional signal from noisy zero-inflated data. However, our proposed ZINB model is more general and can be used, in principle, for supervised differential expression analysis, where the parameters of interest are regression coefficients β corresponding to known sample-level covariates in the matrix X (e.g., cell type, treatment/control status). Differentially expressed genes may be identified via likelihood ratio tests or Wald tests, with standard errors of estimators of β obtained from the Hessian matrix of the likelihood function. In addition, posterior dropout probabilities can be readily derived from the model and used as weights to unlock standard bulk RNA-seq methods 49 , such as edgeR 41 . We envision a future version of the zinbwave package with this added capability.

Why do we use cRNA instead of initial extracted RNA in the microarray technique? - Biology

An overview of using Xenopus laevis as a model system.

Using tractable organisms (model systems) to answer fundamental questions in medicine and biology is a common practice since the ancient years. One of the model systems with important contributions is Xenopus laevis, the African clawed frog, a pseudotetraploid vertebrate which lives in fresh water. The reasons for its worldwide usage in research lie in the high degree of conservation of most essential cellular and molecular mechanisms it is inexpensive, easily manipulated and large amounts of material can be readily obtained for a variety of experimental procedures. In this review, the material that is routinely obtained from Xenopus laevis, the most important experimental procedures, how the Xenopus system is used in research as well as the types of questions that can be answered only by using Xenopus are going to be discussed.

The breeding can be manipulated so that each frog can give eggs up to 3-4 times a year and the material used in research studies can vary from oocytes to cell-free extracts. The egg production is controlled by subcutaneous injections of 50-100 units of pregnant mare serum (PMS) in their dorsal lymph sacs several days before the second injection with 600-800 units of human chorionic gonadotropin (hCG) 12-16 hours before the desirable egg laying. Right after the second injection, they are placed in individual tanks and the next day the eggs are collected.

The oocytes are large cells of >1 mm in diameter. They contain a huge nucleus (or germinal vesicle), 100,000 times larger than a somatic cell nucleus, which occupies approximately the one-third of the oocyte’s volume and they develop rapidly. The vesicle is surrounded by a nuclear envelope with large pores that facilitate transportation from and to the cytoplasm. Admirably, the oocytes have synchronous cell cycle and they arrest at the first meiotic prophase until activation by progesterone. Progesterone activation results in meiotic maturation of the oocytes. The maturation stage is controlled by the activity of the maturation-promoting factor (MPF) [2], which is also known as p34cdc2 kinase and growth-associated histone H1 kinase. Upon progesterone activation and an increase of MPF activity, the nuclear structure collapses, the spindle is formed and the first meiotic division is completed. But the cycle is arrested again at the metaphase stage of the second meiotic cycle by the action of the cytostatic factor (CSF) and this signals the mature oocyte awaiting fertilization [3, 4]. After fertilization i.e., the merging of the sperm with the egg, the cytoplasmic pool of calcium is increased and this results in inactivation of both MPF and CSF, and release of the cell cycle (Figure 1) (see videos that show Xenopus development upon egg fertilization)

The material that Xenopus can offer in biology is great in terms of quality and quantity. For example, one frog ovary corresponds to 1000 mouse ovaries [5] and western blots can be performed with protein samples from just one oocyte (or 1 uL of extract – see below). Oocytes are very rich in all the necessary RNA and proteins for the first stages of development until early embryogenesis, i.e., the tadpole stage. They can contain up to 4 ug of ribosomal RNA, therefore each oocyte is able of synthesizing up to 400 ng of proteins per day. Single cell experimentation in Xenopus laevis oocytes is a routine for experimental biologists. In addition, they lack transcriptional activity until the mid-late blastula stage. It is estimated that the protein content is about 500 ug in each oocyte and approximately 10-50 mg/mL in the extract.

Xenopus laevis ovaries are large and transparent and may contain several hundreds of, also large, oocytes which are easily obtainable. Despite the difference in size, the processes of oogenesis and maturation are conserved in mammals. Therefore, they are extensively used by researchers studying the cell cycle.

The oocytes can be manually dissected using forceps from the ovaries or they can be extracted by digestion with collagenase (enzymatic defolliculation) of the four extracellular connective tissue layers that surround them. The advantages and disadvantages of either method have been discussed in detail in [6]. Then, from the oocytes, the nucleus can be manually removed. A pair of forceps with sharp tips is used to pierce a small opening on the cell and by squeezing force applied by another pair of forceps with blunt tips the nucleus is extracted [6, 7]. This results in oocytes without any nucleus (enucleated oocyte) and isolated nuclei. The experimental protocol with which the enucleation takes place is selected according to the downstream application, which can be immunoblotting [8], injection of a somatic nuclei and RNA synthesis for analysis of transcription [9], exogenous protein expression by injection of complementary RNA (cRNA) in the cytoplasm or of complementary DNA (cDNA) in the nucleus [10], injection of peptides or recombinant proteins [11] and others. Fully functional nuclei can be isolated and be used to study all nuclear processes including gene expression, chromatin dynamics, nuclear import and export of fluorescently labeled proteins or function of the nuclear pore complexes [12, 13].

The volume of one Xenopus laevis oocyte is approximately 1 uL. They can tolerate an injection of up to 50 nL using automatic nanoinjectors, without any adverse effect in the function. Several oocytes are arrayed onto special slides and sequentially injected with a nano-injector, as shown in this video.

Injections can be used for either gain-of-function (GOF) or loss-of-function (LOF) experiments or experiments involving both. By injections which contain any kind of DNA, RNA, plasmid DNA [14], morpholino oligonucleotides [15] or protein molecules, gene function can be easily manipulated. Several examples can be found in the literature for example, the first experiments that led to the cloning of interferon [16], or the development of morpholino technology [17] were performed in Xenopus oocytes. Morpholino has been noted to generate high false positives [18]. Generation of transgenic Xenopus laevis embryos have also been described and have been used to assess gene expression involved in pathological diseases [19, 20]. Chromatin assembly can be studied in vivo in oocytes after injection of DNA into the germinal vesicle. Complementary strand synthesis or replication can be monitored by co-injection of radiolabeled dNTPs while nucleosome assembly can be assessed with supercoiling assay and micrococcal nuclease (MNase) assay. Virus replication is also possible after microinjection of viral RNA, which produces infectious viruses [21]. Apoptotic trigger waves were also identified through injection [22]. Li L et al injected morpholino oligos, TALEN mRNA, and full-length mRNAs to study the function of Etv6 [23]. Clairfeuille T et al injected cRNA for human and Blattella germanica Nav (synthesized through an mMessage mMachine kit from Thermo Fisher) into Xenopus laevis oocytes, sourced from either Xenopus one or Ecocyte, to study the inactivation of voltage-gated sodium channels [24].

As in other model systems used in developmental biology, transplantations can be easily performed in Xenopus oocytes. As explained above, due to the large oocytes’ size, enucleation is a routine procedure with Xenopus. An oocyte can be injected with nucleotides, such as morpholino or siRNA, which results in the loss of function of a target of interest. Then, the nucleus from this mutant oocyte or tadpole can be transferred into a ‘wild-type’ oocyte, whose nucleus has been removed or destroyed, and the healthy nucleus can be transplanted into a mutant oocyte and compared (Figure 2). These experimental procedures could replace strategies, such as the generation of transgenic clones, which are not very feasible in frogs yet. For more information read below the Gurdon experiment.

Furthermore, ion channels and membrane receptors of postmortem brains of humans with Alzheimer's disease were microtransplanted and were functional in Xenopus oocytes and have applications in studying channel activity related to human diseases [25-27]. Similarly, Xenopus oocytes are a great system for expression and study of neurotransmitters from human brain or cultured cells and its ligands (drugs, pesticides, etc) after injection of mRNA or cDNAs or membrane vesicles from native tissues. The currents generated in such experiments in oocytes are easily measured due to the large oocyte's size hence the Xenopus system has significant contributions in the study of neurological disorders, such as epilepsy [28].

Chromatin assembly can be simply monitored after injection of exogenous proteins or after GOF or LOF experiments [29]. Similarly, proteins can be extracted for western blot analysis. Interestingly, chromatin immunoprecipitation (ChIP) is a technique used widely with cell lines to identify interplay between transcription factors or epigenetic markers with specific DNA elements, such as promoters. ChIPs can also be applied in Xenopus oocytes using the same principles but with much less starting material (10-15 oocytes) [30, 31]. There are cases where while Xenopus is not used, however, its chromatin components, such as histones, are examined [32].

Because of the large material that Xenopus oocytes and extracts offer, they have been used to make initial screens for small molecules that could be used in therapies of human diseases, such as cancer metastasis and generation of new blood vessels [33], protein degradation and DNA damage and repair [34, 35].

The large size of the oocyte opposes some problems in microscopy experiments (difficulty for antibodies to penetrate, background signal from the cortical pigment and the yolk platelets) rendering them less straightforward and requires optimization. Despite the difficulties, several protocols have been developed to study the localization and organization of the cytoskeleton and the cytoskeletal proteins (e.g., actin, tubulin, keratin) by immunofluorescence and confocal microscopy (for a full discussion see Becker and Gard, 2006 [36] ). Xenopus oocytes are also the preferred cells to study Ca 2+ signaling.

The extract obtained from unfertilized Xenopus eggs is a cell-free system which supplements and sometimes overcomes the use of other model systems. It recapitulates all biological functions of the oocyte, and has been shown to self-organize into cell-like structures and undergo multiple rounds of division under the right conditions [37]. The extract is a widely used system to study the cell cycle in vitro, especially the processes of DNA replication and nucleus formation. These are very important processes for genome stability and any defect can contribute to numerous pathological conditions, such as cancer. Most, if not all, of the characteristics, are conserved from vertebrates to humans. Therefore, Xenopus egg extracts provide the simplicity to study vital cellular processes without the need to compensate for anything.

The first extracts were prepared by Lohka and Masui [38] and were widely used after Blow and Laskey and Hutchison et al in the mid-80s. These extracts could complete one cell cycle in vitro and could be stored frozen in liquid nitrogen and when fresh could exhibit several cell cycles [39, 40]. This cycle included the decondensation of the chromosomes, formation of the nuclear envelope, complete replication of DNA in a semi-conservative way as in humans, re-condensation of the chromosomes, break-down of the nuclear envelope, mitotic spindle formation and separation of the sister chromatids.

Since then, several variations of the original protocols for extract preparation exist from laboratory to laboratory according to available equipment and personal experience [1, 3, 41]. The differences are however small and all provide high-quality egg extracts that can be stored in liquid nitrogen for years.

Generally, the jelly coating of eggs is removed (dejellying) and the eggs are crushed by sequential centrifugations (Figure 3). When preferred, the unfertilized eggs are treated with calcium ionophore before crushing, which inactivates the cytostatic factor (CSF) and releases the cell cycle from the prophase into the next interphase in just a few minutes. The extracts obtained from these Ca 2+ -induced eggs are interphasic. Alternatively, the direct addition of calcium in the extract of uninduced eggs (CSF-arrested mitotic extract) has the same effect and results in the replication of sperm DNA. After the removal of debris, pigment granules, lipids and yolk proteins several types of extracts can be obtained depending on the scientific question to be studied (Figure 3).

More detailed, low-speed extracts (LSE) contain the cytoplasm including light membranes, ribosomes and nuclear envelope and is prepared by medium-speed centrifugation ( 2+ -induced eggs as explained above) or mitotic (from non-induced eggs). LSE interphasic extracts demonstrate chromatin assembly, pre-replication complex (pre-RC) formation and loading on chromatin, origin firing and DNA replication, nuclear assembly while LSE mitotic extracts demonstrate chromosome condensation, mitotic spindle formation, and does not replicate DNA unless activated by Ca 2+ .

High-speed extracts (HSE) are prepared from LSE after additional high-speed centrifugation (100,000g) which separates the cytoplasm from membranes and ribosomes. Thus, HSE cannot form nuclei, DNA is assembled into chromatin and pre-RC is loaded, but double-stranded DNA cannot be replicated. However, it will synthesize the complementary strand when a single-stranded DNA molecule is added.

Transfection of DNA molecules in eukaryotic cells does not result in sufficient replication. On the other hand, Xenopus egg extracts are a powerful system, together with yeast, to study all stages of DNA replication, especially licensing of origins of replication and initiation, as they reproduce events in a semiconservative mechanism, like cells in vivo [42, 43]. In principle, an interphasic extract of good quality will start licensing the origins of DNA replication during the first minutes after the addition of an ATP regenerating mix and template DNA (demembranated sperm or plasmid DNA). DNA synthesis will start after 20-30 minutes and it will complete one round of DNA replication, it will fully decondense chromatin and will form fully functional nuclei in 60-80 minutes at 23oC (Figure 4C). Then, they enter mitosis, unless cycloheximide has been added to inhibit protein synthesis thus, cyclin B is not synthesized to drive mitosis and the extract is arrested in a G2-like state.

Generally, progress through the cell cycle can be monitored by: immunofluorescence after the addition of biotin-labeled dUTPs (Figure 4E-F) or DNA replication assay after the addition of radiolabeled dNTPs and TCA precipitation (Figure 4D) [1].

Since interphasic egg extracts do not have an assembled nucleus but contain all necessary components to construct one nucleus from scratch, they are an advantageous system to study nuclear assembly, nuclear pore and nuclear envelope formation. Using this to our advantage, adding a fluorescently labeled molecule in the extract, its transport from and to the nucleus can be easily studied. Likewise, the removal of a structural component of the nuclear envelope or nuclear pore could be used in such studies.

The great advantage of Xenopus extracts is the ease they provide in immunodepletion of particular components or addition of proteins, inhibitors or other molecules. These are standard procedures and there is no need for transfection or similar manipulations that are required, for example, with cultured cell lines. It is a simple addition in the tube. Besides, the cytosolic extract has no distinct cellular compartments formed and contains all proteins (e.g., cytosolic, nucleoplasmic, chromatin-bound) in one pool. Thus, the addition/depletion/inhibition of proteins is readily effective and the outcome of this manipulation easily measured [44]. For example, addition of high concentrations of okadaic acid, an inhibitor of protein phosphatases and especially protein phosphatase 2A (PP2A), directs the extract to bypass replication and to go directly to mitosis (Figure 5 the lower panel). Or immunodepletion of cyclin-dependent kinase 2, the major kinase in all stages of DNA replication, results in impaired replication as shown by a big reduction in the number of replication foci (Figure 5 the middle panel).

Xenopus egg extracts (as well as oocytes) provide an excellent environment to study posttranslational modifications (PTMs) of recombinant proteins. The proteins and signaling pathways in Xenopus show a high degree of conservation with humans [45]. Therefore, a recombinant protein can find its natural partners in an egg extract and thus retain full functionality. This is not always the case in bacteria or yeast, which do not have all the posttranslational modifications found in higher mammals. Egg extracts are highly concentrated compared to extracts from cultured cell lines (i.e. little material is required). The greatest advantage is that they offer a somewhat physiological context in either interphasic or mitotic egg extracts. Hence, when adding a binding partner as “bait” or a substrate of an enzyme in Xenopus extract, the probability to find the physiologically relevant interactors and upstream or downstream effectors at the appropriate stage of the cell cycle is very high.

Epigenetics and histone modifications are considered important for embryonic development and the route of several pathological conditions, including cancer. Chromatin of interphasic or mitotic Xenopus egg extracts can be readily purified at various stages (e.g., pre-licensing of replication origins, DNA replication firing, late S-phase, etc) through a sucrose gradient and compared. Purified chromatin can be assessed by immunoblotting, immunofluorescence, etc. Xenopus offers the ability to identify novel PTMs that are difficult to find in more complex samples, such as cell lines. Several other resources for genetic, genomic and proteomic studies can be found here [46].

Xenopus model system has numerous other applications, such as the understanding of regeneration [47], microtubule nucleation [48], phase separation [49], brain development [15] and many others. Other areas of research in which Xenopus embryos have significant contribution are the eye and vision research, heart development, immunology. Based on embryological analysis of Xenopus eye after manipulation of gene expression specifically in the eye by microinjections of the blastomeres, several genes and signaling pathways have been uncovered crucial for eye development and retina fate (Xenopus White Paper 2009). In terms of gene expression, Mughal BB et al have identified reference genes such as clta.L, sub1.L etc. for quantitative real-time PCR studies in developing Xenopus laevis [50]. DeLay BD et al established the procedure for kidney-specific knockout of lhx1 gene with CRISPR/Cas9 in the allotetraploid frog [51].

In conclusion, the main advantages of the Xenopus system are the large oocytes and cell-free extracts, which offer plenty of material (protein, DNA, RNA, etc) for biochemical studies and easiness in microinjection, cell cycle synchrony, and conserved molecular mechanisms. Given that embryonic development shares many characteristics with human cancers, Xenopus laevis and X. tropicalis are great tools to not only study tumorigenesis but also to answer general fundamental questions in biology and medicine [45, 52].

Xenopus laevis has become one of the most promising models to study the mechanisms of regeneration. Human tissue does not have the regenerative capacity of the amphibian skin and develops fibrosis after dermal injury. Therefore, the analysis of Xenopus regenerative blastemal cells and comparison with myofibroblast-mediated scarring in humans would help to understand regenerative processes in humans. Aztekin C et al identified a regeneration-organizing cell in the Xenopus tail using single-cell RNA-seq [53]. For example, a recent study used Xenopus cells labeled with infrared laser‐evoked gene operator (IR-LEGO) in the study of skin regeneration [54, 55]. In addition, the genome sequencing in Xenopus laevis has already been finalized and can be applied to the gene expression profiling in the amphibian regenerative cells [56]. Furthermore, another recent publication presented several assays to study cellular and molecular wound-healing mechanisms in Xenopus embryos [57]. The described methods included generation of knockdown and overexpression constructs and incorporation of mRNAs encoding fluorescent markers.

Regeneration of central and peripheral structures of the nervous system is another area of research, where Xenopus laevis has been proven to be an effective study model. The use of transgenic Xenopus for studies of axon regeneration includes the introduction of one or two labeling molecules, such as dextran amines, to follow the regrowth of the experimentally damaged axons [58]. Another protocol, which uses Xenopus to investigate the regeneration of spinal cord, has recently been published [59]. The study is based on the ability of these amphibian species to regenerate the spinal cord at larval stage of development. In addition, a recent study has applied quantitative proteomics to analyze the variety of proteins expressed in Xenopus spinal cord in regenerative and nonregenerative stages [60]. With regard to the visual sensory system, Xenopus has been used to study the regeneration of eye structures, such as lens and retina [61, 62].

Xenbase, the Xenopus model organism database, is a vital resource for all researchers [63-65]. It contains genomic, mRNA, proteomic and functional annotations and is directly linked to several databases, such as NCBI or Uniprot. Xenopus laevis genome has been only recently sequenced fully. Initially, at the beginning of the 2000s, the tetraploid genome was thought to bring about problems that would not be observed if the diploid Xenopus tropicalis was chosen for sequencing. Indeed, the knowledge obtained from X. tropicalis genome as well as the recent advances in DNA sequencing helped to have Xenopus laevis genome fully assembled and sequenced. Therefore, proteomics studies can be routinely performed [66-70]. Genes can also be obtained from Source Biosciences [49].

Zoologist and environmental biologist Thomas H. Morgan ( Nobel Prize in Physiology or Medicine in 1933) was one of the first to use Xenopus, before he moved to study Drosophila. But it was endocrinologist Lancelot Hogben's work that established Xenopus as a model system. Until then, the breeding could not be manipulated and therefore experiments could be performed only in late winter to spring during which the natural breeding of frogs happens. His work led to the development of the most reliable and rapid pregnancy test until the 1960s, as he discovered that when urine from pregnant women was injected into female Xenopus laevis, the frog laid eggs indicating the presence of human chorionic gonadotropin (hCG) in the urine [5, 71] (see also Therefore, after a simple injection of hCG, Xenopus laevis lays a huge number of eggs every 3-4 months anytime in a year, which can be exploited by biochemists, molecular or developmental biologists. The eggs can be used directly as single cell oocytes or, after a simple experimental procedure, egg extracts can be prepared.

Later, Michaïl Fischberg and his colleagues found one frog in their colonies producing diploid embryos with one nucleolus (instead of two), and they generated the homozygous frog which laid the foundation for nuclear transplantation, as this O-nu mutation could be used as a genetic marker of nuclear transplantation [72]. A few years later, experimental biologist John Gurdon (M. Fischberg’s student) discovered that the O-nu mutation was the result of a complete deletion of several ribosomal genes [73]. For example, the concept that differentiated cells can be reprogrammed to become pluripotent was initially introduced in 1962 by Gurdon, who took the nucleus of a mature intestinal cell of a tadpole and placed it in an egg cell, whose nucleus had been removed. He then saw that this egg developed into a healthy tadpole [74]. For his groundbreaking work on cell reprogramming, Gurdon was awarded the Nobel Prize in Physiology or Medicine in 2012, jointly with S. Yamanaka.

Since then (the 1960s), even though Xenopus’ natural environment is restricted to South Africa, it has become a very popular model system for a variety of studies and is now a common inhabitant of research laboratories worldwide. Xenopus is a remarkable, widely used model system for biology and medicine. In the laboratory environment, the frogs are usually housed in plastic tanks with tap water in groups of 6-15 with a plastic pipe inside the tank which would provide a hiding place for the frogs. The temperature should be around 16-20°C and though wild frogs live in somewhat dark environments, a circle of 12-14 hours of light with 10-12 hours of dark is vital for good health (Figure 6).