FAQs & Help for Signature Analysis

Send additional questions via our comments page and we'll post the answers here.


Normalized abundance (TPM or other)
Normalized abundance is a measurement of the transcript abundance, and it is the best estimate of the expression level of the mRNA transcript containing that signature or small RNA that is that signature. We previously used values such as Transcripts Per Million (TPM) or TP10M (transcripts per 10 million). Using those cases as examples, the normalized abundance is the raw expression value divided by the total number of signatures and multiplying by 1,000,000 (for TPM) or 10,000,000 (for TP10M). In most cases, we have started using "genome-matched reads" excluding t/r/sn/snoRNAs for the small RNA data to pick the normalization denominator (the "normalization basis" or n_base value).
For mRNA data, remember that some genes may utilize multiple signatures if there is alternative poly-adenylation or alternative splicing, so a measurement of one mRNA signature may not accurately reflect the total expression of the gene.

Raw abundance values
Raw abundance values are the total number of observations of the sequence or read in the library, calculated out of the total number of reads in the library, prior to normalization.

Totals for raw abundance values
For SBS small RNA data, this column should represent the genome-matching signatures, excluding the t/r/sn/snoRNAs. For more information on the run, including totals before matching to the genome, please see the "Library Information" page within the website.

Average normalized abundance in all libraries
"Average normalized abundance in all libraries" is exactly what the name indicates. This value is not particularly informative, but provides a general indication of gene expression for a specific signature across multiple tissues.

Average normalized abundance in # libraries with abundance not < 5 TPM/TPQ
"Average normalized abundance in # libraries with abundance not < 5 TPM/TPQ" is exactly what the name indicates. For this calculation, we disregard the libraries with very low abundances and calculate the average only for libraries that have substantial data for this signature. This value is not particularly informative, but provides a general indication of gene expression for a specific signature only in tissues in which significant expression was observed.

"Abundance of this signature" in the table "Totals for ALL runs"
"Abundance of this signature" in the table "Totals for ALL runs" is a sum of all raw abundances in all runs in all libraries. If only a few observations were made in total, it could be an artifact, possibly resulting from sequencing errors. But if the signature is observed in multiple libraries, then it is much more likely to be real.

"Total of all signatures" in the table "Totals for ALL runs"
"Total of all signatures" in the table "Totals for ALL runs" is the sum of all abundances for all signatures in runs for all libraries. These values are used as the denominators in the normalized value.

Sum of abundance
In the Gene Analysis page, at the bottom of the table listing all of the signatures associated with a specific gene, we have provided a pre-calculated sum for the abundance in the normalized value.

Number of expressed distinct signatures
This row indicates the number of different signatures that matched to the gene or region in each of the libraries. Although the total number of signatures found in the list may be greater, not all signatures are found in each library. Therefore, if the abundance was zero, the signature was not counted in this row.

Technologies used to extract signatures
The "T" column indicates the technology platform that was used to identify this signature. Some signatures were identified with multiple platforms and may be duplicated in the list because of the duplication in platforms.

M = MPSS data. These are 17-nt signatures from the 5' end of the small RNA, and therefore they lack length information.

4 = 454 data. These are "full length" sequences that are identified after removing linker sequences and comparing them to the genome. The genome comparison reduces problems associated with homopolymers, which represent many of the errors found in 454 data. The length is variable and is biologically relevant.

S = SBS data. Like 454 data, these are "full length" sequences, and length is variable and is biologically relevant. However, the read length may be limited to at most ~25 nt (depending on the maximum numbers of cycles used on the machine), so longer molecules may be truncated, but this should capture the full length for the majority of small RNAs.

Signature abundances normalized to (varies by library depth)
The abundance of individual signatures in our database is normalized to a round number, often expressed (as in the case of TPM or TP10M) to transcripts per million or ten million, or some multiple thereof. The normalization calculation still uses the "Abundance of this signature" as the numerator and the "Total of all signatures" as the denominator, but we then normalize to these other values. We pick values based on the nearest "round" number (1, 5, 50, etc) in millions; this value that is selected (the normalization basis or "n_base") is used for all related libraries so that all libraries within the same experiment are normalized to the same value.

It is very important to note that because library depths vary substantially, we may show abundances from DIFFERENT normalizations in the same table. This means that it is important to show the n_base value used to normalize each library, when the individual signature abundances are shown for each library. For example, an abundance of 5 (out of 1,000,000) is substantially different from an abundance of 5 (out of 5,000,000): there is a five-fold difference in abundance between these, even though they both show an abundance of "5". This may be confusing, but we haven't yet developed an easier way to show this in our "Gene Analysis" tables, but our "Signature Analysis" tables show the abundances for a given normalization denominator, which may be an easier way to compare across libraries.