EP4151728A1 - Method for data processing of rna information - Google Patents

Method for data processing of rna information Download PDF

Info

Publication number
EP4151728A1
EP4151728A1 EP21803729.9A EP21803729A EP4151728A1 EP 4151728 A1 EP4151728 A1 EP 4151728A1 EP 21803729 A EP21803729 A EP 21803729A EP 4151728 A1 EP4151728 A1 EP 4151728A1
Authority
EP
European Patent Office
Prior art keywords
rnas
rna
specimens
zero
proportion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP21803729.9A
Other languages
German (de)
French (fr)
Inventor
Yuya UEHARA
Kotomi YAJIMA
Takayoshi Inoue
Naoki Oya
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kao Corp
Original Assignee
Kao Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kao Corp filed Critical Kao Corp
Publication of EP4151728A1 publication Critical patent/EP4151728A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1089Design, preparation, screening or analysis of libraries using computer algorithms
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/158Expression markers
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/16Primer sets for multiplex assays

Definitions

  • the present invention relates to a method for data processing of RNA information in a human-derived secretion.
  • nucleic acids such as DNAs and RNAs
  • the analysis using nucleic acids has advantages of being capable of obtaining abundant information by a single analysis because a comprehensive analysis method has been established and of being capable of easily performing functional association of analysis results based on many research reports on single nucleotide polymorphism, RNA function, and the like.
  • Nucleic acids of a biological origin can be extracted from, for example, body fluid such as blood, secretion, or tissue, and recently, it has been reported that RNAs included in skin surface lipids (SSL) are used as a specimen for biological analysis and that marker genes of epidermis, sweat glands, hair follicles, and sebaceous glands are detected in SSL (Patent Literature 1).
  • SSL skin surface lipids
  • RNA sequencing (RNA-Seq) analysis which directly quantitatively measures the RNA sequence expressed in cells, can detect a low-expression gene of which quantitative measurement in a microarray using a signal intensity ratio is difficult, and can obtain a highly accurate expression profile, and is therefore an analysis approach that is currently attracting attention.
  • gene expression analysis the concentration and/or relative or absolute amount of a specific RNA in a specimen is determined, and the specific RNA is quantified (quantitatively measured). In this case, a highly accurate and reproducible method is desired.
  • the quantities of a specific RNA are not always directly comparable. Accordingly, in biological specimens derived from two or more different individuals, in order to well compare the quantities of a specific RNA, normalization is performed for the quantities of the RNA across specimens.
  • RNA-seq the number of sequence reads mapped on a genome is used for quantitative measurement of the expression level of a gene. Accordingly, the normalization uses, for example, RPM (Reads Per Million reads mapped, Non Patent Literature 1) or RLE (Relative Log Expression, Non Patent Literature 2), which are correction methods using the total number of reads. Normalization by RLE is implemented in an analytical technique for a series of gene expression level analysis called DESeq2.
  • RNAs collected from secretion such as sebum and saliva
  • SSL RNAs collected from SSL
  • Patent Literature 1 WO 2018/008319
  • the present invention relates to the following 1) to 3) .
  • Figure 1 is a box plot of Log 2 (normalized count + 1) value in each subject.
  • the present invention relates to a provision of a data processing method for RNA information in order to perform effective normalization processing in cases of using secretion derived from a subject as a biological specimen and analyzing the RNA information obtained therefrom.
  • the present inventors used the expression status of RNAs included in SSL as sequence information and examined the data used in normalization of expression values for various statistical approaches and, as a result, found that effective normalization processing is possible by setting a threshold as a selection criterion of data analysis target specimens and a threshold as a selection criterion of data analysis target genes within specific ranges and extracting RNA information.
  • RNA expression information includes many missing values and variations
  • effective normalization processing is possible, and statistical analysis with high accuracy and high reproducibility is possible based on the RNA information.
  • the "RNA" as an analysis target may be any RNA of a biological origin and may be any of total RNA, mRNA, rRNA, tRNA, and noncoding RNA but is preferably mRNA.
  • the biological specimen used in the method of the present invention is secretion derived from a subject, and specifically, examples thereof include a specimen including sebum, saliva, runny nose, tears, sweat, urine, semen, vaginal fluid, amniotic fluid, milk, and feces.
  • the method of the present invention is effective to be applied to skin surface lipids (SSL) which includes many missing RNA information and is high in variation.
  • SSL skin surface lipids
  • skin surface lipids refers to a lipophilic fraction present on the surface of the skin and is also called sebum.
  • SSL skin surface lipids
  • SSL mainly contains secretion secreted from exocrine glands, such as sebaceous glands, on the skin and is present on the skin surface in the form of a thin layer covering the skin surface.
  • SSL contains RNAs expressed in skin cells.
  • skin is a generic name of regions including stratum corneum, epidermis, dermis, hair follicles, and tissues such as sweat glands, sebaceous glands, and other glands, unless otherwise specified.
  • any means used for collection or removal of SSL from the skin can be adopted.
  • an SSL absorbent material, an SSL adhesive material, or a tool for scraping off SSL from the skin can be used.
  • the SSL absorbent material and the SSL adhesive material are not particularly limited as long as they have affinity to SSL, and examples thereof include polypropylene and pulp. More detailed examples of the procedure of collecting SSL from the skin include a method of absorbing SSL to a sheetlike material, such as oil-blotting paper and an oil-blotting film, a method of adhering SSL to a glass plate, tape, or the like, and a method of scraping off and collecting SSL with a spatula, scraper, or the like.
  • an SSL absorbent material impregnated with a solvent having high lipophilicity in advance may be used.
  • the SSL absorbent material contains a solvent having high hydrophilicity, or moisture, adsorption of SSL is prevented. Accordingly, it is preferable that the content of a highly hydrophilic solvent or moisture be low. It is preferable to use the SSL absorbent material in a dry state.
  • the site of the skin where SSL is collected is not particularly limited, and examples of the skin include those at any site of the body such as the head, face, neck, trunk, and limbs, and a site where sebum is abundantly secreted, for example, the skin of the head or the face is preferable, and the skin of the face is more preferable.
  • RNA-containing SSL collected from a subject may be stored for a certain period of time.
  • the collected SSL is preferably stored at a low temperature condition as prompt as possible after collection in order to suppress the decomposition of RNAs contained therein as much as possible.
  • the temperature condition for storing the RNA-containing SSL may be 0°C or less and is preferably from -20°C ⁇ 20°C to -80°C ⁇ 20°C, more preferably from -20°C ⁇ 10°C to -80°C ⁇ 10°C, further more preferably from -20°C ⁇ 20°C to -40°C ⁇ 20°C, further more preferably from -20°C ⁇ 10°C to -40°C ⁇ 10°C, further more preferably -20°C ⁇ 10°C, and further more preferably -20°C ⁇ 5°C.
  • the period of time for storing the RNA-containing SSL at the low temperature condition is not particularly limited and is preferably 12 months or less, for example, 6 hours or more and 12 months or less, more preferably 6 months or less, for example, 1 day or more and 6 months or less, and further more preferably 3 months or less, for example, 3 days or more and 3 months or less.
  • the method of acquiring RNA expression information is not particularly limited, and examples thereof include acquisition by converting RNAs included in a specimen into cDNAs by reverse transcription and then measuring the cDNAs or an amplification product thereof.
  • the means of measuring an expression level include a DNA chip, a DNA microarray, and RNA-Seq, and the means is preferably RNA-Seq.
  • RNA expression level of is quantitatively measured by a signal intensity ratio when microarray analysis is used and is quantitatively measured by the number of sequence reads (read count value) mapped on a genome in RNA-seq analysis.
  • the method of the present invention includes a step of acquiring information on the RNA expression level and includes a step of obtaining the number of sequence reads (read count value) quantitatively measured as the RNA expression level as described above.
  • the data of the RNA expression level are stored in a server or a recording medium of a computer and input into a computer, and processing of the data of the present invention can be implemented by the program installed in the computer based on the input data.
  • the expression information of analysis target RNA is extracted by setting a threshold as a selection criterion of data analysis target specimens and a threshold as a selection criterion of data analysis target genes, and normalization is performed.
  • RNA expression level data read count value by RNA-Seq
  • the selection criterion of specimens (subjects) as a data analysis target and the selection criterion of genes as a data analysis target were examined as follows.
  • the TD j value determined for each specimen by the following equation is used as a selection index of specimens (j) as a data analysis target.
  • the TD value is Targets Detected and corresponds to the gene detection rate (%).
  • TD j number j of detectable genes total number of detection target genes ⁇ 100
  • the total number of detection target genes is the total number of genes judged to be theoretically detectable in RNA expression analysis and may be appropriately determined based on the RNA expression analytical method to be used. In the case of the sequencing method (AmpliSeq) of Examples as described later, the total number is determined based on the number of primer pairs of Multiplex PCR.
  • the number of detectable genes can be calculated by subtracting the number of undetectable genes from the total number of detection target genes.
  • the number of undetectable genes means the number of genes of which the expression is zero or can be regarded as zero.
  • the SD value is Samples Detected and is a rate (detection specimen rate) of specimens for which the detection of expression of RNA derived from a gene has enabled, for each of the genes of the RNA expression level data of data analysis target specimens after selection using the TD value.
  • detection specimen rate a rate of specimens for which the detection of expression of RNA derived from a gene has enabled, for each of the genes of the RNA expression level data of data analysis target specimens after selection using the TD value.
  • what the detection of expression of RNA has enabled means that the detection expression higher than zero or a level that can be regarded as zero has enabled.
  • SD i number i of specimens subjects showing detectable RNA expression total number of analysis target specimens subjects ⁇ 100
  • Specimens (subjects) having a TD j value of 0% or less than 20% or 30% were excluded, and the other specimens (subjects) were selected as data analysis target specimens (subjects).
  • genes having an SD i value of less than 70%, 80%, 90%, or 100% were excluded, and the other genes were selected as data analysis target genes.
  • the RNA expression level data extracted for these genes were subjected to normalization by DESeq2 ( Love MI, et al., Genome Biol., 2014 ) to verify the degree of approximation to a normal distribution.
  • RNAs of which the expression level is zero or can be regarded as zero are judged as undetectable, the number of detectable RNAs is counted, proportion 1 (TD value) of the number of detectable RNAs with respect to the total number of the detection target RNAs is determined for each specimen (step a), specimens for which the proportion 1 is less than a threshold set in a range from 5% to 29% are excluded to select analysis target specimens (step b), proportion 2 (SD value) of the number of specimens for which the RNA expression level is higher than zero or higher than an expression level that can be regarded as zero with respect to the total number of the analysis target specimens is determined for each detection target RNA in the selected specimens (step c), RNAs for which the proportion 2 is less than a threshold set in a range from 81% to 99% are excluded, and the other RNAs are used as analysis targets to extract the expression information thereof (step d). It is said that consequently, effective normalization is possible in subsequent normalization processing.
  • the RNAs of which the expression level is zero or can be regarded as zero can be appropriately determined by a measurement means.
  • RNAs are those having a read count value of less than 20, preferably less than 15, and more preferably less than 10.
  • the threshold of the proportion 1 of the number of detectable RNAs with respect to the total number of detection target RNAs is set to 5% or more from the viewpoint of effective normalization and is preferably 10% or more, more preferably 15% or more, and further more preferably 18% or more.
  • the threshold of the proportion 1 is set to 29% or less from the point of securing the number of analysis target specimens for analysis after the normalization and is preferably 27% or less, more preferably 25% or less, and further more preferably 23% or less.
  • the threshold of the proportion 1 is appropriately set within a range from 5% to 29%, preferably within a range from 10% to 27%, more preferably within a range from 15% to 25%, and further more preferably within a range from 18% to 23%.
  • the threshold of the proportion 1 is particularly preferably set to 20%.
  • the proportion 2 (SD value) of the number of specimens for which the expression level is higher than zero or higher than an expression level that can be regarded as zero with respect to the total number of the analysis target specimens is calculated.
  • the expression level that can be regarded as zero means that, for example, in RNA-seq analysis, the read count value is less than 5, preferably less than 3, and more preferably less than 1.
  • proportion of the number of specimens for which the expression level is higher than zero in RNA-seq analysis, the number of specimens having a read count value of higher than 0) with respect to the total number of analysis target specimens is preferably used.
  • the threshold of the proportion 2 of the number of specimens for which the RNA expression level is higher than zero or higher than an expression level that can be regarded as zero with respect to the total number of the specimens is set to 81% or more from the viewpoint of effective normalization and is preferably 84% or more and more preferably 87% or more.
  • the threshold of the proportion 2 is set to 99% or less from the point of securing the number of analysis target genes for analysis after the normalization and is preferably 96% or less and more preferably 93% or less.
  • the threshold of the proportion 2 is appropriately set within a range from 81% to 99%, preferably within a range from 84% to 96% and more preferably within a range from 87% to 93%.
  • the threshold of the proportion 2 is particularly preferably set to 90%.
  • the threshold of the proportion 1 in the step b When the threshold of the proportion 1 in the step b is low, it is desirable that the threshold of the proportion 2 in the step d be set to be high for efficient normalization. When the threshold of the proportion 2 in the step d is low, it is desirable that the threshold of the proportion 1 in the step b be set to be high for efficient normalization.
  • the method for normalization used in this case is not particularly limited, and, for example, in addition to the above-described RPM method and RLE method, FPKM (fragments per kilobase of exon per million reads mapped) method, RPKM (reads per kilobase of exon per million reads mapped), TPM (transcripts per million) method, TMM (Trimmed mean of M values) method, or the like can be adopted, and the RLE method is suitably used.
  • the RLE method is implemented in an analytical method for performing a series of gene expression level analysis called DESeq2.
  • the data processing method and correction method for analysis of the RNA expression information can be performed using a computer (computing device). That is, the present invention can provide a computing device for implementing the above method, a program for implementing the method by the computer, and a computer-readable information recording medium on which the program is recorded. Furthermore, the present invention can provide a data set for RNA analysis obtained by the above data processing method. In addition, the present invention can perform data processing by inputting information, such as the proportion 1, the proportion 2, or the threshold, used for the above data processing or can also select the proper proportion 1, proportion 2, and thresholds by computation.
  • information such as the proportion 1, the proportion 2, or the threshold
  • the computing device of the present invention includes a means for inputting RNA expression information obtained from a specimen collected from a subject and, includes one or more steps selected from the group consisting of the above-described steps of selecting analysis target specimens, selecting analysis target genes, extracting RNA expression information of the analysis target genes, and normalization of the RNA expression information according to the program for implementing the data processing method and correction method of the present invention.
  • Examples of the computer-readable information recording medium that records the program for implementing the data processing method and correction method of the present invention include a magnetic disk, an optical disk, a magneto-optical disk, and a flash memory.
  • the term computer-readable includes the case of distribution via an electric communication line or the like.
  • Example 1 Normalization of RNA expression data extracted from SSL
  • Sebum was collected from the entire face of each of 42 healthy subjects (females aged 20 to 59) using an oil-blotting film, and the oil-blotting film was then transferred in a vial and was stored at -80°C for about one month until to be used for RNA extraction.
  • RNAs were extracted using QIAzol Lysis Reagent (Qiagen) in accordance with the attached protocol.
  • the extracted RNAs were reverse transcribed at 42°C for 90 minutes using a SuperScript VILO cDNA Synthesis kit (Life Technologies Japan Ltd.) to synthesize cDNAs.
  • primers of the reverse transcription reaction random primers attached to the kit were used.
  • a library containing DNAs derived from 20,802 genes was prepared from the resulting cDNAs by multiplex PCR.
  • the multiplex PCR was performed using Ion AmpliSeq Transcriptome Human Gene Expression Kit (Life Technologies Japan Ltd.) under conditions of [99°C, 2 min ⁇ (99°C, 15 sec ⁇ 62°C, 16 min) ⁇ 20 cycles ⁇ 4°C, Hold] .
  • the resulting PCR product was purified with Ampure XP (Beckman Coulter, Inc.) and was then subjected to buffer reconstruction, digestion of primer sequences, adapter ligation, purification, and amplification to prepare a library.
  • the prepared library was loaded on Ion 540 Chip and sequenced using Ion S5/XL system (Life Technologies Japan Ltd.).
  • RNA expression level data read count value
  • a selection criterion of data analysis target subjects and a selection criterion of data analysis target genes were examined.
  • the selection criterion of data analysis target subjects the value of Targets Detected (TD) calculated in Torrent Suite (Life Technologies Japan Ltd.) was used.
  • the threshold of TD j calculated for each subject was set to 0%, 20%, and 30%, subjects of less than a threshold were excluded from the analysis target, and the other subjects were selected as data analysis target subjects.
  • the variance of the median value was calculated for the Log2 (normalized count + 1) value calculated in the above 3), and as a result, the variance of the median value was decreased to 0.1 or less with an increase in the threshold of the TD value or SD value (Table 1, boldface). A synergistic decrease in the variance of the median value with an increase in the threshold of the TD value or SD value was also confirmed. Accordingly, it was demonstrated that the median value of each subject after the normalization by DESeq2 can be adjusted by selection of data analysis target subjects and data analysis target genes using the TD value and the SD value.

Abstract

Provided is data processing of RNA information in order to perform effective normalization in the case of analyzing RNA information obtained from secretion derived from a subject.
A data processing method for analysis of RNA expression information obtained from secretion collected from a plurality of subjects as biological specimens, the method comprising the following steps a) to d):
a) a step of counting the number of detectable RNAs in detection target RNAs by judging RNAs of which the expression level is zero or can be regarded as zero to be undetectable, and determining proportion 1 (TD value) of the number of detectable RNAs with respect to the total number of the detection target RNAs for each specimen;
b) a step of excluding specimens for which proportion 1 is less than a threshold set within a range from 5% to 29% from the specimens to select analysis target specimens;
c) a step of determining, for each detection target RNA, proportion 2 (SD value) of the number of specimens for which the expression level thereof is higher than zero or higher than an expression level that can be regarded as zero with respect the total number of the analysis target specimens based on the RNA expression information of the analysis target specimens selected above; and
d) a step of excluding RNAs for which proportion 2 is less than a threshold set within a range from 81% to 99% from the detection target RNAs and extracting the expression information of RNAs other than the excluded RNAs as an analysis target.

Description

    Field of the Invention
  • The present invention relates to a method for data processing of RNA information in a human-derived secretion.
  • Background of the Invention
  • In recent years, it has been developed techniques for examining human current and, furthermore, future in vivo physiological states by analysis of nucleic acids, such as DNAs and RNAs, in a biological specimen. The analysis using nucleic acids has advantages of being capable of obtaining abundant information by a single analysis because a comprehensive analysis method has been established and of being capable of easily performing functional association of analysis results based on many research reports on single nucleotide polymorphism, RNA function, and the like. Nucleic acids of a biological origin can be extracted from, for example, body fluid such as blood, secretion, or tissue, and recently, it has been reported that RNAs included in skin surface lipids (SSL) are used as a specimen for biological analysis and that marker genes of epidermis, sweat glands, hair follicles, and sebaceous glands are detected in SSL (Patent Literature 1).
  • RNA sequencing (RNA-Seq) analysis, which directly quantitatively measures the RNA sequence expressed in cells, can detect a low-expression gene of which quantitative measurement in a microarray using a signal intensity ratio is difficult, and can obtain a highly accurate expression profile, and is therefore an analysis approach that is currently attracting attention. In gene expression analysis, the concentration and/or relative or absolute amount of a specific RNA in a specimen is determined, and the specific RNA is quantified (quantitatively measured). In this case, a highly accurate and reproducible method is desired. However, in biological specimens collected from different individuals, since bias may occur in the expression level profile depending on the biological specimen or the analysis process, the quantities of a specific RNA are not always directly comparable. Accordingly, in biological specimens derived from two or more different individuals, in order to well compare the quantities of a specific RNA, normalization is performed for the quantities of the RNA across specimens.
  • In RNA-seq analysis, the number of sequence reads mapped on a genome is used for quantitative measurement of the expression level of a gene. Accordingly, the normalization uses, for example, RPM (Reads Per Million reads mapped, Non Patent Literature 1) or RLE (Relative Log Expression, Non Patent Literature 2), which are correction methods using the total number of reads. Normalization by RLE is implemented in an analytical technique for a series of gene expression level analysis called DESeq2.
  • However, information of RNAs collected from secretion such as sebum and saliva, in particular, RNAs collected from SSL, includes many missing values and is high in variation. Accordingly, when the same data processing as for information of other RNAs is performed, even if subsequent statistical processing, such as machine learning, is performed, problems may arise in accuracy and reproducibility.
  • [Patent Literature 1] WO 2018/008319
    • [Non Patent Literature 1] IPSJ SIG Technical Reports, Vol. 2013-BIO-33, No. 9, pp. 1-3
    • [Non Patent Literature 2] Genome Biol., 2014, Vol. 15, No. 12, p. 550
    Summary of the Invention
  • The present invention relates to the following 1) to 3) .
    1. 1) A data processing method for analysis of RNA expression information obtained from secretion collected from a plurality of subjects as biological specimens, the method comprising the following steps a) to d) :
      1. a) a step of counting the number of detectable RNAs in detection target RNAs by judging RNAs of which an expression level is zero or can be regarded as zero to be undetectable, and determining proportion 1 (TD value) of the number of detectable RNAs with respect to the total number of the detection target RNAs for each specimen;
      2. b) a step of excluding specimens for which the proportion 1 is less than a threshold set within a range from 5% to 29% from the specimens to select analysis target specimens;
      3. c) a step of determining, for each detection target RNA, proportion 2 (SD value) of the number of specimens in which the expression level thereof is higher than zero or higher than an expression level that can be regarded as zero with respect to the total number of the analysis target specimens based on the RNA expression information of the analysis target specimens selected above; and
      4. d) a step of excluding RNAs for which the proportion 2 is less than a threshold set within a range from 81% to 99% from the detection target RNAs and extracting expression information of RNAs other than the excluded RNAs as an analysis target.
    2. 2) A method of correcting an RNA expression value, comprising normalizing the total RNA expression information extracted by the method of 1).
    3. 3) A program for implementing the data processing method of 1) or the correction method of 2), an information recording medium recording the program, a computing device implementing the program, and an RNA analysis data set obtained by the data processing method or the correction method.
    Brief Description of Drawing
  • [Figure 1] Figure 1 is a box plot of Log2 (normalized count + 1) value in each subject.
  • Detailed Description of the Invention
  • The present invention relates to a provision of a data processing method for RNA information in order to perform effective normalization processing in cases of using secretion derived from a subject as a biological specimen and analyzing the RNA information obtained therefrom.
  • The present inventors used the expression status of RNAs included in SSL as sequence information and examined the data used in normalization of expression values for various statistical approaches and, as a result, found that effective normalization processing is possible by setting a threshold as a selection criterion of data analysis target specimens and a threshold as a selection criterion of data analysis target genes within specific ranges and extracting RNA information.
  • According to the present invention, in biological specimens of which RNA expression information includes many missing values and variations, when RNA expression profiles derived from a plurality of samples are compared, effective normalization processing is possible, and statistical analysis with high accuracy and high reproducibility is possible based on the RNA information.
  • In the method of the present invention, the "RNA" as an analysis target may be any RNA of a biological origin and may be any of total RNA, mRNA, rRNA, tRNA, and noncoding RNA but is preferably mRNA.
  • The biological specimen used in the method of the present invention is secretion derived from a subject, and specifically, examples thereof include a specimen including sebum, saliva, runny nose, tears, sweat, urine, semen, vaginal fluid, amniotic fluid, milk, and feces. Among these specimens, the method of the present invention is effective to be applied to skin surface lipids (SSL) which includes many missing RNA information and is high in variation.
  • The term "skin surface lipids (SSL)" refers to a lipophilic fraction present on the surface of the skin and is also called sebum. In general, SSL mainly contains secretion secreted from exocrine glands, such as sebaceous glands, on the skin and is present on the skin surface in the form of a thin layer covering the skin surface. SSL contains RNAs expressed in skin cells. Here, the term "skin" is a generic name of regions including stratum corneum, epidermis, dermis, hair follicles, and tissues such as sweat glands, sebaceous glands, and other glands, unless otherwise specified.
  • In collection of SSL from the skin of a subject, any means used for collection or removal of SSL from the skin can be adopted. Preferably, an SSL absorbent material, an SSL adhesive material, or a tool for scraping off SSL from the skin can be used. The SSL absorbent material and the SSL adhesive material are not particularly limited as long as they have affinity to SSL, and examples thereof include polypropylene and pulp. More detailed examples of the procedure of collecting SSL from the skin include a method of absorbing SSL to a sheetlike material, such as oil-blotting paper and an oil-blotting film, a method of adhering SSL to a glass plate, tape, or the like, and a method of scraping off and collecting SSL with a spatula, scraper, or the like. In order to improve the adsorptive property of SSL, an SSL absorbent material impregnated with a solvent having high lipophilicity in advance may be used. In contrast, if the SSL absorbent material contains a solvent having high hydrophilicity, or moisture, adsorption of SSL is prevented. Accordingly, it is preferable that the content of a highly hydrophilic solvent or moisture be low. It is preferable to use the SSL absorbent material in a dry state. The site of the skin where SSL is collected is not particularly limited, and examples of the skin include those at any site of the body such as the head, face, neck, trunk, and limbs, and a site where sebum is abundantly secreted, for example, the skin of the head or the face is preferable, and the skin of the face is more preferable.
  • RNA-containing SSL collected from a subject may be stored for a certain period of time. The collected SSL is preferably stored at a low temperature condition as prompt as possible after collection in order to suppress the decomposition of RNAs contained therein as much as possible. The temperature condition for storing the RNA-containing SSL may be 0°C or less and is preferably from -20°C ± 20°C to -80°C ± 20°C, more preferably from -20°C ± 10°C to -80°C ± 10°C, further more preferably from -20°C ± 20°C to -40°C ± 20°C, further more preferably from -20°C ± 10°C to -40°C ± 10°C, further more preferably -20°C ± 10°C, and further more preferably -20°C ± 5°C. The period of time for storing the RNA-containing SSL at the low temperature condition is not particularly limited and is preferably 12 months or less, for example, 6 hours or more and 12 months or less, more preferably 6 months or less, for example, 1 day or more and 6 months or less, and further more preferably 3 months or less, for example, 3 days or more and 3 months or less.
  • In the method of the present invention, the method of acquiring RNA expression information is not particularly limited, and examples thereof include acquisition by converting RNAs included in a specimen into cDNAs by reverse transcription and then measuring the cDNAs or an amplification product thereof. Examples of the means of measuring an expression level include a DNA chip, a DNA microarray, and RNA-Seq, and the means is preferably RNA-Seq.
  • The RNA expression level of is quantitatively measured by a signal intensity ratio when microarray analysis is used and is quantitatively measured by the number of sequence reads (read count value) mapped on a genome in RNA-seq analysis.
  • The method of the present invention includes a step of acquiring information on the RNA expression level and includes a step of obtaining the number of sequence reads (read count value) quantitatively measured as the RNA expression level as described above. After the step, the data of the RNA expression level are stored in a server or a recording medium of a computer and input into a computer, and processing of the data of the present invention can be implemented by the program installed in the computer based on the input data.
  • In the data processing method of RNA information of the present invention, the expression information of analysis target RNA is extracted by setting a threshold as a selection criterion of data analysis target specimens and a threshold as a selection criterion of data analysis target genes, and normalization is performed.
  • As shown in Examples as described later, regarding the RNA expression level data (read count value by RNA-Seq) in a specimen derived from a subject, the selection criterion of specimens (subjects) as a data analysis target and the selection criterion of genes as a data analysis target were examined as follows. As a selection index of specimens (j) as a data analysis target, the TDj value determined for each specimen by the following equation is used. The TD value is Targets Detected and corresponds to the gene detection rate (%).
    TD j = number j of detectable genes total number of detection target genes × 100
    Figure imgb0001
  • Here, the total number of detection target genes is the total number of genes judged to be theoretically detectable in RNA expression analysis and may be appropriately determined based on the RNA expression analytical method to be used. In the case of the sequencing method (AmpliSeq) of Examples as described later, the total number is determined based on the number of primer pairs of Multiplex PCR.
  • In addition, the number of detectable genes can be calculated by subtracting the number of undetectable genes from the total number of detection target genes. Here, the number of undetectable genes means the number of genes of which the expression is zero or can be regarded as zero.
  • In selection of genes (i) as a data analysis target, the SDi value determined for each gene by the following equation is used. The SD value is Samples Detected and is a rate (detection specimen rate) of specimens for which the detection of expression of RNA derived from a gene has enabled, for each of the genes of the RNA expression level data of data analysis target specimens after selection using the TD value. Here, what the detection of expression of RNA has enabled means that the detection expression higher than zero or a level that can be regarded as zero has enabled. SD i = number i of specimens subjects showing detectable RNA expression total number of analysis target specimens subjects × 100
    Figure imgb0002
  • Specimens (subjects) having a TDj value of 0% or less than 20% or 30% were excluded, and the other specimens (subjects) were selected as data analysis target specimens (subjects). Subsequently, genes having an SDi value of less than 70%, 80%, 90%, or 100% were excluded, and the other genes were selected as data analysis target genes. The RNA expression level data extracted for these genes were subjected to normalization by DESeq2 (Love MI, et al., Genome Biol., 2014) to verify the degree of approximation to a normal distribution. As a result, a possibility of better approximation to a normal distribution in normalization by DESeq2 was demonstrated by excluding specimens having a TD value of 0%, less than 20%, or less than 30% and excluding genes having an SD value of less than 80%, less than 90%, or less than 100%.
  • However, in this case, it was demonstrated that although about 80% of the number of analysis target specimens can be secured as analyzable specimens when specimens having a TD value of less than 20% are excluded, the number of analyzable specimens is decreased to about 60% when specimens having a TD value of less than 30% are excluded. It was also demonstrated that although the number of analyzable genes is less than 20% of the analysis target genes when genes having an SD value of less than 90% are excluded, the number is decreased to several percent when genes having an SD value of less than 100% are excluded.
  • Accordingly, in the present invention, RNAs of which the expression level is zero or can be regarded as zero are judged as undetectable, the number of detectable RNAs is counted, proportion 1 (TD value) of the number of detectable RNAs with respect to the total number of the detection target RNAs is determined for each specimen (step a), specimens for which the proportion 1 is less than a threshold set in a range from 5% to 29% are excluded to select analysis target specimens (step b), proportion 2 (SD value) of the number of specimens for which the RNA expression level is higher than zero or higher than an expression level that can be regarded as zero with respect to the total number of the analysis target specimens is determined for each detection target RNA in the selected specimens (step c), RNAs for which the proportion 2 is less than a threshold set in a range from 81% to 99% are excluded, and the other RNAs are used as analysis targets to extract the expression information thereof (step d). It is said that consequently, effective normalization is possible in subsequent normalization processing.
  • In the step a, the RNAs of which the expression level is zero or can be regarded as zero can be appropriately determined by a measurement means. For example, in RNA-seq analysis, such RNAs are those having a read count value of less than 20, preferably less than 15, and more preferably less than 10.
  • In the selection of analysis target specimens in the step b, the threshold of the proportion 1 of the number of detectable RNAs with respect to the total number of detection target RNAs is set to 5% or more from the viewpoint of effective normalization and is preferably 10% or more, more preferably 15% or more, and further more preferably 18% or more. At the same time, the threshold of the proportion 1 is set to 29% or less from the point of securing the number of analysis target specimens for analysis after the normalization and is preferably 27% or less, more preferably 25% or less, and further more preferably 23% or less. In addition, the threshold of the proportion 1 is appropriately set within a range from 5% to 29%, preferably within a range from 10% to 27%, more preferably within a range from 15% to 25%, and further more preferably within a range from 18% to 23%. The threshold of the proportion 1 is particularly preferably set to 20%.
  • In the step c, for each detection target RNA, the proportion 2 (SD value) of the number of specimens for which the expression level is higher than zero or higher than an expression level that can be regarded as zero with respect to the total number of the analysis target specimens is calculated. Here, the expression level that can be regarded as zero means that, for example, in RNA-seq analysis, the read count value is less than 5, preferably less than 3, and more preferably less than 1. In the present invention, as the proportion 2 (SD value), proportion of the number of specimens for which the expression level is higher than zero (in RNA-seq analysis, the number of specimens having a read count value of higher than 0) with respect to the total number of analysis target specimens is preferably used.
  • In addition, in the selection of analysis target RNAs in the step d, the threshold of the proportion 2 of the number of specimens for which the RNA expression level is higher than zero or higher than an expression level that can be regarded as zero with respect to the total number of the specimens is set to 81% or more from the viewpoint of effective normalization and is preferably 84% or more and more preferably 87% or more. At the same time, the threshold of the proportion 2 is set to 99% or less from the point of securing the number of analysis target genes for analysis after the normalization and is preferably 96% or less and more preferably 93% or less. In addition, the threshold of the proportion 2 is appropriately set within a range from 81% to 99%, preferably within a range from 84% to 96% and more preferably within a range from 87% to 93%. The threshold of the proportion 2 is particularly preferably set to 90%.
  • When the threshold of the proportion 1 in the step b is low, it is desirable that the threshold of the proportion 2 in the step d be set to be high for efficient normalization. When the threshold of the proportion 2 in the step d is low, it is desirable that the threshold of the proportion 1 in the step b be set to be high for efficient normalization.
  • Thus, effective correction of RNA expression values approximated to a normal distribution is possible by normalization the total extracted expression information of analysis target RNAs.
  • The method for normalization used in this case is not particularly limited, and, for example, in addition to the above-described RPM method and RLE method, FPKM (fragments per kilobase of exon per million reads mapped) method, RPKM (reads per kilobase of exon per million reads mapped), TPM (transcripts per million) method, TMM (Trimmed mean of M values) method, or the like can be adopted, and the RLE method is suitably used. The RLE method is implemented in an analytical method for performing a series of gene expression level analysis called DESeq2.
  • The data processing method and correction method for analysis of the RNA expression information can be performed using a computer (computing device). That is, the present invention can provide a computing device for implementing the above method, a program for implementing the method by the computer, and a computer-readable information recording medium on which the program is recorded. Furthermore, the present invention can provide a data set for RNA analysis obtained by the above data processing method. In addition, the present invention can perform data processing by inputting information, such as the proportion 1, the proportion 2, or the threshold, used for the above data processing or can also select the proper proportion 1, proportion 2, and thresholds by computation.
  • The computing device of the present invention includes a means for inputting RNA expression information obtained from a specimen collected from a subject and, includes one or more steps selected from the group consisting of the above-described steps of selecting analysis target specimens, selecting analysis target genes, extracting RNA expression information of the analysis target genes, and normalization of the RNA expression information according to the program for implementing the data processing method and correction method of the present invention.
  • Examples of the computer-readable information recording medium that records the program for implementing the data processing method and correction method of the present invention include a magnetic disk, an optical disk, a magneto-optical disk, and a flash memory. In the present invention, the term computer-readable includes the case of distribution via an electric communication line or the like.
  • Aspects and preferable embodiments of the present invention are shown below.
    • <1> A data processing method for analysis of RNA expression information obtained from secretion collected from a plurality of subjects as biological specimens, the method comprising the following steps a) to d):
      1. a) a step of counting the number of detectable RNAs in detection target RNAs by judging RNAs of which an expression level is zero or can be regarded as zero to be undetectable, and determining proportion 1 (TD value) of the number of detectable RNAs with respect to the total number of the detection target RNAs for each specimen;
      2. b) a step of excluding specimens for which the proportion 1 is less than a threshold set within a range from 5% to 29% from the specimens to select analysis target specimens;
      3. c) a step of determining, for each detection target RNA proportion 2 (SD value) of the number of specimens for which the expression level thereof is higher than zero or higher than an expression level that can be regarded as zero with respect to the total number of the analysis target specimens based on the RNA expression information of the analysis target specimens selected above; and
      4. d) a step of excluding RNAs for which the proportion 2 is less than a threshold set within a range from 81% to 99% from the detection target RNAs and extracting expression information of RNAs other than the excluded RNAs as an analysis target.
    • <2> The method according to <1>, wherein the secretion is skin surface lipids.
    • <3> The method according to <1> or <2>, wherein the information on the RNA expression level in the step a) is a read count value by RNA-Seq.
    • <4> The method according to any one of <1> to <3>, wherein the RNAs of which the expression level is zero or can be regarded as zero in the step a) are RNAs of which the read count value by RNA-seq is less than 20, preferably less than 15, and more preferably less than 10.
    • <5> The method according to any one of <1> to <4>, wherein the threshold of the proportion 1 in the step b) is set to preferably 10% or more, more preferably 15% or more, and further more preferably 18% or more; and preferably 27% or less, more preferably 25% or less, and further more preferably 23% or less; or is set preferably within a range from 10% to 27%, more preferably within a range from 15% to 25%, and further more preferably within a range from 18% to 23%.
    • <6> The method according to any one of <1> to <4>, wherein the threshold of the proportion 1 in the step b) is set to 20%.
    • <7> The method according to any one of <1> to <6>, wherein the expression level that can be regarded as zero in the step c) is a read count value in RNA-seq of less than 5, preferably less than 3, and more preferably less than 1.
    • <8> The method according to any one of <1> to <6>, wherein the specimens for which the expression level is higher than zero or higher than an expression level that can be regarded as zero in the step c) are specimens for which the read count value in RNA-seq is higher than 0.
    • <9> The method according to any one of <1> to <8>, wherein the threshold of the proportion 2 in the step d) is set to preferably 84% or more and more preferably 87% or more; and preferably 96% or less and more preferably 93% or less; or is set preferably within a range from 84% to 96% and more preferably within a range from 87% to 93%.
    • <10> The method according to any one of <1> to <8>, wherein the threshold of the proportion 2 in the step d) is set to 90%.
    • <11> A method of correcting an RNA expression value, comprising normalizing the total RNA expression information extracted by the method according to any one of <1> to <10>.
    • <12> The method according to <11>, wherein the normalization is performed by an RLE method.
    • <13> A program for implementing the data processing method or correction method according to any one of <1> to <12> for analysis of RNA expression information.
    • <14> An information recording medium which records the program according to <13>.
    • <15> A computing device comprising one or more steps selected from the group consisting of a step of selecting analysis target specimens, a step of selecting analysis target genes, a step of extracting RNA expression information of the analysis target genes, and a step of calculating normalization of the RNA information of the analysis target genes that are implemented by the program according to <13>.
    • <16> An RNA analysis data set obtained by the data processing method or correction method according to any one of <1> to <12> for analysis of RNA expression information.
    Examples
  • The present invention will now be described in further detail based on Examples but is not limited thereto.
  • Example 1: Normalization of RNA expression data extracted from SSL 1) SSL collection
  • Sebum was collected from the entire face of each of 42 healthy subjects (females aged 20 to 59) using an oil-blotting film, and the oil-blotting film was then transferred in a vial and was stored at -80°C for about one month until to be used for RNA extraction.
  • 2) RNA preparation and sequencing
  • The oil-blotting films of the above 1) were each cut into a suitable size, and RNAs were extracted using QIAzol Lysis Reagent (Qiagen) in accordance with the attached protocol. The extracted RNAs were reverse transcribed at 42°C for 90 minutes using a SuperScript VILO cDNA Synthesis kit (Life Technologies Japan Ltd.) to synthesize cDNAs. As the primers of the reverse transcription reaction, random primers attached to the kit were used. A library containing DNAs derived from 20,802 genes was prepared from the resulting cDNAs by multiplex PCR. The multiplex PCR was performed using Ion AmpliSeq Transcriptome Human Gene Expression Kit (Life Technologies Japan Ltd.) under conditions of [99°C, 2 min → (99°C, 15 sec → 62°C, 16 min) × 20 cycles → 4°C, Hold] . The resulting PCR product was purified with Ampure XP (Beckman Coulter, Inc.) and was then subjected to buffer reconstruction, digestion of primer sequences, adapter ligation, purification, and amplification to prepare a library. The prepared library was loaded on Ion 540 Chip and sequenced using Ion S5/XL system (Life Technologies Japan Ltd.).
  • 3) Data analysis
  • In the RNA expression level data (read count value) derived from the subjects measured in the above 2), a selection criterion of data analysis target subjects and a selection criterion of data analysis target genes were examined. As the selection criterion of data analysis target subjects, the value of Targets Detected (TD) calculated in Torrent Suite (Life Technologies Japan Ltd.) was used. The threshold of TDj calculated for each subject was set to 0%, 20%, and 30%, subjects of less than a threshold were excluded from the analysis target, and the other subjects were selected as data analysis target subjects. As an extraction criterion of data analysis target genes, percentage (Samples Detected, SD) of the subjects having a read count value of higher than 0 was used for each gene of RNA expression level data after the selection of data analysis target subjects using TD. The threshold of SDi calculated for each detection target gene was set to 70%, 80%, 90%, and 100%, genes of less than a threshold were excluded from the analysis target, and the other genes were selected as data analysis target genes. Data analysis target subjects were selected, subsequently, the expression information on the selected data analysis target genes was extracted, and logarithmic value (Log2 (normalized count + 1) value) to base 2 of a value obtained by adding integer 1 to the read count value (normalized count value) normalized using a method of DESeq2 was then calculated. Figure 1 shows a box plot of Log2 (normalized count + 1) value in each subject.
  • Here, the values of TDj of subject j (j: an integer from 1 to n, n: the number of subjects) and SDi of gene i (i: an integer from 1 to m, m: the number of detection target genes) were calculated as follows.
    TD j = total number of detection target genes number j of genes having a read count value of less than 10 total number of detection target genes × 100
    Figure imgb0003
    SD i = number i of subjects having a read count value of higher than 0 total number of analysis target subjects × 100
    Figure imgb0004
  • 4) Setting of optimum selection criterion
  • The variance of the median value was calculated for the Log2 (normalized count + 1) value calculated in the above 3), and as a result, the variance of the median value was decreased to 0.1 or less with an increase in the threshold of the TD value or SD value (Table 1, boldface). A synergistic decrease in the variance of the median value with an increase in the threshold of the TD value or SD value was also confirmed. Accordingly, it was demonstrated that the median value of each subject after the normalization by DESeq2 can be adjusted by selection of data analysis target subjects and data analysis target genes using the TD value and the SD value. However, when subjects for which the TD value was less than 20% were excluded, the proportion of analyzable subjects was decreased to about 83%, but when subjects for which the TD value is less than 30% were excluded, the proportion of analyzable subjects was decreased to about 64% (Table 2). It was demonstrated that since it is necessary to secure the number of analysis target subjects in analysis after normalization, it is suitable to set the threshold for selection of data analysis target subjects to a TD value of 20% (Table 2, boldface). When genes for which SD value is less than 90% were excluded, the proportion of analyzable genes was about 16%, but when genes for which the SD value was less than 100% were excluded, the proportion of analyzable genes was decreased to 2% or 6% (Table 3). It was demonstrated that since it is necessary to secure the number of analysis target genes in analysis after normalization, it is suitable to set the threshold for selection of data analysis target genes to an SD value of 90% (Table 3, boldface). [Table 1]
    Median variance SD SD SD SD
    70% 80% 90% 100%
    TD 2.39 1.50 0.49 0.041
    0%
    TD 0.66 0.18 0.041 0.033
    20%
    TD 0.17 0.10 0.047 0.024
    30%
    [Table 2]
    Number of subjects (proportion) SD SD SD SD
    70% 80% 90% 100%
    TD (NA) (NA) (NA) 42 (100%)
    0%
    TD (NA) (NA) 35 (83%) 35 (83%)
    20%
    TD (NA) 27 (64%) 27 (64%) 27 (64%)
    30%
    NA; not applicable (out of the target))
    [Table 3]
    Number of genes (proportion) SD SD SD SD
    70% 80% 90% 100%
    TD (NA) (NA) (NA) 451 (2%)
    0%
    TD (NA) (NA) 3282 (16%) 1151 (6%)
    20%
    TD (NA) (NA) (NA) (NA)
    30%
    NA; not applicable (out of the target)

Claims (12)

  1. A data processing method for analysis of RNA expression information obtained from secretion collected from a plurality of subjects as biological specimens, the method comprising the following steps a) to d):
    a) a step of counting the number of detectable RNAs in detection target RNAs by judging RNAs of which an expression level is zero or can be regarded as zero to be undetectable, and determining proportion 1 (TD value) of the number of detectable RNAs with respect to the total number of the detection target RNAs for each specimen;
    b) a step of excluding specimens for which proportion 1 is less than a threshold set within a range from 5% to 29% from the specimens to select analysis target specimens;
    c) a step of determining, for each detection target RNA, proportion 2 (SD value) of the number of specimens for which the expression level thereof is higher than zero or higher than an expression level that can be regarded as zero with respect to the total number of the analysis target specimens based on the RNA expression information of the analysis target specimens selected above; and
    d) a step of excluding RNAs for which the proportion 2 is less than a threshold set within a range from 81% to 99% from the detection target RNAs and extracting expression information of RNAs other than the excluded RNAs as an analysis target.
  2. The method according to Claim 1, wherein the secretion is skin surface lipids.
  3. The method according to Claim 1 or 2, wherein the information on the RNA expression level in the step a) is a read count value by RNA-Seq.
  4. The method according to any one of Claims 1 to 3, wherein the RNAs of which the expression level is zero or can be regarded as zero in the step a) are RNAs of which the read count value by RNA-seq is less than 10.
  5. The method according to any one of Claims 1 to 4, wherein the threshold of the proportion 1 in the step b) is set to 20%.
  6. The method according to any one of Claims 1 to 5, wherein the specimens for which the expression level is higher than zero or higher than an expression level that can be regarded as zero in the step c) are specimens for which the read count value in RNA-seq is higher than 0.
  7. The method according to any one of Claims 1 to 6, wherein the threshold of the proportion 2 in the step d) is set to 90%.
  8. A method of correcting an RNA expression value, comprising normalizing the total RNA expression information extracted by the method according to any one of Claims 1 to 7.
  9. A program for implementing the data processing method or correction method according to any one of Claims 1 to 8 for analysis of RNA expression information.
  10. An information recording medium which records the program according to Claim 9.
  11. A computing device comprising one or more steps selected from the group consisting of a step of selecting analysis target specimens, a step of selecting analysis target genes, a step of extracting RNA expression information of the analysis target genes, and a step of calculating normalization of the RNA information of the analysis target genes that are implemented by the program according to Claim 9.
  12. An RNA analysis data set obtained by the data processing method according to any one of Claims 1 to 8 for analysis of RNA expression information.
EP21803729.9A 2020-05-14 2021-05-14 Method for data processing of rna information Pending EP4151728A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2020085433 2020-05-14
PCT/JP2021/018512 WO2021230380A1 (en) 2020-05-14 2021-05-14 Method for data processing of rna information

Publications (1)

Publication Number Publication Date
EP4151728A1 true EP4151728A1 (en) 2023-03-22

Family

ID=78525196

Family Applications (1)

Application Number Title Priority Date Filing Date
EP21803729.9A Pending EP4151728A1 (en) 2020-05-14 2021-05-14 Method for data processing of rna information

Country Status (5)

Country Link
US (1) US20230197195A1 (en)
EP (1) EP4151728A1 (en)
JP (1) JP2021182386A (en)
CN (1) CN115605613A (en)
WO (1) WO2021230380A1 (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3719138B1 (en) 2016-07-08 2022-09-21 Kao Corporation Method for preparing nucleic acid sample
SG11201912011YA (en) * 2017-06-13 2020-01-30 Oncologica Uk Ltd Method for determining the susceptibility of a patient suffering from proliferative disease to treatment using an agent which targets a component of the pd1/pd-l1 pathway
CN112955551A (en) * 2018-11-01 2021-06-11 花王株式会社 Method for producing nucleic acid derived from skin cell of subject

Also Published As

Publication number Publication date
WO2021230380A1 (en) 2021-11-18
JP2021182386A (en) 2021-11-25
US20230197195A1 (en) 2023-06-22
CN115605613A (en) 2023-01-13

Similar Documents

Publication Publication Date Title
JP7323671B2 (en) Nucleic acid sample preparation method
WO2020091044A1 (en) Method for preparing nucleic acid derived from skin cell of subject
Matsubara et al. DV200 index for assessing RNA integrity in next-generation sequencing
Benson et al. An analysis of select pathogenic messages in lesional and non-lesional psoriatic skin using non-invasive tape harvesting
EP3299473A1 (en) Method for diagnosing early onset of alzheimer&#39;s disease or mild cognitive impairment
Ronald Moy et al. An adhesive patch-based skin biopsy device for molecular diagnostics and skin microbiome studies
EP3023504B1 (en) Method and device for detecting chromosomal aneuploidy
EP4151728A1 (en) Method for data processing of rna information
KR102216913B1 (en) Minimally invasive kit evaluating skin moisturization degree including microneedle patch and biomarker for evaluating skin moisturization degree
Kågedal et al. Failure of the PAXgene™ Blood RNA System to maintain mRNA stability in whole blood
WO2021215531A1 (en) Menstrual cycle marker
CN113481294B (en) Method for preparing nucleic acid sample
EP4141114A1 (en) Method for detecting severity of premenstrual syndrome
JP2023136284A (en) Data homogenization method for sebum RNA information
JP2023069413A (en) Method for evaluating skin moisturizing effect of coating layer
JP2022097301A (en) Stress marker, and method for detecting chronic stress level by using the same
JP2022097303A (en) Fatigue marker and method for detecting fatigue by using the same
WO2022255378A1 (en) Internal reference gene in skin surface lipid specimen
JP2022174645A (en) Biological age prediction method
JP2022097302A (en) Sleep state marker and method for detecting sleep state by using the same
JP2012024027A (en) Method for evaluating cutaneous blood vessel function
Matsubara et al. Research Article DV200 Index for Assessing RNA Integrity in Next-Generation Sequencing
CN116622846A (en) Peripheral blood circRNA biomarker for diffuse large B cell lymphoma diagnosis and application thereof
JP2011115098A (en) Method for evaluating skin for predicting reduction of skin brightness by aging

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20221121

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)