WO2022131328A1 - 多型座位の信号の信頼性値の算出方法 - Google Patents
多型座位の信号の信頼性値の算出方法 Download PDFInfo
- Publication number
- WO2022131328A1 WO2022131328A1 PCT/JP2021/046513 JP2021046513W WO2022131328A1 WO 2022131328 A1 WO2022131328 A1 WO 2022131328A1 JP 2021046513 W JP2021046513 W JP 2021046513W WO 2022131328 A1 WO2022131328 A1 WO 2022131328A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- nucleic acid
- allele
- polymorphic
- component signal
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6844—Nucleic acid amplification reactions
- C12Q1/6851—Quantitative amplification
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/10—Signal processing, e.g. from mass spectrometry [MS] or from PCR
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6844—Nucleic acid amplification reactions
- C12Q1/686—Polymerase chain reaction [PCR]
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6883—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6883—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
- C12Q1/6886—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2531/00—Reactions of nucleic acids characterised by
- C12Q2531/10—Reactions of nucleic acids characterised by the purpose being amplify/increase the copy number of target nucleic acid
- C12Q2531/113—PCR
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/156—Polymorphic or mutational markers
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/20—Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
Definitions
- the present invention relates to data processing of analysis data such as SNPs.
- Non-Invasive Prenatal Paternity Test by analyzing fetal circulation cell-free DNA (Cell-free fetal DNA, cffDNA), which is a genetic substance derived from the fetus mixed in the blood circulation of the mother. NIPPT) can be performed (for example, Patent Document 1).
- a cancer test including a cancer screening test and an evaluation test of the progress of anticancer treatment can be mentioned.
- cancer cells are destroyed by immunity, cell death (apopulation) occurs by themselves, or circulating tumor cells (CTC) circulating in the blood are destroyed in the blood by some influence, the genomic DNA of the cancer cells becomes It will leak into the blood.
- the cfDNA derived from this cancer cell may be specially called ctDNA (circulating tumor DNA).
- ctDNA circulating tumor DNA
- cfDNA analysis technology monitoring of colonization of transplanted organs and the like can be mentioned.
- the success rate has been improved by improving immunosuppressive drugs, the problem of rejection is still a major problem for long-term colonization of transplanted organs.
- genomic DNA leaks into the blood from the cells constituting the transplanted organ.
- This cfDNA derived from transplanted organs (sometimes specifically called ddcfDNA) is expected as a biomarker for transplanted organ disorders.
- SNPs single nucleotide polymorphisms
- a method of selecting single nucleotide polymorphisms (SNPs) capable of personally identifying donors and recipients and quantifying a very small amount of ddcfDNA leaked into the recipient's blood using a next-generation sequencer or the like For example, Patent Document 3.
- SNPs single nucleotide polymorphisms
- Patent Document 3 since most of cfDNA is derived from the recipient's genomic DNA and the proportion of ddcfDNA contained is extremely small, the presence of ddcfDNA obtained by analysis of cfDNA is similar to the above-mentioned prenatal genetic test. There is a problem that it is extremely difficult to determine whether the signal suggesting the above is really derived from the genomic DNA of the transplanted organ or is noise.
- the problem to be solved by the present invention is a novel technique for evaluating the reliability of a signal indicating the presence of a secondary nucleic acid in the analysis data of a mixed nucleic acid sample containing a secondary nucleic acid such as cffDNA, ctDNA, and ddcfDNA in a minute proportion. Is to provide.
- the present invention that solves the above problems is as follows.
- a model for calculating a reliability value of a secondary component signal which comprises the following steps A-1, step A-2, step A-3-1, and step A-4-1. How to create a function.
- Step A-1 A data set obtained by measurement of a mixed nucleic acid sample, comprising a major nucleic acid containing genetic information about a major contributor and a secondary nucleic acid containing genetic information about a secondary contributor, the primary nucleic acid and the secondary.
- a step of preparing a data set (provided that the authenticity of the signal is known) containing a signal indicating the presence of each allele in a plurality of polymorphic loci in nucleic acid.
- Step A-2 Among the data contained in the data set, among the plurality of polymorphic loci, a signal indicating the presence of an allele derived from the main nucleic acid and a signal indicating the presence of an allele derived from the secondary nucleic acid are included.
- (A1) Secondary component signal intensity indicating the presence of a specific polymorphic lous allele derived from the secondary nucleic acid
- the secondary component signal intensity with respect to the total signal intensity caused by the specific polymorphic lous allele The secondary component signal intensity with respect to the total signal intensity caused by the specific polymorphic lous allele.
- Secondary component mixing rate which is the ratio of [Step A-3-1]
- the synthetic variable generated in the step A-2 is divided into a plurality of categories, and the ratio of the secondary component signal strength corresponding to the synthetic variable included in each category is included in each category.
- [Step A-4-1] Regression analysis is performed on the synthetic variable included in each of the categories and the probability corresponding to the synthetic variable included in each category, and the reliability value is determined by using the synthetic variable as an explanatory variable and the reliability value as an objective variable.
- the synthetic variable used for creating the model function in the steps A-3-1 and A-4-1 has the highest contribution rate among the one or more synthetic variables generated in the step A-2.
- step A-2 is a step of performing principal component analysis on a numerical group including at least the above (A1) and the above (A2) and generating one or more principal components as synthetic variables.
- the method according to any one of [1] to [3].
- the step A-2 is derived from the signal indicating the presence of an allele derived from the main nucleic acid and the secondary nucleic acid among the plurality of polymorphic loci among the data contained in the data set. 1 or 2 or more selected from the following (A3) to (A5), including at least the above (A1) and the above (A2), relating to the polymorphic sitting position detected separately from the signal indicating the presence of the allele.
- (A3) The signal intensity of the major component indicating the presence of one allele in a specific polymorphic lous derived from the major nucleic acid.
- the step A-2 is derived from the signal indicating the presence of an allele derived from the main nucleic acid and the secondary nucleic acid among the plurality of polymorphic loci among the data contained in the data set.
- a numerical group containing at least the above-mentioned (A1) and (A2) and further containing the following (A3) to (A5) relating to the polymorphic sitting position detected separately from the signal indicating the presence of the allele is linearly coupled.
- (A3) The signal intensity of the major component indicating the presence of one allele in a specific polymorphic lous derived from the major nucleic acid.
- the first-order homogeneous polynomial representing the composite variable is characterized in that the secondary component signal intensity or the secondary component mixing ratio is weighted to the maximum. The method described.
- step A-2 two or more synthetic variables are generated, and the composite variables are generated.
- step A-3-1 reliability values are assigned to each of the two or more synthetic variables.
- step A-4-1 two or more independent model functions having each of the two or more synthetic variables as explanatory variables are created.
- a method for creating a model function for calculating a reliability value of a secondary component signal which comprises the following steps A-1, step A-3-2, and step A-4-2.
- Step A-1 A data set obtained by measurement of a mixed nucleic acid sample, comprising a major nucleic acid containing genetic information about a major contributor and a secondary nucleic acid containing genetic information about a secondary contributor, the primary nucleic acid and the secondary.
- a step of preparing a data set (provided that the authenticity of the signal is known) containing a signal indicating the presence of each allele in a plurality of polymorphic loci in nucleic acid.
- Step A-3-2 Regarding the polymorphic locus in which the signal indicating the presence of the allele derived from the main nucleic acid and the signal indicating the presence of the allele derived from the secondary nucleic acid are separately detected among the plurality of polymorphic loci.
- the sub-component signal intensities indicating the presence of a specific polymorphic locus allele derived from the sub-nucleic acid are divided into a plurality of categories, and the ratio of the sub-component signal intensities included in each category is true.
- Step A-4-2 Regression analysis is performed on the sub-component signal strength included in each of the categories and the probability corresponding to the sub-component signal strength included in each category, and the sub-component signal strength is used as an explanatory variable and a reliability value.
- the process of finding a model function for calculating the reliability value which is the objective variable.
- a method for creating a model function for calculating a reliability value of a secondary component signal which comprises the following steps A-1, step A-3-3, and step A-4-3.
- Step A-1 A data set obtained by measurement of a mixed nucleic acid sample, comprising a major nucleic acid containing genetic information about a major contributor and a secondary nucleic acid containing genetic information about a secondary contributor, the primary nucleic acid and the secondary.
- a step of preparing a data set (provided that the authenticity of the signal is known) containing a signal indicating the presence of each allele in a plurality of polymorphic loci in nucleic acid.
- Step A-3-3 Regarding the polymorphic locus in which the signal indicating the presence of the allele derived from the main nucleic acid and the signal indicating the presence of the allele derived from the secondary nucleic acid are separately detected among the plurality of polymorphic loci.
- the sub-component mixing rate which is the ratio of the sub-component signal intensity to the total signal strength caused by the allele of a specific polymorphic sitting position, is divided into a plurality of sub-components, and the sub-components corresponding to the sub-component mixing rates included in each category are classified.
- Step A-4-3 Regression analysis was performed on the sub-component mixing rate included in each category and the probability corresponding to the sub-component mixing rate included in each category, and the sub-component mixing rate was used as an explanatory variable and a reliability value. The process of finding a model function for calculating the reliability value, which is the objective variable.
- the data set is data acquired by base sequence analysis.
- the data set is data acquired by digital PCR, and the data set is data obtained by digital PCR.
- the data set is the data acquired by the microarray, and the data set is the data acquired by the microarray.
- the major contributor is a mother
- the sub-contributor is a fetus in the womb of the mother
- the mixed nucleic acid sample is a circulating acellular nucleic acid sample taken from the mother
- the step A. -1 , Step A - 2, Step A - 3-1 and Step A -4-1 are Step A 1-1, Step A 1-2, Step A 1-3-1 and Step A 1-4, respectively.
- Step A 1-1 A data set obtained by measuring a circulating acellular nucleic acid sample containing a major nucleic acid containing genetic information about the mother and a secondary nucleic acid containing genetic information about the fetal, wherein the primary nucleic acid and the secondary nucleic acids have a plurality.
- Step A 1-2 Among the data contained in the data set, among the plurality of polymorphic sitting positions, Homozygous in the mother, homozygous in the father, and a signal indicating the presence of an allele derived from the major nucleic acid and a signal indicating the presence of an allele derived from the secondary nucleic acid are detected separately.
- the synthetic variables generated in the step A 1-2 are divided into a plurality of categories, and the ratio of the secondary component signal intensities corresponding to the synthetic variables included in each category is included in each category.
- the process of giving as a probability corresponding to a variable (However, for alleles that are homozygous for the mother, homozygous for the father, and atypical between the mother and the father.
- the sub-component signal is detected separately from the main component signal, the sub-component signal is regarded as true. If the sub-component signal is not detected in distinction from the main component signal, the sub-component signal is regarded as false.
- the sub-component signal is regarded as false.
- Step A 1-4-1 Regression analysis is performed on the synthetic variable included in each of the categories and the probability corresponding to the synthetic variable included in each category, and the reliability value is determined by using the synthetic variable as an explanatory variable and the reliability value as an objective variable. The process of finding a model function for calculation.
- the main contributor is a healthy person
- the sub-contributor is a cancer cell
- the steps A-1, step A-2, step A-3-1 and step A-4-1 are described.
- the method according to any one of [1] to [10], which is a step A 2-1 , a step A 2 -2-, a step A 2-3-1, and a step A 2-4-1, respectively.
- Step A 2-1 A plurality of nucleic acids containing the base sequence information of the polymorphic locus in which a cancer-related mutation is introduced in the polymorphic locus associated with cancer in a nucleic acid sample collected from a healthy person containing a major nucleic acid containing genetic information on the healthy person.
- Step A 2-2 Among the data contained in the data set, among the plurality of polymorphic loci, a signal indicating the presence of an allele derived from the main nucleic acid and a signal indicating the presence of an allele derived from the secondary nucleic acid are included.
- the synthetic variables generated in the step A 2-2 are divided into a plurality of categories, and the ratio of the secondary component signal strength corresponding to the synthetic variables included in each category is included in each category.
- the process of giving as a probability corresponding to a variable is the case where a nucleic acid fragment containing the base sequence information of the polymorphic locus into which the mutation has been introduced is added to the mixed nucleic acid sample.
- the secondary component signal is regarded as true.
- Step A 2-4-1 Regression analysis is performed on the synthetic variable included in each of the categories and the probability corresponding to the synthetic variable included in each category, and the reliability value is determined by using the synthetic variable as an explanatory variable and the reliability value as an objective variable. The process of finding a model function for calculation.
- Step A 2'-1 Nucleotide sequence information of the single polymorphic locus in which a cancer-related mutation is introduced into a single polymorphic locus associated with cancer in a nucleic acid sample containing a major nucleic acid containing genetic information about a healthy person.
- Step A 2'- 2 Among the data contained in the data set, the single polymorphism in which the signal indicating the presence of the allele derived from the main nucleic acid and the signal indicating the presence of the allele derived from the secondary nucleic acid are detected separately.
- A1' Secondary component signal intensity indicating the presence of the single polymorphic lous allele derived from the secondary nucleic acid.
- A2' Secondary component mixing ratio, which is the ratio of the secondary component signal strength to the total signal strength caused by the single polymorphic sitting allele.
- nucleic acid fragment containing the base sequence information of the polymorphic locus into which the mutation has been introduced is added to the mixed nucleic acid sample.
- secondary component signal is detected for the nucleic acid fragment
- the secondary component signal is regarded as true. If no secondary component signal is detected for the nucleic acid fragment, the secondary component signal is regarded as false.
- the nucleic acid fragment containing the base sequence information of the polymorphic locus into which the mutation has been introduced is not added to the mixed nucleic acid sample.
- the secondary component signal is regarded as false. If no secondary component signal is detected for the nucleic acid fragment, the secondary component signal is true.
- Step A 2-4-1 Regression analysis is performed on the synthetic variable included in each of the categories and the probability corresponding to the synthetic variable included in each category, and the reliability value is determined by using the synthetic variable as an explanatory variable and the reliability value as an objective variable. The process of finding a model function for calculation.
- the major contributor is the recipient of the organ transplant
- the sub-contributor is the transplanted organ
- the steps A-1, step A-2, step A-3-1 and step A-4- 1 is any of [1] to [10], which is a process A 3-1, a process A 3 -2- , a process A 3 3-1 and a process A 3 4-1 respectively.
- [Step A 3-1 ] A data set obtained by measuring a mixed nucleic acid sample containing a major nucleic acid containing genetic information about a recipient and a secondary nucleic acid containing genetic information about a transplanted organ, which is a plurality of the primary nucleic acid and the secondary nucleic acid.
- a step of preparing a data set containing a signal indicating the presence of each allele in the polymorphic sitting position (however, the authenticity of the signal is known).
- Step A 3-2 Among the data contained in the data set, among the plurality of polymorphic loci, a signal indicating the presence of an allele derived from the main nucleic acid and a signal indicating the presence of an allele derived from the secondary nucleic acid are included.
- Step A 3 3-1 The synthetic variables generated in the step A 3-2 are divided into a plurality of categories, and the ratio of the secondary component signal strength corresponding to the synthetic variables included in each category is included in each category.
- the process of giving as a probability corresponding to a variable. However, for alleles that the recipient does not have and that the donor has homozygotes or heterozygotes.
- the sub-component signal is detected separately from the main component signal, the sub-component signal is regarded as true. If the sub-component signal is not detected in distinction from the main component signal, the sub-component signal is regarded as false.
- Step A 3-4-1 Regression analysis is performed on the synthetic variable included in each of the categories and the probability corresponding to the synthetic variable included in each category, and the reliability value is determined by using the synthetic variable as an explanatory variable and the reliability value as an objective variable. The process of finding a model function for calculation.
- the model function The model function obtained by the method according to any one of [1] to [26].
- a model function expressed by multiplying each other by two or more model functions selected from the model function of any of the following equations 1 to 3 or the group consisting of the model functions represented by the following equations 1 to 3. can be,
- the explanatory variables are 1 or 2 or more numerical values selected from the following (B1) and (B2) included in the data set prepared in the following step B-1 and the synthetic variables obtained in the following step B-2.
- a method of calculating a reliability value which is characterized by being present.
- Step B-1 A data set obtained by measurement of a mixed nucleic acid sample containing a major nucleic acid containing genetic information about a major contributor and containing or may contain a secondary nucleic acid containing genetic information about a secondary contributor, said primary nucleic acid and said secondary.
- Step B-2 Among the data contained in the data set, among the plurality of polymorphic loci, a signal indicating the presence of an allele derived from the main nucleic acid and a signal indicating the presence of an allele derived from the secondary nucleic acid are included.
- B1 A secondary component signal intensity indicating the presence of a specific polymorphic lous allele derived from the secondary nucleic acid.
- B2) The secondary component mixing ratio, which is the ratio of the secondary component signal intensity to the total signal intensity caused by the allele of the specific polymorphic sitting position.
- the major contributor is the mother
- the sub-contributor is the fetus in the womb of the mother
- the mixed nucleic acid sample is a circulating acellular nucleic acid sample taken from the mother, step B.
- [Step B 1-1 ] A data set obtained by measuring a circulating acellular nucleic acid sample containing a major nucleic acid containing genetic information about the mother and a secondary nucleic acid containing genetic information about the fetal, wherein the primary nucleic acid and the secondary nucleic acids have a plurality.
- Step B 1-2 The process of preparing a data set containing a signal indicating the presence of each allele in the polymorphic sitting position.
- Step B 1-2 Among the data contained in the data set, among the plurality of polymorphic sitting positions, Concerning a polymorphic locus in which a signal indicating the presence of an allele derived from the major nucleic acid and a signal indicating the presence of the allele derived from the secondary nucleic acid are homozygous in the mother and are detected separately.
- the plurality of polymorphic sitting positions are polymorphic sitting positions used in human individual identification.
- the method of [28] characterized in that it is a method of calculating a reliability value for non-invasive prenatal paternity testing.
- the major contributor is a test subject
- the sub-contributor is a cancer cell
- the mixed nucleic acid sample is a circulating acellular nucleic acid sample collected from the test subject
- the step B- The method according to [27], wherein 1 and step B - 2 are step B 2-1 and step B 2-2, respectively.
- Step B 2-1 A data set obtained by measurement of a circulating acellular nucleic acid sample, which comprises a major nucleic acid containing genetic information about a subject to be tested and may contain a secondary nucleic acid containing genetic information about cancer cells, said primary nucleic acid and said secondary nucleic acid.
- Step B 2-2 In the step of preparing a data set containing a signal indicating the presence of each allele in a plurality of polymorphic sitting positions associated with cancer.
- Step B 2-2 Among the data included in the data set, the polymorphism in which the signal indicating the presence of a normal allele and the signal indicating the presence of a mutant allele are detected separately in the plurality of polymorphic loci is detected.
- the test subject has the mutant allyl as homozygous or heterozygous. Exclude data on type sitting, Among the data contained in the data set remaining after exclusion, a signal indicating the presence of a normal allele and a signal indicating the presence of a mutant allele are detected separately in the plurality of polymorphic loci.
- a numerical group including at least the above (B1) and the above (B2) is linearly connected to generate one or more synthetic variables with respect to the polymorphic locus.
- the major contributor is a recipient of an organ transplant
- the secondary contributor is a transplanted organ
- the mixed nucleic acid sample is a circulating acellular nucleic acid sample taken from the recipient
- the step B is The method according to [27], wherein -1 and step B - 2 are step B 3-1 and step B 3-2 , respectively.
- Step B 3-1 A data set obtained by measurement of a circulating acellular nucleic acid sample, which comprises a major nucleic acid containing genetic information about a recipient and may contain a secondary nucleic acid containing genetic information about a transplanted organ, in the primary nucleic acid and the secondary nucleic acid.
- Step B 3-2 The process of preparing a dataset containing signals indicating the presence of each allele in multiple polymorphic sitting positions.
- Step B 3-2 Among the data contained in the data set, among the plurality of polymorphic loci, a signal indicating the presence of an allele derived from the main nucleic acid and a signal indicating the presence of an allele derived from the secondary nucleic acid are included.
- the plurality of polymorphic sitting positions are polymorphic sitting positions used in human individual identification.
- the method according to [32] which is a method for calculating a reliability value for monitoring the colonization of a transplanted organ.
- a method for setting exclusion conditions which comprises steps C-2-1, step C-3-1, and step C-4-1.
- [Process C-1-1] A data set obtained by measuring a mixed nucleic acid sample containing a major nucleic acid containing genetic information about a major contributor and a secondary nucleic acid containing genetic information about a secondary contributor, the primary nucleic acid and the secondary nucleic acid. In the step of preparing a data set containing a signal indicating the presence of each allele in a plurality of polymorphic sitting positions (however, the authenticity of the signal is known).
- the major contributor is the mother
- the sub-contributor is the fetus in the womb of the mother
- the mixed nucleic acid sample is a circulating acellular nucleic acid sample taken from the mother, or the major.
- the contributor is the recipient
- the by-contributor is the transplanted organ
- the mixed nucleic acid sample is a circulating acellular nucleic acid sample taken from the recipient.
- C1 A secondary component signal intensity indicating the presence of a specific polymorphic lous allele derived from the secondary nucleic acid.
- C2 The secondary component mixing ratio, which is the ratio of the secondary component signal intensity to the total signal intensity caused by the allele of the specific polymorphic sitting position.
- [Process C-3-1] A step of setting a threshold value for the value of the synthetic variable so as to exclude a part or all of the outliers of the synthetic variable obtained by the linear combination in the step C-2-1.
- [Process C-4-1] The step of setting the condition to be excluded from the data set to be input to the model function for calculating the reliability as the following exclusion condition C1.
- (Exclusion condition C1) Of a dataset obtained by analysis of a mixed nucleic acid sample containing a major nucleic acid containing genetic information about the mother or recipient and a secondary nucleic acid containing genetic information about the fetus or transplanted organ.
- Alleles that are homozygous in the mother, homozygous in the pseudo-father, and atypical between the mother and the pseudo-father, or homozygous in the recipient and homozygous in the donor of the transplanted organ are obtained by linearly connecting a numerical group containing at least the above (C1), the above (C2) and the above (C3) with respect to the polymorphic locus in which the allele that is atypical between the recipient and the donor is present. Further, the data set in which the synthetic variable having the highest contribution rate is less than the threshold value set in the step C-3-1 is removed.
- a method for setting exclusion conditions which comprises steps C-2-2, step C-3-2, and step C-4-2.
- [Process C-1-2] A data set obtained by measurement of a mixed nucleic acid sample, comprising a major nucleic acid containing genetic information about a major contributor and a secondary nucleic acid containing genetic information about a secondary contributor, said primary nucleic acid and said secondary.
- the mixed nucleic acid sample is a circulating acellular nucleic acid sample taken from the mother.
- the major contributor is the recipient
- the sub-contributor is the transplanted organ
- the mixed nucleic acid sample is a circulating acellular nucleic acid sample taken from the recipient.
- (C1) A secondary component signal intensity indicating the presence of a specific polymorphic lous allele derived from the secondary nucleic acid.
- (C2) The secondary component mixing ratio, which is the ratio of the secondary component signal intensity to the total signal intensity caused by the allele of the specific polymorphic sitting position.
- (C3) Noise obtained by subtracting the main component signal strength and the secondary component signal strength from the total signal strength caused by the allele of the specific polymorphic sitting position.
- [Process C-3-2] A step of setting a threshold value for the value of the synthetic variable so as to exclude a part or all of the outliers of the synthetic variable obtained by the linear combination in the step C-2-2.
- [Process C-4-2] The step of setting the condition to be excluded from the data set to be input to the model function for calculating the reliability as the following exclusion condition C2.
- Example condition C2 Of the dataset obtained by analysis of a mixed nucleic acid sample containing a major nucleic acid containing genetic information about the mother or recipient and a secondary nucleic acid containing genetic information about the fetus or transplanted organ. Alleles that are homozygous in the mother, homozygous in the pseudo-father, and homozygous between the mother and the pseudo-father, or homozygous in the recipient and homozygous in the donor of the transplanted organ. And, it is obtained by linearly connecting a numerical group containing at least the above (C1), the above (C2) and the above (C3) with respect to the polymorphic locus in which the allele homozygous between the recipient and the donor is present. In addition, the data set in which the synthetic variable having the first or second highest contribution rate is less than the threshold set in the step C-3-2 is removed.
- the outliers are obtained when the reliability value is calculated by the method according to any one of [27] to [33].
- the method according to any one of [34] to [37] which is characterized in that it is a numerical value relating to the allele in the case where the nucleic acid is lost.
- step B-1 It is characterized in that the data set remaining after removing the data set corresponding to the exclusion condition C1 specified by the method described in [34] and / or the exclusion condition C2 specified by the method described in [35] is prepared. , [32] or [33].
- the model function The model function obtained by the method according to any one of [1] to [26].
- a model function expressed by multiplying each other by two or more model functions selected from the model function of any of the following equations 1 to 3 or the group consisting of the model functions represented by the following equations 1 to 3. can be, One or more of the explanatory variables selected from the following (B1) and (B2) included in the data set prepared in the following step B 4-1 and the synthetic variables obtained in the following step B 4-2 .
- a method for calculating a reliability value which is characterized by being a numerical value.
- Step B 4-1 A dataset obtained by measurement of a circulating acellular nucleic acid sample taken from the mother, comprising a major nucleic acid containing genetic information about the mother and a secondary nucleic acid containing genetic information about the fetus in the womb of the mother.
- Step B 4-2 From the data contained in the dataset, data on polymorphic loci having the mutant allyl as a heterozygotes in the mother among the plurality of polymorphic loci was excluded.
- a signal indicating the presence of an allele derived from the main nucleic acid and an allele derived from the secondary nucleic acid are shown in the plurality of polymorphic loci.
- B1 Secondary component signal intensity indicating the presence of a specific polymorphic lous allele derived from the secondary nucleic acid.
- the secondary component mixing ratio which is the ratio of the secondary component signal intensity to the total signal intensity caused by the allele of the specific polymorphic sitting position.
- the model function represented by multiplication by multiplying each other by two or more model functions selected from the model function of any of the following equations 1 to 3 or the group consisting of the model functions represented by the following equations 1 to 3 is obtained.
- a reliability value calculation system including a recorded storage unit and a processing unit that executes the method according to any one of [27] to [33] and [40] to [42].
- a model function of the present invention for calculating the reliability value of a secondary component signal in the analysis data of a mixed nucleic acid sample containing a secondary nucleic acid such as cffDNA, ctDNA, ddcfDNA in a minute proportion.
- the reliability value of the secondary component signal in the analysis data of the mixed nucleic acid sample containing the secondary nucleic acid such as cffDNA, ctDNA, and ddcfDNA in a minute ratio is calculated. can do.
- the exclusion condition setting method of the present invention in order to narrow down the data of the explanatory variables to be input to the model function, it is possible to set the exclusion condition for determining what should be excluded from the data set.
- a sigmoid curve showing the model function f1 (x1) is shown.
- the "probability” on the vertical axis is the reliability value
- the "main component 1" on the horizontal axis is the first principal component obtained by principal component analysis.
- the white data points in the figure indicate the reliability value and the first principal component used in the regression analysis.
- a sigmoid curve showing the model function f2 (x2) is shown.
- the "probability” on the vertical axis is the reliability value
- the "fetal minor count” on the horizontal axis is the absolute value of the secondary component signal intensity.
- the white data points in the figure indicate the reliability value and the absolute value of the secondary component signal intensity used in the regression analysis.
- a sigmoid curve showing the model function f3 (x3) is shown.
- the vertical axis "probability” is the reliability value
- the horizontal axis "fetal minor frequency” is the secondary component contamination rate.
- the white data points in the figure indicate the reliability value and the secondary component contamination rate used in the regression analysis.
- It is a distribution map of the reliability value (Fidelity) calculated in Test Example 2.
- the left is a compilation of the reliability values for SNPs that are homozygous for each parent.
- the right is a compilation of reliability values for SNPs of the same type that parents have in homozygosity. It is a scatter diagram which plotted each principal component obtained by the principal component analysis prepared for examination of exclusion condition 1 on the y-axis, and the reliability value on the x-axis.
- a scatter plot showing the first principal component, the second principal component, the third principal component, the fourth principal component, and the fifth principal component on the y-axis It is a scatter diagram which plotted each principal component obtained by the principal component analysis prepared for the examination of exclusion condition 2 on the y-axis, and the reliability value on the x-axis. From the left, a scatter plot showing the first principal component, the second principal component, the third principal component, the fourth principal component, and the fifth principal component on the y-axis. It is a distribution map of the reliability value (Fidelity) calculated in Test Example 4. The left is a compilation of the reliability values for SNPs that are homozygous for each parent.
- the right is a compilation of reliability values for SNPs of the same type that parents have in homozygosity. It is a distribution map of the reliability value (Fidelity) calculated in Test Example 5. On the left, the reliability values for SNPs that are homozygous and homozygous for each other are tabulated. The right shows the ratio of the reliability values calculated in Test Example 2 and Test Example 5, which are different NGS target panel analyzes. It is a graph which aggregated the reliability value (Fidelity) for the SNPs genotype confirmed from the analysis of the child born in Test Example 6. The distribution map of the mother homo SNPs reliability values (Fidelity) was aggregated by number without considering the genotype of the father showing the truth about the existence of the secondary component signal.
- the left is a compilation of the reliability values for SNPs that are homozygous for each parent (the correct answer for fetal genotype is heterozygotes).
- the right is the reliability value for SNPs of the same type that parents have in homozygosity.
- 6 is a distribution diagram of reliability values calculated in Test Example 6 and Test Example 9.
- the left is a compilation of reliability values for SNPs that the mother has in homozygotes and that the newborn has in heterozygotes.
- the right is the reliability value for SNPs that the mother has by homozygosity and the newborn has by homozygosity.
- the method for creating a model function of the present invention includes step A-1, step A-2, step A-3-1 and step A-4-1 as essential steps. Hereinafter, they will be described in order.
- Step A-1 is a step of preparing a data set obtained by measuring a mixed nucleic acid sample.
- a "mixed nucleic acid sample” is a sample containing genetic information about a plurality of contributors. This information includes genetic information encoded by DNA as well as genetic information encoded by RNA. Examples of the mixed nucleic acid sample include samples containing cfDNA and cfRNA, and specific examples thereof include whole blood, plasma, serum and urine, and more preferably whole blood, plasma and serum.
- the mixed nucleic acid sample contains a major nucleic acid containing genetic information on the major contributor and a secondary nucleic acid containing genetic information on the secondary contributor.
- the abundance ratio of the major nucleic acid and the secondary nucleic acid in the mixed nucleic acid sample may vary depending on the status of the major contributor and the sub-contributor.
- the "major contributor” as used herein is the mother in the case of prenatal genetic testing, the subject to be tested in the case of cancer testing, and the recipient in the monitoring of transplanted organs.
- the “major contributor” refers to an individual from which a mixed nucleic acid sample has been obtained.
- the “major nucleic acid” is a nucleic acid containing genetic information regarding the major contributor.
- the major nucleic acids are the maternal genomic DNA or fragments thereof in the case of prenatal genetic testing or RNA (cfDNA or cfRNA derived from the maternal) which is a transcript from the maternal genomic DNA, and the subject to be tested in the case of cancer testing.
- RNA cfDNA or cfRNA derived from the test subject
- RNA cfDNA or cfRNA derived from the test subject
- RNA cfDNA or cfRNA derived from a recipient
- the "secondary contributor” corresponds to the fetus in the case of prenatal genetic testing, cancer cells in the case of cancer testing, and the transplanted organ in the monitoring of transplanted organs.
- the “secondary contributor” refers to an individual, tissue, or cell that exists in the body of the main contributor and has genetic information different from the original genetic information of the main contributor.
- the “secondary nucleic acid” is a nucleic acid containing genetic information regarding the secondary contributor. Secondary nucleic acids include fetal genomic DNA or fragments thereof in the case of prenatal genetic testing or RNA (cfDNA or cfRNA derived from the fetal) that is a transcript from fetal genomic DNA, and cancer cells in the case of cancer testing.
- Genome DNA or fragment thereof or RNA which is a transcript from the genomic DNA of cancer cells, and in the monitoring of the transplanted organ, the genomic DNA of the transplanted organ or a fragment thereof or the genomic DNA of the donor
- RNA cfDNA or cfRNA derived from a transplanted organ
- the mixed nucleic acid sample containing the main nucleic acid and the secondary nucleic acid may be artificial.
- a mixed nucleic acid sample may be prepared by spike (adding) a nucleic acid imitating a secondary nucleic acid into blood containing a major nucleic acid.
- the data set prepared in step A-1 includes a data set containing a signal indicating the presence of each allele in a plurality of polymorphic loci in the primary nucleic acid and the secondary nucleic acid.
- the number of polymorphic sitting positions included in the data set is not particularly limited, and is preferably 5 or more, more preferably 10 or more, still more preferably 15 or more, still more preferably 18 or more.
- This data set is not particularly limited as long as it is obtained by an analytical means capable of distinguishing and detecting each allele in the polymorphic sitting position.
- the analytical means include analytical means capable of distinguishing and detecting single nucleotide substitutions (SNPs) in polymorphic loci.
- the analysis means include base sequence analysis used for detecting SNPs, digital PCR, microarray, real-time PCR, and the like.
- next-generation sequencer can be mentioned as a specific means for base sequence analysis.
- the next-generation sequencer is a sequencing method that enables large-scale parallel sequencing of clonally amplified molecules and single nucleic acid molecules.
- any NGS system may be adopted.
- pyrosequencing GS Junior (Roche), etc.
- synthetic sequencing using a reversible dye terminator MiSeq (Illumina), etc.
- sequencing by ligation SeqStudio Genetic Analyzer (Thermo, etc.)
- Ion Semiconductor Sequencing Ion Protein System (Thermo Fisher SCENTIFIC), etc.
- CMOS Complementary Metal Oxide Film Semiconductor Chip
- the sequence data read by the next-generation sequencer can be analyzed, and the number of reads of the allele having a specific sequence (specific SNPs) in the polymorphic locus can be interpreted as a signal indicating the existence of the allele.
- a barcode sequence (Unique Molecular Indicators (UMI), Unique Molecular Tag (UMT)) that enables individual identification of nucleic acid molecules is ligated to the nucleic acid fragment to be analyzed. If so, the count number of UMT that identifies the allele as having a specific sequence (specific SNPs) in the polymorphic locus can be interpreted as a signal indicating the presence of the allele.
- UMI Unique Molecular Indicators
- UMT Unique Molecular Tag
- Digital PCR is a method in which a sample is distributed to a large number of wells so that one nucleic acid molecule may or may not be contained in one well, and PCR is performed individually. In the wells containing the target sequence, PCR amplification proceeds and the fluorescence signal is detected, but in the wells containing no target sequence, PCR amplification does not proceed and the fluorescence signal is not detected. After PCR, the signal amplification “yes (+) / no ( ⁇ )” is discriminated in each well, and the number of “yes (+)” wells of the signal is calculated as the number of copies of the target.
- a probe such as a TaqManR probe or cycling probe
- a probe that can accurately discriminate mutations such as SNPs
- fluorescence is observed only in wells in which alleles having a specific sequence (specific SNPs) are amplified. ..
- a fluorescently labeled probe having a different emission wavelength for each allele, it is possible to detect different alleles existing in one polymorphic locus by the fluorescent color.
- the number of "some (+)" wells of a fluorescent signal corresponding to a particular allele can be interpreted as a signal indicating the presence of that allele.
- Microarrays use nucleic acids such as DNA, DNA fragments, cDNA, oligonucleotides, RNA or RNA fragments with known sequences as probes, and sequence hundreds to hundreds of thousands to solidify them to complement the probes. This is a method for detecting when a nucleic acid having a different sequence hybridizes, by using a fluorescent label. Microarrays that perform SNPs typing are also particularly referred to as SNP arrays. When multiple alleles are assumed in one lotus coition, it is possible to distinguish and detect each allele by immobilizing each allele separately. The fluorescence intensity at the point where a particular allele is immobilized can be interpreted as a signal indicating the presence of the allele.
- Real-time PCR is a method of monitoring and analyzing fluorescence generated in response to the amount of nucleic acid amplification by PCR in real time with a spectrofluorometer. It is preferable to combine real-time PCR with a probe (TaqManR probe, cycling probe, etc.) capable of accurately discriminating mutations such as SNPs. By designing a fluorescently labeled probe having a different emission wavelength for each allele, it is possible to detect different alleles existing in one polymorphic locus by the fluorescent color. When obtaining a data set by real-time PCR, it is preferable to adopt multiplex PCR from the viewpoint of improving measurement efficiency.
- Multiplex PCR is a method of amplifying a plurality of target sequences at one time in one reaction system using a plurality of sets of primers.
- the intensity of the fluorescent signal corresponding to a particular allele can be interpreted as a signal indicating the presence of that allele.
- Mass analysis is an analytical method that measures the mass of an ion or molecule by ionizing the molecule and measuring its mass-to-charge ratio (m / z). Originally, it is a method of measuring the mass of a molecule, but for nucleic acid molecules prepared under specific conditions (such as when PCR is performed using a specific primer or when a nucleic acid molecule is cleaved with a specific restriction enzyme). If the mass can be measured, the base sequence of the detected nucleic acid molecule can be identified by collating the mass with the database. For this reason, mass spectrometry is widely applied to genotyping. In mass spectrometry, the ionic strength at m / z peculiar to a base sequence containing a specific allele can be interpreted as a signal indicating the presence of the allele.
- the data set prepared in step A-1 needs to know the truth of the signal indicating the existence of the above-mentioned allele. That is, when a signal indicating the presence of a specific allele is detected, it is necessary to know whether or not the major nucleic acid or secondary nucleic acid containing the base sequence of the allele is contained in the mixed nucleic acid sample.
- process A-1 is a process of preparing a data set. Therefore, the step of nucleic acid analysis for primary acquisition of a dataset is not an essential element of the present invention.
- the practitioner of the present invention naturally includes a mode in which the above data set is prepared by primarily acquiring data by nucleic acid analysis. Not limited.
- a person other than the person other than the person who implements the present invention prepares the above data set by secondarily acquiring the data set initially acquired by nucleic acid analysis. include.
- Step A-2 is a step of performing principal component analysis on the data contained in the above-mentioned data set. Specifically, among the data included in the data set, a signal indicating the presence of an allele derived from a major nucleic acid and a signal indicating the presence of an allele derived from a secondary nucleic acid among a plurality of polymorphic loci , And the polymorphic loci detected separately, are linearly coupled to the numerical groups containing the following (A1) and (A2) to generate one or more synthetic variables.
- the secondary component signal intensity is the intensity of the signal indicating the presence of a specific polymorphic lous allele derived from the secondary nucleic acid. It is easy to determine whether the signals indicating the presence of the two types of alleles detected separately from each other in the analysis of the mixed nucleic acid sample are derived from the primary nucleic acid or the secondary nucleic acid, respectively. In most cases, the circulating acellular nucleic acid sample contains more major nucleic acid than secondary nucleic acid, so that the secondary component signal intensity is inevitably weaker than the above-mentioned major component signal intensity. In such a case, the one with the weaker signal strength can be regarded as the secondary component signal strength.
- the ratio of the mother-derived nucleic acid to the fetal-derived nucleic acid in the latter half of pregnancy, and the ratio of the patient-derived nucleic acid to the cancer-derived nucleic acid when the cancer is advanced may be reversed from the normal case. That is, the amount of secondary nucleic acid in the circulating acellular nucleic acid sample may be equal to or greater than the amount of major nucleic acid.
- the genotype of the major contributor may be identified in advance by genotyping and compared with the analysis result of the mixed nucleic acid sample. This makes it possible to determine whether the signal indicating the presence of the two types of alleles detected separately from each other by the analysis of the mixed nucleic acid sample is derived from the main nucleic acid or the secondary nucleic acid, respectively.
- secondary component signal intensity all the numerical values reflecting the signal intensity indicating the presence of the allele of a specific polymorphic lous derived from the secondary nucleic acid.
- secondary component signal intensity the numerical value that directly expresses the signal strength
- the numerical value obtained by multiplying the numerical value by a constant, and all the numerical values that reflect the signal strength such as the power value and the root of the value are "secondary component signal strength”.
- the standardized numerical value of the original data of the secondary component signal strength is also included in the wording "secondary component signal strength". Details of standardization will be described later.
- the numerical value obtained by processing the original data of the secondary component signal strength based on the other detected parameters is also included in the wording "secondary component signal strength".
- Noise is mentioned as an "other parameter" used for processing the original data of the secondary component signal strength. The definition of noise is as described below.
- a numerical value obtained by subtracting the noise intensity or the average value thereof in a plurality of polymorphic lotus coitions to be analyzed from the original data of the secondary component signal intensity can also be treated as the "secondary component signal intensity".
- the parameter for obtaining the average value of the noise intensity may be the number of polymorphic lotuses in which noise is detected, or the number of all polymorphic lotus coitions analyzed.
- the average value of the noise intensity is uniformly subtracted from the original data of the secondary component signal intensity without distinguishing between the polymorphic lotus in which noise is detected and the polymorphic lotus in which noise is not detected.
- the embodiment may be in which the average value of the noise intensity is subtracted from the original data of the secondary component signal intensity only for the specific polymorphic lotus coition in which noise is detected.
- the embodiment may be in which the noise intensity detected for the specific polymorphic lotus is subtracted from the secondary component signal intensity of the specific polymorphic lotus where noise is detected.
- a numerical value obtained by dividing the secondary component signal intensity indicating the presence of the allele of the specific polymorphic lotus by the average value of the noise intensity in the plurality of polymorphic sitting positions is treated as the "secondary component signal intensity”. May be good. That is, it may be an embodiment that treats the numerical value represented by the following equation as "secondary component signal strength". (Secondary component signal strength) / (Average value of noise strength)
- step A-2 may be in a form in which only one kind of “secondary component signal intensity” is included, or two or more kinds of "secondary components". "Signal strength" may be included.
- the standardized numerical value of the original data of the secondary component mixing rate is also included in the wording "secondary component mixing rate". Details of standardization will be described later.
- the numerical value obtained by processing the original data of the secondary component mixing rate based on the other detected parameters is also included in the wording "secondary component mixing rate".
- Noise is mentioned as an "other parameter" used for processing the original data of the secondary component mixing rate. The definition of noise is as described below.
- the numerical value obtained by subtracting the ratio of the noise intensity to the total signal intensity (noise mixing rate) or the average value thereof in the plurality of polymorphic sitting positions to be analyzed from the original data of the secondary component mixing rate is also "secondary component". It can be treated as "mixing rate".
- the parameter for obtaining the average value of the noise mixing rate may be the number of polymorphic lotuses in which noise is detected, or the number of all polymorphic lotus coitions analyzed.
- the embodiment may be in which the average value of the noise mixing rate is subtracted from the original data of the secondary component mixing rate. Further, it may be an embodiment in which the noise mixing rate of the noise intensity detected for the specific polymorphic lotus is individually subtracted from the secondary component mixing rate of the specific polymorphic lotus where noise is detected.
- the value obtained by dividing the secondary component mixing rate of the specific polymorphic lotus by the average value of the noise intensities in the plurality of polymorphic sitting positions is treated as the "secondary component mixing rate". That is, it may be an embodiment in which the numerical value represented by the following formula is treated as the “secondary component mixing ratio”. (Secondary component mixing rate) / (Average value of noise intensity)
- step A-2 may be in a form in which only one kind of "secondary component mixing rate” is included, or two or more kinds of "secondary components". "Mixing rate” may be included.
- the numerical value group to be linearly combined in step A-2 may include numerical values other than the above-mentioned (A1) and (A2). That is, a linear combination is performed on a numerical group including various measured values or calculated values related to the specific polymorphic lotus coition in addition to (A1) and (A2).
- the numerical values (A3) to (A5) that may be included in the numerical value group to be linearly combined will be described below. In addition, only one kind selected from the following (A3) to (A5) may be included in the numerical value group, or two or more kinds of numerical values arbitrarily selected may be included in the numerical value group. Further, all of (A3) to (A5) may be included in the numerical group.
- the major component signal intensity is the intensity of the signal indicating the presence of one allele of a specific polymorphic lous derived from the major nucleic acid.
- the major component signal intensity is the intensity of the signal indicating the presence of one allele of a specific polymorphic lous derived from the major nucleic acid.
- the circulating acellular nucleic acid sample contains more major nucleic acid than secondary nucleic acid, so that the primary component signal intensity is inevitably weaker than the secondary component signal intensity described above. .. In such a case, the one with the stronger signal strength can be regarded as the main component signal strength.
- the ratio of the mother-derived nucleic acid to the fetal-derived nucleic acid in the latter half of pregnancy, and the ratio of the patient-derived nucleic acid to the cancer-derived nucleic acid when the cancer is advanced may be reversed from the normal case. That is, the amount of secondary nucleic acid in the circulating acellular nucleic acid sample may be equal to or greater than the amount of major nucleic acid.
- the genotype of the major contributor may be identified in advance by genotyping and compared with the analysis result of the mixed nucleic acid sample. This makes it possible to determine whether the signal indicating the presence of the two types of alleles detected separately from each other by the analysis of the mixed nucleic acid sample is derived from the main nucleic acid or the secondary nucleic acid, respectively.
- the numerical value obtained by multiplying the numerical value by a constant and all that reflect the signal strength such as the power value and the root of the value.
- the numerical value of is included in the "main component signal strength".
- the numerical group to be linearly combined in step A-2 may be in a form in which only one type of "main component signal strength” is included, or two or more types of "main component signal strength” are included. It may be included.
- main component mixing rate main component signal strength / total signal strength
- this signal is defined as “noise” in the present invention. That is, the noise is obtained by subtracting the main component signal strength and the secondary component signal strength from the total signal strength caused by the allergen of the specific polymorphic locus, and is obtained by subtracting the total signal strength- (main component signal). It can be expressed by the formula of "intensity + secondary component signal intensity)".
- the data set prepared in step A-1 is a set of data related to a plurality of polymorphic sitting positions. Therefore, needless to say, the data set prepared in step A-1 includes a plurality of sets of data including the above (A1) and (A2) and other numerical data relating to a specific polymorphic lotus coition. It will be.
- Standardized data [(original data)-(mean value)] / (sample standard deviation)
- a polymorphic locus detected by distinguishing between a signal indicating the presence of an allele derived from a major nucleic acid and a signal indicating the presence of an allele derived from a secondary nucleic acid refers to the presence of an allele derived from the major nucleic acid. It refers to a polymorphic locus in which the signal indicating the signal and the signal indicating the presence of an allele derived from the secondary nucleic acid are not mixed.
- the cfDNA of the cfDNA regardless of the father's genetic type.
- the signals of allele A and allele B derived from the genomic DNA of the mother are always detected.
- Either the allele A or allele B signal should contain a signal derived from the fetal cffDNA, but this cannot be distinguished from the signal derived from the mother's genomic DNA. Adding such data to the basis of analysis reduces the accuracy of the model function.
- the mutation is always included in ctDNA, so the signal derived from the test target and cancer. It will be mixed with cell-derived signals. Adding such data to the basis of analysis reduces the accuracy of the model function.
- the polymorphic locus targeted for data analysis is "a signal indicating the presence of an allele derived from a major nucleic acid and a signal indicating the presence of an allele derived from a secondary nucleic acid.” It is limited to "polymorphic sitting position detected separately”.
- the polymorphic locus to be analyzed in step A-2 is a polymorphic lous coition in which there is no possibility that a signal indicating the presence of an allele derived from a secondary nucleic acid is mixed with a signal indicating the presence of an allele derived from a major nucleic acid. It may be paraphrased as.
- one or more synthetic variables are generated by linearly combining the above-mentioned numerical groups.
- Principal component analysis can be preferably exemplified as a means of linear combination. It may be a synthetic variable generated by another means. Even if it is a synthetic variable generated by another means, it is preferable that this is a synthetic variable that can be generated by principal component analysis.
- the synthetic variables generated by the linear combination are represented by the following first-order homogeneous polynomials.
- n is an integer of 2 or more representing the number of types of numerical values included in the numerical value group to be linearly combined among the numerical values included in the data set.
- Xn is a numerical value included in the numerical value group that is the target of the linear combination.
- a1n is a coefficient that weights a numerical value that performs a linear combination.
- Z1 a11X1 + a12X2 + ... + a1nXn
- the secondary component signal intensity or secondary component mixing rate is maximally weighted.
- the number of synthetic variables that can be generated increases as the number of types of numerical values included in the numerical value group to be linearly combined increases.
- the number of synthetic variables generated in step A-2 is not particularly limited.
- the synthetic variable is generated by the linear combination of the numerical group including at least (A1) and (A2)
- the embodiment in which the synthetic variable is generated by the non-linear combination of the numerical group may be used.
- the nonlinear coupling refers to a power of each numerical value, a product of each numerical value, a quotient, a function having these numerical values as an exponent, and the like.
- the synthetic variable obtained by the linear combination of step A-2 has a correlation with the reliability value.
- a model function is created using this correlation, and the present invention has the following steps A-3-1 and A-4-1 as specific steps thereof.
- Step A-3-1 is a step of assigning a reliability value to the synthetic variable generated by the linear combination.
- the synthetic variables used in step A-3-1 are not particularly limited, but the synthetic variables that best reflect the numerical group that is the target of the linear combination are preferably mentioned.
- a synthetic variable showing the highest contribution rate to the numerical group targeted for the linear combination can be preferably exemplified. This corresponds to the first principal component in the principal component analysis.
- step A-3-1 first, the synthetic variables generated by the linear combination are divided into a plurality of parts. That is, the composite variable is divided into a plurality of variables according to the size of the numerical value.
- the classification method is not particularly limited. Although the divisions may be performed at equal intervals according to the size of the synthetic variables, it is preferable to divide them so that the generated synthetic variables are included in all of the divisions. In a more preferable form, it is preferable to classify exponentially instead of linearly classifying according to the size of the synthetic variable. This is because a sigmoid curve is obtained by performing a curve regression on the generated synthetic variables and probabilities.
- the number of divisions is not limited, but is preferably 3 or more, more preferably 5 or more, still more preferably 7 or more, still more preferably 10 or more, still more preferably 12 or more, still more preferably 15 or more, and further. It is preferably divided into 18 or more categories.
- the ratio of the secondary component signal intensities corresponding to the synthetic variables included in each category is obtained. That is, the ratio of the synthetic variable corresponding to the true secondary component signal strength is obtained from all the synthetic variables included in each category. In the present specification, this ratio is referred to as "probability".
- the secondary component signal intensity suggests the presence of a specific allele present at the polymorphic locus in the secondary nucleic acid. As suggested by this secondary component signal intensity, if the specific allele is actually present in the secondary nucleic acid, this is regarded as "true”.
- Step A-4-1 In step A-4-1, regression analysis is performed on the synthetic variables included in each of the above-mentioned categories and the probabilities corresponding to the synthetic variables included in each category. As a result, a model function for calculating the reliability value is obtained, with the composite variable as the explanatory variable and the reliability value as the objective variable.
- Probability and "reliability value” are in a correspondence relationship.
- the parameter used to create the model function is called “probability”
- the parameter calculated by inputting the explanatory variable to the model function is called “reliability value”.
- the method of regression analysis in step A-4-1 is not particularly limited, but the least squares method can be preferably exemplified.
- the model function is a sigmoid function.
- the model function can be expressed by Equation 1 below.
- a model function for calculating the reliability value in the form of a sigmoid function having two parameters not limited to the case of the above equation 1.
- A1 and x01 correspond to the parametric variables in Equation 1.
- A1 is preferably 15.4 to 15.6, more preferably 15.5.
- x01 is preferably ⁇ 0.8 to ⁇ 0.6, and more preferably ⁇ 0.9.
- those corresponding to the above numerical values when rounded to the second decimal place shall be included in the numerical range specified here.
- the model function obtained by the above method is extremely versatile. It can also be applied to the analysis of the data set primaryly acquired under the conditions different from the acquisition conditions of the data set prepared in the step A-1. For example, under the condition that there is a difference in sample amount and concentration, a difference in the analyzed polymorphic sitting position, and a difference in signal type (number of reads and UMT count) from the acquisition conditions of the data set prepared in step A-1.
- the model function can be applied to the calculation of the reliability value in the temporarily acquired data set. That is, when it is desired to calculate the reliability value for a data set acquired under another condition, it is not necessary to create a model function again for the other condition. Once the model function is created by the method of the present invention, it can be diverted to the analysis of the data set acquired under different conditions.
- model functions created based on datasets related to prenatal genetic testing can be diverted to analysis of datasets acquired in cancer testing and monitoring of transplant organ colonization.
- the types and numbers of the numerical values included in the numerical value group used for the linear connection used for creating the model function and the linear connection for generating the synthetic variable to be the input value to the model function were used. It is preferable that the types of numerical values included in the numerical value group and the number thereof are the same.
- the method of creating a model function based on the correlation between the composite variable and the reliability value has been described above, but the present invention is not limited to this, and the model function for calculating the reliability value using another index as the explanatory variable. Can be provided.
- the present invention also relates to a method for creating model functions f2 (x2) and f3 (x3), which will be described later. The method of creating each model function will be described in detail below.
- step A-1 the method of creating the model function f2 (x2) will be explained.
- This method comprises steps A-1, steps A-3-2 and steps A-4-2.
- the contents of step A-1 are as described above.
- step A-3-2 and step A-4-2 will be described.
- step A-3-2 first, the above-mentioned (A1) secondary component signal strength is divided into a plurality of parts. That is, (A1) the secondary component signal strength is divided into a plurality of parts according to the magnitude of the numerical value.
- the classification method is not particularly limited. Although the sub-component signal strength may be divided at equal intervals according to the magnitude of the sub-component signal strength, it is preferable to classify the sub-component signal strength so that all of the classifications include the sub-component signal strength. In a more preferable form, it is preferable to classify exponentially instead of linearly classifying according to the magnitude of the secondary component signal intensity. This is because a sigmoid curve is obtained by regressing the secondary component signal intensity and the reliability value by a curve.
- the number of divisions is not limited, but is preferably 3 or more, more preferably 5 or more, still more preferably 7 or more, still more preferably 10 or more, still more preferably 12 or more, still more preferably 15 or more, and further. It is preferably divided into 18 or more categories.
- the ratio of the sub-component signal intensities corresponding to the sub-component signal intensities included in each category is obtained. That is, the ratio of the true secondary component signal strength is obtained from the numerical values of all the secondary component signal strengths included in each category. In the present specification, this ratio is referred to as "probability".
- the secondary component signal intensity suggests the presence of a specific allele present at the polymorphic locus in the secondary nucleic acid. As suggested by this secondary component signal intensity, if the specific allele is actually present in the secondary nucleic acid, this is regarded as "true".
- the probability of the secondary component signal strength in each category After obtaining the probability of the secondary component signal strength in each category, this is given as the probability corresponding to the secondary component signal strength included in each category. Specifically, the probability in the relevant category is assigned to the value of one secondary component signal strength representing each category. By this step, a scatter plot of the secondary component signal intensity and the probability can be created.
- Step A-4-2 In step A-4-2, regression analysis is performed on the secondary component signal strength included in each of the above-mentioned categories and the probability corresponding to the secondary component signal strength included in each category. As a result, a model function f2 (x2) for calculating the reliability value is obtained, with the secondary component signal strength as the explanatory variable x2 and the reliability value as the objective variable.
- the method of regression analysis in step A-4-2 is not particularly limited, but the least squares method can be preferably exemplified.
- the model function f2 (x2) is a sigmoid function and can be expressed by the following equation 2.
- the model function f2 (x2) acquired by the above method is extremely versatile, and once the model function f2 (x2) is created by the method of the present invention, it can also be used for analysis of a data set acquired under different conditions. Can be diverted. It can also be applied to the analysis of a data set obtained by a different type of inspection from the data set on which the model function f2 (x2) is created.
- Equation 2 A2 is preferably 1.8 to 2.0, more preferably 1.9. Further, x02 is preferably 2.5 to 2.7, and more preferably 2.6. In addition, those corresponding to the above numerical values when rounded to the second decimal place shall be included in the numerical range specified here.
- This method comprises the following steps A-3-3 and steps A-4-3.
- step A-3-3 first, the above-mentioned (A2) by-component mixing ratio is classified into a plurality of portions. That is, (A2) the secondary component mixing rate is divided into a plurality of parts according to the magnitude of the numerical value.
- the classification method is not particularly limited. Although it may be classified at equal intervals according to the magnitude of the secondary component mixing rate, it is preferable to classify so that the secondary component mixing rate is included in all of the classifications. In a more preferable form, it is preferable to classify exponentially instead of linearly classifying according to the magnitude of the secondary component mixing ratio. This is because a sigmoid curve is obtained by regressing the secondary component mixing rate and the probability by a curve.
- the number of divisions is not limited, but is preferably 3 or more, more preferably 5 or more, still more preferably 7 or more, still more preferably 10 or more, still more preferably 12 or more, still more preferably 15 or more, and further. It is preferably divided into 18 or more categories.
- the secondary component contamination rate includes the secondary component signal intensity as the basis for its calculation, and suggests the presence of a specific allele present at the polymorphic lous coition in this secondary nucleic acid.
- the secondary nucleic acid signal intensity which is the basis for calculating the secondary component contamination rate, when the specific allele actually exists in the secondary nucleic acid, this is regarded as "true”.
- the probability of the sub-component mixing rate in each category is given as the probability corresponding to each sub-component mixing rate included in each category.
- the probability in the relevant category is assigned to the value of one secondary component mixing rate representing each category.
- Step A-4-3 a regression analysis is performed on the secondary component mixing rate included in each of the above-mentioned categories and the probability corresponding to the secondary component mixing rate included in each category.
- a model function f3 (x3) for calculating the reliability value is obtained, with the secondary component mixing rate as the explanatory variable x3 and the reliability value as the objective variable.
- the method of regression analysis in step A-4-3 is not particularly limited, but the least squares method can be preferably exemplified.
- the model function f3 (x3) is a sigmoid function and can be expressed by the following equation 3.
- A3 is preferably 9.3 to 9.5, more preferably 9.4.
- x03 is preferably 0.5 to 0.7, and more preferably 0.6.
- those corresponding to the above numerical values when rounded to the second decimal place shall be included in the numerical range specified here.
- model functions are useful for evaluating the reliability of the secondary component signal strength contained in the data set independently.
- a more useful model function can be created by multiplying the created multiple model functions with each other.
- step A-2 two or more synthetic variables are generated, and in step A-3-1, reliability values are given to each of the two or more synthetic variables.
- step A-4-1 two or more independent model functions having each of the two or more synthetic variables as explanatory variables are created. By multiplying these two or more model functions with each other, an embodiment of creating a model function represented by multiplication may be used.
- model functions selected from the following three model functions may be multiplied by each other to create a model function represented by multiplication.
- all of the following three model functions may be multiplied by each other to create a model function represented by multiplication.
- Equation 4 a model function created by multiplying the above-mentioned model functions f1 (x1), model function f2 (x2), and model function f3 (3) with each other is used. ..
- the primary contributor is the mother
- the secondary contributor is the fetus in the womb of the mother
- the mixed nucleic acid sample is a circulating acellular nucleic acid sample collected from the mother.
- Step A 1-1 is a step of preparing a data set obtained by measuring a circulating acellular nucleic acid sample.
- Circulating cell-free nucleic acid samples contain a major nucleic acid containing genetic information about the mother and a secondary nucleic acid containing genetic information about the fetus.
- Circulating cell-free nucleic acid samples usually contain more primary nucleic acid than secondary nucleic acid. On the other hand, the content ratio may be reversed in the latter half of pregnancy.
- This dataset contains signals indicating the presence of each allele in multiple polymorphic loci in the primary and secondary nucleic acids.
- a sitting position having single nucleotide polymorphisms (SNPs) used in human identification (HID) is preferably mentioned.
- the known SNPs used in HID are stored in a database, and a polymorphic sitting position with these SNPs can be arbitrarily selected.
- Steps A 1-2 show the presence of alleles that are homozygous in the mother, homozygous in the father, and derived from the major nucleic acid among the multiple polymorphic loci of the data contained in the dataset. This is a step of linearly binding at least the numerical values of (A1) and (A2) regarding the polymorphic locus detected by distinguishing between the signal shown and the signal indicating the presence of an allele derived from the secondary nucleic acid. .. Since the polymorphic loci are homozygous for the mother and homozygous for the father, it is unlikely that the signal from the maternal genomic DNA will contribute to both the primary component signal intensity and the secondary component signal intensity.
- Step A 1-3-1 is a step of assigning a reliability value to a synthetic variable generated by a linear combination, and all the explanations of Step A-3-1 described above are valid.
- the truth of the secondary component signal strength is determined as follows.
- the secondary component signal due to the allele derived from the father is homozygous by the mother. It should be detected separately from alleles. Therefore, when the sub-component signal is detected separately from the main component signal for the allele, the sub-component signal is regarded as true. Further, when the sub-component signal is not detected in the allele to be distinguished from the main component signal, the sub-component signal is regarded as false. This means that the result that the secondary component signal was not detected is false.
- the alleles derived from the father are detected separately from the alleles that the mother has in the homozygosity. It is not possible. Therefore, when the sub-component signal is detected separately from the main component signal for the allele, the sub-component signal is regarded as false. Further, when the sub-component signal is not detected separately from the main component signal, the sub-component signal is regarded as true. This means that the result that the secondary component signal was not detected is true.
- Step A 1-4-1 is a step of obtaining a model function, and all the above-mentioned explanations of step A-4-1 are valid.
- model function f2 in which the secondary component signal strength is the explanatory variable x2
- model function f3 in which the secondary component mixing ratio is the explanatory variable x3. ..
- steps A-4-2 and A-4-3 are appropriate.
- a plurality of created model functions may be multiplied by each other to create a model function represented by multiplication. The specific embodiment is as described above.
- the major contributor corresponds to a healthy person having a normal allele in a polymorphic sitting position where a mutation related to cancer is observed, and the secondary contributor corresponds to a cancer cell.
- the mixed nucleic acid sample contains the base sequence information of the polymorphic locus in which the mutation related to cancer is introduced into the nucleic acid sample collected from the healthy person containing the main nucleic acid containing the genetic information about the healthy person. It is artificially prepared by spike (adding) a secondary nucleic acid consisting of a plurality of nucleic acid fragments containing the nucleic acid. More specifically, a mixed nucleic acid sample artificially prepared by spiked a nucleic acid fragment containing a sequence of a mutant allele associated with cancer into a circulating acellular nucleic acid sample collected from a healthy person is preferable.
- the mixed nucleic acid sample may be prepared by spike an artificially synthesized nucleic acid fragment into a nucleic acid sample collected from a healthy person. Further, a mixed nucleic acid sample may be prepared by spiking a cancer cell line or a cancer tissue or a nucleic acid extract thereof on a nucleic acid sample collected from a healthy person.
- the mixed nucleic acid sample mimics a circulating acellular nucleic acid sample of a subject to be tested for cancer.
- the mixing ratio of the primary nucleic acid and the secondary nucleic acid in the mixed nucleic acid sample is not particularly limited, but it is preferable to adjust the mixed nucleic acid sample so that the primary nucleic acid is contained in a larger amount than the secondary nucleic acid. In other words, it is preferable to spike the secondary nucleic acid so that the signal resulting from a particular locus in the secondary nucleic acid is smaller than the signal resulting from the locus in the primary nucleic acid.
- the spiked secondary nucleic acid has a gene copy count of preferably less than 50%, more preferably 40% or less, still more preferably 30% or less, still more preferably 20% or less, still more preferably 10% with respect to the major nucleic acid. % Or less.
- the length of the nucleic acid fragment to be spiked is not particularly limited as long as it contains a mutation related to cancer, but preferably 50 to 500 bp, more preferably 100 to 300 bp, still more preferably 120 to 200 bp. It can be exemplified.
- nucleic acid fragment to be spiked a plurality of any known cancer-related single nucleotide substitution mutations can be selected.
- Steps A-1, A- 2 , A-3-1 and A-4-1 described in the item of " ⁇ 1-1>Overview" are the steps A2-1 and A in the present embodiment. It corresponds to 2-2 , step A 2-3-1 and step A 2-4-1. Hereinafter, each step will be described.
- Step A 2-1 is a step of preparing a data set containing data obtained by measuring a mixed nucleic acid sample in which the above-mentioned secondary nucleic acid is spiked.
- the data set prepared in step A 2-1 may also include data obtained by measuring a nucleic acid sample containing only the main nucleic acid without spiked secondary nucleic acids.
- the polymorphic loci preferably include loci with single nucleotide polymorphisms (SNPs) known to be associated with cancer. Cancer-related SNPs are stored in a database, and certain polymorphic loci with these SNPs can be arbitrarily selected.
- SNPs single nucleotide polymorphisms
- Step A 2-2 Among the data contained in the data set, among the plurality of polymorphic loci, a signal indicating the presence of an allele derived from the main nucleic acid and a signal indicating the presence of an allele derived from the secondary nucleic acid are included. It is a step of performing linear coupling with respect to at least the numerical values (A1) and (A2) regarding the polymorphic sitting position detected separately.
- Step A 2-3-1 is a step of assigning a reliability value to the synthetic variable obtained by the linear combination, and all the explanations of step A-3-1 described above are valid.
- the truth of the secondary component signal strength is determined as follows.
- a secondary component signal should be detected for the nucleic acid fragment. Therefore, in this case, when the secondary component signal is detected for the nucleic acid fragment, the secondary component signal is true. If no secondary component signal is detected for the nucleic acid fragment, the secondary component signal is regarded as false. This means that the result that the secondary component signal was not detected is true.
- the secondary component signal should not be detected for the nucleic acid fragment. Therefore, in this case, when the secondary component signal is detected for the nucleic acid fragment, the secondary component signal is false. Further, in this case, when the secondary component signal is not detected for the nucleic acid fragment, the secondary component signal is regarded as true. This means that the result that the secondary component signal was not detected is true.
- Step A 2-4-1 is a step of obtaining a model function, and all the above-mentioned explanations of step A-4-1 are valid.
- model function f2 (x2) in which the secondary component signal intensity is the explanatory variable x2
- model function f2 (x2) in which the secondary component mixing ratio is the explanatory variable x2. ..
- steps A-4-2 and A-4-3 are appropriate.
- a plurality of created model functions may be multiplied by each other to create a model function represented by multiplication. The specific embodiment is as described above.
- a model function is created from a data set obtained from a cancer test.
- the feature of this embodiment is that a model function is created based on the data regarding a single polymorphic lotus coition. Specifically, it includes the following steps A 2'-1, step A 2'- 2 , and the above-mentioned steps A 2-3-1 and step A 2-4-1.
- Step A 2'-1 is a step of preparing a data set obtained by measuring a plurality of mixed nucleic acid samples in which the above-mentioned secondary nucleic acids are spiked at different content ratios.
- the difference from Step A 2-1 is that a plurality of mixed nucleic acid samples in which secondary nucleic acids are spiked at different content ratios are prepared.
- the above-mentioned step A 2-1 contains data on a plurality of polymorphic loci, in the dataset of step A2'-1, each allele in a single polymorphic locus in the primary nucleic acid and the secondary nucleic acid. It also differs in that it only needs to include a signal indicating the existence of. That is, step A 2'-1 is characterized in that while data on a single polymorphic locus may be prepared, data on a plurality of mixed nucleic acid samples having different content ratios of secondary nucleic acids are prepared.
- Step A 2'- 2 among the data contained in the data set, a signal indicating the presence of an allele derived from a major nucleic acid and a signal indicating the presence of an allele derived from a secondary nucleic acid were detected separately. It is a step of linearly connecting numerical groups including at least the following (A1') and (A2') for a single polymorphic locus to generate one or more synthetic variables.
- A1' Secondary component signal intensity indicating the presence of the single polymorphic lous allele derived from the secondary nucleic acid.
- A2' Secondary component mixing ratio, which is the ratio of the secondary component signal strength to the total signal strength caused by the single polymorphic sitting allele.
- (A1') and (A2') are merely different in terms of expression because the data prepared in step A2' - 1 is data relating to a single polymorphic lotus coition, and their essence is the same. Is the same as (A1) and (A2) described above.
- step A 2 ′ -1 In the embodiment including step A 2 ′ -1, step A 2 ′ -2 and the above-mentioned step A 2-3-1 and step A 2-4-1, there is no general method for creating a calibration curve. It is useful for creating model functions from data acquired by microarrays, digital PCR, and base sequence determination means (particularly next-generation sequencers).
- the main contributor corresponds to the recipient of organ transplantation
- the secondary contributor corresponds to the transplanted organ transplanted from the donor.
- the mixed nucleic acid sample in this embodiment contains a primary nucleic acid containing genetic information about the recipient and a secondary nucleic acid containing genetic information about the transplanted organ.
- the mixed nucleic acid sample contains more major nucleic acid than secondary nucleic acid.
- the genetic information about the transplanted organ is consistent with the genetic information about the donor.
- the mixed nucleic acid sample may be a sample obtained from the recipient after transplantation, specifically, a circulating acellular nucleic acid sample. Alternatively, it may be prepared by artificially mixing the main nucleic acid derived from the recipient obtained from the recipient and the secondary nucleic acid derived from the donor obtained from the donor or the transplanted organ.
- the number of copies of the secondary nucleic acid is preferably less than 50%, more preferably 40% or less with respect to the primary nucleic acid so that the signal caused by the primary nucleic acid is detected more strongly than the signal caused by the secondary nucleic acid. , More preferably 30% or less, still more preferably 20% or less, still more preferably 10% or less.
- Steps A-1, A-2, A-3-1 and A-4-1 described in the item of " ⁇ 1-1>Overview" are the steps A 3-1 and A in the present embodiment. It corresponds to 3-2, step A 3 3-1 and step A 3 4-1 . Hereinafter, each step will be described.
- Step A 3-1 is a step of preparing a data set obtained by measuring the mixed nucleic acid sample described above. This dataset contains signals indicating the presence of each allele in multiple polymorphic loci in the primary and secondary nucleic acids.
- a sitting position having single nucleotide polymorphisms (SNPs) used in human identification (HID) is preferably mentioned.
- the known SNPs used in HID are stored in a database, and a polymorphic sitting position with these SNPs can be arbitrarily selected.
- a recipe is used.
- a signal is obtained indicating the presence of an allele that the ent does not have and that the donor has as a heterozygotes or homozygotes, this can be determined to be true.
- nucleic acid sample does not contain secondary nucleic acids derived from the donor, it is false when a signal is obtained indicating the presence of an allele that the recipient does not have but the donor has. Can be determined.
- Step A 3-2 among the data contained in the data set, a signal indicating the presence of an allele derived from the main nucleic acid and the presence of the allele derived from the secondary nucleic acid in the plurality of polymorphic loci are present.
- This is a step of linearly coupling at least the numerical values of (A1) and (A2) with respect to the polymorphic sitting position detected separately from the signal indicating.
- the secondary component signal intensity indicating the presence of another allele other than the specific allele may be indicated. Signals due to the recipient's allele cannot be mixed. In this case, the signal indicating the presence of the allele derived from the main nucleic acid and the signal indicating the presence of the allele derived from the secondary nucleic acid are detected separately.
- Step A 3 3-1 is a step of assigning a reliability value to a synthetic variable generated by a linear combination, and all the explanations of step A-3-1 described above are valid.
- the truth of the secondary component signal strength is determined as follows.
- Alleles that the recipient does not have and that the donor has homozygous or heterozygous are distinguished from the alleles that the recipient has and are derived from the alleles that the donor has.
- the next component signal should be detected. Therefore, when the sub-component signal is detected separately from the main component signal for the allele, the sub-component signal is regarded as true. Further, when the sub-component signal is not detected in the allele to be distinguished from the main component signal, the sub-component signal is regarded as false. This means that the result that the secondary component signal was not detected is false.
- the secondary component signal is not detected in distinction from the allele possessed by the recipient. Therefore, when the sub-component signal is detected separately from the main component signal for the allele, the sub-component signal is regarded as false. Further, when the sub-component signal is not detected in the allele to be distinguished from the main component signal, the sub-component signal is regarded as true. This means that the result that the secondary component signal was not detected is true.
- Step A 3-4-1 is a step of obtaining a model function, and all the above-mentioned explanations of step A-4-1 are valid.
- model function f2 in which the secondary component signal strength is the explanatory variable x2
- model function f3 in which the secondary component mixing ratio is the explanatory variable x3. ..
- steps A-4-2 and A-4-3 are appropriate.
- a plurality of created model functions may be multiplied by each other to create a model function represented by multiplication. The specific embodiment is as described above.
- the present invention also relates to a reliability calculation method.
- a reliability calculation method Asinafter, specific embodiments of the reliability calculation method of the present invention will be described. It should be noted that, of the contents of the above-mentioned description of the method for creating the model function, the part appropriate for the description of the method for calculating the reliability of the present invention will be omitted as appropriate.
- the reliability calculation method of the present invention is a reliability value calculation method for calculating a reliability value by inputting an explanatory variable thereof into a model function.
- the model function referred to here is two or more models selected from a group consisting of a model function obtained by the above method, a model function of any of equations 1 to 3, or a model function represented by equations 1 to 3. Examples include model functions that are multiplied by each other and represented by multiplication.
- the numerical values to be input to the model function are the explanatory variables in each model function. Specifically, a numerical value of 1 or 2 or more selected from the following (B1) and (B2) and the synthetic variables obtained in the following step B-2 included in the data set prepared in the following step B-1 is used. Enter it in the model function as an explanatory variable.
- the reliability calculation method of the present invention includes the following step B-1. If the numerical value to be input to the model function is a synthetic variable, the synthetic variable is generated by the following step B-2.
- step B-1 If the numerical value to be input to the model function is a synthetic variable, the synthetic variable is generated by the following step B-2.
- Step B-1 is a step of preparing a data set obtained by measuring a mixed nucleic acid sample containing a major nucleic acid containing genetic information about a major contributor and a secondary nucleic acid containing genetic information about a secondary contributor.
- the mixed nucleic acid sample contains more major nucleic acid than secondary nucleic acid.
- the dataset then contains signals indicating the presence of each allele in multiple polymorphic loci in the primary nucleic acid and the secondary nucleic acid.
- the method for acquiring the data set is not particularly limited. It may be acquired primarily by using the analysis means described later, or it may be acquired secondarily by a third party.
- the data set is not particularly limited as long as it is obtained by an analytical means capable of distinguishing and detecting each allele in the polymorphic sitting position. Examples of the analytical means include analytical means capable of distinguishing and detecting single nucleotide substitutions (SNPs) in polymorphic loci.
- SNPs single nucleotide substitutions
- analysis means examples include next-generation sequencers used for detecting SNPs, digital PCR, microarrays, multiplexing PCR, mass spectrometry, and the like. These specific contents are as explained in the item of " ⁇ 1> Method of creating a model function".
- the type of mixed nucleic acid sample is not limited. For example, it is obtained from a circulating acellular nucleic acid sample (cfDNA, cfRNA) obtained from the blood of a pregnant woman obtained for a prenatal genetic test, or from the blood of a test subject obtained for a cancer test. Preferable examples thereof include a circulating acellular nucleic acid sample (cfDNA, cfRNA) obtained from the blood of a recipient obtained for monitoring the colonization of a transplanted organ, and the like.
- a circulating acellular nucleic acid sample cfDNA, cfRNA
- cfDNA, cfRNA circulating acellular nucleic acid sample obtained from the blood of a recipient obtained for monitoring the colonization of a transplanted organ, and the like.
- the data set in the reliability calculation method of the present invention includes a signal indicating the existence of each allele in a plurality of polymorphic loci, and this "plurality of polymorphic loci" is the basis for creating a model function. It does not have to be the same as the "plurality of polymorphic sitting positions" used as, and the degree of overlap is not limited.
- the degree of overlap may be preferably 80% or less, more preferably 70% or less, based on the "plurality of polymorphic lotuses" used as the basis for creating the model function. , More preferably 60% or less, still more preferably 50% or less.
- the degree of overlap may be 0%, preferably 10% or more, or further, based on the "plurality of polymorphic lotus coitions" used as the basis for creating the model function. It may be preferably 20% or more, more preferably 30% or more, still more preferably 40% or more.
- Step B-2 among the data included in the data set, a signal indicating the presence of an allele derived from a major nucleic acid and a signal indicating the presence of an allele derived from a secondary nucleic acid among a plurality of polymorphic loci are shown. Is a step of linearly connecting numerical groups including the following (B1) and (B2) with respect to the polymorphic locus detected separately to generate one or more synthetic variables.
- the secondary component signal intensity is the intensity of the signal indicating the presence of a specific polymorphic lous allele derived from the secondary nucleic acid.
- the above description (A1) is valid as it is.
- the numerical value group to be linearly combined in step B-2 may include numerical values other than the above-mentioned (B1) and (B2). That is, a linear combination is performed on a numerical group including various measured values or calculated values related to the specific polymorphic lotus coition in addition to (B1) and (B2).
- the numerical values (B3) to (B5) that may be included in the numerical value group to be linearly combined will be described below. In addition, only one kind selected from the following (B3) to (B5) may be included in the numerical value group, or two or more kinds of numerical values arbitrarily selected may be included in the numerical value group. Further, all of (B3) to (B5) may be included in the numerical group.
- the major component signal strength is the strength of the signal indicating the presence of one allele of a specific polymorphic lous derived from the major nucleic acid.
- the above description (A3) is valid as it is.
- the data set prepared in step B-1 is a set of data related to a plurality of polymorphic sitting positions. Therefore, needless to say, the data set prepared in step B-1 includes a plurality of sets of data including the above (B1) and (B2) and other numerical data relating to a specific polymorphic lotus coition. It will be.
- Standardized data [(original data)-(mean value)] / (sample standard deviation)
- a polymorphic locus detected by distinguishing between a signal indicating the presence of an allele derived from a major nucleic acid and a signal indicating the presence of an allele derived from a secondary nucleic acid refers to the presence of an allele derived from the major nucleic acid. It refers to a polymorphic locus in which the signal indicating the signal and the signal indicating the presence of an allele derived from the secondary nucleic acid are not mixed.
- the cfDNA of the cfDNA regardless of the father's genetic type.
- the signals of allele A and allele B derived from the genomic DNA of the mother are always detected.
- Either the allele A or allele B signal should contain a signal derived from the fetal cffDNA, but this cannot be distinguished from the signal derived from the mother's genomic DNA. Such data is excluded from the analysis of the present invention.
- the mutation is always included in ctDNA, so the signal derived from the test target and cancer. It will be mixed with cell-derived signals. Such data is excluded from the analysis of the present invention.
- the polymorphic locus targeted for data analysis is "a signal indicating the presence of an allele derived from a major nucleic acid and a signal indicating the presence of an allele derived from a secondary nucleic acid.” It is limited to "polymorphic sitting position detected separately”.
- the polymorphic locus to be analyzed in step B-2 is a polymorphic lous coition in which there is no possibility that a signal indicating the presence of an allele derived from a secondary nucleic acid is mixed with a signal indicating the presence of an allele derived from a major nucleic acid. It may be paraphrased as.
- one or more synthetic variables are generated by linearly combining the above-mentioned numerical groups.
- Principal component analysis can be preferably exemplified as a means of linear combination. It may be a synthetic variable generated by another means. Even if it is a synthetic variable generated by another means, it is preferable that this is a synthetic variable that can be generated by principal component analysis.
- the number of synthetic variables that can be generated increases as the number of types of numerical values included in the numerical value group to be linearly combined increases.
- the number of synthetic variables generated in step B-2 is not particularly limited.
- the steps for calculating the reliability value by inputting the numerical values obtained as described above into the model function are the following steps B-3-1 to B-3-4.
- step B-3-1 the synthetic variable generated by the linear combination in step B-2 is input to the above-mentioned model function whose synthetic variable is the explanatory variable and the reliability value is the objective variable, and the reliability value is calculated. It is a process to do. It should be noted that the types and numbers of the numerical values included in the numerical value group used for the linear connection used for creating the model function and the linear connection for generating the composite variable to be the input value to the model function were used. It is preferable that the types of numerical values included in the numerical value group and the number thereof are the same.
- the present invention also relates to a method for calculating a reliability value, which comprises the above-mentioned step B-1 and the following step B-3-2.
- Step B-3-2 is a step of inputting the secondary component signal strength of (B1) into the above-mentioned model function f2 (x2) and calculating a reliability value.
- the reliability value of the data can be easily calculated by inputting the secondary component signal strength primaryly included in the data set into the model function f2 (x2).
- Step B-3-3 is a step of inputting the secondary component mixing ratio of the above (B2) into the above-mentioned model function f3 (x3) and calculating a reliability value.
- the reliability value of the data can be easily calculated by inputting the secondary component mixing rate into the model function f3 (x3).
- the present invention also relates to a method for calculating a reliability value, which comprises the above-mentioned step B-1 and the following step B-3'.
- a method for calculating a reliability value which comprises the above-mentioned step B-1 and the following step B-3'.
- Step B-3' a variable selected from the following three types of numerical values is input to a model function represented by multiplication, the variable being the explanatory variable and the reliability value being the objective variable, and reliability is obtained. This is the process of calculating the sex value.
- (I) The synthetic variable generated in the above step B-2.
- a polymorphism in which a signal indicating the presence of an allele derived from the main nucleic acid and a signal indicating the presence of an allele derived from the secondary nucleic acid are separately detected is detected.
- a polymorphism in which a signal indicating the presence of an allele derived from the main nucleic acid and a signal indicating the presence of an allele derived from the secondary nucleic acid are separately detected is detected.
- the secondary component mixing ratio which is the ratio of the secondary component signal intensity to the total signal intensity caused by the allele of the specific polymorphic sitting position with respect to the sitting position.
- the model function represented by multiplication here is a model function represented by multiplication by multiplying two or more model functions selected from the following three model functions with each other as described above.
- -Model function created by process A-1, process A-2, process A-3-1 and process A-4-1-Created by process A-1, process A-3-2 and process A-4-2 Model functions created by process A-1, process A-3-3, and process A-4-3
- variables corresponding to the respective explanatory variables of f1 (x1), f2 (x2), and f3 (x3) are input to the model function represented by Equation 4, and the reliability value is calculated. do.
- the major contributor corresponds to the mother
- the secondary contributor corresponds to the fetus in the womb of the mother
- the mixed nucleic acid sample corresponds to a circulating acellular nucleic acid sample collected from the mother.
- the process B- 1 , the process B - 2 and the process B - 3-1 described above correspond to the process B 1-1, the process B 1-2 and the process B 1-3-1 described below, respectively. do.
- Step B 1-1 is a step of preparing a data set obtained by measuring a circulating acellular nucleic acid sample containing a major nucleic acid containing genetic information about the mother and a secondary nucleic acid containing genetic information about the fetus.
- the dataset is a dataset containing signals indicating the presence of each allele in multiple polymorphic loci in the primary and secondary nucleic acids.
- the plurality of polymorphic sitting positions referred to here are preferably polymorphic sitting positions used in human personal identification (HID).
- Step B 1-2 are a signal indicating the presence of an allele derived from the major nucleic acid that is homozygous in the mother among the multiple polymorphic loci of the data contained in the dataset, and the secondary nucleic acid.
- a signal indicating the presence of an allele derived from the above and a numerical group containing at least the above (B1) and the above (B2) for the polymorphic locus detected separately are linearly coupled to generate one or more synthetic variables. It is a process.
- the genotype of the polymorphic lotus in the pseudofather may be homozygous or heterozygous.
- Step B 1-3-1 is a step of inputting the synthetic variable generated in step B 1-2 into a model function using the synthetic variable as an explanatory variable and calculating a reliability value.
- the major contributor corresponds to the test subject
- the secondary contributor corresponds to the cancer cell
- the mixed nucleic acid sample corresponds to the circulating acellular nucleic acid sample collected from the test subject.
- the steps B - 1, B- 2 and B-3-1 correspond to the steps B2-1, B2-2 and B2-3-1 described below, respectively.
- Step B 2-1 is a data set obtained by measuring a circulating acellular nucleic acid sample, which comprises a major nucleic acid containing genetic information about the subject to be tested and may contain a secondary nucleic acid containing genetic information about cancer cells. It is a step of preparing a data set containing a signal indicating the presence of each allele in a plurality of cancer-related polymorphic loci in the primary nucleic acid and the secondary nucleic acid.
- “may contain secondary nucleic acid” means a situation in which the possibility that the secondary nucleic acid is contained in the circulating acellular nucleic acid sample cannot be completely ruled out.
- Step B 2-2 among the data contained in the data set, a signal indicating the presence of a normal type allele and a signal indicating the presence of a mutant type allele are distinguished from each other in a plurality of polymorphic loci. It is a step of linearly connecting numerical groups including at least the above (B1) and the above (B2) with respect to the detected polymorphic locus to generate one or more synthetic variables.
- Normal-type alleles are alleles commonly found in healthy individuals who do not have cancer, and mutant-type alleles are alleles into which mutations that are considered to be related to cancer have been introduced.
- step B 2-2 from the data contained in the data set, among the plurality of polymorphic sitting positions, the polymorphic sitting position in which the mutant allele is homozygous or heterozygous in the test subject is concerned. It is preferable to exclude the data. By excluding the data on the polymorphic sitting position with the mutant allyl that is congenitally possessed by the test subject in this way, the secondary component signal is detected mixed with the main component signal derived from the test subject himself. Data is excluded. This improves the accuracy of the calculated reliability value.
- Step B 2-3-1 is a step of inputting the synthetic variable generated in step B 2-2 into a model function using the synthetic variable as an explanatory variable and calculating a reliability value.
- the major contributor corresponds to the recipient of the organ transplant
- the secondary contributor corresponds to the transplanted organ
- the mixed nucleic acid sample corresponds to a circulating acellular nucleic acid sample collected from the recipient.
- process B-1, process B - 2 and process B-3-1 correspond to process B 3-1, process B 3 -2- and process B 3 3-1 described below, respectively.
- Step B 3-1 is a step of preparing a dataset obtained by measuring a circulating acellular nucleic acid sample, which may contain a major nucleic acid containing genetic information about the recipient and a secondary nucleic acid containing genetic information about the transplanted organ. be.
- the dataset contains signals indicating the presence of each allele in multiple polymorphic loci in the primary and secondary nucleic acids.
- the plurality of polymorphic sitting positions referred to here are preferably polymorphic sitting positions used in human personal identification (HID).
- Step B 3-2 among the data included in the data set, among the plurality of polymorphic sitting positions, At least the above-mentioned (B1) and the above-mentioned (B2) regarding the polymorphic locus in which the signal indicating the presence of the allele derived from the main nucleic acid and the signal indicating the presence of the allele derived from the secondary nucleic acid are separately detected.
- This is a step of linearly combining numerical values including the above to generate one or more synthetic variables.
- Step B 3-3-1 is a step of inputting the synthetic variable generated in step B 3-2 into a model function using the synthetic variable as an explanatory variable and calculating a reliability value.
- the major contributor corresponds to the mother
- the secondary contributor corresponds to the fetus in the womb of the mother
- the mixed nucleic acid sample corresponds to a circulating acellular nucleic acid sample collected from the mother.
- the process B-1, the process B - 2 and the process B-3-1 described above correspond to the process B 4-1 and the process B 4-2 and the process B 4 3-1 described below, respectively. do.
- Step B 4-1 is obtained by measuring a circulating acellular nucleic acid sample taken from the mother, including a major nucleic acid containing genetic information about the mother and a secondary nucleic acid containing genetic information about the fetus in the mother's womb. Prepare the data set to be used. The dataset contains signals indicating the presence of each allele in multiple disease-related polymorphic loci in the primary and secondary nucleic acids.
- Step B 4-2 first, among the plurality of polymorphic loci, the data regarding the polymorphic loci having the mutant allyl as a heterozygotes in the mother are excluded from the data contained in the data set.
- the signal indicating the presence of the allele derived from the main nucleic acid and the presence of the allele derived from the secondary nucleic acid are displayed in the plurality of polymorphic loci.
- One or more synthetic variables are generated by linearly combining the indicated signal and the numerical group containing at least the above (B1) and the above (B2) with respect to the polymorphic locus detected separately.
- Step B 4-3-1 is a step of inputting the synthetic variable generated in the step B-2 into a model function using the synthetic variable as an explanatory variable and calculating a reliability value.
- the reliability of a signal indicating the presence of a specific allele at a specific polymorphic lous coition in a secondary nucleic acid contained in a data set is determined. Can be evaluated.
- the reliability value of the signal indicating the presence of the allele is calculated to be low. There are cases where it ends up.
- the method of setting the exclusion condition of the present invention relates to a method of setting an exclusion condition for determining what should be excluded from the data set in order to narrow down the data of the explanatory variables to be input to the model function.
- the method for setting exclusion conditions of the present invention particularly relates to prenatal genetic testing.
- the reliability value of the secondary component signal intensity for the loci homozygous for each of the parents is preferably less than 0.8, more preferably less than 0.9, still more preferable. It is preferable to set the exclusion condition so as to exclude those having a value of less than 0.99, more preferably less than 0.999. Further, the reliability value of the secondary component signal intensities for the loci of the same type that the parents have in homozygosity is preferably 0.2 or more, more preferably 0.1 or more, and further preferably 0. It is preferable to set the exclusion condition so as to exclude those of 01 or more, more preferably 0.001 or more.
- Exclusion condition setting method (Embodiment 1)
- One embodiment of the method for setting the exclusion condition of the present invention includes the following steps C-1-1, step C-2-1, step C-3-1 and step C-4-1.
- the exclusion conditions set by the present embodiment can be applied to the method for calculating the reliability value for monitoring the colonization of the transplanted organ described above.
- Step C-1-1 prepares a data set obtained by measuring a mixed nucleic acid sample containing a major nucleic acid containing genetic information on a major contributor and a secondary nucleic acid containing genetic information on a secondary contributor.
- the dataset includes a dataset containing signals indicating the presence of each allele in multiple polymorphic loci in the primary nucleic acid and the secondary nucleic acid. The authenticity of the signal is known.
- the single nucleotide polymorphic lotus used in human personal identification (HID) can be preferably exemplified.
- the major contributor, sub-contributor, and mixed nucleic acid sample correspond to any of the following.
- the major contributor is the mother, the sub-contributor is the fetus in the womb of the mother, and the mixed nucleic acid sample is a circulating acellular nucleic acid sample taken from the mother.
- the major contributor is the recipient, the sub-contributor is the transplanted organ, and the mixed nucleic acid sample is a circulating acellular nucleic acid sample taken from the recipient.
- Process C-2-1 is the most contributory among the synthetic variables obtained by linearly connecting numerical groups including numerical values related to polymorphic loci that meet specific conditions in the data set prepared in step C-1-1. Is the process of generating high synthetic variables. The synthetic variable with the highest contribution rate corresponds to the first principal component when performing principal component analysis.
- step C-2-1 the allele is homozygous in the mother, homozygous in the father, and atypical between the mother and the father, or homozygous in the recipient.
- (C1) is the secondary component signal strength.
- the secondary component signal intensity is the intensity of the signal indicating the presence of a specific polymorphic lous allele derived from the secondary nucleic acid.
- the above description (A1) is valid as it is.
- (C2) is the secondary component mixing rate.
- (C3) is noise. Noise is a numerical value obtained by subtracting the main component signal strength and the secondary component signal strength from the total signal strength caused by the allele of a specific polymorphic lotus coition. As for the definition and the specific embodiment, the above description (A5) is valid as it is.
- the numerical value group to be linearly combined in step C-2-1 may include numerical values other than the above-mentioned (C1), (C2) and (C3). That is, linear coupling is performed on a numerical group including various measured values or calculated values related to the specific polymorphic lotus (C1), (C2) and (C3), as well as various measured values or calculated values related to the specific polymorphic lotus.
- the numerical values (C4) to (C5) that may be included in the numerical value group to be linearly combined will be described below. In addition, only one kind selected from the following (C4) to (C5) may be included in the numerical value group, or two or more kinds of numerical values arbitrarily selected may be included in the numerical value group. Further, all of (C4) to (C5) may be included in the numerical group.
- the major component signal strength is the strength of the signal indicating the presence of one allele of a specific polymorphic lous derived from the major nucleic acid.
- the above description (A3) is valid as it is.
- (C5) is the mixing rate of the main components.
- main component mixing rate main component signal strength / total signal strength
- the dataset is a set of data related to a plurality of polymorphic lotus coitions. Therefore, needless to say, the data set includes a plurality of sets of data including the numerical data of the above (C1-1) to (C5-1) relating to a specific polymorphic lotus coition. It is preferable that the numerical data included in the numerical group to be linearly combined is standardized.
- the types and numbers of numerical values included in the numerical group used for the linear combination used to create the model function, and the numerical group used for the linear combination to generate the synthetic variable in step C-2-1 are the same.
- Step C-3-1 is a step of setting a threshold value for the value of the synthetic variable so as to exclude a part or all of the outliers of the synthetic variable obtained by the linear combination in step C-2-1.
- the specific embodiment is not particularly limited.
- the outlier is a numerical value indicating an abnormal value when the reliability value is calculated by inputting to the model function created by the method of the present invention.
- the reliability value of the signal indicating the presence of the allele is preferably less than 0.6, more preferably. Can be treated as an outlier in the case where is calculated as less than 0.7, more preferably less than 0.8.
- the reliability value of the signal indicating the presence of the allele is preferably 0.4 or more, more preferably 0.
- a numerical value relating to the allele in the case where it is calculated as 3 or more, more preferably 0.2 or more, can be treated as an outlier.
- a numerical value separated from the average value of the composite variable by a value preferably 2 times or more, more preferably 3 times or more, further preferably 4 times or more, still more preferably 5 times or more of the standard deviation is treated as an outlier. You can also do it.
- step C-3-1 include the following methods. First, a tentative threshold value is set for the synthetic variable, and the following tentative exclusion condition C1 is set.
- (Tentative exclusion condition C1) Of the dataset obtained by analysis of a mixed nucleic acid sample containing a major nucleic acid containing genetic information about the mother or recipient and a secondary nucleic acid containing genetic information about the fetus or transplanted organ. Alleles that are homozygous in the mother, homozygous in the pseudo-father, and atypical between the mother and the pseudo-father, or homozygous in the recipient and homozygous in the donor of the transplanted organ.
- this provisional exclusion condition C1 is applied to the data set to be analyzed, and the invention of the above-mentioned reliability value calculation method is applied to the data set remaining without being excluded, and the reliability value is calculated. It is tested whether or not the exception result is excluded from the result of this calculated reliability value. If the exception result is not excluded, or if the result of the reliability value that accurately reflects the fact is excluded excessively, the provisional exclusion condition is reset again, and the test is repeated in the same manner as above. Identify the optimal conditions.
- the process C-3-1 may include the process C-3-1-1 and the process C-3-1-2, which will be described later.
- step C-3-1-1 a synthetic variable generated by linear coupling in step C-2-1, (C1) secondary component signal strength, and (C2) are added to the model function created by the method of the present invention described above.
- This is a step of calculating a reliability value by inputting a necessary numerical value as an explanatory variable among the secondary component mixing rate and (C3) noise.
- the model function used for calculating the reliability value is not particularly limited as long as it is the model function described in the item of " ⁇ 1> Method of creating a model function".
- an explanatory variable is input to the model function represented by any of the above equations 1 to 4, and the reliability value is calculated.
- step C-3-1-2 a scatter diagram is created in which the synthetic variables generated by the linear combination in step C-2-1 and the reliability values calculated in step C-3-1-1 are plotted. do.
- a scatter diagram in which synthetic variables are plotted on the vertical axis and reliability values are plotted on the horizontal axis a set of data points distributed in the horizontal direction (direction in which the reliability values spread) (in other words, the dispersion of the values of the synthetic variables is small).
- a set of data points distributed in the direction in which the reliability value spreads (a set extending in the horizontal direction) is specified as an exclusion candidate.
- a set of data points (a set extending in the vertical direction) dispersed in the direction in which the composite variable spreads is specified as a non-exclusion candidate. Then, a threshold value is set for the value of the synthetic variable so as to exclude a part or all of the exclusion candidates.
- the ratio of excluded data points is preferably 50% or more, more preferably 60% or more, still more preferably 60% or more, of all the data points of the exclusion candidates (including the portion overlapping with the non-exclusion candidates).
- a threshold is set for the synthetic variable so that it is 70% or more, more preferably 80% or more, still more preferably 90% or more, still more preferably 95% or more.
- Step C-4-1 is a step of setting a condition to be excluded from the data set input to the model function for calculating reliability as the following exclusion condition C1.
- Example condition C1 Of the dataset obtained by analysis of a mixed nucleic acid sample containing a major nucleic acid containing genetic information about the mother or recipient and a secondary nucleic acid containing genetic information about the fetus or transplanted organ. Alleles that are homozygous in the mother, homozygous in the pseudo-father, and atypical between the mother and the pseudo-father, or homozygous in the recipient and homozygous in the donor of the transplanted organ. And, it is obtained by linearly connecting a numerical group containing at least the above (C1), the above (C2) and the above (C3) with respect to the polymorphic locus in which the allele that is atypical between the recipient and the donor is present. Further, the data set in which the synthetic variable having the highest contribution rate is less than the threshold value set in the step C-3-1 is removed.
- Exclusion condition setting method (Embodiment 2)
- One embodiment of the method for setting the exclusion condition of the present invention includes the following steps C-1-2 and C-2-2, and steps C-3-2 and C-4-2.
- Step C-1-2 prepares a dataset obtained by measuring a mixed nucleic acid sample containing a major nucleic acid containing genetic information about the major contributor and a secondary nucleic acid containing genetic information about the secondary contributor. Is.
- the dataset includes a dataset containing signals indicating the presence of each allele in multiple polymorphic loci in the primary nucleic acid and the secondary nucleic acid. The authenticity of the signal is known.
- the single nucleotide polymorphic lotus used in human personal identification (HID) can be preferably exemplified.
- the major contributor, sub-contributor, and mixed nucleic acid sample correspond to any of the following.
- the major contributor is the mother, the sub-contributor is the fetus in the womb of the mother, and the mixed nucleic acid sample is a circulating acellular nucleic acid sample taken from the mother.
- the major contributor is the recipient, the sub-contributor is the transplanted organ, and the mixed nucleic acid sample is a circulating acellular nucleic acid sample taken from the recipient.
- Step C-2-2 is the first or the first of the synthetic variables obtained by linearly connecting a group of numerical values including numerical values related to polymorphic loci that meet specific conditions in the data set prepared in step C-1-2.
- the second step is to generate a synthetic variable with the highest contribution rate.
- the synthetic variable with the highest contribution rate corresponds to the first principal component when performing principal component analysis.
- the synthetic variable with the second highest contribution rate corresponds to the second principal component when performing principal component analysis.
- step C-2-2 the allele that is homozygous in the mother, homozygous in the father, and homozygous between the mother and the father, or homozygous in the recipient, in the donor of the transplanted organ.
- Linear coupling is performed for at least the above-mentioned numerical groups including (C1), (C2) and (C3) relating to the polymorphic locus in which alleles that are homozygous and homozygous for the recipient and donor are present.
- the numerical group to be the target of the linear combination may include numerical values other than (C1), (C2) and (C3), and examples thereof include (C4) to (C5) described above.
- the above-mentioned description in step C-2-1 is appropriate for the specific embodiment of step C-2-2.
- the types and numbers of numerical values included in the numerical group used for the linear combination used to create the model function, and the numerical group used for the linear combination to generate the synthetic variable in step C-2-2 are the same.
- Step C-3-2 is a step of setting a threshold value for the value of the synthetic variable so as to exclude a part or all of the outliers of the synthetic variable generated by the linear combination in step C-2-2.
- the specific embodiment is not particularly limited. Regarding the definition of outliers, the above-mentioned explanation in step C-3-1 is valid.
- step C-3-2 include the following methods. First, a tentative threshold value is set for the synthetic variable, and the following tentative exclusion condition C2 is set.
- (Tentative exclusion condition C2) Of the dataset obtained by analysis of a mixed nucleic acid sample containing a major nucleic acid containing genetic information about the mother or recipient and a secondary nucleic acid containing genetic information about the fetus or transplanted organ. Alleles that are homozygous in the mother, homozygous in the pseudo-father, and homozygous between the mother and the pseudo-father, or homozygous in the recipient and homozygous in the donor of the transplanted organ.
- this provisional exclusion condition C2 is applied to the data set to be analyzed, and the invention of the above-mentioned reliability value calculation method is applied to the data set remaining without being excluded, and the reliability value is calculated. It is tested whether or not the exception result is excluded from the result of this calculated reliability value. If the exception result is not excluded, or if the result of the reliability value that accurately reflects the fact is excluded excessively, the provisional exclusion condition is reset again, and the test is repeated in the same manner as above. Identify the optimal conditions.
- the process C-3-2 may include the process C-3-2-1 and the process C-3-2-2, which will be described later.
- step C-3-2-1 the synthetic variables generated by the linear coupling in step C-2-2, (C1) secondary component signal strength, and (C2) are added to the model function created by the method of the present invention described above.
- This is a step of calculating a reliability value by inputting a necessary numerical value as an explanatory variable among the secondary component mixing rate and (C3) noise.
- the model function used for calculating the reliability value is not particularly limited as long as it is the model function described in the item of " ⁇ 1> Method of creating a model function".
- an explanatory variable is input to the model function represented by any of the above equations 1 to 4, and the reliability value is calculated.
- step C-3-2-2 a scatter plot is created by plotting the synthetic variables generated by the linear combination in step C-2-2 and the reliability values calculated in step C-3-2-1. do.
- a scatter diagram in which synthetic variables are plotted on the vertical axis and reliability values are plotted on the horizontal axis a set of data points distributed in the horizontal direction (direction in which the reliability values spread) (in other words, the dispersion of the values of the synthetic variables is small).
- a set of data points distributed in the vertical direction (in the direction in which the composite variables spread) and a set of data points distributed in the vertical direction in other words, the set of the values of the composite variables is large and the dispersion of the reliability values is large).
- a small set is observed.
- a set of data points (a set extending in the vertical direction) dispersed in the direction in which the composite variable spreads is specified as an exclusion candidate.
- a set of data points (a set extending in the horizontal direction) dispersed in the direction in which the reliability value spreads is specified as a non-exclusion candidate.
- a threshold value is set for the value of the synthetic variable so as to exclude a part or all of the exclusion candidates.
- the ratio of excluded data points is preferably 50% or more, more preferably 60% or more, still more preferably 60% or more, of all the data points of the exclusion candidates (including the portion overlapping with the non-exclusion candidates).
- a threshold is set for the synthetic variable so that it is 70% or more, more preferably 80% or more, still more preferably 90% or more, still more preferably 95% or more.
- Step C-4-2 is a step of setting the condition to be excluded from the data set to be input to the model function for calculating the reliability as the following exclusion condition C2.
- Example condition C2 Of the dataset obtained by analysis of a mixed nucleic acid sample containing a major nucleic acid containing genetic information about the mother or recipient and a secondary nucleic acid containing genetic information about the fetus or transplanted organ. Alleles that are homozygous in the mother, homozygous in the pseudo-father, and homozygous between the mother and the pseudo-father, or homozygous in the recipient and homozygous in the donor of the transplanted organ. And, it is obtained by linearly connecting a numerical group containing at least the above (C1), the above (C2) and the above (C3) with respect to the polymorphic locus in which the allele homozygous between the recipient and the donor is present. In addition, the data set in which the synthetic variable having the first or second highest contribution rate is less than the threshold set in the step C-3-2 is removed.
- the exclusion condition C1 and / or the exclusion condition C2 set by the above-mentioned exclusion condition setting method is set to the above " ⁇ 2-3> transplanted organ.
- the exclusion condition to be applied may be either one or both of the exclusion condition C1 and the exclusion condition C2.
- the types of numerical values included in the numerical value group to be linearly combined in step B 1-2 or step B 3-2 are preferably 10 or more, more preferably 20 or more, and further preferably 30 or more. In some cases, it is possible to calculate the reliability value with very high accuracy only by applying the exclusion condition C1.
- the present invention also relates to a program for causing a computer to execute one or more methods selected from the above-mentioned method for creating a model function, a method for calculating a reliability value, and a method for setting an exclusion condition.
- the processor in the computer operates according to the program of the present invention stored in the built-in storage device such as a hard disk device, it is selected from the above-mentioned model function creation method, reliability value calculation method, and exclusion condition setting method. Alternatively, it can be configured to perform more than one method.
- Storage medium also relates to a storage medium in which the above-mentioned program is recorded.
- the present invention also relates to a storage medium in which a model function created by the above method is recorded.
- Examples of the storage medium include a storage medium that can be read by a computer, such as a semiconductor memory, a hard disk, a magnetic storage medium, and an optical storage medium, without limitation.
- the present invention also relates to a reliability value calculation system including a storage unit in which the above-mentioned model function is recorded and a processing unit for executing the above-mentioned reliability value calculation method. ..
- a reliability value calculation system including a storage unit in which the above-mentioned model function is recorded and a processing unit for executing the above-mentioned reliability value calculation method. ..
- preferred embodiments of the reliability value calculation system of the present invention will be described.
- the processing unit is configured to process the data set to be appraised acquired by the analyzer.
- the processing unit reads and executes a program stored in the storage unit (a program that executes the above-mentioned reliability value calculation method) to realize data processing necessary for calculating the reliability value.
- It may be a device (which may be referred to as a calculator).
- the processing unit has an aspect as an execution subject of data processing. Examples of the processing unit include a CPU (Central Processing Unit), an MPU (Micro Processing Unit), a DSP (Digital Signal Processor), and an FPGA (Field Programmable Gate Array).
- the processing unit may be a multi-core processor including two or more cores.
- the storage unit is a circuit configured to store and retain data and programs related to various data processing executed by the processing unit.
- the storage unit includes at least a non-volatile storage device and / or a volatile storage device.
- RAM Random Access Memory
- ROM Read Only Memory
- SSD Solid State Drive
- HDD Hard Disk Drive
- the storage unit is a general term for various storage devices such as a main storage device and an auxiliary storage device.
- the program may be stored in the storage unit in advance, or may be downloaded from a device (server or the like) connected via a communication circuit and stored in the storage unit.
- the reliability value calculation system of this embodiment includes an input unit for inputting the data set prepared in the above step B-1.
- the data set input to the input unit is provided to the processing unit.
- the processing unit reads out a program stored in the storage unit for executing the above-mentioned reliability value calculation method, and is included in the data set in the model function also stored in the storage unit according to the program. Enter the explanatory variables generated from the dataset to calculate the reliability value.
- the exclusion condition C1 and / or the exclusion condition C2 created by the above-mentioned setting method of the exclusion method is recorded in the storage unit.
- the reliability value calculation system of the present embodiment includes an input unit for inputting the data set prepared in the above step B-1.
- the data set input to the input unit is provided to the processing unit.
- the processing unit reads the above-mentioned exclusion condition C1 and / or exclusion condition C2 stored in the storage unit, applies the condition to the data set, and excludes data that is not suitable for calculating the reliability value.
- the processing unit reads out a program for executing the above-mentioned reliability value calculation method, and according to the program, is included in the data set remaining after applying the exclusion condition to the model function also stored in the storage unit, or the data. Enter the explanatory variables generated from the set to calculate the reliability value.
- NGS next-generation sequencer
- the first principal component is an index showing a high correlation with the reliability value.
- Each model function was created by the method described below. Although it is necessary to determine the authenticity of the secondary component signal strength in order to create the model function, the authenticity was determined based on the correct answer set according to the following rule. ⁇ If the genotype of the parents is homozygous and isomorphic, the fetal genotype is homozygous (secondary component signal intensity is false). If the genotype of the parent is homozygous and atypical, the fetal genotype is heterozygous (secondary component signal intensity is true)
- model function f1 (x1) The first principal component obtained by principal component analysis was divided into 20 according to its size. Next, the ratio (probability) of the secondary component signal intensities corresponding to the first principal component included in each category was determined. Then, the probability in the relevant category was assigned to the representative value of the first principal component included in each category. Regression analysis is performed on the first principal component and the reliability value obtained in this way using the least squares method, and a model function f1 (x1) with the first principal component as the explanatory variable and the reliability value as the objective variable is obtained. rice field. The contribution rate (R2) of the regression analysis was 0.99 or more, which was extremely good.
- FIG. 1 shows a sigmoid curve showing the model function f1 (x1). Further, the equation of the model function f1 (x1) is shown in the equation 5 below.
- model function f2 (x2) The absolute value of the secondary component signal intensity was divided into 20 according to its magnitude. Next, the ratio (probability) of the absolute value of the secondary component signal intensity included in each category was determined. Then, the probability in the relevant category was assigned to the representative value of the absolute value of the secondary component signal strength in each category. Regression analysis is performed on the absolute value of the secondary component signal intensity and the probability obtained in this way using the least squares method, and the model function f2 (with the absolute value of the secondary component signal intensity as the explanatory variable and the reliability value as the objective variable) ( x2) was obtained. The contribution rate (R 2 ) of the regression analysis was 0.99 or more, which was extremely good.
- FIG. 2 shows a sigmoid curve showing the model function f2 (x2). Further, the equation of the model function f2 (x2) is shown in the equation 6 below.
- model function f3 (x3) The mixing rate of secondary components was divided into 20 according to their magnitude. Next, the ratio (probability) of the secondary component signal intensities corresponding to the secondary component mixing ratios included in each category was determined. Then, the probability in the relevant category was assigned to the representative value of the secondary component mixing rate included in each category. Regression analysis is performed on the secondary component contamination rate and probability obtained in this way using the least squares method, and a model function f3 (x3) with the secondary component contamination rate as the explanatory variable and the reliability value as the objective variable is obtained. rice field. The contribution rate (R 2 ) of the regression analysis was 0.99 or more, which was extremely good.
- FIG. 3 shows a sigmoid curve showing the model function f3 (x3). Further, the equation of the model function f3 (x3) is shown in the equation 7 below.
- model function f (x1, x2, x3) Multiply f1 (x1), f2 (x2), f3 (x3) to create model function f (x1, x2, x3) represented by the following equation 4. did.
- ⁇ Test Example 2> Calculation of reliability value Using the model function f (x1, x2, x3) of Equation 4, the reliability of 200 sets of data used to create the model function is calculated and the results are verified. rice field. That is, the first principal component, the secondary component signal intensity absolute value, and the secondary component mixing rate for the lotus coition related to SNPs in the mixed nucleic acid sample are input to the model function f (x1, x2, x3), and the reliability value thereof is input. Calculated. In the calculation of the reliability value, the reliability value (Fidelity) was calculated for 8,148 SNPs excluding those in which the total value of (1) and (2) was less than 300.
- FIG. 4 shows a distribution map of the calculated reliability value.
- the left is a compilation of the reliability values for SNPs that are homozygous for each parent (the correct answer for fetal genotype is heterozygotes).
- the right is a compilation of the reliability values for SNPs of the same type that parents have in homozygosity (the correct answer for fetal genotype is homozygosity).
- the reliability of signals related to SNPs can be evaluated accurately.
- Exclusion condition 1 The above (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) to (1) Principal component analysis was performed on the parameters of 5). On the other hand, the reliability value was calculated by using the above model functions f (x1, x2, x3) based on the parameters (1) to (5) in which the principal component analysis was performed. Next, a scatter plot was created in which each principal component obtained by principal component analysis was plotted on the y-axis and the reliability value was plotted on the x-axis (FIG. 5).
- Exclusion condition 2 We investigated whether the exclusion conditions for SNPs of the same type that parents have in homozygosity can be set appropriately.
- the reliability value was calculated by using the above model functions f (x1, x2, x3) based on the parameters (1) to (5) in which the principal component analysis was performed.
- a scatter plot was created in which each principal component obtained by principal component analysis was plotted on the y-axis and the reliability value was plotted on the x-axis (FIG. 6).
- ⁇ Test Example 4> Reaggregation of reliability values The same procedure as in Test Example 1 after excluding the data related to SNPs corresponding to the exclusion conditions 1 and 2 set in Test Example 3 using the 200 test data set from the data set.
- the reliability value was calculated in (Number of remaining SNPs: 8,081).
- the distribution map of the calculated reliability value is shown in FIG. The left is a compilation of the reliability values for SNPs that are homozygous for each parent (the correct answer for fetal genotype is heterozygotes).
- the right is a compilation of the reliability values for SNPs of the same type that parents have in homozygosity (the correct answer for fetal genotype is homozygosity).
- the left side of FIG. 7 is a distribution diagram of reliability values for data after applying exclusion condition 1.
- the right side of FIG. 7 is a distribution diagram of reliability values for data after applying the exclusion condition 2.
- the number of exceptional cases was significantly excluded and the validity was improved.
- ⁇ Test Example 5 Verification of validity for different NGS target panels The following studies were conducted using a separately prepared 16-set data set to verify the validity of the present invention. It is the analysis result of the target panel of 132 SNPs different from the 184 SNPs target panel shown in Test Example 1.
- a set of data sets is the gene sequence test data by NGS, which is obtained by analyzing the oral mucosa sample of the mother, the oral mucosa sample of the father, the plasma sample of the mother, and the oral mucosa sample of the newborn.
- NGS is a target sequence performed on a polymorphic lotus coition with 132 known SNPs. That is, the prepared data set contains data on 2,112 (16 sets ⁇ 132) SNPs.
- the 132 SNPs analyzed in this test example do not completely overlap with the 184 SNPs analyzed in test examples 1 to 3, and the 71 SNPs are the same as the SNPs analyzed in test examples 1 to 3. Are different SNPs. From this data set, SNPs that both parents had as homozygotes were extracted and the reliability values of 531 SNPs were calculated.
- FIG. 8 shows a distribution map of reliability values calculated from the 16 test data sets.
- SNPs that are homozygous for each other the correct answer for fetal genotype is heterozygous
- SNPs that are homozygous for parents the correct answer for fetal genotype is homozygous
- 175 of the 176 SNPs showed a reliability value of 0.9 or more.
- Test Example 6 Verification of validity for SNPs whose authenticity of secondary component signals is unknown Among the 16 sets of data used in Test Example 5, the fidelity distribution of 951 SNPs that the mother has by homozygosity is inherited by the newborn. The types are tabulated as heterozygous and homozygous and summarized in FIG. In addition, all SNPs shown in FIG. 9 are a total of 300 or more of the fetal Count Major and the fetal Count minor.
- the estimated fetal genotype using the parental genotype was consistent with the genotype of the offspring confirmed after birth.
- 99.6% of neonatal homo SNPs (573 SNPs out of 575 SNPs) showed a low fidelity of 0.2 or less, and 99.4% of neonatal hetero SNPs (374 SNPs of 376 SNPs) showed a high fidelity of 0.8 or more. ..
- Part 2 Creation of model function (Part 2) From the same data set as that used in Test Example 1, only those relating to the polymorphic sitting position homozygous for both mother and father were extracted. Principal component analysis was performed on the 13 factors shown in Table 1 below included in this extracted data set. Table 1 shows the eigenvectors for the first principal component obtained as a result of principal component analysis.
- the contents of (1) to (5) are as described in Test Example 1.
- the data including "major” is the data related to the main component signal
- the data including "minor” is the data related to the secondary component signal.
- the data including "count” is the data related to the signal strength
- the data including "freq” or "frequency” is the data related to the ratio of the signal strength. That is, the numerical value including both “minor” and “count” as the notation of the variable in Table 1 corresponds to the "secondary component signal strength” in the present invention. Further, the numerical value including both “minor” and “frequ” or "frequency” as the notation of the variable in Table 1 corresponds to the "secondary component mixing ratio” in the present invention.
- (7) in Table 1 is a numerical value obtained by dividing the secondary component signal intensity indicating the presence of the allele in the specific polymorphic lotus by the average value of noise in the plurality of polymorphic lotus coitions.
- (9) in Table 1 shows the subcomponent mixing ratio, which is the ratio of the subcomponent signal intensity to the total signal intensity caused by the allele of the specific polymorphic lous coition, for the noise in the plurality of polymorphic loci. It is a value divided by the average value.
- a model function f1 (x1) having the first principal component x1 as an explanatory variable and a reliability value as an objective variable was created by the same procedure as in Test Example 1. ..
- the contribution rate (R 2 ) of the regression analysis was 0.99 or more, which was extremely good.
- Part 2 Principal component analysis was performed on the 13 factors shown in Table 1 contained in the same data set as that used in Test Example 1.
- the first principal component, the absolute value of the signal intensity of the secondary component and the mixing rate of the secondary component obtained by the principal component analysis are input to the model function f (x1, x2, x3) created in Test Example 7, and the reliability value is set.
- FIG. 10 shows a distribution map of reliability values calculated by performing principal component analysis on 5 factors or 13 factors. As shown in FIG. 10, even in this test example, extremely accurate results were obtained with almost no exceptional results. From this result, the validity and high accuracy of the model function created in Test Example 7 were proved.
- Part 2 The same data set as that prepared in Test Example 6 was prepared, and principal component analysis was performed on the 13 factors shown in Table 1 included in the data set.
- the first principal component, the absolute value of the signal intensity of the secondary component and the mixing rate of the secondary component obtained by the principal component analysis are input to the model function f (x1, x2, x3) created in Test Example 7, and the reliability value is set.
- FIG. 11 shows a distribution map of reliability values calculated by performing principal component analysis on 5 factors or 13 factors. As shown in FIG. 11, even when the genotype of the father indicating the truth or falsehood of the presence of the secondary component signal was not known in this test example, extremely accurate results were obtained with almost no exceptional results. From this result, the validity and high accuracy of the model function created in Test Example 7 were proved.
- the present invention can be applied to prenatal genetic testing, cancer screening testing, transplant organ colonization monitoring, infectious disease testing, and forensic medicine.
Landscapes
- Chemical & Material Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Organic Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Physics & Mathematics (AREA)
- Wood Science & Technology (AREA)
- Zoology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Immunology (AREA)
- General Engineering & Computer Science (AREA)
- Biochemistry (AREA)
- Microbiology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Pathology (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Bioinformatics & Computational Biology (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Biology (AREA)
- Oncology (AREA)
- Hospice & Palliative Care (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Bioethics (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| KR1020227044153A KR20230012033A (ko) | 2020-12-16 | 2021-12-16 | 다형 좌위 신호의 신뢰성 값의 산출 방법 |
| EP21906688.3A EP4266315A4 (en) | 2020-12-16 | 2021-12-16 | METHOD FOR CALCULATING THE RELIABILITY VALUE OF A POLYMORPHISM LOCUS SIGNAL |
| JP2022521759A JP7121440B1 (ja) | 2020-12-16 | 2021-12-16 | 多型座位の信号の信頼性値の算出方法 |
| US18/001,544 US20230227897A1 (en) | 2020-12-16 | 2021-12-16 | Method for calculating the fidelity of the signal of polymorphic genetic loci |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2020-208554 | 2020-12-16 | ||
| JP2020208554 | 2020-12-16 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022131328A1 true WO2022131328A1 (ja) | 2022-06-23 |
Family
ID=82059580
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2021/046513 Ceased WO2022131328A1 (ja) | 2020-12-16 | 2021-12-16 | 多型座位の信号の信頼性値の算出方法 |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US20230227897A1 (https=) |
| EP (1) | EP4266315A4 (https=) |
| JP (1) | JP7121440B1 (https=) |
| KR (1) | KR20230012033A (https=) |
| WO (1) | WO2022131328A1 (https=) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2025041864A1 (ja) * | 2023-08-22 | 2025-02-27 | 株式会社seeDNA | 非侵襲的出生前親子鑑定方法 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2014502845A (ja) | 2010-12-22 | 2014-02-06 | ナテラ, インコーポレイテッド | 非侵襲性出生前親子鑑定法 |
| JP2016034282A (ja) * | 2011-02-24 | 2016-03-17 | ザ チャイニーズ ユニバーシティー オブ ホンコンThe Chinese University Of Hongkong | 多胎妊娠の分子検査 |
| JP2016061514A (ja) * | 2014-09-19 | 2016-04-25 | 株式会社ケーヒン・サーマル・テクノロジー | エバポレータおよびこれを用いた車両用空調装置 |
| JP2017094805A (ja) | 2015-11-19 | 2017-06-01 | 株式会社デンソー | 車両制御装置 |
| JP2020529648A (ja) | 2017-06-20 | 2020-10-08 | イルミナ インコーポレイテッド | 既知又は未知の遺伝子型の複数のコントリビューターからのdna混合物の分解及び定量化のための方法並びにシステム |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| RS57837B1 (sr) * | 2011-04-12 | 2018-12-31 | Verinata Health Inc | Razrešavanje genomskih frakcija upotrebom broja kopija polimorfizama |
| WO2015164432A1 (en) * | 2014-04-21 | 2015-10-29 | Natera, Inc. | Detecting mutations and ploidy in chromosomal segments |
| KR101806663B1 (ko) | 2016-02-11 | 2017-12-11 | 주식회사 로브아이 | 레이더 및 비디오 카메라 일체형 교통정보 측정시스템 |
| CN114026646A (zh) * | 2019-05-20 | 2022-02-08 | 基金会医学公司 | 用于评估肿瘤分数的系统和方法 |
-
2021
- 2021-12-16 EP EP21906688.3A patent/EP4266315A4/en active Pending
- 2021-12-16 KR KR1020227044153A patent/KR20230012033A/ko active Pending
- 2021-12-16 US US18/001,544 patent/US20230227897A1/en active Pending
- 2021-12-16 WO PCT/JP2021/046513 patent/WO2022131328A1/ja not_active Ceased
- 2021-12-16 JP JP2022521759A patent/JP7121440B1/ja active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2014502845A (ja) | 2010-12-22 | 2014-02-06 | ナテラ, インコーポレイテッド | 非侵襲性出生前親子鑑定法 |
| JP2016034282A (ja) * | 2011-02-24 | 2016-03-17 | ザ チャイニーズ ユニバーシティー オブ ホンコンThe Chinese University Of Hongkong | 多胎妊娠の分子検査 |
| JP2016061514A (ja) * | 2014-09-19 | 2016-04-25 | 株式会社ケーヒン・サーマル・テクノロジー | エバポレータおよびこれを用いた車両用空調装置 |
| JP2017094805A (ja) | 2015-11-19 | 2017-06-01 | 株式会社デンソー | 車両制御装置 |
| JP2020529648A (ja) | 2017-06-20 | 2020-10-08 | イルミナ インコーポレイテッド | 既知又は未知の遺伝子型の複数のコントリビューターからのdna混合物の分解及び定量化のための方法並びにシステム |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2025041864A1 (ja) * | 2023-08-22 | 2025-02-27 | 株式会社seeDNA | 非侵襲的出生前親子鑑定方法 |
| JP2025031706A (ja) * | 2023-08-22 | 2025-03-07 | 株式会社seeDNA | 非侵襲的出生前親子鑑定方法 |
| JP2025029961A (ja) * | 2023-08-22 | 2025-03-07 | 株式会社seeDNA | 非侵襲的出生前親子鑑定方法 |
| JP7684727B2 (ja) | 2023-08-22 | 2025-05-28 | 株式会社seeDNA | 非侵襲的出生前親子鑑定方法 |
| JP7696663B2 (ja) | 2023-08-22 | 2025-06-23 | 株式会社seeDNA | 非侵襲的出生前親子鑑定方法 |
Also Published As
| Publication number | Publication date |
|---|---|
| US20230227897A1 (en) | 2023-07-20 |
| JP7121440B1 (ja) | 2022-08-18 |
| EP4266315A1 (en) | 2023-10-25 |
| JPWO2022131328A1 (https=) | 2022-06-23 |
| EP4266315A4 (en) | 2024-11-20 |
| KR20230012033A (ko) | 2023-01-25 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12087401B2 (en) | Using cell-free DNA fragment size to detect tumor-associated variant | |
| CN104254618B (zh) | 母体血浆中胎儿dna分数的基于大小的分析 | |
| Brettschneider et al. | Quality assessment for short oligonucleotide microarray data | |
| ES2441807T5 (es) | Diagnóstico de aneuploidía cromosómica fetal utilizando secuenciación genómica | |
| US20190172582A1 (en) | Methods and systems for determining somatic mutation clonality | |
| JP7009516B2 (ja) | 未知の遺伝子型の寄与体からのdna混合物の正確な計算による分解のための方法 | |
| JP7676324B2 (ja) | ヘルシーエイジングを管理するための新規エコシステム | |
| CN110168648A (zh) | 序列变异识别的验证方法和系统 | |
| EP3476946A1 (en) | Quality evaluation method, quality evaluation apparatus, program, storage medium, and quality control sample | |
| TW202438680A (zh) | 可實施2種以上檢測的遺傳學分析方法 | |
| JP7121440B1 (ja) | 多型座位の信号の信頼性値の算出方法 | |
| Haverty et al. | Limited agreement among three global gene expression methods highlights the requirement for non-global validation | |
| Chong et al. | SeqControl: process control for DNA sequencing | |
| US20220170010A1 (en) | System and method for detection of genetic alterations | |
| CN119252381B (zh) | 基于生物标志物的腹膜后脂肪肉瘤分子分型方法、系统及设备 | |
| US20240287593A1 (en) | Single-molecule strand-specific end modalities | |
| KR20200085144A (ko) | 모체 시료 중 태아 분획을 결정하는 방법 | |
| JP2026027504A (ja) | 遺伝情報解析システム、及び遺伝情報解析方法 | |
| US20220380841A1 (en) | Methods and Kits using Internal Standards to Control for Complexity of Next Generation Sequencing(NGS) Libraries | |
| Chaudhary | Accessing the Need of Unique Molecular Index in RNA-Sequencing | |
| JP2006215809A (ja) | アレイに基づく比較ハイブリダイゼーション・データの分析方法及びシステム | |
| Öztürk | Investigation of the effects of MAS5, RMA and gcRMA preprocessing methods on an affymetrix zebrafish genechip® dataset using statistical and network parameters |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| ENP | Entry into the national phase |
Ref document number: 2022521759 Country of ref document: JP Kind code of ref document: A |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21906688 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 20227044153 Country of ref document: KR Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2021906688 Country of ref document: EP Effective date: 20230717 |