CN114613436B - Blood sample Motif feature extraction method and cancer early screening model construction method - Google Patents

Blood sample Motif feature extraction method and cancer early screening model construction method Download PDF

Info

Publication number
CN114613436B
CN114613436B CN202210506566.6A CN202210506566A CN114613436B CN 114613436 B CN114613436 B CN 114613436B CN 202210506566 A CN202210506566 A CN 202210506566A CN 114613436 B CN114613436 B CN 114613436B
Authority
CN
China
Prior art keywords
sequence
blood sample
mers
features
feature extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210506566.6A
Other languages
Chinese (zh)
Other versions
CN114613436A (en
Inventor
李�根
李莹
侯光远
陈钊
何�轩
万冲
丁凤
王占东
许军普
付原
张峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lianhe Medical Laboratory Co ltd
Jiaxing Yakangbo Biotechnology Co ltd
Beijing ACCB Biotech Ltd
Yangtze Delta Region Institute of Tsinghua University Zhejiang
Original Assignee
Beijing Lianhe Medical Laboratory Co ltd
Jiaxing Yakangbo Biotechnology Co ltd
Beijing ACCB Biotech Ltd
Yangtze Delta Region Institute of Tsinghua University Zhejiang
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lianhe Medical Laboratory Co ltd, Jiaxing Yakangbo Biotechnology Co ltd, Beijing ACCB Biotech Ltd, Yangtze Delta Region Institute of Tsinghua University Zhejiang filed Critical Beijing Lianhe Medical Laboratory Co ltd
Priority to CN202210506566.6A priority Critical patent/CN114613436B/en
Publication of CN114613436A publication Critical patent/CN114613436A/en
Application granted granted Critical
Publication of CN114613436B publication Critical patent/CN114613436B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/20Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a blood sample Motif feature extraction method and a cancer early screening model construction method, which can be used for obtaining various Motif features, increasing the diversity of the Motif features, improving the screening accuracy of a subsequent cancer early screening model constructed based on the Motif features, increasing the reliability of a screening result and further ensuring the timeliness of cancer diagnosis and treatment.

Description

Blood sample Motif feature extraction method and cancer early screening model construction method
Technical Field
The invention relates to the technical field of cancer early screening, in particular to a blood sample Motif feature extraction method and a cancer early screening model construction method.
Background
Fluid biopsy is the clinical application of early screening, molecular typing, prognosis, medication guidance, and recurrence monitoring of cancer by analyzing cancer components in blood. Liquid biopsy is used as a new precise medical technology, can qualitatively and quantitatively detect tumor cells and DeoxyriboNucleic Acid (DNA) directly related to tumors, has the characteristics of non-invasiveness, convenience in sampling, real-time monitoring and the like, and plays an increasingly important role in tumor diagnosis and treatment.
Currently, the conventional method for studying fluid biopsy and cancer early screening is to identify free DNA (cell-free DNA, cfDNA) released from tumor by mutation detection of oncogene or cancer suppressor gene. cfDNA is a degraded DNA fragment released into plasma, exists in various body fluids of the human body, and undergoes concentration changes according to tissue damage, cancer, inflammatory reaction, and the like.
Motif is characterized by a specific pattern of DNA sequences that bind to regulatory proteins (such as transcription factors) and thus anchor functional proteins for a short period of time. Therefore, in the prior art, Motif features are usually extracted from cfDNA, and a cancer early-screening model is constructed based on the Motif features, so as to realize smooth cancer early-screening.
However, the Motif features extracted in the prior art are single, which affects the screening accuracy of the cancer early screening model constructed based on the Motif features and reduces the reliability of the screening result.
Disclosure of Invention
The invention provides a blood sample Motif feature extraction method and a cancer early screening model construction method, which are used for overcoming the defects in the prior art.
The invention provides a blood sample Motif feature extraction method, which comprises the following steps:
obtaining a double-end sequencing sequence obtained by cfDNA extraction and sequencing of a blood sample, and determining a sequence to be extracted based on the double-end sequencing sequence and a reference genome;
extracting the sequence features of the K-mers of the sequence to be extracted, counting the proportion of the sequence features of each category in all the sequence features, and determining the overall features of the K-mers of the sequence to be extracted based on the proportion and the number of the categories of the sequence features in all the sequence features.
According to the method for extracting the Motif characteristics of the blood sample, provided by the invention, the sequence to be extracted is determined based on the double-end sequencing sequence and the reference genome, and the method comprises the following steps:
determining an overlapping region of the double-ended sequencing sequences, and merging the double-ended sequencing sequences based on the overlapping region to obtain a merging result of the double-ended sequencing sequences;
and comparing the merged result with the reference genome to obtain a first comparison result, and obtaining the sequence to be extracted based on the first comparison result.
According to the method for extracting the Motif characteristics of the blood sample, which is provided by the invention, the sequence to be extracted is obtained based on the first comparison result, and the method comprises the following steps:
carrying out indel area re-comparison on the combined result based on the first comparison result to obtain a second comparison result;
and sequentially filtering and screening the combined result and correcting GC content based on the first comparison result and the second comparison result to obtain the sequence to be extracted.
According to the method for extracting the Motif characteristics of the blood sample, provided by the invention, the sequence to be extracted is determined based on the double-end sequencing sequence and the reference genome, and the method comprises the following steps:
filtering and screening the double-ended sequencing based on sequencing quality information and a base recognition result to obtain an alternative double-ended sequencing sequence;
and removing the primer sequence in the alternative double-end sequencing sequence and the reading section with failed primer identification to obtain the double-end sequencing sequence.
According to the method for extracting the Motif characteristics of the blood sample, provided by the invention, the overall characteristics of the K-mers of the sequence to be extracted are determined based on the proportion and the category number of the sequence characteristics in all the sequence characteristics, and the method comprises the following steps:
determining a category distribution parameter corresponding to the proportion based on the proportion and the category quantity;
determining a weighting parameter corresponding to the proportion based on the proportion and the category distribution parameter;
and summing the weighting parameters corresponding to the proportions of the sequence features of each category to obtain the overall feature.
The invention also provides a construction method of the cancer early screening model, which comprises the following steps:
based on the blood sample Motif feature extraction method, feature extraction is respectively carried out on a first blood sample carrying a positive label and a second blood sample carrying a negative label to obtain the sequence features of K-mers and the overall features of the K-mers of the blood samples;
and training an initial model based on the positive label, the negative label, the sequence characteristics of the K-mers of the blood samples and the overall characteristics of the K-mers to obtain a cancer early-screening model.
The invention also provides a blood sample Motif feature extraction device, which comprises:
the sequence acquisition module is used for acquiring a double-end sequencing sequence obtained by cfDNA extraction and sequencing of a blood sample and determining a sequence to be extracted based on the double-end sequencing sequence and a reference genome;
the first feature extraction module is configured to extract sequence features of the K-mers of the sequence to be extracted, count a ratio of the sequence features of each category in all the sequence features, and determine an overall feature of the K-mers of the sequence to be extracted based on the ratio and the number of categories of the sequence features in all the sequence features.
The invention also provides a cancer early screening model construction device, which comprises:
the second characteristic extraction module is used for respectively extracting the characteristics of the first blood sample carrying the positive label and the second blood sample carrying the negative label based on the blood sample Motif characteristic extraction method to obtain the sequence characteristics of the K-mer and the overall characteristics of the K-mer of each blood sample;
and the training module is used for training the initial model based on the positive label, the negative label, the sequence characteristics of the K-mers of the various blood samples and the overall characteristics of the K-mers to obtain a cancer early screening model.
The present invention also provides a cancer prescreening device comprising:
the third feature extraction module is used for obtaining a blood sample to be screened, and performing feature extraction on the blood sample to be screened based on the blood sample Motif feature extraction method to obtain the sequence features of the K-mers and the overall features of the K-mers of the blood sample to be screened;
the screening module is used for inputting the sequence characteristics of the K-mers and the overall characteristics of the K-mers of the blood sample to be screened into a cancer early screening model to obtain a screening result output by the cancer early screening model;
the cancer early-screening model is constructed based on the construction method of the cancer early-screening model.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the blood sample Motif feature extraction method; and/or, implementing the cancer early-screening model construction method as described in any one of the above; and/or, effecting a method of cancer prescreening as in any of the above.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a blood sample Motif feature extraction method as described in any one of the above; and/or, implementing the cancer early-screening model construction method as described in any one of the above; and/or, effecting a method of cancer prescreening as in any of the above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a blood sample Motif feature extraction method as described in any one of the above; and/or, implementing the cancer early-screening model construction method as described in any one of the above; and/or, effecting a method of cancer prescreening as in any of the above.
The invention provides a blood sample Motif feature extraction method and a cancer early screening model construction method, which comprises the steps of firstly obtaining a double-end sequencing sequence obtained by cfDNA extraction and sequencing of a blood sample, and determining a sequence to be extracted based on the double-end sequencing sequence and a reference genome; and then extracting the sequence characteristics of the K-mers of the sequence to be extracted, counting the proportion of the sequence characteristics of each category in all the sequence characteristics, and determining the overall characteristics of the K-mers of the sequence to be extracted based on the proportion and the number of the categories of the sequence characteristics in all the sequence characteristics. By the method, two types of Motif characteristics, namely the sequence characteristics of the K-mer and the overall characteristics of the K-mer, can be extracted, the sequence characteristics of the K-mer can represent the blood sample from the dimensions of different types of sequence characteristics, and the overall characteristics of the K-mer can consider the diversity of the sequence characteristic categories and represent the blood sample from the overall dimensions of the sequence characteristics. The method can obtain various Motif characteristics, increases the diversity of the Motif characteristics, improves the screening accuracy of a cancer early screening model constructed based on the Motif characteristics subsequently, increases the reliability of a screening result, and further can ensure the timeliness of cancer diagnosis and treatment.
Drawings
In order to more clearly illustrate the present invention or the technical solutions in the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic flow chart of a blood sample Motif feature extraction method provided by the invention;
FIG. 2 is a schematic flow chart of a method for constructing a cancer early-screening model according to the present invention;
FIG. 3 is a second schematic flow chart of the method for constructing a cancer early-screening model according to the present invention;
FIG. 4 is a schematic flow diagram of a method for the early screening of cancer provided by the present invention;
FIG. 5 is a schematic structural diagram of a blood sample Motif feature extraction device provided by the invention;
FIG. 6 is a schematic structural diagram of a cancer early screening model construction device provided by the present invention;
FIG. 7 is a schematic structural view of a cancer prescreening device provided by the present invention;
fig. 8 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the existing early cancer screening methods, the universality of the traditional early screening method is lower, and most cancer species have no effective early screening means; although the endoscope screening can be used for early screening and finding the digestive tract cancer or the intestinal cancer, the endoscope screening belongs to invasive screening, the examination process is painful, and the requirement on the physique of a patient is high; imaging screening (e.g., CT, MRI, etc.) is radiologic and has low recognition of early stage cancer; the tissue biopsy sampling is difficult, the tumor heterogeneity easily causes incomplete sampling, the diagnosis and the typing are not facilitated, and the false negative rate and the false positive rate are also high.
The early screening method of the kit is adopted to detect the fixation of cancer species and the fixation of loci; updating and optimizing can not be carried out unless the kit design is carried out again; the method has large limitation and is more dependent on the determined information, and a new prediction point cannot be determined.
The early screening method by extracting features and constructing the model mainly extracts Motif features from cfDNA, which is equivalent to extracting specific sequences from fragments generated by different diseases and has certain disease characteristics, so that the construction of the cancer early screening model based on the Motif features can be helpful for the smooth early screening of cancers.
However, the Motif features extracted in the prior art are single, so that the screening accuracy of the cancer early screening model constructed based on the Motif features is greatly reduced, the reliability of the screening result is reduced, and the diagnosis and treatment of the cancer are influenced.
Therefore, the embodiment of the invention provides a blood sample Motif feature extraction method.
Fig. 1 is a schematic flow chart of a blood sample Motif feature extraction method provided in an embodiment of the present invention, and as shown in fig. 1, the method includes:
s11, obtaining a double-end sequencing sequence obtained by cfDNA extraction and sequencing of a blood sample, and determining a sequence to be extracted based on the double-end sequencing sequence and a reference genome;
s12, extracting the sequence features of the K-mers of the sequence to be extracted, counting the proportion of the sequence features of each category in all the sequence features, and determining the overall features of the K-mers of the sequence to be extracted based on the proportion and the number of the categories of the sequence features in all the sequence features.
Specifically, an implementation subject of the method for extracting Motif characteristics of a blood sample provided in the embodiment of the present invention is a blood sample Motif characteristic extraction device, and the device may be configured in a server, where the server may be a local server or a cloud server, and the local server may be a computer, and the method is not limited in the embodiment of the present invention.
Firstly, step S11 is executed to obtain a double-ended sequencing sequence obtained by cfDNA extraction and sequencing of a blood sample, and a sequence to be extracted is determined based on the double-ended sequencing sequence and a reference genome. The blood sample is a blood sample, and may be a blood sample of a healthy person or a blood sample of a cancer patient, and is not particularly limited herein. The blood sample can be collected by using a blood collection tube.
After obtaining the blood sample, cfDNA can be extracted from the blood sample, and operations such as library building, sequencing, and the like can be performed. Here, the cfDNA may be extracted and pooled using conventional methods, and a double-ended sequenced sequence of the cfDNA, which may be denoted as a fastq sequence, may be obtained using conventional sequencing techniques, such as 3X WGS sequencing techniques.
Since paired-end sequencing is typically performed when cfDNA is extracted and sequenced from a blood sample, including forward sequencing sequences, which can be denoted as fastq1, and reverse sequencing sequences, which can be denoted as fastq 2. The forward and reverse sequencing sequences each comprise a plurality of reads (reads), each read comprising base information.
After the double-ended sequencing sequence is obtained, the sequence to be extracted can be determined by combining the double-ended sequencing sequence and the reference genome. The reference genome refers to a human whole genome sequence, is a DNA double helix structure and is linked by hydrogen bonds through base complementary pairing; in the normal aging and cancer progression processes, the pH value of the environment around the cells changes, so that the complementary hydrogen bonds of the basic groups are destroyed and the cells are broken; the percentage of sequences containing information about sequences at different breakpoints will also vary due to differences in base sequences at the breakpoints.
The sequence to be extracted refers to a sequence capable of performing feature extraction, the sequence to be extracted can be obtained by directly comparing the double-ended sequencing sequence with a reference genome, and a sequence formed by reads matched with the reference genome in the double-ended sequencing sequence can be directly used as the sequence to be extracted.
When the sequence to be extracted is obtained, before the double-ended sequencing sequence is compared with the reference genome, the double-ended sequencing sequence can be preprocessed to obtain a preprocessing result, the preprocessing result is compared with the reference genome, and a sequence formed by reads matched with the reference genome in the preprocessing result is used as the sequence to be extracted.
When the sequence to be extracted is obtained, after the double-ended sequencing sequence is compared with the reference genome, or after the pretreatment result is compared with the reference genome, the obtained comparison result can be used for carrying out post-treatment on the double-ended sequencing sequence or the pretreatment result to obtain a post-treatment result, and the post-treatment result can be used as the sequence to be extracted.
Then, step S12 is executed to extract the sequence features of the K-mers of the sequence to be extracted. It is understood that a K-mer refers to an iterative selection of a target sequence of K bases in length from the sequence to be extracted. The value of K may be set as needed, where K may be equal to or less than 8, for example, may be 4, 8, and the like, and is not limited specifically here.
The sequence characteristics of the K-mers refer to the number of target sequences selected in each iteration contained in the sequence to be extracted. The sequence characteristic of the K-mer is a Motif characteristic.
The number of categories of sequence features of the K-mer is 4 because the target sequence selected in each iteration comprises K bases and the types of the bases are only 4 K . If K =4, the number of categories is 256, and if K =8, the number of categories is 65536. Therefore, the calculation amount of feature extraction can be greatly reduced by adopting the sequence features of the 4-mer.
When the proportion of the sequence features of each category in all the sequence features is determined, the summation result of all the sequence features may be determined first, and then the ratio of the sequence features of each category to the summation result is calculated, which is the proportion of the sequence features of the category.
And then, determining the overall characteristics of the K-mers of the sequence to be extracted according to the proportion of the sequence characteristics of each category and the category number of the sequence characteristics. The overall characteristic of the K-mer is another Motif characteristic. The overall characteristics of the K-mer can be determined by determining the characteristic parameters of each category according to the proportion of the sequence characteristics of each category and the number of the categories. And then, summing the characteristic parameters of all the categories to obtain the overall characteristic of the K-mer. The process of the blood sample Motif feature extraction method provided by the embodiment of the invention is ended.
The method for extracting the Motif characteristics of the blood sample comprises the steps of firstly, obtaining a double-end sequencing sequence obtained by cfDNA extraction and sequencing of the blood sample, and determining a sequence to be extracted based on the double-end sequencing sequence and a reference genome; and then extracting the sequence characteristics of the K-mers of the sequence to be extracted, counting the proportion of the sequence characteristics of each category in all the sequence characteristics, and determining the overall characteristics of the K-mers of the sequence to be extracted based on the proportion and the number of the categories of the sequence characteristics in all the sequence characteristics. By the method, two types of Motif characteristics, namely the sequence characteristics of the K-mer and the overall characteristics of the K-mer, can be extracted, the sequence characteristics of the K-mer can represent the blood sample from the dimensions of different types of sequence characteristics, and the overall characteristics of the K-mer can consider the diversity of the sequence characteristic categories and represent the blood sample from the overall dimensions of the sequence characteristics. The method can obtain various Motif characteristics, increases the diversity of the Motif characteristics, improves the screening accuracy of a cancer early screening model constructed based on the Motif characteristics subsequently, increases the reliability of a screening result, and further can ensure the timeliness of cancer diagnosis and treatment.
On the basis of the above embodiments, the method for extracting Motif characteristics of a blood sample provided in the embodiments of the present invention, wherein the determining a sequence to be extracted based on the paired end sequencing sequence and the reference genome, includes:
determining an overlapping region of the double-ended sequencing sequences, and merging the double-ended sequencing sequences based on the overlapping region to obtain a merging result of the double-ended sequencing sequences;
and comparing the merged result with the reference genome to obtain a first comparison result, and obtaining the sequence to be extracted based on the first comparison result.
Specifically, in the embodiment of the present invention, when determining the sequence to be extracted, the overlapping region of the double-ended sequencing sequence may be determined first. Since the paired forward and reverse sequencing sequences include a forward sequencing sequence and a reverse sequencing sequence, the paired forward and reverse sequencing sequences at the same position have a partial overlapping region at the tail end.
A turn-on condition occurs when the target area is small and the read is relatively long. That is, when both fastq1 and fastq2 overlap in the same region, the two sequences overlap by sequencing, resulting in an overlapping region. Considering the actual situation of sequencing, systematic detection deviation of bases can occur in the head and tail parts of the obtained reads, and the systematic detection deviation can have great influence on subsequent alignment with a reference genome.
In order to avoid the problems and improve the comparison accuracy with the reference genome, the overlapping regions in the fastq1 and the fastq2 are merged according to the content of the overlapping regions, so that the overlapping regions and the fastq1 are merged into a sequencing sequence without the overlapping regions, and a merged result is obtained.
And finally, comparing the combined result with the reference genome to obtain the positioning information of each read in the combined result on the reference genome, thereby obtaining a first comparison result. The first comparison result may be a BAM file, and basic information and a compared position of each read may be recorded in the BAM file.
And obtaining the sequence to be extracted according to the first comparison result. Here, a sequence formed by the reads matching with the reference genome in the merged result may be directly used as a sequence to be extracted according to the first comparison result, or the merged result may be further processed according to the first comparison result to obtain a sequence to be extracted, which is not specifically limited herein.
In the embodiment of the invention, the overlapping regions need to be merged before the comparison with the reference genome, so that the comparison accuracy and the comparison efficiency can be improved.
On the basis of the foregoing embodiment, the method for extracting Motif characteristics of a blood sample, provided in the embodiment of the present invention, for obtaining the sequence to be extracted based on the first alignment result includes:
carrying out indel area re-comparison on the combined result based on the first comparison result to obtain a second comparison result;
and sequentially filtering and screening the combined result and correcting GC content based on the first comparison result and the second comparison result to obtain the sequence to be extracted.
Specifically, in the embodiments of the present invention, when determining a sequence to be extracted, if there is a variation such as insertion or deletion of a base in a double-ended sequencing sequence, the accuracy of the sequence to be extracted will be affected, and the detectability of the variation in the periphery will be directly affected. Therefore, the merged result needs to be further processed according to the first comparison result, that is, the merged result is subjected to indel region re-comparison to obtain accurate positioning information of each read in the merged result on the reference genome. All regions needing to be subjected to re-comparison in a combination result can be found by using the existing human reference genome sequence and indel site information proposed by a thousand-person genome plan to form an interval file; and then, combining the interval file, and comparing the indel variation types in the combined result to obtain a second comparison result. The second comparison result can be understood as being corrected for the first comparison result.
And then, filtering and screening the combined result through the first comparison result and the second comparison result to obtain a filtering and screening result. The process of filtering and screening may include a quality control process, a filtering process, and a screening process.
The Quality control process is to obtain a comparison Quality score (MAPQ) of each read in the merged result according to the first comparison result, and screen out the reads of which the MAPQ is less than a preset threshold value from the merged result. The preset threshold may be set as needed, and may be set to 80% or more, for example.
The filtering process is to filter all reads matching the reference genome from the merged result according to the second alignment result.
The screening process refers to the removal of repetitive sequences in the pooled results.
It is to be understood that the order of execution of the quality control process, the filtering process, and the screening process may be set as desired, and is not particularly limited herein.
Thereafter, since the sequencing is performed in Polymerase Chain Reaction (PCR) amplification, the bias of PCR may cause the deviation of GC content in the sequencing result. Therefore, in the embodiment of the invention, the preference of the GC content of the filtering and screening result needs to be corrected, and the sequence to be extracted can be obtained. Here, the correction mode may be implemented by using a Loess model to correct GC preference of the sample.
In the embodiment of the invention, after the comparison with the reference genome, the indel region re-comparison, the filtering and screening, the GC content correction and other operations are carried out, so that the accuracy of the sequence to be extracted can be improved, the error of the sequence characteristics of the K-mer obtained by carrying out the characteristic extraction on the subsequent sequence to be extracted can be reduced, and the accuracy of the characteristic extraction can be further improved.
On the basis of the above embodiments, the method for extracting Motif characteristics of a blood sample provided in the embodiments of the present invention, which determines a sequence to be extracted based on the paired-end sequencing sequence and the reference genome, includes:
filtering and screening the double-ended sequencing based on sequencing quality information and base recognition results to obtain alternative double-ended sequencing sequences;
and removing the primer sequence in the alternative double-end sequencing sequence and the reading section with failed primer identification to obtain the double-end sequencing sequence.
Specifically, in the embodiment of the present invention, after obtaining the paired-end sequencing sequence and before applying the paired-end sequencing sequence, the paired-end sequencing may be filtered and screened according to the sequencing quality information and the base identification result, so as to obtain the alternative paired-end sequencing sequence. Here, filtering and screening refers to removing low-quality reads and filtering and screening to obtain high-quality reads according to sequencing quality information. Based on the base recognition result, reads in which bases are not recognized are removed, and reads in which bases are recognized are obtained. Through the two processes, the alternative double-end sequencing sequence can be obtained.
Due to technical requirements, primer sequences appear on each read during the sequencing of the library. The primer sequence can affect the recognition of the variation site in the subsequent reads and increase unnecessary data volume, so that the primer sequence of each read is removed according to the known primer information in the embodiment of the invention, thereby improving the subsequent analysis efficiency.
The basic principle of primer recognition is to use the specific sequence of each primer as a specific label for the corresponding primer. When the specific sequence of a pair of primers occurs multiple times in the first 30bp of a read, the read can be considered to be amplified by the corresponding primer. After the corresponding primer is identified, the corresponding primer sequence and the reading of the primer which cannot be identified can be removed according to the length of the primer.
In the embodiment of the invention, after the double-ended sequencing sequence is obtained, the double-ended sequencing sequence is preprocessed, so that the quality of the double-ended sequencing sequence can be ensured, the interference of a low-quality sequence in the sequencing process on subsequent processing is reduced, the accuracy of a subsequent processing result is improved, and a foundation is provided for improving the feature extraction efficiency.
On the basis of the foregoing embodiment, the method for extracting Motif features of a blood sample provided in an embodiment of the present invention, wherein the determining, based on the ratio and the number of categories of the sequence features in all the sequence features, the overall features of K-mers of the sequence to be extracted includes:
determining a category distribution parameter corresponding to the proportion based on the proportion and the category number;
determining a weighting parameter corresponding to the proportion based on the proportion and the category distribution parameter;
and summing the weighting parameters corresponding to the proportions of the sequence features of each category to obtain the overall feature.
Specifically, in the embodiment of the present invention, when calculating the overall feature of the K-mer, the category distribution parameter corresponding to the ratio of the sequence features of each category may be determined according to the ratio of the sequence features of each category and the number of categories. The category distribution parameter may be represented by a ratio of common logarithms, for example, the category distribution parameter corresponding to the ratio of the sequence features of the ith category may be represented as:
Figure 444636DEST_PATH_IMAGE001
wherein,
Figure 315640DEST_PATH_IMAGE002
indicating the proportion of sequence features of the ith class.
Then, according to the ratio of the sequence features of each category and the corresponding category distribution parameters, determining the weighting parameters corresponding to the ratio, namely:
Figure 515677DEST_PATH_IMAGE003
finally, the weighting parameters corresponding to the proportion of the sequence features of each category can be summed to obtain the overall features of the K-mer, which can be represented by mds (motif Diversity score), that is, the method comprises the following steps:
Figure 694328DEST_PATH_IMAGE004
in the embodiment of the invention, the overall characteristics of the K-mers are determined in a summing manner by combining the category distribution parameters and the weighting parameters, so that the overall characteristics of the K-mers can represent the overall diversity of the sequence characteristics of the K-mers of the blood sample.
As shown in fig. 2, on the basis of the above embodiment, an embodiment of the present invention provides a method for constructing a cancer early-stage screening model, including:
s21, respectively performing feature extraction on a first blood sample carrying a positive label and a second blood sample carrying a negative label based on the blood sample Motif feature extraction methods provided in the embodiments to obtain the sequence features of the K-mers and the overall features of the K-mers of the blood samples;
s22, training an initial model based on the positive label, the negative label, the sequence characteristics of the K-mers of the various blood samples and the overall characteristics of the K-mers to obtain a cancer early-screening model.
Specifically, an execution subject of the cancer prescreening model construction method provided in the embodiment of the present invention is a cancer prescreening model construction device, which may be configured in a server, where the server may be a local server or a cloud server, and the local server may be a computer, which is not specifically limited in the embodiment of the present invention.
Firstly, step S21 is executed, and the blood sample Motif feature extraction method provided in each embodiment is adopted to respectively perform feature extraction on the first blood sample carrying the positive label and the second blood sample carrying the negative label, so as to obtain the sequence features of the K-mers and the overall features of the K-mers of each blood sample. The sequence characteristics of the K-mers and the overall characteristics of the K-mers of the first blood sample can be obtained separately, or the sequence characteristics of the K-mers and the overall characteristics of the K-mers of the second blood sample can be obtained separately.
It is understood that the first blood sample carrying a positive label is a blood sample from a cancer patient and the second blood sample carrying a negative label is a blood sample from a healthy person.
As shown in Table 1, the sequence characteristics of 4-Mer are shown for a portion of the blood samples.
TABLE 1 sequence characterization of 4-Mer of part of blood samples
Figure 855182DEST_PATH_IMAGE005
Here, Type is the category of the 4-Mer sequence features, B1-B9 are the numbers of the 9 blood samples, respectively, and the data in Table 1 show the 4-Mer sequence features for each category of the 9 blood samples.
TABLE 2 Overall 4-Mer characteristics of part of the blood samples
Figure 123353DEST_PATH_IMAGE006
Where ID columns are serial numbers of each blood sample and MDS is an overall characterization of 4-Mer for each blood sample.
And then, training the initial model by combining the positive label, the negative label, the sequence characteristics of the K-mers of various blood samples and the overall characteristics of the K-mers to obtain the cancer early-screening model. The initial model can be a neural network model, the sequence characteristics of the K-mers of various blood samples and the overall characteristics of the K-mers can be input into the initial model, the initial model outputs the prediction results of various blood samples, and the loss function is calculated by combining the prediction results and the positive labels or the negative labels carried by the various blood samples. And adjusting model parameters of the initial model based on the loss function, and repeating the process until the loss function converges or reaches a preset training frequency, so as to obtain the cancer early-screening model. The cancer prescreening model can be used to analyze a blood sample to determine whether the blood sample is negative or positive.
It can be understood that the loss function adopted in the embodiment of the present invention may be set as needed, a conventional loss function may be adopted, and the preset training times may also be set as needed, which is not specifically limited herein.
Firstly, respectively performing characteristic extraction on a first blood sample carrying a positive label and a second blood sample carrying a negative label based on the blood sample Motif characteristic extraction method provided in each embodiment to obtain the sequence characteristics of K-mers and the overall characteristics of the K-mers of each blood sample; and then training the initial model based on the positive label, the negative label, the sequence characteristics of the K-mers of various blood samples and the overall characteristics of the K-mers to obtain the cancer early screening model. The construction method introduces the sequence characteristics of the K-mers of various blood samples and the overall characteristics of the K-mers, so that the trained cancer early-screening model can analyze the blood samples from local and global angles, more accurate early-screening results are obtained, and the accuracy of the cancer early-screening model is improved.
As shown in fig. 3, a schematic flow chart of a method for constructing a cancer early-stage screening model provided in an embodiment of the present invention is shown, where the method includes:
obtaining double-end sequencing sequences of various blood samples;
respectively preprocessing the double-end sequencing sequences of various blood samples to obtain each preprocessing result;
performing quality control on each pretreatment result, if the pretreatment result is qualified, performing primer identification on the pretreatment result, and otherwise, discarding the pretreatment result;
after primer identification is carried out on each pretreatment result, sequence combination is respectively carried out to obtain each combination result;
comparing each combined result with a reference genome respectively to obtain each first comparison result, correcting each first comparison result respectively, and filtering, screening and correcting GC content of each combined result in sequence to obtain a sequence to be extracted;
extracting sequence characteristics of a K-mer of a sequence to be extracted and overall characteristics of the K-mer;
and constructing a cancer early screening model.
As shown in fig. 4, on the basis of the above embodiments, the embodiment of the present invention provides a cancer prescreening method, including:
s41, obtaining a blood sample to be screened, and performing feature extraction on the blood sample to be screened based on the blood sample Motif feature extraction method provided in each embodiment to obtain the sequence features of the K-mers and the overall features of the K-mers of the blood sample to be screened;
s42, inputting the sequence characteristics of the K-mers and the overall characteristics of the K-mers of the blood sample to be screened into a cancer early-screening model to obtain a screening result output by the cancer early-screening model;
the cancer early-stage screening model is constructed based on the construction method of the cancer early-stage screening model provided in each embodiment.
Specifically, an execution subject of the cancer prescreening method provided in the embodiment of the present invention is a cancer prescreening device, which may be configured in a server, where the server may be a local server or a cloud server, and the local server may be a computer, which is not specifically limited in the embodiment of the present invention.
Step S41 is first executed to obtain a blood sample to be screened. The blood sample to be screened refers to a blood sample which needs to be determined as negative or positive. According to the blood sample Motif feature extraction method provided in each embodiment, feature extraction can be performed on a blood sample to be screened, so that the sequence features of K-mers and the overall features of the K-mers of the blood sample to be screened are obtained;
and then, executing step S42, inputting the sequence characteristics of the K-mers and the overall characteristics of the K-mers of the blood sample to be screened into the cancer early-screening model, analyzing the sequence characteristics of the K-mers and the overall characteristics of the K-mers of the blood sample to be screened through the cancer early-screening model, and further outputting a screening result. The screening result may be a probability that the blood sample to be screened is positive.
It is understood that the cancer prescreening model used in the embodiments of the present invention can be constructed by the construction method of the cancer prescreening model provided in the above embodiments.
According to the cancer early screening method provided by the embodiment of the invention, firstly, a blood sample to be screened is obtained, and based on the blood sample Motif characteristic extraction method provided by the embodiments, the characteristic extraction is carried out on the blood sample to be screened, so that the sequence characteristic of a K-mer and the overall characteristic of the K-mer of the blood sample to be screened are obtained; and then inputting the sequence characteristics of the K-mers of the blood sample to be screened and the overall characteristics of the K-mers into a cancer early screening model to obtain a screening result output by the cancer early screening model. The method utilizes the cancer early screening model quantity obtained by training based on the sequence characteristics of the K-mers of various blood samples and the overall characteristics of the K-mers, can enable the obtained screening result to be more accurate, increases the reliability of the screening result, and further can ensure the timeliness of the diagnosis and treatment of the cancer.
As shown in fig. 5, on the basis of the above embodiment, an embodiment of the present invention provides a blood sample Motif feature extraction device, including:
the sequence acquisition module 51 is used for acquiring a double-end sequencing sequence obtained by cfDNA extraction and sequencing of a blood sample, and determining a sequence to be extracted based on the double-end sequencing sequence and a reference genome;
the first feature extraction module 52 is configured to extract sequence features of the K-mers of the sequence to be extracted, count a ratio of the sequence features of each category in all the sequence features, and determine an overall feature of the K-mers of the sequence to be extracted based on the ratio and the number of categories of the sequence features in all the sequence features.
On the basis of the foregoing embodiment, in the blood sample Motif feature extraction device provided in the embodiment of the present invention, the sequence acquisition module is configured to:
determining an overlapping region of the double-ended sequencing sequences, and merging the double-ended sequencing sequences based on the overlapping region to obtain a merging result of the double-ended sequencing sequences;
and comparing the merged result with the reference genome to obtain a first comparison result, and obtaining the sequence to be extracted based on the first comparison result.
On the basis of the foregoing embodiment, in the blood sample Motif feature extraction device provided in the embodiment of the present invention, the sequence acquisition module is configured to:
carrying out indel area re-comparison on the combined result based on the first comparison result to obtain a second comparison result;
and sequentially filtering and screening the combined result and correcting GC content based on the first comparison result and the second comparison result to obtain the sequence to be extracted.
On the basis of the above embodiment, the blood sample Motif feature extraction device provided in the embodiment of the present invention further includes a preprocessing module, configured to:
filtering and screening the double-ended sequencing based on sequencing quality information and a base recognition result to obtain an alternative double-ended sequencing sequence;
and removing the primer sequence in the alternative double-end sequencing sequence and the reading section with failed primer identification to obtain the double-end sequencing sequence.
On the basis of the foregoing embodiment, in the blood sample Motif feature extraction device provided in the embodiment of the present invention, the first feature extraction module is configured to:
determining a category distribution parameter corresponding to the proportion based on the proportion and the category number;
determining a weighting parameter corresponding to the proportion based on the proportion and the category distribution parameter;
and summing the weighting parameters corresponding to the proportions of the sequence features of each category to obtain the overall feature.
Specifically, the functions of the modules in the blood sample Motif feature extraction device provided in the embodiment of the present invention are in one-to-one correspondence with the operation flows of the steps in the method embodiments in which the blood sample Motif feature extraction device is used as the execution main body, and the implementation effects are also consistent.
As shown in fig. 6, on the basis of the above embodiment, an embodiment of the present invention provides a cancer prescreening model building apparatus, including:
a second feature extraction module 61, configured to perform feature extraction on the first type of blood sample carrying the positive tag and the second type of blood sample carrying the negative tag based on the blood sample Motif feature extraction methods provided in the foregoing embodiments, respectively to obtain sequence features of the K-mers and overall features of the K-mers of the various types of blood samples;
and the training module 62 is configured to train the initial model based on the positive label, the negative label, the sequence characteristics of the K-mers of the various blood samples, and the overall characteristics of the K-mers, so as to obtain a cancer early-screening model.
Specifically, the functions of the modules in the cancer prescreening model building apparatus provided in the embodiment of the present invention are in one-to-one correspondence with the operation flows of the steps in the method embodiments that use the cancer prescreening model building apparatus as an execution main body, and the implementation effects are also consistent.
As shown in fig. 7, on the basis of the above embodiment, an embodiment of the present invention provides a cancer prescreening device, including:
a third feature extraction module 71, configured to obtain a blood sample to be screened, and perform feature extraction on the blood sample to be screened based on the blood sample Motif feature extraction method provided in each of the above embodiments, to obtain a sequence feature of a K-mer of the blood sample to be screened and an overall feature of the K-mer;
the screening module 72 is configured to input the sequence characteristics of the K-mers and the overall characteristics of the K-mers of the blood sample to be screened into a cancer early-screening model, so as to obtain a screening result output by the cancer early-screening model;
the cancer early-stage screening model is constructed based on the construction method of the cancer early-stage screening model provided in each embodiment.
Specifically, the functions of the modules in the cancer prescreening device provided in the embodiment of the present invention are in one-to-one correspondence with the operation flows of the steps in the method embodiments that use the cancer prescreening device as an execution main body, and the achieved effects are also consistent.
Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a Processor (Processor) 810, a communication Interface 820, a Memory 830 and a communication bus 840, wherein the Processor 810, the communication Interface 820 and the Memory 830 communicate with each other via the communication bus 840. The processor 810 may call logic instructions in the memory 830 to perform the blood sample Motif feature extraction method provided in the above embodiments, the method comprising: obtaining a double-end sequencing sequence obtained by cfDNA extraction and sequencing of a blood sample, and determining a sequence to be extracted based on the double-end sequencing sequence and a reference genome; extracting the sequence features of the K-mers of the sequence to be extracted, counting the proportion of the sequence features of each category in all the sequence features, and determining the overall features of the K-mers of the sequence to be extracted based on the proportion and the number of the categories of the sequence features in all the sequence features. And/or, executing the cancer early screening model construction method provided in the above embodiments, the method comprising: based on the blood sample Motif feature extraction method provided in each embodiment, feature extraction is respectively performed on a first blood sample carrying a positive label and a second blood sample carrying a negative label, so that the sequence features of K-mers and the overall features of the K-mers of the various blood samples are obtained; and training an initial model based on the positive label, the negative label, the sequence characteristics of the K-mers of the blood samples and the overall characteristics of the K-mers to obtain a cancer early-screening model. And/or, performing the cancer prescreening method provided in the various embodiments above, the method comprising: obtaining a blood sample to be screened, and performing feature extraction on the blood sample to be screened based on the blood sample Motif feature extraction method provided in each embodiment to obtain the sequence features of the K-mers and the overall features of the K-mers of the blood sample to be screened; the screening module is used for inputting the sequence characteristics of the K-mers and the overall characteristics of the K-mers of the blood sample to be screened into a cancer early screening model to obtain a screening result output by the cancer early screening model; the cancer early-stage screening model is constructed based on the construction method of the cancer early-stage screening model provided in each embodiment.
In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In another aspect, the present invention further provides a computer program product, the computer program product comprising a computer program, the computer program being stored on a non-transitory computer readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the method for Motif feature extraction of a blood sample provided in the above embodiments, the method comprising: obtaining a double-end sequencing sequence obtained by cfDNA extraction and sequencing of a blood sample, and determining a sequence to be extracted based on the double-end sequencing sequence and a reference genome; and extracting the sequence features of the K-mers of the sequence to be extracted, counting the proportion of the sequence features of each category in all the sequence features, and determining the overall features of the K-mers of the sequence to be extracted based on the proportion and the number of the categories of the sequence features in all the sequence features. And/or, executing the cancer early-screening model construction method provided in the above embodiments, the method comprising: based on the blood sample Motif feature extraction method provided in each embodiment, feature extraction is respectively performed on a first blood sample carrying a positive label and a second blood sample carrying a negative label, so that the sequence features of K-mers and the overall features of the K-mers of the various blood samples are obtained; and training an initial model based on the positive label, the negative label, the sequence characteristics of the K-mers of the blood samples and the overall characteristics of the K-mers to obtain a cancer early-screening model. And/or, performing the cancer prescreening method provided in the various embodiments above, the method comprising: obtaining a blood sample to be screened, and performing feature extraction on the blood sample to be screened based on the blood sample Motif feature extraction method provided in each embodiment to obtain the sequence features of the K-mers and the overall features of the K-mers of the blood sample to be screened; the screening module is used for inputting the sequence characteristics of the K-mers and the overall characteristics of the K-mers of the blood sample to be screened into a cancer early screening model to obtain a screening result output by the cancer early screening model; the cancer early-stage screening model is constructed based on the construction method of the cancer early-stage screening model provided in each embodiment.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to execute the blood sample Motif feature extraction method provided in the above embodiments, the method including: obtaining a double-end sequencing sequence obtained by cfDNA extraction and sequencing of a blood sample, and determining a sequence to be extracted based on the double-end sequencing sequence and a reference genome; extracting the sequence features of the K-mers of the sequence to be extracted, counting the proportion of the sequence features of each category in all the sequence features, and determining the overall features of the K-mers of the sequence to be extracted based on the proportion and the number of the categories of the sequence features in all the sequence features. And/or, executing the cancer early-screening model construction method provided in the above embodiments, the method comprising: based on the blood sample Motif feature extraction method provided in each embodiment, feature extraction is respectively performed on a first blood sample carrying a positive label and a second blood sample carrying a negative label, so that the sequence features of K-mers and the overall features of the K-mers of the various blood samples are obtained; and training an initial model based on the positive label, the negative label, the sequence characteristics of the K-mers of the blood samples and the overall characteristics of the K-mers to obtain a cancer early-screening model. And/or, performing the cancer prescreening method provided in the various embodiments above, the method comprising: obtaining a blood sample to be screened, and performing feature extraction on the blood sample to be screened based on the blood sample Motif feature extraction method provided in each embodiment to obtain the sequence features of the K-mers and the overall features of the K-mers of the blood sample to be screened; the screening module is used for inputting the sequence characteristics of the K-mers and the overall characteristics of the K-mers of the blood sample to be screened into a cancer early screening model to obtain a screening result output by the cancer early screening model; the cancer early-stage screening model is constructed based on the construction method of the cancer early-stage screening model provided in each embodiment.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. A blood sample Motif feature extraction method is characterized by comprising the following steps:
obtaining a double-end sequencing sequence obtained by cfDNA extraction and sequencing of a blood sample, and determining a sequence to be extracted based on the double-end sequencing sequence and a reference genome;
extracting the sequence features of the K-mers of the sequence to be extracted, counting the proportion of the sequence features of each category in all the sequence features, and determining the overall features of the K-mers of the sequence to be extracted based on the proportion and the number of the categories of the sequence features in all the sequence features;
determining the overall characteristics of the K-mers of the sequence to be extracted based on the proportion and the category number of the sequence characteristics in all the sequence characteristics, wherein the determining comprises the following steps:
determining a category distribution parameter corresponding to the proportion based on the proportion and the category number;
determining a weighting parameter corresponding to the proportion based on the proportion and the category distribution parameter;
summing the weighting parameters corresponding to the proportion of the sequence features of each category to obtain the overall features;
the category distribution parameter corresponding to the ratio of the sequence features of the ith category is expressed as:
Figure 323483DEST_PATH_IMAGE001
wherein,
Figure 726783DEST_PATH_IMAGE002
indicating the proportion of sequence features of the ith class.
2. The method for extracting Motif characteristics of a blood sample according to claim 1, wherein the determining a sequence to be extracted based on the paired-end sequencing sequence and a reference genome comprises:
determining an overlapping region of the double-ended sequencing sequences, and merging the double-ended sequencing sequences based on the overlapping region to obtain a merging result of the double-ended sequencing sequences;
and comparing the merged result with the reference genome to obtain a first comparison result, and obtaining the sequence to be extracted based on the first comparison result.
3. The method for extracting Motif characteristics of a blood sample according to claim 2, wherein the obtaining the sequence to be extracted based on the first alignment result comprises:
carrying out indel area re-comparison on the combined result based on the first comparison result to obtain a second comparison result;
and sequentially filtering and screening the combined result and correcting GC content based on the first comparison result and the second comparison result to obtain the sequence to be extracted.
4. The method for extracting Motif characteristics of a blood sample according to claim 1, wherein the determining of the sequence to be extracted based on the paired-end sequencing sequence and a reference genome previously comprises:
filtering and screening the double-ended sequencing based on sequencing quality information and a base recognition result to obtain an alternative double-ended sequencing sequence;
and removing the primer sequence in the alternative double-end sequencing sequence and the reading section with failed primer identification to obtain the double-end sequencing sequence.
5. A construction method of a cancer early screening model is characterized by comprising the following steps:
based on the blood sample Motif feature extraction method of any one of claims 1-4, respectively performing feature extraction on a first blood sample carrying a positive label and a second blood sample carrying a negative label to obtain the sequence features of K-mers and the overall features of the K-mers of the blood samples;
and training an initial model based on the positive label, the negative label, the sequence characteristics of the K-mers of the blood samples and the overall characteristics of the K-mers to obtain a cancer early-screening model.
6. A blood sample Motif feature extraction device, comprising:
the sequence acquisition module is used for acquiring a double-end sequencing sequence obtained by cfDNA extraction and sequencing of a blood sample and determining a sequence to be extracted based on the double-end sequencing sequence and a reference genome;
the first feature extraction module is used for extracting the sequence features of the K-mers of the sequence to be extracted, counting the proportion of the sequence features of each category in all the sequence features, and determining the overall features of the K-mers of the sequence to be extracted based on the proportion and the number of the categories of the sequence features in all the sequence features;
the first feature extraction module is specifically configured to:
determining a category distribution parameter corresponding to the proportion based on the proportion and the category number;
determining a weighting parameter corresponding to the proportion based on the proportion and the category distribution parameter;
summing the weighting parameters corresponding to the proportion of the sequence features of each category to obtain the overall features;
the category distribution parameter corresponding to the ratio of the sequence features of the ith category is expressed as:
Figure 188988DEST_PATH_IMAGE003
wherein,
Figure 536793DEST_PATH_IMAGE002
indicating the proportion of sequence features of the ith class.
7. A cancer early-screening model construction device is characterized by comprising:
a second feature extraction module, configured to perform feature extraction on the first type of blood sample carrying the positive tag and the second type of blood sample carrying the negative tag based on the blood sample Motif feature extraction method according to any one of claims 1 to 4, respectively, so as to obtain sequence features of K-mers and overall features of the K-mers of the various types of blood samples;
and the training module is used for training the initial model based on the positive label, the negative label, the sequence characteristics of the K-mers of the various blood samples and the overall characteristics of the K-mers to obtain a cancer early screening model.
8. A cancer prescreening device, comprising:
a third feature extraction module, configured to obtain a blood sample to be screened, and perform feature extraction on the blood sample to be screened based on the blood sample Motif feature extraction method described in any one of claims 1 to 4, to obtain sequence features of K-mers and overall features of the K-mers of the blood sample to be screened;
the screening module is used for inputting the sequence characteristics of the K-mers and the overall characteristics of the K-mers of the blood sample to be screened into a cancer early screening model to obtain a screening result output by the cancer early screening model;
wherein the cancer early-screening model is constructed based on the cancer early-screening model construction method of claim 5.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the blood sample Motif feature extraction method according to any one of claims 1 to 4 when executing the program, and/or implements the cancer early-screening model construction method according to claim 5.
CN202210506566.6A 2022-05-11 2022-05-11 Blood sample Motif feature extraction method and cancer early screening model construction method Active CN114613436B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210506566.6A CN114613436B (en) 2022-05-11 2022-05-11 Blood sample Motif feature extraction method and cancer early screening model construction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210506566.6A CN114613436B (en) 2022-05-11 2022-05-11 Blood sample Motif feature extraction method and cancer early screening model construction method

Publications (2)

Publication Number Publication Date
CN114613436A CN114613436A (en) 2022-06-10
CN114613436B true CN114613436B (en) 2022-08-02

Family

ID=81868558

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210506566.6A Active CN114613436B (en) 2022-05-11 2022-05-11 Blood sample Motif feature extraction method and cancer early screening model construction method

Country Status (1)

Country Link
CN (1) CN114613436B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112204666A (en) * 2018-04-13 2021-01-08 格里尔公司 Multiple assay predictive model for cancer detection
WO2021072275A1 (en) * 2019-10-11 2021-04-15 Guardant Health, Inc. Use of cell free bacterial nucleic acids for detection of cancer
CN113160889A (en) * 2021-01-28 2021-07-23 清华大学 Cancer noninvasive early screening method based on cfDNA omics characteristics
CN113838533A (en) * 2021-08-17 2021-12-24 福建和瑞基因科技有限公司 Cancer detection model and construction method and kit thereof
WO2022061080A1 (en) * 2020-09-17 2022-03-24 The Regents Of The University Of Colorado, A Body Corporate Signatures in cell-free dna to detect disease, track treatment response, and inform treatment decisions

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080041727A1 (en) * 2006-08-18 2008-02-21 Semitool, Inc. Method and system for depositing alloy composition
WO2009055597A2 (en) * 2007-10-25 2009-04-30 Monsanto Technology Llc Methods for identifying genetic linkage
WO2018081382A1 (en) * 2016-10-26 2018-05-03 Brown University A method to measure myeloid suppressor cells for diagnosis and prognosis of cancer
GB201818159D0 (en) * 2018-11-07 2018-12-19 Cancer Research Tech Ltd Enhanced detection of target dna by fragment size analysis
US11581062B2 (en) * 2018-12-10 2023-02-14 Grail, Llc Systems and methods for classifying patients with respect to multiple cancer classes
CN112435714B (en) * 2020-11-03 2021-07-02 北京科技大学 Tumor immune subtype classification method and system
CN112784884A (en) * 2021-01-07 2021-05-11 重庆兆琨智医科技有限公司 Medical image classification method, system, medium and electronic terminal

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112204666A (en) * 2018-04-13 2021-01-08 格里尔公司 Multiple assay predictive model for cancer detection
WO2021072275A1 (en) * 2019-10-11 2021-04-15 Guardant Health, Inc. Use of cell free bacterial nucleic acids for detection of cancer
WO2022061080A1 (en) * 2020-09-17 2022-03-24 The Regents Of The University Of Colorado, A Body Corporate Signatures in cell-free dna to detect disease, track treatment response, and inform treatment decisions
CN113160889A (en) * 2021-01-28 2021-07-23 清华大学 Cancer noninvasive early screening method based on cfDNA omics characteristics
CN113838533A (en) * 2021-08-17 2021-12-24 福建和瑞基因科技有限公司 Cancer detection model and construction method and kit thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《血浆游离DNA甲基化靶向测序在结直肠癌中的应用价值》;张丽静等;《临床检验杂志》;20220328;第40卷(第3期);179-182页 *
Genome-wide cell-free DNA fragmentation in patients with cancer;Stephen Cristiano 等;《nature》;20191229;385-389页 *

Also Published As

Publication number Publication date
CN114613436A (en) 2022-06-10

Similar Documents

Publication Publication Date Title
CN109767810B (en) High-throughput sequencing data analysis method and device
CN112086129B (en) Method and system for predicting cfDNA of tumor tissue
CN108256292B (en) Copy number variation detection device
CN113257350A (en) ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device
CN113035273B (en) Rapid and ultrahigh-sensitivity DNA fusion gene detection method
CN108664769B (en) Drug relocation method based on cancer genome and non-specific gene tag
CN112289376B (en) Method and device for detecting somatic cell mutation
CN113903401A (en) ctDNA length-based analysis method and system
CN111321209A (en) Method for double-end correction of circulating tumor DNA sequencing data
CN116064755B (en) Device for detecting MRD marker based on linkage gene mutation
CN113838533A (en) Cancer detection model and construction method and kit thereof
CN113862351B (en) Kit and method for identifying extracellular RNA biomarkers in body fluid sample
CN111180013B (en) Device for detecting blood disease fusion gene
CN116356001B (en) Dual background noise mutation removal method based on blood circulation tumor DNA
CN112687341B (en) Method for identifying chromosome structure variation by taking breakpoint as center
CN117275585A (en) Method for constructing lung cancer early-screening model based on LP-WGS and DNA methylation and electronic equipment
CN114613436B (en) Blood sample Motif feature extraction method and cancer early screening model construction method
CN115240764A (en) Tumor gene detection system and data processing method
CN114898803B (en) Mutation detection analysis method, device, readable medium and apparatus
CN110462056B (en) Sample source detection method, device and storage medium based on DNA sequencing data
CN116189904A (en) Gene methylation diagnosis model of differentiated thyroid cancer and construction method thereof
CN113355438B (en) Plasma microbial species diversity evaluation method and device and storage medium
CN108660213A (en) The application of three kinds of non-coding RNA reagents of detection and kit
JPWO2014175427A1 (en) Method, apparatus and program for evaluating DNA status
CN113159529A (en) Risk assessment model and related system for intestinal polyp

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant