WO2023236058A1

WO2023236058A1 - Construction method and apparatus for pulmonary nodule screening model, and pulmonary nodule screening method and apparatus

Info

Publication number: WO2023236058A1
Application number: PCT/CN2022/097450
Authority: WO
Inventors: 梁瀚; 周鑫兰; 李甫强; 乔斯坦; 赵鑫; 吴逵
Original assignee: 深圳华大生命科学研究院
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2023-12-14

Abstract

The present invention relates to a construction method and apparatus for a pulmonary nodule screening model, and a pulmonary nodule screening method and apparatus. Specifically, the present invention relates to a construction method for a pulmonary nodule screening model, the method comprising the following steps: within the full range of human reference genomes, screening a training set so as to obtain a predetermined number of regions having the greatest difference between weighted fragment distribution difference (WFDD) values of benign-nodule crowd samples and malignant-nodule crowd samples, and taking the predetermined number of regions as feature data; and using the feature data to construct a pulmonary nodule screening model. The method in the present invention realizes non-invasive human nodule type testing, avoids invasive testing, and provides a judgment accuracy higher than that of existing CT scanning and close to that of invasive tissue biopsies. The method is based on sequencing data development of cf DNA, and only requires operations such as drawing blood or collecting other bodily fluids, without the risk of exposure to radiation.

Description

Method and device for establishing pulmonary nodule screening model and method and device for pulmonary nodule screening

Technical field

The present invention relates to the technical field of bioinformatics, specifically, to a method and device for constructing a pulmonary nodule screening model and a pulmonary nodule screening method and device.

Background technique

Cancer has become the leading cause of death in China, and the incidence rate of cancer is increasing year by year. According to the latest 2019 National Cancer Report released by the National Cancer Center, deaths from malignant tumors account for 23.91% of all causes of death among residents, and the annual medical expenses caused by malignant tumors exceed 220 billion. Among them, lung cancer ranks first in the incidence of malignant tumors in my country in order of the number of cases.

Pulmonary sarcoidosis is a multi-system and multi-organ granulomatous disease of unknown etiology. It often invades the lungs, bilateral hilar lymph nodes, eyes, skin and other organs. Its chest invasion rate is as high as 80% to 90%. . The prognosis for pulmonary sarcoidosis is mostly good. The early forms of lung cancer are mostly small nodules in the lungs. Therefore, distinguishing the type of nodules is particularly important for early screening of lung cancer. The current method to reliably determine the type of pulmonary nodules basically relies on tissue biopsy through invasive surgical sampling.

The World Health Organization points out that early detection and early treatment are the key to providing effective cancer treatment. Therefore, it is extremely important to develop early screening and early detection technology for cancer. Currently, the clinical non-invasive diagnosis of nodules (such as pulmonary nodules) mainly relies on low-dose spiral CT. Low-dose spiral CT uses the smallest scanning range, lowest dose, and smallest amount of X-rays to diagnose lesions. Compared with traditional conventional CT examination, its radiation is smaller and micro nodules can be displayed more clearly. However, there has always been controversy that low-dose spiral CT increases the risk of cancer. Epidemiological studies show that even with just two or three CT scans, radiation dose can lead to an increased risk of detectable cancer, especially in children (Computed Tomography-An Increasing Source of Radiation Exposure.N Engl J Med 2007 ;357:2277-2284 DOI:10.1056/NEJMra072149).

Cell-free DNA (cell-free DNA, cf DNA), also known as circulating DNA (circulating free DNA, cf DNA), is free extracellular DNA that exists in peripheral fluids such as blood and urine. Among them, cf DNA from tumors is also called ct DNA (circulating tumor DNA, ct DNA). The application value of cfDNA as a new diagnostic marker has been confirmed in a variety of solid tumors. Existing technologies have developed early screening methods for cancer based on whole-genome sequencing or methylation sequencing of cfDNA. At present, there are literatures (Integrating Genomic Features for Non-Invasive Early Lung Cancer Detection.Nature; Volume 580, Pages 245-251(2020); (https://doi.org/10.1038/s41586-020-2140-0) ) reported non-invasive screening of cancer patients based on cf DNA fragments or protein markers in peripheral fluid. However, these methods are almost all used to screen healthy people, aiming to detect cancer patients from healthy people, and it is difficult to ensure that tumors in a certain part can be accurately distinguished from other non-cancer diseases in that part, and For patients with inflammation, the above methods may misidentify them as cancer patients.

Wenhua Liang et al. proposed a method for pulmonary nodule diagnosis based on cf DNA methylation sequencing data (Non-Invasive Diagnosis of Early-Stage Lung Cancer Using High-Throughput Targeted DNA Methylation Sequencing of Circulating Tumor DNA (ct DNA). Theranostics ;2019;9(7):2056-2070.DOI:10.7150). The results suggest that in a validation set including 39 patients with malignant pulmonary nodules and 27 patients with benign pulmonary nodules, the area under the receiver operating characteristic curve (Receiver Operating Characteristic Curve, ROC) of this method and the area enclosed by the coordinate axis ( Area Under Curve (AUC) is 0.816. This article compares malignant lung lesions with benign lesions, understands tissue DNA methylation characteristics, and establishes a diagnostic model for benign/malignant nodules. Applying this model to the identification of tumor-specific ct DNA in the plasma of patients with pulmonary nodules has certain sensitivity and specificity for early lung cancer. However, the accuracy of this non-invasive diagnostic method for pulmonary nodules is low and cannot yet meet clinical requirements.

Copy Number Variation (CNV) is caused by genome rearrangements. It generally refers to an increase or decrease in the copy number of a large genome segment with a length of more than 1 kb, mainly manifesting as submicroscopic deletions and duplications. CNV is an important component of genome structural variation (Structural Variation, SV). The mutation rate of CNV sites is much higher than that of Single Nucleotide Polymorphism (SNP), and it is one of the important causative factors of human diseases. Existing literature (Maternal Malignancies Detected With Noninvasive Prenatal Testing Reply.Jama the Journal of the American Medical Association; 2015 Nov 24;314(20):2192-3[DOI:10.1001/jama.2015.12922]) has reported the use of blood from cancer patients cf DNA sequencing data detected copy number variations, meaning that in the region where the CNV occurs, the distribution of DNA fragments in cancer patients differs from a baseline representative of healthy people.

Therefore, it is an urgent problem to be solved to develop products that can meet clinical application and detect human nodule types with high accuracy, such as using non-invasive methods to determine whether a patient's pulmonary nodules are benign or malignant, thereby avoiding invasive examinations.

Contents of the invention

The main purpose of the present invention is to provide a method for establishing a pulmonary nodule screening model, a screening model, a screening method and a screening device, which can distinguish a certain part of malignant tumors from other non-cancer types of diseases, especially pulmonary nodule screening models. Type of nodule.

In order to achieve the above objects, according to the first aspect of the present invention, the present invention proposes a method for constructing a pulmonary nodule screening model, which method includes the following steps:

Within the entire scope of the human reference genome, select a predetermined number of regions with the largest difference in weighted fragment distribution difference values (WFDD) between the benign nodule population samples and the malignant nodule population samples in the training set as feature data; and use the feature data to construct Pulmonary nodule screening model.

Further, the step of screening a predetermined number of regions with the largest difference in weighted fragment distribution difference values (WFDD) among the benign nodule population samples and the malignant nodule population samples in the training set as feature data includes:

Connect the base sequences of all autosomal chromosomes in the reference genome together, and divide the connected base sequences into a series of windows according to a fixed length. Each window corresponds to a base sequence; calculate the window reference depth of each window. and the weight, where the window reference depth is the average of the depth values of the samples in the training set in the window, the weight is the square of the variance of the depth values of the samples in the training set in the window, and the depth value of the window is the sequencing data of the sample The number of base sequence fragments that can be compared to the base sequence corresponding to the window; calculate the window sample depth of the specified sample in the training set in each window; calculate the difference between the window sample depth and the window reference depth; divide the difference Multiply the value and the weight to get the weighted difference of the window; combine an indefinite number of windows to form different areas, sum the weighted differences of all windows in the specified area to get the weighted difference sum; perform a numerical calculation on the weighted difference sum Transform to obtain the weighted fragment distribution difference value (WFDD) of the specified sample in the specified area; and calculate the difference in the weighted fragment distribution difference value between the benign nodule population sample and the malignant nodule population sample in the training set in each area, and A predetermined number of areas with the largest differences are filtered as feature data.

Further, before calculating the difference between the window sample depth and the window reference depth, a step of normalizing the window reference depth and the window sample depth is also included.

Further, after the normalization process, the average value of the window reference depth of each window is 0, and the standard deviation is 1.

Further, screening a predetermined number of areas with the largest differences as feature data includes:

For a specific area, calculate the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample; calculate the average WFDD value of the benign nodule population sample and the average WFDD value of the malignant nodule population sample respectively. ; Calculate the discrimination value of a specific area according to the following formula, and select a predetermined number of areas with the largest discrimination value as feature data:

Among them: t is the discrimination value of a specific area;

and

are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n ₁ and n ₂ are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S ₁ and S ₂ is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the condition is: when the benign nodule population sample corresponds to the subscript 1, the malignant nodule population sample corresponds to the subscript 2; or when the benign nodule population sample corresponds to the subscript 2 When the nodule population sample corresponds to the subscript 2, the malignant nodule population sample corresponds to the subscript 1.

a.) Generate initial area:

-Divide the genome into a series of windows;

-Number a series of windows in the order of a series of windows;

-Use a series of window numbers to number the area obtained by combining a series of windows;

- Combine the n consecutive windows at window i and the other n consecutive windows at 2 ^j n windows downstream to form an initial area:

x _i = {i, i+1, i+2,..., i+n-1, i+2 ^j n, i+2 ^j n+1, i+2 ^j n+2,..., i+2 ^j n+n-1}

i=1, n, 2n,..., N; j=1, 2,..., 8; i+(2 ⁱ⁺¹ )n≤N

Among them, n is the number of given continuous windows, and N is the total number of divided windows;

b.) Combination and splitting of regions:

Using a genetic algorithm, two parents combine to exchange information and produce offspring; where all initial regions are placed in the region pool and randomly selected to generate offspring; where

-The probability that region i is selected as one of the parents is:

Among them, N is the total number of divided windows, t _i is the t value of the i-th window;

-When region x is selected as the first parent, the probability that another region i is selected as the second parent is:

Among them, N is the total number of divided windows, m _i is the average of the window numbers included in area i; and

-After selecting the parents, take the union of the windows included in the parents and randomly delete some of the windows as children. The random selection method is sampling with replacement;

-After obtaining the offspring, put the offspring into the regional pool for the next round of selection; the parents are not deleted in this operation. The offspring generated by the combination of area P ₁ and area P ₂ are:

child(P ₁ , P ₂ )=P ₁ ∪P ₂ -S(p, P ₁ ∪P ₂ )

Among them, S(p,s) is a subset obtained by extracting elements with proportion p from the set s with replacement;

- Repeat the process of producing offspring;

- Calculate the distinction value of the generated offspring according to the following formula, and select a predetermined number of areas with the largest distinction value as feature data:

Among them: t is the discrimination value of a specific area;

and

are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n ₁ and n ₂ are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S ₁ and S ₂ is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the conditions are:

When the benign nodule population sample corresponds to the subscript 1, the malignant nodule population sample corresponds to the subscript 2; or when the benign nodule population sample corresponds to the subscript 2, the malignant nodule population sample corresponds to the subscript 1.

Further, the number n of continuous windows is 1-100, preferably 5-50, and more preferably 5.

Further, after selecting the parents, take the union of the windows included by the parents and randomly delete 1%-99%, more preferably 5%-50%, further preferably 20% of the windows as offspring, and perform sampling with replacement .

Further, the process of generating offspring is repeated 1 to 1 million times, more preferably 100 to 100,000 times, and further preferably 300,000 times.

Further, the connected base sequence is divided into a series of windows according to the length of 100bp-100kbp, preferably 10kbp-50kbp, and more preferably 30kbp.

Further, the windows that are combined to form different areas are continuous or discontinuous.

Further, the predetermined number of areas is 1-500, more preferably 10-100, even more preferably 50.

According to the second aspect of the present invention, the present invention proposes a pulmonary nodule screening method, which includes the following steps: calculating the sum of the weighted segment distribution difference values of the sample to be tested in a selected predetermined number of regions to obtain the total WFDD value ; Input the total WFDD value of the sample to be tested into the pulmonary nodule screening model established according to the method described in the first aspect of the present invention; output the screening results of the sample to be tested; wherein, a predetermined number of regions are selected It is the same as the predetermined number of regions with the largest difference in weighted fragment distribution difference (WFDD) values between the benign nodule population sample and the malignant nodule population sample. .

Further, the total WFDD value of the sample to be tested is input into the pulmonary nodule screening model, and the pulmonary nodule screening model determines the type of pulmonary nodule of the sample to be tested based on the comparison between the total WFDD value of the sample to be tested and a predetermined threshold.

Further, the predetermined threshold is obtained by the following method: calculating the total WFDD value of each sample in the training set in a predetermined number of areas with the largest difference in WFDD value; calculating the maximum WFDD value based on the total WFDD value of the benign nodule population and the malignant nodule population in the training set. The optimal segmentation point is the predetermined threshold.

Further, the optimal segmentation point was calculated using the roc function of the pROC package from the R language analysis platform.

According to the third aspect of the present invention, the present invention proposes a device for constructing a pulmonary nodule screening model, including: a feature data screening module configured to screen benign nodules in the training set within the entire range of the human reference genome. A predetermined number of areas with the largest difference in weighted fragment distribution difference (WFDD) values between the population sample and the malignant nodule population sample are used as feature data; and a building module is configured to use the feature data to build a pulmonary nodule screening model.

Further, the feature data screening module includes: a window division module, which is set to connect the base sequences of all autosomal chromosomes in the human reference genome together, and divide the connected base sequences into a series of windows according to a fixed length, each Each window corresponds to a base sequence; the first calculation module is set to calculate the window reference depth and weight of each window, where the window reference depth is the average depth value of the window in the training set, and the weight is the training set. The square of the variance of the depth value of the window for the set of samples. The depth value of the window is the number of base sequence fragments in the sequencing data of the sample that can be compared to the base sequence corresponding to the window; the second calculation module is set Calculate the window sample depth of the specified sample in each window in the training set; the third calculation module is set to calculate the difference between the window sample depth and the window reference depth; the fourth calculation module is set to compare the difference with the weight Multiply to obtain the weighted difference of the windows; the fifth calculation module is set to combine an indefinite number of windows to form different areas, and sum the weighted differences of all windows in the specified area to obtain the total weighted difference; the numerical transformation module, is set to perform numerical transformation on the sum of weighted differences to obtain the weighted fragment distribution difference value (WFDD) of the specified sample in the specified area; and the sixth calculation module is set to calculate the benign nodule population sample and malignant nodule in the training set section the difference of the population sample in each area with respect to the weighted fragment distribution difference value; and the feature data screening submodule is set to screen a predetermined number of areas with the largest differences as feature data.

Furthermore, the feature data screening module also includes a homogenization processing module.

Further, the feature data screening sub-module includes: a first calculation unit, configured to calculate the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample for a specific area; a second calculation unit, configured to In order to respectively calculate the average WFDD value of the benign nodule population sample and the average WFDD value of the malignant nodule population sample; the third calculation unit is set to calculate the discrimination value of a specific area according to the following formula, and select the unit, is set to select a predetermined number of regions with the largest discrimination values as feature data:

Among them: t is the discrimination value of a specific area;

and

Further, the feature data screening sub-module includes: an initial region generation unit and a region combination splitting unit, wherein the initial region generation unit includes: a window division element, which is set to divide the genome into a series of windows; a window encoding element, which is set To number a series of windows in the order of a series of windows; the area number component is set to use a series of window numbers to number the area obtained by combining a series of windows; the window combination component is set to use n consecutive n numbers at window i A window is combined with another n consecutive windows at 2 ^j n windows downstream to form an initial area:

x _i = {i, i+1, i+2,..., i+n-1, i+2 ^j n, i+2 ^j n+1, i+2 ^j n+2,..., i+2 ^j n+n-1},

i=1, n, 2n,..., N; j=1, 2,..., 8; i+(2 ⁱ⁺¹ )n≤N

The regional combination splitting unit includes: the first child selection component is set to use a genetic algorithm, and the two parents combine to exchange information and generate offspring; among them, all initial areas are put into the regional pool and randomly selected to generate offspring. ;in:

-The probability that region i is selected as one of the parents is:

Among them, N is the total number of divided windows, m _i is the average of the window numbers included in area i;

The second child selection component is set to take the union of the windows included in the parents after selecting the parents and randomly delete several of the windows as children. The random selection method is sampling with replacement; and

The third child selection component is set to put the children into the regional pool for the next round of selection after obtaining the children; in this operation, the parents are not deleted from the regional pool, in which area P ₁ and area P ₂ are combined The resulting offspring are:

child(P ₁ , P ₂ )=P ₁ ∪P ₂ -S(p, P ₁ ∪P ₂ ),

Among them, S(p,s) is a subset obtained by extracting proportion p elements from the set s with replacement; and

The descendants repeatedly generate components and are set to repeat the process of generating descendants;

Among them: t is the discrimination value of a specific area;

and

Further, the predetermined number of areas is 1-500, more preferably 10-100, even more preferably 50. According to a fourth aspect of the present invention, the present invention proposes a pulmonary nodule screening device, including: a first calculation module configured to calculate the weighted fragment distribution difference value of the sample to be tested in a selected predetermined number of regions. Sum to obtain the total WFDD value; the input module is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model constructed according to the construction device of the third aspect of the present invention; and the output module is configured to output Screening results of samples to be tested; wherein, the selected predetermined number of regions are the same as the predetermined number of regions with the largest difference in weighted fragment distribution difference values (WFDD) between the benign nodule population sample and the malignant nodule population sample.

Further, the input module includes: an input unit, which is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model; and a determination unit, which is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model. Compare with a predetermined threshold to determine the type of pulmonary nodule in the sample to be tested.

Further, the screening device further includes a predetermined threshold acquisition module. The predetermined threshold acquisition module includes: a first calculation unit configured to calculate the total WFDD value of each sample in the training set in a predetermined number of regions where the WFDD value difference is the largest; and The second calculation unit is set to calculate the optimal segmentation point based on the total WFDD value of the benign nodule population and the malignant nodule population in the training set, and the optimal segmentation point is the predetermined threshold.

According to a fifth aspect of the present invention, the present invention proposes a computer-readable storage medium. The storage medium includes a stored program. When the program is run, the program executes the construction method according to the first aspect of the present invention or the second method of the present invention. Two aspects of pulmonary nodule screening methods.

According to the sixth aspect of the present invention, the present invention proposes a processor. The processor is configured to run a program. When the program is running, the method for constructing a pulmonary nodule screening model according to the first aspect of the present invention is executed or the method according to the present invention is executed. The second aspect of pulmonary nodule screening methods.

The technical solution of the present invention is applied to develop a method aimed at distinguishing a certain part of malignant tumors from other non-cancer types of diseases (such as nodules, etc.). The results show that the effect of the method of the present invention is significantly better than that of existing CT scans. Close to invasive tissue biopsy. This method produces a product that can non-invasively detect the type of human nodules (benign/malignant). For example, through blood testing, it can determine whether the patient's lung nodules are malignant, thereby avoiding invasive examinations.

Description of drawings

The description and drawings that constitute a part of this application are used to provide a further understanding of the present invention. The illustrative embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute an improper limitation of the present invention. In the attached picture:

Figure 1 shows a flow chart of a method for constructing a pulmonary nodule screening model according to the present invention.

Figure 2 shows the calculation method of mode weighted fragment distribution difference (Weighted Fragment Distribution Difference, WFDD) according to the present invention. Among them: A shows the calculation method of the weighted difference of a sample in the specified window; B shows the calculation method of the accumulated value of the i-th window; and C shows the calculation method of the WFDD of the sample in the specified area.

Figure 3 shows an example of a benign correlation distribution pattern obtained by modeling according to an embodiment of the present invention.

Figure 4 shows an example of a malignant correlation distribution pattern obtained by modeling according to an embodiment of the present invention.

Figure 5 shows experimental results obtained by modeling and predicting validation set samples according to an embodiment of the present invention.

Figure 6 shows a flow chart of the pulmonary nodule screening method according to the present invention.

Figure 7 shows a device for constructing a pulmonary nodule screening model according to the present invention.

Figure 8 shows the feature data screening module in the device for constructing a pulmonary nodule screening model according to the present invention.

Figure 9 shows the feature data screening sub-module in the feature data screening module in the device for building a pulmonary nodule screening model according to the present invention.

Figure 10 shows the feature data screening sub-module in the feature data screening module in the device for building a pulmonary nodule screening model according to the present invention.

Figure 11 shows a pulmonary nodule screening device according to the present invention.

Figure 12 shows the input module of the pulmonary nodule screening device according to the present invention.

Detailed ways

The following description is presented to enable a person of ordinary skill in the art to make and use the various embodiments. Descriptions of specific devices, techniques, and applications are provided as examples only. Various modifications to the examples described herein will be apparent to those of ordinary skill in the art, and the general principles defined herein may be applied to other examples and applications without departing from the scope of the various embodiments. Accordingly, the various implementations are not intended to be limited to the examples described and illustrated herein but are to be consistent with the scope of the claims. It should be noted that, as long as there is no conflict, the embodiments and features in the embodiments of this application can be combined with each other. The present invention will be described in detail below with reference to the accompanying drawings and embodiments.

As described in the background section, the application value of cell-free DNA as a new diagnostic marker has been confirmed in a variety of solid tumors. Therefore, an increasing number of studies have developed early screening methods for cancer through cell-free DNA-based whole-genome sequencing or methylation sequencing. Although these methods have proliferated, many problems remain. First of all, existing technology methods can almost only detect cancer patients from healthy people, but it is difficult to accurately distinguish tumors in a certain part from other non-cancer diseases at that site, especially for patients with inflammation. The method may misidentify them as cancer patients. Secondly, the accuracy of existing non-invasive diagnosis methods for pulmonary nodules based on tumor-specific ct DNA in patient plasma is low and cannot meet clinical requirements.

Due to the above problems, traditional early screening programs need to be modified to be able to distinguish a certain subset of malignant tumors from other non-cancer types of diseases.

Referring now to FIG. 1 , FIG. 1 shows a flow chart of a method for constructing a pulmonary nodule screening model according to the present invention. The method for constructing a pulmonary nodule screening model according to an embodiment of the present invention constructs a model based on the distribution characteristics of DNA fragment sequence reads (reads) in second-generation sequencing data on a reference genome (reference), thereby distinguishing different types ( benign/malignant) nodules. Existing literature (Maternal Malignancies Detected With Noninvasive Prenatal Testing Reply.Jama the Journal of the American Medical Association; 2015 Nov 24;314(20):2192-3[DOI:10.1001/jama.2015.12922]) has reported the use of blood from cancer patients cf DNA sequencing data detected copy number variations, meaning that in the region where the CNV occurs, the distribution of DNA fragments in cancer patients differs from a baseline representative of healthy people. We hypothesized that there are differences in the distribution of DNA fragments in certain regions in patients with benign/malignant nodules. Therefore, in order to better describe the characteristics of cf DNA fragment distribution in a region, we proposed the concept of Weighted Fragment Distribution Difference (WFDD). CNV only focuses on the difference between the total number of fragments of a sample in a region and the baseline, while WFDD focuses on the details of the differences in fragment distribution across the region. We developed a pulmonary nodule screening model to describe this detail. Specifically, the construction method of the pulmonary nodule screening model includes:

In Figure 1, 1, the sequencing data of all autosomes in the reference genome are first concatenated and divided into a series of windows by a fixed length. Here, the fixed length range is 100bp-100kbp, preferably 10kbp-50kbp, and more preferably 30kbp. Since the division is performed by concatenating the sequencing data of autosomes, there may be windows spanning chromosomes. Here, in order to exclude possible interference caused by gender factors, we did not use sex chromosome data.

In 2 to 6 of Figure 1 , a calculation method of a mode weighted fragment distribution difference (Weighted Fragment Distribution Difference, WFDD) according to an embodiment of the present invention is shown. In Figure 1, 2, the window base depth and weight are calculated for each window: the average of the depth values of the samples in the window (each sample provides a depth value) of the training set is used as the window base depth of the window, and The square of the variance of the depth values of these samples serves as the weight of the window. Thereafter, the window sample depth in each window is calculated for the specified sample in the training set. In a preferred embodiment, the window reference depth of the window in the designated area is normalized to obtain the normalized window reference depth, so that the average value of each window reference depth is 0 and the standard deviation is 1. The reason for this is that WFDD only focuses on the depth difference at the window level. In order to eliminate the impact of the total depth difference of different samples in the region on WFDD, it is necessary to normalize the base depth number of the windows contained in a region before calculating WFDD. Normalized window base depth. The depth of the specified sample in these windows is also subjected to the same normalization operation to obtain the normalized window sample depth. In Figure 1, 3, the difference between the window sample depth and the window reference depth is calculated. In a preferred embodiment, the difference between the normalized window sample depth and the normalized window reference depth is calculated. In Figure 1, 4, the difference is multiplied by the weight to obtain the weighted difference of the window. In a preferred embodiment where normalization processing is performed, the difference between the normalized window sample depth and the normalized window reference depth is multiplied by the weight. Obviously, windows with larger fluctuations in different sample depths will have greater weight. Multiplying the difference between the depth of the sample in this window and the reference depth by its weight will amplify the difference, that is, amplify the distribution difference signal. Subsequently, in Figure 1, 5, an indefinite number of windows are combined to form different areas, and the weighted differences of all windows in the specified area are summed; the sum of the weighted differences (i.e., the last cumulative value of the summation) Perform numerical transformation to obtain the weighted fragment distribution difference (WFDD) of the specified sample in the specified area. In Figure 1, 6, the difference between the benign nodule population sample and the malignant nodule population sample in the training set with respect to the weighted fragment distribution difference value in each region is calculated, and a predetermined number of regions with the largest differences are screened as feature data. In one embodiment, the number of selected regions is 1-500, preferably 10-100, and more preferably 50; then, the characteristic data is used to build a pulmonary nodule screening model.

Referring now to FIG. 2 , an example of a calculation method of mode weighted segment distribution differences according to one embodiment of the present invention is shown. A of Figure 2 shows an example of a method for calculating the weighted difference of a sample in a specified window. In this example, the average of the depth values of the training set samples in the window (each sample provides a depth value) is used as the window base depth of the window, and the square of the variance of the depth values of these samples is used as the window's base depth. Weights. Computes the window sample depth in each window for the specified sample in the training set. Normalize the window reference depth of the windows in the specified area to obtain the normalized window reference depth, so that the average value of each window reference depth in the same area is 0 and the standard deviation is 1. The depth of the specified sample in these windows is also subjected to the same normalization operation to obtain the normalized window sample depth. Calculate the difference between the normalized window sample depth and the normalized window base depth. Multiply the difference by the weight to get the weighted difference of the window. In B of FIG. 2 , an example of the calculation method of the accumulated value of the i-th window is shown. Here, an indefinite number of windows that combine to form different areas may be non-contiguous (non-adjacent). And it should be noted that when calculating the WFDD of a sample in a region, we only normalize the sample depth and reference depth based on the window of this region, that is, we only require the average of the reference depth of each window in the same region. is 0 and the standard deviation is 1, rather than based on all windows on the genome. In C of Figure 2, the formula for numerical transformation of the last accumulated value of the sum is shown: WDFF=f(x) or WDFF=-1*f(x), f(x) in different situations The forms are specifically shown in Figures 3 and 4 respectively.

More simply (without elaborating from the perspective of cumulative values), in one implementation, the empirical formula for calculating the WFDD of a sample in a specified area is:

When the area belongs to a benign related area: WDFF＝-1*f(x)

When the area belongs to the malignant related area: WDFF=f(x)

in,

x′ _i is the normalized window sample depth of the specified sample in the i-th window of the training set,

is the normalized reference depth of the i-th window, and

σ _i is the depth value variance of the specified sample in the training set in the i-th window.

Referring now to Figure 3, an example of a benign correlation distribution pattern obtained by modeling in accordance with an embodiment of the present invention is shown. In a given region, if the WFDD fluctuation of samples from the benign pulmonary nodule population is greater than that of the malignant pulmonary nodule population, we call the fragment distribution pattern in this region a benign correlation pattern. In Figure 3, the area where the pattern is located includes 53 windows, and each polyline represents a sample. The left picture shows the weighted difference on these windows of cf DNA samples (40 in total) from 20 random patients with benign/malignant pulmonary nodules; the middle picture shows the cumulative weighted difference of each window; the right picture shows The last cumulative value of the sample is converted in the benign correlation mode to obtain the correlation function and results of WFDD (box plot).

Referring now to Figure 4, an example of a malignant correlation distribution pattern obtained by modeling according to an embodiment of the present invention is shown. If the WFDD fluctuation of samples from the malignant pulmonary nodule population is greater than that of the benign pulmonary nodule population, we call the fragment distribution pattern in this region a malignant-related pattern. In Figure 4, the pattern area includes 231 windows. The meaning of each part in the figure is the same as (B), but the samples are not completely consistent. The left picture shows the weighted differences in these windows for cf DNA samples of 20 randomly selected patients with benign/malignant pulmonary nodules (40 in total); the middle picture shows the cumulative weighted difference for each window; the right picture It shows the numerical transformation of the last cumulative value of the sample in the benign correlation mode to obtain the correlation function and results of WFDD (box plot).

In one implementation, the following formula can be used to perform a normalization operation on a set of values (such as the depth values of a sample in multiple windows):

Among them, S is the standard deviation of this set of values, and

is its average value.

As described before, we select several areas where the WFDD values of the two types of samples have the greatest difference to build the model. In order to evaluate the difference in WFDD of the two types of samples in a specific area, we calculated the discrimination value of the area based on the WFDD values of the two types of samples to evaluate its ability to distinguish the two types of samples. Specifically: for a specific area, calculate the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample; calculate the WFDD value of the benign nodule population sample respectively.

The average of the WFDD values and the average of the WFDD values of the malignant nodule population samples; calculate the discrimination value of a specific region and select a predetermined number of regions with the largest discrimination value as feature data. In one embodiment, the formula for calculating the discrimination value of a specific area is:

in:

t is the discrimination value of a specific area;

and

are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n ₁ and n ₂ are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S ₁ and S ₂ is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the condition is: when the benign nodule population corresponds to the subscript 1, the malignant nodule population corresponds to the subscript 2; or when the benign nodule population corresponds to the subscript 2 When the population corresponds to subscript 2, the population with malignant nodules corresponds to subscript 1. A higher t value indicates that the WFDD values of the two groups of samples have a greater difference in this area, that is, this area has a higher degree of discrimination.

By combining its parts into different areas based on the divided windows, the number of areas that can be combined will theoretically exceed the range that existing computers can handle. In order to more effectively find potential high-discrimination windows, in one embodiment, an improved genetic algorithm can be used to search. The improved genetic algorithm randomly merges and splits a series of regions (initial regions) obtained by simple strategies, and guides the generation of regions with larger t values. Specifically, the steps to search for potential high-discrimination windows include:

a.) Generate initial area:

Divide the genome into a series of windows: Suppose we have divided the genome into N windows, each window is represented by its sequence number (such as 1, 2,..., N), and the region obtained by the combination of these windows is represented by a series of window numbers Represents (such as {1, 2, 3, 10, 11, 12}). We combine the n consecutive windows at window i and the other n consecutive windows at 2 ^j n windows downstream, resulting in a total of 2n windows combined into an initial area. Right now:

i=1, n, 2n,..., N; j=1, 2,..., 8; i+(2 ⁱ⁺¹ )n≤N

Among them, n is the number of given continuous windows, and its value range is 1-100, preferably 5-50, and more preferably 5, and N is the total number of divided windows. Here, requiring two consecutive window sequences to be separated by a certain distance is to allow the region to have the ability to span long distances.

b.) Combination and splitting of regions:

Using a genetic algorithm, two parents combine to exchange information and produce offspring; among them, all initial areas are put into the area pool and randomly selected to generate offspring. Here, the probability of area i being selected as one of the parents is:

Among them, N is the total number of divided windows.

When region x is selected as the first parent, the probability that another region i is selected as the second parent is:

Among them, N is the total number of divided windows, and m _i is the average of the window numbers included in area i. Here, _the | _m

After selecting the parents, take the union of the windows included in the parents and randomly delete several of the windows as offspring; in one embodiment, the range of randomly deleted windows is 1%-99%, preferably 5%-50%, More preferably, it is 20%, with replacement sampling. After obtaining the offspring, put the offspring into the regional pool for the next round of selection; the parents are not deleted during this operation. The offspring produced by the combination of area P ₁ and area P ₂ are:

child(P ₁ , P ₂ )=P ₁ ∪P ₂ -S(p, P ₁ ∪P ₂ )

Among them, S(p,s) is a subset obtained by extracting proportion p elements from the set s with replacement.

The process of producing offspring is repeated over and over again. In one embodiment, the range of repetitions is 1 to 1 million times, in a preferred embodiment, the range of repetitions is 100 to 100,000 times, in a more preferred embodiment, the range of repetitions is 300,000 times, and Finally, select a number of areas with the greatest discrimination as features to build a model and predict.

Referring now to Figure 6, a flow chart of a pulmonary nodule screening method according to the present invention is shown. The pulmonary nodule screening method includes: selecting a certain number of areas. In order to determine the type of a specific sample based on these areas, we calculate the WFDD values of the sample in these areas and sum them up to obtain the total WFDD value. Then, the WFDD value is calculated. The total WFDD value is compared with a predetermined threshold, and its type is determined based on whether it is greater than or less than the threshold.

As used in this article, the threshold is calculated based on samples from the training set. First, calculate the total WFDD value of each sample in the training set for these areas, and then calculate the optimal segmentation point based on the total WFDD value of the benign nodule population and malignant nodule population in the training set. This segmentation point is the required threshold. The optimal split point can be calculated using the roc function (pROC package from R language).

Referring now to FIG. 7 , a device for constructing a pulmonary nodule screening model according to the present invention is shown. Specifically, the construction device of the present invention includes: a feature data screening module, which is configured to screen the weighted fragment distribution difference (WFDD) in the benign nodule population samples and the malignant nodule population samples in the training set within the entire range of the human reference genome. A predetermined number of areas with the largest difference in values are used as feature data; and a building module is configured to use the feature data to build a pulmonary nodule screening model.

Referring now to FIG. 8 , a feature data filtering module in a device for constructing a pulmonary nodule screening model according to the present invention is shown. Specifically, the feature data screening module of the present invention includes: a window division module, which is configured to connect the base sequences of all autosomal chromosomes in the human reference genome together, and divide the connected base sequences into a series of fixed lengths. Window, each window corresponds to a base sequence; the first calculation module is set to calculate the window reference depth and weight of each window, where the window reference depth is the average depth value of the sample in the training set in the window, The weight is the square of the variance of the depth value of the window for the samples in the training set, and the depth value of the window is the number of base sequence fragments in the sequencing data of the sample that can be compared to the base sequence corresponding to the window; the second calculation module , is set to calculate the window sample depth of the specified sample in the training set in each window; the third calculation module is set to calculate the difference between the window sample depth and the window reference depth; the fourth calculation module is set to calculate the difference The value is multiplied by the weight to obtain the weighted difference of the window; the fifth calculation module is set to combine an indefinite number of windows to form different areas, and sum the weighted differences of all windows in the specified area to obtain the total weighted difference; numerical value The transformation module is configured to perform numerical transformation on the sum of weighted differences to obtain the weighted fragment distribution difference value (WFDD) of the specified sample in the specified area; and the sixth calculation module is configured to calculate the benign nodule population sample in the training set and the difference in the weighted fragment distribution difference value between the malignant nodule population sample and the malignant nodule population sample in each region; and the feature data screening submodule is set to screen a predetermined number of regions with the largest differences as feature data. Optionally, as marked with a dotted line in Figure 8 , the feature data screening module also includes a normalization processing module, which is configured to normalize the window reference depth and the window sample depth. Optionally, the average value of the window reference depth of each window after the normalization process is 0, and the standard deviation is 1.

Referring now to FIG. 9 , which shows a feature data screening sub-module in the feature data screening module in the device for constructing a pulmonary nodule screening model according to the present invention. Specifically, the characteristic data screening sub-module of the present invention includes: a first calculation unit, which is configured to calculate the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample for a specific area; a second calculation unit , is set to respectively calculate the average WFDD value of the benign nodule population sample and the average WFDD value of the malignant nodule population sample; the third calculation unit is set to calculate the discrimination value of a specific area according to the following formula, and The selection unit is set to select a predetermined number of regions with the largest discrimination values as feature data:

Among them: t is the discrimination value of a specific area;

and

Referring now to Figure 10, which shows the feature data screening sub-module in the feature data screening module in the device for constructing a pulmonary nodule screening model according to the present invention. The feature data screening sub-module includes: an initial region generation unit and a region combination splitting unit. The initial region generation unit includes: a window division element, which is set to divide the genome into a series of windows; a window encoding element, which is set to use a The order of the series windows is to number a series of windows; the area number component is set to use a series of window numbers to be the area number obtained by combining a series of windows; the window combination component is set to be the sum of n consecutive windows at window i Another n consecutive windows at the downstream 2 ^j n windows are combined to form an initial area: x _i ={i, i+1, i+2,..., i+n-1, i+2 ^j n, i+2 ^j n+1, i+2 ^j n+2,..., i+2 ^j n+n-1},; i=1, n, 2n,..., N; j=1, 2,...,8; i+(2 ⁱ⁺¹ )n≤N; where n is the number of given continuous windows, and N is the total number of divided windows. The regional combination splitting unit includes: the first child selection component is set to use a genetic algorithm, and the two parents combine to exchange information and generate offspring; among them, all initial areas are put into the regional pool and randomly selected to generate offspring. ;where: -The probability that region i is selected as one of the parents is:

Among them, N is the total number of divided windows, t _i is the t value of the i-th window; - when area x is selected as the first parent, the probability that another area i is selected as the second parent is:

Among them, N is the total number of divided windows, m _i is the average of the window numbers included in area i; the second generation selects components and is set to select the parents. After selecting the parents, the union of the windows included in the parents is taken and randomly deleted. Several windows are used as children, and the random selection method is sampling with replacement; and the third child selection component is set to put the children into the regional pool for the next round of selection after obtaining the children; there is no need for this operation. Delete the parents from the region pool, where the offspring produced by the combination of region P ₁ and region P ₂ are: child (P ₁ , P ₂ ) = P ₁ ∪P ₂ -S (p, P ₁ ∪P ₂ ); where, S(p,s) is a subset obtained by extracting elements of proportion p from the set s with replacement; and the descendants repeatedly generate elements, which are set to the process of repeatedly generating descendants; - Calculate the generated by the following formula Discrimination value of the offspring, select a predetermined number of areas with the largest discrimination value as feature data:

Among them: t is the discrimination value of a specific area;

and

Referring now to Figure 11, a pulmonary nodule screening device according to the present invention is shown. Specifically, the screening device includes: a first calculation module, which is configured to calculate the sum of the weighted fragment distribution difference values of the sample to be tested in a selected predetermined number of areas to obtain a total WFDD value; an input module, which is configured to calculate the total WFDD value of the sample to be tested. The total WFDD value of the sample to be tested is input into the pulmonary nodule screening model of the present invention; and the output module is configured to output the screening results of the sample to be tested; wherein, the selected predetermined number of regions are related to the benign nodule population sample and the malignant nodule population sample. The predetermined number of regions with the largest weighted fragment distribution difference (WFDD) differences in the nodule population sample are the same.

Referring now to Figure 12, an input module of the pulmonary nodule screening device according to the present invention is shown. Specifically, the input module includes: an input unit, which is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model; and a determination unit, which is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model. Compare with a predetermined threshold to determine the type of pulmonary nodule in the sample to be tested. As shown in the dotted box in Figure 12, in some embodiments, the screening device further includes a predetermined threshold acquisition module. The predetermined threshold acquisition module includes: a first calculation unit configured to calculate the WFDD of each sample in the training set. The total WFDD values of the predetermined number of regions with the largest value difference; and the second calculation unit is configured to calculate the optimal segmentation point based on the total WFDD values of the benign nodule population and the malignant nodule population in the training set, and the optimal segmentation point is is a predetermined threshold; more preferably, the optimal segmentation point is calculated using the roc function of the pROC package from the R language analysis platform.

Embodiments of the present invention provide a computer-readable storage medium on which a stored program is stored. When the program is run, the program executes the method for establishing a pulmonary nodule screening model of the present invention or the pulmonary nodule screening method of the present invention. method.

Embodiments of the present invention provide a processor, and the processor is configured to run a program. When the program is run, the method for establishing a pulmonary nodule screening model of the present invention or the pulmonary nodule screening method of the present invention is executed. The present invention will be described in further detail below with reference to specific examples. These examples shall not be construed as limiting the scope of protection claimed by the present invention.

Example 1: Establishing a pulmonary nodule screening model

Blood samples from 639 patients with untreated pulmonary nodules were collected, including 484 patients with malignant nodules (85% of which were stage I) and 155 patients with benign nodules. Patients with malignant pulmonary nodules only included patients with non-small cell lung cancer. All patients have been anonymized and have given consent for their samples to be used in clinical research.

Use EDTA tubes to collect whole blood and process it immediately. If it cannot be processed immediately, store it at 4°C for no more than 1 day. Centrifuge at 1600g for 10 minutes at 4°C to distinguish plasma and cellular components. The plasma is further centrifuged at 16000g for 10 minutes at 4°C to remove possible cellular residues and stored at -80°C until use.

Use MagPure Circulating DNA KF Kit (Magen) to extract cf DNA from 200ul of plasma, use MGIEasy Cell-free DNA Library Prep Set (MGI) to perform second-generation sequencing standard library construction on the obtained cf DNA, and use the MGISEQ-2000 platform. Sequencing, and finally obtain whole-genome sequencing data of approximately 0.5-1.0x sequencing depth for each sample.

Use Sentieon software to process the sequencing data (including alignment, sorting and deduplication), and use the software readCounter to count the number of reads per sample compared to each 1kbp range region on the autosomal chromosome, that is, reads per kb value depth, and then add every 30 depth values to get the depth of a range with a length of 30kbp. In this embodiment, 30kbp is the length of a window. In order to calculate the depth of all sites as much as possible and reduce information loss caused by discarding windows of insufficient length, readCounter is not directly allowed to perform statistics in units of 30kbp. The same operation is performed for each sample, resulting in a depth value matrix with a length of 95833 (number of windows) and a width of 639 (number of samples).

31 samples and 30 samples were randomly selected from the malignant pulmonary nodule sample set and the benign nodule sample set respectively as the verification set, and the remaining samples were used as the training set to extract relevant feature areas. In this embodiment, we extracted the 10 benign-related features and 10 malignant-related features with the highest discrimination to build a model, and predicted the validation set samples. The prediction results are shown in Figure 5.

The results show that when the model distinguishes the two types of samples in the validation set, the AUC is approximately 0.954 (95% CI: 0.908-1.000), indicating that the model has excellent performance in distinguishing the type of pulmonary nodules in patients based on their blood cfDNA.

In addition, it is calculated according to the following formula:

Specificity = number of true negatives/(number of true negatives + number of false positives)*100% (rate of correctly identifying non-patients);

Sensitivity = number of true positives/(number of true positives + number of false negatives)*100% (ratio of correctly identified patients);

The specificity and sensitivity of the method of the present invention were both about 0.8, and the result of specificity × sensitivity was about 0.64.

In addition, CT scans were also performed on the above 639 patients with untreated pulmonary nodules. It is calculated that: the specificity of the CT scan-based method is about 0.3, which means that about 70% of benign patients are considered malignant or cannot be judged; and the sensitivity of the CT scan-based method is about 0.93; specificity The result of × sensitivity is about 0.28.

It can be seen that compared with the CT scanning method of the prior art, the method of the present invention has significantly higher specificity and considerable sensitivity, thereby obtaining a significantly higher specificity × sensitivity result, indicating that the model is capable of distinguishing It has excellent performance in determining the patient's pulmonary nodule type based on the patient's blood cfDNA.

Example 2: Obtaining a pulmonary nodule screening method

This embodiment proposes a pulmonary nodule screening method, which includes the following steps:

Calculate the sum of the weighted segment distribution difference values of the sample to be tested in the selected predetermined number of areas to obtain the total WFDD value; input the total WFDD value of the sample to be tested into the lung node constructed according to the construction method of the first aspect of the present invention. Nodule screening model, or input the pulmonary nodule screening model according to the second aspect of the present invention for screening; output the screening results of the samples to be tested; wherein the selected predetermined number of regions are related to the benign nodule population sample and The predetermined number of regions with the largest differences in weighted fragment distribution difference values (WFDD) among the malignant nodule population samples are the same.

Optionally, the pulmonary nodule screening method also includes: inputting the total WFDD value of the sample to be tested into the pulmonary nodule screening model, and the pulmonary nodule screening model makes a judgment based on comparing the total WFDD value of the sample to be tested with a predetermined threshold. The type of pulmonary nodule in the sample to be tested.

Optionally, obtain the predetermined threshold through the following method: calculate the total WFDD value of each sample in the training set in a predetermined number of areas where the WFDD value difference is the largest; calculate based on the total WFDD value of the benign nodule population and the malignant nodule population in the training set. The optimal segmentation point is the predetermined threshold.

Optionally, calculate the optimal split point using the roc function from the pROC package of the R language analysis platform.

Example 3: Obtaining a device for constructing a pulmonary nodule screening model

The present invention proposes a device for constructing a pulmonary nodule screening model. The device includes: a characteristic data screening module, which is configured to screen benign nodule population samples and malignant nodule populations in the training set within the entire range of the human reference genome. A predetermined number of regions with the largest difference in weighted fragment distribution difference (WFDD) values in the sample are used as feature data; and a building module is configured to use the feature data to build a pulmonary nodule screening model.

Optionally, the feature data screening module includes: a window division module, which is configured to join together the base sequences of all autosomal chromosomes in the human reference genome, and divide the joined base sequences into a series of windows according to a fixed length, Each window corresponds to a base sequence; the first calculation module is set to calculate the window reference depth and weight of each window, where the window reference depth is the average depth value of the samples in the training set in the window, and the weight is The square of the variance of the depth value of the window for the samples in the training set. The depth value of the window is the number of base sequence fragments in the sequencing data of the sample that can be compared to the base sequence corresponding to the window; the second calculation module is Set to calculate the window sample depth of each window for the specified sample in the training set; the third calculation module is set to calculate the difference between the window sample depth and the window reference depth; the fourth calculation module is set to compare the difference with Multiply the weights to obtain the weighted difference of the windows; the fifth calculation module is set to combine an indefinite number of windows to form different areas, and sum the weighted differences of all windows in the specified area to obtain the total weighted difference; the numerical transformation module , is set to perform numerical transformation on the sum of weighted differences to obtain the weighted fragment distribution difference value (WFDD) of the specified sample in the specified area; and the sixth calculation module is set to calculate the benign nodule population samples and malignant nodules in the training set The difference of the nodule population sample in each region with respect to the weighted segment distribution difference value; and the feature data screening submodule is set to screen a predetermined number of regions with the largest differences as feature data.

Optionally, the feature data filtering module also includes a normalization processing module.

Optionally, the average value of the window reference depth of each window after the normalization process is 0, and the standard deviation is 1.

Optionally, the feature data screening sub-module includes: a first calculation unit, configured to calculate the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample for a specific area; a second calculation unit, It is set to respectively calculate the average WFDD value of the benign nodule population sample and the average WFDD value of the malignant nodule population sample; the third calculation unit is set to calculate the discrimination value of a specific area according to the following formula, and select the unit , is set to select a predetermined number of regions with the largest discrimination values as feature data:

Among them: t is the discrimination value of a specific area;

and

Optionally, the feature data screening sub-module includes: an initial region generation unit and a region combination splitting unit, wherein the initial region generation unit includes: a window division element, which is set to divide the genome into a series of windows; a window encoding element, which is It is set to number a series of windows in the order of a series of windows; the area number component is set to use a series of window numbers to number the area obtained by combining a series of windows; the window combination component is set to number consecutive windows at window i n windows are combined with another n consecutive windows at the downstream 2 ^j n windows to form an initial area: x _i = {i, i+1, i+2,..., i+n-1, i+ 2 ^j n, i+2 ^j n+1, i+2 ^j n+2,..., i+2 ^j n+n-1}; i=1, n, 2n,..., N; j =1, 2,...,8; i+(2 ⁱ⁺¹ )n≤N; where n is the number of given continuous windows, and N is the total number of divided windows; the regional combination splitting unit includes: The first child selection component is set to use a genetic algorithm, and the two parents combine to exchange information and generate offspring; among them, all initial areas are placed in the area pool, and randomly selected to generate offspring; where:

-The probability that region i is selected as one of the parents is:

The second generation selects components and is set to select the parents. After selecting the parents, it takes the union of the windows included in the parents and randomly deletes several of the windows as children. The random selection method is sampling with replacement; and the third offspring selects components. After being set to obtain offspring, the offspring will be put into the regional pool for the next round of selection; the parents will not be deleted from the regional pool in this operation. The offspring produced by the combination of area P ₁ and area P ₂ is: child( P ₁ , P ₂ )=P ₁ ∪P ₂ -S(p, P ₁ ∪P ₂ ); where S(p, s) is a subset obtained by extracting elements with proportion p from the set s with replacement ; and the descendant recurring component, which is set to repeat the process of producing descendants;

Among them: t is the discrimination value of a specific area;

and

In the construction device of the present invention, the number n of consecutive windows is 1-100, preferably 5-50, more preferably 5; preferably, after selecting the parents, take the union of the windows included by the parents and randomly delete 1 %-99%, more preferably 5%-50%, further preferably 20% of the window is used as the offspring, and sampling with replacement is performed. Preferably, the process of generating progeny is repeated 1 to 1 million times, more preferably 100 to 100,000 times, and further preferably 300,000 times. The connected base sequence is divided into a series of windows according to a length of 100bp-100kbp, preferably 10kbp-50kbp, and more preferably 30kbp. Preferably, the windows combined to form different areas are continuous or discontinuous. Preferably, the predetermined number of areas is 1-500, more preferably 10-100, and further preferably 50.

Example 4: Obtaining a pulmonary nodule screening device

The present invention proposes a pulmonary nodule screening device, which includes: a first calculation module, which is configured to calculate the sum of the weighted segment distribution difference values of samples to be tested in a selected predetermined number of areas to obtain a total WFDD value; an input module , is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model constructed by the construction device of the present invention; and the output module is configured to output the screening results of the sample to be tested. Among them, the selected predetermined number of regions are the same as the predetermined number of regions with the largest difference in weighted fragment distribution difference values (WFDD) between the benign nodule population sample and the malignant nodule population sample.

Optionally, the input module includes: an input unit configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model; and a determination unit configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model; The value is compared with a predetermined threshold to determine the type of pulmonary nodule of the sample to be tested.

Optionally, the screening device further includes a predetermined threshold acquisition module. The predetermined threshold acquisition module includes: a first calculation unit configured to calculate the total WFDD value of each sample in the training set in a predetermined number of regions where the WFDD value difference is the largest; and The second calculation unit is set to calculate the optimal segmentation point based on the total WFDD value of the benign nodule population and the malignant nodule population in the training set, and the optimal segmentation point is the predetermined threshold.

In addition, it should be noted that, after simple modifications, the construction method, screening model, screening method and screening device of the pulmonary nodule screening model of the present invention can also be applied to methylation sequencing data, RNA Sequencing data and proteomics-related data.

Those skilled in the art will understand that embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.

These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.

These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

Memory may include non-volatile memory in computer-readable media, random access memory (RAM) and/or non-volatile memory in the form of read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information. Information may be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory. (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device. As defined in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements not only includes those elements, but also includes Other elements are not expressly listed or are inherent to the process, method, article or equipment. Without further limitation, an element qualified by the statement "comprises a..." does not exclude the presence of additional identical elements in the process, method, good, or device that includes the element.

The above are only preferred embodiments of the present invention and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection scope of the present invention.

Industrial applicability

From the above description, it can be seen that the present invention provides a product for detecting human nodule types with high accuracy based on extracellular free DNA whole genome low-depth sequencing data. Specifically, the present invention proposes a pulmonary nodule screen. A method and device for establishing a detection model and a method and device for screening pulmonary nodules, which:

1) Non-invasively detect human nodule types (benign/malignant) and avoid invasive examinations;

2) Provides higher judgment accuracy than existing CT scans, and its accuracy is close to that of invasive tissue biopsy;

3) This method is developed based on cfDNA sequencing data and only requires blood drawing or collection of other body fluids, without the risk of radioactive exposure.

Claims

A method of establishing a pulmonary nodule screening model, characterized in that the establishment method includes the following steps:

Within the entire range of the human reference genome, select a predetermined number of regions with the largest difference in weighted fragment distribution difference (WFDD) values between the benign nodule population sample and the malignant nodule population sample in the training set as feature data; and

The characteristic data is used to construct the pulmonary nodule screening model.
The construction method according to claim 1, characterized in that the predetermined number of regions with the largest difference in weighted fragment distribution difference (WFDD) values among the benign nodule population samples and the malignant nodule population samples in the screening training set are used as feature data The steps include:

Connect the base sequences of all autosomal chromosomes in the human reference genome together, and divide the connected base sequences into a series of windows according to a fixed length, with each window corresponding to a base sequence;

Calculate the window reference depth and weight of each window, where the window reference depth is the average depth value of the samples in the training set in the window, and the weight is the depth value of the samples in the training set in the window The square of the variance of the depth value of the window, the depth value of the window is the number of base sequence fragments in the sequencing data of the sample that can be compared to the base sequence corresponding to the window;

Calculate the window sample depth in each window of the specified sample in the training set;

Calculate the difference between the window sample depth and the window reference depth;

Multiply the difference by the weight to obtain the weighted difference of the window;

Combine an indefinite number of windows to form different areas, and sum the weighted differences of all windows in the specified area to obtain the total weighted difference;

Perform a numerical transformation on the sum of weighted differences to obtain the weighted fragment distribution difference value (WFDD) of the designated sample in the designated area; and

Calculate the difference between the benign nodule population sample and the malignant nodule population sample in the training set with respect to the weighted segment distribution difference value in each region, and select a predetermined number of regions with the largest differences as the features data;

Preferably, before calculating the difference between the window sample depth and the window reference depth, it further includes the step of normalizing the window reference depth and the window sample depth;

More preferably, the average value of the window reference depth of each window after the normalization process is 0, and the standard deviation is 1.
The assembly method according to claim 2, wherein the screening of a predetermined number of regions with the largest differences as the characteristic data includes:

-For a specific region, calculate the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample;

- Calculate the average WFDD value of the benign nodule population sample and the average WFDD value of the malignant nodule population sample respectively;

- Calculate the discrimination value of said specific area according to the following formula, and

-Select the predetermined number of regions with the largest discrimination values as the feature data:

in:

t is the discrimination value of the specific area;

and
are respectively the average value of WFDD values from the benign nodule population sample or the malignant nodule population sample;

n 1 and n 2 are the number of values from the benign nodule population sample or the malignant nodule population sample respectively; and

S 1 and S 2 are respectively the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample; the conditions are:

When the benign nodule population sample corresponds to subscript 1, the malignant nodule population sample corresponds to subscript 2; or

When the benign nodule population sample corresponds to subscript 2, the malignant nodule population sample corresponds to subscript 1.
The assembly method according to claim 2, wherein the screening of a predetermined number of regions with the largest differences as the characteristic data includes:

a.) Generate initial area:

-Divide the genome into a series of windows;

- numbering said series of windows in the order of said series of windows;

- Use a series of window numbers as area numbers obtained by combining the series of windows;

- Combine the n consecutive windows at window i and the other n consecutive windows at 2 j n windows downstream to form an initial area:

x i = {i, i+1, i+2,..., i+n-1, i+2 j n, i+2 j n+1, i+2 j n+2,..., i+2 j n+n-1}

i＝1,n,2n,...,N;j＝1,2,...,8;i+(2 i+1 )n≤N

Among them, n is the number of given continuous windows, and N is the total number of divided windows;

b.) Combination and splitting of regions:

Using a genetic algorithm, two parents combine to exchange information and generate offspring; among them, all initial regions are placed in the region pool and randomly selected to generate offspring;

in:

-The probability that region i is selected as one of the parents is:

Among them, N is the total number of divided windows, t i is the t value of the i-th window;

-When region x is selected as the first parent, the probability that another region i is selected as the second parent is:

Among them, N is the total number of divided windows, m i is the average of the window numbers included in area i; and

-After selecting the parents, take the union of the windows included in the parents and randomly delete some of the windows as children. The random selection method is sampling with replacement;

-After obtaining the offspring, put the offspring into the regional pool for the next round of selection; the parents are not deleted from the regional pool in this operation. Among them, the offspring generated by the combination of area P 1 and area P 2 are:

child(P 1 ,P 2 )＝P 1 ∪P 2 -S(p,P 1 ∪P 2 )

Among them, S(p,s) is a subset obtained by extracting elements with proportion p from the set s with replacement;

- Repeat the process of producing offspring;

- Calculate the distinction value of the generated offspring according to the following formula, and select the predetermined number of areas with the largest distinction value as the feature data:

in:

t is the discrimination value of the specific area;

and
are respectively the average value of WFDD values from the benign nodule population sample or the malignant nodule population sample;

n 1 and n 2 are the number of values from the benign nodule population sample or the malignant nodule population sample respectively; and

S 1 and S 2 are respectively the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample; the conditions are:

When the benign nodule population sample corresponds to subscript 1, the malignant nodule population sample corresponds to subscript 2; or

When the benign nodule population sample corresponds to subscript 2, the malignant nodule population sample corresponds to subscript 1.
The construction method according to claim 4, characterized in that:

The number n of the continuous windows is 1-100, preferably 5-50, and more preferably 5;

Preferably, after selecting parents, take the union of the windows included in the parents and randomly delete 1%-99%, more preferably 5%-50%, and even more preferably 20% of the windows as offspring, and perform a putative search. backsampling;

Preferably, the repeated process of generating offspring is 1 to 1 million times, more preferably 100 to 100,000 times, and further preferably 300,000 times.
The construction method according to claim 2, characterized in that:

Divide the connected base sequence into a series of windows according to a length of 100bp-100kbp, preferably 10kbp-50kbp, and more preferably 30kbp;

Preferably, the windows combined to form different areas are continuous or discontinuous;

Preferably, the predetermined number of areas is 1-500, more preferably 10-100, and further preferably 50.
A pulmonary nodule screening method is characterized by including the following steps:

Calculate the sum of the weighted fragment distribution difference values of the sample to be tested in the selected predetermined number of areas to obtain the total WFDD value;

Input the total WFDD value of the sample to be tested into the pulmonary nodule screening model constructed by the construction method according to any one of claims 1 to 6;

Output the screening results of the sample to be tested;

Wherein, the selected predetermined number of regions are the same as the predetermined number of regions with the largest weighted fragment distribution difference values (WFDD) in the benign nodule population sample and the malignant nodule population sample.
The screening method according to claim 7, characterized in that: the total WFDD value of the sample to be tested is input into the pulmonary nodule screening model, and the pulmonary nodule screening model is based on the total WFDD value of the sample to be tested. The total WFDD value is compared with a predetermined threshold to determine the pulmonary nodule type of the sample to be tested;

Preferably, the predetermined threshold is obtained by the following method:

- Calculate the total WFDD value of each sample in the training set in a predetermined number of regions where the WFDD value difference is the largest;

- Calculate the optimal segmentation point based on the total WFDD value of the benign nodule population and the malignant nodule population in the training set, and the optimal segmentation point is the predetermined threshold;

More preferably, the optimal split point is calculated using the roc function from the pROC package of the R language analysis platform.
A device for constructing a pulmonary nodule screening model, characterized in that the device includes:

The feature data screening module is configured to screen a predetermined number of regions with the largest difference in Weighted Fragment Distribution Difference (WFDD) values between the benign nodule population samples and the malignant nodule population samples in the training set within the entire range of the human reference genome as Characteristic data; and

A building module is configured to build the pulmonary nodule screening model using the characteristic data.
The assembly device according to claim 9, characterized in that the characteristic data filtering module includes:

The window division module is set to connect the base sequences of all autosomal chromosomes in the human reference genome together, and divide the connected base sequences into a series of windows according to a fixed length, with each window corresponding to a base sequence;

The first calculation module is configured to calculate the window reference depth and weight of each window, wherein the window reference depth is the average depth value of the samples in the training set in the window, and the weight is the The square of the variance of the depth value of the window for the samples in the training set. The depth value of the window is the number of base sequence fragments in the sequencing data of the sample that can be compared to the base sequence corresponding to the window;

The second calculation module is configured to calculate the window sample depth in each window of the specified sample in the training set;

A third calculation module configured to calculate the difference between the window sample depth and the window reference depth;

A fourth calculation module is configured to multiply the difference by the weight to obtain the weighted difference of the window;

The fifth calculation module is configured to combine an indefinite number of windows to form different areas, and sum the weighted differences of all windows in the specified area to obtain the total weighted difference;

A numerical transformation module configured to perform numerical transformation on the sum of weighted differences to obtain the weighted fragment distribution difference value (WFDD) of the specified sample in the specified area; and

A sixth calculation module is configured to calculate the difference between the benign nodule population sample and the malignant nodule population sample in the training set with respect to the weighted segment distribution difference value in each region; and

The feature data screening submodule is configured to screen a predetermined number of areas with the largest differences as the feature data;

Preferably, the feature data screening module also includes a homogenization processing module;

More preferably, the average value of the window reference depth of each window after the normalization process is 0, and the standard deviation is 1.
The assembly device according to claim 10, characterized in that the characteristic data filtering sub-module includes:

A first calculation unit configured to calculate, for a specific region, the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample;

The second calculation unit is configured to respectively calculate the average value of the WFDD value of the benign nodule population sample and the average value of the WFDD value of the malignant nodule population sample;

The third calculation unit is configured to calculate the discrimination value of the specific area according to the following formula, and

A selection unit configured to select the predetermined number of regions with the largest discrimination values as the feature data:

in:

t is the discrimination value of the specific area;

and
are respectively the average value of WFDD values from the benign nodule population sample or the malignant nodule population sample;

n 1 and n 2 are the number of values from the benign nodule population sample or the malignant nodule population sample respectively; and

S 1 and S 2 are respectively the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample; the conditions are:

When the benign nodule population sample corresponds to subscript 1, the malignant nodule population sample corresponds to subscript 2; or

When the benign nodule population sample corresponds to subscript 2, the malignant nodule population sample corresponds to subscript 1.
The assembly device according to claim 10, characterized in that the characteristic data screening sub-module includes: an initial region generation unit and a region combination and splitting unit, wherein,

The initial area generation unit includes:

The window division element is configured to divide the genome into a series of windows;

a window encoding element configured to number the series of windows in the order of the series of windows;

The area number component is configured to use a series of window numbers as area numbers obtained by combining the series of windows;

The window combination component is set to combine n consecutive windows at window i and another n consecutive windows at 2 j n windows downstream of it to form an initial area:

x i = {i, i+1, i+2,..., i+n-1, i+2 j n, i+2 j n+1, i+2 j n+2,..., i+2 j n+n-1},

i＝1,n,2n,...,N;j＝1,2,...,8;i+(2 i+1 )n≤N

Among them, n is the number of given continuous windows, and N is the total number of divided windows;

The regional combined spin-off units include:

The first offspring selection component is set to use a genetic algorithm, and the two parents combine to exchange information and generate offspring; among them, all initial areas are placed in the area pool, and offspring are randomly selected;

in:

-The probability that region i is selected as one of the parents is:

Among them, N is the total number of divided windows, t i is the t value of the i-th window;

-When region x is selected as the first parent, the probability that another region i is selected as the second parent is:

Among them, N is the total number of divided windows, m i is the average of the window numbers included in area i;

The second child selection component is set to take the union of the windows included in the parents after selecting the parents and randomly delete several of the windows as children. The random selection method is sampling with replacement; and

The third child selection component is set to put the children into the regional pool for the next round of selection after obtaining the children; in this operation, the parents are not deleted from the regional pool, in which area P 1 and area P 2 are combined The resulting offspring are:

child(P 1 , P 2 )=P 1 ∪P 2 -S(p, P 1 ∪P 2 ),

Among them, S(p,s) is a subset obtained by extracting proportion p elements from the set s with replacement; and

The descendants repeatedly generate components and are set to repeat the process of generating descendants;

- Calculate the distinction value of the generated offspring according to the following formula, and select the predetermined number of areas with the largest distinction value as the feature data:

in:

t is the discrimination value of the specific area;

and
are respectively the average value of WFDD values from the benign nodule population sample or the malignant nodule population sample;

n 1 and n 2 are the number of values from the benign nodule population sample or the malignant nodule population sample respectively; and

S 1 and S 2 are respectively the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample; the conditions are:

When the benign nodule population sample corresponds to subscript 1, the malignant nodule population sample corresponds to subscript 2; or

When the benign nodule population sample corresponds to subscript 2, the malignant nodule population sample corresponds to subscript 1.
The assembly device according to claim 12, characterized in that:

The number n of the continuous windows is 1-100, preferably 5-50, and more preferably 5;

Preferably, after selecting parents, take the union of the windows included in the parents and randomly delete 1%-99%, more preferably 5%-50%, and even more preferably 20% of the windows as offspring, and perform a putative search. backsampling;

Preferably, the repeated process of generating offspring is 1 to 1 million times, more preferably 100 to 100,000 times, and further preferably 300,000 times.
The assembly device according to claim 10, characterized in that:

Divide the connected base sequence into a series of windows according to a length of 100bp-100kbp, preferably 10kbp-50kbp, and more preferably 30kbp;

Preferably, the windows combined to form different areas are continuous or discontinuous;

Preferably, the predetermined number of areas is 1-500, more preferably 10-100, and further preferably 50.
A pulmonary nodule screening device, characterized in that the screening device includes:

The first calculation module is configured to calculate the sum of the weighted fragment distribution difference values of the sample to be tested in the selected predetermined number of regions to obtain the total WFDD value;

An input module configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model assembled by the assembly device according to any one of claims 9 to 14; and

An output module is configured to output the screening results of the sample to be tested;

Wherein, the selected predetermined number of regions are the same as the predetermined number of regions with the largest difference in weighted fragment distribution difference values (WFDD) between the benign nodule population sample and the malignant nodule population sample.
The screening device according to claim 15, wherein the input module includes:

An input unit configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model; and

A determination unit configured so that the pulmonary nodule screening model determines the pulmonary nodule type of the sample to be tested based on the comparison between the total WFDD value of the sample to be tested and a predetermined threshold;

Preferably, the screening device further includes a predetermined threshold acquisition module, and the predetermined threshold acquisition module includes:

A first calculation unit configured to calculate the total WFDD value of each sample in the training set in a predetermined number of regions where the WFDD value difference is the largest; and

The second calculation unit is configured to calculate the optimal segmentation point based on the total WFDD value of the benign nodule population and the malignant nodule population in the training set, where the optimal segmentation point is the predetermined threshold;

More preferably, the optimal split point is calculated using the roc function from the pROC package of the R language analysis platform.
A computer-readable storage medium, characterized in that the storage medium includes a stored program, and when the program is run, the program executes the pulmonary nodule screening model according to any one of claims 1 to 6 The formation method or the pulmonary nodule screening method according to claim 7 or 8.
A processor, characterized in that the processor is used to run a program, wherein when the program is run, the method for establishing a pulmonary nodule screening model according to any one of claims 1 to 6 or claim 7 is executed Or the pulmonary nodule screening method described in 8.