WO2023236058A1 - Construction method and apparatus for pulmonary nodule screening model, and pulmonary nodule screening method and apparatus - Google Patents

Construction method and apparatus for pulmonary nodule screening model, and pulmonary nodule screening method and apparatus Download PDF

Info

Publication number
WO2023236058A1
WO2023236058A1 PCT/CN2022/097450 CN2022097450W WO2023236058A1 WO 2023236058 A1 WO2023236058 A1 WO 2023236058A1 CN 2022097450 W CN2022097450 W CN 2022097450W WO 2023236058 A1 WO2023236058 A1 WO 2023236058A1
Authority
WO
WIPO (PCT)
Prior art keywords
nodule
sample
window
value
windows
Prior art date
Application number
PCT/CN2022/097450
Other languages
French (fr)
Chinese (zh)
Inventor
梁瀚
周鑫兰
李甫强
乔斯坦
赵鑫
吴逵
Original Assignee
深圳华大生命科学研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大生命科学研究院 filed Critical 深圳华大生命科学研究院
Priority to PCT/CN2022/097450 priority Critical patent/WO2023236058A1/en
Publication of WO2023236058A1 publication Critical patent/WO2023236058A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Definitions

  • the present invention relates to the technical field of bioinformatics, specifically, to a method and device for constructing a pulmonary nodule screening model and a pulmonary nodule screening method and device.
  • Cancer has become the leading cause of death in China, and the incidence rate of cancer is increasing year by year. According to the latest 2019 National Cancer Report released by the National Cancer Center, deaths from malignant tumors account for 23.91% of all causes of death among residents, and the annual medical expenses caused by malignant tumors exceed 220 billion. Among them, lung cancer ranks first in the incidence of malignant tumors in my country in order of the number of cases.
  • Pulmonary sarcoidosis is a multi-system and multi-organ granulomatous disease of unknown etiology. It often invades the lungs, bilateral hilar lymph nodes, eyes, skin and other organs. Its chest invasion rate is as high as 80% to 90%. .
  • the prognosis for pulmonary sarcoidosis is mostly good.
  • the early forms of lung cancer are mostly small nodules in the lungs. Therefore, distinguishing the type of nodules is particularly important for early screening of lung cancer.
  • the current method to reliably determine the type of pulmonary nodules basically relies on tissue biopsy through invasive surgical sampling.
  • Cell-free DNA also known as circulating DNA (circulating free DNA, cf DNA)
  • circulating DNA circulating free DNA
  • cf DNA circulating tumor DNA
  • ct DNA circulating tumor DNA
  • This article compares malignant lung lesions with benign lesions, understands tissue DNA methylation characteristics, and establishes a diagnostic model for benign/malignant nodules. Applying this model to the identification of tumor-specific ct DNA in the plasma of patients with pulmonary nodules has certain sensitivity and specificity for early lung cancer. However, the accuracy of this non-invasive diagnostic method for pulmonary nodules is low and cannot yet meet clinical requirements.
  • Copy Number Variation is caused by genome rearrangements. It generally refers to an increase or decrease in the copy number of a large genome segment with a length of more than 1 kb, mainly manifesting as submicroscopic deletions and duplications. CNV is an important component of genome structural variation (Structural Variation, SV). The mutation rate of CNV sites is much higher than that of Single Nucleotide Polymorphism (SNP), and it is one of the important causative factors of human diseases.
  • SNP Single Nucleotide Polymorphism
  • the main purpose of the present invention is to provide a method for establishing a pulmonary nodule screening model, a screening model, a screening method and a screening device, which can distinguish a certain part of malignant tumors from other non-cancer types of diseases, especially pulmonary nodule screening models.
  • Type of nodule is to provide a method for establishing a pulmonary nodule screening model, a screening model, a screening method and a screening device.
  • the present invention proposes a method for constructing a pulmonary nodule screening model, which method includes the following steps:
  • WFDD weighted fragment distribution difference values
  • the step of screening a predetermined number of regions with the largest difference in weighted fragment distribution difference values (WFDD) among the benign nodule population samples and the malignant nodule population samples in the training set as feature data includes:
  • each window corresponds to a base sequence; calculate the window reference depth of each window. and the weight, where the window reference depth is the average of the depth values of the samples in the training set in the window, the weight is the square of the variance of the depth values of the samples in the training set in the window, and the depth value of the window is the sequencing data of the sample.
  • the number of base sequence fragments that can be compared to the base sequence corresponding to the window calculate the window sample depth of the specified sample in the training set in each window; calculate the difference between the window sample depth and the window reference depth; divide the difference Multiply the value and the weight to get the weighted difference of the window; combine an indefinite number of windows to form different areas, sum the weighted differences of all windows in the specified area to get the weighted difference sum; perform a numerical calculation on the weighted difference sum Transform to obtain the weighted fragment distribution
  • a step of normalizing the window reference depth and the window sample depth is also included.
  • the average value of the window reference depth of each window is 0, and the standard deviation is 1.
  • screening a predetermined number of areas with the largest differences as feature data includes:
  • For a specific area calculate the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample; calculate the average WFDD value of the benign nodule population sample and the average WFDD value of the malignant nodule population sample respectively. ; Calculate the discrimination value of a specific area according to the following formula, and select a predetermined number of areas with the largest discrimination value as feature data:
  • t is the discrimination value of a specific area; and are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S 1 and S 2 is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the condition is: when the benign nodule population sample corresponds to the subscript 1, the malignant nodule population sample corresponds to the subscript 2; or when the benign nodule population sample corresponds to the subscript 2 When the nodule population sample corresponds to the subscript 2, the malignant nodule population sample corresponds to the subscript 1.
  • screening a predetermined number of areas with the largest differences as feature data includes:
  • x i ⁇ i, i+1, i+2,..., i+n-1, i+2 j n, i+2 j n+1, i+2 j n+2,..., i+2 j n+n-1 ⁇
  • n is the number of given continuous windows, and N is the total number of divided windows;
  • two parents combine to exchange information and produce offspring; where all initial regions are placed in the region pool and randomly selected to generate offspring; where
  • N is the total number of divided windows
  • t i is the t value of the i-th window
  • N is the total number of divided windows
  • m i is the average of the window numbers included in area i
  • the offspring generated by the combination of area P 1 and area P 2 are:
  • S(p,s) is a subset obtained by extracting elements with proportion p from the set s with replacement;
  • t is the discrimination value of a specific area; and are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S 1 and S 2 is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the conditions are:
  • the malignant nodule population sample corresponds to the subscript 2; or when the benign nodule population sample corresponds to the subscript 2, the malignant nodule population sample corresponds to the subscript 1.
  • the number n of continuous windows is 1-100, preferably 5-50, and more preferably 5.
  • the process of generating offspring is repeated 1 to 1 million times, more preferably 100 to 100,000 times, and further preferably 300,000 times.
  • the connected base sequence is divided into a series of windows according to the length of 100bp-100kbp, preferably 10kbp-50kbp, and more preferably 30kbp.
  • the windows that are combined to form different areas are continuous or discontinuous.
  • the predetermined number of areas is 1-500, more preferably 10-100, even more preferably 50.
  • the present invention proposes a pulmonary nodule screening method, which includes the following steps: calculating the sum of the weighted segment distribution difference values of the sample to be tested in a selected predetermined number of regions to obtain the total WFDD value ; Input the total WFDD value of the sample to be tested into the pulmonary nodule screening model established according to the method described in the first aspect of the present invention; output the screening results of the sample to be tested; wherein, a predetermined number of regions are selected It is the same as the predetermined number of regions with the largest difference in weighted fragment distribution difference (WFDD) values between the benign nodule population sample and the malignant nodule population sample.
  • WFDD weighted fragment distribution difference
  • the total WFDD value of the sample to be tested is input into the pulmonary nodule screening model, and the pulmonary nodule screening model determines the type of pulmonary nodule of the sample to be tested based on the comparison between the total WFDD value of the sample to be tested and a predetermined threshold.
  • the predetermined threshold is obtained by the following method: calculating the total WFDD value of each sample in the training set in a predetermined number of areas with the largest difference in WFDD value; calculating the maximum WFDD value based on the total WFDD value of the benign nodule population and the malignant nodule population in the training set.
  • the optimal segmentation point is the predetermined threshold.
  • the optimal segmentation point was calculated using the roc function of the pROC package from the R language analysis platform.
  • the present invention proposes a device for constructing a pulmonary nodule screening model, including: a feature data screening module configured to screen benign nodules in the training set within the entire range of the human reference genome. A predetermined number of areas with the largest difference in weighted fragment distribution difference (WFDD) values between the population sample and the malignant nodule population sample are used as feature data; and a building module is configured to use the feature data to build a pulmonary nodule screening model.
  • WFDD weighted fragment distribution difference
  • the feature data screening module includes: a window division module, which is set to connect the base sequences of all autosomal chromosomes in the human reference genome together, and divide the connected base sequences into a series of windows according to a fixed length, each Each window corresponds to a base sequence; the first calculation module is set to calculate the window reference depth and weight of each window, where the window reference depth is the average depth value of the window in the training set, and the weight is the training set. The square of the variance of the depth value of the window for the set of samples.
  • the depth value of the window is the number of base sequence fragments in the sequencing data of the sample that can be compared to the base sequence corresponding to the window;
  • the second calculation module is set Calculate the window sample depth of the specified sample in each window in the training set;
  • the third calculation module is set to calculate the difference between the window sample depth and the window reference depth;
  • the fourth calculation module is set to compare the difference with the weight Multiply to obtain the weighted difference of the windows;
  • the fifth calculation module is set to combine an indefinite number of windows to form different areas, and sum the weighted differences of all windows in the specified area to obtain the total weighted difference;
  • the numerical transformation module is set to perform numerical transformation on the sum of weighted differences to obtain the weighted fragment distribution difference value (WFDD) of the specified sample in the specified area;
  • the sixth calculation module is set to calculate the benign nodule population sample and malignant nodule in the training set section the difference of the population sample in each area with respect to the weighted fragment distribution difference value; and the feature
  • the feature data screening module also includes a homogenization processing module.
  • the average value of the window reference depth of each window is 0, and the standard deviation is 1.
  • the feature data screening sub-module includes: a first calculation unit, configured to calculate the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample for a specific area; a second calculation unit, configured to In order to respectively calculate the average WFDD value of the benign nodule population sample and the average WFDD value of the malignant nodule population sample; the third calculation unit is set to calculate the discrimination value of a specific area according to the following formula, and select the unit, is set to select a predetermined number of regions with the largest discrimination values as feature data:
  • t is the discrimination value of a specific area; and are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S 1 and S 2 is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the condition is: when the benign nodule population sample corresponds to the subscript 1, the malignant nodule population sample corresponds to the subscript 2; or when the benign nodule population sample corresponds to the subscript 2 When the nodule population sample corresponds to the subscript 2, the malignant nodule population sample corresponds to the subscript 1.
  • the feature data screening sub-module includes: an initial region generation unit and a region combination splitting unit, wherein the initial region generation unit includes: a window division element, which is set to divide the genome into a series of windows; a window encoding element, which is set To number a series of windows in the order of a series of windows; the area number component is set to use a series of window numbers to number the area obtained by combining a series of windows; the window combination component is set to use n consecutive n numbers at window i A window is combined with another n consecutive windows at 2 j n windows downstream to form an initial area:
  • x i ⁇ i, i+1, i+2,..., i+n-1, i+2 j n, i+2 j n+1, i+2 j n+2,..., i+2 j n+n-1 ⁇ ,
  • n is the number of given continuous windows, and N is the total number of divided windows;
  • the regional combination splitting unit includes: the first child selection component is set to use a genetic algorithm, and the two parents combine to exchange information and generate offspring; among them, all initial areas are put into the regional pool and randomly selected to generate offspring. ;in:
  • N is the total number of divided windows
  • t i is the t value of the i-th window
  • N is the total number of divided windows
  • m i is the average of the window numbers included in area i;
  • the second child selection component is set to take the union of the windows included in the parents after selecting the parents and randomly delete several of the windows as children.
  • the random selection method is sampling with replacement;
  • the third child selection component is set to put the children into the regional pool for the next round of selection after obtaining the children; in this operation, the parents are not deleted from the regional pool, in which area P 1 and area P 2 are combined
  • the resulting offspring are:
  • S(p,s) is a subset obtained by extracting proportion p elements from the set s with replacement.
  • the descendants repeatedly generate components and are set to repeat the process of generating descendants
  • t is the discrimination value of a specific area; and are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S 1 and S 2 is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the condition is: when the benign nodule population sample corresponds to the subscript 1, the malignant nodule population sample corresponds to the subscript 2; or when the benign nodule population sample corresponds to the subscript 2 When the nodule population sample corresponds to the subscript 2, the malignant nodule population sample corresponds to the subscript 1.
  • the number n of continuous windows is 1-100, preferably 5-50, and more preferably 5.
  • the process of generating offspring is repeated 1 to 1 million times, more preferably 100 to 100,000 times, and further preferably 300,000 times.
  • the connected base sequence is divided into a series of windows according to the length of 100bp-100kbp, preferably 10kbp-50kbp, and more preferably 30kbp.
  • the windows that are combined to form different areas are continuous or discontinuous.
  • the predetermined number of areas is 1-500, more preferably 10-100, even more preferably 50.
  • the present invention proposes a pulmonary nodule screening device, including: a first calculation module configured to calculate the weighted fragment distribution difference value of the sample to be tested in a selected predetermined number of regions.
  • the input module is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model constructed according to the construction device of the third aspect of the present invention
  • the output module is configured to output Screening results of samples to be tested; wherein, the selected predetermined number of regions are the same as the predetermined number of regions with the largest difference in weighted fragment distribution difference values (WFDD) between the benign nodule population sample and the malignant nodule population sample.
  • WFDD weighted fragment distribution difference values
  • the input module includes: an input unit, which is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model; and a determination unit, which is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model. Compare with a predetermined threshold to determine the type of pulmonary nodule in the sample to be tested.
  • the screening device further includes a predetermined threshold acquisition module.
  • the predetermined threshold acquisition module includes: a first calculation unit configured to calculate the total WFDD value of each sample in the training set in a predetermined number of regions where the WFDD value difference is the largest; and The second calculation unit is set to calculate the optimal segmentation point based on the total WFDD value of the benign nodule population and the malignant nodule population in the training set, and the optimal segmentation point is the predetermined threshold.
  • the optimal segmentation point was calculated using the roc function of the pROC package from the R language analysis platform.
  • the present invention proposes a computer-readable storage medium.
  • the storage medium includes a stored program.
  • the program executes the construction method according to the first aspect of the present invention or the second method of the present invention. Two aspects of pulmonary nodule screening methods.
  • the present invention proposes a processor.
  • the processor is configured to run a program.
  • the program is running, the method for constructing a pulmonary nodule screening model according to the first aspect of the present invention is executed or the method according to the present invention is executed.
  • the second aspect of pulmonary nodule screening methods is executed.
  • the technical solution of the present invention is applied to develop a method aimed at distinguishing a certain part of malignant tumors from other non-cancer types of diseases (such as nodules, etc.).
  • the results show that the effect of the method of the present invention is significantly better than that of existing CT scans.
  • This method produces a product that can non-invasively detect the type of human nodules (benign/malignant). For example, through blood testing, it can determine whether the patient's lung nodules are malignant, thereby avoiding invasive examinations.
  • Figure 1 shows a flow chart of a method for constructing a pulmonary nodule screening model according to the present invention.
  • Figure 2 shows the calculation method of mode weighted fragment distribution difference (Weighted Fragment Distribution Difference, WFDD) according to the present invention. Among them: A shows the calculation method of the weighted difference of a sample in the specified window; B shows the calculation method of the accumulated value of the i-th window; and C shows the calculation method of the WFDD of the sample in the specified area.
  • WFDD Weighted Fragment Distribution Difference
  • Figure 3 shows an example of a benign correlation distribution pattern obtained by modeling according to an embodiment of the present invention.
  • Figure 4 shows an example of a malignant correlation distribution pattern obtained by modeling according to an embodiment of the present invention.
  • Figure 5 shows experimental results obtained by modeling and predicting validation set samples according to an embodiment of the present invention.
  • Figure 6 shows a flow chart of the pulmonary nodule screening method according to the present invention.
  • Figure 7 shows a device for constructing a pulmonary nodule screening model according to the present invention.
  • Figure 8 shows the feature data screening module in the device for constructing a pulmonary nodule screening model according to the present invention.
  • Figure 9 shows the feature data screening sub-module in the feature data screening module in the device for building a pulmonary nodule screening model according to the present invention.
  • Figure 10 shows the feature data screening sub-module in the feature data screening module in the device for building a pulmonary nodule screening model according to the present invention.
  • Figure 11 shows a pulmonary nodule screening device according to the present invention.
  • Figure 12 shows the input module of the pulmonary nodule screening device according to the present invention.
  • FIG. 1 shows a flow chart of a method for constructing a pulmonary nodule screening model according to the present invention.
  • the method for constructing a pulmonary nodule screening model according to an embodiment of the present invention constructs a model based on the distribution characteristics of DNA fragment sequence reads (reads) in second-generation sequencing data on a reference genome (reference), thereby distinguishing different types ( benign/malignant) nodules.
  • the construction method of the pulmonary nodule screening model includes:
  • the sequencing data of all autosomes in the reference genome are first concatenated and divided into a series of windows by a fixed length.
  • the fixed length range is 100bp-100kbp, preferably 10kbp-50kbp, and more preferably 30kbp. Since the division is performed by concatenating the sequencing data of autosomes, there may be windows spanning chromosomes.
  • windows spanning chromosomes In order to exclude possible interference caused by gender factors, we did not use sex chromosome data.
  • a calculation method of a mode weighted fragment distribution difference (Weighted Fragment Distribution Difference, WFDD) according to an embodiment of the present invention is shown.
  • WFDD Weighted Fragment Distribution Difference
  • the window base depth and weight are calculated for each window: the average of the depth values of the samples in the window (each sample provides a depth value) of the training set is used as the window base depth of the window, and The square of the variance of the depth values of these samples serves as the weight of the window.
  • the window sample depth in each window is calculated for the specified sample in the training set.
  • the window reference depth of the window in the designated area is normalized to obtain the normalized window reference depth, so that the average value of each window reference depth is 0 and the standard deviation is 1.
  • WFDD only focuses on the depth difference at the window level.
  • the depth of the specified sample in these windows is also subjected to the same normalization operation to obtain the normalized window sample depth.
  • the difference between the window sample depth and the window reference depth is calculated.
  • the difference between the normalized window sample depth and the normalized window reference depth is calculated.
  • the difference is multiplied by the weight to obtain the weighted difference of the window.
  • the difference between the normalized window sample depth and the normalized window reference depth is multiplied by the weight.
  • the weighted differences of all windows in the specified area are summed; the sum of the weighted differences (i.e., the last cumulative value of the summation) Perform numerical transformation to obtain the weighted fragment distribution difference (WFDD) of the specified sample in the specified area.
  • WFDD weighted fragment distribution difference
  • the difference between the benign nodule population sample and the malignant nodule population sample in the training set with respect to the weighted fragment distribution difference value in each region is calculated, and a predetermined number of regions with the largest differences are screened as feature data.
  • the number of selected regions is 1-500, preferably 10-100, and more preferably 50; then, the characteristic data is used to build a pulmonary nodule screening model.
  • FIG. 2 an example of a calculation method of mode weighted segment distribution differences according to one embodiment of the present invention is shown.
  • a of Figure 2 shows an example of a method for calculating the weighted difference of a sample in a specified window.
  • the average of the depth values of the training set samples in the window (each sample provides a depth value) is used as the window base depth of the window, and the square of the variance of the depth values of these samples is used as the window's base depth.
  • Weights. Computes the window sample depth in each window for the specified sample in the training set.
  • the empirical formula for calculating the WFDD of a sample in a specified area is:
  • x′ i is the normalized window sample depth of the specified sample in the i-th window of the training set
  • ⁇ i is the depth value variance of the specified sample in the training set in the i-th window.
  • FIG. 3 an example of a benign correlation distribution pattern obtained by modeling in accordance with an embodiment of the present invention is shown.
  • the fragment distribution pattern in this region a benign correlation pattern.
  • the area where the pattern is located includes 53 windows, and each polyline represents a sample.
  • the left picture shows the weighted difference on these windows of cf DNA samples (40 in total) from 20 random patients with benign/malignant pulmonary nodules; the middle picture shows the cumulative weighted difference of each window; the right picture shows The last cumulative value of the sample is converted in the benign correlation mode to obtain the correlation function and results of WFDD (box plot).
  • FIG 4 an example of a malignant correlation distribution pattern obtained by modeling according to an embodiment of the present invention is shown. If the WFDD fluctuation of samples from the malignant pulmonary nodule population is greater than that of the benign pulmonary nodule population, we call the fragment distribution pattern in this region a malignant-related pattern.
  • the pattern area includes 231 windows. The meaning of each part in the figure is the same as (B), but the samples are not completely consistent.
  • the left picture shows the weighted differences in these windows for cf DNA samples of 20 randomly selected patients with benign/malignant pulmonary nodules (40 in total); the middle picture shows the cumulative weighted difference for each window; the right picture It shows the numerical transformation of the last cumulative value of the sample in the benign correlation mode to obtain the correlation function and results of WFDD (box plot).
  • the following formula can be used to perform a normalization operation on a set of values (such as the depth values of a sample in multiple windows):
  • S is the standard deviation of this set of values, and is its average value.
  • the average of the WFDD values and the average of the WFDD values of the malignant nodule population samples calculate the discrimination value of a specific region and select a predetermined number of regions with the largest discrimination value as feature data.
  • the formula for calculating the discrimination value of a specific area is:
  • t is the discrimination value of a specific area; and are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S 1 and S 2 is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the condition is: when the benign nodule population corresponds to the subscript 1, the malignant nodule population corresponds to the subscript 2; or when the benign nodule population corresponds to the subscript 2 When the population corresponds to subscript 2, the population with malignant nodules corresponds to subscript 1.
  • a higher t value indicates that the WFDD values of the two groups of samples have a greater difference in this area, that is, this area has a higher degree of discrimination.
  • an improved genetic algorithm can be used to search.
  • the improved genetic algorithm randomly merges and splits a series of regions (initial regions) obtained by simple strategies, and guides the generation of regions with larger t values.
  • the steps to search for potential high-discrimination windows include:
  • x i ⁇ i, i+1, i+2,..., i+n-1, i+2 j n, i+2 j n+1, i+2 j n+2,..., i+2 j n+n-1 ⁇
  • n is the number of given continuous windows, and its value range is 1-100, preferably 5-50, and more preferably 5, and N is the total number of divided windows.
  • requiring two consecutive window sequences to be separated by a certain distance is to allow the region to have the ability to span long distances.
  • N is the total number of divided windows.
  • region x is selected as the first parent
  • the probability that another region i is selected as the second parent is:
  • N is the total number of divided windows
  • m i is the average of the window numbers included in area i.
  • the offspring produced by the combination of area P 1 and area P 2 are:
  • S(p,s) is a subset obtained by extracting proportion p elements from the set s with replacement.
  • the process of producing offspring is repeated over and over again.
  • the range of repetitions is 1 to 1 million times, in a preferred embodiment, the range of repetitions is 100 to 100,000 times, in a more preferred embodiment, the range of repetitions is 300,000 times, and Finally, select a number of areas with the greatest discrimination as features to build a model and predict.
  • the pulmonary nodule screening method includes: selecting a certain number of areas. In order to determine the type of a specific sample based on these areas, we calculate the WFDD values of the sample in these areas and sum them up to obtain the total WFDD value. Then, the WFDD value is calculated. The total WFDD value is compared with a predetermined threshold, and its type is determined based on whether it is greater than or less than the threshold.
  • the threshold is calculated based on samples from the training set. First, calculate the total WFDD value of each sample in the training set for these areas, and then calculate the optimal segmentation point based on the total WFDD value of the benign nodule population and malignant nodule population in the training set. This segmentation point is the required threshold.
  • the optimal split point can be calculated using the roc function (pROC package from R language).
  • the construction device of the present invention includes: a feature data screening module, which is configured to screen the weighted fragment distribution difference (WFDD) in the benign nodule population samples and the malignant nodule population samples in the training set within the entire range of the human reference genome. A predetermined number of areas with the largest difference in values are used as feature data; and a building module is configured to use the feature data to build a pulmonary nodule screening model.
  • WFDD weighted fragment distribution difference
  • the feature data screening module of the present invention includes: a window division module, which is configured to connect the base sequences of all autosomal chromosomes in the human reference genome together, and divide the connected base sequences into a series of fixed lengths.
  • each window corresponds to a base sequence
  • the first calculation module is set to calculate the window reference depth and weight of each window, where the window reference depth is the average depth value of the sample in the training set in the window, The weight is the square of the variance of the depth value of the window for the samples in the training set, and the depth value of the window is the number of base sequence fragments in the sequencing data of the sample that can be compared to the base sequence corresponding to the window
  • the second calculation module is set to calculate the window sample depth of the specified sample in the training set in each window
  • the third calculation module is set to calculate the difference between the window sample depth and the window reference depth
  • the fourth calculation module is set to calculate the difference
  • the value is multiplied by the weight to obtain the weighted difference of the window
  • the fifth calculation module is set to combine an indefinite number of windows to form different areas, and sum the weighted differences of all windows in the specified area to obtain the total weighted difference
  • numerical value The transformation module is configured to perform numerical transformation on the sum of weighted differences to
  • the feature data screening module also includes a normalization processing module, which is configured to normalize the window reference depth and the window sample depth.
  • a normalization processing module configured to normalize the window reference depth and the window sample depth.
  • the average value of the window reference depth of each window after the normalization process is 0, and the standard deviation is 1.
  • the characteristic data screening sub-module of the present invention includes: a first calculation unit, which is configured to calculate the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample for a specific area; a second calculation unit , is set to respectively calculate the average WFDD value of the benign nodule population sample and the average WFDD value of the malignant nodule population sample; the third calculation unit is set to calculate the discrimination value of a specific area according to the following formula, and The selection unit is set to select a predetermined number of regions with the largest discrimination values as feature data: Among them: t is the discrimination value of a specific area; and are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodu
  • the feature data screening sub-module includes: an initial region generation unit and a region combination splitting unit.
  • the initial region generation unit includes: a window division element, which is set to divide the genome into a series of windows; a window encoding element, which is set to use a
  • the order of the series windows is to number a series of windows; the area number component is set to use a series of window numbers to be the area number obtained by combining a series of windows; the window combination component is set to be the sum of n consecutive windows at window i Another n consecutive windows at the downstream 2 j n windows are combined to form an initial area:
  • the regional combination splitting unit includes: the first child selection component is set to use a genetic algorithm, and the two parents combine to exchange information and generate offspring; among them, all initial areas are put into the regional pool and randomly selected to generate offspring. ;where: -The probability that region i is selected as one of the parents is: Among them, N is the total number of divided windows, t i is the t value of the i-th window; - when area x is selected as the first parent, the probability that another area i is selected as the second parent is: Among them, N is the total number of divided windows, m i is the average of the window numbers included in area i; the second generation selects components and is set to select the parents.
  • the union of the windows included in the parents is taken and randomly deleted.
  • Several windows are used as children, and the random selection method is sampling with replacement; and the third child selection component is set to put the children into the regional pool for the next round of selection after obtaining the children; there is no need for this operation.
  • the screening device includes: a first calculation module, which is configured to calculate the sum of the weighted fragment distribution difference values of the sample to be tested in a selected predetermined number of areas to obtain a total WFDD value; an input module, which is configured to calculate the total WFDD value of the sample to be tested.
  • the total WFDD value of the sample to be tested is input into the pulmonary nodule screening model of the present invention; and the output module is configured to output the screening results of the sample to be tested; wherein, the selected predetermined number of regions are related to the benign nodule population sample and the malignant nodule population sample.
  • the predetermined number of regions with the largest weighted fragment distribution difference (WFDD) differences in the nodule population sample are the same.
  • the input module includes: an input unit, which is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model; and a determination unit, which is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model. Compare with a predetermined threshold to determine the type of pulmonary nodule in the sample to be tested.
  • the screening device further includes a predetermined threshold acquisition module.
  • the predetermined threshold acquisition module includes: a first calculation unit configured to calculate the WFDD of each sample in the training set.
  • the total WFDD values of the predetermined number of regions with the largest value difference; and the second calculation unit is configured to calculate the optimal segmentation point based on the total WFDD values of the benign nodule population and the malignant nodule population in the training set, and the optimal segmentation point is is a predetermined threshold; more preferably, the optimal segmentation point is calculated using the roc function of the pROC package from the R language analysis platform.
  • Embodiments of the present invention provide a computer-readable storage medium on which a stored program is stored.
  • the program executes the method for establishing a pulmonary nodule screening model of the present invention or the pulmonary nodule screening method of the present invention. method.
  • Embodiments of the present invention provide a processor, and the processor is configured to run a program.
  • the program is run, the method for establishing a pulmonary nodule screening model of the present invention or the pulmonary nodule screening method of the present invention is executed.
  • the present invention will be described in further detail below with reference to specific examples. These examples shall not be construed as limiting the scope of protection claimed by the present invention.
  • Example 1 Establishing a pulmonary nodule screening model
  • Blood samples from 639 patients with untreated pulmonary nodules were collected, including 484 patients with malignant nodules (85% of which were stage I) and 155 patients with benign nodules. Patients with malignant pulmonary nodules only included patients with non-small cell lung cancer. All patients have been anonymized and have given consent for their samples to be used in clinical research.
  • EDTA tubes to collect whole blood and process it immediately. If it cannot be processed immediately, store it at 4°C for no more than 1 day. Centrifuge at 1600g for 10 minutes at 4°C to distinguish plasma and cellular components. The plasma is further centrifuged at 16000g for 10 minutes at 4°C to remove possible cellular residues and stored at -80°C until use.
  • Magneticen Use MagPure Circulating DNA KF Kit (Magen) to extract cf DNA from 200ul of plasma, use MGIEasy Cell-free DNA Library Prep Set (MGI) to perform second-generation sequencing standard library construction on the obtained cf DNA, and use the MGISEQ-2000 platform. Sequencing, and finally obtain whole-genome sequencing data of approximately 0.5-1.0x sequencing depth for each sample.
  • MMI MGIEasy Cell-free DNA Library Prep Set
  • Use Sentieon software to process the sequencing data (including alignment, sorting and deduplication), and use the software readCounter to count the number of reads per sample compared to each 1kbp range region on the autosomal chromosome, that is, reads per kb value depth, and then add every 30 depth values to get the depth of a range with a length of 30kbp.
  • 30kbp is the length of a window.
  • readCounter is not directly allowed to perform statistics in units of 30kbp. The same operation is performed for each sample, resulting in a depth value matrix with a length of 95833 (number of windows) and a width of 639 (number of samples).
  • Sensitivity number of true positives/(number of true positives + number of false negatives)*100% (ratio of correctly identified patients);
  • the specificity and sensitivity of the method of the present invention were both about 0.8, and the result of specificity ⁇ sensitivity was about 0.64.
  • CT scans were also performed on the above 639 patients with untreated pulmonary nodules. It is calculated that: the specificity of the CT scan-based method is about 0.3, which means that about 70% of benign patients are considered malignant or cannot be judged; and the sensitivity of the CT scan-based method is about 0.93; specificity The result of ⁇ sensitivity is about 0.28.
  • the method of the present invention has significantly higher specificity and considerable sensitivity, thereby obtaining a significantly higher specificity ⁇ sensitivity result, indicating that the model is capable of distinguishing It has excellent performance in determining the patient's pulmonary nodule type based on the patient's blood cfDNA.
  • Example 2 Obtaining a pulmonary nodule screening method
  • This embodiment proposes a pulmonary nodule screening method, which includes the following steps:
  • the pulmonary nodule screening method also includes: inputting the total WFDD value of the sample to be tested into the pulmonary nodule screening model, and the pulmonary nodule screening model makes a judgment based on comparing the total WFDD value of the sample to be tested with a predetermined threshold. The type of pulmonary nodule in the sample to be tested.
  • obtain the predetermined threshold through the following method: calculate the total WFDD value of each sample in the training set in a predetermined number of areas where the WFDD value difference is the largest; calculate based on the total WFDD value of the benign nodule population and the malignant nodule population in the training set.
  • the optimal segmentation point is the predetermined threshold.
  • Example 3 Obtaining a device for constructing a pulmonary nodule screening model
  • the present invention proposes a device for constructing a pulmonary nodule screening model.
  • the device includes: a characteristic data screening module, which is configured to screen benign nodule population samples and malignant nodule populations in the training set within the entire range of the human reference genome. A predetermined number of regions with the largest difference in weighted fragment distribution difference (WFDD) values in the sample are used as feature data; and a building module is configured to use the feature data to build a pulmonary nodule screening model.
  • WFDD weighted fragment distribution difference
  • the feature data screening module includes: a window division module, which is configured to join together the base sequences of all autosomal chromosomes in the human reference genome, and divide the joined base sequences into a series of windows according to a fixed length, Each window corresponds to a base sequence; the first calculation module is set to calculate the window reference depth and weight of each window, where the window reference depth is the average depth value of the samples in the training set in the window, and the weight is The square of the variance of the depth value of the window for the samples in the training set.
  • a window division module which is configured to join together the base sequences of all autosomal chromosomes in the human reference genome, and divide the joined base sequences into a series of windows according to a fixed length, Each window corresponds to a base sequence
  • the first calculation module is set to calculate the window reference depth and weight of each window, where the window reference depth is the average depth value of the samples in the training set in the window, and the weight is The square of the variance of the depth value of the window for the samples
  • the depth value of the window is the number of base sequence fragments in the sequencing data of the sample that can be compared to the base sequence corresponding to the window;
  • the second calculation module is Set to calculate the window sample depth of each window for the specified sample in the training set;
  • the third calculation module is set to calculate the difference between the window sample depth and the window reference depth;
  • the fourth calculation module is set to compare the difference with Multiply the weights to obtain the weighted difference of the windows;
  • the fifth calculation module is set to combine an indefinite number of windows to form different areas, and sum the weighted differences of all windows in the specified area to obtain the total weighted difference;
  • the numerical transformation module is set to perform numerical transformation on the sum of weighted differences to obtain the weighted fragment distribution difference value (WFDD) of the specified sample in the specified area;
  • the sixth calculation module is set to calculate the benign nodule population samples and malignant nodules in the training set The difference of the nodule population sample in each region with respect to the weighted segment distribution difference value;
  • the feature data filtering module also includes a normalization processing module.
  • the average value of the window reference depth of each window after the normalization process is 0, and the standard deviation is 1.
  • the feature data screening sub-module includes: a first calculation unit, configured to calculate the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample for a specific area; a second calculation unit, It is set to respectively calculate the average WFDD value of the benign nodule population sample and the average WFDD value of the malignant nodule population sample; the third calculation unit is set to calculate the discrimination value of a specific area according to the following formula, and select the unit , is set to select a predetermined number of regions with the largest discrimination values as feature data:
  • t is the discrimination value of a specific area; and are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S 1 and S 2 is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the condition is: when the benign nodule population sample corresponds to the subscript 1, the malignant nodule population sample corresponds to the subscript 2; or when the benign nodule population sample corresponds to the subscript 2 When the nodule population sample corresponds to the subscript 2, the malignant nodule population sample corresponds to the subscript 1.
  • N is the total number of divided windows
  • t i is the t value of the i-th window
  • N is the total number of divided windows
  • m i is the average of the window numbers included in area i;
  • the second generation selects components and is set to select the parents. After selecting the parents, it takes the union of the windows included in the parents and randomly deletes several of the windows as children. The random selection method is sampling with replacement; and the third offspring selects components. After being set to obtain offspring, the offspring will be put into the regional pool for the next round of selection; the parents will not be deleted from the regional pool in this operation.
  • t is the discrimination value of a specific area; and are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S 1 and S 2 is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the condition is: when the benign nodule population sample corresponds to the subscript 1, the malignant nodule population sample corresponds to the subscript 2; or when the benign nodule population sample corresponds to the subscript 2 When the nodule population sample corresponds to the subscript 2, the malignant nodule population sample corresponds to the subscript 1.
  • the number n of consecutive windows is 1-100, preferably 5-50, more preferably 5; preferably, after selecting the parents, take the union of the windows included by the parents and randomly delete 1 %-99%, more preferably 5%-50%, further preferably 20% of the window is used as the offspring, and sampling with replacement is performed.
  • the process of generating progeny is repeated 1 to 1 million times, more preferably 100 to 100,000 times, and further preferably 300,000 times.
  • the connected base sequence is divided into a series of windows according to a length of 100bp-100kbp, preferably 10kbp-50kbp, and more preferably 30kbp.
  • the windows combined to form different areas are continuous or discontinuous.
  • the predetermined number of areas is 1-500, more preferably 10-100, and further preferably 50.
  • Example 4 Obtaining a pulmonary nodule screening device
  • the present invention proposes a pulmonary nodule screening device, which includes: a first calculation module, which is configured to calculate the sum of the weighted segment distribution difference values of samples to be tested in a selected predetermined number of areas to obtain a total WFDD value; an input module , is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model constructed by the construction device of the present invention; and the output module is configured to output the screening results of the sample to be tested.
  • the selected predetermined number of regions are the same as the predetermined number of regions with the largest difference in weighted fragment distribution difference values (WFDD) between the benign nodule population sample and the malignant nodule population sample.
  • the input module includes: an input unit configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model; and a determination unit configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model; The value is compared with a predetermined threshold to determine the type of pulmonary nodule of the sample to be tested.
  • the screening device further includes a predetermined threshold acquisition module.
  • the predetermined threshold acquisition module includes: a first calculation unit configured to calculate the total WFDD value of each sample in the training set in a predetermined number of regions where the WFDD value difference is the largest; and The second calculation unit is set to calculate the optimal segmentation point based on the total WFDD value of the benign nodule population and the malignant nodule population in the training set, and the optimal segmentation point is the predetermined threshold.
  • the construction method, screening model, screening method and screening device of the pulmonary nodule screening model of the present invention can also be applied to methylation sequencing data, RNA Sequencing data and proteomics-related data.
  • embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions
  • the device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device.
  • Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • Memory may include non-volatile memory in computer-readable media, random access memory (RAM) and/or non-volatile memory in the form of read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information.
  • Information may be computer-readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • read-only memory read-only memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technology
  • compact disc read-only memory CD-ROM
  • DVD digital versatile disc
  • Magnetic tape cassettes tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device.
  • computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
  • the present invention provides a product for detecting human nodule types with high accuracy based on extracellular free DNA whole genome low-depth sequencing data. Specifically, the present invention proposes a pulmonary nodule screen.
  • a method and device for establishing a detection model and a method and device for screening pulmonary nodules which:
  • This method is developed based on cfDNA sequencing data and only requires blood drawing or collection of other body fluids, without the risk of radioactive exposure.

Abstract

The present invention relates to a construction method and apparatus for a pulmonary nodule screening model, and a pulmonary nodule screening method and apparatus. Specifically, the present invention relates to a construction method for a pulmonary nodule screening model, the method comprising the following steps: within the full range of human reference genomes, screening a training set so as to obtain a predetermined number of regions having the greatest difference between weighted fragment distribution difference (WFDD) values of benign-nodule crowd samples and malignant-nodule crowd samples, and taking the predetermined number of regions as feature data; and using the feature data to construct a pulmonary nodule screening model. The method in the present invention realizes non-invasive human nodule type testing, avoids invasive testing, and provides a judgment accuracy higher than that of existing CT scanning and close to that of invasive tissue biopsies. The method is based on sequencing data development of cf DNA, and only requires operations such as drawing blood or collecting other bodily fluids, without the risk of exposure to radiation.

Description

肺结节筛查模型的组建方法和装置以及肺结节筛查方法和装置Method and device for establishing pulmonary nodule screening model and method and device for pulmonary nodule screening 技术领域Technical field
本发明涉及生物信息学技术领域,具体而言,涉及肺结节筛查模型的组建方法和装置以及肺结节筛查方法和装置。The present invention relates to the technical field of bioinformatics, specifically, to a method and device for constructing a pulmonary nodule screening model and a pulmonary nodule screening method and device.
背景技术Background technique
癌症已成为中国人死亡的首要因素,并且癌症发病率在逐年攀升。根据国家癌症中心最新发布的2019年全国癌症报告,恶性肿瘤死亡占居民全部死因的23.91%,每年恶性肿瘤所致的医疗花费超过2200亿。其中,按发病人数顺位排序,肺癌位居我国恶性肿瘤发病首位。Cancer has become the leading cause of death in China, and the incidence rate of cancer is increasing year by year. According to the latest 2019 National Cancer Report released by the National Cancer Center, deaths from malignant tumors account for 23.91% of all causes of death among residents, and the annual medical expenses caused by malignant tumors exceed 220 billion. Among them, lung cancer ranks first in the incidence of malignant tumors in my country in order of the number of cases.
肺结节病(sarcoidosis)是一种病因未明的多系统多器官的肉芽肿性疾病,常侵犯肺、双侧肺门淋巴结、眼、皮肤等器官,其胸部受侵率高达80%至90%。肺结节病的预后大多良好。而肺癌的早期形态大多表现为肺部的小结节。因此,区分结节的类型对肺癌的早期筛查是尤为重要的。目前比较可靠地判断肺结节类型的方法基本依赖于通过手术有创采样进行组织活检。Pulmonary sarcoidosis is a multi-system and multi-organ granulomatous disease of unknown etiology. It often invades the lungs, bilateral hilar lymph nodes, eyes, skin and other organs. Its chest invasion rate is as high as 80% to 90%. . The prognosis for pulmonary sarcoidosis is mostly good. The early forms of lung cancer are mostly small nodules in the lungs. Therefore, distinguishing the type of nodules is particularly important for early screening of lung cancer. The current method to reliably determine the type of pulmonary nodules basically relies on tissue biopsy through invasive surgical sampling.
世界卫生组织指出,早发现早治疗是提供癌症治疗效果的关键,因此,开发针对癌症的早筛早检技术显得极为重要。目前临床上用于结节(如肺结节)的无创诊断主要依靠低剂量螺旋CT。低剂量螺旋CT是用最小的扫描范围、最低剂量和最少的X线量,对病变做出诊断。相比传统的常规CT检查,它的辐射更小,微小结节也能够更清楚地显示。但一直存在低剂量螺旋CT增加致癌风险的争议。流行病学研究表明,即使仅进行两次或三次CT扫描,放射剂量也会导致可检测到的癌症风险增加,尤其对于儿童而言(Computed Tomography-An Increasing Source of Radiation Exposure.N Engl J Med 2007;357:2277-2284 DOI:10.1056/NEJMra072149)。The World Health Organization points out that early detection and early treatment are the key to providing effective cancer treatment. Therefore, it is extremely important to develop early screening and early detection technology for cancer. Currently, the clinical non-invasive diagnosis of nodules (such as pulmonary nodules) mainly relies on low-dose spiral CT. Low-dose spiral CT uses the smallest scanning range, lowest dose, and smallest amount of X-rays to diagnose lesions. Compared with traditional conventional CT examination, its radiation is smaller and micro nodules can be displayed more clearly. However, there has always been controversy that low-dose spiral CT increases the risk of cancer. Epidemiological studies show that even with just two or three CT scans, radiation dose can lead to an increased risk of detectable cancer, especially in children (Computed Tomography-An Increasing Source of Radiation Exposure.N Engl J Med 2007 ;357:2277-2284 DOI:10.1056/NEJMra072149).
游离DNA(cell-free DNA,cf DNA),又称循环DNA(circulating free DNA,cf DNA),是存在于外周液如血液和尿液中的游离于细胞外的DNA。其中,来自于肿瘤的cf DNA也称为ct DNA(circulating tumor DNA,ct DNA)。已经在多种实体肿瘤中证实了cf DNA作为新的诊断标志物的应用价值。现有技术已经基于cf DNA的全基因组测序或甲基化测序开发了癌症的早筛方法。目前,已有文献(Integrating Genomic Features for Non-Invasive Early Lung Cancer Detection.Nature;Volume 580,Pages 245-251(2020);(https://doi.org/10.1038/s41586-020-2140-0))报道了根据外周液中的cf DNA片段或蛋白标记物来无创筛查癌症患者。然而,这些方法几乎都是针对健康人来筛查,旨在将癌症患者从健康人群中检测出来,而难以 保证准确地将某个部分的肿瘤与在该部位的其他非癌症疾病区分开,并且对于炎症患者而言,上述方法可能会将其误判为癌症患者。Cell-free DNA (cell-free DNA, cf DNA), also known as circulating DNA (circulating free DNA, cf DNA), is free extracellular DNA that exists in peripheral fluids such as blood and urine. Among them, cf DNA from tumors is also called ct DNA (circulating tumor DNA, ct DNA). The application value of cfDNA as a new diagnostic marker has been confirmed in a variety of solid tumors. Existing technologies have developed early screening methods for cancer based on whole-genome sequencing or methylation sequencing of cfDNA. At present, there are literatures (Integrating Genomic Features for Non-Invasive Early Lung Cancer Detection.Nature; Volume 580, Pages 245-251(2020); (https://doi.org/10.1038/s41586-020-2140-0) ) reported non-invasive screening of cancer patients based on cf DNA fragments or protein markers in peripheral fluid. However, these methods are almost all used to screen healthy people, aiming to detect cancer patients from healthy people, and it is difficult to ensure that tumors in a certain part can be accurately distinguished from other non-cancer diseases in that part, and For patients with inflammation, the above methods may misidentify them as cancer patients.
Wenhua Liang等人提出了一个基于cf DNA甲基化测序数据进行肺结节诊断方法(Non-Invasive Diagnosis of Early-Stage Lung Cancer Using High-Throughput Targeted DNA Methylation Sequencing of Circulating Tumor DNA(ct DNA).Theranostics;2019;9(7):2056-2070.DOI:10.7150)。结果提示,在一个包含39名恶性肺结节患者和27名良性肺结节患者的验证集中,该方法的接受者操作特征曲线(Receiver Operating Characteristic Curve,ROC)下与坐标轴围成的面积(Area Under Curve,AUC)为0.816。该文章通过对肺部恶性病变与良性病变的比较,了解组织DNA甲基化特征,组建了良性/恶性结节的诊断模型。将此模型应用于肺结节患者血浆中肿瘤特异性ct DNA的鉴定,对早期肺癌具有一定的敏感性和特异性。但这种肺结节无创诊断方法的准确率较低,尚不能满足临床要求。Wenhua Liang et al. proposed a method for pulmonary nodule diagnosis based on cf DNA methylation sequencing data (Non-Invasive Diagnosis of Early-Stage Lung Cancer Using High-Throughput Targeted DNA Methylation Sequencing of Circulating Tumor DNA (ct DNA). Theranostics ;2019;9(7):2056-2070.DOI:10.7150). The results suggest that in a validation set including 39 patients with malignant pulmonary nodules and 27 patients with benign pulmonary nodules, the area under the receiver operating characteristic curve (Receiver Operating Characteristic Curve, ROC) of this method and the area enclosed by the coordinate axis ( Area Under Curve (AUC) is 0.816. This article compares malignant lung lesions with benign lesions, understands tissue DNA methylation characteristics, and establishes a diagnostic model for benign/malignant nodules. Applying this model to the identification of tumor-specific ct DNA in the plasma of patients with pulmonary nodules has certain sensitivity and specificity for early lung cancer. However, the accuracy of this non-invasive diagnostic method for pulmonary nodules is low and cannot yet meet clinical requirements.
拷贝数变异(Copy Number Variation,CNV)是基因组重排导致的,一般指长度为1kb以上的基因组大片段的拷贝数增加或者减少,主要表现为亚显微水平的缺失和重复。CNV是基因组结构变异(Structural Variation,SV)的重要组成部分。CNV位点的突变率远高于单核苷酸多态性(Single Nucleotide Polymorphism,SNP),是人类疾病的重要致病因素之一。已有文献(Maternal Malignancies Detected With Noninvasive Prenatal Testing Reply.Jama the Journal of the American Medical Association;2015 Nov 24;314(20):2192-3[DOI:10.1001/jama.2015.12922])报道使用癌症患者的血液cf DNA测序数据检测到拷贝数变异,这意味着在CNV的发生区域,癌症患者的DNA片段分布与代表健康人的基准存在差异。Copy Number Variation (CNV) is caused by genome rearrangements. It generally refers to an increase or decrease in the copy number of a large genome segment with a length of more than 1 kb, mainly manifesting as submicroscopic deletions and duplications. CNV is an important component of genome structural variation (Structural Variation, SV). The mutation rate of CNV sites is much higher than that of Single Nucleotide Polymorphism (SNP), and it is one of the important causative factors of human diseases. Existing literature (Maternal Malignancies Detected With Noninvasive Prenatal Testing Reply.Jama the Journal of the American Medical Association; 2015 Nov 24;314(20):2192-3[DOI:10.1001/jama.2015.12922]) has reported the use of blood from cancer patients cf DNA sequencing data detected copy number variations, meaning that in the region where the CNV occurs, the distribution of DNA fragments in cancer patients differs from a baseline representative of healthy people.
因此,开发能够满足临床应用的、以高准确率检测人体结节类型的产品,如通过无创的方法判断患者的肺结节是良性或恶性结节、从而避免侵入式检查是亟待解决的问题。Therefore, it is an urgent problem to be solved to develop products that can meet clinical application and detect human nodule types with high accuracy, such as using non-invasive methods to determine whether a patient's pulmonary nodules are benign or malignant, thereby avoiding invasive examinations.
发明内容Contents of the invention
本发明的主要目的在于提供一种肺结节筛查模型的组建方法、筛查模型、筛查方法和筛查装置,其能够区分某个部分的恶性肿瘤和其他非癌症类型疾病,尤其是肺结节的类型。The main purpose of the present invention is to provide a method for establishing a pulmonary nodule screening model, a screening model, a screening method and a screening device, which can distinguish a certain part of malignant tumors from other non-cancer types of diseases, especially pulmonary nodule screening models. Type of nodule.
为了实现上述目的,根据本发明的第一方面,本发明提出了一种肺结节筛查模型的组建方法,该方法包括以下步骤:In order to achieve the above objects, according to the first aspect of the present invention, the present invention proposes a method for constructing a pulmonary nodule screening model, which method includes the following steps:
在人类参考基因组的全体范围内,筛选训练集中的良性结节人群样本和恶性结节人群样本中加权片段分布差异值(WFDD)的差异最大的预定数量个区域作为特征数据;和利用特征数据组建肺结节筛查模型。Within the entire scope of the human reference genome, select a predetermined number of regions with the largest difference in weighted fragment distribution difference values (WFDD) between the benign nodule population samples and the malignant nodule population samples in the training set as feature data; and use the feature data to construct Pulmonary nodule screening model.
进一步地,筛选训练集中的良性结节人群样本和恶性结节人群样本中加权片段分布差异值(WFDD)的差异最大的预定数量个区域作为特征数据的步骤包括:Further, the step of screening a predetermined number of regions with the largest difference in weighted fragment distribution difference values (WFDD) among the benign nodule population samples and the malignant nodule population samples in the training set as feature data includes:
将参考基因组中所有常染色体的碱基序列连接在一起,按固定的长度将连接后的碱基序列划分为一系列窗口,每个窗口对应着一段碱基序列;计算每个窗口的窗口基准深度和权 重,其中,窗口基准深度为训练集中的样本在窗口的深度值的平均值,权重为训练集的样本在窗口的深度值的方差的平方,所述窗口的深度值为对样本的测序数据中能比对到该窗口对应的碱基序列的碱基序列片段的数量;计算训练集中指定样本在每个窗口的窗口样本深度;计算窗口样本深度与窗口基准深度之间的差值;将差值与权重相乘,得到窗口的加权差值;将不定数量的窗口组合形成不同区域,对指定区域中所有窗口的加权差值求和得到加权差值总和;对所述加权差值总和进行数值变换,得到指定样本在指定区域内的加权片段分布差异值(WFDD);和计算训练集中的良性结节人群样本和恶性结节人群样本在每个区域上关于加权片段分布差异值的差异,并筛选差异最大的预定数量个区域作为特征数据。Connect the base sequences of all autosomal chromosomes in the reference genome together, and divide the connected base sequences into a series of windows according to a fixed length. Each window corresponds to a base sequence; calculate the window reference depth of each window. and the weight, where the window reference depth is the average of the depth values of the samples in the training set in the window, the weight is the square of the variance of the depth values of the samples in the training set in the window, and the depth value of the window is the sequencing data of the sample The number of base sequence fragments that can be compared to the base sequence corresponding to the window; calculate the window sample depth of the specified sample in the training set in each window; calculate the difference between the window sample depth and the window reference depth; divide the difference Multiply the value and the weight to get the weighted difference of the window; combine an indefinite number of windows to form different areas, sum the weighted differences of all windows in the specified area to get the weighted difference sum; perform a numerical calculation on the weighted difference sum Transform to obtain the weighted fragment distribution difference value (WFDD) of the specified sample in the specified area; and calculate the difference in the weighted fragment distribution difference value between the benign nodule population sample and the malignant nodule population sample in the training set in each area, and A predetermined number of areas with the largest differences are filtered as feature data.
进一步地,在计算窗口样本深度与窗口基准深度之间的差值之前,还包括对窗口基准深度和窗口样本深度进行均一化处理的步骤。Further, before calculating the difference between the window sample depth and the window reference depth, a step of normalizing the window reference depth and the window sample depth is also included.
进一步地,均一化处理后各个窗口的窗口基准深度的平均值为0,标准差为1。Further, after the normalization process, the average value of the window reference depth of each window is 0, and the standard deviation is 1.
进一步地,筛选差异最大的预定数量个区域作为特征数据包括:Further, screening a predetermined number of areas with the largest differences as feature data includes:
对于特定区域,计算良性结节人群样本和恶性结节人群样本中的每一个样本的WFDD值;分别计算良性结节人群样本的WFDD值的平均值和恶性结节人群样本的WFDD值的平均值;按如下公式计算特定区域的区分度值,以及选择区分度值最大的预定数量个区域作为特征数据:For a specific area, calculate the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample; calculate the average WFDD value of the benign nodule population sample and the average WFDD value of the malignant nodule population sample respectively. ; Calculate the discrimination value of a specific area according to the following formula, and select a predetermined number of areas with the largest discrimination value as feature data:
Figure PCTCN2022097450-appb-000001
Figure PCTCN2022097450-appb-000001
其中:t为特定区域的区分度值;
Figure PCTCN2022097450-appb-000002
Figure PCTCN2022097450-appb-000003
分别为来自良性结节人群样本或恶性结节人群样本的WFDD值的平均值;n 1和n 2分别为来自良性结节人群样本或恶性结节人群样本的值的个数;并且S 1和S 2分别为来自良性结节人群样本或恶性结节人群样本的值的标准差;条件是:当良性结节人群样本对应下角标1时,恶性结节人群样本对应下角标2;或当良性结节人群样本对应下角标2时,恶性结节人群样本对应下角标1。
Among them: t is the discrimination value of a specific area;
Figure PCTCN2022097450-appb-000002
and
Figure PCTCN2022097450-appb-000003
are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S 1 and S 2 is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the condition is: when the benign nodule population sample corresponds to the subscript 1, the malignant nodule population sample corresponds to the subscript 2; or when the benign nodule population sample corresponds to the subscript 2 When the nodule population sample corresponds to the subscript 2, the malignant nodule population sample corresponds to the subscript 1.
进一步地,筛选差异最大的预定数量个区域作为特征数据包括:Further, screening a predetermined number of areas with the largest differences as feature data includes:
a.)生成初始区域:a.) Generate initial area:
-将基因组划分为一系列窗口;-Divide the genome into a series of windows;
-用一系列窗口的次序对一系列窗口编号;-Number a series of windows in the order of a series of windows;
-用一系列窗口编号为由一系列窗口组合得到的区域编号;-Use a series of window numbers to number the area obtained by combining a series of windows;
-将窗口i处连续的n个窗口和其下游2 jn个窗口处另外n个连续窗口组合,形成一个初始区域: - Combine the n consecutive windows at window i and the other n consecutive windows at 2 j n windows downstream to form an initial area:
x i={i,i+1,i+2,...,i+n-1,i+2 jn,i+2 jn+1,i+2 jn+2,...,i+2 jn+n-1} x i = {i, i+1, i+2,..., i+n-1, i+2 j n, i+2 j n+1, i+2 j n+2,..., i+2 j n+n-1}
i=1,n,2n,...,N;j=1,2,...,8;i+(2 i+1)n≤N i=1, n, 2n,..., N; j=1, 2,..., 8; i+(2 i+1 )n≤N
其中,n为给定的连续窗口的个数,N为划分好的窗口总数;Among them, n is the number of given continuous windows, and N is the total number of divided windows;
b.)区域的组合和分拆:b.) Combination and splitting of regions:
使用遗传算法,两个双亲结合起来交换信息并产生子代;其中,将所有初始区域放到区域池内,并随机挑选产生子代;其中Using a genetic algorithm, two parents combine to exchange information and produce offspring; where all initial regions are placed in the region pool and randomly selected to generate offspring; where
-区域i被选择作为双亲之一的概率为:-The probability that region i is selected as one of the parents is:
Figure PCTCN2022097450-appb-000004
Figure PCTCN2022097450-appb-000004
其中,N为划分好的窗口总数,t i为第i个窗口的t值; Among them, N is the total number of divided windows, t i is the t value of the i-th window;
-当选定区域x作为第一个亲本后,另一个区域i被挑选作为第二个亲本的概率为:-When region x is selected as the first parent, the probability that another region i is selected as the second parent is:
Figure PCTCN2022097450-appb-000005
Figure PCTCN2022097450-appb-000005
其中,N为划分好的窗口总数,m i为区域i所包含的窗口编号的平均值;并且 Among them, N is the total number of divided windows, m i is the average of the window numbers included in area i; and
-选定双亲后,取双亲所包括窗口的并集并随机删除其中若干窗口作为子代,随机挑选方式为有放回抽样;-After selecting the parents, take the union of the windows included in the parents and randomly delete some of the windows as children. The random selection method is sampling with replacement;
-得到子代后,将子代放进区域池内进行下一轮挑选;在此操作中不删除双亲,其中,由区域P 1和区域P 2结合产生的子代为: -After obtaining the offspring, put the offspring into the regional pool for the next round of selection; the parents are not deleted in this operation. The offspring generated by the combination of area P 1 and area P 2 are:
child(P 1,P 2)=P 1∪P 2-S(p,P 1∪P 2) child(P 1 , P 2 )=P 1 ∪P 2 -S(p, P 1 ∪P 2 )
其中,S(p,s)为从集合s中有放回地抽取比例p的元素得到的子集;Among them, S(p,s) is a subset obtained by extracting elements with proportion p from the set s with replacement;
-重复产生子代的过程;- Repeat the process of producing offspring;
-按如下公式计算所产生的子代的区分度值,挑选区分度值最大的预定数量个区域作为特征数据:- Calculate the distinction value of the generated offspring according to the following formula, and select a predetermined number of areas with the largest distinction value as feature data:
Figure PCTCN2022097450-appb-000006
Figure PCTCN2022097450-appb-000006
其中:t为特定区域的区分度值;
Figure PCTCN2022097450-appb-000007
Figure PCTCN2022097450-appb-000008
分别为来自良性结节人群样本或恶性结节人群样本的WFDD值的平均值;n 1和n 2分别为来自良性结节人群样本或恶性结节人群样本的值的个数;并且S 1和S 2分别为来自良性结节人群样本或恶性结节人群样本的值的标准差;条件是:
Among them: t is the discrimination value of a specific area;
Figure PCTCN2022097450-appb-000007
and
Figure PCTCN2022097450-appb-000008
are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S 1 and S 2 is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the conditions are:
当良性结节人群样本对应下角标1时,恶性结节人群样本对应下角标2;或当良性结节人群样本对应下角标2时,恶性结节人群样本对应下角标1。When the benign nodule population sample corresponds to the subscript 1, the malignant nodule population sample corresponds to the subscript 2; or when the benign nodule population sample corresponds to the subscript 2, the malignant nodule population sample corresponds to the subscript 1.
进一步地,连续窗口的个数n为1-100、优选为5-50、更优选为5。Further, the number n of continuous windows is 1-100, preferably 5-50, and more preferably 5.
进一步地,选定双亲后,取双亲所包括窗口的并集并随机删除1%-99%、更优选为5%-50%、进一步优选为20%的窗口作为子代,进行有放回抽样。Further, after selecting the parents, take the union of the windows included by the parents and randomly delete 1%-99%, more preferably 5%-50%, further preferably 20% of the windows as offspring, and perform sampling with replacement .
进一步地,重复产生子代的过程为1到1百万次、更优选为100到10万次、进一步优选为30万次。Further, the process of generating offspring is repeated 1 to 1 million times, more preferably 100 to 100,000 times, and further preferably 300,000 times.
进一步地,按照100bp-100kbp,优选为10kbp-50kbp,更优选为30kbp的长度将连接后的碱基序列划分为一系列窗口。Further, the connected base sequence is divided into a series of windows according to the length of 100bp-100kbp, preferably 10kbp-50kbp, and more preferably 30kbp.
进一步地,组合形成不同区域的窗口是连续的或不连续的。Further, the windows that are combined to form different areas are continuous or discontinuous.
进一步地,预定数量个区域为1-500个,更优选为10-100个,进一步优选为50个。Further, the predetermined number of areas is 1-500, more preferably 10-100, even more preferably 50.
根据本发明的第二方面,本发明提出了一种肺结节筛查方法,包括以下步骤:计算待检样本在选定的预定数量个区域的加权片段分布差异值的总和,得到总WFDD值;将待检样本的总WFDD值输入根据本发明的第一方面中所描述的方法所组建的肺结节筛查模型;输出待检样本的筛查结果;其中,选定的预定数量个区域与良性结节人群样本和恶性结节人群样本中加权片段分布差异(WFDD)值差异最大的预定数量个区域相同。。According to the second aspect of the present invention, the present invention proposes a pulmonary nodule screening method, which includes the following steps: calculating the sum of the weighted segment distribution difference values of the sample to be tested in a selected predetermined number of regions to obtain the total WFDD value ; Input the total WFDD value of the sample to be tested into the pulmonary nodule screening model established according to the method described in the first aspect of the present invention; output the screening results of the sample to be tested; wherein, a predetermined number of regions are selected It is the same as the predetermined number of regions with the largest difference in weighted fragment distribution difference (WFDD) values between the benign nodule population sample and the malignant nodule population sample. .
进一步地,将待检样本的总WFDD值输入肺结节筛查模型,肺结节筛查模型根据待检样本的总WFDD值与预定阈值比较来判断待检样本的肺结节类型。Further, the total WFDD value of the sample to be tested is input into the pulmonary nodule screening model, and the pulmonary nodule screening model determines the type of pulmonary nodule of the sample to be tested based on the comparison between the total WFDD value of the sample to be tested and a predetermined threshold.
进一步地,通过如下方法获得预定阈值:计算训练集中每个样本在WFDD值差异最大的预定数量个区域的总WFDD值;根据训练集中的良性结节人群和恶性结节人群的总WFDD值计算最佳分割点,最佳分割点即为预定阈值。Further, the predetermined threshold is obtained by the following method: calculating the total WFDD value of each sample in the training set in a predetermined number of areas with the largest difference in WFDD value; calculating the maximum WFDD value based on the total WFDD value of the benign nodule population and the malignant nodule population in the training set. The optimal segmentation point is the predetermined threshold.
进一步地,使用来自R语言分析平台的pROC包的roc函数计算最佳分割点。Further, the optimal segmentation point was calculated using the roc function of the pROC package from the R language analysis platform.
根据本发明的第三方面,本发明提出了一种肺结节筛查模型的组建装置,包括:特征数据筛选模块,被设置为在人类参考基因组的全体范围内,筛选训练集中的良性结节人群样本 和恶性结节人群样本中加权片段分布差异(WFDD)值的差异最大的预定数量个区域作为特征数据;和组建模块,被设置为利用特征数据组建肺结节筛查模型。According to the third aspect of the present invention, the present invention proposes a device for constructing a pulmonary nodule screening model, including: a feature data screening module configured to screen benign nodules in the training set within the entire range of the human reference genome. A predetermined number of areas with the largest difference in weighted fragment distribution difference (WFDD) values between the population sample and the malignant nodule population sample are used as feature data; and a building module is configured to use the feature data to build a pulmonary nodule screening model.
进一步地,特征数据筛选模块包括:窗口划分模块,被设置为将人类参考基因组中所有常染色体的碱基序列连接在一起,按固定的长度将连接后的碱基序列划分为一系列窗口,每个窗口对应着一段碱基序列;第一计算模块,被设置为计算每个窗口的窗口基准深度和权重,其中,窗口基准深度为训练集中的样本在窗口的深度值的平均值,权重为训练集的样本在窗口的深度值的方差的平方,窗口的深度值为对样本的测序数据中能比对到该窗口对应的碱基序列的碱基序列片段的数量;第二计算模块,被设置为计算训练集中指定样本在每个窗口的窗口样本深度;第三计算模块,被设置为计算窗口样本深度与窗口基准深度之间的差值;第四计算模块,被设置为将差值与权重相乘,得到窗口的加权差值;第五计算模块,被设置为将不定数量的窗口组合形成不同区域,对指定区域中所有窗口的加权差值求和得到加权差值总和;数值变换模块,被设置为对加权差值总和进行数值变换,得到指定样本在指定区域内的加权片段分布差异值(WFDD);和第六计算模块,被设置为计算训练集中的良性结节人群样本和恶性结节人群样本在每个区域上关于加权片段分布差异值的差异;以及特征数据筛选子模块,被设置为筛选差异最大的预定数量个区域作为特征数据。Further, the feature data screening module includes: a window division module, which is set to connect the base sequences of all autosomal chromosomes in the human reference genome together, and divide the connected base sequences into a series of windows according to a fixed length, each Each window corresponds to a base sequence; the first calculation module is set to calculate the window reference depth and weight of each window, where the window reference depth is the average depth value of the window in the training set, and the weight is the training set. The square of the variance of the depth value of the window for the set of samples. The depth value of the window is the number of base sequence fragments in the sequencing data of the sample that can be compared to the base sequence corresponding to the window; the second calculation module is set Calculate the window sample depth of the specified sample in each window in the training set; the third calculation module is set to calculate the difference between the window sample depth and the window reference depth; the fourth calculation module is set to compare the difference with the weight Multiply to obtain the weighted difference of the windows; the fifth calculation module is set to combine an indefinite number of windows to form different areas, and sum the weighted differences of all windows in the specified area to obtain the total weighted difference; the numerical transformation module, is set to perform numerical transformation on the sum of weighted differences to obtain the weighted fragment distribution difference value (WFDD) of the specified sample in the specified area; and the sixth calculation module is set to calculate the benign nodule population sample and malignant nodule in the training set section the difference of the population sample in each area with respect to the weighted fragment distribution difference value; and the feature data screening submodule is set to screen a predetermined number of areas with the largest differences as feature data.
进一步地,特征数据筛选模块还包括均一化处理模块。Furthermore, the feature data screening module also includes a homogenization processing module.
进一步地,均一化处理后各个窗口的窗口基准深度的平均值为0,标准差为1。Further, after the normalization process, the average value of the window reference depth of each window is 0, and the standard deviation is 1.
进一步地,特征数据筛选子模块包括:第一计算单元,被设置为对于特定区域,计算良性结节人群样本和恶性结节人群样本中的每一个样本的WFDD值;第二计算单元,被设置为分别计算良性结节人群样本的WFDD值的平均值和恶性结节人群样本的WFDD值的平均值;第三计算单元,被设置为按如下公式计算特定区域的区分度值,以及选择单元,被设置为选择区分度值最大的预定数量个区域作为特征数据:Further, the feature data screening sub-module includes: a first calculation unit, configured to calculate the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample for a specific area; a second calculation unit, configured to In order to respectively calculate the average WFDD value of the benign nodule population sample and the average WFDD value of the malignant nodule population sample; the third calculation unit is set to calculate the discrimination value of a specific area according to the following formula, and select the unit, is set to select a predetermined number of regions with the largest discrimination values as feature data:
Figure PCTCN2022097450-appb-000009
Figure PCTCN2022097450-appb-000009
其中:t为特定区域的区分度值;
Figure PCTCN2022097450-appb-000010
Figure PCTCN2022097450-appb-000011
分别为来自良性结节人群样本或恶性结节人群样本的WFDD值的平均值;n 1和n 2分别为来自良性结节人群样本或恶性结节人群样本的值的个数;并且S 1和S 2分别为来自良性结节人群样本或恶性结节人群样本的值的标准差;条件是:当良性结节人群样本对应下角标1时,恶性结节人群样本对应下角标2;或当良性结节人群样本对应下角标2时,恶性结节人群样本对应下角标1。
Among them: t is the discrimination value of a specific area;
Figure PCTCN2022097450-appb-000010
and
Figure PCTCN2022097450-appb-000011
are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S 1 and S 2 is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the condition is: when the benign nodule population sample corresponds to the subscript 1, the malignant nodule population sample corresponds to the subscript 2; or when the benign nodule population sample corresponds to the subscript 2 When the nodule population sample corresponds to the subscript 2, the malignant nodule population sample corresponds to the subscript 1.
进一步地,特征数据筛选子模块包括:初始区域生成单元和区域组合分拆单元,其中,初始区域生成单元包括:窗口划分元件,被设置为将基因组划分为一系列窗口;窗口编码元件,被设置为用一系列窗口的次序对一系列窗口编号;区域编号元件,被设置为用一系列窗 口编号为由一系列窗口组合得到的区域编号;窗口组合元件,被设置为将窗口i处连续的n个窗口和其下游2 jn个窗口处另外n个连续窗口组合,形成一个初始区域: Further, the feature data screening sub-module includes: an initial region generation unit and a region combination splitting unit, wherein the initial region generation unit includes: a window division element, which is set to divide the genome into a series of windows; a window encoding element, which is set To number a series of windows in the order of a series of windows; the area number component is set to use a series of window numbers to number the area obtained by combining a series of windows; the window combination component is set to use n consecutive n numbers at window i A window is combined with another n consecutive windows at 2 j n windows downstream to form an initial area:
x i={i,i+1,i+2,...,i+n-1,i+2 jn,i+2 jn+1,i+2 jn+2,...,i+2 jn+n-1}, x i = {i, i+1, i+2,..., i+n-1, i+2 j n, i+2 j n+1, i+2 j n+2,..., i+2 j n+n-1},
i=1,n,2n,...,N;j=1,2,...,8;i+(2 i+1)n≤N i=1, n, 2n,..., N; j=1, 2,..., 8; i+(2 i+1 )n≤N
其中,n为给定的连续窗口的个数,N为划分好的窗口总数;Among them, n is the number of given continuous windows, and N is the total number of divided windows;
区域组合分拆单元包括:第一子代挑选元件,被设置为使用遗传算法,两个双亲结合起来交换信息并产生子代;其中,将所有初始区域放到区域池内,并随机挑选产生子代;其中:The regional combination splitting unit includes: the first child selection component is set to use a genetic algorithm, and the two parents combine to exchange information and generate offspring; among them, all initial areas are put into the regional pool and randomly selected to generate offspring. ;in:
-区域i被选择作为双亲之一的概率为:-The probability that region i is selected as one of the parents is:
Figure PCTCN2022097450-appb-000012
Figure PCTCN2022097450-appb-000012
其中,N为划分好的窗口总数,t i为第i个窗口的t值; Among them, N is the total number of divided windows, t i is the t value of the i-th window;
-当选定区域x作为第一个亲本后,另一个区域i被挑选作为第二个亲本的概率为:-When region x is selected as the first parent, the probability that another region i is selected as the second parent is:
Figure PCTCN2022097450-appb-000013
Figure PCTCN2022097450-appb-000013
其中,N为划分好的窗口总数,m i为区域i所包含的窗口编号的平均值; Among them, N is the total number of divided windows, m i is the average of the window numbers included in area i;
第二子代挑选元件,被设置为选定双亲后,取双亲所包括窗口的并集并随机删除其中若干窗口作为子代,随机挑选方式为有放回抽样;和The second child selection component is set to take the union of the windows included in the parents after selecting the parents and randomly delete several of the windows as children. The random selection method is sampling with replacement; and
第三子代挑选元件,被设置为得到子代后,将子代放进区域池内进行下一轮挑选;在此操作中不从区域池内删除双亲,其中,由区域P 1和区域P 2结合产生的子代为: The third child selection component is set to put the children into the regional pool for the next round of selection after obtaining the children; in this operation, the parents are not deleted from the regional pool, in which area P 1 and area P 2 are combined The resulting offspring are:
child(P 1,P 2)=P 1∪P 2-S(p,P 1∪P 2), child(P 1 , P 2 )=P 1 ∪P 2 -S(p, P 1 ∪P 2 ),
其中,S(p,s)为从集合s中有放回地抽取比例p的元素得到的子集;以及Among them, S(p,s) is a subset obtained by extracting proportion p elements from the set s with replacement; and
子代重复产生元件,被设置为重复产生子代的过程;The descendants repeatedly generate components and are set to repeat the process of generating descendants;
-按如下公式计算所产生的子代的区分度值,挑选区分度值最大的预定数量个区域作为特征数据:- Calculate the distinction value of the generated offspring according to the following formula, and select a predetermined number of areas with the largest distinction value as feature data:
Figure PCTCN2022097450-appb-000014
Figure PCTCN2022097450-appb-000014
其中:t为特定区域的区分度值;
Figure PCTCN2022097450-appb-000015
Figure PCTCN2022097450-appb-000016
分别为来自良性结节人群样本或恶性结节人群样本的WFDD值的平均值;n 1和n 2分别为来自良性结节人群样本或恶性结节人群样本的值的个数;并且S 1和S 2分别为来自良性结节人群样本或恶性结节人群样本的值的标准差;条件是:当良性结节人群样本对应下角标1时,恶性结节人群样本对应下角标2;或当良性结节人群样本对应下角标2时,恶性结节人群样本对应下角标1。
Among them: t is the discrimination value of a specific area;
Figure PCTCN2022097450-appb-000015
and
Figure PCTCN2022097450-appb-000016
are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S 1 and S 2 is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the condition is: when the benign nodule population sample corresponds to the subscript 1, the malignant nodule population sample corresponds to the subscript 2; or when the benign nodule population sample corresponds to the subscript 2 When the nodule population sample corresponds to the subscript 2, the malignant nodule population sample corresponds to the subscript 1.
进一步地,连续窗口的个数n为1-100、优选为5-50、更优选为5。Further, the number n of continuous windows is 1-100, preferably 5-50, and more preferably 5.
进一步地,选定双亲后,取双亲所包括窗口的并集并随机删除1%-99%、更优选为5%-50%、进一步优选为20%的窗口作为子代,进行有放回抽样。Further, after selecting the parents, take the union of the windows included by the parents and randomly delete 1%-99%, more preferably 5%-50%, further preferably 20% of the windows as offspring, and perform sampling with replacement .
进一步地,重复产生子代的过程为1到1百万次、更优选为100到10万次、进一步优选为30万次。Further, the process of generating offspring is repeated 1 to 1 million times, more preferably 100 to 100,000 times, and further preferably 300,000 times.
进一步地,按照100bp-100kbp,优选为10kbp-50kbp,更优选为30kbp的长度将连接后的碱基序列划分为一系列窗口。Further, the connected base sequence is divided into a series of windows according to the length of 100bp-100kbp, preferably 10kbp-50kbp, and more preferably 30kbp.
进一步地,组合形成不同区域的窗口是连续的或不连续的。Further, the windows that are combined to form different areas are continuous or discontinuous.
进一步地,预定数量个区域为1-500个,更优选为10-100个,进一步优选为50个。根据本发明的第四方面,本发明提出了一种肺结节筛查装置,包括:第一计算模块,被设置为计算待检样本在选定的预定数量个区域的加权片段分布差异值的总和,得到总WFDD值;输入模块,被设置为将待检样本的总WFDD值输入根据本发明的第三方面的组建装置所组建的肺结节筛查模型;和输出模块,被设置为输出待检样本的筛查结果;其中,选定的预定数量个区域与良性结节人群样本和恶性结节人群样本中加权片段分布差异值(WFDD)差异最大的预定数量个区域相同。Further, the predetermined number of areas is 1-500, more preferably 10-100, even more preferably 50. According to a fourth aspect of the present invention, the present invention proposes a pulmonary nodule screening device, including: a first calculation module configured to calculate the weighted fragment distribution difference value of the sample to be tested in a selected predetermined number of regions. Sum to obtain the total WFDD value; the input module is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model constructed according to the construction device of the third aspect of the present invention; and the output module is configured to output Screening results of samples to be tested; wherein, the selected predetermined number of regions are the same as the predetermined number of regions with the largest difference in weighted fragment distribution difference values (WFDD) between the benign nodule population sample and the malignant nodule population sample.
进一步地,输入模块包括:输入单元,被设置为将待检样本的总WFDD值输入肺结节筛查模型;和判定单元,被设置为肺结节筛查模型根据待检样本的总WFDD值与预定阈值比较来判断待检样本的肺结节类型。Further, the input module includes: an input unit, which is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model; and a determination unit, which is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model. Compare with a predetermined threshold to determine the type of pulmonary nodule in the sample to be tested.
进一步地,筛查装置还包括预定阈值获取模块,预定阈值获取模块包括:第一计算单元,被设置为计算训练集中每个样本在WFDD值差异最大的预定数量个区域的总WFDD值;和第二计算单元,被设置为根据训练集中的良性结节人群和恶性结节人群的总WFDD值计算最佳分割点,最佳分割点即为预定阈值。Further, the screening device further includes a predetermined threshold acquisition module. The predetermined threshold acquisition module includes: a first calculation unit configured to calculate the total WFDD value of each sample in the training set in a predetermined number of regions where the WFDD value difference is the largest; and The second calculation unit is set to calculate the optimal segmentation point based on the total WFDD value of the benign nodule population and the malignant nodule population in the training set, and the optimal segmentation point is the predetermined threshold.
进一步地,使用来自R语言分析平台的pROC包的roc函数计算最佳分割点。Further, the optimal segmentation point was calculated using the roc function of the pROC package from the R language analysis platform.
根据本发明的第五方面,本发明提出了一种计算机可读存储介质,存储介质包括存储的程序,在程序运行时,程序执行根据本发明的第一方面的组建方法或者根据本发明的第二方面的肺结节筛查方法。According to a fifth aspect of the present invention, the present invention proposes a computer-readable storage medium. The storage medium includes a stored program. When the program is run, the program executes the construction method according to the first aspect of the present invention or the second method of the present invention. Two aspects of pulmonary nodule screening methods.
根据本发明的第六方面,本发明提出了一种处理器,处理器用于运行程序,其中,程序运行时执行根据本发明的第一方面的肺结节筛查模型的组建方法或者根据本发明的第二方面的肺结节筛查方法。According to the sixth aspect of the present invention, the present invention proposes a processor. The processor is configured to run a program. When the program is running, the method for constructing a pulmonary nodule screening model according to the first aspect of the present invention is executed or the method according to the present invention is executed. The second aspect of pulmonary nodule screening methods.
应用本发明的技术方案开发了旨在区分某个部分的恶性肿瘤和其他非癌症类型疾病(如结节等)的方法,结果显示本发明的方法的效果显著优于现有的CT扫描,而接近有创的组织活检。通过该方法产生了能够无创地检测人体结节类型(良性/恶性)的产品,如通过血液检测,判断患者的肺结节是否是恶性,从而避免侵入式检查。The technical solution of the present invention is applied to develop a method aimed at distinguishing a certain part of malignant tumors from other non-cancer types of diseases (such as nodules, etc.). The results show that the effect of the method of the present invention is significantly better than that of existing CT scans. Close to invasive tissue biopsy. This method produces a product that can non-invasively detect the type of human nodules (benign/malignant). For example, through blood testing, it can determine whether the patient's lung nodules are malignant, thereby avoiding invasive examinations.
附图说明Description of drawings
构成本申请的一部分的说明书附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The description and drawings that constitute a part of this application are used to provide a further understanding of the present invention. The illustrative embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute an improper limitation of the present invention. In the attached picture:
图1示出了根据本发明的肺结节筛查模型的组建方法的流程图。Figure 1 shows a flow chart of a method for constructing a pulmonary nodule screening model according to the present invention.
图2示出了根据本发明的模式加权片段分布差异(Weighted Fragment Distribution Difference,WFDD)的计算方法。其中:A示出了一个样本在指定窗口的加权差值的计算方法;B示出了第i个窗口的累加值的计算方法;并且C示出了样本在指定区域的WFDD的计算方法。Figure 2 shows the calculation method of mode weighted fragment distribution difference (Weighted Fragment Distribution Difference, WFDD) according to the present invention. Among them: A shows the calculation method of the weighted difference of a sample in the specified window; B shows the calculation method of the accumulated value of the i-th window; and C shows the calculation method of the WFDD of the sample in the specified area.
图3示出了根据本发明的实施方式建模获得的良性相关分布模式的示例。Figure 3 shows an example of a benign correlation distribution pattern obtained by modeling according to an embodiment of the present invention.
图4示出了根据本发明的实施方式建模获得的恶性相关分布模式的示例。Figure 4 shows an example of a malignant correlation distribution pattern obtained by modeling according to an embodiment of the present invention.
图5示出了根据本发明的实施方式建模并预测验证集样本所获得的实验结果。Figure 5 shows experimental results obtained by modeling and predicting validation set samples according to an embodiment of the present invention.
图6示出了根据本发明的肺结节筛查方法的流程图。Figure 6 shows a flow chart of the pulmonary nodule screening method according to the present invention.
图7示出了根据本发明的肺结节筛查模型的组建装置。Figure 7 shows a device for constructing a pulmonary nodule screening model according to the present invention.
图8示出了根据本发明的肺结节筛查模型的组建装置中的特征数据筛选模块。Figure 8 shows the feature data screening module in the device for constructing a pulmonary nodule screening model according to the present invention.
图9示出了根据本发明的肺结节筛查模型的组建装置中的特征数据筛选模块中的特征数据筛选子模块。Figure 9 shows the feature data screening sub-module in the feature data screening module in the device for building a pulmonary nodule screening model according to the present invention.
图10示出了根据本发明的肺结节筛查模型的组建装置中的特征数据筛选模块中的特征数据筛选子模块。Figure 10 shows the feature data screening sub-module in the feature data screening module in the device for building a pulmonary nodule screening model according to the present invention.
图11示出了根据本发明的肺结节筛查装置。Figure 11 shows a pulmonary nodule screening device according to the present invention.
图12示出了根据本发明的肺结节筛查装置的输入模块。Figure 12 shows the input module of the pulmonary nodule screening device according to the present invention.
具体实施方式Detailed ways
呈现以下描述以使本领域普通技术人员能够获得和使用各种实施方式。特定装置、技术和应用程序的描述仅作为实例提供。对本文描述的实例的各种修改对于本领域普通技术人员将是显而易见的,并且在不脱离各种实施方式的范围的情况下,本文中定义的一般原理可以应用于其他实例和应用。因此,各种实施方式不旨在限于本文描述和示出的实例,而是与符合权利要求的范围相一致。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本发明。The following description is presented to enable a person of ordinary skill in the art to make and use the various embodiments. Descriptions of specific devices, techniques, and applications are provided as examples only. Various modifications to the examples described herein will be apparent to those of ordinary skill in the art, and the general principles defined herein may be applied to other examples and applications without departing from the scope of the various embodiments. Accordingly, the various implementations are not intended to be limited to the examples described and illustrated herein but are to be consistent with the scope of the claims. It should be noted that, as long as there is no conflict, the embodiments and features in the embodiments of this application can be combined with each other. The present invention will be described in detail below with reference to the accompanying drawings and embodiments.
如背景技术部分所描述的,已经在多种实体肿瘤中证实了游离DNA作为新的诊断标志物的应用价值。因此,越来越多的研究通过基于游离DNA的全基因组测序或甲基化测序开发了癌症的早筛方法。尽管这些方法已经激增,但仍存在较多问题。首先,现有技术的方法几乎只能将癌症患者从健康人群中检测出来,而难以保证准确地将某个部分的肿瘤与在该部位的其他非癌症疾病区分开,特别是对于炎症患者,上述方法可能会将其误判为癌症患者。其次,现有技术基于患者血浆中肿瘤特异性ct DNA的肺结节无创诊断方法的准确率较低,尚不能满足临床要求。As described in the background section, the application value of cell-free DNA as a new diagnostic marker has been confirmed in a variety of solid tumors. Therefore, an increasing number of studies have developed early screening methods for cancer through cell-free DNA-based whole-genome sequencing or methylation sequencing. Although these methods have proliferated, many problems remain. First of all, existing technology methods can almost only detect cancer patients from healthy people, but it is difficult to accurately distinguish tumors in a certain part from other non-cancer diseases at that site, especially for patients with inflammation. The method may misidentify them as cancer patients. Secondly, the accuracy of existing non-invasive diagnosis methods for pulmonary nodules based on tumor-specific ct DNA in patient plasma is low and cannot meet clinical requirements.
由于上述问题,因此需要对传统的早筛方案进行改造,以能够区分某个部分的恶性肿瘤和其他非癌症类型疾病。Due to the above problems, traditional early screening programs need to be modified to be able to distinguish a certain subset of malignant tumors from other non-cancer types of diseases.
现在参照图1,图1示出了根据本发明的肺结节筛查模型的组建方法的流程图。根据本发明实施方式的肺结节筛查模型的组建方法基于二代测序数据中的DNA片段序列读取值(reads)在参考基因组上(reference)的分布特征来构建模型,从而区分不同类型(良性/恶性)的结节。已有文献(Maternal Malignancies Detected With Noninvasive Prenatal Testing Reply.Jama the Journal of the American Medical Association;2015 Nov 24;314(20):2192-3[DOI:10.1001/jama.2015.12922])报道使用癌症患者的血液cf DNA测序数据检测到拷贝数变异,这意味着在CNV的发生区域,癌症患者的DNA片段分布与代表健康人的基准存在差异。我们假设患有良性/恶性结节患者在某些区域存在DNA片段分布上的差异。因此,为了更好地描述一个区域内cf DNA片段分布的特征,我们提出了加权片段分布差异(Weighted Fragment Distribution Difference,WFDD)的概念。CNV只关注样本在一个区域内的总片段数量与基准的差异,而WFDD关注了区域内各处的片段分布差异的细节。我们组建了一种肺结节筛查模型用于描述这种细节。具体地,肺结节筛查模型的组建方法包括:Referring now to FIG. 1 , FIG. 1 shows a flow chart of a method for constructing a pulmonary nodule screening model according to the present invention. The method for constructing a pulmonary nodule screening model according to an embodiment of the present invention constructs a model based on the distribution characteristics of DNA fragment sequence reads (reads) in second-generation sequencing data on a reference genome (reference), thereby distinguishing different types ( benign/malignant) nodules. Existing literature (Maternal Malignancies Detected With Noninvasive Prenatal Testing Reply.Jama the Journal of the American Medical Association; 2015 Nov 24;314(20):2192-3[DOI:10.1001/jama.2015.12922]) has reported the use of blood from cancer patients cf DNA sequencing data detected copy number variations, meaning that in the region where the CNV occurs, the distribution of DNA fragments in cancer patients differs from a baseline representative of healthy people. We hypothesized that there are differences in the distribution of DNA fragments in certain regions in patients with benign/malignant nodules. Therefore, in order to better describe the characteristics of cf DNA fragment distribution in a region, we proposed the concept of Weighted Fragment Distribution Difference (WFDD). CNV only focuses on the difference between the total number of fragments of a sample in a region and the baseline, while WFDD focuses on the details of the differences in fragment distribution across the region. We developed a pulmonary nodule screening model to describe this detail. Specifically, the construction method of the pulmonary nodule screening model includes:
在图1的1中,首先将参考基因组中所有常染色体的测序数据连接在一起,按固定的长度将其划分为一系列窗口。在此,固定的长度的范围为100bp-100kbp,优选为10kbp-50kbp,更优选为30kbp。由于是将常染色体的测序数据连接在一起进行的划分,因此,可能存在跨越染色体的窗口。在此,为了排除性别因素可能带来的干扰,我们没有使用性染色体的数据。In Figure 1, 1, the sequencing data of all autosomes in the reference genome are first concatenated and divided into a series of windows by a fixed length. Here, the fixed length range is 100bp-100kbp, preferably 10kbp-50kbp, and more preferably 30kbp. Since the division is performed by concatenating the sequencing data of autosomes, there may be windows spanning chromosomes. Here, in order to exclude possible interference caused by gender factors, we did not use sex chromosome data.
在图1的2至6中,示出了根据本发明的实施方式的模式加权片段分布差异(Weighted Fragment Distribution Difference,WFDD)的计算方法。在图1的2中,为每个窗口计算窗口 基准深度和权重:将训练集的样本在窗口的深度值(每个样本提供一个深度值)的平均值作为该窗口的窗口基准深度,并将这些样本的深度值的方差的平方作为该窗口的权重。其后,为训练集中的指定样本计算在每个窗口的窗口样本深度。在优选的实施方式中,将指定区域中的窗口的窗口基准深度进行归一化得到归一化后窗口基准深度,使得各个窗口基准深度的平均值为0,标准差为1。这样做的原因在于WFDD只关注窗口层面的深度差异,而为了消除不同样本在区域内总深度差异对WFDD的影响,需要在计算WFDD之前将一个区域包含的窗口的基准深数进行归一化得到归一化后窗口基准深度。将指定样本在这些窗口的深度也进行同样的归一化操作,得到归一化后窗口样本深度。在图1的3中,计算窗口样本深度与窗口基准深度之间的差值。在优选的实施方式中,计算归一化后窗口样本深度与归一化后窗口基准深度之间的差值。在图1的4中,将差值与权重相乘,得到窗口的加权差值。在进行了归一化处理的优选的实施方式中,将归一化后窗口样本深度与归一化后窗口基准深度之间的差值与权重相乘。显然地,不同样本深度波动越大的窗口将拥有更大的权重。将样本在该窗口的深度与基准深度的差值乘以其权重会放大差值,即:放大分布差异信号。随后,在图1的5中,将不定数量的窗口组合形成不同区域,对指定区域中的所有窗口的加权差值求和;对加权差值的总和(即,求和的最后一个累计值)进行数值变换,得到指定样本在指定区域内的加权片段分布差异(WFDD)。在图1的6中,计算训练集中的良性结节人群样本和恶性结节人群样本在每个区域上关于加权片段分布差异值的差异,并筛选差异最大的预定数量个区域作为特征数据。在一个实施方式中,选出的区域为1-500个,优选10-100个,并且更优选50个;之后,用特征数据组建肺结节筛查模型。In 2 to 6 of Figure 1 , a calculation method of a mode weighted fragment distribution difference (Weighted Fragment Distribution Difference, WFDD) according to an embodiment of the present invention is shown. In Figure 1, 2, the window base depth and weight are calculated for each window: the average of the depth values of the samples in the window (each sample provides a depth value) of the training set is used as the window base depth of the window, and The square of the variance of the depth values of these samples serves as the weight of the window. Thereafter, the window sample depth in each window is calculated for the specified sample in the training set. In a preferred embodiment, the window reference depth of the window in the designated area is normalized to obtain the normalized window reference depth, so that the average value of each window reference depth is 0 and the standard deviation is 1. The reason for this is that WFDD only focuses on the depth difference at the window level. In order to eliminate the impact of the total depth difference of different samples in the region on WFDD, it is necessary to normalize the base depth number of the windows contained in a region before calculating WFDD. Normalized window base depth. The depth of the specified sample in these windows is also subjected to the same normalization operation to obtain the normalized window sample depth. In Figure 1, 3, the difference between the window sample depth and the window reference depth is calculated. In a preferred embodiment, the difference between the normalized window sample depth and the normalized window reference depth is calculated. In Figure 1, 4, the difference is multiplied by the weight to obtain the weighted difference of the window. In a preferred embodiment where normalization processing is performed, the difference between the normalized window sample depth and the normalized window reference depth is multiplied by the weight. Obviously, windows with larger fluctuations in different sample depths will have greater weight. Multiplying the difference between the depth of the sample in this window and the reference depth by its weight will amplify the difference, that is, amplify the distribution difference signal. Subsequently, in Figure 1, 5, an indefinite number of windows are combined to form different areas, and the weighted differences of all windows in the specified area are summed; the sum of the weighted differences (i.e., the last cumulative value of the summation) Perform numerical transformation to obtain the weighted fragment distribution difference (WFDD) of the specified sample in the specified area. In Figure 1, 6, the difference between the benign nodule population sample and the malignant nodule population sample in the training set with respect to the weighted fragment distribution difference value in each region is calculated, and a predetermined number of regions with the largest differences are screened as feature data. In one embodiment, the number of selected regions is 1-500, preferably 10-100, and more preferably 50; then, the characteristic data is used to build a pulmonary nodule screening model.
现在参考图2,示出了根据本发明的一个实施方式的模式加权片段分布差异的计算方法的一个实例。图2的A中,示出了一个样本在指定窗口的加权差值的计算方法的实例。在该实例中,将训练集的样本在窗口的深度值(每个样本提供一个深度值)的平均值作为该窗口的窗口基准深度,并将这些样本的深度值的方差的平方作为该窗口的权重。为训练集中的指定样本计算在每个窗口的窗口样本深度。将指定区域中的窗口的窗口基准深度进行归一化得到归一化后窗口基准深度,使得同一个区域中的各个窗口基准深度的平均值为0,标准差为1。将指定样本在这些窗口的深度也进行同样的归一化操作,得到归一化后窗口样本深度。计算归一化后窗口样本深度与归一化后窗口基准深度之间的差值。将差值与权重相乘,得到窗口的加权差值。在图2的B中,示出了第i个窗口的累加值的计算方法的实例。在此,组合形成不同区域的不定数量的窗口可以是非连续(非相邻)的。并且需要注意的是,在计算样本在一个区域的WFDD时,我们只基于这个区域的窗口来对样本深度及基准深度进行归一化,即只要求同一个区域中的各个窗口基准深度的平均值为0,标准差为1,而不是基于基因组上所有的窗口。在图2的C中,示出了对求和的最后一个累计值进行数值变换的公式:WDFF=f(x)或WDFF=-1*f(x),f(x)在不同情况下的形式分别在图3和图4中具体展现。Referring now to FIG. 2 , an example of a calculation method of mode weighted segment distribution differences according to one embodiment of the present invention is shown. A of Figure 2 shows an example of a method for calculating the weighted difference of a sample in a specified window. In this example, the average of the depth values of the training set samples in the window (each sample provides a depth value) is used as the window base depth of the window, and the square of the variance of the depth values of these samples is used as the window's base depth. Weights. Computes the window sample depth in each window for the specified sample in the training set. Normalize the window reference depth of the windows in the specified area to obtain the normalized window reference depth, so that the average value of each window reference depth in the same area is 0 and the standard deviation is 1. The depth of the specified sample in these windows is also subjected to the same normalization operation to obtain the normalized window sample depth. Calculate the difference between the normalized window sample depth and the normalized window base depth. Multiply the difference by the weight to get the weighted difference of the window. In B of FIG. 2 , an example of the calculation method of the accumulated value of the i-th window is shown. Here, an indefinite number of windows that combine to form different areas may be non-contiguous (non-adjacent). And it should be noted that when calculating the WFDD of a sample in a region, we only normalize the sample depth and reference depth based on the window of this region, that is, we only require the average of the reference depth of each window in the same region. is 0 and the standard deviation is 1, rather than based on all windows on the genome. In C of Figure 2, the formula for numerical transformation of the last accumulated value of the sum is shown: WDFF=f(x) or WDFF=-1*f(x), f(x) in different situations The forms are specifically shown in Figures 3 and 4 respectively.
更简化地(不从累计值的角度进行阐述),在一个实施方式中,样本在指定区域的WFDD的计算经验公式为:More simply (without elaborating from the perspective of cumulative values), in one implementation, the empirical formula for calculating the WFDD of a sample in a specified area is:
Figure PCTCN2022097450-appb-000017
Figure PCTCN2022097450-appb-000017
当区域属于良性相关区域时:WDFF=-1*f(x)When the area belongs to a benign related area: WDFF=-1*f(x)
当区域属于恶性相关区域时:WDFF=f(x)When the area belongs to the malignant related area: WDFF=f(x)
其中,in,
x′ i为训练集的指定样本在第i个窗口的归一化后窗口样本深度, x′ i is the normalized window sample depth of the specified sample in the i-th window of the training set,
Figure PCTCN2022097450-appb-000018
为第i个窗口的窗口的归一化后基准深度,并且
Figure PCTCN2022097450-appb-000018
is the normalized reference depth of the i-th window, and
σ i为训练集的指定样本在第i个窗口的深度值方差。 σ i is the depth value variance of the specified sample in the training set in the i-th window.
现在参考图3,其中示出了根据本发明的实施方式建模获得的良性相关分布模式的一个实例。在对于给定的区域,如果良性肺结节人群的样本的WFDD波动比恶性肺结节人群的更大,我们将该区域的片段分布模式称为良性相关模式。在图3中,该模式所在的区域包括了53个窗口,每根折线代表一个样本。左图展示了20名随机良/恶性肺结节患者的cf DNA样本(共40个)在这些窗口上的加权差值;中图展示了每个窗口的加权差值累计值;右图展示了对样本最后一个的累计值在良性相关模式的进行转换以得到WFDD的相关函数和结果(箱线图)。Referring now to Figure 3, an example of a benign correlation distribution pattern obtained by modeling in accordance with an embodiment of the present invention is shown. In a given region, if the WFDD fluctuation of samples from the benign pulmonary nodule population is greater than that of the malignant pulmonary nodule population, we call the fragment distribution pattern in this region a benign correlation pattern. In Figure 3, the area where the pattern is located includes 53 windows, and each polyline represents a sample. The left picture shows the weighted difference on these windows of cf DNA samples (40 in total) from 20 random patients with benign/malignant pulmonary nodules; the middle picture shows the cumulative weighted difference of each window; the right picture shows The last cumulative value of the sample is converted in the benign correlation mode to obtain the correlation function and results of WFDD (box plot).
现在参考图4,其中示出了根据本发明的实施方式建模获得的恶性相关分布模式的一个实例。如果恶性肺结节人群的样本的WFDD波动比良性肺结节人群的更大,我们将该区域的片段分布模式称为恶性相关模式。在图4中,模式所在区域包括了231个窗口,图中各部分的含义同(B),但样本不完全一致。左图展示了20名随机挑选的良/恶性肺结节患者的cf DNA样本(共40名)在这些窗口上的加权差值;中图展示了每个窗口的加权差值累计值;右图展示了对样本最后一个的累计值在良性相关模式的进行数值变换以得到WFDD的相关函数和结果(箱线图)。Referring now to Figure 4, an example of a malignant correlation distribution pattern obtained by modeling according to an embodiment of the present invention is shown. If the WFDD fluctuation of samples from the malignant pulmonary nodule population is greater than that of the benign pulmonary nodule population, we call the fragment distribution pattern in this region a malignant-related pattern. In Figure 4, the pattern area includes 231 windows. The meaning of each part in the figure is the same as (B), but the samples are not completely consistent. The left picture shows the weighted differences in these windows for cf DNA samples of 20 randomly selected patients with benign/malignant pulmonary nodules (40 in total); the middle picture shows the cumulative weighted difference for each window; the right picture It shows the numerical transformation of the last cumulative value of the sample in the benign correlation mode to obtain the correlation function and results of WFDD (box plot).
在一个实施方式中,可以使用以下公式对一组值(如一个样本在多个窗口的深度值)进行归一化操作:In one implementation, the following formula can be used to perform a normalization operation on a set of values (such as the depth values of a sample in multiple windows):
Figure PCTCN2022097450-appb-000019
Figure PCTCN2022097450-appb-000019
其中,S为这组值的标准差,且
Figure PCTCN2022097450-appb-000020
为其平均值。
Among them, S is the standard deviation of this set of values, and
Figure PCTCN2022097450-appb-000020
is its average value.
如前所描述的,我们挑选两类样本的WFDD值有最大差异的若干区域来组建模型。为了评估两类样本的WFDD在特定区域的差异,我们基于两组类型样本的WFDD值计算该区域的区分度值来评估其区分这两类样本的能力。具体来说:对于特定区域,计算良性结节人群样本和恶性结节人群样本中的每一个样本的WFDD值;分别计算良性结节人群样本的As described before, we select several areas where the WFDD values of the two types of samples have the greatest difference to build the model. In order to evaluate the difference in WFDD of the two types of samples in a specific area, we calculated the discrimination value of the area based on the WFDD values of the two types of samples to evaluate its ability to distinguish the two types of samples. Specifically: for a specific area, calculate the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample; calculate the WFDD value of the benign nodule population sample respectively.
WFDD值的平均值和恶性结节人群样本的WFDD值的平均值;计算特定区域的区分度值和选择区分度值最大的预定数量个区域作为特征数据。在一个实施方式中,计算特定区域的区分度值的公式为:The average of the WFDD values and the average of the WFDD values of the malignant nodule population samples; calculate the discrimination value of a specific region and select a predetermined number of regions with the largest discrimination value as feature data. In one embodiment, the formula for calculating the discrimination value of a specific area is:
Figure PCTCN2022097450-appb-000021
Figure PCTCN2022097450-appb-000021
其中:in:
t为特定区域的区分度值;
Figure PCTCN2022097450-appb-000022
Figure PCTCN2022097450-appb-000023
分别为来自良性结节人群样本或恶性结节人群样本的WFDD值的平均值;n 1和n 2分别为来自良性结节人群样本或恶性结节人群样本的值的个数;并且S 1和S 2分别为来自良性结节人群样本或恶性结节人群样本的值的标准差;条件是:当良性结节人群对应下角标1时,恶性结节人群对应下角标2;或当良性结节人群对应下角标2时,恶性结节人群对应下角标1。更高的t值表示两组样本的WFDD值在此区域具有更大的差异,即,该区域具有更高的区分度。
t is the discrimination value of a specific area;
Figure PCTCN2022097450-appb-000022
and
Figure PCTCN2022097450-appb-000023
are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S 1 and S 2 is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the condition is: when the benign nodule population corresponds to the subscript 1, the malignant nodule population corresponds to the subscript 2; or when the benign nodule population corresponds to the subscript 2 When the population corresponds to subscript 2, the population with malignant nodules corresponds to subscript 1. A higher t value indicates that the WFDD values of the two groups of samples have a greater difference in this area, that is, this area has a higher degree of discrimination.
基于划分好的窗口将其部分组合成不同的区域,理论上能够组合得到的区域数量将超出现有计算机能够处理的范围。为了更有效地找到其中潜在的高区分度窗口,在一个实施方式中,可以利用改进后的遗传算法来搜索。改进后的遗传算法将一系列由简单策略得到的区域(初始区域)进行随机合并和拆分,并引导产生t值更大的区域。具体来说,搜索潜在的高区分度窗口的步骤包括:By combining its parts into different areas based on the divided windows, the number of areas that can be combined will theoretically exceed the range that existing computers can handle. In order to more effectively find potential high-discrimination windows, in one embodiment, an improved genetic algorithm can be used to search. The improved genetic algorithm randomly merges and splits a series of regions (initial regions) obtained by simple strategies, and guides the generation of regions with larger t values. Specifically, the steps to search for potential high-discrimination windows include:
a.)生成初始区域:a.) Generate initial area:
将基因组划分为一系列窗口:假设我们已经把基因组划分为N个窗口,每个窗口用其次序编号(如1,2,…,N)表示,由这些窗口组合得到的区域用一系列窗口编号表示(如{1,2,3,10,11,12})。我们将窗口i处连续的n个窗口和其下游2 jn个窗口处另外n个连续窗口组合起来,得到共2n个窗口组合成一个初始区域。即: Divide the genome into a series of windows: Suppose we have divided the genome into N windows, each window is represented by its sequence number (such as 1, 2,..., N), and the region obtained by the combination of these windows is represented by a series of window numbers Represents (such as {1, 2, 3, 10, 11, 12}). We combine the n consecutive windows at window i and the other n consecutive windows at 2 j n windows downstream, resulting in a total of 2n windows combined into an initial area. Right now:
x i={i,i+1,i+2,...,i+n-1,i+2 jn,i+2 jn+1,i+2 jn+2,...,i+2 jn+n-1} x i = {i, i+1, i+2,..., i+n-1, i+2 j n, i+2 j n+1, i+2 j n+2,..., i+2 j n+n-1}
i=1,n,2n,...,N;j=1,2,...,8;i+(2 i+1)n≤N i=1, n, 2n,..., N; j=1, 2,..., 8; i+(2 i+1 )n≤N
其中,n为给定的连续窗口的个数,其取值范围为1-100,优选为5-50,更优选为5,N为划分好的窗口总数。在这里,要求两个连续窗口序列间隔一定距离是为了让区域具有跨越长距离的能力。Among them, n is the number of given continuous windows, and its value range is 1-100, preferably 5-50, and more preferably 5, and N is the total number of divided windows. Here, requiring two consecutive window sequences to be separated by a certain distance is to allow the region to have the ability to span long distances.
b.)区域的组合和分拆:b.) Combination and splitting of regions:
使用遗传算法,两个双亲结合起来交换信息并产生子代;其中,将所有初始区域放到区域池内,并随机挑选产生子代,在这里,区域i被选择作为双亲之一的概率为:Using a genetic algorithm, two parents combine to exchange information and produce offspring; among them, all initial areas are put into the area pool and randomly selected to generate offspring. Here, the probability of area i being selected as one of the parents is:
Figure PCTCN2022097450-appb-000024
Figure PCTCN2022097450-appb-000024
其中,N为划分好的窗口总数。Among them, N is the total number of divided windows.
当选定区域x作为第一个亲本后,另一个区域i被挑选作为第二个亲本的概率为:When region x is selected as the first parent, the probability that another region i is selected as the second parent is:
Figure PCTCN2022097450-appb-000025
Figure PCTCN2022097450-appb-000025
其中,N为划分好的窗口总数,m i为区域i所包含的窗口编号的平均值。在这里,设立|m x-m i|项是为了优先选择距离近的区域作为另一个亲本。 Among them, N is the total number of divided windows, and m i is the average of the window numbers included in area i. Here, the | m
选定双亲后,取双亲所包括窗口的并集并随机删除其中若干窗口作为子代;在一个实施方式中,被随机删除的窗口范围为1%-99%、优选为5%-50%、更优选为20%,有放回抽样。得到子代后,将子代放进区域池内进行下一轮挑选;在此操作中不删除双亲。由区域P 1和区域P 2结合产生的子代为: After selecting the parents, take the union of the windows included in the parents and randomly delete several of the windows as offspring; in one embodiment, the range of randomly deleted windows is 1%-99%, preferably 5%-50%, More preferably, it is 20%, with replacement sampling. After obtaining the offspring, put the offspring into the regional pool for the next round of selection; the parents are not deleted during this operation. The offspring produced by the combination of area P 1 and area P 2 are:
child(P 1,P 2)=P 1∪P 2-S(p,P 1∪P 2) child(P 1 , P 2 )=P 1 ∪P 2 -S(p, P 1 ∪P 2 )
其中,S(p,s)为从集合s中有放回地抽取比例p的元素得到的子集。Among them, S(p,s) is a subset obtained by extracting proportion p elements from the set s with replacement.
不断重复产生子代的过程。在一个实施方式中,重复的范围为1到1百万次,在优选的实施方式中,重复的范围为100到10万次,在更优选实施方式中,重复的范围为30万次,并最后挑选若干区分度最大的区域作为特征组建模型并预测。The process of producing offspring is repeated over and over again. In one embodiment, the range of repetitions is 1 to 1 million times, in a preferred embodiment, the range of repetitions is 100 to 100,000 times, in a more preferred embodiment, the range of repetitions is 300,000 times, and Finally, select a number of areas with the greatest discrimination as features to build a model and predict.
现参考图6,其示出了根据本发明的肺结节筛查方法的流程图。肺结节筛查方法包括:选定一定数量的区域,为了根据这些区域来判断特定样本的类型,我们分别计算该样本在这些区域的WFDD值并求和,得到总WFDD值,然后,将该总WFDD值与预定阈值比较,依据其大于阈值/小于阈值来判断其类型。Referring now to Figure 6, a flow chart of a pulmonary nodule screening method according to the present invention is shown. The pulmonary nodule screening method includes: selecting a certain number of areas. In order to determine the type of a specific sample based on these areas, we calculate the WFDD values of the sample in these areas and sum them up to obtain the total WFDD value. Then, the WFDD value is calculated. The total WFDD value is compared with a predetermined threshold, and its type is determined based on whether it is greater than or less than the threshold.
如本文所使用的,阈值是基于训练集的样本计算的。先计算得到训练集每个样本关于这些区域的总WFDD值,再根据训练集中的良性结节人群和恶性结节人群的总WFDD值计算最佳分割点,该分割点即为所求阈值。计算最佳分割点可使用roc函数(来自R语言的pROC包)。As used in this article, the threshold is calculated based on samples from the training set. First, calculate the total WFDD value of each sample in the training set for these areas, and then calculate the optimal segmentation point based on the total WFDD value of the benign nodule population and malignant nodule population in the training set. This segmentation point is the required threshold. The optimal split point can be calculated using the roc function (pROC package from R language).
现参考图7,其中示出了根据本发明的肺结节筛查模型的组建装置。具体地,本发明的组建装置包括:特征数据筛选模块,被设置为在人类参考基因组的全体范围内,筛选训练集中的良性结节人群样本和恶性结节人群样本中加权片段分布差异(WFDD)值的差异最大的预定数量个区域作为特征数据;和组建模块,被设置为利用特征数据组建肺结节筛查模型。Referring now to FIG. 7 , a device for constructing a pulmonary nodule screening model according to the present invention is shown. Specifically, the construction device of the present invention includes: a feature data screening module, which is configured to screen the weighted fragment distribution difference (WFDD) in the benign nodule population samples and the malignant nodule population samples in the training set within the entire range of the human reference genome. A predetermined number of areas with the largest difference in values are used as feature data; and a building module is configured to use the feature data to build a pulmonary nodule screening model.
现参考图8,其中示出了根据本发明的肺结节筛查模型的组建装置中的特征数据筛选模块。具体地,本发明的特征数据筛选模块包括:窗口划分模块,被设置为将人类参考基因组中所有常染色体的碱基序列连接在一起,按固定的长度将连接后的碱基序列划分为一系列窗口,每个窗口对应着一段碱基序列;第一计算模块,被设置为计算每个窗口的窗口基准深度和权重,其中,窗口基准深度为训练集中的样本在窗口的深度值的平均值,权重为训练集的样本在窗口的深度值的方差的平方,窗口的深度值为对样本的测序数据中能比对到该窗口对应的碱基序列的碱基序列片段的数量;第二计算模块,被设置为计算训练集中指定样本在每个窗口的窗口样本深度;第三计算模块,被设置为计算窗口样本深度与窗口基准深度之间的差值;第四计算模块,被设置为将差值与权重相乘,得到窗口的加权差值;第五计算模块,被设置为将不定数量的窗口组合形成不同区域,对指定区域中所有窗口的加权差值求和得到加权差值总和;数值变换模块,被设置为对加权差值总和进行数值变换,得到指定样本在指定区域内的加权片段分布差异值(WFDD);和第六计算模块,被设置为计算训练集中的良性结节人群样本和恶性结节人群样本在每个区域上关于加权片段分布差异值的差异;以及特征数据筛选子模块,被设置为筛选差异最大的预定数量个区域作为特征数据。可选地,如图8中以虚线标注的,特征数据筛选模块还包括均一化处理模块,其被设置为对窗口基准深度和窗口样本深度进行均一化处理。可选地,均一化处理后各个窗口的窗口基准深度的平均值为0,标准差为1。Referring now to FIG. 8 , a feature data filtering module in a device for constructing a pulmonary nodule screening model according to the present invention is shown. Specifically, the feature data screening module of the present invention includes: a window division module, which is configured to connect the base sequences of all autosomal chromosomes in the human reference genome together, and divide the connected base sequences into a series of fixed lengths. Window, each window corresponds to a base sequence; the first calculation module is set to calculate the window reference depth and weight of each window, where the window reference depth is the average depth value of the sample in the training set in the window, The weight is the square of the variance of the depth value of the window for the samples in the training set, and the depth value of the window is the number of base sequence fragments in the sequencing data of the sample that can be compared to the base sequence corresponding to the window; the second calculation module , is set to calculate the window sample depth of the specified sample in the training set in each window; the third calculation module is set to calculate the difference between the window sample depth and the window reference depth; the fourth calculation module is set to calculate the difference The value is multiplied by the weight to obtain the weighted difference of the window; the fifth calculation module is set to combine an indefinite number of windows to form different areas, and sum the weighted differences of all windows in the specified area to obtain the total weighted difference; numerical value The transformation module is configured to perform numerical transformation on the sum of weighted differences to obtain the weighted fragment distribution difference value (WFDD) of the specified sample in the specified area; and the sixth calculation module is configured to calculate the benign nodule population sample in the training set and the difference in the weighted fragment distribution difference value between the malignant nodule population sample and the malignant nodule population sample in each region; and the feature data screening submodule is set to screen a predetermined number of regions with the largest differences as feature data. Optionally, as marked with a dotted line in Figure 8 , the feature data screening module also includes a normalization processing module, which is configured to normalize the window reference depth and the window sample depth. Optionally, the average value of the window reference depth of each window after the normalization process is 0, and the standard deviation is 1.
现参考图9,其示出了根据本发明的肺结节筛查模型的组建装置中的特征数据筛选模块中的特征数据筛选子模块。具体地,本发明的特征数据筛选子模块包括:第一计算单元,被设置为对于特定区域,计算良性结节人群样本和恶性结节人群样本中的每一个样本的WFDD值;第二计算单元,被设置为分别计算良性结节人群样本的WFDD值的平均值和恶性结节人群样本的WFDD值的平均值;第三计算单元,被设置为按如下公式计算特定区域的区分度值,以及选择单元,被设置为选择区分度值最大的预定数量个区域作为特征数据:
Figure PCTCN2022097450-appb-000026
其中:t为特定区域的区分度值;
Figure PCTCN2022097450-appb-000027
Figure PCTCN2022097450-appb-000028
分别为来自良性结节人群样本或恶性结节人群样本的WFDD值的平均值;n 1和n 2分别为来自良性结节人群样本或恶性结节人群样本的值的个数;并且S 1和S 2分别为来自良性结节人群样本或恶性结节人群样本的值的标准差;条件是:当良性结节人群样本对应下角标1时,恶性结节人群样本对应下角标2;或当良性结节人群样本对应下角标2时,恶性结节人群样本对应下角标1。
Referring now to FIG. 9 , which shows a feature data screening sub-module in the feature data screening module in the device for constructing a pulmonary nodule screening model according to the present invention. Specifically, the characteristic data screening sub-module of the present invention includes: a first calculation unit, which is configured to calculate the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample for a specific area; a second calculation unit , is set to respectively calculate the average WFDD value of the benign nodule population sample and the average WFDD value of the malignant nodule population sample; the third calculation unit is set to calculate the discrimination value of a specific area according to the following formula, and The selection unit is set to select a predetermined number of regions with the largest discrimination values as feature data:
Figure PCTCN2022097450-appb-000026
Among them: t is the discrimination value of a specific area;
Figure PCTCN2022097450-appb-000027
and
Figure PCTCN2022097450-appb-000028
are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S 1 and S 2 is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the condition is: when the benign nodule population sample corresponds to the subscript 1, the malignant nodule population sample corresponds to the subscript 2; or when the benign nodule population sample corresponds to the subscript 2 When the nodule population sample corresponds to the subscript 2, the malignant nodule population sample corresponds to the subscript 1.
现参考图10,其示出了根据本发明的肺结节筛查模型的组建装置中的特征数据筛选模块中的特征数据筛选子模块。特征数据筛选子模块包括:初始区域生成单元和区域组合分拆单元,其中,初始区域生成单元包括:窗口划分元件,被设置为将基因组划分为一系列窗口;窗口编码元件,被设置为用一系列窗口的次序对一系列窗口编号;区域编号元件,被设置为用一系列窗口编号为由一系列窗口组合得到的区域编号;窗口组合元件,被设置为将窗口i 处连续的n个窗口和其下游2 jn个窗口处另外n个连续窗口组合,形成一个初始区域:x i={i,i+1,i+2,...,i+n-1,i+2 jn,i+2 jn+1,i+2 jn+2,...,i+2 jn+n-1},;i=1,n,2n,...,N;j=1,2,...,8;i+(2 i+1)n≤N;其中,n为给定的连续窗口的个数,N为划分好的窗口总数。区域组合分拆单元包括:第一子代挑选元件,被设置为使用遗传算法,两个双亲结合起来交换信息并产生子代;其中,将所有初始区域放到区域池内,并随机挑选产生子代;其中:-区域i被选择作为双亲之一的概率为:
Figure PCTCN2022097450-appb-000029
其中,N为划分好的窗口总数,t i为第i个窗口的t值;-当选定区域x作为第一个亲本后,另一个区域i被挑选作为第二个亲本的概率为:
Figure PCTCN2022097450-appb-000030
其中,N为划分好的窗口总数,m i为区域i所包含的窗口编号的平均值;第二子代挑选元件,被设置为选定双亲后,取双亲所包括窗口的并集并随机删除其中若干窗口作为子代,随机挑选方式为有放回抽样;和第三子代挑选元件,被设置为得到子代后,将子代放进区域池内进行下一轮挑选;在此操作中不从区域池内删除双亲,其中,由区域P 1和区域P 2结合产生的子代为:child(P 1,P 2)=P 1∪P 2-S(p,P 1∪P 2);其中,S(p,s)为从集合s中有放回地抽取比例p的元素得到的子集;以及子代重复产生元件,被设置为重复产生子代的过程;-按如下公式计算所产生的子代的区分度值,挑选区分度值最大的预定数量个区域作为特征数据:
Figure PCTCN2022097450-appb-000031
其中:t为特定区域的区分度值;
Figure PCTCN2022097450-appb-000032
Figure PCTCN2022097450-appb-000033
分别为来自良性结节人群样本或恶性结节人群样本的WFDD值的平均值;n 1和n 2分别为来自良性结节人群样本或恶性结节人群样本的值的个数;并且S 1和S 2分别为来自良性结节人群样本或恶性结节人群样本的值的标准差;条件是:当良性结节人群样本对应下角标1时,恶性结节人群样本对应下角标2;或当良性结节人群样本对应下角标2时,恶性结节人群样本对应下角标1。
Referring now to Figure 10, which shows the feature data screening sub-module in the feature data screening module in the device for constructing a pulmonary nodule screening model according to the present invention. The feature data screening sub-module includes: an initial region generation unit and a region combination splitting unit. The initial region generation unit includes: a window division element, which is set to divide the genome into a series of windows; a window encoding element, which is set to use a The order of the series windows is to number a series of windows; the area number component is set to use a series of window numbers to be the area number obtained by combining a series of windows; the window combination component is set to be the sum of n consecutive windows at window i Another n consecutive windows at the downstream 2 j n windows are combined to form an initial area: x i ={i, i+1, i+2,..., i+n-1, i+2 j n, i+2 j n+1, i+2 j n+2,..., i+2 j n+n-1},; i=1, n, 2n,..., N; j=1, 2,...,8; i+(2 i+1 )n≤N; where n is the number of given continuous windows, and N is the total number of divided windows. The regional combination splitting unit includes: the first child selection component is set to use a genetic algorithm, and the two parents combine to exchange information and generate offspring; among them, all initial areas are put into the regional pool and randomly selected to generate offspring. ;where: -The probability that region i is selected as one of the parents is:
Figure PCTCN2022097450-appb-000029
Among them, N is the total number of divided windows, t i is the t value of the i-th window; - when area x is selected as the first parent, the probability that another area i is selected as the second parent is:
Figure PCTCN2022097450-appb-000030
Among them, N is the total number of divided windows, m i is the average of the window numbers included in area i; the second generation selects components and is set to select the parents. After selecting the parents, the union of the windows included in the parents is taken and randomly deleted. Several windows are used as children, and the random selection method is sampling with replacement; and the third child selection component is set to put the children into the regional pool for the next round of selection after obtaining the children; there is no need for this operation. Delete the parents from the region pool, where the offspring produced by the combination of region P 1 and region P 2 are: child (P 1 , P 2 ) = P 1 ∪P 2 -S (p, P 1 ∪P 2 ); where, S(p,s) is a subset obtained by extracting elements of proportion p from the set s with replacement; and the descendants repeatedly generate elements, which are set to the process of repeatedly generating descendants; - Calculate the generated by the following formula Discrimination value of the offspring, select a predetermined number of areas with the largest discrimination value as feature data:
Figure PCTCN2022097450-appb-000031
Among them: t is the discrimination value of a specific area;
Figure PCTCN2022097450-appb-000032
and
Figure PCTCN2022097450-appb-000033
are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S 1 and S 2 is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the condition is: when the benign nodule population sample corresponds to the subscript 1, the malignant nodule population sample corresponds to the subscript 2; or when the benign nodule population sample corresponds to the subscript 2 When the nodule population sample corresponds to the subscript 2, the malignant nodule population sample corresponds to the subscript 1.
现参考图11,其示出了根据本发明的肺结节筛查装置。具体地,筛查装置包括:第一计算模块,被设置为计算待检样本在选定的预定数量个区域的加权片段分布差异值的总和,得到总WFDD值;输入模块,被设置为将待检样本的总WFDD值输入本发明的肺结节筛查模型;和输出模块,被设置为输出待检样本的筛查结果;其中,选定的预定数量个区域与良性结节人群样本和恶性结节人群样本中加权片段分布差异值(WFDD)差异最大的预定数量个区域相同。Referring now to Figure 11, a pulmonary nodule screening device according to the present invention is shown. Specifically, the screening device includes: a first calculation module, which is configured to calculate the sum of the weighted fragment distribution difference values of the sample to be tested in a selected predetermined number of areas to obtain a total WFDD value; an input module, which is configured to calculate the total WFDD value of the sample to be tested. The total WFDD value of the sample to be tested is input into the pulmonary nodule screening model of the present invention; and the output module is configured to output the screening results of the sample to be tested; wherein, the selected predetermined number of regions are related to the benign nodule population sample and the malignant nodule population sample. The predetermined number of regions with the largest weighted fragment distribution difference (WFDD) differences in the nodule population sample are the same.
现参考图12,其示出了根据本发明的肺结节筛查装置的输入模块。具体地,输入模块包括:输入单元,被设置为将待检样本的总WFDD值输入肺结节筛查模型;和判定单元,被设 置为肺结节筛查模型根据待检样本的总WFDD值与预定阈值比较来判断待检样本的肺结节类型。如图12中的虚线框所示出的,在一些实施方式中,筛查装置还包括预定阈值获取模块,预定阈值获取模块包括:第一计算单元,被设置为计算训练集中每个样本在WFDD值差异最大的预定数量个区域的总WFDD值;和第二计算单元,被设置为根据训练集中的良性结节人群和恶性结节人群的总WFDD值计算最佳分割点,最佳分割点即为预定阈值;更优选地,使用来自R语言分析平台的pROC包的roc函数计算最佳分割点。Referring now to Figure 12, an input module of the pulmonary nodule screening device according to the present invention is shown. Specifically, the input module includes: an input unit, which is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model; and a determination unit, which is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model. Compare with a predetermined threshold to determine the type of pulmonary nodule in the sample to be tested. As shown in the dotted box in Figure 12, in some embodiments, the screening device further includes a predetermined threshold acquisition module. The predetermined threshold acquisition module includes: a first calculation unit configured to calculate the WFDD of each sample in the training set. The total WFDD values of the predetermined number of regions with the largest value difference; and the second calculation unit is configured to calculate the optimal segmentation point based on the total WFDD values of the benign nodule population and the malignant nodule population in the training set, and the optimal segmentation point is is a predetermined threshold; more preferably, the optimal segmentation point is calculated using the roc function of the pROC package from the R language analysis platform.
本发明实施例提供了一种计算机可读存储介质,其上存储有存储的程序,在程序运行时,程序执行本发明的肺结节筛查模型的组建方法或者本发明的肺结节筛查方法。Embodiments of the present invention provide a computer-readable storage medium on which a stored program is stored. When the program is run, the program executes the method for establishing a pulmonary nodule screening model of the present invention or the pulmonary nodule screening method of the present invention. method.
本发明实施例提供了一种处理器,处理器用于运行程序,其中,程序运行时执行本发明的肺结节筛查模型的组建方法或者本发明的肺结节筛查方法。以下结合具体实施例对本发明作进一步详细描述,这些实施例不能理解为限制本发明所要求保护的范围。Embodiments of the present invention provide a processor, and the processor is configured to run a program. When the program is run, the method for establishing a pulmonary nodule screening model of the present invention or the pulmonary nodule screening method of the present invention is executed. The present invention will be described in further detail below with reference to specific examples. These examples shall not be construed as limiting the scope of protection claimed by the present invention.
实施例1:组建肺结节筛查模型Example 1: Establishing a pulmonary nodule screening model
收集了639例来自未经治疗的肺结节患者的血液样本,其中恶性结节患者484名(其中85%为I期),良性结节患者155名。恶性肺结节患者仅包含非小细胞肺癌患者。所有患者已匿名处理,并均已同意将其样本用于临床研究。Blood samples from 639 patients with untreated pulmonary nodules were collected, including 484 patients with malignant nodules (85% of which were stage I) and 155 patients with benign nodules. Patients with malignant pulmonary nodules only included patients with non-small cell lung cancer. All patients have been anonymized and have given consent for their samples to be used in clinical research.
使用EDTA管收集全血并立即处理,如无法立即处理,则在4℃下存放不超过1天。在4℃下用1600g离心力离心10分钟区分血浆和细胞成分,将血浆进一步在4℃下用16000g离心力离心10分钟以去除可能的细胞残留物,并在-80℃下存储待用。Use EDTA tubes to collect whole blood and process it immediately. If it cannot be processed immediately, store it at 4°C for no more than 1 day. Centrifuge at 1600g for 10 minutes at 4°C to distinguish plasma and cellular components. The plasma is further centrifuged at 16000g for 10 minutes at 4°C to remove possible cellular residues and stored at -80°C until use.
使用MagPure Circulating DNA KF Kit(Magen)提取200ul血浆中的cf DNA,使用MGIEasy Cell-free DNA Library Prep Set(MGI)对所获得的cf DNA进行二代测序标准建库,并使用MGISEQ-2000平台进行测序,最后获得每个样本约0.5-1.0x测序深度的全基因组测序数据。Use MagPure Circulating DNA KF Kit (Magen) to extract cf DNA from 200ul of plasma, use MGIEasy Cell-free DNA Library Prep Set (MGI) to perform second-generation sequencing standard library construction on the obtained cf DNA, and use the MGISEQ-2000 platform. Sequencing, and finally obtain whole-genome sequencing data of approximately 0.5-1.0x sequencing depth for each sample.
使用Sentieon软件对测序数据进行处理(包括比对、排序和去重),并使用软件readCounter统计每个样本的读取值比对到常染色体上每1kbp范围区域的数量,即每kb的读取值深度,再将每30个深度值加和得到长度为30kbp的范围的深度。在本实施例中,30kbp为一个窗口的长度。为了尽可能地计算所有位点的深度,减少丢弃长度不足的窗口造成的信息丢失,没有直接让readCounter按30kbp为单位进行统计。对每个样本都进行同样的操作,得到长度为95833(窗口数量)宽度为639(样本数量)的深度值矩阵。Use Sentieon software to process the sequencing data (including alignment, sorting and deduplication), and use the software readCounter to count the number of reads per sample compared to each 1kbp range region on the autosomal chromosome, that is, reads per kb value depth, and then add every 30 depth values to get the depth of a range with a length of 30kbp. In this embodiment, 30kbp is the length of a window. In order to calculate the depth of all sites as much as possible and reduce information loss caused by discarding windows of insufficient length, readCounter is not directly allowed to perform statistics in units of 30kbp. The same operation is performed for each sample, resulting in a depth value matrix with a length of 95833 (number of windows) and a width of 639 (number of samples).
分别从恶性肺结节样本集和良性结节样本集中随机抽取了31例样本和30例样本作为验证集,并将剩余的样本作为训练集提取相关特征区域。在本实施例中,我们分别提取了区分度最高的10个良性相关特征和10个恶性相关特征来组建模型,并预测验证集样本,预测结果如图5所示。31 samples and 30 samples were randomly selected from the malignant pulmonary nodule sample set and the benign nodule sample set respectively as the verification set, and the remaining samples were used as the training set to extract relevant feature areas. In this embodiment, we extracted the 10 benign-related features and 10 malignant-related features with the highest discrimination to build a model, and predicted the validation set samples. The prediction results are shown in Figure 5.
结果显示,模型在区分验证集两类样本时,AUC约为0.954(95%CI:0.908-1.000),表明模型在区分根据患者的血液cfDNA判断患者的肺结节类型方面具有优异的性能。The results show that when the model distinguishes the two types of samples in the validation set, the AUC is approximately 0.954 (95% CI: 0.908-1.000), indicating that the model has excellent performance in distinguishing the type of pulmonary nodules in patients based on their blood cfDNA.
另外,根据以下公式计算:In addition, it is calculated according to the following formula:
特异性=真阴性人数/(真阴性人数+假阳性人数)*100%(正确判断非病人的比率);Specificity = number of true negatives/(number of true negatives + number of false positives)*100% (rate of correctly identifying non-patients);
灵敏度=真阳性人数/(真阳性人数+假阴性人数)*100%(正确判断病人的比);Sensitivity = number of true positives/(number of true positives + number of false negatives)*100% (ratio of correctly identified patients);
获得本发明的方法的特异性和灵敏度均为约0.8,特异性×灵敏度的结果为约0.64。The specificity and sensitivity of the method of the present invention were both about 0.8, and the result of specificity × sensitivity was about 0.64.
此外,针对上述639例来自未经治疗的肺结节患者,同时还进行了CT扫描。经计算获得:基于CT扫描的方法的特异性为约0.3,这意味着有约70%的良性患者被认为是恶性的或者无法进行判断;并且基于CT扫描的方法的灵敏度为约0.93;特异性×灵敏度的结果为约0.28。In addition, CT scans were also performed on the above 639 patients with untreated pulmonary nodules. It is calculated that: the specificity of the CT scan-based method is about 0.3, which means that about 70% of benign patients are considered malignant or cannot be judged; and the sensitivity of the CT scan-based method is about 0.93; specificity The result of × sensitivity is about 0.28.
由此可见,与现有技术的CT扫描的方法相比,本发明的方法具有显著更高的特异性和相当的灵敏度,由此获得显著更高的特异性×灵敏度的结果,表明模型在区分根据患者的血液cfDNA判断患者的肺结节类型方面具有优异的性能。It can be seen that compared with the CT scanning method of the prior art, the method of the present invention has significantly higher specificity and considerable sensitivity, thereby obtaining a significantly higher specificity × sensitivity result, indicating that the model is capable of distinguishing It has excellent performance in determining the patient's pulmonary nodule type based on the patient's blood cfDNA.
实施例2:获得肺结节筛查方法Example 2: Obtaining a pulmonary nodule screening method
本实施例提出一种肺结节筛查方法,包括以下步骤:This embodiment proposes a pulmonary nodule screening method, which includes the following steps:
计算待检样本在选定的预定数量个区域的加权片段分布差异值的总和,得到总WFDD值;将待检样本的总WFDD值输入根据本发明的第一方面的组建方法所组建的肺结节筛查模型,或输入根据本发明的第二方面的肺结节筛查模型进行筛查;输出待检样本的筛查结果;其中,选定的预定数量个区域与良性结节人群样本和恶性结节人群样本中加权片段分布差异值(WFDD)差异最大的预定数量个区域相同。Calculate the sum of the weighted segment distribution difference values of the sample to be tested in the selected predetermined number of areas to obtain the total WFDD value; input the total WFDD value of the sample to be tested into the lung node constructed according to the construction method of the first aspect of the present invention. Nodule screening model, or input the pulmonary nodule screening model according to the second aspect of the present invention for screening; output the screening results of the samples to be tested; wherein the selected predetermined number of regions are related to the benign nodule population sample and The predetermined number of regions with the largest differences in weighted fragment distribution difference values (WFDD) among the malignant nodule population samples are the same.
可选地,该肺结节筛查方法还包括:将待检样本的总WFDD值输入肺结节筛查模型,肺结节筛查模型根据待检样本的总WFDD值与预定阈值比较来判断待检样本的肺结节类型。Optionally, the pulmonary nodule screening method also includes: inputting the total WFDD value of the sample to be tested into the pulmonary nodule screening model, and the pulmonary nodule screening model makes a judgment based on comparing the total WFDD value of the sample to be tested with a predetermined threshold. The type of pulmonary nodule in the sample to be tested.
可选地,通过如下方法获得预定阈值:计算训练集中每个样本在WFDD值差异最大的预定数量个区域的总WFDD值;根据训练集中的良性结节人群和恶性结节人群的总WFDD值计算最佳分割点,最佳分割点即为预定阈值。Optionally, obtain the predetermined threshold through the following method: calculate the total WFDD value of each sample in the training set in a predetermined number of areas where the WFDD value difference is the largest; calculate based on the total WFDD value of the benign nodule population and the malignant nodule population in the training set. The optimal segmentation point is the predetermined threshold.
可选地,使用来自R语言分析平台的pROC包的roc函数计算最佳分割点。Optionally, calculate the optimal split point using the roc function from the pROC package of the R language analysis platform.
实施例3:获得肺结节筛查模型的组建装置Example 3: Obtaining a device for constructing a pulmonary nodule screening model
本发明提出一种肺结节筛查模型的组建装置,组建装置包括:特征数据筛选模块,被设置为在人类参考基因组的全体范围内,筛选训练集中的良性结节人群样本和恶性结节人群样本中加权片段分布差异(WFDD)值的差异最大的预定数量个区域作为特征数据;和组建模块,被设置为利用特征数据组建肺结节筛查模型。The present invention proposes a device for constructing a pulmonary nodule screening model. The device includes: a characteristic data screening module, which is configured to screen benign nodule population samples and malignant nodule populations in the training set within the entire range of the human reference genome. A predetermined number of regions with the largest difference in weighted fragment distribution difference (WFDD) values in the sample are used as feature data; and a building module is configured to use the feature data to build a pulmonary nodule screening model.
可选地,特征数据筛选模块包括:窗口划分模块,被设置为将人类参考基因组中所有常染色体的碱基序列连接在一起,按固定的长度将连接后的碱基序列划分为一系列窗口,每个窗口对应着一段碱基序列;第一计算模块,被设置为计算每个窗口的窗口基准深度和权重,其中,窗口基准深度为训练集中的样本在窗口的深度值的平均值,权重为训练集的样本在窗口的深度值的方差的平方,窗口的深度值为对样本的测序数据中能比对到该窗口对应的碱基序列的碱基序列片段的数量;第二计算模块,被设置为计算训练集中指定样本在每个窗口的窗口样本深度;第三计算模块,被设置为计算窗口样本深度与窗口基准深度之间的差值;第四计算模块,被设置为将差值与权重相乘,得到窗口的加权差值;第五计算模块,被设置为将不定数量的窗口组合形成不同区域,对指定区域中所有窗口的加权差值求和得到加权差值总和;数值变换模块,被设置为对加权差值总和进行数值变换,得到指定样本在指定区域内的加权片段分布差异值(WFDD);和第六计算模块,被设置为计算训练集中的良性结节人群样本和恶性结节人群样本在每个区域上关于加权片段分布差异值的差异;以及特征数据筛选子模块,被设置为筛选差异最大的预定数量个区域作为特征数据。Optionally, the feature data screening module includes: a window division module, which is configured to join together the base sequences of all autosomal chromosomes in the human reference genome, and divide the joined base sequences into a series of windows according to a fixed length, Each window corresponds to a base sequence; the first calculation module is set to calculate the window reference depth and weight of each window, where the window reference depth is the average depth value of the samples in the training set in the window, and the weight is The square of the variance of the depth value of the window for the samples in the training set. The depth value of the window is the number of base sequence fragments in the sequencing data of the sample that can be compared to the base sequence corresponding to the window; the second calculation module is Set to calculate the window sample depth of each window for the specified sample in the training set; the third calculation module is set to calculate the difference between the window sample depth and the window reference depth; the fourth calculation module is set to compare the difference with Multiply the weights to obtain the weighted difference of the windows; the fifth calculation module is set to combine an indefinite number of windows to form different areas, and sum the weighted differences of all windows in the specified area to obtain the total weighted difference; the numerical transformation module , is set to perform numerical transformation on the sum of weighted differences to obtain the weighted fragment distribution difference value (WFDD) of the specified sample in the specified area; and the sixth calculation module is set to calculate the benign nodule population samples and malignant nodules in the training set The difference of the nodule population sample in each region with respect to the weighted segment distribution difference value; and the feature data screening submodule is set to screen a predetermined number of regions with the largest differences as feature data.
可选地,特征数据筛选模块还包括均一化处理模块。Optionally, the feature data filtering module also includes a normalization processing module.
可选地,均一化处理后各个窗口的窗口基准深度的平均值为0,标准差为1。Optionally, the average value of the window reference depth of each window after the normalization process is 0, and the standard deviation is 1.
可选地,特征数据筛选子模块包括:第一计算单元,被设置为对于特定区域,计算良性结节人群样本和恶性结节人群样本中的每一个样本的WFDD值;第二计算单元,被设置为分别计算良性结节人群样本的WFDD值的平均值和恶性结节人群样本的WFDD值的平均值;第三计算单元,被设置为按如下公式计算特定区域的区分度值,以及选择单元,被设置为选择区分度值最大的预定数量个区域作为特征数据:Optionally, the feature data screening sub-module includes: a first calculation unit, configured to calculate the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample for a specific area; a second calculation unit, It is set to respectively calculate the average WFDD value of the benign nodule population sample and the average WFDD value of the malignant nodule population sample; the third calculation unit is set to calculate the discrimination value of a specific area according to the following formula, and select the unit , is set to select a predetermined number of regions with the largest discrimination values as feature data:
Figure PCTCN2022097450-appb-000034
Figure PCTCN2022097450-appb-000034
其中:t为特定区域的区分度值;
Figure PCTCN2022097450-appb-000035
Figure PCTCN2022097450-appb-000036
分别为来自良性结节人群样本或恶性结节人群样本的WFDD值的平均值;n 1和n 2分别为来自良性结节人群样本或恶性结节人群样本的值的个数;并且S 1和S 2分别为来自良性结节人群样本或恶性结节人群样本的值的标准差;条件是:当良性结节人群样本对应下角标1时,恶性结节人群样本对应下角标2;或当良性结节人群样本对应下角标2时,恶性结节人群样本对应下角标1。
Among them: t is the discrimination value of a specific area;
Figure PCTCN2022097450-appb-000035
and
Figure PCTCN2022097450-appb-000036
are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S 1 and S 2 is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the condition is: when the benign nodule population sample corresponds to the subscript 1, the malignant nodule population sample corresponds to the subscript 2; or when the benign nodule population sample corresponds to the subscript 2 When the nodule population sample corresponds to the subscript 2, the malignant nodule population sample corresponds to the subscript 1.
可选地,特征数据筛选子模块包括:初始区域生成单元和区域组合分拆单元,其中,初始区域生成单元包括:窗口划分元件,被设置为将基因组划分为一系列窗口;窗口编码元件,被设置为用一系列窗口的次序对一系列窗口编号;区域编号元件,被设置为用一系列窗口编号为由一系列窗口组合得到的区域编号;窗口组合元件,被设置为将窗口i处连续的n个窗口和其下游2 jn个窗口处另外n个连续窗口组合,形成一个初始区域: x i={i,i+1,i+2,...,i+n-1,i+2 jn,i+2 jn+1,i+2 jn+2,...,i+2 jn+n-1};i=1,n,2n,...,N;j=1,2,...,8;i+(2 i+1)n≤N;其中,n为给定的连续窗口的个数,N为划分好的窗口总数;区域组合分拆单元包括:第一子代挑选元件,被设置为使用遗传算法,两个双亲结合起来交换信息并产生子代;其中,将所有初始区域放到区域池内,并随机挑选产生子代;其中: Optionally, the feature data screening sub-module includes: an initial region generation unit and a region combination splitting unit, wherein the initial region generation unit includes: a window division element, which is set to divide the genome into a series of windows; a window encoding element, which is It is set to number a series of windows in the order of a series of windows; the area number component is set to use a series of window numbers to number the area obtained by combining a series of windows; the window combination component is set to number consecutive windows at window i n windows are combined with another n consecutive windows at the downstream 2 j n windows to form an initial area: x i = {i, i+1, i+2,..., i+n-1, i+ 2 j n, i+2 j n+1, i+2 j n+2,..., i+2 j n+n-1}; i=1, n, 2n,..., N; j =1, 2,...,8; i+(2 i+1 )n≤N; where n is the number of given continuous windows, and N is the total number of divided windows; the regional combination splitting unit includes: The first child selection component is set to use a genetic algorithm, and the two parents combine to exchange information and generate offspring; among them, all initial areas are placed in the area pool, and randomly selected to generate offspring; where:
-区域i被选择作为双亲之一的概率为:
Figure PCTCN2022097450-appb-000037
其中,N为划分好的窗口总数,t i为第i个窗口的t值;
-The probability that region i is selected as one of the parents is:
Figure PCTCN2022097450-appb-000037
Among them, N is the total number of divided windows, t i is the t value of the i-th window;
-当选定区域x作为第一个亲本后,另一个区域i被挑选作为第二个亲本的概率为:-When region x is selected as the first parent, the probability that another region i is selected as the second parent is:
Figure PCTCN2022097450-appb-000038
其中,N为划分好的窗口总数,m i为区域i所包含的窗口编号的平均值;
Figure PCTCN2022097450-appb-000038
Among them, N is the total number of divided windows, m i is the average of the window numbers included in area i;
第二子代挑选元件,被设置为选定双亲后,取双亲所包括窗口的并集并随机删除其中若干窗口作为子代,随机挑选方式为有放回抽样;和第三子代挑选元件,被设置为得到子代后,将子代放进区域池内进行下一轮挑选;在此操作中不从区域池内删除双亲,其中,由区域P 1和区域P 2结合产生的子代为:child(P 1,P 2)=P 1∪P 2-S(p,P 1∪P 2);其中,S(p,s)为从集合s中有放回地抽取比例p的元素得到的子集;以及子代重复产生元件,被设置为重复产生子代的过程; The second generation selects components and is set to select the parents. After selecting the parents, it takes the union of the windows included in the parents and randomly deletes several of the windows as children. The random selection method is sampling with replacement; and the third offspring selects components. After being set to obtain offspring, the offspring will be put into the regional pool for the next round of selection; the parents will not be deleted from the regional pool in this operation. The offspring produced by the combination of area P 1 and area P 2 is: child( P 1 , P 2 )=P 1 ∪P 2 -S(p, P 1 ∪P 2 ); where S(p, s) is a subset obtained by extracting elements with proportion p from the set s with replacement ; and the descendant recurring component, which is set to repeat the process of producing descendants;
-按如下公式计算所产生的子代的区分度值,挑选区分度值最大的预定数量个区域作为特征数据:- Calculate the distinction value of the generated offspring according to the following formula, and select a predetermined number of areas with the largest distinction value as feature data:
Figure PCTCN2022097450-appb-000039
其中:t为特定区域的区分度值;
Figure PCTCN2022097450-appb-000040
Figure PCTCN2022097450-appb-000041
分别为来自良性结节人群样本或恶性结节人群样本的WFDD值的平均值;n 1和n 2分别为来自良性结节人群样本或恶性结节人群样本的值的个数;并且S 1和S 2分别为来自良性结节人群样本或恶性结节人群样本的值的标准差;条件是:当良性结节人群样本对应下角标1时,恶性结节人群样本对应下角标2;或当良性结节人群样本对应下角标2时,恶性结节人群样本对应下角标1。
Figure PCTCN2022097450-appb-000039
Among them: t is the discrimination value of a specific area;
Figure PCTCN2022097450-appb-000040
and
Figure PCTCN2022097450-appb-000041
are the average WFDD values from benign nodule population samples or malignant nodule population samples respectively; n 1 and n 2 are the number of values from benign nodule population samples or malignant nodule population samples respectively; and S 1 and S 2 is the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample respectively; the condition is: when the benign nodule population sample corresponds to the subscript 1, the malignant nodule population sample corresponds to the subscript 2; or when the benign nodule population sample corresponds to the subscript 2 When the nodule population sample corresponds to the subscript 2, the malignant nodule population sample corresponds to the subscript 1.
在本发明的组建装置中,连续窗口的个数n为1-100、优选为5-50、更优选为5;优选地,选定双亲后,取双亲所包括窗口的并集并随机删除1%-99%、更优选为5%-50%、进一步优选为20%的窗口作为子代,进行有放回抽样。优选地,重复产生子代的过程为1到1百 万次、更优选为100到10万次、进一步优选为30万次。按照100bp-100kbp,优选为10kbp-50kbp,更优选为30kbp的长度将连接后的碱基序列划分为一系列窗口。优选地,组合形成不同区域的窗口是连续的或不连续的。优选地,预定数量个区域为1-500个,更优选为10-100个,进一步优选为50个。In the construction device of the present invention, the number n of consecutive windows is 1-100, preferably 5-50, more preferably 5; preferably, after selecting the parents, take the union of the windows included by the parents and randomly delete 1 %-99%, more preferably 5%-50%, further preferably 20% of the window is used as the offspring, and sampling with replacement is performed. Preferably, the process of generating progeny is repeated 1 to 1 million times, more preferably 100 to 100,000 times, and further preferably 300,000 times. The connected base sequence is divided into a series of windows according to a length of 100bp-100kbp, preferably 10kbp-50kbp, and more preferably 30kbp. Preferably, the windows combined to form different areas are continuous or discontinuous. Preferably, the predetermined number of areas is 1-500, more preferably 10-100, and further preferably 50.
实施例4:获得肺结节筛查装置Example 4: Obtaining a pulmonary nodule screening device
本发明提出一种肺结节筛查装置,包括:第一计算模块,被设置为计算待检样本在选定的预定数量个区域的加权片段分布差异值的总和,得到总WFDD值;输入模块,被设置为将待检样本的总WFDD值输入本发明的组建装置所组建的肺结节筛查模型;和输出模块,被设置为输出待检样本的筛查结果。其中,选定的预定数量个区域与良性结节人群样本和恶性结节人群样本中加权片段分布差异值(WFDD)差异最大的预定数量个区域相同。The present invention proposes a pulmonary nodule screening device, which includes: a first calculation module, which is configured to calculate the sum of the weighted segment distribution difference values of samples to be tested in a selected predetermined number of areas to obtain a total WFDD value; an input module , is configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model constructed by the construction device of the present invention; and the output module is configured to output the screening results of the sample to be tested. Among them, the selected predetermined number of regions are the same as the predetermined number of regions with the largest difference in weighted fragment distribution difference values (WFDD) between the benign nodule population sample and the malignant nodule population sample.
可选地,输入模块包括:输入单元,被设置为将待检样本的总WFDD值输入肺结节筛查模型;和判定单元,被设置为肺结节筛查模型根据待检样本的总WFDD值与预定阈值比较来判断待检样本的肺结节类型。Optionally, the input module includes: an input unit configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model; and a determination unit configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model; The value is compared with a predetermined threshold to determine the type of pulmonary nodule of the sample to be tested.
可选地,筛查装置还包括预定阈值获取模块,预定阈值获取模块包括:第一计算单元,被设置为计算训练集中每个样本在WFDD值差异最大的预定数量个区域的总WFDD值;和第二计算单元,被设置为根据训练集中的良性结节人群和恶性结节人群的总WFDD值计算最佳分割点,最佳分割点即为预定阈值。Optionally, the screening device further includes a predetermined threshold acquisition module. The predetermined threshold acquisition module includes: a first calculation unit configured to calculate the total WFDD value of each sample in the training set in a predetermined number of regions where the WFDD value difference is the largest; and The second calculation unit is set to calculate the optimal segmentation point based on the total WFDD value of the benign nodule population and the malignant nodule population in the training set, and the optimal segmentation point is the predetermined threshold.
可选地,使用来自R语言分析平台的pROC包的roc函数计算最佳分割点。Optionally, calculate the optimal split point using the roc function from the pROC package of the R language analysis platform.
此外,需要说明的是,在将其进行简单改变后,本发明的肺结节筛查模型的组建方法、筛查模型、筛查方法和筛查装置也能应用在甲基化测序数据、RNA测序数据以及蛋白组学相关数据上。In addition, it should be noted that, after simple modifications, the construction method, screening model, screening method and screening device of the pulmonary nodule screening model of the present invention can also be applied to methylation sequencing data, RNA Sequencing data and proteomics-related data.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will understand that embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。存储器是计算机可读介质的示例。Memory may include non-volatile memory in computer-readable media, random access memory (RAM) and/or non-volatile memory in the form of read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information. Information may be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory. (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device. As defined in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements not only includes those elements, but also includes Other elements are not expressly listed or are inherent to the process, method, article or equipment. Without further limitation, an element qualified by the statement "comprises a..." does not exclude the presence of additional identical elements in the process, method, good, or device that includes the element.
以上仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection scope of the present invention.
工业实用性Industrial applicability
从以上的描述中,可以看出本发明基于细胞外游离DNA全基因组低深度测序数据提供了一种以高准确率检测人体结节类型的产品,具体而言,本发明提出了肺结节筛查模型的组建方法和装置以及肺结节筛查方法和装置,其:From the above description, it can be seen that the present invention provides a product for detecting human nodule types with high accuracy based on extracellular free DNA whole genome low-depth sequencing data. Specifically, the present invention proposes a pulmonary nodule screen. A method and device for establishing a detection model and a method and device for screening pulmonary nodules, which:
1)无创地检测人体结节类型(良性/恶性),避免侵入式检查;1) Non-invasively detect human nodule types (benign/malignant) and avoid invasive examinations;
2)提供比现有的CT扫描更高的判断准确性,其准确性接近有创的组织活检;2) Provides higher judgment accuracy than existing CT scans, and its accuracy is close to that of invasive tissue biopsy;
3)该方法基于cf DNA的测序数据开发,仅需抽血或采集其他体液等操作,无放射性暴露风险。3) This method is developed based on cfDNA sequencing data and only requires blood drawing or collection of other body fluids, without the risk of radioactive exposure.

Claims (18)

  1. 一种肺结节筛查模型的组建方法,其特征在于,所述组建方法包括以下步骤:A method of establishing a pulmonary nodule screening model, characterized in that the establishment method includes the following steps:
    在人类参考基因组的全体范围内,筛选训练集中的良性结节人群样本和恶性结节人群样本中加权片段分布差异(WFDD)值的差异最大的预定数量个区域作为特征数据;和Within the entire range of the human reference genome, select a predetermined number of regions with the largest difference in weighted fragment distribution difference (WFDD) values between the benign nodule population sample and the malignant nodule population sample in the training set as feature data; and
    利用所述特征数据组建所述肺结节筛查模型。The characteristic data is used to construct the pulmonary nodule screening model.
  2. 根据权利要求1所述的组建方法,其特征在于,所述筛选训练集中的良性结节人群样本和恶性结节人群样本中加权片段分布差异(WFDD)值差异最大的预定数量个区域作为特征数据的步骤包括:The construction method according to claim 1, characterized in that the predetermined number of regions with the largest difference in weighted fragment distribution difference (WFDD) values among the benign nodule population samples and the malignant nodule population samples in the screening training set are used as feature data The steps include:
    将人类参考基因组中所有常染色体的碱基序列连接在一起,按固定的长度将连接后的碱基序列划分为一系列窗口,每个窗口对应着一段碱基序列;Connect the base sequences of all autosomal chromosomes in the human reference genome together, and divide the connected base sequences into a series of windows according to a fixed length, with each window corresponding to a base sequence;
    计算每个窗口的窗口基准深度和权重,其中,所述窗口基准深度为所述训练集中的样本在所述窗口的深度值的平均值,所述权重为所述训练集的样本在所述窗口的深度值的方差的平方,所述窗口的深度值为对样本的测序数据中能比对到该窗口对应的碱基序列的碱基序列片段的数量;Calculate the window reference depth and weight of each window, where the window reference depth is the average depth value of the samples in the training set in the window, and the weight is the depth value of the samples in the training set in the window The square of the variance of the depth value of the window, the depth value of the window is the number of base sequence fragments in the sequencing data of the sample that can be compared to the base sequence corresponding to the window;
    计算所述训练集中指定样本在每个窗口的窗口样本深度;Calculate the window sample depth in each window of the specified sample in the training set;
    计算所述窗口样本深度与所述窗口基准深度之间的差值;Calculate the difference between the window sample depth and the window reference depth;
    将所述差值与所述权重相乘,得到所述窗口的加权差值;Multiply the difference by the weight to obtain the weighted difference of the window;
    将不定数量的窗口组合形成不同区域,对指定区域中所有窗口的加权差值求和得到加权差值总和;Combine an indefinite number of windows to form different areas, and sum the weighted differences of all windows in the specified area to obtain the total weighted difference;
    对所述加权差值总和进行数值变换,得到所述指定样本在所述指定区域内的加权片段分布差异值(WFDD);和Perform a numerical transformation on the sum of weighted differences to obtain the weighted fragment distribution difference value (WFDD) of the designated sample in the designated area; and
    计算所述训练集中的所述良性结节人群样本和所述恶性结节人群样本在每个区域上关于所述加权片段分布差异值的差异,并筛选差异最大的预定数量个区域作为所述特征数据;Calculate the difference between the benign nodule population sample and the malignant nodule population sample in the training set with respect to the weighted segment distribution difference value in each region, and select a predetermined number of regions with the largest differences as the features data;
    优选地,在计算所述窗口样本深度与所述窗口基准深度之间的差值之前,还包括对所述窗口基准深度和所述窗口样本深度进行均一化处理的步骤;Preferably, before calculating the difference between the window sample depth and the window reference depth, it further includes the step of normalizing the window reference depth and the window sample depth;
    更优选地,均一化处理后各个窗口的所述窗口基准深度的平均值为0,标准差为1。More preferably, the average value of the window reference depth of each window after the normalization process is 0, and the standard deviation is 1.
  3. 根据权利要求2所述的组建方法,其特征在于,所述筛选差异最大的预定数量个区域作为所述特征数据包括:The assembly method according to claim 2, wherein the screening of a predetermined number of regions with the largest differences as the characteristic data includes:
    -对于特定区域,计算所述良性结节人群样本和所述恶性结节人群样本中的每一个 样本的WFDD值;-For a specific region, calculate the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample;
    -分别计算所述良性结节人群样本的WFDD值的平均值和所述恶性结节人群样本的WFDD值的平均值;- Calculate the average WFDD value of the benign nodule population sample and the average WFDD value of the malignant nodule population sample respectively;
    -按如下公式计算所述特定区域的区分度值,以及- Calculate the discrimination value of said specific area according to the following formula, and
    -选择所述区分度值最大的所述预定数量个区域作为所述特征数据:-Select the predetermined number of regions with the largest discrimination values as the feature data:
    Figure PCTCN2022097450-appb-100001
    Figure PCTCN2022097450-appb-100001
    其中:in:
    t为所述特定区域的区分度值;t is the discrimination value of the specific area;
    Figure PCTCN2022097450-appb-100002
    Figure PCTCN2022097450-appb-100003
    分别为来自所述良性结节人群样本或所述恶性结节人群样本的WFDD值的平均值;
    Figure PCTCN2022097450-appb-100002
    and
    Figure PCTCN2022097450-appb-100003
    are respectively the average value of WFDD values from the benign nodule population sample or the malignant nodule population sample;
    n 1和n 2分别为来自所述良性结节人群样本或所述恶性结节人群样本的值的个数;并且 n 1 and n 2 are the number of values from the benign nodule population sample or the malignant nodule population sample respectively; and
    S 1和S 2分别为来自所述良性结节人群样本或所述恶性结节人群样本的值的标准差;条件是: S 1 and S 2 are respectively the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample; the conditions are:
    当良性结节人群样本对应下角标1时,恶性结节人群样本对应下角标2;或When the benign nodule population sample corresponds to subscript 1, the malignant nodule population sample corresponds to subscript 2; or
    当良性结节人群样本对应下角标2时,恶性结节人群样本对应下角标1。When the benign nodule population sample corresponds to subscript 2, the malignant nodule population sample corresponds to subscript 1.
  4. 根据权利要求2所述的组建方法,其特征在于,所述筛选差异最大的预定数量个区域作为所述特征数据包括:The assembly method according to claim 2, wherein the screening of a predetermined number of regions with the largest differences as the characteristic data includes:
    a.)生成初始区域:a.) Generate initial area:
    -将基因组划分为一系列窗口;-Divide the genome into a series of windows;
    -用所述一系列窗口的次序对所述一系列窗口编号;- numbering said series of windows in the order of said series of windows;
    -用一系列窗口编号为由所述一系列窗口组合得到的区域编号;- Use a series of window numbers as area numbers obtained by combining the series of windows;
    -将窗口i处连续的n个窗口和其下游2 jn个窗口处另外n个连续窗口组合,形成一个初始区域: - Combine the n consecutive windows at window i and the other n consecutive windows at 2 j n windows downstream to form an initial area:
    x i={i,i+1,i+2,...,i+n-1,i+2 jn,i+2 jn+1,i+2 jn+2,...,i+2 jn+n-1} x i = {i, i+1, i+2,..., i+n-1, i+2 j n, i+2 j n+1, i+2 j n+2,..., i+2 j n+n-1}
    i=1,n,2n,...,N;j=1,2,...,8;i+(2 i+1)n≤N i=1,n,2n,...,N;j=1,2,...,8;i+(2 i+1 )n≤N
    其中,n为给定的连续窗口的个数,N为划分好的窗口总数;Among them, n is the number of given continuous windows, and N is the total number of divided windows;
    b.)区域的组合和分拆:b.) Combination and splitting of regions:
    使用遗传算法,两个双亲结合起来交换信息并产生子代;其中,将所有初始区域放到区域池内,并随机挑选产生子代;Using a genetic algorithm, two parents combine to exchange information and generate offspring; among them, all initial regions are placed in the region pool and randomly selected to generate offspring;
    其中:in:
    -区域i被选择作为双亲之一的概率为:-The probability that region i is selected as one of the parents is:
    Figure PCTCN2022097450-appb-100004
    Figure PCTCN2022097450-appb-100004
    其中,N为划分好的窗口总数,t i为第i个窗口的t值; Among them, N is the total number of divided windows, t i is the t value of the i-th window;
    -当选定区域x作为第一个亲本后,另一个区域i被挑选作为第二个亲本的概率为:-When region x is selected as the first parent, the probability that another region i is selected as the second parent is:
    Figure PCTCN2022097450-appb-100005
    Figure PCTCN2022097450-appb-100005
    其中,N为划分好的窗口总数,m i为区域i所包含的窗口编号的平均值;并且 Among them, N is the total number of divided windows, m i is the average of the window numbers included in area i; and
    -选定双亲后,取双亲所包括窗口的并集并随机删除其中若干窗口作为子代,随机挑选方式为有放回抽样;-After selecting the parents, take the union of the windows included in the parents and randomly delete some of the windows as children. The random selection method is sampling with replacement;
    -得到子代后,将子代放进区域池内进行下一轮挑选;在此操作中不从区域池内删除双亲,其中,由区域P 1和区域P 2结合产生的子代为: -After obtaining the offspring, put the offspring into the regional pool for the next round of selection; the parents are not deleted from the regional pool in this operation. Among them, the offspring generated by the combination of area P 1 and area P 2 are:
    child(P 1,P 2)=P 1∪P 2-S(p,P 1∪P 2) child(P 1 ,P 2 )=P 1 ∪P 2 -S(p,P 1 ∪P 2 )
    其中,S(p,s)为从集合s中有放回地抽取比例p的元素得到的子集;Among them, S(p,s) is a subset obtained by extracting elements with proportion p from the set s with replacement;
    -重复产生子代的过程;- Repeat the process of producing offspring;
    -按如下公式计算所产生的子代的区分度值,挑选所述区分度值最大的所述预定数量个区域作为所述特征数据:- Calculate the distinction value of the generated offspring according to the following formula, and select the predetermined number of areas with the largest distinction value as the feature data:
    Figure PCTCN2022097450-appb-100006
    Figure PCTCN2022097450-appb-100006
    其中:in:
    t为所述特定区域的区分度值;t is the discrimination value of the specific area;
    Figure PCTCN2022097450-appb-100007
    Figure PCTCN2022097450-appb-100008
    分别为来自所述良性结节人群样本或所述恶性结节人群样本的WFDD值的平均值;
    Figure PCTCN2022097450-appb-100007
    and
    Figure PCTCN2022097450-appb-100008
    are respectively the average value of WFDD values from the benign nodule population sample or the malignant nodule population sample;
    n 1和n 2分别为来自所述良性结节人群样本或所述恶性结节人群样本的值的个数;并且 n 1 and n 2 are the number of values from the benign nodule population sample or the malignant nodule population sample respectively; and
    S 1和S 2分别为来自所述良性结节人群样本或所述恶性结节人群样本的值的标准差;条件是: S 1 and S 2 are respectively the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample; the conditions are:
    当良性结节人群样本对应下角标1时,恶性结节人群样本对应下角标2;或When the benign nodule population sample corresponds to subscript 1, the malignant nodule population sample corresponds to subscript 2; or
    当良性结节人群样本对应下角标2时,恶性结节人群样本对应下角标1。When the benign nodule population sample corresponds to subscript 2, the malignant nodule population sample corresponds to subscript 1.
  5. 根据权利要求4所述的组建方法,其特征在于:The construction method according to claim 4, characterized in that:
    所述连续窗口的个数n为1-100、优选为5-50、更优选为5;The number n of the continuous windows is 1-100, preferably 5-50, and more preferably 5;
    优选地,选定双亲后,取所述双亲所包括窗口的并集并随机删除1%-99%、更优选为5%-50%、进一步优选为20%的窗口作为子代,进行有放回抽样;Preferably, after selecting parents, take the union of the windows included in the parents and randomly delete 1%-99%, more preferably 5%-50%, and even more preferably 20% of the windows as offspring, and perform a putative search. backsampling;
    优选地,所述重复产生子代的过程为1到1百万次、更优选为100到10万次、进一步优选为30万次。Preferably, the repeated process of generating offspring is 1 to 1 million times, more preferably 100 to 100,000 times, and further preferably 300,000 times.
  6. 根据权利要求2所述的组建方法,其特征在于:The construction method according to claim 2, characterized in that:
    按照100bp-100kbp,优选为10kbp-50kbp,更优选为30kbp的长度将连接后的碱基序列划分为一系列窗口;Divide the connected base sequence into a series of windows according to a length of 100bp-100kbp, preferably 10kbp-50kbp, and more preferably 30kbp;
    优选地,组合形成不同区域的所述窗口是连续的或不连续的;Preferably, the windows combined to form different areas are continuous or discontinuous;
    优选地,所述预定数量个区域为1-500个,更优选为10-100个,进一步优选为50个。Preferably, the predetermined number of areas is 1-500, more preferably 10-100, and further preferably 50.
  7. 一种肺结节筛查方法,其特征在于,包括以下步骤:A pulmonary nodule screening method is characterized by including the following steps:
    计算待检样本在选定的预定数量个区域的加权片段分布差异值的总和,得到总WFDD值;Calculate the sum of the weighted fragment distribution difference values of the sample to be tested in the selected predetermined number of areas to obtain the total WFDD value;
    将所述待检样本的总WFDD值输入权利要求1至6中任一项所述的组建方法所组建的肺结节筛查模型;Input the total WFDD value of the sample to be tested into the pulmonary nodule screening model constructed by the construction method according to any one of claims 1 to 6;
    输出所述待检样本的筛查结果;Output the screening results of the sample to be tested;
    其中,选定的所述预定数量个区域与所述良性结节人群样本和所述恶性结节人群样 本中加权片段分布差异值(WFDD)差异最大的预定数量个区域相同。Wherein, the selected predetermined number of regions are the same as the predetermined number of regions with the largest weighted fragment distribution difference values (WFDD) in the benign nodule population sample and the malignant nodule population sample.
  8. 根据权利要求7所述的筛查方法,其特征在于:将所述待检样本的总WFDD值输入所述肺结节筛查模型,所述肺结节筛查模型根据所述待检样本的总WFDD值与预定阈值比较来判断所述待检样本的肺结节类型;The screening method according to claim 7, characterized in that: the total WFDD value of the sample to be tested is input into the pulmonary nodule screening model, and the pulmonary nodule screening model is based on the total WFDD value of the sample to be tested. The total WFDD value is compared with a predetermined threshold to determine the pulmonary nodule type of the sample to be tested;
    优选地,通过如下方法获得所述预定阈值:Preferably, the predetermined threshold is obtained by the following method:
    -计算所述训练集中每个样本在所述WFDD值差异最大的预定数量个区域的总WFDD值;- Calculate the total WFDD value of each sample in the training set in a predetermined number of regions where the WFDD value difference is the largest;
    -根据所述训练集中的所述良性结节人群和所述恶性结节人群的总WFDD值计算最佳分割点,所述最佳分割点即为所述预定阈值;- Calculate the optimal segmentation point based on the total WFDD value of the benign nodule population and the malignant nodule population in the training set, and the optimal segmentation point is the predetermined threshold;
    更优选地,使用来自R语言分析平台的pROC包的roc函数计算所述最佳分割点。More preferably, the optimal split point is calculated using the roc function from the pROC package of the R language analysis platform.
  9. 一种肺结节筛查模型的组建装置,其特征在于,所述组建装置包括:A device for constructing a pulmonary nodule screening model, characterized in that the device includes:
    特征数据筛选模块,被设置为在人类参考基因组的全体范围内,筛选训练集中的良性结节人群样本和恶性结节人群样本中加权片段分布差异(WFDD)值的差异最大的预定数量个区域作为特征数据;和The feature data screening module is configured to screen a predetermined number of regions with the largest difference in Weighted Fragment Distribution Difference (WFDD) values between the benign nodule population samples and the malignant nodule population samples in the training set within the entire range of the human reference genome as Characteristic data; and
    组建模块,被设置为利用所述特征数据组建所述肺结节筛查模型。A building module is configured to build the pulmonary nodule screening model using the characteristic data.
  10. 根据权利要求9所述的组建装置,其特征在于,所述特征数据筛选模块包括:The assembly device according to claim 9, characterized in that the characteristic data filtering module includes:
    窗口划分模块,被设置为将人类参考基因组中所有常染色体的碱基序列连接在一起,按固定的长度将连接后的碱基序列划分为一系列窗口,每个窗口对应着一段碱基序列;The window division module is set to connect the base sequences of all autosomal chromosomes in the human reference genome together, and divide the connected base sequences into a series of windows according to a fixed length, with each window corresponding to a base sequence;
    第一计算模块,被设置为计算每个窗口的窗口基准深度和权重,其中,所述窗口基准深度为所述训练集中的样本在所述窗口的深度值的平均值,所述权重为所述训练集的样本在所述窗口的深度值的方差的平方,所述窗口的深度值为对样本的测序数据中能比对到该窗口对应的碱基序列的碱基序列片段的数量;The first calculation module is configured to calculate the window reference depth and weight of each window, wherein the window reference depth is the average depth value of the samples in the training set in the window, and the weight is the The square of the variance of the depth value of the window for the samples in the training set. The depth value of the window is the number of base sequence fragments in the sequencing data of the sample that can be compared to the base sequence corresponding to the window;
    第二计算模块,被设置为计算所述训练集中指定样本在每个窗口的窗口样本深度;The second calculation module is configured to calculate the window sample depth in each window of the specified sample in the training set;
    第三计算模块,被设置为计算所述窗口样本深度与所述窗口基准深度之间的差值;A third calculation module configured to calculate the difference between the window sample depth and the window reference depth;
    第四计算模块,被设置为将所述差值与所述权重相乘,得到所述窗口的加权差值;A fourth calculation module is configured to multiply the difference by the weight to obtain the weighted difference of the window;
    第五计算模块,被设置为将不定数量的窗口组合形成不同区域,对指定区域中所有窗口的加权差值求和得到加权差值总和;The fifth calculation module is configured to combine an indefinite number of windows to form different areas, and sum the weighted differences of all windows in the specified area to obtain the total weighted difference;
    数值变换模块,被设置为对所述加权差值总和进行数值变换,得到所述指定样本在所述指定区域内的加权片段分布差异值(WFDD);和A numerical transformation module configured to perform numerical transformation on the sum of weighted differences to obtain the weighted fragment distribution difference value (WFDD) of the specified sample in the specified area; and
    第六计算模块,被设置为计算所述训练集中的所述良性结节人群样本和所述恶性结节人群样本在每个区域上关于所述加权片段分布差异值的差异;以及A sixth calculation module is configured to calculate the difference between the benign nodule population sample and the malignant nodule population sample in the training set with respect to the weighted segment distribution difference value in each region; and
    特征数据筛选子模块,被设置为筛选差异最大的预定数量个区域作为所述特征数据;The feature data screening submodule is configured to screen a predetermined number of areas with the largest differences as the feature data;
    优选地,所述特征数据筛选模块还包括均一化处理模块;Preferably, the feature data screening module also includes a homogenization processing module;
    更优选地,均一化处理后各个窗口的所述窗口基准深度的平均值为0,标准差为1。More preferably, the average value of the window reference depth of each window after the normalization process is 0, and the standard deviation is 1.
  11. 根据权利要求10所述的组建装置,其特征在于,所述特征数据筛选子模块包括:The assembly device according to claim 10, characterized in that the characteristic data filtering sub-module includes:
    第一计算单元,被设置为对于特定区域,计算所述良性结节人群样本和所述恶性结节人群样本中的每一个样本的WFDD值;A first calculation unit configured to calculate, for a specific region, the WFDD value of each sample in the benign nodule population sample and the malignant nodule population sample;
    第二计算单元,被设置为分别计算所述良性结节人群样本的WFDD值的平均值和所述恶性结节人群样本的WFDD值的平均值;The second calculation unit is configured to respectively calculate the average value of the WFDD value of the benign nodule population sample and the average value of the WFDD value of the malignant nodule population sample;
    第三计算单元,被设置为按如下公式计算所述特定区域的区分度值,以及The third calculation unit is configured to calculate the discrimination value of the specific area according to the following formula, and
    选择单元,被设置为选择所述区分度值最大的所述预定数量个区域作为所述特征数据:A selection unit configured to select the predetermined number of regions with the largest discrimination values as the feature data:
    Figure PCTCN2022097450-appb-100009
    Figure PCTCN2022097450-appb-100009
    其中:in:
    t为所述特定区域的区分度值;t is the discrimination value of the specific area;
    Figure PCTCN2022097450-appb-100010
    Figure PCTCN2022097450-appb-100011
    分别为来自所述良性结节人群样本或所述恶性结节人群样本的WFDD值的平均值;
    Figure PCTCN2022097450-appb-100010
    and
    Figure PCTCN2022097450-appb-100011
    are respectively the average value of WFDD values from the benign nodule population sample or the malignant nodule population sample;
    n 1和n 2分别为来自所述良性结节人群样本或所述恶性结节人群样本的值的个数;并且 n 1 and n 2 are the number of values from the benign nodule population sample or the malignant nodule population sample respectively; and
    S 1和S 2分别为来自所述良性结节人群样本或所述恶性结节人群样本的值的标准差;条件是: S 1 and S 2 are respectively the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample; the conditions are:
    当良性结节人群样本对应下角标1时,恶性结节人群样本对应下角标2;或When the benign nodule population sample corresponds to subscript 1, the malignant nodule population sample corresponds to subscript 2; or
    当良性结节人群样本对应下角标2时,恶性结节人群样本对应下角标1。When the benign nodule population sample corresponds to subscript 2, the malignant nodule population sample corresponds to subscript 1.
  12. 根据权利要求10所述的组建装置,其特征在于,所述特征数据筛选子模块包括:初始区域生成单元和区域组合分拆单元,其中,The assembly device according to claim 10, characterized in that the characteristic data screening sub-module includes: an initial region generation unit and a region combination and splitting unit, wherein,
    所述初始区域生成单元包括:The initial area generation unit includes:
    窗口划分元件,被设置为将基因组划分为一系列窗口;The window division element is configured to divide the genome into a series of windows;
    窗口编码元件,被设置为用所述一系列窗口的次序对所述一系列窗口编号;a window encoding element configured to number the series of windows in the order of the series of windows;
    区域编号元件,被设置为用一系列窗口编号为由所述一系列窗口组合得到的区域编号;The area number component is configured to use a series of window numbers as area numbers obtained by combining the series of windows;
    窗口组合元件,被设置为将窗口i处连续的n个窗口和其下游2 jn个窗口处另外n个连续窗口组合,形成一个初始区域: The window combination component is set to combine n consecutive windows at window i and another n consecutive windows at 2 j n windows downstream of it to form an initial area:
    x i={i,i+1,i+2,...,i+n-1,i+2 jn,i+2 jn+1,i+2 jn+2,...,i+2 jn+n-1}, x i = {i, i+1, i+2,..., i+n-1, i+2 j n, i+2 j n+1, i+2 j n+2,..., i+2 j n+n-1},
    i=1,n,2n,...,N;j=1,2,...,8;i+(2 i+1)n≤N i=1,n,2n,...,N;j=1,2,...,8;i+(2 i+1 )n≤N
    其中,n为给定的连续窗口的个数,N为划分好的窗口总数;Among them, n is the number of given continuous windows, and N is the total number of divided windows;
    所述区域组合分拆单元包括:The regional combined spin-off units include:
    第一子代挑选元件,被设置为使用遗传算法,两个双亲结合起来交换信息并产生子代;其中,将所有初始区域放到区域池内,并随机挑选产生子代;The first offspring selection component is set to use a genetic algorithm, and the two parents combine to exchange information and generate offspring; among them, all initial areas are placed in the area pool, and offspring are randomly selected;
    其中:in:
    -区域i被选择作为双亲之一的概率为:-The probability that region i is selected as one of the parents is:
    Figure PCTCN2022097450-appb-100012
    Figure PCTCN2022097450-appb-100012
    其中,N为划分好的窗口总数,t i为第i个窗口的t值; Among them, N is the total number of divided windows, t i is the t value of the i-th window;
    -当选定区域x作为第一个亲本后,另一个区域i被挑选作为第二个亲本的概率为:-When region x is selected as the first parent, the probability that another region i is selected as the second parent is:
    Figure PCTCN2022097450-appb-100013
    Figure PCTCN2022097450-appb-100013
    其中,N为划分好的窗口总数,m i为区域i所包含的窗口编号的平均值; Among them, N is the total number of divided windows, m i is the average of the window numbers included in area i;
    第二子代挑选元件,被设置为选定双亲后,取双亲所包括窗口的并集并随机删除其中若干窗口作为子代,随机挑选方式为有放回抽样;和The second child selection component is set to take the union of the windows included in the parents after selecting the parents and randomly delete several of the windows as children. The random selection method is sampling with replacement; and
    第三子代挑选元件,被设置为得到子代后,将子代放进区域池内进行下一轮挑选;在此操作中不从区域池内删除双亲,其中,由区域P 1和区域P 2结合产生的子代为: The third child selection component is set to put the children into the regional pool for the next round of selection after obtaining the children; in this operation, the parents are not deleted from the regional pool, in which area P 1 and area P 2 are combined The resulting offspring are:
    child(P 1,P 2)=P 1∪P 2-S(p,P 1∪P 2), child(P 1 , P 2 )=P 1 ∪P 2 -S(p, P 1 ∪P 2 ),
    其中,S(p,s)为从集合s中有放回地抽取比例p的元素得到的子集;以及Among them, S(p,s) is a subset obtained by extracting proportion p elements from the set s with replacement; and
    子代重复产生元件,被设置为重复产生子代的过程;The descendants repeatedly generate components and are set to repeat the process of generating descendants;
    -按如下公式计算所产生的子代的区分度值,挑选所述区分度值最大的所述预定数量个区域作为所述特征数据:- Calculate the distinction value of the generated offspring according to the following formula, and select the predetermined number of areas with the largest distinction value as the feature data:
    Figure PCTCN2022097450-appb-100014
    Figure PCTCN2022097450-appb-100014
    其中:in:
    t为所述特定区域的区分度值;t is the discrimination value of the specific area;
    Figure PCTCN2022097450-appb-100015
    Figure PCTCN2022097450-appb-100016
    分别为来自所述良性结节人群样本或所述恶性结节人群样本的WFDD值的平均值;
    Figure PCTCN2022097450-appb-100015
    and
    Figure PCTCN2022097450-appb-100016
    are respectively the average value of WFDD values from the benign nodule population sample or the malignant nodule population sample;
    n 1和n 2分别为来自所述良性结节人群样本或所述恶性结节人群样本的值的个数;并且 n 1 and n 2 are the number of values from the benign nodule population sample or the malignant nodule population sample respectively; and
    S 1和S 2分别为来自所述良性结节人群样本或所述恶性结节人群样本的值的标准差;条件是: S 1 and S 2 are respectively the standard deviation of the values from the benign nodule population sample or the malignant nodule population sample; the conditions are:
    当良性结节人群样本对应下角标1时,恶性结节人群样本对应下角标2;或When the benign nodule population sample corresponds to subscript 1, the malignant nodule population sample corresponds to subscript 2; or
    当良性结节人群样本对应下角标2时,恶性结节人群样本对应下角标1。When the benign nodule population sample corresponds to subscript 2, the malignant nodule population sample corresponds to subscript 1.
  13. 根据权利要求12所述的组建装置,其特征在于,The assembly device according to claim 12, characterized in that:
    所述连续窗口的个数n为1-100、优选为5-50、更优选为5;The number n of the continuous windows is 1-100, preferably 5-50, and more preferably 5;
    优选地,选定双亲后,取所述双亲所包括窗口的并集并随机删除1%-99%、更优选为5%-50%、进一步优选为20%的窗口作为子代,进行有放回抽样;Preferably, after selecting parents, take the union of the windows included in the parents and randomly delete 1%-99%, more preferably 5%-50%, and even more preferably 20% of the windows as offspring, and perform a putative search. backsampling;
    优选地,所述重复产生子代的过程为1到1百万次、更优选为100到10万次、进一步优选为30万次。Preferably, the repeated process of generating offspring is 1 to 1 million times, more preferably 100 to 100,000 times, and further preferably 300,000 times.
  14. 根据权利要求10所述的组建装置,其特征在于,The assembly device according to claim 10, characterized in that:
    按照100bp-100kbp,优选为10kbp-50kbp,更优选为30kbp的长度将连接后的碱基序列划分为一系列窗口;Divide the connected base sequence into a series of windows according to a length of 100bp-100kbp, preferably 10kbp-50kbp, and more preferably 30kbp;
    优选地,组合形成不同区域的所述窗口是连续的或不连续的;Preferably, the windows combined to form different areas are continuous or discontinuous;
    优选地,所述预定数量个区域为1-500个,更优选为10-100个,进一步优选为50个。Preferably, the predetermined number of areas is 1-500, more preferably 10-100, and further preferably 50.
  15. 一种肺结节筛查装置,其特征在于,所述筛查装置包括:A pulmonary nodule screening device, characterized in that the screening device includes:
    第一计算模块,被设置为计算待检样本在选定的预定数量个区域的加权片段分布差异值的总和,得到总WFDD值;The first calculation module is configured to calculate the sum of the weighted fragment distribution difference values of the sample to be tested in the selected predetermined number of regions to obtain the total WFDD value;
    输入模块,被设置为将所述待检样本的总WFDD值输入权利要求9至14中任一项所述的组建装置所组建的肺结节筛查模型;和An input module configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model assembled by the assembly device according to any one of claims 9 to 14; and
    输出模块,被设置为输出所述待检样本的筛查结果;An output module is configured to output the screening results of the sample to be tested;
    其中,选定的所述预定数量个区域与所述良性结节人群样本和所述恶性结节人群样本中加权片段分布差异值(WFDD)差异最大的预定数量个区域相同。Wherein, the selected predetermined number of regions are the same as the predetermined number of regions with the largest difference in weighted fragment distribution difference values (WFDD) between the benign nodule population sample and the malignant nodule population sample.
  16. 根据权利要求15所述的筛查装置,其特征在于,所述输入模块包括:The screening device according to claim 15, wherein the input module includes:
    输入单元,被设置为将所述待检样本的总WFDD值输入所述肺结节筛查模型;和An input unit configured to input the total WFDD value of the sample to be tested into the pulmonary nodule screening model; and
    判定单元,被设置为所述肺结节筛查模型根据所述待检样本的总WFDD值与预定阈值比较来判断所述待检样本的肺结节类型;A determination unit configured so that the pulmonary nodule screening model determines the pulmonary nodule type of the sample to be tested based on the comparison between the total WFDD value of the sample to be tested and a predetermined threshold;
    优选地,所述筛查装置还包括预定阈值获取模块,所述预定阈值获取模块包括:Preferably, the screening device further includes a predetermined threshold acquisition module, and the predetermined threshold acquisition module includes:
    第一计算单元,被设置为计算所述训练集中每个样本在所述WFDD值差异最大的预定数量个区域的总WFDD值;和A first calculation unit configured to calculate the total WFDD value of each sample in the training set in a predetermined number of regions where the WFDD value difference is the largest; and
    第二计算单元,被设置为根据所述训练集中的所述良性结节人群和所述恶性结节人群的总WFDD值计算最佳分割点,所述最佳分割点即为所述预定阈值;The second calculation unit is configured to calculate the optimal segmentation point based on the total WFDD value of the benign nodule population and the malignant nodule population in the training set, where the optimal segmentation point is the predetermined threshold;
    更优选地,使用来自R语言分析平台的pROC包的roc函数计算所述最佳分割点。More preferably, the optimal split point is calculated using the roc function from the pROC package of the R language analysis platform.
  17. 一种计算机可读存储介质,其特征在于,所述存储介质包括存储的程序,在所述程序运行时,所述程序执行权利要求1至6中任一项所述的肺结节筛查模型的组建方法或者权利要求7或8所述的肺结节筛查方法。A computer-readable storage medium, characterized in that the storage medium includes a stored program, and when the program is run, the program executes the pulmonary nodule screening model according to any one of claims 1 to 6 The formation method or the pulmonary nodule screening method according to claim 7 or 8.
  18. 一种处理器,其特征在于,所述处理器用于运行程序,其中,所述程序运行时执行权利要求1至6中任一项所述的肺结节筛查模型的组建方法或者权利要求7或8所述的肺结节筛查方法。A processor, characterized in that the processor is used to run a program, wherein when the program is run, the method for establishing a pulmonary nodule screening model according to any one of claims 1 to 6 or claim 7 is executed Or the pulmonary nodule screening method described in 8.
PCT/CN2022/097450 2022-06-07 2022-06-07 Construction method and apparatus for pulmonary nodule screening model, and pulmonary nodule screening method and apparatus WO2023236058A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/097450 WO2023236058A1 (en) 2022-06-07 2022-06-07 Construction method and apparatus for pulmonary nodule screening model, and pulmonary nodule screening method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/097450 WO2023236058A1 (en) 2022-06-07 2022-06-07 Construction method and apparatus for pulmonary nodule screening model, and pulmonary nodule screening method and apparatus

Publications (1)

Publication Number Publication Date
WO2023236058A1 true WO2023236058A1 (en) 2023-12-14

Family

ID=89117378

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/097450 WO2023236058A1 (en) 2022-06-07 2022-06-07 Construction method and apparatus for pulmonary nodule screening model, and pulmonary nodule screening method and apparatus

Country Status (1)

Country Link
WO (1) WO2023236058A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060018524A1 (en) * 2004-07-15 2006-01-26 Uc Tech Computerized scheme for distinction between benign and malignant nodules in thoracic low-dose CT
WO2018209704A1 (en) * 2017-05-19 2018-11-22 深圳华大基因研究院 Sample source detection method, device, and storage medium based on dna sequencing data
CN113160883A (en) * 2021-05-26 2021-07-23 深圳泰莱生物科技有限公司 Multi-group detection system for lung cancer
CN113421608A (en) * 2021-07-03 2021-09-21 南京世和基因生物技术股份有限公司 Construction method, detection device and computer readable medium of liver cancer early screening model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060018524A1 (en) * 2004-07-15 2006-01-26 Uc Tech Computerized scheme for distinction between benign and malignant nodules in thoracic low-dose CT
WO2018209704A1 (en) * 2017-05-19 2018-11-22 深圳华大基因研究院 Sample source detection method, device, and storage medium based on dna sequencing data
CN113160883A (en) * 2021-05-26 2021-07-23 深圳泰莱生物科技有限公司 Multi-group detection system for lung cancer
CN113421608A (en) * 2021-07-03 2021-09-21 南京世和基因生物技术股份有限公司 Construction method, detection device and computer readable medium of liver cancer early screening model

Similar Documents

Publication Publication Date Title
US20190316209A1 (en) Multi-Assay Prediction Model for Cancer Detection
KR20220131530A (en) Systems and methods for predicting future risk of lung cancer
Kim et al. Pre-operative prediction of advanced prostatic cancer using clinical decision support systems: accuracy comparison between support vector machine and artificial neural network
WO2023019918A1 (en) Cancer detection model and construction method therefor, and reagent kit
Liu et al. Prediction of high‐risk cytogenetic status in multiple myeloma based on magnetic resonance imaging: utility of radiomics and comparison of machine learning methods
CN111440869A (en) DNA methylation marker for predicting primary breast cancer occurrence risk and screening method and application thereof
Zhao et al. TCGA-TCIA–Based CT Radiomics Study for Noninvasively Predicting Epstein-Barr Virus Status in Gastric Cancer
CN110916666B (en) Imaging omics feature processing method for predicting recurrence of hepatocellular carcinoma after surgical resection
Pareek et al. Predicting the spread of vessels in initial stage cervical cancer through radiomics strategy based on deep learning approach
CN110223775B (en) Lung cancer risk prediction system
Chang et al. Feature selection methods for optimizing clinicopathologic input variables in oral cancer prognosis
CN111916154B (en) Diagnostic marker for predicting intestinal cancer liver metastasis and application thereof
WO2023236058A1 (en) Construction method and apparatus for pulmonary nodule screening model, and pulmonary nodule screening method and apparatus
US20200291483A1 (en) Novel workflow for epigenetic-based diagnostics of cancer
Jin et al. Machine learning based gray-level co-occurrence matrix early warning system enables accurate detection of colorectal cancer pelvic bone metastases on MRI
Hu et al. Predicting Kirsten rat sarcoma virus gene mutation status in patients with colorectal cancer by radiomics models based on multiphasic CT
Ramasamy et al. A hybridized channel selection approach with deep convolutional neural network for effective ovarian cancer prediction in periodic acid‐Schiff‐stained images
Chandrasekar et al. Performance and evaluation of data mining techniques in cancer diagnosis
Yang et al. Development and validation of a clinic machine-learning nomogram for the prediction of risk stratifications of prostate cancer based on functional subsets of peripheral lymphocyte
Hasan et al. Can Machine Learning Technique Predict the Prostate Cancer accurately?: The fact and remedy
Hu et al. Preoperative Cervical Lymph Node Metastasis Prediction in Papillary Thyroid Carcinoma: A Noninvasive Clinical Multimodal Radiomics (CMR) Nomogram Analysis
CN116762132A (en) Disease prediction model based on free DNA, construction method and application thereof
KR20210149052A (en) Stratification of the risk of virus-associated cancer
CN117438097B (en) Method and system for predicting recurrence risk after early liver cancer operation
Chen et al. Deep Learning Integration of Chest CT Imaging and Gene Expression Identifies Novel Aspects of COPD

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22945205

Country of ref document: EP

Kind code of ref document: A1