CN117316286B

CN117316286B - Data processing method, device and storage medium for tumor tracing

Info

Publication number: CN117316286B
Application number: CN202311606964.6A
Authority: CN
Inventors: 侯婷; 张周; 倪帅
Original assignee: Ranshi Biotechnology Shanghai Co ltd; Guangzhou Burning Rock Dx Co ltd
Current assignee: Ranshi Biotechnology Shanghai Co ltd; Guangzhou Burning Rock Dx Co ltd
Priority date: 2023-11-29
Filing date: 2023-11-29
Publication date: 2024-02-27
Anticipated expiration: 2043-11-29
Also published as: CN117316286A

Abstract

The application relates to a data processing method, a data processing device and a storage medium for tumor tracing. The method comprises the following steps: acquiring sequencing data to be processed and a corresponding feature vector thereof; the sequencing data to be processed is obtained by carrying out gene sequencing on a target sample by adopting a preset sequencing mode, and the target sample is obtained by sampling a target object; inputting the sequencing data to be processed and the corresponding feature vectors thereof into a first classification model in the pre-trained traceability prediction model to obtain candidate prediction results, and inputting the sequencing data to be processed and the corresponding feature vectors thereof into a second classification model in the pre-trained traceability prediction model to obtain reference prediction results; correcting the candidate prediction results according to the reference prediction results to obtain tracing prediction results; the tracing prediction result is used for representing a primary part corresponding to the target sample; the primary site is a site in the target object that has an origin relationship with the target sample. By adopting the method, the tumor tracing prediction accuracy can be improved, and the prediction cost can be reduced.

Description

Data processing method, device and storage medium for tumor tracing

Technical Field

The present invention relates to the field of computer technology for biological information processing, and in particular, to a data processing method and model training method for tumor tracing, and an apparatus, a computer device, a storage medium, and a computer program product.

Background

For primary foci-unknown metastases (Cancer of unknown primary, CUP), the primary site of the tumor cannot be determined by morphological observation alone. Compared with other metastases, CUP has the characteristics of early metastasis and invasive metastasis. Because CUP is highly invasive in metastasis and has no identifiable site of origin, physicians can experience confusion in selecting treatment regimens. Thus, accurate treatment of CUP is a challenge in the tumor clinical field.

At present, tumor tracing through molecular characteristics of tumor genome has a certain feasibility, but the following problems also exist: firstly, the recognition capability of a machine learning model on rare cancer samples is limited, and the recognition accuracy of rare cancer is low; secondly, in order to improve the theoretical accuracy of the algorithm, more molecular characteristics are introduced, so that the detection flow is complicated, the clinical application value is relatively low, and the effect of solving the tumor traceability problem of CUP in actual clinical application is poor.

Disclosure of Invention

Based on the above, in order to solve the above technical problems, the present application provides a data processing method, a model training method, a device, a computer device, a storage medium, and a computer program product, which can effectively improve the tracing accuracy and reduce the prediction cost.

In a first aspect, the present application provides a data processing method, the method comprising:

acquiring sequencing data to be processed and a corresponding feature vector thereof; the sequencing data to be processed is obtained by carrying out gene sequencing on a target sample by adopting a preset sequencing mode, and the target sample is obtained by sampling a target object;

inputting the sequencing data to be processed and the corresponding feature vectors thereof into a first classification model in a pre-trained traceability prediction model to obtain candidate prediction results, and inputting the sequencing data to be processed and the corresponding feature vectors thereof into a second classification model in the pre-trained traceability prediction model to obtain reference prediction results;

correcting the candidate prediction result according to the reference prediction result to obtain a tracing prediction result; the tracing prediction result is used for representing a primary part corresponding to the target sample; the primary site is a site in the target object that has an origin relationship with the target sample.

In one embodiment, the inputting the sequencing data to be processed and the feature vector corresponding thereto into the first classification model in the pre-trained traceable prediction model to obtain the candidate prediction result, and inputting the sequencing data to be processed and the feature vector corresponding thereto into the second classification model in the pre-trained traceable prediction model to obtain the reference prediction result includes:

inputting the sequencing data to be processed and the corresponding feature vectors thereof into the first classification model to obtain first prediction probabilities corresponding to a plurality of candidate primary parts, and taking the first prediction probabilities as candidate prediction results;

inputting the sequencing data to be processed and the corresponding feature vectors thereof into the second classification model to obtain a second prediction probability corresponding to the reference primary part, and taking the second prediction probability as the reference prediction result; the reference primary site is any primary site or designated primary site, and the plurality of candidate primary sites includes the reference primary site.

In one embodiment, the correcting the candidate prediction result according to the reference prediction result to obtain a traceable prediction result includes:

and correcting the first prediction probability corresponding to the reference primary part in the plurality of candidate primary parts according to the second prediction probability and a preset reference threshold value to obtain a corrected candidate prediction result, and obtaining the tracing prediction result based on the corrected candidate prediction result.

In one embodiment, the obtaining the traceable prediction result based on the modified candidate prediction result includes:

sorting all the first prediction probabilities in the corrected candidate prediction results to obtain a prediction probability sorting result;

and determining the tracing prediction result according to the candidate primary part corresponding to the maximum prediction probability in the prediction probability sequencing result.

In one embodiment, the determining the tracing prediction result according to the candidate primary location corresponding to the maximum prediction probability in the prediction probability ranking result includes:

when the maximum prediction probability is greater than or equal to a preset probability threshold, taking the candidate primary part corresponding to the maximum prediction probability as the tracing prediction result;

or when the maximum prediction probability is smaller than a preset probability threshold value, the candidate primary parts corresponding to the first two prediction probabilities in the prediction probability sequencing result are used as the tracing prediction result; the prediction probability sequencing results are arranged in descending order according to the probability value.

In one embodiment, the primary site corresponding to the target sample involves at least the following cancers:

Lung cancer, colorectal cancer, gastroesophageal cancer, ovarian cancer, breast cancer, pancreatic cancer, endometrial cancer, soft tissue sarcoma, bile duct cancer, liver cancer, kidney cancer, prostate cancer, head and neck cancer, cervical cancer, bladder cancer, melanoma, urothelial tumor, gastrointestinal stromal tumor, thyroid cancer.

In one embodiment, the feature vector corresponding to the sequencing data to be processed at least includes one or more of the following traceable key prediction features:

homologous recombination repair defect HRD, large fragment migration LST, telomere allele imbalance TAI, genomic heterozygosity loss LOH.

In a second aspect, the present application provides a model training method, the method comprising:

acquiring training sample data; the training sample data comprises a plurality of sample sequencing data and corresponding feature vectors thereof; the sample sequencing data are obtained by carrying out gene sequencing on a training sample by adopting the preset sequencing mode, and the training sample is obtained by sampling a sample object;

constructing a first classification model to be trained and a second classification model to be trained based on a preset gradient model structure to obtain a traceability prediction model to be trained; preferably, the preset gradient model structure comprises a limit gradient lifting tree XGBoost;

Respectively inputting the training sample data into the first classification model to be trained and the second classification model to be trained to obtain a sample candidate result and a sample reference result;

combining the model characteristics of the preset gradient model structure, the sample candidate results and the sample reference results, and adjusting model parameters of the traceability prediction model to be trained until the model training ending conditions are met, so as to obtain a pre-trained traceability prediction model;

the feature vectors corresponding to the sample sequencing data at least comprise one or more of the following traceable key prediction features: homologous recombination repair defect HRD, large fragment migration LST, telomere allele imbalance TAI, genomic heterozygosity loss LOH.

In one embodiment, the plurality of sample sequencing data includes different types of sample sequencing data, and the inputting the training sample data into the first classification model to be trained and the second classification model to be trained includes:

aiming at each type in the training sample data, taking the sample sequencing data of the type and the corresponding feature vector thereof as a data set to be processed;

Performing oversampling processing according to the data set to be processed to obtain an oversampled data set corresponding to the type; the number of training samples corresponding to the data set to be processed is smaller than or equal to the number of training samples corresponding to the oversampled data set;

and taking the oversampling data set corresponding to each type as input data, and respectively inputting the oversampling data set into the first classification model to be trained and the second classification model to be trained.

In one embodiment, the performing the oversampling according to the to-be-processed data set to obtain an oversampled data set corresponding to the type includes:

acquiring sampling parameter information and acquiring adjacent sample information; the sampling parameter information is used for determining the number of new samples generated based on the over-sampling process, and the adjacent sample information is used for representing the number of adjacent samples associated with any new sample when the over-sampling process generates the any new sample;

and carrying out oversampling processing on the data set to be processed according to the sampling parameter information and the adjacent sample information to obtain a feature vector corresponding to the newly added sample data, and taking the feature vector corresponding to the data set to be processed and the newly added sample data as an oversampled data set.

In one embodiment, the performing oversampling on the to-be-processed data set according to the sampling parameter information and the adjacent sample information to obtain a feature vector corresponding to the newly added sample data includes:

when the sampling parameter information is detected to meet a preset oversampling condition, any existing sample point in a feature space corresponding to the data set to be processed is taken as a target sample point, and candidate sample points corresponding to the target sample point are determined according to the adjacent sample information; the candidate sample points include a plurality of;

in the feature space, generating a new synthesized sample point according to the sample difference between any candidate sample point and the target sample point, and taking a feature vector corresponding to the new synthesized sample point as a feature vector corresponding to the newly added sample data; the new composite sample point is located on a line connecting any of the candidate sample points and the target sample point.

In one embodiment, the method further comprises:

acquiring a tracing feature set based on the tracing prediction model after training; the traceability characteristic set comprises traceability key prediction characteristics corresponding to each prediction type in a sample prediction result and characteristic gene information with traceability prediction performance, and the sample prediction result is obtained by outputting a traceability prediction model after training is finished.

In a third aspect, the present application further provides a data processing apparatus for tumor tracing, the apparatus comprising:

the sequencing data acquisition module is used for acquiring sequencing data to be processed and corresponding feature vectors thereof; the sequencing data to be processed is obtained by carrying out gene sequencing on a target sample by adopting a preset sequencing mode, and the target sample is obtained by sampling a target object;

the tracing prediction module is used for inputting the sequencing data to be processed and the corresponding feature vectors thereof into a first classification model in a pre-trained tracing prediction model to obtain candidate prediction results, and inputting the sequencing data to be processed and the corresponding feature vectors thereof into a second classification model in the pre-trained tracing prediction model to obtain reference prediction results;

the tracing result obtaining module is used for correcting the candidate prediction result according to the reference prediction result to obtain a tracing prediction result; the tracing prediction result is used for representing a primary part corresponding to the target sample; the primary site is a site in the target object that has an origin relationship with the target sample.

In a fourth aspect, the present application further provides a model training apparatus for tumor tracing, the apparatus comprising:

The training data acquisition module is used for acquiring training sample data; the training sample data comprises a plurality of sample sequencing data and corresponding feature vectors thereof; the sample sequencing data are obtained by carrying out gene sequencing on a training sample by adopting the preset sequencing mode, and the training sample is obtained by sampling a sample object;

the model construction module is used for constructing a first classification model to be trained and a second classification model to be trained based on a preset gradient model structure to obtain a traceability prediction model to be trained; preferably, the preset gradient model structure comprises a limit gradient lifting tree XGBoost;

the sample result obtaining module is used for respectively inputting the training sample data into the first classification model to be trained and the second classification model to be trained to obtain a sample candidate result and a sample reference result;

the model training module is used for adjusting model parameters of the traceability prediction model to be trained by combining the model characteristics of the preset gradient model structure, the sample candidate results and the sample reference results until the model training ending condition is met, so as to obtain a pre-trained traceability prediction model;

In a fifth aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program which when executed implements the steps of the data processing method according to the first aspect and/or the steps of the model training method according to the second aspect.

In a sixth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the data processing method as described in the first aspect and/or the steps of the model training method as described in the second aspect.

In a seventh aspect, the present application also provides a computer program product. The computer program product comprising a computer program which, when executed by a processor, implements the steps of the data processing method as described in the first aspect and/or the steps of the model training method as described in the second aspect.

According to the data processing method, the model training method, the device, the computer equipment, the storage medium and the computer program product, the sequencing data to be processed and the corresponding feature vectors are obtained by carrying out gene sequencing on the target sample in a preset sequencing mode, the target sample is obtained by sampling the target object, the sequencing data to be processed and the corresponding feature vectors are input into a first classification model in a pre-trained traceable prediction model to obtain candidate prediction results, the sequencing data to be processed and the corresponding feature vectors are input into a second classification model in the pre-trained traceable prediction model to obtain reference prediction results, the candidate prediction results are corrected according to the reference prediction results to obtain the traceable prediction results, the traceable prediction results are used for representing the primary part corresponding to the target sample, the primary part is the part with an origin relation with the target sample in the target object, the processing optimization of the traceable tumor source of the CUP is realized, the key prediction features of the tumor source contained in the feature vectors are processed based on the pre-trained traceable prediction model, the traceable prediction results can be improved, the traceable prediction accuracy of the CUP is improved by combining with the overall traceable prediction results of the 2 prediction results to 84%, and the accuracy of the traceable prediction results can reach to the accuracy of the CUP.

Drawings

FIG. 1 is a flow chart of a data processing method according to an embodiment;

FIG. 2 is a schematic diagram of a structure of a traceability prediction model according to an embodiment;

FIG. 3 is a flow chart of a model training method in one embodiment;

FIG. 4a is a schematic diagram of a data preprocessing and model training process in one embodiment;

FIG. 4b is a schematic diagram of an oversampled data comparison in one embodiment;

FIG. 4c is a schematic diagram of an over-sampling process for interpolation of data in one embodiment;

FIG. 5 is a flow chart of another data processing method according to an embodiment;

FIG. 6 is a block diagram of a data processing apparatus for tumor tracing in one embodiment;

FIG. 7 is a block diagram of a model training apparatus for tumor tracing in one embodiment;

FIG. 8 is an internal block diagram of a computer device in one embodiment;

FIG. 9 is an internal block diagram of another computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for presentation, analyzed data, etc.) related in the present application are both information and data authorized by the user or sufficiently authorized by each party; correspondingly, the application also provides a corresponding user authorization entry for the user to select authorization or select rejection.

In one embodiment, as shown in fig. 1, a data processing method is provided, where the method is applied to a terminal to illustrate the method, it is to be understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server, where, for example, the terminal may communicate with the server through a network, and a data storage system may store data that needs to be processed by the server, and may be integrated on the server, or may be placed on a cloud or other network servers. The terminal may be, but not limited to, a personal computer, a notebook computer, a smart phone, a tablet computer, an internet of things device, and a portable wearable device, and the server may be implemented by a stand-alone server or a server cluster formed by a plurality of servers. In this embodiment, the method includes the steps of:

Step 101, acquiring sequencing data to be processed and corresponding feature vectors thereof;

the sequencing data to be processed can be obtained by carrying out gene sequencing on the target sample in a preset sequencing mode, for example, the target sample can be detected by adopting a specific kit, and the specific kit is a panel targeted sequencing kit containing 9000 SNP loci and 520 genes.

As an example, the target sample may be a DNA fragment obtained by sampling a target object, and the source of the target sample may include, but is not limited to, cells, tumor tissue, healthy tissue, blood, and the like.

In practical application, a patient to be detected can be taken as a target object, a test sample of the target object can be obtained through sampling and taken as a target sample, and then a sequencing result (i.e. sequencing data to be processed) of the target sample can be subjected to feature extraction to obtain a corresponding feature vector (i.e. a feature vector corresponding to the sequencing data to be processed), for example, a feature vector with a certain length can be generated to serve as model information to be input, and all information of the sequencing result of the patient to be detected can be represented.

In one example, based on the biological mechanism of tumor clone evolution, metastasis and primary sites of tumors have closely related molecular characteristics, and the molecular characteristics of metastatic tumors tend to favor the characteristics of primary sites, so that the tumor can be traced through the molecular characteristics of tumor genome with feasibility and potential for clinical application. In this embodiment, the feature vector extracted based on the sequencing result may include a plurality of feature vectors, and a feature vector set corresponding to the plurality of feature vectors may be used as the model information to be input. The feature vector set has a key prediction feature of tumor tracing (i.e., tracing key prediction feature), which may include homologous recombination repair defect (homologous recombination deficiency, HRD), large fragment migration (large-scale state transition, LST), telomere allele imbalance (telomeric allelic imbalance, TAI), and genomic heterozygosity loss (loss of heterozygosity, LOH).

For example, feature vectors corresponding to the sequencing data to be processed may also include any one or more of the following traceable key prediction features: sex, HRD, LOH, TAI, LST, APC gene, CDH1 gene, EGFR gene, KIT gene, KMT2D gene, KRAS gene, NRAS gene, PTEN gene, RB1 gene, RNF43 gene, TP53 gene, VHL gene, APC_hotspot (APC gene hotspot variation), EGFR_hotspot (EGFR gene hotspot variation), FGFR3_hotspot (FGFR 3 gene hotspot variation), FOXA1_hotspot (FOXA 1 gene hotspot variation), KIT_hotspot (KIT gene hotspot variation), KRAS_hotspot (KRAS gene hotspot variation), PIK3CA_hotspot (PIK 3CA gene hotspot variation), SPOP_hotspot (SPOP gene hotspot variation), VHL_hotspot (VHL gene hotspot variation), BRAF_V600E (BRAF gene V600E variation), HRAS_Q61R (HRAS gene Q61R), BRAF_V600E (BRAF gene mutation) kit_l576P (KIT gene L576P variation), kras_g12d (KRAS gene G12D variation), apc_trunk (APC gene truncation variation), cdkn1a_trunk (cdkn1a gene truncation variation), gata3_trunk (gata3 gene truncation variation), kd6a_trunk (KDM 6A gene truncation variation), kmt2d_trunk (KMT 2D gene truncation variation), rb1_trunk (RB 1 gene truncation variation), vhl_trunk (VHL gene truncation variation), eml4_alk_fusion (EML 4-ALK gene fusion variation), tmprss2_erg_fusion (TMPRSS 2-ERG gene fusion variation), rb1_lgr (RB 1 gene large fragment rearrangement variation), erbb2_amp (ERBB 2 gene amplification), cn_burden (copy number variation load), logv_mb (SNV single nucleotide variant detected per Mb), log of single nucleotide variation), logindel_mb (log of INDEL detected per Mb base, [ insertion-deletion ], insertion or deletion) and TMB (tumor mutation burden, tumor mutation load); other traceability key prediction features may also be included, and are not particularly limited in this embodiment.

102, inputting the sequencing data to be processed and the corresponding feature vectors thereof into a first classification model in a pre-trained traceable prediction model to obtain candidate prediction results, and inputting the sequencing data to be processed and the corresponding feature vectors thereof into a second classification model in the pre-trained traceable prediction model to obtain reference prediction results;

the pre-trained traceability prediction model may be a model obtained by fusing two classification models on the basis of a multi-classification model, that is, the first classification model may be a multi-classification model, the second classification model may be a traceability judgment model of a designated part, which may be used to determine whether a primary part corresponding to a target sample is the designated part, and a model structure corresponding to the pre-trained traceability prediction model may be as shown in fig. 2.

For example, taking the second classification model as an example of the lung cancer classification model, it may output a probability value of whether the primary site of the sample (i.e., the primary site corresponding to the target sample) is lung cancer, that is, a reference prediction result.

In a specific implementation, the sequencing data to be processed and the feature vectors corresponding to the sequencing data to be processed may be input to a first classification model to obtain a first prediction probability corresponding to each of a plurality of candidate primary sites, as a candidate prediction result, and the sequencing data to be processed and the feature vectors corresponding to the sequencing data to be processed may be input to a second classification model to obtain a second prediction probability corresponding to a reference primary site, as a reference prediction result, where the plurality of candidate primary sites may include the reference primary site.

In an example, as shown in fig. 2, by inputting a feature vector obtained by extracting features of a sequencing result into a multi-classification model (i.e., a first classification model in a traceability prediction model), a plurality of probability values of 0-1 (i.e., candidate prediction results) can be obtained, and each probability value can represent a probability that a primary site predicted based on the multi-classification model corresponds to a different cancer species for a target sample.

In yet another example, taking the second classification model as the lung cancer classification model as an example, as shown in fig. 2, by inputting the feature vector obtained by extracting the feature of the sequencing result into the lung cancer classification model (i.e., the second classification model in the traceable prediction model), a probability value of 0-1 (i.e., the reference prediction result) can be obtained, where the probability value can characterize, for the target sample, the probability p that the primary site predicted based on the lung cancer classification model is lung cancer.

And step 103, correcting the candidate prediction results according to the reference prediction results to obtain traceable prediction results.

The tracing prediction result can be used for representing a primary part corresponding to the target sample, the primary part can be a part with an origin relation with the target sample in the target object, for example, the primary unknown metastasis is targeted, and the origin part corresponding to the metastasis can be determined by sampling the sample of the metastasis for tracing the tumor.

After the reference prediction result and the candidate prediction result are obtained, the first prediction probability corresponding to the reference primary part in the plurality of candidate primary parts can be corrected according to the second prediction probability and the preset reference threshold value, so that a corrected candidate prediction result is obtained, and further, a traceable prediction result can be obtained based on the corrected candidate prediction result.

Specifically, taking the second classification model as an example of the lung cancer two-classification model, the prediction probability corresponding to the primary part of lung cancer in the multi-classification prediction result (i.e. the candidate prediction result) may be corrected according to the judgment probability value p (i.e. the reference prediction result) output by the lung cancer two-classification model, for example, if p is smaller than the preset reference threshold value x, the prediction probability corresponding to the primary part of lung cancer in the multi-classification prediction result may be corrected to 0.

In an alternative embodiment, the corrected multi-classification prediction results may be ranked, and then cancer information corresponding to the highest probability value of the ranked first may be output as the traceable prediction result. Under the condition that the maximum probability value is smaller than the preset probability value (such as 0.95), the cancer information corresponding to the probability value of the second rank can be output, and the output cancer information can be output as a final prediction result (namely a tracing prediction result), as shown in fig. 2, so as to complete prediction.

In one example, the cancer involved in the primary site corresponding to the target sample may be one of lung cancer (including non-small cell lung cancer and small cell lung cancer), colorectal cancer, gastroesophageal cancer, ovarian cancer, breast cancer, pancreatic cancer, endometrial cancer, soft tissue sarcoma, cholangiocarcinoma, liver cancer, renal cancer, prostate cancer, head and neck tumor, cervical cancer, bladder cancer, melanoma, urothelial tumor, gastrointestinal stromal tumor, thyroid cancer; other cancer types may also be included, and are not particularly limited in this embodiment.

According to the data processing method, the sequencing data to be processed and the corresponding feature vectors thereof are obtained, then the sequencing data to be processed and the corresponding feature vectors thereof are input into the first classification model in the pre-trained traceability prediction model to obtain candidate prediction results, the sequencing data to be processed and the corresponding feature vectors thereof are input into the second classification model in the pre-trained traceability prediction model to obtain reference prediction results, then the candidate prediction results are corrected according to the reference prediction results to obtain traceability prediction results, so that CUP tumor traceability processing optimization is realized, the traceability prediction accuracy can be improved based on the pre-trained traceability prediction model according to the key prediction features of tumor traceability contained in the feature vectors, the overall prediction accuracy reaches more than 84.7% after the secondary prediction results are combined, and the accuracy of traceability prediction on CUP tumor samples is improved on the premise of controllable cost.

In one embodiment, the inputting the sequencing data to be processed and the feature vector corresponding thereto into the first classification model in the pre-trained traceable prediction model to obtain the candidate prediction result, and inputting the sequencing data to be processed and the feature vector corresponding thereto into the second classification model in the pre-trained traceable prediction model to obtain the reference prediction result may include the following steps:

inputting the sequencing data to be processed and the corresponding feature vectors thereof into the first classification model to obtain first prediction probabilities corresponding to a plurality of candidate primary parts, and taking the first prediction probabilities as candidate prediction results; and inputting the sequencing data to be processed and the corresponding feature vectors thereof into the second classification model to obtain a second prediction probability corresponding to the reference primary part, and taking the second prediction probability as the reference prediction result.

The plurality of candidate primary sites may include a reference primary site, where the reference primary site may be any primary site or a designated primary site, e.g., a designated site corresponding to the second classification model, that is, the reference primary site.

In practical application, as shown in fig. 2, by inputting the feature vector obtained by extracting the feature of the sequencing result into the multi-classification model (i.e., the first classification model), a plurality of probability values of 0-1 (i.e., candidate prediction results) can be obtained, the sum of the probability values of 0-1 is 1, and each probability value can represent the probability that the primary site predicted based on the multi-classification model is corresponding to a different cancer species (i.e., the first prediction probability corresponding to each of the plurality of candidate primary sites) for the target sample.

Taking the second classification model as a lung cancer classification model as an example, as shown in fig. 2, by inputting a feature vector obtained by extracting features of a sequencing result into the lung cancer classification model (i.e., the second classification model), a probability value of 0-1 (i.e., a reference prediction result) can be obtained, where the probability value can represent a probability p of a primary site predicted based on the lung cancer classification model being lung cancer (i.e., a second prediction probability corresponding to the reference primary site) for a target sample.

In this embodiment, the first prediction probabilities corresponding to the plurality of candidate primary sites are obtained by inputting the sequencing data to be processed and the feature vectors corresponding to the sequencing data to be processed into the first classification model, and the first prediction probabilities corresponding to the plurality of candidate primary sites are used as candidate prediction results, and the second prediction probabilities corresponding to the reference primary sites are obtained by inputting the sequencing data to be processed and the feature vectors corresponding to the sequencing data to be processed into the second classification model, and the second prediction probabilities corresponding to the reference primary sites are used as reference prediction results, so that data support is provided for further correcting the prediction results.

In one embodiment, the correcting the candidate prediction result according to the reference prediction result to obtain the traceable prediction result may include the following steps:

In an example, taking the second classification model as the lung cancer two classification model as an example, according to the judgment probability value p (i.e. the second prediction probability) output by the lung cancer two classification model and the preset reference threshold value x, if p is smaller than x, the prediction probability (i.e. the first prediction probability) corresponding to the primary part of the lung cancer in the multi-classification prediction result can be corrected to 0.

In one embodiment, the obtaining the traceable prediction result based on the modified candidate prediction result may include the following steps:

sorting all the first prediction probabilities in the corrected candidate prediction results to obtain a prediction probability sorting result; and determining the tracing prediction result according to the candidate primary part corresponding to the maximum prediction probability in the prediction probability sequencing result.

As an example, the first prediction probabilities in the prediction probability ranking result may be arranged in descending order of probability values.

In a specific implementation, by sorting the corrected multi-classification prediction results (i.e., the first prediction probabilities), the tracing prediction result can be obtained according to the cancer information corresponding to the maximum probability value of the first sorted prediction result (i.e., the maximum prediction probability in the prediction probability sorting result).

In the embodiment, the first prediction probabilities in the corrected candidate prediction results are sequenced to obtain the prediction probability sequencing result, and then the tracing prediction result is determined according to the candidate primary part corresponding to the maximum prediction probability in the prediction probability sequencing result, so that the tracing prediction accuracy and the prediction efficiency are improved.

In one embodiment, the determining the source-tracing prediction result according to the candidate primary location corresponding to the maximum prediction probability in the prediction probability ranking result may include the following steps:

when the maximum prediction probability is greater than or equal to a preset probability threshold, taking the candidate primary part corresponding to the maximum prediction probability as the tracing prediction result; or when the maximum prediction probability is smaller than a preset probability threshold value, the candidate primary parts corresponding to the first two prediction probabilities in the prediction probability sequencing result are used as the tracing prediction result; the prediction probability sequencing results are arranged in descending order according to the probability value.

In an example, by sorting the corrected multi-classification prediction results, cancer information corresponding to the maximum probability value (i.e., the maximum prediction probability) of the first sorting can be output as the traceability prediction result; or, if the maximum probability value is smaller than the preset probability value (e.g., 0.95), the cancer seed information corresponding to the probability value of the second rank may be output, and the output cancer seed information (i.e., candidate primary sites corresponding to the first two prediction probabilities) may be output as the final prediction result.

In this embodiment, when the maximum prediction probability is greater than or equal to the preset probability threshold, the candidate primary part corresponding to the maximum prediction probability is used as the tracing prediction result, or when the maximum prediction probability is less than the preset probability threshold, the candidate primary parts corresponding to the first two prediction probabilities in the prediction probability sequencing result are used as the tracing prediction result, so that the tracing prediction result can be flexibly output.

In one embodiment, as shown in fig. 3, a model training method is provided, where the method is applied to a terminal to illustrate, it is understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:

step 301, obtaining training sample data; the training sample data comprises a plurality of sample sequencing data and corresponding feature vectors thereof;

the sample sequencing data can be obtained by carrying out gene sequencing on a training sample in a preset sequencing mode, for example, the training sample can be detected by adopting a specific kit, and the specific kit is a panel targeted sequencing kit containing 9000 SNP loci and 520 genes.

The feature vector corresponding to the plurality of sample sequencing data may at least include one or more of the following traceable key prediction features: homologous recombination repair defect HRD, large fragment migration LST, telomere allele imbalance TAI, genomic heterozygosity loss LOH.

As an example, a training sample may be a DNA fragment obtained by sampling a sample object, and sources of the training sample may include, but are not limited to, cells, tumor tissue, healthy tissue, blood, and the like.

In practical applications, multiple tumor samples (i.e., training samples) covering different cancer types, such as multiple common cancers, may be obtained in advance: lung cancer (including non-small cell lung cancer and small cell lung cancer), colorectal cancer, gastroesophageal cancer, ovarian cancer, breast cancer, pancreatic cancer, endometrial cancer, soft tissue sarcoma, cholangiocarcinoma, liver cancer, kidney cancer, prostate cancer, head and neck tumor, cervical cancer, bladder cancer, melanoma, urothelial tumor, gastrointestinal stromal tumor, thyroid cancer, and the like, by detecting with a specific kit (panel targeted sequencing containing 9000 SNP sites and 520 genes), a plurality of molecular features in sample sequencing data corresponding to each tumor sample can be collected to obtain corresponding feature vectors, and if the collected features can be concentrated to newly increase HRD, LST, TAI and LOH, the 4 genome-wide instability prediction indexes can also include other features, which are not particularly limited in this embodiment.

Step 302, constructing a first classification model to be trained and a second classification model to be trained based on a preset gradient model structure to obtain a traceability prediction model to be trained;

the preset gradient model structure may include a limit gradient lifting tree XGBoost.

In a specific implementation, taking tracing of lung cancer as an example, a two-class model M of a plurality of types of lung cancer samples can be fused on the basis of a multi-class model, the two-class model M can be used for assisting the multi-class model in tracing judgment, and the fused prediction model structure can be shown in fig. 2. Preferably, the limit gradient lifting tree XGBoost may be selected as a basic model structure (i.e., a preset gradient model structure) that may be applied in a multi-classification model (i.e., a first classification model to be trained) and a lung cancer two-classification correction model (i.e., a second classification model to be trained), respectively.

In an example, since the two classification models M are one completely independent model, the model M can give a higher degree of freedom to superparameter selection, and the threshold x of the M model can be flexibly selected to adapt to the sensitivity or specificity requirement of the total prediction model. Because the proportion of lung cancer patients to cancer patients is higher, based on fusion of lung cancer two-classification correction models M on the basis of multiple classification models, the accuracy of overall identification can be effectively improved by improving the identification specificity of lung cancer samples.

Step 303, inputting the training sample data into the first classification model to be trained and the second classification model to be trained respectively, so as to obtain a sample candidate result and a sample reference result;

and step 304, adjusting model parameters of the traceability prediction model to be trained by combining model characteristics of the preset gradient model structure, the sample candidate result and the sample reference result until model training ending conditions are met, so as to obtain a pre-trained traceability prediction model.

In practical application, the traceability prediction model to be trained can be trained by using the training sample data subjected to Smote oversampling through data preprocessing, super parameters can be screened according to the characteristics of the XGBoost model and the data characteristics of the limit gradient lifting tree so as to determine an optimal prediction model (namely a pre-trained traceability prediction model), and the data preprocessing and model training flow can be shown as fig. 4 a.

In an example, the input of the model may be a plurality of sample sequencing data processed by the Smote algorithm and corresponding feature vectors thereof, and the output of the model may be a prediction probability corresponding to different cancer types. For each training sample, the cancer species with the highest prediction probability can be selected as the prediction cancer species, and if the prediction probability of the model on a certain cancer species is lower than a preset probability value (such as 95%), the model can also output a second most probable prediction result.

In yet another example, according to the experimental analysis, under the same sensitivity, the recognition specificity corresponding to the fusion of the classification model is higher than that when the multi-classification model is adopted alone to classify the lung cancer sample, i.e. the recognition specificity can be improved by fusing the classification model on the basis of the multi-classification model. For the threshold value x of the two-classification model, a threshold value corresponding to the theoretical sensitivity (such as 99.9%) of the training set reaching a specific requirement can be selected as the threshold value x, after the threshold value x is determined, the samples can be classified by two through the threshold value x, and for the samples with the prediction probability lower than x, the possibility of predicting the samples as lung cancer is not considered any more, and then the prediction result can be determined according to the maximum value of the prediction probability of the model for each cancer.

Compared with the problem of unbalanced number of various cancer samples in the tumor training samples in the traditional method, the recognition capability of the machine learning model on the rare cancer samples is limited due to the small sample size, and if the machine learning model is lack of specific guidance, the algorithm may sacrifice the recognition accuracy of the rare cancer for pursuing global accuracy, so that the model has poor performance in the rare cancer. And the balance between the selection of effective molecular characteristics and cost control is realized, and adding more molecular characteristics can improve the theoretical accuracy upper limit of an algorithm, but also means higher detection cost, more complicated detection flow and lower clinical usefulness.

According to the technical scheme, the data preprocessing and the structure of the prediction model are optimized, the measurement characteristic range is expanded, the HRD, LST, TAI and LOH 4 whole genome level description indexes for the unstable condition/instability of the genome are added, the problem that the accuracy of the prediction result is low due to the fact that the number of various cancer samples in the training samples is unbalanced can be effectively solved, the purpose of improving the accuracy of the tracing model is achieved, and HRD, LST, TAI and the LOH 4 whole genome instability prediction indexes are determined to serve as tracing key factors.

According to the model training method, training sample data are obtained, a first classification model to be trained and a second classification model to be trained are constructed based on a preset gradient model structure, a traceability prediction model to be trained is obtained, then the training sample data are respectively input into the first classification model to be trained and the second classification model to be trained, a sample candidate result and a sample reference result are obtained, and model parameters of the traceability prediction model to be trained are adjusted by combining model characteristics of the preset gradient model structure, the sample candidate result and the sample reference result until model training end conditions are met, a pre-trained traceability prediction model is obtained, CUP tumor traceability processing optimization is achieved, traceability prediction accuracy is improved, prediction cost is reduced, and the aim of traceability prediction of CUP tumor samples on the premise of controllable cost is achieved.

In one embodiment, the plurality of sample sequencing data includes different types of sample sequencing data, and the inputting the training sample data into the first classification model to be trained and the second classification model to be trained respectively may include the steps of:

aiming at each type in the training sample data, taking the sample sequencing data of the type and the corresponding feature vector thereof as a data set to be processed; performing oversampling processing according to the data set to be processed to obtain an oversampled data set corresponding to the type; and taking the oversampling data set corresponding to each type as input data, and respectively inputting the oversampling data set into the first classification model to be trained and the second classification model to be trained.

As an example, the number of training samples corresponding to the data set to be processed is smaller than or equal to the number of training samples corresponding to the oversampled data set, for example, for cancer types with a smaller number of training samples, additional training samples can be generated by oversampling in the data preprocessing stage, so that the overall training sample size is improved.

In practical application, for the problem that the sample number distribution of each cancer species in the training sample data is extremely unbalanced (the sample number distribution of each cancer species in the following table 1, 10228 samples is extremely unbalanced, the number of non-small cell lung cancers with the largest number is 54.9% of the total samples, and the thyroid cancer samples only account for 0.26% of all samples), as shown in fig. 4a, the sample data can be processed by using a Smote oversampling algorithm before being input into a machine learning model for training, so that the reduction of the specific cancer species identification accuracy caused by the problem of the sample number imbalance among different cancer species can be avoided.

TABLE 1 distribution of the number of cancers in training set

In an example, as shown in fig. 4b, the method can characterize the unbalance situation among tumor categories before Smote oversampling (left side of fig. 4 b) and after Smote oversampling (right side of fig. 4 b), and after the Smote oversampling treatment, the unbalance phenomenon among different cancer categories is significantly improved, and the total training sample size can be increased.

In this embodiment, for each type in the training sample data, the sample sequencing data of the type and the feature vector corresponding to the sample sequencing data are used as the data set to be processed, and then the oversampling processing is performed according to the data set to be processed to obtain the oversampled data set corresponding to the type, and then the oversampled data sets corresponding to the types are used as input data to be respectively input into the first classification model to be trained and the second classification model to be trained, so that the problem of unbalanced sample number among different cancer types in the training sample data can be solved.

In an embodiment, the performing an oversampling process according to the to-be-processed data set to obtain an oversampled data set corresponding to the type may include the following steps:

acquiring sampling parameter information and acquiring adjacent sample information; and carrying out oversampling processing on the data set to be processed according to the sampling parameter information and the adjacent sample information to obtain a feature vector corresponding to the newly added sample data, and taking the feature vector corresponding to the data set to be processed and the newly added sample data as an oversampled data set.

Wherein the sampling parameter information may be used to determine the number of newly added samples generated based on the over-sampling process, such as sampling ratio C; the contiguous sample information may be used to characterize the number of contiguous samples, such as the number K of contiguous samples, associated with any new sample when the oversampling process generates any new sample.

In a specific implementation, based on the Smote algorithm, the additional minority cancer species sample may be generated by performing a method of interpolating based on k most similar samples and random offsets in the feature space of the minority cancer species sample (as shown in fig. 4C, which may characterize the Smote algorithm to interpolate based on 6 most similar samples and random offsets in the feature space), with a dynamic sampling ratio C (i.e., sampling parameter information), and the degree of oversampling of the minority class sample may be determined according to the number of samples, e.g., the dynamic sampling ratio C may be defined as inversely proportional to the number of minority class samples, the smaller the number of samples in the minority class, the higher the degree of oversampling. Therefore, not only can all effective training information of a plurality of types of samples be reserved, but also overfitting caused by simple repeated sampling of a few types of samples can be avoided.

In an example, the SMOTE algorithm is a method for generating new sample features according to features of a class of similar samples, and the SMOTE algorithm oversampling can be implemented by using a SmoteClassif function in an R language UBL toolkit, for example, the SmoteClassif can generate a new dataset containing an original dataset through processing the original dataset to solve the problem of class imbalance.

For example, smoteClassif function input 1 may be a raw data set, which may contain feature vectors of all samples in the raw data set (i.e. the data set to be processed), for example, may be a matrix of n×m, where n is the number of samples and m is the feature number of each sample.

The SmoteClassif function input 2 may be a sampling scale C (i.e. sampling parameter information) which may be used to determine the number of new samples to generate on the basis of the original data set. If the sampling ratio C is greater than 1, over-sampling may be performed, e.g., when the sampling ratio C is 2, the number of samples may be 1 times the existing number of data sets; if the sampling ratio C is equal to 1, the data set may not be oversampled.

The SmoteClassif function input 3 may be a number of contiguous samples K (i.e. contiguous sample information) which may be used to determine the number of other neighbouring samples from which an oversampled new sample is generated.

The SmoteClassif function output may be a feature vector corresponding to the new sample point (i.e., a feature vector corresponding to the newly added sample data).

In this embodiment, by acquiring the sampling parameter information and acquiring the adjacent sample information, and further performing oversampling processing on the data set to be processed according to the sampling parameter information and the adjacent sample information, a feature vector corresponding to the newly added sample data is obtained, and the feature vector corresponding to the data set to be processed and the newly added sample data is used as an oversampled data set, so that the situation that the number of samples in different cancer types in the training sample data is unbalanced can be effectively avoided.

In one embodiment, the performing oversampling on the to-be-processed data set according to the sampling parameter information and the adjacent sample information to obtain a feature vector corresponding to the newly added sample data may include the following steps:

when the sampling parameter information is detected to meet a preset oversampling condition, any existing sample point in a feature space corresponding to the data set to be processed is taken as a target sample point, and candidate sample points corresponding to the target sample point are determined according to the adjacent sample information; the candidate sample points include a plurality of; in the feature space, generating a new synthesized sample point according to the sample difference between any candidate sample point and the target sample point, and taking a feature vector corresponding to the new synthesized sample point as a feature vector corresponding to the newly added sample data; the new composite sample point is located on a line connecting any of the candidate sample points and the target sample point.

In an example, when the sampling ratio C is determined to be greater than 1, that is, when it is detected that the sampling parameter information meets the preset oversampling condition, when each new sample is generated by oversampling, as shown in fig. 4C, an existing sample point a (i.e., a target sample point) is randomly selected based on SmoteClassif in a feature space, K samples closest to the randomly selected sample may be calculated in the feature space according to the euclidean distance, 1 neighboring sample B (i.e., any candidate sample point) may be randomly selected in the K neighboring samples (i.e., any candidate sample point), and further, a new synthesized sample point D may be generated according to a sample difference between A, B two sample points in the feature space, where the new synthesized sample point is located on a line between A, B two sample points, and a feature vector corresponding to the new synthesized sample point D may be used as a feature vector of the new sample (i.e., a feature vector corresponding to the new sample data).

In yet another example, the SMOTE oversampling method may generate the composite samples by interpolating a minority class of samples based on the feature space of the samples. Therefore, the feature space of the minority class sample can be expanded, the model can be facilitated to better explore and learn the features of the minority class, and the performance of the model can be improved.

For example, SMOTE oversampling may be performed using the following steps:

1. for each minority sample, any minority sample can be used as a current sample, and K nearest neighbors of the current sample are found by calculating the distance between the current sample and all other minority samples;

2. one neighbor sample can be randomly selected from the K nearest neighbors, and the difference between the neighbor sample and the current sample is calculated;

3. a new composite sample may be generated based on the calculated difference ratio, the new composite sample being located on a line between the neighbor sample and the current sample.

4. By repeating the above steps, a specified number of synthesized samples can be generated.

In this embodiment, when it is detected that the sampling parameter information meets a preset oversampling condition, any existing sample point in a feature space corresponding to a data set to be processed is taken as a target sample point, a candidate sample point corresponding to the target sample point is determined according to adjacent sample information, further, in the feature space, a new synthesized sample point is generated according to a sample difference between any candidate sample point and the target sample point, and a feature vector corresponding to the new synthesized sample point is taken as a feature vector corresponding to newly added sample data, so that a new synthesized sample can be generated based on the sample difference between the existing samples, and the problem that an additional few types of cancer samples are generated by adopting interpolation processing is effectively solved.

In one embodiment, the method may further comprise the steps of:

In practical application, aiming at the balance problem between molecular feature selection and cost control, by adopting a specific kit (panel targeted sequencing comprising 9000 SNP loci and 520 genes) for detection, the whole genome layer features of key effects on tumor tracing can be obtained with lower detection cost by utilizing a cancer species specific key hot spot mutation and whole genome sparse coverage mode. The 520 gene panel targeted sequencing technology not only realizes the complete coverage on specific key gene mutation of cancer species, but also achieves the uniform coverage of the whole genome by selecting more than 1 ten thousand SNP loci of people on the whole genome level, and realizes the description index of the unstable condition of the genome on the basis of HRD, LST, TAI and LOH (Low-loss-of-light) 4 whole genome levels.

In one example, 45 predictive features that play a key role in tumor tracing were found by counting key predictors in each cancer species, as shown in table 2, and the 4 genome-wide instability index plays a key role in tumor tracing of multiple cancer species.

TABLE 2 Key predictive features of 45 tumor traceability, including HRD, LST, TAI and LOH

Table 2-1, basic information:

TABLE 2-2 Gene variation information

Tables 2-3, general index:

tables 2-4, custom index:

/>

in yet another example, from a single gene perspective, the training-based traceability prediction model also found 11 characteristic genetic mutations with traceability prediction performance in different cancer species, as shown in table 3. After training is completed, the model may output the most important features of all the traceable features for each cancer species, as shown in fig. 4 a.

TABLE 3 characteristic genetic variation with traceable predictive performance in different cancer species

In this embodiment, the tracing feature set is obtained based on the tracing prediction model after training is completed, so that the feature with the highest importance in all tracing features can be obtained for each cancer after training is completed, and the tracing prediction accuracy is improved.

In one embodiment, the performance of the pre-trained traceability prediction model is verified by utilizing a plurality of independent application cases, based on result display, the optimized model achieves the effect of improving traceability prediction accuracy in independent verification concentration, and the application cases all achieve better traceability accuracy by utilizing DNA targeted sequencing information only, so that the problem of tracing prediction on CUP tumor samples on the premise of controllable cost clinically is solved to a certain extent.

In terms of accuracy, if only the first prediction result is considered, the number of cancer species with prediction specificity exceeding 0.8 is 4, and in soft tissue sarcoma and head and neck tumor, recognition specificity is lower than 0.5, which may be related to clinical fuzzy classification of soft tissue sarcoma and head and neck tumor. After combining the secondary prediction results, the recognition sensitivity and specificity of all cancer species were improved, and the number of cancer species whose prediction specificity exceeded 0.8 was increased to 7 (see Table 4). Compared with the comparative data disclosed in the industry, compared with a model obtained by training 7791 patients by a traditional method (Penson A, camahho N, et al Development of Genome-Derived Tumor Type Prediction to Inform Clinical Cancer Care. JAMAOncology 2020 Jan 1;6 (1): 84-91. Doi: 10.1001/jamacol.2019.3985. PMID: 31725847; PMCID: PMC 6865333), the overall prediction accuracy reaches 74.1% in a completely independent verification set, the technical scheme of the embodiment corresponds to 77.5%, and the overall prediction accuracy reaches 84.7% after combining the secondary prediction results, which shows that the overall prediction accuracy is obviously improved in the prediction level compared with the advanced level in the industry.

TABLE 4 prediction performance index of individual cancer species for model in completely independent validation set

In yet another example, by using the model to find a primary phenomenon of metastasis in 38 CUP patients, 38 patients each have at least 2 cancerous tissue sampling sites from different sites at the same time. Setting that if the model has the same tracing prediction result for two or more different sampling positions of the same CUP patient and the prediction position belongs to one cancer position of the patient, the primary position of the patient is considered as the tracing prediction position; if all the part prediction results of the model on the CUP patient are the real sampling parts, the patient is considered to be two/more primary. According to the set rules, the model successfully identified the transfer phenomenon and the determined primary sites among the samples of 17 patients in the data set, the primary sites were determined to be multiple sites in the other 20 patients, and the primary sites could not be determined in the 1 patients.

Of the 17 patients whose primary sites were determined, 5 patients whose primary sites were determined in the first prediction result and 12 patients were determined in the second prediction result. The model determined the association between multiple metastases in all patients (38/38, 100%) in combination with the first and second predictions, given a clear primary relationship between the sites in 37 (37/38, 97.4%) patients, and only in 1 (1/38, 2.6%) patients, the primary sites could not be determined although metastasis could be determined (table 5). By combining the results, the technical scheme of the application has good clinical application value and application in CUP tumor traceability prediction.

TABLE 5 comparison of the sampled and predicted sites for 38 CUP patients

/>

In one embodiment, as shown in FIG. 5, a flow diagram of another data processing method is provided. In this embodiment, the method includes the steps of:

in step 501, training sample data is obtained, and a first classification model to be trained and a second classification model to be trained are constructed based on a preset gradient model structure, so as to obtain a traceability prediction model to be trained. In step 502, training sample data is input to a first classification model to be trained and a second classification model to be trained respectively, a sample candidate result and a sample reference result are obtained, and model parameters of a traceability prediction model to be trained are adjusted by combining model characteristics of a preset gradient model structure, the sample candidate result and the sample reference result until model training end conditions are met, so that a pre-trained traceability prediction model is obtained. In step 503, the sequencing data to be processed and the feature vectors corresponding to the sequencing data to be processed are obtained, and the sequencing data to be processed and the feature vectors corresponding to the sequencing data to be processed are input into a first classification model in the pre-trained traceable prediction model, so as to obtain first prediction probabilities corresponding to a plurality of candidate primary parts, and the first prediction probabilities are used as candidate prediction results. In step 504, the sequencing data to be processed and the feature vectors corresponding to the sequencing data are input to a second classification model in the pre-trained traceable prediction model, so as to obtain a second prediction probability corresponding to the reference primary part, and the second prediction probability is used as a reference prediction result. In step 505, the first prediction probability corresponding to the reference primary part in the plurality of candidate primary parts is corrected according to the second prediction probability and the preset reference threshold value, so as to obtain a corrected candidate prediction result. In step 506, the first prediction probabilities in the modified candidate prediction results are ranked, so as to obtain a prediction probability ranking result. In step 507, when the maximum prediction probability is greater than or equal to a preset probability threshold, the candidate primary part corresponding to the maximum prediction probability is used as a tracing prediction result; or when the maximum prediction probability is smaller than a preset probability threshold value, the candidate primary parts corresponding to the first two prediction probabilities in the prediction probability sequencing result are used as tracing prediction results. It should be noted that, the specific limitation of the above steps may be referred to the specific limitation of a data processing method and a model training method, which are not described herein.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a data processing device for realizing the above related data processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in one or more embodiments of the data processing device for tumor tracing provided below may refer to the limitation of the data processing method hereinabove, and will not be repeated herein.

In one embodiment, as shown in fig. 6, there is provided a data processing apparatus for tumor tracing, including:

the sequencing data acquisition module 601 is configured to acquire sequencing data to be processed and a feature vector corresponding to the sequencing data; the sequencing data to be processed is obtained by carrying out gene sequencing on a target sample by adopting a preset sequencing mode, and the target sample is obtained by sampling a target object;

the traceability prediction module 602 is configured to input the sequencing data to be processed and the feature vector corresponding thereto into a first classification model in a pre-trained traceability prediction model to obtain a candidate prediction result, and input the sequencing data to be processed and the feature vector corresponding thereto into a second classification model in the pre-trained traceability prediction model to obtain a reference prediction result;

the tracing result obtaining module 603 is configured to correct the candidate prediction result according to the reference prediction result, so as to obtain a tracing prediction result; the tracing prediction result is used for representing a primary part corresponding to the target sample; the primary site is a site in the target object that has an origin relationship with the target sample.

In one embodiment, the traceability prediction module 602 includes:

The candidate prediction result obtaining submodule is used for inputting the sequencing data to be processed and the feature vectors corresponding to the sequencing data to the first classification model to obtain first prediction probabilities corresponding to a plurality of candidate primary parts as the candidate prediction results;

the reference prediction result obtaining submodule is used for inputting the sequencing data to be processed and the corresponding feature vectors of the sequencing data to the second classification model to obtain a second prediction probability corresponding to a reference primary part, and the second prediction probability is used as the reference prediction result; the reference primary site is any primary site or a designated primary site, and the plurality of candidate primary sites includes the reference primary site;

in one embodiment, the tracing result obtaining module 603 includes:

and the correction sub-module is used for correcting the first prediction probability corresponding to the reference primary part in the plurality of candidate primary parts according to the second prediction probability and a preset reference threshold value to obtain a corrected candidate prediction result, and obtaining the tracing prediction result based on the corrected candidate prediction result.

In one embodiment, the correction submodule includes:

the sorting unit is used for sorting the first prediction probabilities in the corrected candidate prediction results to obtain a prediction probability sorting result;

And the tracing prediction result determining unit is used for determining the tracing prediction result according to the candidate primary part corresponding to the maximum prediction probability in the prediction probability sequencing result.

In one embodiment, the traceability prediction result determining unit includes:

the prediction result obtaining subunit is configured to use, as the tracing prediction result, a candidate primary part corresponding to the maximum prediction probability when the maximum prediction probability is greater than or equal to a preset probability threshold; or when the maximum prediction probability is smaller than a preset probability threshold value, the candidate primary parts corresponding to the first two prediction probabilities in the prediction probability sequencing result are used as the tracing prediction result; the prediction probability sequencing results are arranged in descending order according to the probability value.

The above-mentioned various modules in the data processing apparatus for tumor tracing may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

Based on the same inventive concept, the embodiment of the application also provides a model training device for realizing the model training method. The implementation scheme of the solution provided by the device is similar to the implementation scheme described in the above method, so the specific limitation in one or more embodiments of the model training device for tumor tracing provided below can be referred to the limitation of the model training method hereinabove, and will not be repeated here.

In one embodiment, as shown in fig. 7, there is provided a model training apparatus for tumor tracing, including:

A training data acquisition module 701, configured to acquire training sample data; the training sample data comprises a plurality of sample sequencing data and corresponding feature vectors thereof; the sample sequencing data are obtained by carrying out gene sequencing on a training sample by adopting the preset sequencing mode, and the training sample is obtained by sampling a sample object;

the model construction module 702 is configured to construct a first classification model to be trained and a second classification model to be trained based on a preset gradient model structure, so as to obtain a traceability prediction model to be trained;

a sample result obtaining module 703, configured to input the training sample data to the first classification model to be trained and the second classification model to be trained, respectively, to obtain a sample candidate result and a sample reference result;

the model training module 704 is configured to combine the model characteristics of the preset gradient model structure, the sample candidate result, and the sample reference result to adjust model parameters of the traceability prediction model to be trained until a model training end condition is satisfied, thereby obtaining a pre-trained traceability prediction model;

In one embodiment, the plurality of sample sequencing data comprises different types of sample sequencing data, and the sample result obtaining module 703 comprises:

the data set to be processed determining submodule is used for regarding sample sequencing data of each type and corresponding feature vectors of the type as a data set to be processed;

the oversampling data set determining submodule is used for carrying out oversampling processing according to the data set to be processed to obtain an oversampling data set corresponding to the type; the number of training samples corresponding to the data set to be processed is smaller than or equal to the number of training samples corresponding to the oversampled data set;

the input data obtaining submodule is used for taking the oversampling data set corresponding to each type as input data and respectively inputting the oversampling data set to the first classification model to be trained and the second classification model to be trained.

In one embodiment, the oversampled data set determination submodule includes:

an information acquisition unit for acquiring sampling parameter information and acquiring adjacent sample information; the sampling parameter information is used for determining the number of new samples generated based on the over-sampling process, and the adjacent sample information is used for representing the number of adjacent samples associated with any new sample when the over-sampling process generates the any new sample;

And the oversampling processing unit is used for carrying out oversampling processing on the data set to be processed according to the sampling parameter information and the adjacent sample information to obtain a feature vector corresponding to the newly added sample data, and taking the feature vector corresponding to the data set to be processed and the newly added sample data as an oversampled data set.

In one embodiment, the oversampling processing unit includes:

a candidate sample point obtaining subunit, configured to determine, when the sampling parameter information is detected to meet a preset oversampling condition, a candidate sample point corresponding to a target sample point according to the adjacent sample information by using any existing sample point in a feature space corresponding to the to-be-processed data set as the target sample point; the candidate sample points include a plurality of;

a new sample data determining subunit, configured to generate a new synthesized sample point according to a sample difference between any one of the candidate sample points and the target sample point in the feature space, and use a feature vector corresponding to the new synthesized sample point as a feature vector corresponding to the new sample data; the new composite sample point is located on a line connecting any of the candidate sample points and the target sample point.

In one embodiment, the apparatus further comprises:

the traceability feature set obtaining module is used for obtaining the traceability feature set based on the traceability prediction model after training is finished; the traceability characteristic set comprises traceability key prediction characteristics corresponding to each prediction type in a sample prediction result and characteristic gene information with traceability prediction performance, and the sample prediction result is obtained by outputting a traceability prediction model after training is finished.

The above-mentioned various modules in the model training apparatus for tumor tracing may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 8. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a data processing method.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 9. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a model training method.

It will be appreciated by those skilled in the art that the structures shown in fig. 8 and 9 are block diagrams of only some of the structures associated with the present application and are not intended to limit the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:

In an embodiment, the processor, when executing the computer program, also implements the steps of the data processing method in the other embodiments described above.

constructing a first classification model to be trained and a second classification model to be trained based on a preset gradient model structure to obtain a traceability prediction model to be trained; respectively inputting the training sample data into the first classification model to be trained and the second classification model to be trained to obtain a sample candidate result and a sample reference result;

In one embodiment, the processor, when executing the computer program, also implements the steps of the model training method in the other embodiments described above.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

In an embodiment, the computer program, when executed by a processor, also implements the steps of the data processing method in the other embodiments described above.

constructing a first classification model to be trained and a second classification model to be trained based on a preset gradient model structure to obtain a traceability prediction model to be trained;

In one embodiment, the computer program when executed by the processor also implements the steps of the model training method in the other embodiments described above.

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method of data processing, the method comprising:

acquiring sequencing data to be processed and a corresponding feature vector thereof; the sequencing data to be processed is obtained by carrying out gene sequencing on a target sample by adopting a preset sequencing mode, and the target sample is obtained by sampling a target object; the preset sequencing mode is to detect by adopting a kit containing SNP loci and gene panel targeted sequencing;

correcting the candidate prediction result according to the reference prediction result to obtain a corrected candidate prediction result, and obtaining a tracing prediction result; the tracing prediction result is used for representing a primary part corresponding to the target sample; the primary site is a site in the target object having an origin relationship with the target sample; the corrected candidate prediction result is obtained by correcting the first prediction probability corresponding to the candidate primary part in the candidate prediction result according to the second prediction probability corresponding to the reference primary part in the reference prediction result and a preset reference threshold;

the feature vector corresponding to the sequencing data to be processed at least comprises one or more of the following traceable key prediction features:

2. The method of claim 1, wherein the inputting the sequencing data to be processed and the corresponding feature vectors thereof into a first classification model in a pre-trained traceable prediction model to obtain candidate prediction results, and inputting the sequencing data to be processed and the corresponding feature vectors thereof into a second classification model in the pre-trained traceable prediction model to obtain reference prediction results, comprises:

3. The method according to claim 2, wherein the correcting the candidate prediction result according to the reference prediction result to obtain a traceable prediction result includes:

4. A method according to claim 3, wherein said deriving said traceable prediction result based on said modified candidate prediction result comprises:

5. The method of claim 4, wherein the determining the trace-source prediction result according to the candidate primary location corresponding to the maximum prediction probability in the prediction probability ranking result comprises:

6. The method of any one of claims 1 to 5, wherein the primary site corresponding to the target sample is at least involved in the following cancers:

7. A method of model training, the method comprising:

acquiring training sample data; the training sample data comprises a plurality of sample sequencing data and corresponding feature vectors thereof; the sample sequencing data are obtained by carrying out gene sequencing on a training sample by adopting a preset sequencing mode, and the training sample is obtained by sampling a sample object; the preset sequencing mode is to detect by adopting a kit containing SNP loci and gene panel targeted sequencing;

Combining the model characteristics of the preset gradient model structure, the sample candidate results and the sample reference results, and adjusting model parameters of the traceability prediction model to be trained until the model training ending conditions are met, so as to obtain a pre-trained traceability prediction model; the sample reference result is used for correcting the sample candidate result by combining a preset reference threshold value in the model training process;

8. The method of claim 7, wherein the plurality of sample sequencing data comprises different types of sample sequencing data, the inputting the training sample data into the first classification model to be trained and the second classification model to be trained, respectively, comprises:

9. The method according to claim 8, wherein the performing the oversampling process according to the data set to be processed to obtain the oversampled data set corresponding to the type includes:

10. The method according to claim 9, wherein the performing oversampling on the data set to be processed according to the sampling parameter information and the adjacent sample information to obtain feature vectors corresponding to newly added sample data includes:

11. The method according to any one of claims 7 to 10, further comprising:

12. A data processing apparatus for tumor tracing, the apparatus comprising:

the sequencing data acquisition module is used for acquiring sequencing data to be processed and corresponding feature vectors thereof; the sequencing data to be processed is obtained by carrying out gene sequencing on a target sample by adopting a preset sequencing mode, and the target sample is obtained by sampling a target object; the preset sequencing mode is to detect by adopting a kit containing SNP loci and gene panel targeted sequencing;

the tracing result obtaining module is used for obtaining a tracing prediction result according to the corrected candidate prediction result obtained by correcting the candidate prediction result according to the reference prediction result; the tracing prediction result is used for representing a primary part corresponding to the target sample; the primary site is a site in the target object having an origin relationship with the target sample; the corrected candidate prediction result is obtained by correcting the first prediction probability corresponding to the candidate primary part in the candidate prediction result according to the second prediction probability corresponding to the reference primary part in the reference prediction result and a preset reference threshold;

13. A model training device for tumor tracing, the device comprising:

the training data acquisition module is used for acquiring training sample data; the training sample data comprises a plurality of sample sequencing data and corresponding feature vectors thereof; the sample sequencing data are obtained by carrying out gene sequencing on a training sample by adopting a preset sequencing mode, and the training sample is obtained by sampling a sample object; the preset sequencing mode is to detect by adopting a kit containing SNP loci and gene panel targeted sequencing;

the model construction module is used for constructing a first classification model to be trained and a second classification model to be trained based on a preset gradient model structure to obtain a traceability prediction model to be trained;

The model training module is used for adjusting model parameters of the traceability prediction model to be trained by combining the model characteristics of the preset gradient model structure, the sample candidate results and the sample reference results until the model training ending condition is met, so as to obtain a pre-trained traceability prediction model; the sample reference result is used for correcting the sample candidate result by combining a preset reference threshold value in the model training process;

14. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the data processing method according to any one of claims 1 to 6 and/or the steps of the model training method according to any one of claims 7 to 11.