CN111340135A

CN111340135A - Renal mass classification method based on random projection

Info

Publication number: CN111340135A
Application number: CN202010171801.XA
Authority: CN
Inventors: 甄鑫; 莫天澜; 王琳婧; 何强
Original assignee: Guangzhou Lingtuo Medical Technology Co ltd
Priority date: 2020-03-12
Filing date: 2020-03-12
Publication date: 2020-06-26
Anticipated expiration: 2040-03-12
Also published as: CN111340135B

Abstract

The application relates to a renal mass classification method based on random projection, which comprises the following steps: acquiring N target object data describing the renal mass; performing target region delineation on each CT flat scanning image according to each mask image to obtain an interested region of each CT flat scanning image, and performing radiology characteristic data extraction on each interested region to obtain N pieces of radiology characteristic data; projecting the N pieces of radiologic characteristic data through L random projection matrixes to obtain L sets of projection characteristic data; respectively carrying out multiple classifier training on the L sets of projection characteristic data to obtain a prediction matrix of each classifier and each trained classifier, and determining the weight of each classifier; and performing fusion processing on the data to be classified by adopting each trained classifier according to the corresponding weight so as to determine the corresponding class. The method and the device can improve the robustness in the process of identifying the category of the data to be classified, thereby improving the reliability of the classification result.

Description

Renal mass classification method based on random projection

Technical Field

The application relates to the technical field of machine learning, in particular to a method for classifying renal small masses based on random projection.

Background

In recent years, multi-classifier systems have found widespread use in the field of machine learning to obtain more reliable and accurate predictions in supervised and unsupervised learning tasks than in single classifiers. The method is successfully applied to a plurality of fields of bioinformatics, remote sensing science, network security, astronomical physics, clinical fields, chemical informatics and the like. Most of the current multi-classifier system studies can be summarized in the following two categories: non-generating and generating. The non-generating multi-classifier system focuses on the selection of the classifier or the fusion mode of multi-classifier output to optimize the system structure so as to achieve the purpose of improving the system prediction capability, and the generating multi-classifier system focuses on the generation of the base classifier so as to improve the diversity and difference of the system, thereby improving the system prediction accuracy. Much of the previous research has focused on building new integrated architectures or finding ways to improve classifier diversity. How to construct diversified base classifiers and integrate the base classifiers into a logic fusion architecture is the key to construct a successful multi-classifier system. Exploring a reasonable balance between integration architecture and integration diversity has been a hotspot of many studies in recent years, and is also a problem to be solved.

An effective multi-classifier system is urgently needed for medical decision in clinical tasks of clinical diagnosis and prognosis, and prediction of curative effect from information acquired from radiological images (computed tomography, positron emission tomography, magnetic resonance imaging and the like) and clinical treatment. Since clinical information is diverse, such as image information of different modalities, treatment parameters, dosage parameters, and other clinical features, a fusion system is needed to integrate various information to contribute to clinical judgment or determination of treatment plan. Meanwhile, the medical decision problem is usually determined by the problem itself, different classifiers may have different performances for different diseases and clinical endpoints, and even though the same clinical task is performed, the different classifiers rarely achieve consistent results, for example, in some studies, Yang R et al compares 224 classification models, and finds that there is a significant classification precision difference between models established by different classifier combinations and feature selection methods. Furthermore, a wide variety of clinical information in combination with different classifiers may yield more classification models, and it is unlikely that a truly optimal solution can be approached by traversing all available models and trial and error, and therefore, an efficient multi-classifier framework is always desirable in a clinical environment to fully process diverse medical data.

Vascular smooth muscle lipoma (renal hamartoma) is the most common of renal benign tumors, accounting for about 3% of renal tumors, and can be reliably and accurately diagnosed by imaging through detection of typical intratumoral macroscopic fat. However, the amount of fat is variable, and sometimes some renal vascular smooth muscle lipomas may be free or almost free of fat, so-called atypical, fat-poor renal vascular smooth muscle lipomas or fat-free visible renal vascular smooth muscle lipomas (AMLwvf), behave similarly to Renal Cell Carcinoma (RCC) on CT, are prone to misdiagnosis, resulting in unnecessary surgery. Recent advances and successful application of radiology in previous studies have facilitated improved accuracy in tumor prediction and classification. Based on machine learning, many researchers have attempted to distinguish AMLwvf from RCC using CT texture analysis. However, these general-purpose studies are limited in that either texture features are typically extracted from a single CT phase or randomly selected classifiers are built for classification modeling, and no comprehensive survey reports demonstrate which phase and classifier or possible combination thereof has higher discriminative power. Yang R et al compared 224 classification models and found that image features extracted from non-enhanced phase CT images had higher discriminatory power than the other three phases (renal cortical enhancement scan, contrast post-renal disease scan, contrast post-void scan). However, by traversing all available models, it is unlikely that a truly optimal solution can be approached through trial and error, which makes the conventional procedure for classifying the renal small tumor often have a problem of poor robustness, which easily results in low reliability of the corresponding classification result.

Disclosure of Invention

In view of the above, there is a need to provide a method for classifying renal small masses based on random projection, which can improve the robustness of the renal small mass classification process.

A method for classifying renal masses based on stochastic projection, the method comprising:

s10, acquiring N target object data describing the kidney small tumor; the target object data includes a CT scout image, a mask image and label data of the corresponding kidney small tumor; the label data characterizes the respective renal mass as benign or malignant;

s20, performing target region delineation on each CT flat scanning image according to each mask image to obtain an interested region of each CT flat scanning image, and performing radiology characteristic data extraction on each interested region to obtain N pieces of radiology characteristic data;

s30, projecting the N sets of radiologic characteristic data through L random projection matrixes to obtain L sets of projection characteristic data;

s40, respectively carrying out multiple classifier training on the L sets of projection characteristic data to obtain a prediction matrix of each classifier and each trained classifier, and setting the weight of each classifier according to the prediction matrix of each classifier;

and S50, fusing the data to be classified by adopting the trained classifiers according to the corresponding weights to determine the category of the data to be classified.

The method for classifying the renal small masses based on the random projection comprises the steps of carrying out L-time random projection on radiologic characteristic data extracted from N target object data for describing the renal small masses, obtaining L sets of projection characteristic data generated by the target object data, inputting the projection characteristic data into different classifiers for training to obtain each trained classifier, obtaining a prediction matrix of each classifier, determining the weight of each classifier, fusing the classification data to be classified according to the corresponding weight by adopting each trained classifier, enabling the classifiers to form a hierarchical structure, and integrating the diversity and the structural advantages of the classifiers to improve the robustness in the process of carrying out class identification on the classification data to be classified so as to improve the reliability of an identification result.

Drawings

FIG. 1 is a flow chart of a method for classifying renal masses based on stochastic projection according to an embodiment;

FIG. 2 is a schematic representation of a target volume delineation of a CT planar image in one embodiment;

FIG. 3 is a schematic structural diagram of a small kidney mass classifying device based on random projection according to an embodiment;

FIG. 4 is a schematic diagram of a computer device of an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The method for classifying the renal masses based on the random projection can be applied to intelligent terminal equipment such as an identification terminal or a classification terminal for identifying whether pictures representing the renal masses are benign or malignant. The intelligent terminal equipment can acquire N pieces of target object data describing the renal mass, target area delineation is carried out on each CT horizontal scanning image according to a mask image in the target object data to obtain an interested area of each CT horizontal scanning image, extracting the characteristic data of the radiology in each region of interest to obtain N characteristic data of the radiology, projecting the N characteristic data of the radiology through L random projection matrixes to obtain L sets of projection characteristic data, respectively carrying out a plurality of classifier training on the L sets of projection characteristic data to obtain a prediction matrix of each classifier and each trained classifier, and setting the weight of each classifier according to the prediction matrix of each classifier, and performing fusion processing on the data to be classified by adopting each trained classifier according to the corresponding weight so as to determine the class of the data to be classified corresponding to the renal mass to be detected. The intelligent terminal device maps an original data set of target object data to a low-dimensional space through a large number of random projection matrixes based on an integrated framework of random projection to generate diversified training data sets, inputs the projected training data sets into a plurality of classifiers, designs a two-level hierarchical fusion scheme, and integrates all outputs in a logic mode to generate final classification so as to more reliably classify the quality and the malignancy of the renal small masses.

In one embodiment, as shown in fig. 1, a method for classifying renal masses based on random projection is provided, which is exemplified by an intelligent terminal device such as an identification terminal or a classification terminal for identifying benign or malignant images representing renal masses, and includes the following steps:

s10, acquiring N target object data describing the kidney small tumor, wherein the target object data comprises a CT flat scan image, a mask image and label data of the corresponding kidney small tumor; the label data characterizes the corresponding renal mass as benign or malignant; wherein N is a positive integer.

Specifically, the target object data may be derived from a renal small mass of a patient with a pathologically confirmed renal small mass. Each target object data may describe a small kidney mass, and a target object data specifically describes the corresponding small kidney mass through a CT flat scan image, a mask image and tag data, where for a small kidney mass, the CT flat scan image is an image obtained by CT scanning the small kidney mass, the mask image is an image mask corresponding to the small kidney mass, and the tag data is data indicating that the small kidney mass is benign or malignant, for example, when the small kidney mass is pathologically confirmed to be fatty renal vascular smooth muscle lipoma, the tag data of the small kidney mass is benign, and when the small kidney mass is pathologically confirmed to be renal cell carcinoma, the corresponding tag data is malignant.

Specifically, the CT panned image is a non-contrast enhanced CT scanned image, so that the corresponding renal small tumor can be more accurately represented by the CT panned image.

And S20, performing target region delineation on each CT flat scanning image according to each mask image to obtain an interested region of each CT flat scanning image, and performing radiology characteristic data extraction on each interested region to obtain N pieces of radiology characteristic data.

Specifically, the region of interest is a region where the CT scout image includes valid information, and can be obtained by performing target delineation on the corresponding CT scout image according to a mask image by a plurality of users with experience, such as radiological diagnostic experts. In one example, a schematic representation of a target volume delineation of a CT scout image to obtain a region of interest may be found with reference to fig. 2.

Further, each of the radiologic characteristic data comprises a plurality of radiologic features; the radiologic features may include shape features, first order statistical features, and texture features. The first-order statistical features may include statistical features such as a number obtained by histogram analysis, and the texture features may include second-order statistical features such as image gray distribution.

In one example, one radiologic characteristic data may include 103 radiologic features, where the 103 radiologic features may be extracted from a region of interest (ROI) obtained by delineating a corresponding non-enhanced CT image by two experienced radiology diagnosticians, and specifically may include 12 shape features, 17 first-order statistical features, and 74 texture features, so that in a subsequent process of training a plurality of taxonomies, there are sufficiently complete radiologic features as a basis to ensure accuracy of a training result.

Preferably, after the N pieces of radiologic characteristic data are obtained, data standardization processing may be performed on the N pieces of radiologic characteristic data, so that different characteristic data with a large difference in value among the N pieces of radiologic characteristic data fall within a set range (e.g., [0,1]), thereby eliminating adverse effects caused by singular characteristic data with an excessively large value or an excessively small value in the radiologic characteristic data. In one example, the data after the data standardization processing may be subjected to feature selection by using a feature selection method f _ score to avoid data overfitting, and then the SMOTE algorithm is used to perform class balancing on the data to overcome negative influence caused by data class imbalance, so that values, feature distribution and class distribution of N sets of radiology characteristic data subjected to matrix projection are kept in a balanced state, and therefore the effectiveness of L sets of projection characteristic data obtained subsequently is ensured.

And S30, projecting the N sets of radiologic characteristic data through L random projection matrixes to obtain L sets of projection characteristic data.

The number L of the random projections may be set according to the classification accuracy of the renal mass, for example, L may be set to be 10.

Specifically, the foregoing steps may create L Random Projection matrices according to the J-L theorem of Random Projection (RP) to perform corresponding Random Projection on the N sets of radiology characteristic data, so as to smoothly obtain L sets of Projection characteristic data.

And S40, respectively carrying out multiple classifier training on the L sets of projection characteristic data to obtain a prediction matrix of each classifier and each trained classifier, and setting the weight of each classifier according to the prediction matrix of each classifier.

The number and the category of the classifiers can be determined according to the classification precision of the renal small masses. Specifically, the L sets of projection characteristic data are respectively input into each classifier branch line, and five-fold cross training can be performed in each classifier to obtain a prediction matrix of each classifier relative to each target object data.

In one example, the L sets of projection characteristics data may be input into 7 classifiers for training, respectively, where the 7 classifiers may include a bayesian classifier (e.g., a gaussian bayesian classifier), a logistic regression classifier, a quadratic discriminant analysis classifier, a K-nearest neighbor classifier, a decision tree classifier, a random forest classifier, and an XGBoost classifier.

Specifically, the training of multiple classifiers on the L sets of projection characteristic data respectively to obtain the prediction matrix of each classifier and each trained classifier includes:

constructing each classifier model by adopting scimit-leann machine learning software package under Python programming language environment, respectively inputting L sets of projection characteristic data and corresponding label data into each classifier model, and respectively calling fit functions for each classifier model to train so as to obtain a prediction matrix of each classifier.

The data to be classified may include object data describing the renal masses to be tested during the test, object data describing the renal masses of the category to be identified during the actual classification, and so on.

Specifically, in the above steps, each classifier may be used to perform classification prediction on the data to be classified, and the weight of each classifier is used to perform fusion processing on the classification prediction result of the corresponding classifier, so as to identify the category of the renal mass represented by the data to be classified, thereby improving the reliability of the determined category.

In the embodiment, the radiologic characteristic data extracted from N target object data describing the renal mass are subjected to L-time random projection to obtain L sets of projection characteristic data generated by the target object data, the projection characteristic data are input into different classifiers to be trained to obtain each trained classifier, a prediction matrix of each classifier is obtained, the weight of each classifier is determined, and then the classification data is fused by adopting each trained classifier according to the corresponding weight, so that the classifiers form a hierarchical structure, the diversity and the structural advantages of the classifiers can be integrated, the robustness in the process of class identification of the classification data to be classified is improved, and the reliability of the identification result is improved.

In one embodiment, the setting the weight of each classifier according to the prediction matrix of each classifier includes:

calculating a first average prediction matrix corresponding to the benign renal small mass and a second average prediction matrix corresponding to the malignant renal small mass according to the prediction matrixes of the classifiers;

calculating Euclidean distances from the prediction matrix of each classifier to the first average prediction matrix and the second average prediction matrix respectively;

determining the prediction labels of the N target object data on each classifier according to the Euclidean distance from the prediction matrix of each classifier to the first average prediction matrix and the second average prediction matrix respectively;

and calculating the prediction accuracy parameter of each classifier according to the prediction label of the N target object data on each classifier and the label data respectively included by the N target object data, and determining the weight of each classifier according to the prediction accuracy parameter of each classifier.

The prediction matrix of each classifier may be a prediction matrix obtained by performing classification training on each target object data of each classifier. In the prediction matrix of each classifier, each row corresponds to the result of the predicted posterior probability of the current classifier in each projection domain, the first column represents the posterior probability of the current classifier which is judged as the first class under L projections, and the second column represents the posterior probability of the current classifier which is judged as the second class under L projections. In the corresponding distance matrix, each row corresponds to a distance result of one type of classifier, the first column represents the distance from the prediction matrix of each classifier to the first average prediction matrix, and the second column represents the distance from the prediction matrix of each classifier to the second average prediction matrix.

Specifically, the determining process of the first average prediction matrix or the second average prediction matrix includes:

in the formula (I), the compound is shown in the specification,

representing the g-th average prediction matrix, the value of g is 1 or 2,

indicates the ith target object data, N indicates the number of target object data,

a prediction matrix, y, representing the m-th classifier with respect to the i-th target object data_gA label g representing the category of the user is shown,

to represent

When the category label g takes 1, it indicates that the renal mass described by the corresponding target object data is benign, and when the category label g takes 2, it indicates that the renal mass described by the corresponding target object data is malignant.

Specifically, the calculation process of the euclidean distance includes:

in the formula (I), the compound is shown in the specification,

and the Euclidean distance from the mth classifier to the G-th average prediction matrix relative to the ith target object data is represented, L represents the number of random projection matrixes, G represents the number of categories, and if the categories of the renal small masses comprise benign and malignant states, the value of G is 2.

Specifically, the process of determining the predictive label of a target object data on a classifier comprises the following steps:

in the formula (I), the compound is shown in the specification,

indicating the target object data (e.g., ith target object data) classified in the classifier, y_sRepresenting the label data predicted by the classifier for the target object data,

to represent

Predictive tag (i.e., tag data), symbol of

When the equation behind the symbol is satisfied, the category attribution relation in front of the symbol is obtained, the symbol min represents the minimum value,

and the Euclidean distance from the mth classifier to the mth average prediction matrix relative to the ith target object data is shown, and the subscript s represents the class index with the minimum distance.

Specifically, the process of determining the weight of each classifier according to the prediction accuracy parameter of each classifier includes:

in the formula, ω_mWeight, acc, representing the mth classifier_mA prediction accuracy parameter, acc, representing the mth classifier^minRepresenting the minimum of the respective prediction accuracy parameters, acc^maxRepresents the maximum of the respective prediction accuracy parameters. Wherein M is 1,2_m∈[0,1]。

In this embodiment, the weights of the classifiers are calculated according to the prediction accuracy parameters of the target object data by the classifiers, so that the validity of the calculated weights of the classifiers can be ensured, and the reliability of the subsequent fusion processing by using the corresponding data to be classified of the classifiers according to the weights can be ensured.

Further, performing fusion processing on the data to be classified by adopting each trained classifier according to the corresponding weight to determine the category of the data to be classified comprises:

splicing Euclidean distances from each prediction matrix to a first average prediction matrix and a second average prediction matrix according to rows to obtain prediction distance matrices, weighting Euclidean distances corresponding to corresponding classifiers in the prediction distance matrices by adopting the weights of the classifiers to obtain first weighted distance matrices, and grouping and averaging the first weighted distance matrices according to label data of target object data to obtain a first average distance matrix and a second average distance matrix;

projecting the data to be classified through L random projection matrixes to obtain L sets of classified projection data, inputting the L sets of classified projection data into each trained classifier for prediction respectively, and obtaining a classified prediction matrix obtained by each classifier according to the prediction of the data to be classified;

calculating Euclidean distances from each classified prediction matrix to the first average prediction matrix and the second average prediction matrix respectively;

splicing Euclidean distances from each classified prediction matrix to the first average prediction matrix and the second average prediction matrix according to rows to obtain a classified distance matrix, and weighting Euclidean distances corresponding to corresponding classifiers in the classified distance matrix by adopting the weight of each classifier to obtain a second weighted distance matrix;

substituting the second weighted distance matrix, the first average distance matrix and the second average distance matrix into a classification formula to determine the category of the data to be classified; the classification formula includes:

in the formula (I), the compound is shown in the specification,

representing data to be classified (such as test data or data describing the renal mass of the category to be determined etc.),

a second weighted distance matrix is represented that is,

denotes the G-th average distance matrix, G denotes the number of classes of data to be classified, y_sRepresenting the label data predicted by the classifier for the data to be classified,

to represent

The subscript s denotes the category index with the smallest distance.

Preferably, the determination formula of the first average distance matrix or the second average distance matrix includes:

in the formula (I), the compound is shown in the specification,

a first weighted distance matrix is represented.

In one example, the computation of prediction matrices (e.g., a first average prediction matrix and a second average prediction matrix) for each classifier N target object data is consistent with the idea of computing prediction matrices (e.g., classification prediction matrices) for the data to be classified. Correspondingly, in each process, the thought of calculating the euclidean distance to obtain the corresponding weighted distance matrix (such as the first weighted distance matrix and the second weighted distance matrix) is also consistent, and the following describes the process of calculating various distance matrices:

s501, respectively enabling the prediction matrixes to reach Euclidean distances of a first average prediction matrix and a second average prediction matrix, and enabling ith target object data

Splicing Euclidean distances corresponding to the prediction matrixes in all the classifiers according to rows to obtain a distance matrix

(e.g., a predicted distance matrix) as shown in the following equation:

s502, adopting the weight acc of each classifier_mWeighting the distance matrix to obtain a weighted distance matrix

(e.g., a first weighted distance matrix) as shown in the following equation:

s503, obtaining the weighted distance matrix in the step S502

Averaging according to the label data groups of the target object according to the following formula to obtain an average distance matrix

(e.g., the first average distance matrix or the second average distance matrix):

s511, if the data to be classified is the fixed test target object x_testX is to be_testBy means of L random projection matrix projections,obtaining L sets of classified projection data

S512, obtaining prediction matrixes obtained by classifying the classified projection data by each classifier

(e.g., a class prediction matrix);

s513, calculating Euclidean distances from each classified prediction matrix to the first average prediction matrix and the second average prediction matrix respectively;

s514, obtaining a weighted distance matrix according to the calculation method shown in the step S501 to the step S502

(second weighted distance matrix);

s515, calculating according to the first formula in the classification formulas

And the average distance matrix in step S503

The Euclidean distance of (a) is calculated according to a second formula of the classification formula to obtain a test target object x_testThe final classification of (1);

the first of the classification formulas is:

the first of the classification formulas is:

according to the embodiment, the data to be classified is fused according to the corresponding weights through a plurality of different classifiers, a hierarchical structure is provided, the diversity and the structural advantages of the classifiers are integrated, the data to be classified is classified, the robustness of the classification process can be improved, and the reliability of the classification result is further improved.

In an embodiment, the projecting the N sets of radiologic characteristic data through L random projection matrices to obtain L sets of projection characteristic data includes:

in the formula (I), the compound is shown in the specification,

the projection characteristic data of the first set is represented, D represents N pieces of radiologic characteristic data, particularly a set comprising N pieces of radiologic characteristic data, and the projection characteristic data can also be written as

P_lAnd the projection of the ith random projection matrix is represented, q represents the data dimension of a projection domain corresponding to the projection of the random projection matrix, and the value range of L is 1-L. In addition, L random projection matrix projections may be used

It is shown that,

namely, it is

Is a representation of the N radiosomic property data in the new projection domain, where the upper index Λ may represent the projection domain.

Specifically, each random projection matrix in the projection process can be determined by the following formula:

wherein P represents a random projection matrix, r_ijRandomly taking a value from the values in the set,subscript i denotes the row number of P and subscript j denotes the column number of P.

The set of settings may be determined in dependence on the associated projection characteristics. For example, the set of settings may be

At this time r_ijFrom can be according to probability

pro(r_ij0) 1/3 was obtained randomly.

Further, the determination of the data dimension q includes: when p > q₀When q is equal to q₀(ii) a When p is less than or equal to q₀When q is p/2; wherein q is₀＝[2×ln(n)/ε²]And epsilon is 0.25, and p represents the dimension of the pre-projection radiology characteristic data.

In this embodiment, the above formula is adopted to project the N pieces of radiologic characteristic data D L times, such as to make L at [1, L]In the process of sequentially taking values in the numerical value interval, the method adopts

And respectively projecting the N sets of radiologic characteristic data D to obtain L sets of projection characteristic data and ensure the effectiveness of the obtained L sets of projection characteristic data.

In one embodiment, if the classifier specifically includes: a Bayes classifier, a logistic regression classifier, a quadratic discriminant analysis classifier, a K nearest neighbor classifier, a decision tree classifier, a random forest classifier and an XGboost classifier. The scimit-learn machine learning software package in the Python programming language environment is used for constructing each classifier model, L sets of projection characteristic data and corresponding label data are respectively input into each classifier model, a fit function is respectively called for each classifier model for training, a prediction matrix of each classifier is obtained, and the training process of each classifier specifically comprises the following steps:

training process of Bayes classifier: constructing a Bayesian model by adopting a scimit-learn machine learning software package under a Python programming language environment, respectively inputting L sets of projection characteristic data and corresponding label data into the Bayesian model, and calling a fit function for training to obtain a prediction matrix of a Bayesian classifier;

training the logistic regression classifier: constructing a logistic regression model by adopting a scimit-learn machine learning software package under a Python programming language environment, respectively inputting L sets of projection characteristic data and corresponding label data into the logistic regression model, and calling a fit function for training to obtain a prediction matrix of the logistic regression classifier;

training the secondary discriminant analysis classifier: constructing a secondary discriminant analysis model by adopting a scimit-learn machine learning software package in a Python programming language environment, respectively inputting L sets of projection characteristic data and corresponding label data into the secondary discriminant analysis model, and calling a fit function for training to obtain a prediction matrix of a secondary discriminant analysis classifier;

training the K neighbor classifier: constructing a K neighbor model by adopting a scimit-learn machine learning software package under a Python programming language environment, respectively inputting L sets of projection characteristic data and corresponding label data into the K neighbor model, and calling a fit function for training to obtain a prediction matrix of a K neighbor classifier;

training the decision tree classifier: constructing a decision tree model by adopting a scimit-learn machine learning software package in a Python programming language environment, respectively inputting L sets of projection characteristic data and corresponding label data into the decision tree model, and calling a fit function for training to obtain a prediction matrix of a decision tree classifier;

training a random forest classifier: the method comprises the steps of constructing a random forest model by adopting a scimit-learn machine learning software package in a Python programming language environment, inputting L sets of projection characteristic data and corresponding label data into the random forest model respectively, and calling a fit function for training to obtain a prediction matrix of a random forest classifier.

The XGboost classifier training process comprises the following steps: the XGboost model is constructed by adopting a scimit-learn machine learning software package in a Python programming language environment, L sets of projection characteristic data and corresponding label data are respectively input into the XGboost model, and a fit function is called for training to obtain a prediction matrix of the random forest classifier.

Further, in the training process of the 7 classifiers, the process of obtaining the prediction matrix includes:

the Bayes classifier calls a predict _ proba function to predict to obtain the predicted posterior probability of each set of projection characteristic data, and then the predicted posterior probabilities are spliced into a prediction matrix Q related to the Bayes classifier according to rows¹；

The logistic regression classifier calls a predict _ proba function to predict to obtain the predicted posterior probability of each set of projection characteristic data, and the predicted posterior probabilities are spliced into a prediction matrix Q related to the quadratic discriminant logistic regression classifier according to rows²；

The secondary discriminant analysis classifier calls a predict _ proba function to predict to obtain the predicted posterior probability of each set of projection characteristic data, and then the predicted posterior probabilities are spliced into a prediction matrix Q related to the secondary discriminant analysis classifier according to rows³；

The K neighbor classifier calls a predict _ proba function to predict to obtain the predicted posterior probability of each set of projection characteristic data, and the predicted posterior probabilities are spliced into a prediction matrix Q related to the K neighbor classifier according to rows⁴；

The decision tree classifier calls a predict _ proba function to predict to obtain the predicted posterior probability of each set of projection characteristic data, and the predicted posterior probabilities are spliced into a prediction matrix Q related to the decision tree classifier according to rows⁵；

The random forest classifier calls a predict _ proba function to predict to obtain the predicted posterior probability of each set of characteristic data, and the predicted posterior probabilities are spliced into a prediction matrix Q related to the random forest classifier according to rows⁶；

The XGboost classifier calls a predict _ proba function to predict to obtain the predicted posterior probability of each set of characteristic data, and then the predicted posterior probabilities are spliced into a prediction matrix Q related to the XGboost classifier according to rows⁷。

The embodiment specifically adopts a Bayesian classifier, a logistic regression classifier, a quadratic discriminant analysis classifier, a K nearest neighbor classifier, a decision tree classifier, a random forest classifier and an XGboost classifier, and the classifiers with stable performance respectively train the L sets of projection characteristic data, so that the stability of the training process can be improved, and the reliability of the training result is further ensured.

In one embodiment, the random projection-based method for classifying renal masses acquires a matrix of classifier weights and mean distances from 130 pathologically confirmed renal mass patients from target object data acquired

The process of (a) is explained.

Target subject data collected from 130 pathologically confirmed patients with renal masses includes non-enhanced CT images of the renal masses of the patient subjects, corresponding mask images and labeled data of the types of renal masses. The clinical information of the patients with renal small tumors can be referred to table 1, in table 1, AMLwvf represents fatty-poor renal vascular smooth muscle lipoma, RCC renal cell carcinoma represents, and P value is used to determine whether the null hypothesis is true (the current null hypothesis is that the characteristics in the table are not different between two types of renal small tumor patients).

TABLE 1 clinical information on patients with renal small tumors

The data for these 130 pathologically confirmed renal small tumor patients included 94 patients with renal cell carcinoma and 36 patients with lipo-poor renal vascular smooth muscle lipoma. Based on the collected target object data of 130 pathologically confirmed patients with renal masses, the above method for classifying renal masses based on random projections may specifically include:

step one, data input:

respectively inputting target object data from 130 patients with renal small masses, wherein the target object data comprises target object non-enhanced CT images, corresponding mask images and target object label data, and thus obtaining 130 non-enhanced CT images, 130 corresponding mask image data and 130 target object renal small mass type label data;

step two, outputting data characteristics:

and 2.1, extracting characteristic data. The acquisition of the radiologic characteristic data is completed by using an open source python package radiomics, and the acquisition is performed on an ROI (region of interest) sketched on a non-enhanced CT (computed tomography) image of each target object to obtain the imaging characteristic data of the target object and output the characteristic data (radiologic characteristic data).

Step 2.2, 103 radiology features can be obtained from step 2.1, as shown in table 2, the radiology features are divided into three categories, 1) shape features; 2) first order statistical features (histogram analysis); 3) textural features (image gray scale distribution). Since the target object data may have data imbalance, this embodiment employs a synthetic minority class oversampling technique (SMOTE), which oversamples a minority class of target objects with fat-poor renal vascular smooth muscle lipoma by introducing synthetic feature samples.

Table 2103 radiologic features

And 2.3, besides performing class balance processing on the target object in the step 2.2, further processing the target data by adopting an f _ score feature selection method, performing feature selection, and performing dimension reduction on the feature space to avoid the over-fitting condition.

Step three, data processing

And 3.1, creating a Random Projection matrix P according to the formula (I), and creating a Random Projection matrix according to the theory of Random Projection (RP) and J-L theorem.

Where q is the data dimension in the new projection domain, the element r in the random matrix_ijFrom

According to the outlineRate of change

pro(r_ij0) 2/3 was obtained randomly.

And 3.2, projecting the 130 characteristic data obtained in the second step to a new characteristic space through the L random projection matrixes obtained in the step 3.1 according to the formula (II) to obtain L sets of new characteristic data (projection characteristic data).

Wherein

D is the set of 130 target objects obtained in step two,

p is the original data dimension of the target object,

is a representation of 130 target objects in the new projection domain, the symbol a representing the projection domain.

Step four, constructing a multi-classifier system model for classifying the benign and malignant renal masses based on random projection:

step 4.1, training a classifier,

specifically, the Bayes training is to adopt scimit-learn machine learning software package under Python programming language environment to construct Bayes model, then to input L sets of new characteristic data and label data into Bayes model, to call fit function to train, to store and output;

specifically, the training of the logistic regression is to adopt a scimit-lean machine learning software package under a Python programming language environment to construct a logistic regression model, then input L sets of new characteristic data and label data into the logistic regression model respectively, call a fit function for training, store and output;

the training of the secondary discriminant analysis specifically comprises the steps of constructing a discriminant analysis model by adopting a scimit-learn machine learning software package in a Python programming language environment, inputting L sets of new characteristic data and label data into the secondary discriminant analysis model respectively, calling a fit function for training, storing and outputting;

the K neighbor training specifically comprises the steps of constructing a discriminant analysis model by adopting a scimit-learn machine learning software package under a Python programming language environment, inputting L sets of new characteristic data and label data into the K neighbor model, calling a fit function for training, storing and outputting;

specifically, the scimit-lean machine learning software package in a Python programming language environment is adopted to construct a decision tree model, then L sets of new characteristic data and label data are input into the decision tree model respectively, a fit function is called to train, and the training is stored and output;

specifically, the random forest training method comprises the steps of constructing a random forest model by adopting a scimit-learn machine learning software package in a Python programming language environment, inputting L sets of new characteristic data and label data into the random forest model, calling a fit function for training, storing and outputting;

the XGboost training specifically comprises the steps of adopting a scimit-leann machine learning software package under a Python programming language environment to construct an XGboost model, inputting L sets of new characteristic data and label data into the XGboost model, calling a fit function to train, storing and outputting;

step 4.2, determining the weight of each classifier,

step 4.2.1, respectively inputting the L sets of new characteristic data obtained in the step three into a Bayes classifier, a logistic regression classifier, a quadratic discriminant analysis classifier, a K neighbor classifier, a decision tree classifier, a random forest classifier and an XGboost classifier, respectively calling a predict _ proba function in each classifier to predict to obtain a predicted posterior probability of each set of characteristic data, and splicing the predicted posterior probabilities into corresponding pre-predictions in each classifier according to rowsA measurement matrix comprising: prediction matrix Q of Bayesian classifier¹Prediction matrix Q of quadratic discriminant logistic regression classifier²Prediction matrix Q of quadratic discriminant analysis classifier³Prediction matrix Q of K neighbor classifier⁴Prediction matrix Q of decision tree classifier⁵Prediction matrix Q of random forest classifier⁶And the prediction matrix Q of the XGboost classifier⁷；

Step 4.2.2, according to each prediction matrix obtained in the step 4.2.1, respectively calculating an average prediction matrix of the prediction matrix of each classifier according to the label data grouping of the target object by using a formula (III)

Wherein g is 1, 2.

Step 4.2.3, average prediction matrix obtained according to step 4.2.2

Calculating the Euclidean distance between the prediction matrix of each classifier obtained in the step 4.2.1 and the corresponding average prediction matrix according to the formula (VI)

Then, the predicted labels of the 130 target objects on each classifier are obtained according to formula (VII).

Where the subscript s denotes the category index with the smallest distance.

Step 4.2.4, respectively calculating the prediction accuracy parameters acc of all classifiers according to the prediction labels of the 130 target objects on each classifier obtained in the step 4.2.3 and the label data obtained in the step one_mAnd the process proceeds to step 4.2.5,

step 4.2.5, according to the prediction accuracy parameter acc of each classifier obtained according to the step 4.2.4_mThe weight of each classifier is calculated according to equation (VIII).

Step 4.3, obtaining an average distance matrix,

step 4.3.1, at the same time, calculating Euclidean distance between the prediction matrix of each classifier and the corresponding average prediction matrix according to the step 4.2, and enabling the target object to be in contact with the target object

The results in all classifiers are spliced according to rows to obtain a distance matrix

As shown in the formula (IX),

wherein G is 2 and M is 7.

Step 4.3.2, the weight acc of each classifier obtained in the step 4.2_mWeighting to the classifier corresponding to the distance matrix obtained in step 4.3.1 to obtain a weighted distance matrix

As shown in formula (X)

Step 4.3.3, obtaining the weighted distance matrix in the step 4.3.2

Averaging according to formula (XI) according to the label data packet of the target object to obtain an average distance matrix

Further, 33 cases of pre-operative data of pathologically confirmed patients with renal small mass were collected as test data, and the classifier weight and average distance matrix obtained in this example was used

The application test is carried out aiming at the test data, and the specific steps are as follows:

step one, data input:

respectively inputting 33 pieces of non-enhanced CT images of target objects of the patients with the renal mass confirmed by pathology, corresponding mask images and label data of the target objects to obtain 33 non-enhanced CT images, 33 corresponding mask image data and 33 label data of types of the renal mass of the target objects;

step two, outputting data characteristics: performing characteristic output on a non-enhanced CT image and corresponding mask image data of a target object of a pathologically confirmed renal small tumor patient according to the steps 2.1 to 2.3 of the data characteristic output part of the target object of a single pathologically confirmed renal small tumor patient;

step three, data processing: and (3) randomly projecting 30 target objects according to the steps 3.1 to 3.2 of the data processing part to obtain new characteristic data in different projection domains.

Step four, processing by the multiple classifiers: inputting the characteristic data of the third step into each classifier for corresponding fusion processing.

Step five, classifying by multiple classifiers: and inputting characteristic data in the third step of a single patient with the renal mass confirmed by the pathology into Bayes, logistic regression, secondary discriminant analysis, K neighbor, decision tree, random forest and XGboost which are constructed in the fourth step. A weighted distance matrix is obtained for a single patient target subject with a pathologically confirmed renal small mass according to steps 4.2.2 through 4.3.2.

Step six, classifying the benign or malignant renal masses of a single pathologically confirmed renal mass patient subject according to step 4.4.

Seventhly, repeating the sixth step until all 33 patients with the pathologically confirmed renal small masses are classified into benign and malignant renal small masses.

Step eight, calculating the performance indexes of classification accuracy, AUC, sensitivity and specificity of the whole set of system according to the classification results of benign and malignant renal masses of 33 patients with pathologically confirmed renal masses, comparing the performance indexes of the classification method of renal masses based on random projection provided by the present invention with the performance of each base classifier, and the comparison results are shown in table 3, wherein the numbers indicate that the statistical differences exist between the experimental results obtained in this example and the single classification used in the experiment.

TABLE 3

As can be seen from table 3, the embodiment, after fusing multiple classifiers, is generally superior to all the single classifiers in terms of accuracy and AUC; meanwhile, when the significance level is 0.05 by using the wilcoxon signed rank test, whether the prediction result of the embodiment is significantly different from that of a single classifier is compared, the sign in table 3 indicates that the prediction result is significantly different, and the result shows that the result is generally less than 0.05, which indicates that the significance difference exists between the multi-classifier fusion scheme provided by the embodiment and the single classifier.

In the embodiment, the non-enhanced scanning CT image of the target object data and the extracted imaging characteristics of the corresponding mask image are preprocessed and then randomly projected for multiple times to obtain multiple new data sets generated by the original data, the data sets pass through different classifiers, the results of the new data sets obtained by the classifiers are fused, the results of the different classifiers are further fused to obtain a hierarchical structure, the diversity and the structure of the multiple classifiers are comprehensively considered, and the robustness of the integrated application of the multiple classifiers can be improved. The classification scheme implemented for small kidney masses described above can be applied in the individualized disease diagnosis process to assist in guiding clinical decisions.

In one embodiment, as shown in fig. 3, there is provided a random projection-based renal mass classification apparatus including:

an obtaining module 10, configured to obtain N target object data describing a renal mass; the target object data includes a CT scout image, a mask image and label data of the corresponding kidney small tumor; the label data characterizes the respective renal mass as benign or malignant;

the extraction module 20 is configured to perform target region delineation on each CT scout image according to each mask image to obtain an interested region of each CT scout image, and perform radiology characteristic data extraction on each interested region to obtain N radiology characteristic data;

the projection module 30 is configured to project the N sets of radiologic characteristic data through L random projection matrices to obtain L sets of projection characteristic data;

the setting module 40 is configured to perform multiple classifier training on the L sets of projection characteristic data, to obtain a prediction matrix of each classifier and each trained classifier, and set a weight of each classifier according to the prediction matrix of each classifier;

and the determining module 50 is configured to perform fusion processing on the data to be classified according to the corresponding weights by using the trained classifiers, so as to determine the category of the data to be classified.

For the specific definition of the renal mass classifying device based on the stochastic projection, reference may be made to the above definition of the renal mass classifying method based on the stochastic projection, and details thereof are not repeated here. The modules of the above-mentioned renal mass classifying device based on random projection can be implemented in whole or in part by software, hardware and their combination. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for classifying a renal mass based on stochastic projections. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

Based on the above examples, in one embodiment, an intelligent terminal device is further provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement any one of the above methods for classifying renal small masses based on stochastic projection.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which may be stored in a non-volatile computer-readable storage medium, and executed by at least one processor of a computer system according to the embodiments of the present invention, to implement the processes of the embodiments including the above random projection-based renal mass classification method. The storage medium may be a magnetic disk, an optical disk, a Read-only Memory (ROM), a Random Access Memory (RAM), or the like.

Accordingly, in an embodiment, there is also provided a computer storage medium having a computer program stored thereon, wherein the program when executed by a processor implements any one of the above-described methods for classifying renal small masses based on stochastic projection.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application.

Claims

1. A method for classifying renal masses based on stochastic projection, the method comprising:

2. The method of claim 1, wherein setting the weight of each classifier according to the prediction matrix of each classifier comprises:

3. The method of claim 2, wherein the determining of the first average prediction matrix or the second average prediction matrix comprises:

in the formula (I), the compound is shown in the specification,

representing the g-th average prediction matrix, the value of g is 1 or 2,

a prediction matrix representing the mth classifier with respect to the ith target object data,

to represent

Tag data of y_gA presentation category label g;

the calculation process of the Euclidean distance comprises the following steps:

in the formula (I), the compound is shown in the specification,

the Euclidean distance from the mth classifier to the gth average prediction matrix relative to the ith target object data is represented, L represents the number of random projection matrixes, and G represents the number of categories;

the process of determining a predictive label for target object data on a classifier comprises:

in the formula (I), the compound is shown in the specification,

indicating that the classifier is performing classificationTarget object data of class, y_sRepresenting the label data predicted by the classifier for the target object data,

to represent

Predictive labels, symbols of

When the equation behind the symbol is expressed, obtaining the category attribution relation in front of the symbol;

the process of determining the weight of each classifier according to the prediction accuracy parameter of each classifier comprises the following steps:

in the formula, ω_mWeight, acc, representing the mth classifier_mA prediction accuracy parameter, acc, representing the mth classifier^minRepresenting the minimum of the respective prediction accuracy parameters, acc^maxRepresents the maximum of the respective prediction accuracy parameters.

4. The method according to claim 3, wherein the using of the trained classifiers to perform fusion processing on the data to be classified according to the corresponding weights to determine the category of the data to be classified comprises:

in the formula (I), the compound is shown in the specification,

the data to be classified is represented by a table,

a second weighted distance matrix is represented that is,

to represent

The predictive tag of (1).

5. The method of claim 4, wherein the formula for determining the first or second average distance matrix comprises:

in the formula (I), the compound is shown in the specification,

a first weighted distance matrix is represented.

6. The method of claim 1, wherein the projecting the N sets of radiologic characteristic data through L random projection matrices to obtain L sets of projection characteristic data comprises:

in the formula (I), the compound is shown in the specification,

representing the first set of projection characteristic data, D representing N sets of radiologic characteristic data, P_lRepresenting the ith random projection matrix projection, and q representing the data dimension of the projection domain corresponding to the random projection matrix projection.

7. The method of claim 6, wherein the random projection matrix is determined by:

wherein P represents a random projection matrix, r_ijRandomly taking values from a set of settingsThe value, subscript i denotes the row number of P and subscript j denotes the column number of P;

and/or the determination mode of the data dimension q comprises the following steps: when p > q₀When q is equal to q₀(ii) a When p is less than or equal to q₀When q is p/2; wherein q is₀＝[2×ln(n)/ε²]And epsilon is 0.25, and p represents the dimension of the pre-projection radiology characteristic data.

8. The method according to any one of claims 1 to 7, wherein the training of a plurality of classifiers is performed on each of the L sets of projection characteristic data, and obtaining the prediction matrix of each classifier comprises:

9. The method of any of claims 1 to 7, wherein the classifier comprises: a Bayes classifier, a logistic regression classifier, a quadratic discriminant analysis classifier, a K nearest neighbor classifier, a decision tree classifier, a random forest classifier and an XGboost classifier.

10. The method of any one of claims 1 to 7, wherein the CT plan image is a non-contrast enhanced CT scan image;

and/or each of the radiologic characteristic data comprises a plurality of radiologic features; the radiologic features include shape features, first order statistical features, and texture features.