CN109711469B

CN109711469B - Breast cancer diagnosis system based on semi-supervised neighborhood discrimination index

Info

Publication number: CN109711469B
Application number: CN201811615503.4A
Authority: CN
Inventors: 张莉; 庞晴晴; 王邦军; 周伟达
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2018-12-27
Filing date: 2018-12-27
Publication date: 2023-06-20
Anticipated expiration: 2038-12-27
Also published as: CN109711469A

Abstract

The invention discloses a breast cancer diagnosis system based on semi-supervised neighborhood discrimination indexes, which comprises a data acquisition module, a feature extraction module, a feature screening module and a classification module, wherein a labeled breast cell data sample and an unlabeled breast cell data sample can be acquired, a plurality of features of the breast cell data sample are extracted, then the semi-supervised neighborhood discrimination indexes of the features are calculated, the features of which the semi-supervised neighborhood discrimination indexes meet preset conditions are screened out from the features, and finally the breast cell data sample to be diagnosed is diagnosed according to the screened features, so that a diagnosis result is obtained. Therefore, the method is realized based on semi-supervised learning, the features with the highest association degree with the breast cancer are screened out by calculating the semi-supervised neighborhood discrimination index of each feature, the feature data are extracted from the data sample to be diagnosed in the diagnosis process, and finally the diagnosis result is obtained, so that the process of adding labels for a large amount of data is avoided, and the cost is greatly saved.

Description

Breast cancer diagnosis system based on semi-supervised neighborhood discrimination index

Technical Field

The invention relates to the field of computers, in particular to a breast cancer diagnosis system based on a semi-supervised neighborhood discrimination index.

Background

Global breast cancer incidence has been on the rise since the end of the 70 s of the 20 th century. It is counted that 1 out of every 8 women in the united states has breast cancer. In recent years, the growth rate of the incidence rate of breast cancer in China is up to 1% -2% of the national rate. The key point of the prevention and treatment of the breast cancer is early discovery, and the early breast cancer, especially the zero-phase breast cancer, can preserve the breast and radically treat the breast cancer through operation, has low cost and can reach more than 90 percent of survival rate in 5 years.

Wang and Hu et al propose using a neighborhood discrimination index for feature selection in "Feature Selection Based on Neighborhood Discrimination Index", introducing the basic idea in shannon information theory into a neighborhood relation context, and propose a discrimination index for measuring the discrimination capability of feature subsets, however, the method is based on a fully supervised learning implementation, which requires acquiring a large number of data tags, and in practical application, the acquisition of the tags often requires a great amount of manual effort, thus requiring a great expense.

Disclosure of Invention

The invention aims to provide a breast cancer diagnosis system based on a semi-supervised neighborhood discrimination index, which is used for solving the problem that the cost is huge because a large number of data labels are needed for realizing the existing method for diagnosing breast cancer based on full-supervised learning.

In order to solve the technical problems, the invention provides a breast cancer diagnosis system based on a semi-supervised neighborhood discrimination index, comprising:

and a data acquisition module: for obtaining a labeled breast cell data sample and an unlabeled breast cell data sample;

and the feature extraction module is used for: extracting a plurality of features of a breast cell data sample from the labeled breast cell data sample and the unlabeled breast cell data sample;

and a feature screening module: the method comprises the steps of screening out features of the plurality of features, wherein the features of the semi-supervised neighborhood discrimination index meet preset conditions;

and a classification module: and diagnosing the breast cell data sample to be diagnosed according to the characteristics meeting the preset conditions to obtain a diagnosis result.

Optionally, the feature extraction module includes:

normalization unit: normalizing a labeled and unlabeled breast cell data matrix, wherein the breast cell data matrix comprises data of a plurality of breast cell data samples;

and a splicing unit: the method comprises the steps of splicing a normalized labeled mammary cell data matrix and an unlabeled mammary cell data matrix to obtain a target data matrix;

feature extraction unit: and the method is used for extracting each column of data in the target data matrix to obtain a plurality of characteristics of the breast cell data sample.

Optionally, the feature screening module is specifically configured to: and screening out the preset number of features with the maximum semi-supervised neighborhood discrimination index from the plurality of features.

Optionally, the feature screening module includes:

a judging unit: judging whether the number of the screened features is 0;

a first calculation unit: when the number of the screened features is 0, calculating a semi-supervised neighborhood discrimination index of each feature in a feature set to be screened according to a first formula, wherein the feature set to be screened comprises the plurality of features during initialization and is updated along with the screening process of the features;

a second calculation unit: when the number of the screened features is not 0, calculating a semi-supervised neighborhood discrimination index of each feature in the feature set to be screened according to a second formula;

screening unit: and the method is used for screening the features with the maximum semi-supervised neighborhood discrimination indexes from the feature set to be screened, and updating the feature set to be screened until the number of the screened features is the preset number.

Optionally, the classification module includes:

extraction unit: extracting target feature data from a breast cell data sample to be diagnosed according to the features meeting preset conditions;

classification unit: and the target characteristic data are input into a pre-trained KNN classifier to obtain a diagnosis result.

Optionally, the mammary gland cell data sample comprises mammary gland cell data extracted from a fine needle puncture digital image, the mammary gland cell data comprises any one or any combination of the following: nucleus radius, texture, smoothness, perimeter, concavity.

The invention provides a breast cancer diagnosis system based on semi-supervised neighborhood discrimination indexes, which comprises a data acquisition module, a feature extraction module, a feature screening module and a classification module, wherein the data acquisition module can acquire a labeled breast cell data sample and an unlabeled breast cell data sample, extract a plurality of features of the breast cell data sample, calculate the semi-supervised neighborhood discrimination indexes of the features, screen out the features of the features, and finally diagnose the breast cell data sample to be diagnosed according to the screened features to obtain diagnosis results. Therefore, the method is realized based on semi-supervised learning, the features with the highest association degree with the breast cancer are screened out by calculating the semi-supervised neighborhood discrimination index of each feature, the feature data are extracted from the data sample to be diagnosed in the diagnosis process, and finally the diagnosis result is obtained, so that the process of adding labels for a large amount of data is avoided, and the cost is greatly saved.

Drawings

For a clearer description of embodiments of the invention or of the prior art, the drawings that are used in the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained from them without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a breast cancer diagnosis system based on a semi-supervised neighborhood discrimination index according to an embodiment of the present invention;

FIG. 2 is a flowchart of a first embodiment of a breast cancer diagnosis system based on a semi-supervised neighborhood discrimination index according to the present invention;

FIG. 3 is a schematic structural diagram of a second embodiment of a breast cancer diagnosis system based on a semi-supervised neighborhood discrimination index according to the present invention;

fig. 4 is a flowchart of a breast cancer diagnosis system based on a semi-supervised neighborhood discrimination index according to a second embodiment of the present invention.

Detailed Description

The invention provides a breast cancer diagnosis system based on a semi-supervised neighborhood discrimination index, which realizes the purpose of realizing diagnosis based on semi-supervised learning, screens features according to the semi-supervised neighborhood discrimination index of the features in the diagnosis process, saves calculation cost and adds labels for a large amount of data.

In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The following describes an embodiment of a breast cancer diagnosis system based on a semi-supervised neighborhood discrimination index, referring to fig. 1, the embodiment includes:

the data acquisition module 100: for obtaining a labeled breast cell data sample and an unlabeled breast cell data sample.

The mammary gland cell data refer to characteristic data of cell nuclei in a digital image of a fine needle puncture of a breast tumor, such as radius, texture, circumference, smoothness, compactness, concavity, pits and the like. The principle of the fine needle puncture (fine needle aspiration, FNA for short) is that the fine needle puncture is utilized to suck the components such as cells in the focus part as smear, and the morphological change and interstitial change of tumor and non-tumor cells are observed. The labeled breast cell data samples also include diagnostic results, particularly, both benign and malignant, as compared to the unlabeled breast cell data samples. In this embodiment, the number of samples of the unlabeled breast cell data sample is greater than the number of samples of the labeled breast cell data sample.

Feature extraction module 200: for extracting a plurality of features of the breast cell data sample from the labeled breast cell data sample and the unlabeled breast cell data sample.

Specifically, in the foregoing data acquisition module 100, the acquired labeled mammary cell data matrix and the unlabeled mammary cell data matrix may be provided that the labeled mammary cell data matrix is an i×n matrix, where i is the number of samples of the labeled mammary cell data samples, n is the dimension of each sample, and the unlabeled mammary cell data matrix is a u×n matrix, where u is the number of samples of the unlabeled mammary cell data samples, and n is the dimension of the samples. Then, in the feature extraction module 200, the two matrices may be normalized respectively so that the sum of the features of the samples is 1, and then the normalized two matrices are spliced together to obtain a (l+u) n-type matrix, which is referred to as a target data matrix for convenience of description, where each column of the target data matrix is one feature of the breast cell data sample, so that n features of the breast cell data sample are finally obtained.

Feature screening module 300: and the method is used for screening out the characteristics of which the semi-supervised neighborhood discrimination index meets the preset condition from the plurality of characteristics.

The purpose of the feature screening module 300 is to screen out some features with the greatest degree of association with the tag from the n features of the breast cell sample extracted by the feature extraction module 200, or to find the relationship between the features and the tag, so that when there is a feature but no unknown data of the tag, the unknown data tag can be obtained through the existing relationship. In this embodiment, the association degree of a feature and a label is measured by a semi-supervised neighborhood discrimination index, where the semi-supervised neighborhood discrimination index is a parameter for measuring the distinguishing capability of a feature subset, and the greater the semi-supervised neighborhood discrimination index, the better the distinguishing capability of the feature subset. In this embodiment, the semi-supervised neighborhood discrimination index of each of the n features is calculated first, and then the features satisfying the preset requirement are screened out according to the semi-supervised neighborhood discrimination index, which may be an optional implementation manner, where the number of features with the largest preset number of semi-supervised neighborhood discrimination indexes is screened out, and the preset number may be adjusted according to the actual request, and this embodiment is not limited specifically.

Classification module 400: and diagnosing the breast cell data to be diagnosed according to the characteristics meeting the preset conditions to obtain a diagnosis result.

Specifically, according to the features which are screened and have high association degree with the tag, extracting feature data of the features from the breast cell data to be diagnosed, and then taking the feature data as input of a pre-trained classification model, wherein the finally obtained input is a diagnosis result, and the diagnosis result is specifically benign or malignant.

Specifically, the overall implementation flow of the above system embodiment is shown in fig. 2, and includes the following steps:

step S101: a labeled breast cell data sample and an unlabeled breast cell data sample are obtained.

Step S102: and respectively normalizing the labeled mammary gland cell data sample and the unlabeled mammary gland cell data sample to obtain a plurality of characteristics of the mammary gland cell data sample.

Step S103: and screening out the characteristics of which the semi-supervised neighborhood discrimination index meets the preset condition from the plurality of characteristics.

Step S104: and extracting feature data of the features from the breast cell data sample to be diagnosed according to the features meeting the preset conditions, and inputting the feature data into a classifier trained in advance to obtain a diagnosis result.

It can be seen that the breast cancer diagnosis system based on the semi-supervised neighborhood discrimination index provided in this embodiment includes a data acquisition module 100, a feature extraction module 200, a feature screening module 300, and a classification module 400, which can acquire a labeled breast cell data sample and an unlabeled breast cell data sample, extract a plurality of features of the breast cell data sample, calculate the semi-supervised neighborhood discrimination index of each feature, screen out features of the features that the semi-supervised neighborhood discrimination index satisfies a preset condition, and finally diagnose the breast cell data sample to be diagnosed according to the screened features, so as to obtain a diagnosis result. Therefore, the method is realized based on semi-supervised learning, the features with the highest association degree with the breast cancer are screened out by calculating the semi-supervised neighborhood discrimination index of each feature, the feature data are extracted from the data sample to be diagnosed in the diagnosis process, and finally the diagnosis result is obtained, so that the process of adding labels for a large amount of data is avoided, and the cost is greatly saved.

The second embodiment of the breast cancer diagnosis system based on the semi-supervised neighborhood discrimination index provided by the invention is realized based on the first embodiment, and is expanded to a certain extent based on the first embodiment.

As shown in fig. 3, the breast cancer diagnosis system based on the semi-supervised neighborhood discrimination index is divided into four modules, namely, a data acquisition module 100, a feature extraction module 200, a feature screening module 300 and a classification module 400, and the implementation procedures of the four modules are described below:

in this embodiment, two data matrices are input to the data acquisition module 100, including a labeled breast cell data matrix

And a label-free mammary gland cell data matrix +.>

Wherein x is _i Is an n-dimensional sample representing breast cell data, l is the number of samples of labeled data samples, u is the number of samples of unlabeled data samples, and n is the total number of features of the data. Note that the labeled mammary cell data sample contains a label vector y= [ Y ] ₁ … y _l ] ^T Where i=1, 2,.. _i E { -1,1}, in particular, when y _i When = -1 indicates that the diagnosis result of the sample is malignant, when y _i =1 indicates that the diagnosis of the sample is benign. As an alternative embodiment, this example sets l to 172, u to 341, and n to 30.

As shown in fig. 3, the feature extraction module 200 may be divided into the following three units, namely, a normalization unit 201, a stitching unit 202, and a feature extraction unit 203, where the functions of the respective units are as follows:

normalization unit 201: for normalizing the labeled and unlabeled breast cell data matrices, respectively.

The normalization aims to make the sum of the characteristic data in each sample be the same as 1 so as to facilitate subsequent calculation.

The splicing unit 202: the method is used for splicing the normalized labeled mammary cell data matrix and the label-free mammary cell data matrix to obtain a target data matrix.

The splice result is shown in the following formula:

target data matrix

Where m=l+u is the total number of samples of the data matrix.

The feature extraction unit 203: and the method is used for extracting each column of data in the target data matrix to obtain a plurality of characteristics of the breast cell data sample.

That is, the target data matrix is divided by columns, and features that each column is a breast cell data sample are obtained, as shown in the following formula:

wherein f _k ＝[f _1k … f _mk ] ^T For the characterization of breast cell data samples, k=1, 2,..n.

In this embodiment, as shown in fig. 3, the feature screening module 300 includes four units, namely a judging unit 301, a first calculating unit 302, a second calculating unit 303, and a screening unit 304, where the functions of the units are as follows:

the judgment unit 301: it is determined whether the number of features that have been screened is 0.

As an alternative embodiment, before entering the judging unit 301, an initializing operation may be performed, including setting the target feature subset G and the feature subset B to be filtered, and initializing them as

B= { f ₁ ,f ₂ ,...,f _n The target feature subset is used for storing the features which are screened out by the feature screening module 300 and have higher association degree with the tag, while the feature subset to be screened stores a plurality of features which are extracted by the previous module, namely the feature extraction module, during initialization, and the features in the set are updated along with the screening of the features, specifically, when one feature is screened out, the feature subset to be screened is moved to the target feature subset from the feature subset to be screened.

The first calculation unit 302: and the semi-supervised neighborhood discrimination index is used for calculating each feature in the feature set to be screened according to a first formula when the number of the screened features is 0.

In particular, when the target feature subset

Calculating a semi-supervised neighborhood discrimination index of each feature in the set B, wherein the calculation formula is as follows:

semi-supervised neighborhood discrimination index

Where k ε B, |·| represents the number of non-zero elements of the matrix, δ represents an adjustable constant parameter,

and->

Feature vector +.>

The calculation formulas of the neighborhood similarity relation matrix under the distance function are respectively as follows:

wherein the method comprises the steps of

Kth feature vector representing tagged data, < >>

Represents the kth feature of the ith sample with label data, ε represents the neighborhood radius and 0 ε is less than or equal to 1. Furthermore, +_in formula (3)>

1＝[1,...,1] ^T ，f _k The kth eigenvector representing all data, L is the laplace matrix l=d-S, D is the diagonal matrix, D _ii ＝∑ _j S _ij And S is _ij Satisfies the following formula:

where t > 0 is the constant to be regulated, KNN (f _i ) Represents f _i Is not included in the K neighbors of a given set.

As an alternative embodiment, in this example, the following parameters may be set to δ=0.7, k=3, t=100, and ε=0.2.

The second calculation unit 303: and the semi-supervised neighborhood discrimination index is used for calculating each feature in the feature set to be screened according to a second formula when the number of the screened features is not 0.

Specifically, when the target feature subset G is not empty, a semi-supervised neighborhood discrimination index is calculated for each feature in the set B according to the following formula:

wherein, k is E B,

and->

Respectively represent data matrix->

And->

Neighborhood similarity matrix under distance function, < ->

And->

The data matrix with tag data under the feature subsets G and gu { k } are represented, respectively. />

And

the calculation formula of (2) is as follows:

screening unit 304: and the method is used for screening the feature with the largest semi-supervised neighborhood discrimination index from the feature set to be screened, and updating the feature set to be screened.

Specifically, a semi-supervised neighborhood discrimination index SNDI (f _k ,F _G Y) (k=1, …, n), the feature with the largest semi-supervised neighborhood discrimination index is selected, added to the target feature subset G and the candidate feature subset b=b-G is updated.

It should be noted that the above-mentioned screening process is an iterative process, and the screening is repeatedly performed until the number of the screened features is the preset numberUntil that point. Of course, can also be as

And stopping iteration, and finally selecting the first N features with the largest semi-supervised neighborhood discrimination indexes in the target features as the finally screened features.

As shown in fig. 3, the classification module 400 in this embodiment includes two units, namely an extraction unit 401 and a classification unit 402, which function as:

extraction unit 401: and extracting target characteristic data from the breast cell data sample to be diagnosed according to the characteristics meeting the preset conditions.

Classification unit 402: and the target characteristic data are input into a pre-trained KNN classifier to obtain a diagnosis result.

As an optional implementation manner, a KNN classifier may be selected to classify, so as to obtain a classification result that the sample to be tested is benign or malignant.

It can be seen that the breast cancer diagnosis system based on the semi-supervised neighborhood discrimination index provided in this embodiment includes a data acquisition module 100, a feature extraction module 200, a feature screening module 300, and a classification module 400, after original data are acquired, the data are normalized, when features are screened, some features with the largest correlation degree with the tag are screened out through an iterative mode, finally feature data are extracted from a breast cell data sample to be diagnosed, and diagnosis is performed by using a KNN classifier, so as to obtain a final diagnosis result. More characteristics are screened, and the accuracy of the diagnosis result is improved.

In summary, according to the breast cancer diagnosis system based on the semi-supervised neighborhood discrimination index provided in the present embodiment, the whole implementation flow is shown in fig. 4, and the method includes the following steps:

step S201: a labeled breast cell data matrix and an unlabeled breast cell data matrix are acquired.

Step S202: the labeled and unlabeled breast cell data matrices were normalized separately.

Step S203: splicing the normalized labeled mammary gland cell data matrix and the unlabeled mammary gland cell data matrix to obtain a target data matrix, and extracting each column in the target data matrix to obtain a plurality of characteristics of the mammary gland cell data sample.

Step S204: the target feature subset G is initialized to null and the feature subset to be screened is initialized to the plurality of features.

Step S205: whether the target feature subset G is empty is determined, if yes, the process proceeds to step S206, otherwise the process proceeds to step S207.

Step S206: and (3) calculating a semi-supervised neighborhood discrimination index of each feature in the feature set to be screened according to the formula (3).

Step S207: and (3) calculating a semi-supervised neighborhood discrimination index of each feature in the feature set to be screened according to the formula (7).

Step S208: and screening the features with the maximum semi-supervised neighborhood discrimination indexes from the feature set to be screened, and moving the features from the feature subset to be screened to the target feature subset.

Step S209: and extracting target characteristic data from the breast cell data sample to be diagnosed according to the characteristics meeting the preset conditions, and inputting the target characteristic data into a pre-trained KNN classifier to obtain a diagnosis result.

To verify the effect of this embodiment, this embodiment also provides a comparative experiment, specifically, this embodiment tests on UCI Data Set WDBC (Breast Cancer Wisconsin (Diagnostic) Data Set) containing 569 Data samples in total, each sample containing 31 features, including sample tags. Features are calculated from digitized images of Fine Needle Aspiration (FNA) of breast tumors, which characterize the nuclei present in the images. Table 1 shows comparison of the recognition results of the fully supervised neighborhood discrimination index (HANDI) method, including the average feature number and the average recognition rate of the ten-fold cross validation experiment. Obviously, the number of the features screened in the embodiment is more, and the accuracy of the obtained diagnosis result is higher.

TABLE 1

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The breast cancer diagnosis system based on the semi-supervised neighborhood discrimination index provided by the invention is described in detail above. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

Claims

1. A breast cancer diagnosis system based on a semi-supervised neighborhood discrimination index, comprising:

and a classification module: the method is used for diagnosing the breast cell data sample to be diagnosed according to the characteristics meeting the preset conditions to obtain a diagnosis result;

the feature extraction module includes:

feature extraction unit: extracting each column of data in the target data matrix to obtain a plurality of characteristics of the breast cell data sample; the target data matrix is divided according to columns, and the characteristics that each column is a mammary gland cell data sample are obtained, wherein the formula is as follows:

wherein f _k ＝[f _1k …f _mk ] ^T Is a feature of the breast cell data sample, k=1, 2,. -%, n;

the feature screening module is specifically used for: and screening out the preset number of features with the maximum semi-supervised neighborhood discrimination index from the plurality of features.

2. The system of claim 1, wherein the feature screening module comprises:

a judging unit: judging whether the number of the screened features is 0;

3. The system of claim 1 or 2, wherein the classification module comprises:

4. The system of claim 3, wherein the breast cell data sample comprises breast cell data extracted from a fine needle puncture digital image, the breast cell data comprising any one or any combination of: nucleus radius, texture, smoothness, perimeter, concavity.