CN114512231A

CN114512231A - Down syndrome screening system based on cascade characteristic selection algorithm

Info

Publication number: CN114512231A
Application number: CN202210140822.4A
Authority: CN
Inventors: 李玲; 宋柬霏; 荆瑞航; 黄玉兰; 张海蓉
Original assignee: Yancheng Jiyan Intelligent Technology Co ltd
Current assignee: Yancheng Jiyan Intelligent Technology Co ltd
Priority date: 2022-02-16
Filing date: 2022-02-16
Publication date: 2022-05-17

Abstract

The invention belongs to the technical field of medical screening methods, and particularly relates to a Down syndrome screening system based on a cascade characteristic selection algorithm; the down syndrome screening system based on the relevance feature selection algorithm (CFS), the swarm optimization algorithm (BSO) and the SVM machine learning model comprises a data preprocessing module, a feature primary screening module, an optimal feature subset screening module and a model prediction module, and learns and classifies prenatal screening data sets, so that the effects of improving diagnosis accuracy and screening out variables strongly related to outcome are achieved.

Description

Down syndrome screening system based on cascade characteristic selection algorithm

Technical Field

The invention belongs to the technical field of medical screening methods, and particularly relates to a Down syndrome screening system based on a cascade characteristic selection algorithm.

Background

Down syndrome, also known as trisomy 21, is a disease caused by chromosomal abnormalities. In China, 14.7 people per 10000 live babies have the abnormality. In the early stage of the fetus, 60% of children suffer abortion, and survivors have obvious intelligence lag, special face appearance, growth and development disorder and multiple deformity. Currently, an effective treatment method for Down syndrome is lacked, so prenatal screening is an effective measure for preventing infants suffering from Down syndrome from being born. At present, the Down syndrome screening is mainly carried out by measuring the levels of chorionic gonadotropin (HCG), alpha-fetoprotein (AFP) and free estriol (FE3) in the serum of pregnant women and combining the karyotype analysis of peripheral blood cells of the pregnant women and the chromosome examination of amniotic fluid cells.

Before 2012, the prenatal screening program in China usually adopts amniocentesis or villus sampling (CVS) detection, and the degree is called 'gold standard' of chromosome abnormality detection. However, this method is invasive and carries a certain risk of infection. In recent years, noninvasive prenatal DNA detection (NIPT) has attracted attention in this field. NIPT is a novel genetic test used for screening birth defects and genetic diseases, and the NIPT results are usually provided for pregnant women who are high risk in serum screening diagnosis for further screening, but the results are accurate, but time-consuming and high in cost, and cannot be popularized in a true sense.

In recent years, with the development of machine learning techniques, machine learning methods have been widely used for cancer diagnosis and prediction of other common diseases. The accurate computer assistant is helpful for accelerating the diagnosis of diseases, reducing the workload of doctors, improving the working efficiency and bringing more accurate and efficient diagnosis results.

The prenatal screening data is a kind of relatively special data in medical data, and has high dimension and characteristic correlation. For the above reasons, the use of machine learning in down syndrome screening has been rarely reported. The related documents have small dimension of considered features and fail to fully consider important features related to screening results. The high-dimensional and feature-related classification problem is much more difficult than the low-dimensional feature classification problem. The traditional machine learning model has poor classification effect on high-dimensional and feature-related data and is difficult to apply to screening of Down syndrome.

The fused feature selection method is a feature selection algorithm suitable for high-dimensional and feature-related data sets. The essence of the algorithm is that aiming at the advantages and disadvantages of different feature selection models, two different feature selection methods are combined, and therefore the optimal feature subset selection is carried out. The two modes can complement respective advantages, and the combined algorithm is greatly improved in subset evaluation capability and classification accuracy. At present, the method is mostly applied to the industry, and is not applied to the screening of Down syndrome.

Disclosure of Invention

In order to overcome the problems, the invention provides a Down syndrome screening system based on a cascade feature selection algorithm, which is a down syndrome screening system based on a feature selection algorithm (CFS) of correlation, a swarm optimization algorithm (BSO) and a support vector machine SVM machine learning model, and comprises a data preprocessing module, a feature primary screening module, an optimal feature subset screening module and a model prediction module, wherein the optimal feature subset screening module learns and classifies a prenatal screening data set, so that the effects of improving diagnosis accuracy and screening out variables strongly related to outcome are achieved.

A Down syndrome screening system based on a cascade feature selection algorithm comprises a data preprocessing module, a feature primary screening module, an optimal feature subset screening module and a model prediction module, wherein the data preprocessing module is used for receiving text data of a Down syndrome screening result, standardizing the data and filling missing texts in the data;

the characteristic preliminary screening module selects the relevant characteristics of the screening result of Down syndrome by using a characteristic selection algorithm based on the relevance from the text data after passing through the data preprocessing module;

the optimal feature subset screening module further screens the features selected by the feature primary screening module by using a bee colony optimization algorithm, and extracts the optimal features with the strongest correlation with the screening result of Down syndrome;

and the model prediction module screens and predicts the optimal features extracted by the optimal feature subset screening module by using a Support Vector Machine (SVM) model and outputs a prediction result.

The text data of the Down syndrome screening result received by the data preprocessing module refers to the text data of the Down syndrome screening result of the pregnant woman in the gestational period, the text data of each result is regarded as a Down syndrome sample, and each Down syndrome sample comprises 58-dimensional characteristic samples; the data is normalized by adopting a Z-Score normalization method to normalize each dimension of the feature sample, and the formula of the Z-Score normalization is as follows:

wherein: x is the number of_jRepresenting normalized feature samples, x_iRepresenting an original feature sample, wherein mu is the average value of all data in the dimension feature sample, and sigma is the standard deviation of all data in the dimension feature sample;

if the missing data exists in the characteristic sample, filling the missing characteristic data by using a specific value, and then carrying out standardization processing by adopting a Z-Score standardization method after filling, wherein for continuous data, filling by adopting a median filling mode; and for discrete data, filling in a mode of mode filling.

The characteristic primary screening module selects characteristics related to the Down syndrome screening result, adopts a characteristic selection algorithm based on correlation, and comprises the following specific processes:

step one, calculating the correlation between the feature sample of each dimension and the feature samples of other dimensions and the correlation between the feature sample of each dimension and the prediction categories of Down syndrome from the normalized Down syndrome samples output by the data preprocessing module, and further obtaining two correlation matrixes;

wherein the correlation between the feature sample of each dimension and the feature samples of other dimensions is calculated according to the following formula:

wherein: x₁Represents all data under one dimensional feature sample, E (X)₁) Mathematical expectation, D (X), representing all data under this dimensional feature sample₁) Corresponding to the variance of all data under the dimensional feature sample,X₂represents all data under another dimensional feature sample, E (X)₂) Corresponding to the mathematical expectation of all data under this dimensional feature sample, D (X)₂) The variance of all data under the dimensional characteristic sample is corresponded;

the correlation of the feature samples of each dimension to the down syndrome prediction category is calculated as follows:

wherein X represents all data under a feature sample of one dimension, e (X) represents mathematical expectation of all data under the feature sample of the dimension, d (X) corresponds to variance of all data under the feature sample of the dimension, Y represents diagnosis outcome of each feature sample of the dimension, 1 is down syndrome, 0 is non-down syndrome, e (Y) represents mathematical expectation of all data in a list of diagnosis outcomes of the feature samples, and d (Y) represents variance of all data in the list of diagnosis outcomes;

and step two, searching the feature subset by adopting the optimal priority, wherein the specific contents are as follows:

firstly, giving an empty set M, putting each dimension characteristic sample in the empty set M, calculating an estimated value merit of each dimension characteristic sample, selecting the characteristic sample with the largest estimated value to enter the empty set M, then selecting the one-dimensional characteristic sample with the second largest estimated value to enter the empty set M, forming a combined characteristic sample in the empty set M, calculating the estimated value of the combined characteristic sample, removing the characteristic sample with the second largest estimated value if the estimated value of the combined characteristic sample is smaller than the original estimated value of the characteristic sample with the largest estimated value in the empty set M, and keeping the characteristic sample with the second largest estimated value in the empty set M if the estimated value of the combined characteristic sample is not smaller than the original estimated value of the characteristic sample with the largest estimated value in the empty set M;

continuously entering the one-dimensional feature sample with the third largest estimated value into M, forming a combined feature sample by the feature sample with the third largest estimated value and other feature samples retained in M at the moment, calculating the estimated value of the combined feature sample, removing the feature sample newly added into M if the estimated value of the combined feature sample is smaller than the estimated value of the combined feature sample existing when the feature sample is not placed in M, retaining the feature sample newly added into M in M if the estimated value of the combined feature sample is not smaller than the estimated value of the combined feature sample existing when the feature sample is not placed in M, and sequentially progressing until the feature samples of each dimension are processed, so as to obtain the feature sample combination with the largest estimated value; wherein the estimated value merit is calculated according to the following formula:

wherein k represents the number of the characteristic samples in the characteristic sample combination with the maximum estimated value;

an average value representing the relevance of the feature samples in the feature sample set to the Down syndrome prediction category;

and the average value represents the correlation between the characteristic sample in the characteristic sample set and other characteristic samples respectively.

The optimal feature subset screening module extracts the optimal features with the strongest correlation with the Down syndrome screening result by adopting a bee colony optimization algorithm, and the specific contents are as follows:

firstly, randomly appointing a part of feature samples to be searched in a feature sample combination with the maximum estimation value output by a feature primary screening module, judging the quality of each search result by using fitness, and finally obtaining a feature sample subset with the maximum fitness by performing traversal search on all feature samples in the feature sample combination with the maximum estimation value; wherein the fitness is calculated according to the following formula:

wherein, TP represents a characteristic sample with positive prediction result and positive actual result of Down syndrome; FN represents the characteristic sample with negative prediction result but positive actual result of Down syndrome; FP represents a characteristic sample with positive prediction result but negative actual result of Down syndrome; TN represents the characteristic sample with negative prediction result and negative actual result of Down syndrome.

The model prediction module adopts a Support Vector Machine (SVM) model, and the structure is as follows:

s.t.y_k(ω^rx_k+b)≥1-ε_i

wherein m represents the number of the division planes, and omega is a normal vector of the classification plane; c is a penalty factor, and is taken as 1; epsilon_iFor relaxation variables, the value range is [0,1 ]]；x_kIs the kth Down syndrome sample; y is_kThe predicted category of the kth Down syndrome sample; s.t. represents the constraint, T represents transpose, b is the shift term.

The specific process of training the SVM model is as follows:

the method comprises the following steps: inputting the selected Down syndrome sample into a data preprocessing module, and taking the Down syndrome sample processed by the data preprocessing module as a training set;

step two: manually marking each Down syndrome sample in the training set, wherein the Down syndrome sample belongs to non-Down syndrome or Down syndrome, and obtaining a marking training set;

step three: the labeling training set in the step two is collected with 43627 down syndrome samples which are put back, and the 43627 down syndrome samples are divided into down syndrome and non-down syndrome by a classification plane; the classification plane refers to the following parts in the SVM model:

step four: repeating the third step for 10 times, wherein the classification plane generates 10 classification results after 10 times of segmentation on each Down syndrome sample, then votes on the 10 classification results of each Down syndrome sample respectively, and designates the class with the most votes as the final output result of the Down syndrome sample;

when the accuracy of the SVM model in classifying the data in the labeled training set reaches 90%, obtaining a trained SVM model; the accuracy of the model for classifying the data in the labeling training set refers to the number of all down syndrome samples classified by the model in the labeling training set/the number of all the down syndrome samples in the artificially labeled labeling training set being 100%.

Compared with the prior art, the invention has the beneficial effects that:

1. the Down syndrome screening method based on the cascade feature selection algorithm adopts a fusion feature selection method, combines a feature selection algorithm based on correlation with a bee colony optimization algorithm, and accordingly selects the optimal feature subset. The fusion algorithm combines the advantages of low time complexity of a feature selection algorithm based on correlation and the advantages of a bee colony optimization algorithm that the correlation among features is fully considered, and greatly improves the subset evaluation capability and the classification accuracy. And the fusion algorithm is successfully applied to the screening of down syndrome.

2. The Down syndrome screening method based on the cascade feature selection algorithm is applied to screening and predicting of Down syndrome by combining a Support Vector Machine (SVM) classification model after an optimal feature subset is screened out by adopting a fusion feature selection method, the predicted detection rate is 81.0% higher than that obtained by using prenatal screening risk assessment software in the existing hospital, meanwhile, the false detection rate is 9.8% lower than that obtained by using the prenatal screening risk assessment software in the hospital, the detection rate is improved, and the false detection rate is also reduced.

Detailed Description

The Down syndrome screening method applies the feature selection algorithm to the screening prediction of the Down syndrome, properly selects the feature selection algorithm based on the correlation and the swarm optimization algorithm according to the high dimension and the feature correlation of the data, and combines the feature selection algorithm and the swarm optimization algorithm through the fusion feature selection method to achieve the purpose of screening the optimal feature subset. And finally, by combining the use of a Support Vector Machine (SVM) model, the Down syndrome screening method has higher prediction accuracy and can screen out the prediction factor with the strongest diagnosis correlation with the Down syndrome.

Example 1

the characteristic preliminary screening module selects the relevant characteristics of the screening result of Down syndrome by using a characteristic selection algorithm (CFS) based on relevance from the text data after passing through the data preprocessing module;

the purpose of using the CFS algorithm is to filter out features irrelevant to the outcome first, because the program run time of the BSO algorithm used by the next module is extremely long, and filtering out some features obviously irrelevant to the outcome in advance can reduce the execution time of the BSO algorithm, which is mainly considered from the viewpoint of time saving;

the optimal feature subset screening module further screens the features selected by the feature primary screening module by using a swarm optimization algorithm (BSO), and extracts the optimal features with the strongest correlation with the screening result of Down syndrome;

The text data of the Down syndrome screening result received by the data preprocessing module refers to the text data of the Down syndrome screening result of the pregnant woman in the gestational period, the text data of each result is regarded as a Down syndrome sample, and each Down syndrome sample comprises 58-dimensional characteristic samples; the step of standardizing the data is to eliminate the influence of dimension and distribution difference among the features by zooming the data into a specific interval, so that the machine learning model treats all the features equally. The Z-Score normalization method is adopted to normalize the feature sample of each dimension, and the formula of the Z-Score normalization is as follows:

wherein: x is the number of_jFeature samples, x, representing one dimension after normalization_iRepresenting an original feature sample of one dimension, wherein mu is an average value of all data in the feature sample of the dimension, and sigma is a standard deviation of all data in the feature sample of the dimension; the feature sample of one dimension contains a column of data, the standard deviation is the standard deviation of each column of data, and there are as many standard deviations as there are columns (features) of data.

Due to the fact that some characteristic values are missing due to the fact that information is input carelessly or the patient does not conduct the examination, a missing value filling method is adopted for solving the problem, and the missing value filling is to fill in the missing data by using a specific value.

If missing data exists in the feature sample of one dimension, filling the missing feature data by using a specific value, and then performing standardization processing by adopting a Z-Score standardization method after filling, wherein for continuous data, filling is performed by adopting a median filling mode (sequencing data which are not missing in the feature sample of the dimension, finding out the digit of the data, and then taking the digit as the missing data); and (4) filling the discrete data in a mode of mode filling (finding out the data which appears most times in the dimension characteristic sample as missing data).

The characteristic primary screening module selects characteristics related to the screening result of Down syndrome, and adopts a characteristic selection algorithm (CFS) based on correlation, and the specific process is as follows:

step one, calculating the correlation between the feature sample of each dimension and the feature samples of other dimensions from the normalized Down syndrome samples output by the data preprocessing module, and the correlation between the feature sample of each dimension and the prediction type of Down syndrome (which means whether the feature is a typical symptom of Down syndrome), and further obtaining two correlation matrixes;

wherein: x₁Represents all data under one dimensional feature sample, E (X)₁) Mathematical expectation, D (X), representing all data under this dimensional feature sample₁) Corresponding to the variance, X, of all data under the dimensional feature sample₂Represents all data under another dimensional feature sample, E (X)₂) Corresponding to the mathematical expectation of all data under this dimensional feature sample, D (X)₂) The variance of all data under the dimensional characteristic sample is corresponded;

the characteristic of the Down syndrome text data and the characteristic of each dimension have a certain degree of correlation, and the correlation matrix is the value of the degree of correlation.

Step two, searching the feature subset by adopting best first search (best first search), wherein the specific contents are as follows:

The estimate here can be understood as the diagnostic accuracy using this feature, and if the diagnostic accuracy using the a and B features is less than the diagnostic accuracy using the a feature alone, we remove the B feature, continue to consider other features and make comparisons in turn.

Sequentially and progressively means that under the condition of one characteristic, which characteristic has the highest diagnosis accuracy is selected, then the characteristic is reserved, and then the characteristic with the second highest diagnosis accuracy is added, if the diagnosis accuracy of the two characteristics is higher than that of the first characteristic, the second characteristic is added, and then the characteristics with the third highest diagnosis accuracy are added continuously for comparison; if the diagnostic accuracy of the two characteristics is lower than that of the first characteristic, the characteristics are not added, and the characteristics with the third highest diagnostic accuracy are continuously added, and the steps are sequentially advanced.

The optimal feature subset screening module adopts a swarm optimization algorithm (BSO), the algorithm takes each feature approximation as a honey source, the bees search each feature and return each search result, the fitness (fitness) is used for judging the quality of each feature, and a feature subset with the maximum fitness is returned after continuous updating and iteration. The method comprises the following steps of extracting the optimal characteristics with the strongest correlation with the screening result of Down syndrome, wherein the specific contents are as follows:

firstly, randomly appointing a part of feature samples to be searched in a feature sample combination with the maximum estimated value output by a feature primary screening module, judging the quality of each search result by using fitness (fitness), and finally obtaining a feature sample subset with the maximum fitness by performing traversal search on all feature samples in the feature sample combination with the maximum estimated value; wherein the fitness is calculated according to the following formula:

s.t.y_k(ω^Tx_k+b)≥1-ε_i

where m denotes the number of dividing planes, here 10; omega is a normal vector of the classification plane; and C is a penalty factor, namely the penalty degree on the error sample is larger, so that the accuracy rate in the training sample is higher, but the generalization capability is reduced, namely the classification accuracy rate on the test data is reduced. Conversely, decreasing C allows for some misclassification error samples in the training samples. In the invention, C is taken as 1; epsilon_iIs a relaxation variable and is a parameter set artificially, and the value range is [0,1 ]]；x_kThe kth Down syndrome sample; y is_kThe category of the kth Down syndrome sample predicted by the model; s.t. represents the constraint, T represents transpose, b is the shift term.

The specific process of training the SVM model is as follows:

step two: artificially marking each Down syndrome sample in the training set to belong to non-Down syndrome or Down syndrome to obtain a marking training set;

step three: the labeling training set in the step two is collected with 43627 down syndrome samples which are put back, and the 43627 down syndrome samples are divided into down syndrome and non-down syndrome by a classification plane; the classification plane refers to a part of a Support Vector Machine (SVM) model, and the formula is as follows:

wherein x is_kRepresenting the kth Down syndrome sample; y is_kRepresenting the category corresponding to the kth Down syndrome sample; omega is a normal vector of the classification plane and determines the direction of the classification plane; b is a displacement term which determines the distance between the classification plane and the origin;

when the accuracy rate of the SVM model on classification of the data in the labeling training set reaches 90%, obtaining a trained SVM model; the accuracy of the model for classifying the data in the labeling training set refers to the number of all down syndrome samples classified by the model in the labeling training set/the number of all the down syndrome samples in the artificially labeled labeling training set being 100%.

Example 2

A Down syndrome screening system based on a cascade feature selection algorithm specifically comprises a data preprocessing module, a feature primary screening module, an optimal feature subset screening module and a model prediction module, wherein:

the data preprocessing module is used for cleaning the data of the Tang syndrome text, and specifically comprises missing value filling and data standardization. After processing the down syndrome textual data, we obtained 43627 pieces of down syndrome textual data and 58-dimensional features that may be related to down syndrome.

The characteristic preliminary screening module uses a characteristic selection method (CFS) based on correlation, and the specific process of the algorithm is as follows: first, calculating each feature and category and a correlation matrix of each feature and feature from the Down syndrome text data, and then searching the feature subset by adopting best first search (best first search). And evaluating the advantages and disadvantages of the feature subsets by using the estimation value merit, and finally selecting the feature subset with the highest estimation value.

The optimal feature subset screening module adopts a swarm optimization algorithm (BSO), the algorithm takes each feature approximation as a honey source, the bees search each feature and return each search result, fitness (fitness) is used for judging the quality of each feature, and a feature subset with the maximum fitness is returned after continuous updating and iteration.

The model prediction module adopts a Support Vector Machine (SVM) model, and text data of the Down syndrome processed by the data preprocessing module are sent to the trained SVM model for model prediction to obtain a final prediction result.

The data preprocessing module comprises missing value filling and standardization of the Down syndrome text data. Missing value padding is to use a specific value to fill in the missing data. For continuous data, a median filling mode is adopted; for discrete data, we use mode padding. Data normalization is to eliminate the influence of dimension and distribution difference between features by scaling data into a specific interval, so that the machine learning model can treat all features equally. We normalized the Down syndrome text data using the Z-Score normalization method, which is formulated as follows:

wherein: x is the number of_jRepresenting a normalized one-dimensional feature sample, x_iRepresenting an original one-dimensional feature sample, wherein mu is the average value of all data in the dimension feature sample, and sigma is the standard deviation of all data in the dimension feature sample; the one-dimensional feature sample comprises a column of data, the standard deviation is the standard deviation of each column of data, and there are as many standard deviations as there are columns (features) of data.

The characteristic preliminary screening module adopts a characteristic selection algorithm (CFS) based on correlation, and the specific process of the algorithm is as follows: first, a correlation matrix of each feature and category and each feature and feature is calculated from the Down syndrome text data, and then a feature subset is searched by using best first search (best first search). The optimal preferential search is that an empty set M is given, all features are put back into the empty set M in sequence, an estimation value (merit) of each feature is calculated, the feature with the largest estimation value is selected to enter the empty set M, then a second feature with the largest estimation value is selected to enter the empty set M, if the estimation values of the two features are smaller than the original estimation value, the feature with the largest estimation value is removed, then the next search is carried out, and the sequential progress is carried out, so that the feature combination which enables the estimation value to be the largest is found out. The formula for the estimate value unit of the feature set is defined as follows:

wherein k represents the feature quantity of the current feature set;

an average value representing the relevance of each feature in the set of features to the down syndrome prediction category;

represents the average of the correlation between each feature in the set of features.

The optimal feature subset screening module adopts a swarm optimization algorithm (BSO), which takes each feature approximation as a honey source, and bees search each feature and return each search result. Firstly, randomly appointing the searched features, judging the quality of each search result by using fitness (fitness), and finally returning a feature subset with the maximum fitness by performing traversal search on all the features.

The fitness (fitness) formula is defined as follows:

wherein, TP represents a sample with positive prediction result and positive actual result of Down syndrome; FN represents samples with negative prediction results but positive actual results of Down syndrome; FP represents a sample with positive prediction result but negative actual result of Down syndrome; TN represents the sample with negative prediction result and negative actual result of Down syndrome.

The model prediction module adopts a Support Vector Machine (SVM) model, and the specific process of model training is as follows:

the method comprises the following steps: inputting the selected Down syndrome textual data into a data preprocessing module, and taking the Down syndrome textual data processed by the data preprocessing module as a training set;

step two: manually marking each Down syndrome character data in the training set as normal (non-Down syndrome) text data or abnormal (namely, the Down syndrome character data) to obtain a marking training set;

step three: the labeling training set in the step two is collected with 43627 pieces of Down syndrome symptom text data which are put back, and the 43627 pieces of Down syndrome symptom text data are divided into two types of Down syndrome and non-Down syndrome through a classification plane to form a trained SVM model;

step four: repeating the step three for 10 times, wherein the classification plane generates 10 classification results after 10 times of segmentation on the Down syndrome comprehensive character data, votes the 10 classification results, and designates the category with the most votes as the final output result;

when the accuracy of the SVM model in classifying the data in the labeled training set reaches 90%, obtaining a trained SVM model; the accuracy of the model for classifying the data in the labeling training set refers to the number of all the Down syndrome text data in the labeling training set classified by the model/the number of all the Down syndrome text data in the artificially labeled labeling training set multiplied by 100%.

The classification formula of the classification plane for classifying the down syndrome comprehensive character text data in the third step is as follows:

wherein x is_kRepresenting the kth Down syndrome text data; y is_kRepresenting the corresponding category of the text data of the Down syndrome, 1 representing the Down syndrome, and 0 representing non-Down syndrome; omega is a normal vector of the classification plane and determines the direction of the classification plane; b is a displacement term that determines the distance between the classification plane and the origin.

And obtaining values of parameters omega and b after training in the third step, and sending the values into a Support Vector Machine (SVM) model, as shown in the following formula:

s.t.y_k(ω^Tx_k+b)≥1-ε_i

where m denotes the number of dividing planes, here 10; omega is a normal vector of the classification plane; b is a displacement term; c is a penalty factor, i.e. the greater the penalty degree of the error sample, becauseThis is more accurate in training samples, but the generalization ability is reduced, i.e., the classification accuracy on the test data is reduced. Conversely, decreasing C allows some misclassification of erroneous samples in the training samples. In the invention, C is taken as 1; epsilon_iIs a parameter which is set artificially and has a value range of [0,1 ]]；x_kCarrying out the kth Tang syndrome text data in the training set; y is_kThe classification predicted by the model for the Down syndrome text data; s.t. represents the constraint and T represents the transpose.

The research of the invention is verified on the data set obtained by clinical cases, and the generalization ability and the popularization ability of the methods have stronger reliability. A cascade-feature-based selection algorithm may assist in prenatal screening efforts for down syndrome by classifying it.

The predicted detection rate is 81.0% higher than that obtained by using the prenatal screening risk assessment software in the current hospital, and meanwhile, the false detection rate is 9.8% lower than that obtained by using the prenatal screening risk assessment software in the hospital, so that the detection rate is improved, and the false detection rate is also reduced.

Claims

1. A Down syndrome screening system based on a cascade feature selection algorithm is characterized by comprising a data preprocessing module, a feature primary screening module, an optimal feature subset screening module and a model prediction module, wherein the data preprocessing module is used for receiving text data of a Down syndrome screening result, standardizing the data and filling missing texts in the data;

2. The system of claim 1, wherein the text data of the Down syndrome screening results received by the data preprocessing module refers to the text data of the Down syndrome screening results of pregnant women during pregnancy, each text data is regarded as a Down syndrome sample, and each Down syndrome sample comprises 58-dimensional feature samples; the data is normalized by adopting a Z-Score normalization method to normalize each dimension of the feature sample, and the formula of the Z-Score normalization is as follows:

3. The Down syndrome screening system based on cascade feature selection algorithm as claimed in claim 2, wherein the feature primary screening module selects the features related to the Down syndrome screening result by using the feature selection algorithm based on the correlation, the specific process is as follows:

step one, calculating the correlation between the feature sample of each dimension and the feature samples of other dimensions and the correlation between the feature sample of each dimension and the prediction category of Down syndrome from the normalized Down syndrome sample output by the data preprocessing module, and further obtaining two correlation matrixes;

wherein: x₁Represents all data under one dimensional feature sample, E (X)₁) Mathematical expectation, D (X), representing all data under this dimensional feature sample₁) Corresponding to the variance, X, of all data under the dimensional feature sample₂Represents all data under another dimensional feature sample, E (X)₂) Corresponding to the mathematical expectation of all data under this dimensional feature sample, D (X)₂) Corresponding to the variance of all data under the dimensional characteristic sample;

4. The Down syndrome screening system based on cascade feature selection algorithm as claimed in claim 3, wherein said screening optimal feature subset module adopts bee colony optimization algorithm to extract the optimal feature with strongest relevance to the Down syndrome screening result, the specific content is as follows:

5. The Down syndrome screening system based on cascade feature selection algorithm as claimed in claim 4, wherein the model prediction module employs Support Vector Machine (SVM) model, and the structure is as follows:

s.t. y_k(ω^Tx_k+b)≥1-ε_i

where m denotes the number of the division planes and ω is the normal vector of the classification plane(ii) a C is a penalty factor, and is taken as 1; epsilon_iFor relaxation variables, the value range is [0,1 ]]；x_kThe kth Down syndrome sample; y is_kThe predicted category of the kth Down syndrome sample; s.t. represents the constraint, T represents transpose, b is the shift term.

6. The Down syndrome screening system based on cascade feature selection algorithm as claimed in claim 5, wherein the specific process of training SVM model is as follows: