CN114512231A - Down syndrome screening system based on cascade characteristic selection algorithm - Google Patents

Down syndrome screening system based on cascade characteristic selection algorithm Download PDF

Info

Publication number
CN114512231A
CN114512231A CN202210140822.4A CN202210140822A CN114512231A CN 114512231 A CN114512231 A CN 114512231A CN 202210140822 A CN202210140822 A CN 202210140822A CN 114512231 A CN114512231 A CN 114512231A
Authority
CN
China
Prior art keywords
sample
feature
down syndrome
data
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210140822.4A
Other languages
Chinese (zh)
Inventor
李玲
宋柬霏
荆瑞航
黄玉兰
张海蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yancheng Jiyan Intelligent Technology Co ltd
Original Assignee
Yancheng Jiyan Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yancheng Jiyan Intelligent Technology Co ltd filed Critical Yancheng Jiyan Intelligent Technology Co ltd
Priority to CN202210140822.4A priority Critical patent/CN114512231A/en
Publication of CN114512231A publication Critical patent/CN114512231A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Primary Health Care (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of medical screening methods, and particularly relates to a Down syndrome screening system based on a cascade characteristic selection algorithm; the down syndrome screening system based on the relevance feature selection algorithm (CFS), the swarm optimization algorithm (BSO) and the SVM machine learning model comprises a data preprocessing module, a feature primary screening module, an optimal feature subset screening module and a model prediction module, and learns and classifies prenatal screening data sets, so that the effects of improving diagnosis accuracy and screening out variables strongly related to outcome are achieved.

Description

Down syndrome screening system based on cascade characteristic selection algorithm
Technical Field
The invention belongs to the technical field of medical screening methods, and particularly relates to a Down syndrome screening system based on a cascade characteristic selection algorithm.
Background
Down syndrome, also known as trisomy 21, is a disease caused by chromosomal abnormalities. In China, 14.7 people per 10000 live babies have the abnormality. In the early stage of the fetus, 60% of children suffer abortion, and survivors have obvious intelligence lag, special face appearance, growth and development disorder and multiple deformity. Currently, an effective treatment method for Down syndrome is lacked, so prenatal screening is an effective measure for preventing infants suffering from Down syndrome from being born. At present, the Down syndrome screening is mainly carried out by measuring the levels of chorionic gonadotropin (HCG), alpha-fetoprotein (AFP) and free estriol (FE3) in the serum of pregnant women and combining the karyotype analysis of peripheral blood cells of the pregnant women and the chromosome examination of amniotic fluid cells.
Before 2012, the prenatal screening program in China usually adopts amniocentesis or villus sampling (CVS) detection, and the degree is called 'gold standard' of chromosome abnormality detection. However, this method is invasive and carries a certain risk of infection. In recent years, noninvasive prenatal DNA detection (NIPT) has attracted attention in this field. NIPT is a novel genetic test used for screening birth defects and genetic diseases, and the NIPT results are usually provided for pregnant women who are high risk in serum screening diagnosis for further screening, but the results are accurate, but time-consuming and high in cost, and cannot be popularized in a true sense.
In recent years, with the development of machine learning techniques, machine learning methods have been widely used for cancer diagnosis and prediction of other common diseases. The accurate computer assistant is helpful for accelerating the diagnosis of diseases, reducing the workload of doctors, improving the working efficiency and bringing more accurate and efficient diagnosis results.
The prenatal screening data is a kind of relatively special data in medical data, and has high dimension and characteristic correlation. For the above reasons, the use of machine learning in down syndrome screening has been rarely reported. The related documents have small dimension of considered features and fail to fully consider important features related to screening results. The high-dimensional and feature-related classification problem is much more difficult than the low-dimensional feature classification problem. The traditional machine learning model has poor classification effect on high-dimensional and feature-related data and is difficult to apply to screening of Down syndrome.
The fused feature selection method is a feature selection algorithm suitable for high-dimensional and feature-related data sets. The essence of the algorithm is that aiming at the advantages and disadvantages of different feature selection models, two different feature selection methods are combined, and therefore the optimal feature subset selection is carried out. The two modes can complement respective advantages, and the combined algorithm is greatly improved in subset evaluation capability and classification accuracy. At present, the method is mostly applied to the industry, and is not applied to the screening of Down syndrome.
Disclosure of Invention
In order to overcome the problems, the invention provides a Down syndrome screening system based on a cascade feature selection algorithm, which is a down syndrome screening system based on a feature selection algorithm (CFS) of correlation, a swarm optimization algorithm (BSO) and a support vector machine SVM machine learning model, and comprises a data preprocessing module, a feature primary screening module, an optimal feature subset screening module and a model prediction module, wherein the optimal feature subset screening module learns and classifies a prenatal screening data set, so that the effects of improving diagnosis accuracy and screening out variables strongly related to outcome are achieved.
A Down syndrome screening system based on a cascade feature selection algorithm comprises a data preprocessing module, a feature primary screening module, an optimal feature subset screening module and a model prediction module, wherein the data preprocessing module is used for receiving text data of a Down syndrome screening result, standardizing the data and filling missing texts in the data;
the characteristic preliminary screening module selects the relevant characteristics of the screening result of Down syndrome by using a characteristic selection algorithm based on the relevance from the text data after passing through the data preprocessing module;
the optimal feature subset screening module further screens the features selected by the feature primary screening module by using a bee colony optimization algorithm, and extracts the optimal features with the strongest correlation with the screening result of Down syndrome;
and the model prediction module screens and predicts the optimal features extracted by the optimal feature subset screening module by using a Support Vector Machine (SVM) model and outputs a prediction result.
The text data of the Down syndrome screening result received by the data preprocessing module refers to the text data of the Down syndrome screening result of the pregnant woman in the gestational period, the text data of each result is regarded as a Down syndrome sample, and each Down syndrome sample comprises 58-dimensional characteristic samples; the data is normalized by adopting a Z-Score normalization method to normalize each dimension of the feature sample, and the formula of the Z-Score normalization is as follows:
Figure BDA0003506829730000031
wherein: x is the number ofjRepresenting normalized feature samples, xiRepresenting an original feature sample, wherein mu is the average value of all data in the dimension feature sample, and sigma is the standard deviation of all data in the dimension feature sample;
if the missing data exists in the characteristic sample, filling the missing characteristic data by using a specific value, and then carrying out standardization processing by adopting a Z-Score standardization method after filling, wherein for continuous data, filling by adopting a median filling mode; and for discrete data, filling in a mode of mode filling.
The characteristic primary screening module selects characteristics related to the Down syndrome screening result, adopts a characteristic selection algorithm based on correlation, and comprises the following specific processes:
step one, calculating the correlation between the feature sample of each dimension and the feature samples of other dimensions and the correlation between the feature sample of each dimension and the prediction categories of Down syndrome from the normalized Down syndrome samples output by the data preprocessing module, and further obtaining two correlation matrixes;
wherein the correlation between the feature sample of each dimension and the feature samples of other dimensions is calculated according to the following formula:
Figure BDA0003506829730000032
wherein: x1Represents all data under one dimensional feature sample, E (X)1) Mathematical expectation, D (X), representing all data under this dimensional feature sample1) Corresponding to the variance of all data under the dimensional feature sample,X2represents all data under another dimensional feature sample, E (X)2) Corresponding to the mathematical expectation of all data under this dimensional feature sample, D (X)2) The variance of all data under the dimensional characteristic sample is corresponded;
the correlation of the feature samples of each dimension to the down syndrome prediction category is calculated as follows:
Figure BDA0003506829730000041
wherein X represents all data under a feature sample of one dimension, e (X) represents mathematical expectation of all data under the feature sample of the dimension, d (X) corresponds to variance of all data under the feature sample of the dimension, Y represents diagnosis outcome of each feature sample of the dimension, 1 is down syndrome, 0 is non-down syndrome, e (Y) represents mathematical expectation of all data in a list of diagnosis outcomes of the feature samples, and d (Y) represents variance of all data in the list of diagnosis outcomes;
and step two, searching the feature subset by adopting the optimal priority, wherein the specific contents are as follows:
firstly, giving an empty set M, putting each dimension characteristic sample in the empty set M, calculating an estimated value merit of each dimension characteristic sample, selecting the characteristic sample with the largest estimated value to enter the empty set M, then selecting the one-dimensional characteristic sample with the second largest estimated value to enter the empty set M, forming a combined characteristic sample in the empty set M, calculating the estimated value of the combined characteristic sample, removing the characteristic sample with the second largest estimated value if the estimated value of the combined characteristic sample is smaller than the original estimated value of the characteristic sample with the largest estimated value in the empty set M, and keeping the characteristic sample with the second largest estimated value in the empty set M if the estimated value of the combined characteristic sample is not smaller than the original estimated value of the characteristic sample with the largest estimated value in the empty set M;
continuously entering the one-dimensional feature sample with the third largest estimated value into M, forming a combined feature sample by the feature sample with the third largest estimated value and other feature samples retained in M at the moment, calculating the estimated value of the combined feature sample, removing the feature sample newly added into M if the estimated value of the combined feature sample is smaller than the estimated value of the combined feature sample existing when the feature sample is not placed in M, retaining the feature sample newly added into M in M if the estimated value of the combined feature sample is not smaller than the estimated value of the combined feature sample existing when the feature sample is not placed in M, and sequentially progressing until the feature samples of each dimension are processed, so as to obtain the feature sample combination with the largest estimated value; wherein the estimated value merit is calculated according to the following formula:
Figure BDA0003506829730000042
wherein k represents the number of the characteristic samples in the characteristic sample combination with the maximum estimated value;
Figure BDA0003506829730000043
an average value representing the relevance of the feature samples in the feature sample set to the Down syndrome prediction category;
Figure BDA0003506829730000044
and the average value represents the correlation between the characteristic sample in the characteristic sample set and other characteristic samples respectively.
The optimal feature subset screening module extracts the optimal features with the strongest correlation with the Down syndrome screening result by adopting a bee colony optimization algorithm, and the specific contents are as follows:
firstly, randomly appointing a part of feature samples to be searched in a feature sample combination with the maximum estimation value output by a feature primary screening module, judging the quality of each search result by using fitness, and finally obtaining a feature sample subset with the maximum fitness by performing traversal search on all feature samples in the feature sample combination with the maximum estimation value; wherein the fitness is calculated according to the following formula:
Figure BDA0003506829730000051
wherein, TP represents a characteristic sample with positive prediction result and positive actual result of Down syndrome; FN represents the characteristic sample with negative prediction result but positive actual result of Down syndrome; FP represents a characteristic sample with positive prediction result but negative actual result of Down syndrome; TN represents the characteristic sample with negative prediction result and negative actual result of Down syndrome.
The model prediction module adopts a Support Vector Machine (SVM) model, and the structure is as follows:
Figure BDA0003506829730000052
s.t.ykrxk+b)≥1-εi
wherein m represents the number of the division planes, and omega is a normal vector of the classification plane; c is a penalty factor, and is taken as 1; epsiloniFor relaxation variables, the value range is [0,1 ]];xkIs the kth Down syndrome sample; y iskThe predicted category of the kth Down syndrome sample; s.t. represents the constraint, T represents transpose, b is the shift term.
The specific process of training the SVM model is as follows:
the method comprises the following steps: inputting the selected Down syndrome sample into a data preprocessing module, and taking the Down syndrome sample processed by the data preprocessing module as a training set;
step two: manually marking each Down syndrome sample in the training set, wherein the Down syndrome sample belongs to non-Down syndrome or Down syndrome, and obtaining a marking training set;
step three: the labeling training set in the step two is collected with 43627 down syndrome samples which are put back, and the 43627 down syndrome samples are divided into down syndrome and non-down syndrome by a classification plane; the classification plane refers to the following parts in the SVM model:
Figure BDA0003506829730000061
step four: repeating the third step for 10 times, wherein the classification plane generates 10 classification results after 10 times of segmentation on each Down syndrome sample, then votes on the 10 classification results of each Down syndrome sample respectively, and designates the class with the most votes as the final output result of the Down syndrome sample;
when the accuracy of the SVM model in classifying the data in the labeled training set reaches 90%, obtaining a trained SVM model; the accuracy of the model for classifying the data in the labeling training set refers to the number of all down syndrome samples classified by the model in the labeling training set/the number of all the down syndrome samples in the artificially labeled labeling training set being 100%.
Compared with the prior art, the invention has the beneficial effects that:
1. the Down syndrome screening method based on the cascade feature selection algorithm adopts a fusion feature selection method, combines a feature selection algorithm based on correlation with a bee colony optimization algorithm, and accordingly selects the optimal feature subset. The fusion algorithm combines the advantages of low time complexity of a feature selection algorithm based on correlation and the advantages of a bee colony optimization algorithm that the correlation among features is fully considered, and greatly improves the subset evaluation capability and the classification accuracy. And the fusion algorithm is successfully applied to the screening of down syndrome.
2. The Down syndrome screening method based on the cascade feature selection algorithm is applied to screening and predicting of Down syndrome by combining a Support Vector Machine (SVM) classification model after an optimal feature subset is screened out by adopting a fusion feature selection method, the predicted detection rate is 81.0% higher than that obtained by using prenatal screening risk assessment software in the existing hospital, meanwhile, the false detection rate is 9.8% lower than that obtained by using the prenatal screening risk assessment software in the hospital, the detection rate is improved, and the false detection rate is also reduced.
Detailed Description
The Down syndrome screening method applies the feature selection algorithm to the screening prediction of the Down syndrome, properly selects the feature selection algorithm based on the correlation and the swarm optimization algorithm according to the high dimension and the feature correlation of the data, and combines the feature selection algorithm and the swarm optimization algorithm through the fusion feature selection method to achieve the purpose of screening the optimal feature subset. And finally, by combining the use of a Support Vector Machine (SVM) model, the Down syndrome screening method has higher prediction accuracy and can screen out the prediction factor with the strongest diagnosis correlation with the Down syndrome.
Example 1
A Down syndrome screening system based on a cascade feature selection algorithm comprises a data preprocessing module, a feature primary screening module, an optimal feature subset screening module and a model prediction module, wherein the data preprocessing module is used for receiving text data of a Down syndrome screening result, standardizing the data and filling missing texts in the data;
the characteristic preliminary screening module selects the relevant characteristics of the screening result of Down syndrome by using a characteristic selection algorithm (CFS) based on relevance from the text data after passing through the data preprocessing module;
the purpose of using the CFS algorithm is to filter out features irrelevant to the outcome first, because the program run time of the BSO algorithm used by the next module is extremely long, and filtering out some features obviously irrelevant to the outcome in advance can reduce the execution time of the BSO algorithm, which is mainly considered from the viewpoint of time saving;
the optimal feature subset screening module further screens the features selected by the feature primary screening module by using a swarm optimization algorithm (BSO), and extracts the optimal features with the strongest correlation with the screening result of Down syndrome;
and the model prediction module screens and predicts the optimal features extracted by the optimal feature subset screening module by using a Support Vector Machine (SVM) model and outputs a prediction result.
The text data of the Down syndrome screening result received by the data preprocessing module refers to the text data of the Down syndrome screening result of the pregnant woman in the gestational period, the text data of each result is regarded as a Down syndrome sample, and each Down syndrome sample comprises 58-dimensional characteristic samples; the step of standardizing the data is to eliminate the influence of dimension and distribution difference among the features by zooming the data into a specific interval, so that the machine learning model treats all the features equally. The Z-Score normalization method is adopted to normalize the feature sample of each dimension, and the formula of the Z-Score normalization is as follows:
Figure BDA0003506829730000071
wherein: x is the number ofjFeature samples, x, representing one dimension after normalizationiRepresenting an original feature sample of one dimension, wherein mu is an average value of all data in the feature sample of the dimension, and sigma is a standard deviation of all data in the feature sample of the dimension; the feature sample of one dimension contains a column of data, the standard deviation is the standard deviation of each column of data, and there are as many standard deviations as there are columns (features) of data.
Due to the fact that some characteristic values are missing due to the fact that information is input carelessly or the patient does not conduct the examination, a missing value filling method is adopted for solving the problem, and the missing value filling is to fill in the missing data by using a specific value.
If missing data exists in the feature sample of one dimension, filling the missing feature data by using a specific value, and then performing standardization processing by adopting a Z-Score standardization method after filling, wherein for continuous data, filling is performed by adopting a median filling mode (sequencing data which are not missing in the feature sample of the dimension, finding out the digit of the data, and then taking the digit as the missing data); and (4) filling the discrete data in a mode of mode filling (finding out the data which appears most times in the dimension characteristic sample as missing data).
The characteristic primary screening module selects characteristics related to the screening result of Down syndrome, and adopts a characteristic selection algorithm (CFS) based on correlation, and the specific process is as follows:
step one, calculating the correlation between the feature sample of each dimension and the feature samples of other dimensions from the normalized Down syndrome samples output by the data preprocessing module, and the correlation between the feature sample of each dimension and the prediction type of Down syndrome (which means whether the feature is a typical symptom of Down syndrome), and further obtaining two correlation matrixes;
wherein the correlation between the feature sample of each dimension and the feature samples of other dimensions is calculated according to the following formula:
Figure BDA0003506829730000081
wherein: x1Represents all data under one dimensional feature sample, E (X)1) Mathematical expectation, D (X), representing all data under this dimensional feature sample1) Corresponding to the variance, X, of all data under the dimensional feature sample2Represents all data under another dimensional feature sample, E (X)2) Corresponding to the mathematical expectation of all data under this dimensional feature sample, D (X)2) The variance of all data under the dimensional characteristic sample is corresponded;
the correlation of the feature samples of each dimension to the down syndrome prediction category is calculated as follows:
Figure BDA0003506829730000082
wherein X represents all data under a feature sample of one dimension, e (X) represents mathematical expectation of all data under the feature sample of the dimension, d (X) corresponds to variance of all data under the feature sample of the dimension, Y represents diagnosis outcome of each feature sample of the dimension, 1 is down syndrome, 0 is non-down syndrome, e (Y) represents mathematical expectation of all data in a list of diagnosis outcomes of the feature samples, and d (Y) represents variance of all data in the list of diagnosis outcomes;
the characteristic of the Down syndrome text data and the characteristic of each dimension have a certain degree of correlation, and the correlation matrix is the value of the degree of correlation.
Step two, searching the feature subset by adopting best first search (best first search), wherein the specific contents are as follows:
firstly, giving an empty set M, putting each dimension characteristic sample in the empty set M, calculating an estimated value merit of each dimension characteristic sample, selecting the characteristic sample with the largest estimated value to enter the empty set M, then selecting the one-dimensional characteristic sample with the second largest estimated value to enter the empty set M, forming a combined characteristic sample in the empty set M, calculating the estimated value of the combined characteristic sample, removing the characteristic sample with the second largest estimated value if the estimated value of the combined characteristic sample is smaller than the original estimated value of the characteristic sample with the largest estimated value in the empty set M, and keeping the characteristic sample with the second largest estimated value in the empty set M if the estimated value of the combined characteristic sample is not smaller than the original estimated value of the characteristic sample with the largest estimated value in the empty set M;
continuously entering the one-dimensional feature sample with the third largest estimated value into M, forming a combined feature sample by the feature sample with the third largest estimated value and other feature samples retained in M at the moment, calculating the estimated value of the combined feature sample, removing the feature sample newly added into M if the estimated value of the combined feature sample is smaller than the estimated value of the combined feature sample existing when the feature sample is not placed in M, retaining the feature sample newly added into M in M if the estimated value of the combined feature sample is not smaller than the estimated value of the combined feature sample existing when the feature sample is not placed in M, and sequentially progressing until the feature samples of each dimension are processed, so as to obtain the feature sample combination with the largest estimated value; wherein the estimated value merit is calculated according to the following formula:
Figure BDA0003506829730000091
wherein k represents the number of the characteristic samples in the characteristic sample combination with the maximum estimated value;
Figure BDA0003506829730000092
an average value representing the relevance of the feature samples in the feature sample set to the Down syndrome prediction category;
Figure BDA0003506829730000093
and the average value represents the correlation between the characteristic sample in the characteristic sample set and other characteristic samples respectively.
The estimate here can be understood as the diagnostic accuracy using this feature, and if the diagnostic accuracy using the a and B features is less than the diagnostic accuracy using the a feature alone, we remove the B feature, continue to consider other features and make comparisons in turn.
Sequentially and progressively means that under the condition of one characteristic, which characteristic has the highest diagnosis accuracy is selected, then the characteristic is reserved, and then the characteristic with the second highest diagnosis accuracy is added, if the diagnosis accuracy of the two characteristics is higher than that of the first characteristic, the second characteristic is added, and then the characteristics with the third highest diagnosis accuracy are added continuously for comparison; if the diagnostic accuracy of the two characteristics is lower than that of the first characteristic, the characteristics are not added, and the characteristics with the third highest diagnostic accuracy are continuously added, and the steps are sequentially advanced.
The optimal feature subset screening module adopts a swarm optimization algorithm (BSO), the algorithm takes each feature approximation as a honey source, the bees search each feature and return each search result, the fitness (fitness) is used for judging the quality of each feature, and a feature subset with the maximum fitness is returned after continuous updating and iteration. The method comprises the following steps of extracting the optimal characteristics with the strongest correlation with the screening result of Down syndrome, wherein the specific contents are as follows:
firstly, randomly appointing a part of feature samples to be searched in a feature sample combination with the maximum estimated value output by a feature primary screening module, judging the quality of each search result by using fitness (fitness), and finally obtaining a feature sample subset with the maximum fitness by performing traversal search on all feature samples in the feature sample combination with the maximum estimated value; wherein the fitness is calculated according to the following formula:
Figure BDA0003506829730000101
wherein, TP represents a characteristic sample with positive prediction result and positive actual result of Down syndrome; FN represents the characteristic sample with negative prediction result but positive actual result of Down syndrome; FP represents a characteristic sample with positive prediction result but negative actual result of Down syndrome; TN represents the characteristic sample with negative prediction result and negative actual result of Down syndrome.
The model prediction module adopts a Support Vector Machine (SVM) model, and the structure is as follows:
Figure BDA0003506829730000102
s.t.ykTxk+b)≥1-εi
where m denotes the number of dividing planes, here 10; omega is a normal vector of the classification plane; and C is a penalty factor, namely the penalty degree on the error sample is larger, so that the accuracy rate in the training sample is higher, but the generalization capability is reduced, namely the classification accuracy rate on the test data is reduced. Conversely, decreasing C allows for some misclassification error samples in the training samples. In the invention, C is taken as 1; epsiloniIs a relaxation variable and is a parameter set artificially, and the value range is [0,1 ]];xkThe kth Down syndrome sample; y iskThe category of the kth Down syndrome sample predicted by the model; s.t. represents the constraint, T represents transpose, b is the shift term.
The specific process of training the SVM model is as follows:
the method comprises the following steps: inputting the selected Down syndrome sample into a data preprocessing module, and taking the Down syndrome sample processed by the data preprocessing module as a training set;
step two: artificially marking each Down syndrome sample in the training set to belong to non-Down syndrome or Down syndrome to obtain a marking training set;
step three: the labeling training set in the step two is collected with 43627 down syndrome samples which are put back, and the 43627 down syndrome samples are divided into down syndrome and non-down syndrome by a classification plane; the classification plane refers to a part of a Support Vector Machine (SVM) model, and the formula is as follows:
Figure BDA0003506829730000111
wherein x iskRepresenting the kth Down syndrome sample; y iskRepresenting the category corresponding to the kth Down syndrome sample; omega is a normal vector of the classification plane and determines the direction of the classification plane; b is a displacement term which determines the distance between the classification plane and the origin;
step four: repeating the third step for 10 times, wherein the classification plane generates 10 classification results after 10 times of segmentation on each Down syndrome sample, then votes on the 10 classification results of each Down syndrome sample respectively, and designates the class with the most votes as the final output result of the Down syndrome sample;
when the accuracy rate of the SVM model on classification of the data in the labeling training set reaches 90%, obtaining a trained SVM model; the accuracy of the model for classifying the data in the labeling training set refers to the number of all down syndrome samples classified by the model in the labeling training set/the number of all the down syndrome samples in the artificially labeled labeling training set being 100%.
Example 2
A Down syndrome screening system based on a cascade feature selection algorithm specifically comprises a data preprocessing module, a feature primary screening module, an optimal feature subset screening module and a model prediction module, wherein:
the data preprocessing module is used for cleaning the data of the Tang syndrome text, and specifically comprises missing value filling and data standardization. After processing the down syndrome textual data, we obtained 43627 pieces of down syndrome textual data and 58-dimensional features that may be related to down syndrome.
The characteristic preliminary screening module uses a characteristic selection method (CFS) based on correlation, and the specific process of the algorithm is as follows: first, calculating each feature and category and a correlation matrix of each feature and feature from the Down syndrome text data, and then searching the feature subset by adopting best first search (best first search). And evaluating the advantages and disadvantages of the feature subsets by using the estimation value merit, and finally selecting the feature subset with the highest estimation value.
The optimal feature subset screening module adopts a swarm optimization algorithm (BSO), the algorithm takes each feature approximation as a honey source, the bees search each feature and return each search result, fitness (fitness) is used for judging the quality of each feature, and a feature subset with the maximum fitness is returned after continuous updating and iteration.
The model prediction module adopts a Support Vector Machine (SVM) model, and text data of the Down syndrome processed by the data preprocessing module are sent to the trained SVM model for model prediction to obtain a final prediction result.
The data preprocessing module comprises missing value filling and standardization of the Down syndrome text data. Missing value padding is to use a specific value to fill in the missing data. For continuous data, a median filling mode is adopted; for discrete data, we use mode padding. Data normalization is to eliminate the influence of dimension and distribution difference between features by scaling data into a specific interval, so that the machine learning model can treat all features equally. We normalized the Down syndrome text data using the Z-Score normalization method, which is formulated as follows:
Figure BDA0003506829730000121
wherein: x is the number ofjRepresenting a normalized one-dimensional feature sample, xiRepresenting an original one-dimensional feature sample, wherein mu is the average value of all data in the dimension feature sample, and sigma is the standard deviation of all data in the dimension feature sample; the one-dimensional feature sample comprises a column of data, the standard deviation is the standard deviation of each column of data, and there are as many standard deviations as there are columns (features) of data.
The characteristic preliminary screening module adopts a characteristic selection algorithm (CFS) based on correlation, and the specific process of the algorithm is as follows: first, a correlation matrix of each feature and category and each feature and feature is calculated from the Down syndrome text data, and then a feature subset is searched by using best first search (best first search). The optimal preferential search is that an empty set M is given, all features are put back into the empty set M in sequence, an estimation value (merit) of each feature is calculated, the feature with the largest estimation value is selected to enter the empty set M, then a second feature with the largest estimation value is selected to enter the empty set M, if the estimation values of the two features are smaller than the original estimation value, the feature with the largest estimation value is removed, then the next search is carried out, and the sequential progress is carried out, so that the feature combination which enables the estimation value to be the largest is found out. The formula for the estimate value unit of the feature set is defined as follows:
Figure BDA0003506829730000131
wherein k represents the feature quantity of the current feature set;
Figure BDA0003506829730000132
an average value representing the relevance of each feature in the set of features to the down syndrome prediction category;
Figure BDA0003506829730000133
represents the average of the correlation between each feature in the set of features.
The optimal feature subset screening module adopts a swarm optimization algorithm (BSO), which takes each feature approximation as a honey source, and bees search each feature and return each search result. Firstly, randomly appointing the searched features, judging the quality of each search result by using fitness (fitness), and finally returning a feature subset with the maximum fitness by performing traversal search on all the features.
The fitness (fitness) formula is defined as follows:
Figure BDA0003506829730000134
wherein, TP represents a sample with positive prediction result and positive actual result of Down syndrome; FN represents samples with negative prediction results but positive actual results of Down syndrome; FP represents a sample with positive prediction result but negative actual result of Down syndrome; TN represents the sample with negative prediction result and negative actual result of Down syndrome.
The model prediction module adopts a Support Vector Machine (SVM) model, and the specific process of model training is as follows:
the method comprises the following steps: inputting the selected Down syndrome textual data into a data preprocessing module, and taking the Down syndrome textual data processed by the data preprocessing module as a training set;
step two: manually marking each Down syndrome character data in the training set as normal (non-Down syndrome) text data or abnormal (namely, the Down syndrome character data) to obtain a marking training set;
step three: the labeling training set in the step two is collected with 43627 pieces of Down syndrome symptom text data which are put back, and the 43627 pieces of Down syndrome symptom text data are divided into two types of Down syndrome and non-Down syndrome through a classification plane to form a trained SVM model;
step four: repeating the step three for 10 times, wherein the classification plane generates 10 classification results after 10 times of segmentation on the Down syndrome comprehensive character data, votes the 10 classification results, and designates the category with the most votes as the final output result;
when the accuracy of the SVM model in classifying the data in the labeled training set reaches 90%, obtaining a trained SVM model; the accuracy of the model for classifying the data in the labeling training set refers to the number of all the Down syndrome text data in the labeling training set classified by the model/the number of all the Down syndrome text data in the artificially labeled labeling training set multiplied by 100%.
The classification formula of the classification plane for classifying the down syndrome comprehensive character text data in the third step is as follows:
Figure BDA0003506829730000141
wherein x iskRepresenting the kth Down syndrome text data; y iskRepresenting the corresponding category of the text data of the Down syndrome, 1 representing the Down syndrome, and 0 representing non-Down syndrome; omega is a normal vector of the classification plane and determines the direction of the classification plane; b is a displacement term that determines the distance between the classification plane and the origin.
And obtaining values of parameters omega and b after training in the third step, and sending the values into a Support Vector Machine (SVM) model, as shown in the following formula:
Figure BDA0003506829730000142
s.t.ykTxk+b)≥1-εi
where m denotes the number of dividing planes, here 10; omega is a normal vector of the classification plane; b is a displacement term; c is a penalty factor, i.e. the greater the penalty degree of the error sample, becauseThis is more accurate in training samples, but the generalization ability is reduced, i.e., the classification accuracy on the test data is reduced. Conversely, decreasing C allows some misclassification of erroneous samples in the training samples. In the invention, C is taken as 1; epsiloniIs a parameter which is set artificially and has a value range of [0,1 ]];xkCarrying out the kth Tang syndrome text data in the training set; y iskThe classification predicted by the model for the Down syndrome text data; s.t. represents the constraint and T represents the transpose.
The research of the invention is verified on the data set obtained by clinical cases, and the generalization ability and the popularization ability of the methods have stronger reliability. A cascade-feature-based selection algorithm may assist in prenatal screening efforts for down syndrome by classifying it.
The predicted detection rate is 81.0% higher than that obtained by using the prenatal screening risk assessment software in the current hospital, and meanwhile, the false detection rate is 9.8% lower than that obtained by using the prenatal screening risk assessment software in the hospital, so that the detection rate is improved, and the false detection rate is also reduced.

Claims (6)

1. A Down syndrome screening system based on a cascade feature selection algorithm is characterized by comprising a data preprocessing module, a feature primary screening module, an optimal feature subset screening module and a model prediction module, wherein the data preprocessing module is used for receiving text data of a Down syndrome screening result, standardizing the data and filling missing texts in the data;
the characteristic preliminary screening module selects the relevant characteristics of the screening result of Down syndrome by using a characteristic selection algorithm based on the relevance from the text data after passing through the data preprocessing module;
the optimal feature subset screening module further screens the features selected by the feature primary screening module by using a bee colony optimization algorithm, and extracts the optimal features with the strongest correlation with the screening result of Down syndrome;
and the model prediction module screens and predicts the optimal features extracted by the optimal feature subset screening module by using a Support Vector Machine (SVM) model and outputs a prediction result.
2. The system of claim 1, wherein the text data of the Down syndrome screening results received by the data preprocessing module refers to the text data of the Down syndrome screening results of pregnant women during pregnancy, each text data is regarded as a Down syndrome sample, and each Down syndrome sample comprises 58-dimensional feature samples; the data is normalized by adopting a Z-Score normalization method to normalize each dimension of the feature sample, and the formula of the Z-Score normalization is as follows:
Figure FDA0003506829720000011
wherein: x is the number ofjRepresenting normalized feature samples, xiRepresenting an original feature sample, wherein mu is the average value of all data in the dimension feature sample, and sigma is the standard deviation of all data in the dimension feature sample;
if the missing data exists in the characteristic sample, filling the missing characteristic data by using a specific value, and then carrying out standardization processing by adopting a Z-Score standardization method after filling, wherein for continuous data, filling by adopting a median filling mode; and for discrete data, filling in a mode of mode filling.
3. The Down syndrome screening system based on cascade feature selection algorithm as claimed in claim 2, wherein the feature primary screening module selects the features related to the Down syndrome screening result by using the feature selection algorithm based on the correlation, the specific process is as follows:
step one, calculating the correlation between the feature sample of each dimension and the feature samples of other dimensions and the correlation between the feature sample of each dimension and the prediction category of Down syndrome from the normalized Down syndrome sample output by the data preprocessing module, and further obtaining two correlation matrixes;
wherein the correlation between the feature sample of each dimension and the feature samples of other dimensions is calculated according to the following formula:
Figure FDA0003506829720000021
wherein: x1Represents all data under one dimensional feature sample, E (X)1) Mathematical expectation, D (X), representing all data under this dimensional feature sample1) Corresponding to the variance, X, of all data under the dimensional feature sample2Represents all data under another dimensional feature sample, E (X)2) Corresponding to the mathematical expectation of all data under this dimensional feature sample, D (X)2) Corresponding to the variance of all data under the dimensional characteristic sample;
the correlation of the feature samples of each dimension to the down syndrome prediction category is calculated as follows:
Figure FDA0003506829720000022
wherein X represents all data under a feature sample of one dimension, e (X) represents mathematical expectation of all data under the feature sample of the dimension, d (X) corresponds to variance of all data under the feature sample of the dimension, Y represents diagnosis outcome of each feature sample of the dimension, 1 is down syndrome, 0 is non-down syndrome, e (Y) represents mathematical expectation of all data in a list of diagnosis outcomes of the feature samples, and d (Y) represents variance of all data in the list of diagnosis outcomes;
and step two, searching the feature subset by adopting the optimal priority, wherein the specific contents are as follows:
firstly, giving an empty set M, putting each dimension characteristic sample in the empty set M, calculating an estimated value merit of each dimension characteristic sample, selecting the characteristic sample with the largest estimated value to enter the empty set M, then selecting the one-dimensional characteristic sample with the second largest estimated value to enter the empty set M, forming a combined characteristic sample in the empty set M, calculating the estimated value of the combined characteristic sample, removing the characteristic sample with the second largest estimated value if the estimated value of the combined characteristic sample is smaller than the original estimated value of the characteristic sample with the largest estimated value in the empty set M, and keeping the characteristic sample with the second largest estimated value in the empty set M if the estimated value of the combined characteristic sample is not smaller than the original estimated value of the characteristic sample with the largest estimated value in the empty set M;
continuously entering the one-dimensional feature sample with the third largest estimated value into M, forming a combined feature sample by the feature sample with the third largest estimated value and other feature samples retained in M at the moment, calculating the estimated value of the combined feature sample, removing the feature sample newly added into M if the estimated value of the combined feature sample is smaller than the estimated value of the combined feature sample existing when the feature sample is not placed in M, retaining the feature sample newly added into M in M if the estimated value of the combined feature sample is not smaller than the estimated value of the combined feature sample existing when the feature sample is not placed in M, and sequentially progressing until the feature samples of each dimension are processed, so as to obtain the feature sample combination with the largest estimated value; wherein the estimated value merit is calculated according to the following formula:
Figure FDA0003506829720000031
wherein k represents the number of the characteristic samples in the characteristic sample combination with the maximum estimated value;
Figure FDA0003506829720000032
an average value representing the relevance of the feature samples in the feature sample set to the Down syndrome prediction category;
Figure FDA0003506829720000033
and the average value represents the correlation between the characteristic sample in the characteristic sample set and other characteristic samples respectively.
4. The Down syndrome screening system based on cascade feature selection algorithm as claimed in claim 3, wherein said screening optimal feature subset module adopts bee colony optimization algorithm to extract the optimal feature with strongest relevance to the Down syndrome screening result, the specific content is as follows:
firstly, randomly appointing a part of feature samples to be searched in a feature sample combination with the maximum estimation value output by a feature primary screening module, judging the quality of each search result by using fitness, and finally obtaining a feature sample subset with the maximum fitness by performing traversal search on all feature samples in the feature sample combination with the maximum estimation value; wherein the fitness is calculated according to the following formula:
Figure FDA0003506829720000041
wherein, TP represents a characteristic sample with positive prediction result and positive actual result of Down syndrome; FN represents the characteristic sample with negative prediction result but positive actual result of Down syndrome; FP represents a characteristic sample with positive prediction result but negative actual result of Down syndrome; TN represents the characteristic sample with negative prediction result and negative actual result of Down syndrome.
5. The Down syndrome screening system based on cascade feature selection algorithm as claimed in claim 4, wherein the model prediction module employs Support Vector Machine (SVM) model, and the structure is as follows:
Figure FDA0003506829720000042
s.t. ykTxk+b)≥1-εi
where m denotes the number of the division planes and ω is the normal vector of the classification plane(ii) a C is a penalty factor, and is taken as 1; epsiloniFor relaxation variables, the value range is [0,1 ]];xkThe kth Down syndrome sample; y iskThe predicted category of the kth Down syndrome sample; s.t. represents the constraint, T represents transpose, b is the shift term.
6. The Down syndrome screening system based on cascade feature selection algorithm as claimed in claim 5, wherein the specific process of training SVM model is as follows:
the method comprises the following steps: inputting the selected Down syndrome sample into a data preprocessing module, and taking the Down syndrome sample processed by the data preprocessing module as a training set;
step two: manually marking each Down syndrome sample in the training set, wherein the Down syndrome sample belongs to non-Down syndrome or Down syndrome, and obtaining a marking training set;
step three: the labeling training set in the step two is collected with 43627 down syndrome samples which are put back, and the 43627 down syndrome samples are divided into down syndrome and non-down syndrome by a classification plane; the classification plane refers to the following parts in the SVM model:
Figure FDA0003506829720000051
step four: repeating the third step for 10 times, wherein the classification plane generates 10 classification results after 10 times of segmentation on each Down syndrome sample, then votes on the 10 classification results of each Down syndrome sample respectively, and designates the class with the most votes as the final output result of the Down syndrome sample;
when the accuracy of the SVM model in classifying the data in the labeled training set reaches 90%, obtaining a trained SVM model; the accuracy of the model for classifying the data in the labeling training set refers to the number of all down syndrome samples classified by the model in the labeling training set/the number of all the down syndrome samples in the artificially labeled labeling training set being 100%.
CN202210140822.4A 2022-02-16 2022-02-16 Down syndrome screening system based on cascade characteristic selection algorithm Pending CN114512231A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210140822.4A CN114512231A (en) 2022-02-16 2022-02-16 Down syndrome screening system based on cascade characteristic selection algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210140822.4A CN114512231A (en) 2022-02-16 2022-02-16 Down syndrome screening system based on cascade characteristic selection algorithm

Publications (1)

Publication Number Publication Date
CN114512231A true CN114512231A (en) 2022-05-17

Family

ID=81551287

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210140822.4A Pending CN114512231A (en) 2022-02-16 2022-02-16 Down syndrome screening system based on cascade characteristic selection algorithm

Country Status (1)

Country Link
CN (1) CN114512231A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114881181A (en) * 2022-07-12 2022-08-09 南昌大学第一附属医院 Feature weighting selection method, system, medium and computer based on big data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114881181A (en) * 2022-07-12 2022-08-09 南昌大学第一附属医院 Feature weighting selection method, system, medium and computer based on big data

Similar Documents

Publication Publication Date Title
JP3480940B2 (en) How to select medical and biochemical diagnostic tests using neural network related applications
Liu et al. Machine learning algorithms to predict early pregnancy loss after in vitro fertilization-embryo transfer with fetal heart rate as a strong predictor
EP3491561A1 (en) Methods for non-invasive assessment of genomic instability
WO2020168511A1 (en) Chromosome abnormality detection model, chromosome abnormality detection system, and chromosome abnormality detection method
JP7467504B2 (en) Methods and devices for determining chromosomal aneuploidy and for building classification models - Patents.com
Fulcher et al. Highly comparative fetal heart rate analysis
CN114512231A (en) Down syndrome screening system based on cascade characteristic selection algorithm
CN115331803A (en) Construction method and system for predicting ovarian hyporesponsiveness and deploying individualized ovarian stimulation strategy model
CN113456064B (en) Intelligent interpretation method for prenatal fetal heart monitoring signals
Jamshidnezhad et al. An intelligent prenatal screening system for the prediction of Trisomy-21
Yang et al. Chromosome classification via deep learning and its application to patients with structural abnormalities of chromosomes
Wolcott et al. Automated classification of estrous stage in rodents using deep learning
Zhang et al. Application of intelligent algorithms in Down syndrome screening during second trimester pregnancy
CN110191964B (en) Method and device for determining proportion of free nucleic acid of predetermined source in biological sample
WO2023154851A1 (en) Integrated framework for human embryo ploidy prediction using artificial intelligence
CN116130105A (en) Health risk prediction method based on neural network
CN112522387B (en) Noninvasive prenatal chromosome abnormality detection device
Wang et al. Down Syndrome detection with Swin Transformer architecture
CN114512232A (en) Edward syndrome screening system based on cascade machine learning model
CN113593629B (en) Method for reducing non-invasive prenatal detection false positive and false negative based on semiconductor sequencing
Aljameel et al. An Automated System for Early Prediction of Miscarriage in the First Trimester Using Machine Learning
TWI810915B (en) Method for detecting mutations and related non-transitory computer storage medium
CN111370131B (en) Method and system for screening biomarkers via disease trajectories
US20230005569A1 (en) Chromosomal and Sub-Chromosomal Copy Number Variation Detection
Selfiana et al. Comparison of K-means and DBSCAN for prediction determination of Down syndrome using prenatal test data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination