CN111508603A - Birth defect prediction and risk assessment method and system based on machine learning and electronic equipment - Google Patents

Birth defect prediction and risk assessment method and system based on machine learning and electronic equipment Download PDF

Info

Publication number
CN111508603A
CN111508603A CN201911174613.6A CN201911174613A CN111508603A CN 111508603 A CN111508603 A CN 111508603A CN 201911174613 A CN201911174613 A CN 201911174613A CN 111508603 A CN111508603 A CN 111508603A
Authority
CN
China
Prior art keywords
birth defect
data
sample
gene
birth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911174613.6A
Other languages
Chinese (zh)
Inventor
陈晓禾
李凤美
郭宇
陈雨行
洪凯程
吴皓
杨涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Institute of Biomedical Engineering and Technology of CAS
Original Assignee
Suzhou Institute of Biomedical Engineering and Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Institute of Biomedical Engineering and Technology of CAS filed Critical Suzhou Institute of Biomedical Engineering and Technology of CAS
Priority to CN201911174613.6A priority Critical patent/CN111508603A/en
Publication of CN111508603A publication Critical patent/CN111508603A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Primary Health Care (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application relates to a birth defect prediction and risk assessment method and system based on machine learning and an electronic device. The method comprises the following steps: step a: collecting sample data of a subject; step b: applying a data mining algorithm to carry out main feature screening on the sample data, wherein the screening result comprises medically verified deafness pathogenic gene hot spots and ages; step c: constructing a birth defect prediction model according to the screened main characteristics, and predicting the birth defect result of the birth defect prediction model according to the gene mutation information; step d: the birth defect prediction model evaluates the birth defect risk ratio under the main characteristics of the examinee through a decision tree algorithm, and carries out risk evaluation on the birth defect of the examinee by combining with the real pathogenic rate data. The application can predict the birth defects in advance, reduce the incidence rate of the birth defects and save huge medical and living expenses for the society and families.

Description

Birth defect prediction and risk assessment method and system based on machine learning and electronic equipment
Technical Field
The present application relates to birth defect prediction technologies, and in particular, to a method, a system, and an electronic device for birth defect prediction and risk assessment based on machine learning.
Background
The total incidence rate of birth defects in China is as high as 5.6%, wherein the incidence rate of birth defects of congenital deafness reaches 0.2%, and the birth defects are ranked third and are important congenital genetic birth diseases. Congenital deafness and other birth defects bring huge medical and economic burden to families and society, and primary prevention is the most economic and effective way for improving birth population quality. However, the conventional detection aiming at environmental and genetic risk indexes mostly depends on professional detection equipment and technical personnel of large medical institutions, so that the range of first-level birth defect prevention is greatly restricted, and the effective integration of medical resources and the coordination of follow-up consultation, guidance and intervention by utilizing the internet technology are difficult.
Machine learning algorithms are widely used in the prediction of disease diagnosis, and common applications include qualitative cancer diagnosis [ Liuyi, research on a machine learning-based cancer diagnosis method [ D ] ] and prognosis of cancer recurrence [ Qiliang, Shenjie, TCGA database gene mutation information combined with machine learning software RapidMiner to construct a recurrence model of hepatocellular carcinoma examinee [ J ]. China journal of liver disease (electronic version), 2018,10(03):19-25 ].
Relevant studies have carried out prognosis of Hearing result secondary classification of Sudden Sensorineural Hearing loss by Machine learning method, wherein Deep Belief Network (DBN) is an algorithm [ Bing D, Ying J, Miao J, et al.predicting the Hearing out complex in Sudden sensory Hearing loss L related vision Machine L earningModels [ J ]. Clinical Otolaryngolygology, 2018 ] with average accuracy in test set 77.58%, and the time required for improving Hearing in Sudden Sensorineural Hearing loss (optimal ROC value reaches 0.807) or rate of predicting Hearing improvement (highest predictable probability is 70%) [ vorvionic L, Deep D, ProbsR, Clinical sound, diagnostic Hearing loss for 0.2011) or the highest predictable probability of Hearing improvement is applied to Hearing loss of Hearing loss, visual Hearing loss, Hearing loss.
In addition, the pathogenicity of genetic sequence variation is generally predicted in the prior art by using software such as PROVEAN, SIFT, Polyphen2, and Mutation Taster. However, the above software can only predict the pathogenicity of a single gene locus mutation, and cannot predict the pathogenicity caused by multiple gene hotspot locus mutations or multiple factors.
Disclosure of Invention
The application provides a birth defect prediction and risk assessment method, system and electronic equipment based on machine learning, and aims to solve at least one of the technical problems in the prior art to a certain extent.
In order to solve the above problems, the present application provides the following technical solutions:
a birth defect prediction and risk assessment method based on machine learning comprises the following steps:
step a: collecting sample data of a detected person, wherein the sample data comprises a gene mutation detection result, gender, age and a defect diagnosis result, the gene mutation detection result, the gender and the age are characteristics, and the defect diagnosis result is a classification label;
step b: applying a data mining algorithm to carry out main feature screening on the sample data, wherein the screening result comprises medically verified deafness pathogenic gene hot spots and ages;
step c: constructing a birth defect prediction model according to the screened main characteristics, and predicting the birth defect result of the birth defect prediction model according to the gene mutation information;
step d: the birth defect prediction model evaluates the birth defect risk ratio under the main characteristics of the examinee through a decision tree algorithm, and carries out risk evaluation on the birth defect of the examinee by combining with the real pathogenic rate data.
The technical scheme adopted by the embodiment of the application further comprises the following steps: the step a further comprises: preprocessing the sample data; the pretreatment specifically comprises:
step a 1: classifying the defect diagnosis result;
step a 2: carrying out site splitting on the gene mutation detection result, and splitting gene mutation information into mutation states of all sites;
step a 3: filling missing values in the sample data by adopting a missing value filling mode;
step a 4: deleting illegal data and ambiguous points in the sample data;
step a 5: and carrying out sample balance and inter-class difference increasing processing on the defect sample and the normal sample in the sample data.
The technical scheme adopted by the embodiment of the application further comprises the following steps: in the step b, the data mining algorithm comprises machine learning libraries sklern, WEKA, RapidMiner of Python; the WEKA characteristic screening function is as follows:
1)BestFirst&CfsSubsetEval
2)GreedyStepwise&CfsSubsetEval
3)Ranker&InfoGainAttributeEval
4)BestFirst&WrapperSubsetEval
wherein, the first half section is a searching method, and the second half section is a characteristic evaluation strategy;
the method for screening the characteristics of sklern in Python comprises the following steps:
5) selecting SelectKBest for the univariate characteristics;
6) recursive feature elimination;
7)Feature Importance;
8) principal component analysis method.
The technical scheme adopted by the embodiment of the application further comprises that in the step b, the screened gene sites comprise mutation hot spots of GJB2 gene c.235delC, c.176del169p, c.299_300delAT, c.507_510insAACG and V37I, mutation hot spots 919-2A > G (IVS7-2A > G), H723R (2168A > G) and R409H (1226G > A) in S L C26A4 gene, mutation hot spots of A1555G and C1494T in mitochondria and polymorphism sites V27I, E114G and I203T, and the main characteristics further comprise age.
The technical scheme adopted by the embodiment of the application further comprises the following steps: in the step c, the constructing a birth defect prediction model according to the screened main features specifically includes:
step c 1: evaluating the classification performance of the model by using ten times of cross validation of sample data, and optimizing the model;
and c2, selecting Random Forest, J48, L genetic Regression, SVM, M L P, KStar, Decision Table and CNN algorithms to train the models respectively, comparing the classification performances of various algorithms and the variation trends of the classification performances of different algorithms on the feature full set and the feature subset with different sizes, and selecting J48 as a birth defect prediction algorithm.
Another technical scheme adopted by the embodiment of the application is as follows: a birth defect prediction and risk assessment system based on machine learning, comprising:
a sample collection module: the system comprises a data acquisition module, a data acquisition module and a data processing module, wherein the data acquisition module is used for acquiring sample data of a subject, and the sample data comprises a gene mutation detection result, a sex, an age and a defect diagnosis result, wherein the gene mutation detection result, the sex and the age are characteristics, and the defect diagnosis result is a classification label;
a characteristic screening module: the gene locus of a defect pathogenic gene is screened from the sample data by applying a data mining algorithm as a main characteristic;
a model construction module: the birth defect prediction model is constructed according to the screened main characteristics and used for predicting the birth defect result according to the gene mutation information;
a risk assessment module: the method is used for evaluating the birth defect risk ratio under the main characteristics of the examinee through a decision tree algorithm and carrying out risk evaluation on the birth defect of the examinee by combining with the real pathogenicity rate data.
The technical scheme adopted by the embodiment of the application further comprises a sample preprocessing module, wherein the sample preprocessing module is used for preprocessing the sample data; the sample preprocessing module specifically comprises:
a result classification unit: the defect diagnosis device is used for classifying the defect diagnosis result;
a site splitting unit: the system is used for splitting loci of a gene mutation detection result and splitting gene mutation information into mutation states of all loci;
missing value completion unit: the method is used for completing missing values in the sample data by adopting a missing value filling mode;
an abnormal data deleting unit: the method is used for deleting illegal data and ambiguous points in the sample data;
a sample balancing unit: the method is used for carrying out sample balancing and inter-class difference increasing processing on the defect sample and the normal sample in the sample data.
The technical scheme adopted by the embodiment of the application further comprises the following steps: the data mining algorithm comprises machine learning libraries Sklearn, WEKA and RapidMiner of Python; the WEKA characteristic screening function is as follows:
1)BestFirst&CfsSubsetEval
2)GreedyStepwise&CfsSubsetEval
3)Ranker&InfoGainAttributeEval
4)BestFirst&WrapperSubsetEval
wherein, the first half section is a searching method, and the second half section is a characteristic evaluation strategy;
the method for screening the characteristics of sklern in Python comprises the following steps:
5) selecting SelectKBest for the univariate characteristics;
6) recursive feature elimination;
7)Feature Importance;
8) principal component analysis method.
The technical scheme adopted by the embodiment of the application also comprises screened gene sites comprising mutation hot spots of GJB2 gene c.235delC, c.176del16bp, c.299_300delAT, c.507_510insAACG and V37I, mutation hot spots 919-2A > G (IVS7-2A > G), H723R (2168A > G) and R409H (1226G > A) in S L C26A4 gene, mutation hot spots of A1555G and C1494T in mitochondria, polymorphism sites V27I, E114G and I203T, and the main characteristics also comprise age.
The technical scheme adopted by the embodiment of the application further comprises the following steps: the model building module comprises:
a model optimization unit: the method is used for evaluating the classification performance of the model by using ten-fold cross validation of sample data and optimizing the model;
the algorithm selection unit is used for selecting Random Forest, J48, L genetic Regression, SVM, M L P, KStar, Decision Table and CNN algorithms to train the models respectively, comparing the classification performance of various algorithms and the variation trend of the classification performance of different algorithms on the feature full set and the feature subset with different sizes, and selecting J48 as the birth defect prediction algorithm.
The embodiment of the application adopts another technical scheme that: an electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the following operations of the machine learning based birth defect prediction and risk assessment method described above:
step a: collecting sample data of a detected person, wherein the sample data comprises a gene mutation detection result, gender, age and a defect diagnosis result, the gene mutation detection result, the gender and the age are characteristics, and the defect diagnosis result is a classification label;
step b: applying a data mining algorithm to carry out main feature screening on the sample data, wherein the screening result comprises medically verified deafness pathogenic gene hot spots and ages;
step c: constructing a birth defect prediction model according to the screened main characteristics, and predicting the birth defect result of the birth defect prediction model according to the gene mutation information;
step d: the birth defect prediction model evaluates the birth defect risk ratio under the main characteristics of the examinee through a decision tree algorithm, and carries out risk evaluation on the birth defect of the examinee by combining with the real pathogenic rate data.
Compared with the prior art, the embodiment of the application has the advantages that: the birth defect prediction and risk assessment method, system and electronic equipment based on machine learning of the embodiment of the application screen the defect pathogenic genes through the machine learning algorithm by establishing a birth defect prediction model, predict birth defects in advance and assess risks according to mutation conditions of defect-related genes of a detected person, and reduce the incidence rate of birth defects; meanwhile, result feedback and automatic genetic consultation service are provided for the examinees, the work of clinicians is assisted, genetic consultation suggestion reference is provided for the clinicians, and huge medical treatment and living expenses are saved for the society and the family.
Drawings
FIG. 1 is a schematic structural diagram of a birth defect cloud platform system;
fig. 2 is a schematic diagram of a birth defect prediction model according to an embodiment of the present application;
fig. 3 is a flowchart of a birth defect prediction and risk assessment method based on machine learning according to an embodiment of the present application;
FIG. 4 is a schematic diagram showing the distribution of deafness samples and normal samples at different ages;
FIG. 5 is a graph illustrating the comparison of classification accuracy of deafness diagnosis models of a sample feature complete set and different feature subsets by a machine learning algorithm;
FIG. 6 is a schematic diagram of a deafness birth defect prediction decision tree model, wherein analysis of the non-clean branches of the decision tree can evaluate the hearing loss risk of the sample under the main features;
fig. 7 is a schematic structural diagram of a birth defect prediction and risk assessment system based on machine learning according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of hardware devices of the birth defect prediction and risk assessment method based on machine learning according to the embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The birth defect cloud platform system is advanced and innovative birth defect first-level prevention intelligent equipment and an information platform, can realize automatic collection, analysis and feedback of birth defect eugenic detection data, assists clinicians and genetic consultants in carrying out medical consultation, guidance and intervention on high-risk groups, and establishes an effective birth defect pregnancy first-level prevention networked system. Specifically, as shown in fig. 1, it is a structural diagram of a birth defect cloud platform system. The birth defect cloud platform system comprises an information entry interface, a sample receiving interface, a sample inspection interface and a diagnosis suggestion interface, and the business process of the birth defect cloud platform system specifically comprises the following steps:
step 100: an information entry interface; after a clinician visits an examinee, reading the identity card information of the examinee through a cloud platform information input interface, establishing a personal file for the examinee, inputting an order item, storing the personal file and the order item to a cloud platform, and providing services such as self-service inquiry, online consultation and the like for the examinee;
step 110: a sample receiving interface; after the sample of the examined person is collected, the examining physician scans the bar code of the sample, confirms that the sample is received through a sample receiving interface of the cloud platform, and marks the sample as received or not received in the cloud platform;
step 120: a sample inspection interface; the method comprises the following steps that a checking doctor checks a sample on a corresponding analysis instrument, after the checking is finished, a sample checking interface of a cloud platform automatically sends a checking result to the cloud platform, and the cloud platform automatically generates a corresponding result report sheet;
step 130: a diagnostic advice interface; the clinician gives a diagnostic recommendation based on the results report.
Based on the birth defect cloud platform system, a birth defect prediction model based on gene mutation is established, the model learns medical big data through a machine learning algorithm, carries out birth defect result prediction and defect risk assessment according to gene mutation information, and gives genetic consultation suggestions so as to provide reference for clinicians. And the classification accuracy is improved by gradually optimizing the data and applying different feature screening methods and machine learning algorithms, result feedback and automatic genetic consultation services are provided for the examinees, the birth defect incidence rate is reduced, and huge medical and living expenses are saved for the society and families. The present application is applicable to prediction of various birth defects such as congenital deafness, hereditary hypothyroidism, thalassemia, spinal muscular atrophy, craniomaxillofacial defect, etc., and for convenience of illustration and understanding, the following embodiments are described in detail only with reference to prediction of hearing birth defects (congenital deafness) as an example.
Specifically, please refer to fig. 2 and fig. 3 together, fig. 2 is a schematic diagram of a birth defect prediction model according to an embodiment of the present application, and fig. 3 is a flowchart of a birth defect prediction and risk assessment method based on machine learning according to an embodiment of the present application. The birth defect prediction and risk assessment method based on machine learning comprises the following steps:
step 200: collecting sample data of a hearing subject;
in step 200, the two data sets of the examinee sample of the disabled person association and the examinee sample of the consultation clinic are analyzed, the collected sample data comprises gene mutation detection results, sex, age and hearing diagnosis results of more than 90 sites of three deafness causing genes, namely GJB2, MT and S L C26A4, and the hearing diagnosis results are classified features.
Step 210: preprocessing sample data;
in step 210, the data preprocessing specifically includes:
step 211: classifying the hearing diagnosis result; the specific classification method is as follows: the hearing threshold range values of one side are respectively averaged, and the side with lighter left and right deafness degrees is taken, so that the classification is carried out according to a deafness classification table, the hearing diagnosis results can be classified into N (normal) and D (deafness), and can also be classified into N (normal), M (mild deafness) and H (severe deafness) according to the deafness degree.
Step 212: and splitting the locus of the gene mutation detection result, splitting the gene mutation information into the mutation states of all loci, wherein each locus is used as an independent characteristic, the mutation is not found and is marked as 0, the heterozygous mutation is marked as 1, and the homozygous mutation is marked as 2.
Step 213: filling missing values in the sample data by adopting a missing value filling mode; the method comprises the following steps that information loss exists in partial gene detection data in original data, data filling is conducted through a mean value/mode filling method, a KNN (K nearest neighbor) method, a SoftImpute method, an IterativSVD method and other loss value filling modes, wherein for numerical features, the mean value/mode filling method replaces loss values with mean values, and for nominal features, the mode replaces the loss values; the distance between the samples is measured by the KNN through the mean square error, and the average value of the variable in K neighbors is used for filling up the missing value; SoftiMPute populates the matrix by iterative soft thresholding of the SVD decomposition; IterativetSD populates the matrix by iterating a low-rank SVD decomposition.
Step 214: adjusting the number of categories of the hearing diagnosis result; the data set of sample D can be classified into normal (N), mild (M) and severe (H)3 types according to the degree of deafness, and can also be classified into normal (N) and deafness (D). By comparing the accuracy of the two classification methods, the classification accuracy of the two classifications is found to be significantly better than that of the multi-classification.
Step 215: deleting illegal data and ambiguous points; the sample data contains samples which are not mutated at the sites of partial deafness genes but are clinically diagnosed as deafness, namely illegal data, in which case the deafness of the examinee can be caused by acquired environmental factors or other related gene mutations except the detected genes. In addition, there are some samples with consistent gene mutation information and demographic information but different hearing diagnosis results, i.e. ambiguity points. These two types of data interfere with the determination and classification of deafness diagnosis and are therefore deleted.
Step 216: carrying out sample balance and inter-class difference increasing treatment on the deafness sample and the normal sample; the number of deaf samples before sample balance is far larger than that of normal samples, so that proportion imbalance is caused, and great influence is caused on algorithm classification. The method and the device perform undersampling on the deafness sample, and only perform sampling in the severe deafness sample and the extremely severe deafness sample, so that the ratio of the deafness sample to the normal sample is close to 1: 1, thus not only ensuring data balance, but also increasing the difference among sample classes, thereby improving the classification accuracy.
Step 220: respectively applying data mining algorithms such as a machine learning library Sklearn, WEKA, RapidMiner and the like of Python to screen out main characteristics most relevant to deafness diagnosis results from the preprocessed sample data;
in step 220, the main characteristics comprise medically confirmed deafness pathogenic gene hot spots, the screening results are basically consistent with the existing literature results, the screened deafness pathogenic gene hot spot sites comprise mutation hot spots c.235delC, c.176del169p, c.299_300delAT, c.507_510insAACG and V37I of GJB2 gene, mutation hot spots 919-2A > G (IVS7-2A > G) in S L C26A4 gene, H723R (2168A > G) and R409H (1226G > A), mutation hot spots A1555G and C1494T in mitochondria and polymorphism sites V27I, E114G and I203T, in addition, the main relevant characteristics of deafness are screened, the proportion of case groups is increased along with the increase of age, the difference between case groups and control groups is found to have statistical significance by significance through significance analysis, the hearing loss and the hearing loss of the case groups are proved to have the specific age loss and the normal hearing loss of the hearing loss sample is shown in a graph under the condition that the age is different from the normal hearing loss of the case group.
In the embodiment of the application, the method for screening characteristics of sklern in Python comprises the following steps:
1) selecting SelectKBest for the univariate characteristics;
2) recursive feature elimination;
3)Feature Importance;
4) principal Component Analysis (PCA).
This application has used four kinds of characteristic screening functions of WEKA:
1)BestFirst&CfsSubsetEval
2)GreedyStepwise&CfsSubsetEval
3)Ranker&InfoGainAttributeEval
4)BestFirst&WrapperSubsetEval
the first half is a search method, and the second half is a feature evaluation strategy. Evaluation strategies can be divided into the Wrapper and Filter methods. CfsSubsetEval and WrapperSubsetEval are Wrapper methods, and emphasis is placed on evaluating the feature subset; InfoGainAttributeEval is a Filter method, and focuses on evaluating a single feature, and evaluates the value of the feature according to the information gain related to the category.
Step 230: constructing a birth defect prediction model according to the screened main characteristics, and predicting the hearing result of the birth defect prediction model according to the gene mutation information;
in step 230, the construction of the birth defect prediction model includes model optimization, algorithm selection, and the like, and specifically includes the following steps:
step 2301: evaluating the classification performance of the model by using ten times of cross validation of sample data, and optimizing the model;
step 2302, selecting 8 algorithms of Random Forest, J48, L genetic Regression, SVM, M L P, KStar, Decision Table, CNN and the like to train the model respectively, comparing the classification performance of various algorithms and the variation trend of the classification performance of different algorithms on the feature subsets obtained by the feature complete set and different feature screening methods, and selecting the optimal hearing result prediction algorithm;
in step 2302, as shown in fig. 5, a schematic diagram illustrating the comparison of classification accuracy of deafness diagnosis models of a sample feature complete set and different feature subsets by a machine learning algorithm is shown. Through comparison of classification performances of various algorithms and the change trends of the classification performances of different algorithms on the feature full set and the feature subsets with different sizes, the decision tree algorithm J48 is remarkable in performance, the prediction accuracy reaches over 90%, and the prediction algorithm is easy to interpret, so that the decision tree algorithm J48 is selected as the hearing result prediction algorithm.
Step 240: the birth defect prediction model evaluates the hearing loss risk ratio of sample data under the main characteristics through a non-pure branch of a decision tree algorithm J48, and performs deafness risk evaluation on hearing loss of a detected person by combining with hospital real data pathogenic rate statistics;
in step 240, besides that the homozygous mutant and the compound heterozygous mutant of the pathogenic gene have absolute pathogenicity, the pathogenicity of many mutation types (such as heterozygous mutation of a single pathogenic gene and polymorphism gene mutation) is difficult to predict, and the samples of the mutation types are possible to be hearing loss and normal samples, so that the risk of hearing loss pathogenesis of the examinee needs to be evaluated, and the method has good reference value for disease diagnosis and genetic consultation of the examinee. Specifically, as shown in fig. 6, it is a schematic diagram of a deafness birth defect prediction decision tree model, and the analysis of the non-pure branches of the decision tree can evaluate the risk of hearing loss of the sample under the main features.
Based on decision tree J48 non-clean branch: the decision tree classifies related features to obtain classification results, a part of non-pure branches exist in the results, namely the results are classified into normal samples and deaf samples with both probabilities, and the hearing loss risk ratio of the sample data under the main features can be evaluated through analysis of the non-pure branches of the decision tree.
And counting the pathogenic rate of all gene mutation types in the sample data based on the hospital real data, namely the proportion of the number of deafness samples with the mutation to the total number of samples with the mutation. The pathogenic rate of the gene mutation type based on more than 2000 hospital samples has been counted.
The application also comprises a birth defect APP, wherein the birth defect APP comprises a doctor end, a patient end and 3 public numbers, the hearing prediction result and the risk assessment report form of an examinee can be synchronously received, the examinee can establish and log in the APP account number of the examinee, the hearing prediction result and the risk assessment report form of the examinee can be checked, real-time interaction is carried out between the birth defect APP and a doctor, and the interaction mode comprises voice, video and the like.
Please refer to fig. 7, which is a schematic structural diagram of a birth defect prediction and risk assessment system based on machine learning according to an embodiment of the present application. The birth defect prediction and risk assessment system based on machine learning comprises a sample collection module, a sample preprocessing module, a feature screening module, a model construction module, a risk assessment module and an interaction module.
The hearing test system comprises a sample collection module and a hearing test device, wherein the sample collection module is used for collecting sample data of hearing test devices, the two data sets of the test device sample of the disabled union and the test device sample of the consultation clinic are analyzed, the collected sample data comprise gene mutation detection results, sex, age and hearing diagnosis results of more than 90 sites of three deafness pathogenic genes GJB2, MT and S L C26A4, and the hearing diagnosis results are class characteristics.
A sample preprocessing module: the system is used for preprocessing sample data; specifically, the sample preprocessing module comprises:
a result classification unit: for classifying the hearing diagnosis results; the specific classification method is as follows: the hearing threshold range values of one side are respectively averaged, and the side with lighter left and right deafness degrees is taken, so that the classification is carried out according to a deafness classification table, the hearing diagnosis results can be classified into N (normal) and D (deafness), and can also be classified into N (normal), M (mild deafness) and H (severe deafness) according to the deafness degree.
A site splitting unit: the method is used for splitting the locus of the gene mutation detection result, and splitting the gene mutation information into the mutation states of all loci, wherein each locus is used as an independent characteristic, the mutation is not found and is marked as 0, the heterozygous mutation is marked as 1, and the homozygous mutation is marked as 2.
Missing value completion unit: the method is used for completing missing values in the sample data by adopting a missing value filling mode; the method comprises the following steps that information loss exists in partial gene detection data in original data, data filling is conducted through a mean value/mode filling method, a KNN (K nearest neighbor) method, a SoftImpute method, an IterativSVD method and other loss value filling modes, wherein for numerical features, the mean value/mode filling method replaces loss values with mean values, and for nominal features, the mode replaces the loss values; the distance between the samples is measured by the KNN through the mean square error, and the average value of the variable in K neighbors is used for filling up the missing value; SoftiMPute populates the matrix by iterative soft thresholding of the SVD decomposition; IterativetSD populates the matrix by iterating a low-rank SVD decomposition.
A category adjustment unit: adjusting the number of categories of hearing diagnosis results; the data set of sample D can be classified into normal (N), mild (M) and severe (H)3 types according to the degree of deafness, and can also be classified into normal (N) and deafness (D). By comparing the accuracy of the two classification methods, the classification accuracy of the two classifications is found to be significantly better than that of the multi-classification.
An abnormal data deleting unit: deletion of illegal data and ambiguous points; the sample data contains samples which are not mutated at the sites of partial deafness genes but are clinically diagnosed as deafness, namely illegal data, in which case the deafness of the examinee can be caused by acquired environmental factors or other related gene mutations except the detected genes. In addition, there are some samples with consistent gene mutation information and demographic information but different hearing diagnosis results, i.e. ambiguity points. These two types of data interfere with the determination and classification of deafness diagnosis and are therefore deleted.
A sample balancing unit: the method is used for carrying out sample balance and inter-class difference increasing treatment on the deafness sample and the normal sample; the number of deaf samples before sample balance is far larger than that of normal samples, so that proportion imbalance is caused, and great influence is caused on algorithm classification. The method and the device perform undersampling on the deafness sample, and only perform sampling in the severe deafness sample and the extremely severe deafness sample, so that the ratio of the deafness sample to the normal sample is close to 1: 1, thus not only ensuring data balance, but also increasing the difference among sample classes, thereby improving the classification accuracy.
The characteristic screening module is used for screening main characteristics which are most relevant to deafness diagnosis results from preprocessed sample data by using a data mining algorithm of a machine learning library Sklearn, WEKA, RapidMiner and the like of Python respectively, wherein the main characteristics comprise medically-confirmed deafness pathogenic gene hot spots, and screening results are basically consistent with existing literature results.
In the embodiment of the application, the method for screening characteristics of sklern in Python comprises the following steps:
1) selecting SelectKBest for the univariate characteristics;
2) recursive feature elimination;
3)Feature Importance;
4) principal Component Analysis (PCA).
This application has used four kinds of characteristic screening functions of WEKA:
1)BestFirst&CfsSubsetEval
2)GreedyStepwise&CfsSubsetEval
3)Ranker&InfoGainAttributeEval
4)BestFirst&WrapperSubsetEval
the first half is a search method, and the second half is a feature evaluation strategy. Evaluation strategies can be divided into the Wrapper and Filter methods. CfsSubsetEval and WrapperSubsetEval are Wrapper methods, and emphasis is placed on evaluating the feature subset; InfoGainAttributeEval is a Filter method, and focuses on evaluating a single feature, and evaluates the value of the feature according to the information gain related to the category.
A model construction module: the method is used for constructing a birth defect prediction model according to the screened main characteristics, and the birth defect prediction model carries out hearing result prediction according to gene mutation information; specifically, the model building module comprises:
a model optimization unit: the method is used for evaluating the classification performance of the model by using ten-fold cross validation of sample data and optimizing the model;
the hearing ability prediction method comprises an algorithm selection unit and a Decision tree algorithm prediction unit, wherein the algorithm selection unit is used for selecting 8 algorithms of Random Forest, J48, L geological Regression, SVM, M L P, KStar, Decision Table, CNN and the like to train models respectively, comparing the classification performances of various algorithms, finding the variation trend of the classification performances of different algorithms on a feature complete set and feature subsets with different sizes, and selecting an optimal hearing ability prediction algorithm, and the comparison result shows that the Decision tree algorithm J48 is outstanding in performance and easy to interpret, so that the Decision tree algorithm J48 is selected as the optimal hearing ability prediction algorithm.
A risk assessment module: the birth defect prediction model evaluates the hearing loss risk ratio of sample data under the main characteristics through a non-pure branch of a decision tree algorithm J48, and performs deafness risk evaluation on hearing loss of a detected person by combining with hospital real data pathogenic rate statistics; besides absolute pathogenicity of homozygous mutant and compound heterozygous mutant of pathogenic genes, the pathogenicity of a plurality of mutation types (such as heterozygous mutation of single pathogenic gene and polymorphism gene mutation) is difficult to predict, and samples of the mutation types can be hearing loss or normal samples, so the risk of hearing loss pathogenicity of a subject needs to be evaluated, and the method has good reference value for disease diagnosis and genetic consultation of the subject.
Non-clean branch analysis based on decision tree J48: the decision tree classifies related features to obtain classification results, a part of non-pure branches exist in the results, namely the results are classified into normal samples and deaf samples with both probabilities, and the hearing loss risk ratio of the sample data under the main features can be evaluated through analysis of the non-pure branches of the decision tree.
And counting the pathogenic rate of all gene mutation types in the sample data based on the hospital real data, namely the proportion of the number of deafness samples with the mutation to the total number of samples with the mutation. The pathogenic rate of the gene mutation type based on more than 2000 hospital samples has been counted.
Install the birth defect APP in the interaction module, the birth defect APP includes doctor end, patient end and 3 parts of public number, can synchronous reception person under examination's hearing prediction result and risk assessment report list, person under examination can establish and log on own APP account number, look over own hearing prediction result and risk assessment report list to carry out real-time interaction through birth defect APP and doctor, the interactive mode includes pronunciation, video etc..
The birth defect prediction model provided by the application carries out risk assessment on the gene result of the examined person, and can also give genetic counseling suggestions by combining the mate gene result; can predict the hearing result and evaluate the risk of deafness according to the gene mutation information of the fetus, and is used for assisting the prenatal diagnosis of hearing birth defects. And the kit can also assist doctors to quickly screen pathogenic gene sites from clinical big data and screen new unknown pathogenic sites.
Through tests, the best prediction accuracy reaches over 90% through data optimization, feature screening and algorithm selection. Risk assessment is carried out on hearing loss of the examinees, the pathogenic rate of the gene mutation type based on real hospital data is counted, the reference value for disease diagnosis and genetic consultation of the examinees is good, the birth defect incidence rate is reduced, and huge medical and living expenses are saved for society and families.
Fig. 8 is a schematic structural diagram of hardware devices of the birth defect prediction and risk assessment method based on machine learning according to the embodiment of the present application. As shown in fig. 8, the device includes one or more processors and memory. Taking a processor as an example, the apparatus may further include: an input system and an output system.
The processor, memory, input system, and output system may be connected by a bus or other means, as exemplified by the bus connection in fig. 8.
The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules. The processor executes various functional applications and data processing of the electronic device, i.e., implements the processing method of the above-described method embodiment, by executing the non-transitory software program, instructions and modules stored in the memory.
The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data and the like. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processing system over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input system may receive input numeric or character information and generate a signal input. The output system may include a display device such as a display screen.
The one or more modules are stored in the memory and, when executed by the one or more processors, perform the following for any of the above method embodiments:
step a: collecting sample data of a detected person, wherein the sample data comprises a gene mutation detection result, gender, age and a defect diagnosis result, the gene mutation detection result, the gender and the age are characteristics, and the defect diagnosis result is a classification label;
step b: applying a data mining algorithm to carry out main feature screening on the sample data, wherein the screening result comprises medically verified deafness pathogenic gene hot spots and ages;
step c: constructing a birth defect prediction model according to the screened main characteristics, and predicting the birth defect result of the birth defect prediction model according to the gene mutation information;
step d: the birth defect prediction model evaluates the birth defect risk ratio under the main characteristics of the examinee through a decision tree algorithm, and carries out risk evaluation on the birth defect of the examinee by combining with the real pathogenic rate data.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.
Embodiments of the present application provide a non-transitory (non-volatile) computer storage medium having stored thereon computer-executable instructions that may perform the following operations:
step a: collecting sample data of a detected person, wherein the sample data comprises a gene mutation detection result, gender, age and a defect diagnosis result, the gene mutation detection result, the gender and the age are characteristics, and the defect diagnosis result is a classification label;
step b: applying a data mining algorithm to carry out main feature screening on the sample data, wherein the screening result comprises medically verified deafness pathogenic gene hot spots and ages;
step c: constructing a birth defect prediction model according to the screened main characteristics, and predicting the birth defect result of the birth defect prediction model according to the gene mutation information;
step d: the birth defect prediction model evaluates the birth defect risk ratio under the main characteristics of the examinee through a decision tree algorithm, and carries out risk evaluation on the birth defect of the examinee by combining with the real pathogenic rate data.
Embodiments of the present application provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform the following:
step a: collecting sample data of a detected person, wherein the sample data comprises a gene mutation detection result, gender, age and a defect diagnosis result, the gene mutation detection result, the gender and the age are characteristics, and the defect diagnosis result is a classification label;
step b: applying a data mining algorithm to carry out main feature screening on the sample data, wherein the screening result comprises medically verified deafness pathogenic gene hot spots and ages;
step c: constructing a birth defect prediction model according to the screened main characteristics, and predicting the birth defect result of the birth defect prediction model according to the gene mutation information;
step d: the birth defect prediction model evaluates the birth defect risk ratio under the main characteristics of the examinee through a decision tree algorithm, and carries out risk evaluation on the birth defect of the examinee by combining with the real pathogenic rate data.
The birth defect prediction and risk assessment method, system and electronic equipment based on machine learning of the embodiment of the application screen the defect pathogenic genes through the machine learning algorithm by establishing a birth defect prediction model, predict birth defects in advance and assess risks according to mutation conditions of defect-related genes of a detected person, and reduce the incidence rate of birth defects; meanwhile, result feedback and automatic genetic consultation service are provided for the examinees, the work of clinicians is assisted, genetic consultation suggestion reference is provided for the clinicians, and huge medical treatment and living expenses are saved for the society and the family.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (11)

1. A birth defect prediction and risk assessment method based on machine learning is characterized by comprising the following steps:
step a: collecting sample data of a detected person, wherein the sample data comprises a gene mutation detection result, gender, age and a defect diagnosis result, the gene mutation detection result, the gender and the age are characteristics, and the defect diagnosis result is a classification label;
step b: applying a data mining algorithm to carry out main feature screening on the sample data, wherein the screening result comprises medically verified deafness pathogenic gene hot spots and ages;
step c: constructing a birth defect prediction model according to the screened main characteristics, and predicting the birth defect result of the birth defect prediction model according to the gene mutation information;
step d: the birth defect prediction model evaluates the birth defect risk ratio under the main characteristics of the examinee through a decision tree algorithm, and carries out risk evaluation on the birth defect of the examinee by combining with the real pathogenic rate data.
2. The method for birth defect prediction and risk assessment based on machine learning according to claim 1, wherein said step a further comprises: preprocessing the sample data; the pretreatment specifically comprises:
step a 1: classifying the defect diagnosis result;
step a 2: carrying out site splitting on the gene mutation detection result, and splitting gene mutation information into mutation states of all sites;
step a 3: filling missing values in the sample data by adopting a missing value filling mode;
step a 4: deleting illegal data and ambiguous points in the sample data;
step a 5: and carrying out sample balance and inter-class difference increasing processing on the defect sample and the normal sample in the sample data.
3. The method for birth defect prediction and risk assessment based on machine learning of claim 1 or 2 wherein, in said step b, said data mining algorithm comprises the machine learning libraries sklern, WEKA, RapidMiner of Python; the WEKA characteristic screening function is as follows:
1)BestFirst&CfsSubsetEval
2)GreedyStepwise&CfsSubsetEval
3)Ranker&InfoGainAttributeEval
4)BestFirst&WrapperSubsetEval
wherein, the first half section is a searching method, and the second half section is a characteristic evaluation strategy;
the method for screening the characteristics of sklern in Python comprises the following steps:
5) selecting SelectKBest for the univariate characteristics;
6) recursive feature elimination;
7)Feature Importance;
8) principal component analysis method.
4. The method for birth defect prediction and risk assessment based on machine learning of claim 3 wherein, in said step b, the screened gene loci comprise the mutation hot spots of GJB2 gene c.235delC, c.176del169p, c.299_300delAT, c.507_510insAACG, V37I, the mutation hot spots 919-2A > G (IVS7-2A > G), H723R (2168A > G), R409H (1226G > A) in S L C26A4 gene, the mutation hot spots of A1555G and C1494T in mitochondria, and the polymorphic sites V27I, E114G and I203T, and the main characteristics also comprise age.
5. The method for birth defect prediction and risk assessment based on machine learning according to claim 4, wherein in said step c, said constructing a birth defect prediction model according to said screened main features specifically comprises:
step c 1: evaluating the classification performance of the model by using ten times of cross validation of sample data, and optimizing the model;
and c2, selecting Random Forest, J48, L geological Regression, SVM, M L P, KStar, decisionTable and CNN algorithms to train the models respectively, comparing the classification performances of various algorithms and the variation trends of the classification performances of different algorithms on the feature complete set and the feature subsets with different sizes, and selecting J48 as the birth defect prediction algorithm.
6. A birth defect prediction and risk assessment system based on machine learning is characterized by comprising:
a sample collection module: the system comprises a data acquisition module, a data acquisition module and a data processing module, wherein the data acquisition module is used for acquiring sample data of a subject, and the sample data comprises a gene mutation detection result, a sex, an age and a defect diagnosis result, wherein the gene mutation detection result, the sex and the age are characteristics, and the defect diagnosis result is a classification label;
a characteristic screening module: the system is used for applying a data mining algorithm to carry out main feature screening on the sample data, and the screening result comprises medically verified deafness pathogenic gene hot spots and ages;
a model construction module: the birth defect prediction model is constructed according to the screened main characteristics and used for predicting the birth defect result according to the gene mutation information;
a risk assessment module: the method is used for evaluating the birth defect risk ratio under the main characteristics of the examinee through a decision tree algorithm and carrying out risk evaluation on the birth defect of the examinee by combining with the real pathogenicity rate data.
7. The machine-learning based birth defect prediction and risk assessment system according to claim 6, further comprising a sample preprocessing module for preprocessing said sample data; the sample preprocessing module specifically comprises:
a result classification unit: the defect diagnosis device is used for classifying the defect diagnosis result;
a site splitting unit: the system is used for splitting loci of a gene mutation detection result and splitting gene mutation information into mutation states of all loci;
missing value completion unit: the method is used for completing missing values in the sample data by adopting a missing value filling mode;
an abnormal data deleting unit: the method is used for deleting illegal data and ambiguous points in the sample data;
a sample balancing unit: the method is used for carrying out sample balancing and inter-class difference increasing processing on the defect sample and the normal sample in the sample data.
8. The system of claim 7, wherein the data mining algorithm comprises the machine learning libraries sklern, WEKA, RapidMiner of Python; the WEKA characteristic screening function is as follows:
1)BestFirst&CfsSubsetEval
2)GreedyStepwise&CfsSubsetEval
3)Ranker&InfoGainAttributeEval
4)BestFirst&WrapperSubsetEval
wherein, the first half section is a searching method, and the second half section is a characteristic evaluation strategy;
the method for screening the characteristics of sklern in Python comprises the following steps:
5) selecting SelectKBest for the univariate characteristics;
6) recursive feature elimination;
7)Feature Importance;
8) principal component analysis method.
9. The system for birth defect prediction and risk assessment based on machine learning of claim 8 wherein the screened gene loci include the mutation hot spots of GJB2 gene c.235delC, c.176del16bp, c.299_300delAT, c.507_510insAACG, V37I, mutation hot spots in S L C26A4 gene 919-2A > G (IVS7-2A > G), H723R (2168A > G), R409H (1226G > A), mutation hot spots in mitochondria A1555G, C1494T, and polymorphic sites V27I, E114G, I203T, and the main characteristics further include age.
10. The machine-learning based birth defect prediction and risk assessment system according to claim 9, wherein said model building module comprises:
a model optimization unit: the method is used for evaluating the classification performance of the model by using ten-fold cross validation of sample data and optimizing the model;
the algorithm selection unit is used for selecting Random Forest, J48, L genetic Regression, SVM, M L P, KStar, Decision Table and CNN algorithms to train the models respectively, comparing the classification performance of various algorithms and the variation trend of the classification performance of different algorithms on the feature full set and the feature subset with different sizes, and selecting J48 as the birth defect prediction algorithm.
11. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the following operations of the method for predicting birth defects and assessing risk based on machine learning according to any one of the above 1-5:
step a: collecting sample data of a detected person, wherein the sample data comprises a gene mutation detection result, gender, age and a defect diagnosis result, the gene mutation detection result, the gender and the age are characteristics, and the defect diagnosis result is a classification label;
step b: applying a data mining algorithm to carry out main feature screening on the sample data, wherein the screening result comprises medically verified deafness pathogenic gene hot spots and ages;
step c: constructing a birth defect prediction model according to the screened main characteristics, and predicting the birth defect result of the birth defect prediction model according to the gene mutation information;
step d: the birth defect prediction model evaluates the birth defect risk ratio under the main characteristics of the examinee through a decision tree algorithm, and carries out risk evaluation on the birth defect of the examinee by combining with the real pathogenic rate data.
CN201911174613.6A 2019-11-26 2019-11-26 Birth defect prediction and risk assessment method and system based on machine learning and electronic equipment Pending CN111508603A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911174613.6A CN111508603A (en) 2019-11-26 2019-11-26 Birth defect prediction and risk assessment method and system based on machine learning and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911174613.6A CN111508603A (en) 2019-11-26 2019-11-26 Birth defect prediction and risk assessment method and system based on machine learning and electronic equipment

Publications (1)

Publication Number Publication Date
CN111508603A true CN111508603A (en) 2020-08-07

Family

ID=71863808

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911174613.6A Pending CN111508603A (en) 2019-11-26 2019-11-26 Birth defect prediction and risk assessment method and system based on machine learning and electronic equipment

Country Status (1)

Country Link
CN (1) CN111508603A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110033860A (en) * 2019-02-27 2019-07-19 杭州贝安云科技有限公司 A kind of Inherited Metabolic Disorders recall rate method for improving based on machine learning
CN111933288A (en) * 2020-08-21 2020-11-13 上海交通大学医学院附属第九人民医院 Congenital deafness disease prediction method, system and terminal based on CNN
CN112086130A (en) * 2020-08-13 2020-12-15 东南大学 Obesity risk prediction device based on sequencing and data analysis and prediction method thereof
CN112102878A (en) * 2020-09-16 2020-12-18 张云鹏 LncRNA learning system
CN112530590A (en) * 2020-12-02 2021-03-19 中国福利会国际和平妇幼保健院 Birth defect assessment method and device based on 5G and electronic equipment
CN113344299A (en) * 2021-07-01 2021-09-03 贵州电网有限责任公司 Primary equipment defect prediction model prediction method based on data mining
CN113379313A (en) * 2021-07-02 2021-09-10 贵州电网有限责任公司 Intelligent preventive test operation management and control system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065171A (en) * 2018-11-05 2018-12-21 苏州贝斯派生物科技有限公司 The construction method and system of Kawasaki disease risk evaluation model based on integrated study
CN110033860A (en) * 2019-02-27 2019-07-19 杭州贝安云科技有限公司 A kind of Inherited Metabolic Disorders recall rate method for improving based on machine learning
CN110246577A (en) * 2019-05-31 2019-09-17 深圳江行联加智能科技有限公司 A method of based on artificial intelligence auxiliary gestational diabetes genetic risk prediction

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065171A (en) * 2018-11-05 2018-12-21 苏州贝斯派生物科技有限公司 The construction method and system of Kawasaki disease risk evaluation model based on integrated study
CN110033860A (en) * 2019-02-27 2019-07-19 杭州贝安云科技有限公司 A kind of Inherited Metabolic Disorders recall rate method for improving based on machine learning
CN110246577A (en) * 2019-05-31 2019-09-17 深圳江行联加智能科技有限公司 A method of based on artificial intelligence auxiliary gestational diabetes genetic risk prediction

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
李朔等: "孕期遗传性耳聋基因突变携带者筛查以降低耳聋患儿出生缺陷的可行性研究", 《中国优生与遗传杂志》 *
李朔等: "孕期遗传性耳聋基因突变携带者筛查以降低耳聋患儿出生缺陷的可行性研究", 《中国优生与遗传杂志》, no. 04, 25 April 2015 (2015-04-25) *
熊怡等: "耳聋基因芯片在非综合征性耳聋出生缺陷防控中的应用研究", 《中国优生与遗传杂志》 *
熊怡等: "耳聋基因芯片在非综合征性耳聋出生缺陷防控中的应用研究", 《中国优生与遗传杂志》, no. 08, 25 August 2018 (2018-08-25) *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110033860A (en) * 2019-02-27 2019-07-19 杭州贝安云科技有限公司 A kind of Inherited Metabolic Disorders recall rate method for improving based on machine learning
CN110033860B (en) * 2019-02-27 2021-02-26 杭州贝安云科技有限公司 Method for improving detection rate of genetic metabolic diseases based on machine learning
CN112086130A (en) * 2020-08-13 2020-12-15 东南大学 Obesity risk prediction device based on sequencing and data analysis and prediction method thereof
CN111933288A (en) * 2020-08-21 2020-11-13 上海交通大学医学院附属第九人民医院 Congenital deafness disease prediction method, system and terminal based on CNN
CN112102878A (en) * 2020-09-16 2020-12-18 张云鹏 LncRNA learning system
CN112102878B (en) * 2020-09-16 2024-01-26 张云鹏 LncRNA learning system
CN112530590A (en) * 2020-12-02 2021-03-19 中国福利会国际和平妇幼保健院 Birth defect assessment method and device based on 5G and electronic equipment
CN113344299A (en) * 2021-07-01 2021-09-03 贵州电网有限责任公司 Primary equipment defect prediction model prediction method based on data mining
CN113379313A (en) * 2021-07-02 2021-09-10 贵州电网有限责任公司 Intelligent preventive test operation management and control system
CN113379313B (en) * 2021-07-02 2023-06-20 贵州电网有限责任公司 Intelligent preventive test operation management and control system

Similar Documents

Publication Publication Date Title
CN111508603A (en) Birth defect prediction and risk assessment method and system based on machine learning and electronic equipment
Ainscough et al. A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data
Ko et al. Clinically validated machine learning algorithm for detecting residual diseases with multicolor flow cytometry analysis in acute myeloid leukemia and myelodysplastic syndrome
Labarere et al. How to derive and validate clinical prediction models for use in intensive care medicine
US9524304B2 (en) Systems and methods for diagnosing inherited retinal diseases
Grollemund et al. Development and validation of a 1-year survival prognosis estimation model for Amyotrophic Lateral Sclerosis using manifold learning algorithm UMAP
Kreuzberger et al. Prognostic models for newly‐diagnosed chronic lymphocytic leukaemia in adults: a systematic review and meta‐analysis
US20060057618A1 (en) Determining data quality and/or segmental aneusomy using a computer system
Hang et al. Digital image analysis supports a nuclear‐to‐cytoplasmic ratio cutoff value of 0.5 for atypical urothelial cells
CN115210772B (en) System and method for processing electronic images for universal disease detection
US20220172805A1 (en) System and method for automatically determining serious adverse events
KR102498686B1 (en) Systems and methods for analyzing electronic images for quality control
Pevy et al. Feasibility of using an automated analysis of formulation effort in patients’ spoken seizure descriptions in the differential diagnosis of epileptic and nonepileptic seizures
Ahmad et al. Artificial intelligence in inflammatory bowel disease endoscopy: implications for clinical trials
Roufosse et al. The Banff 2022 Kidney Meeting Work Plan: data-driven refinement of the Banff Classification for renal allografts
Barnado et al. Developing and validating methods to assemble systemic lupus erythematosus births in the electronic health record
Gauthier et al. Breast cancer risk score: a data mining approach to improve readability
Qin et al. Noninvasive evaluation of lupus nephritis activity using a radiomics machine learning model based on ultrasound
Lai et al. Development of a metabolite-based deep learning algorithm for clinical precise diagnosis of the progression of diabetic kidney disease
Nardone et al. Advanced technology for assessment of endoscopic and histological activity in ulcerative colitis: a systematic review and meta-analysis
CN113628751A (en) Gastric cancer prognosis prediction method and device and electronic equipment
CN114974552A (en) Method for establishing breast cancer early screening model
AU2022218581B2 (en) Sequencing data-based itd mutation ratio detecting apparatus and method
Innes et al. Hydronephrosis severity clarifies prognosis and guides management for emergency department patients with acute ureteral colic
Young et al. AI in Dermatology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination