CN115565690A

CN115565690A - Early-stage identification method for autism spectrum disorder based on machine learning

Info

Publication number: CN115565690A
Application number: CN202211324088.3A
Authority: CN
Inventors: 韦秋宏; 程茜; 徐铣明; 徐雪丽
Original assignee: Childrens Hospital of Chongqing Medical University
Current assignee: Childrens Hospital of Chongqing Medical University
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-01-03

Abstract

The invention provides an early identification method of autism spectrum disorder based on machine learning, belonging to the technical field of medical diagnosis and comprising the following steps: selecting a behavior assessment scale of the children with developmental disorders to construct a data set, wherein the data set comprises a DLD sample of developmental language disorder, a GDD sample of comprehensive developmental delay and an ASD sample of autism spectrum disorder; preprocessing samples in the data set, and processing unbalanced data by setting weights; constructing a two-stage decision model TS-DM based on a classification model XGboost; and inputting the sample to be detected into a TS-DM model, and respectively identifying developmental language disorder DLD, autism spectrum disorder ASD and total developmental delay GDD through the TS-DM model. The invention improves the auxiliary decision-making capability of the developmental scale, provides a differential diagnosis idea for clinicians by using the interpretable model, and is beneficial to early identifying developmental disturbance diseases such as ASD and the like.

Description

Early-stage identification method for autism spectrum disorder based on machine learning

Technical Field

The invention belongs to the technical field of medical diagnosis, and particularly relates to an early identification method of autism spectrum disorder based on machine learning.

Background

Autism Spectrum Disorder (ASD) is a type of neurodevelopmental Disorder characterized by social communication disorders, repetitive stereotypic behaviors, and narrow interests. In the children's low age stage, most ASDs are diagnosed with "language problems", while Developmental Language Disorder (DLD) and Global Developmental Delay (GDD) are also mainly manifested by language lag, and the overlapping of the three disease symptoms, heterogeneity of the Developmental level, etc. easily cause difficulty in recognition by doctors, and how to recognize autism spectrum disorder ASD from Developmental language disorder DLD and Global Developmental delay GDD is more difficult.

Because there is no clear objective biomarker for the three diseases at present, the establishment of diagnosis mainly depends on the comprehensive judgment of medical history, behavior observation and scale evaluation. The professional background of the doctor, the clinical experience, the time for receiving a patient visit and the like all influence the accuracy of diagnosis. Pediatric psychiatrists and development pediatricians are severely deficient in China or other developing countries, even in developed regions, and in China, primary hospitals rarely use highly confident assessment tools for autism diagnosis, but only use parental-reported screening scales to assist diagnosis. Early recognition is very important for development disorder diseases such as autism, and development tracks and prognosis of the disease can be improved through early recognition and intervention.

How to use an accurate, economic, convenient and quick method to carry out the ASD identification aid decision-making of the autism spectrum disorder at an early stage in clinical practice becomes an important problem to be solved. At present, the machine learning has gradually shown effects in the aspects of assisting diagnosis of doctors, relieving missed diagnosis and misdiagnosis, improving diagnosis efficiency, making up resource supply and demand gaps and the like. A number of scholars have conducted ASD-related machine learning studies using different screening and diagnostic tools, including autism diagnostic observation ADOS, social response scale SRS, modified pediatric autism screening scale M-CHAT and autism behavioral scale ABC. However, most established machine learning models are designed to distinguish ASD children from normal children, and there is a lack of models to distinguish ASD from other neurodevelopmental disorders. Most models used for ASD machine learning at present, such as Support Vector Machine (SVM), artificial Neural Network (ANN) and the like, are black box models, and the classification process is unknown, so that the clinical interpretability is poor, and the referable clinical knowledge provided by the model is limited.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a method for early identifying autism spectrum disorder based on machine learning.

In order to achieve the above purpose, the invention provides the following technical scheme:

an early identification method of autism spectrum disorder based on machine learning comprises the following steps:

selecting a behavior assessment scale and demographic information of the children with developmental disorders to construct a data set, wherein the data set comprises a DLD sample of developmental language disorder, a GDD sample of comprehensive developmental delay and an ASD sample of autism spectrum disorder;

preprocessing samples in the data set, and processing unbalanced data by setting weights;

constructing a two-stage decision model TS-DM based on a classification model XGboost;

inputting a sample to be detected into a TS-DM model, and respectively identifying developmental language disorder DLD, autism spectrum disorder ASD and total developmental delay GDD through the TS-DM model;

the process for constructing the two-stage decision model TS-DM based on the classification model XGboost comprises the following steps:

inputting the preprocessed samples into an XGboost model, classifying DLD samples and combined GDD samples and ASD samples through an XGboost algorithm, selecting the characteristic that the ROC-AUC (area under the working curve) of a subject of the XGboost model does not rise any more through a SHAP (secure short Range application) value, and establishing the XGboost single-tree model to form a first-stage decision model;

and repeating the steps, establishing a single tree XGboost to classify the GDD and the ASD, and forming a second stage decision model.

Preferably, the behavioral assessment scale comprises Gesell development Scale GDS, early language development progress Scale ELMS, modified infant autism screening Scale M-CHAT, autism behavioral Scale ABC and demographic information.

Preferably, the processing of the imbalance data by setting weights, wherein:

preferably, the preprocessed autism spectrum disorder ASD sample and the comprehensively developmental delayed GDD sample are combined into a whole, the developmental language disorder DLD sample is independently used as a whole, and two types of samples are obtained; from the two main samples, 20 samples of each class are randomly selected as a verification set, and the rest are training sets.

Preferably, training a two-stage decision model TS-DM based on the training set comprises:

determining the optimal parameters of the XGOST model by adopting a parameter adjusting mode of grid search ten-fold cross validation, wherein the parameters to be optimized comprise the number n _ estimators of tree estimators, the maximum depth max _ depth of a single tree and the random sampling proportion subsample; the ratio of randomly sampled column numbers, column _ byte, and learning rate, and the rest of parameters not mentioned all use default parameters;

classifying DLD samples, combined GDD samples and ASD samples through an XGboost algorithm, sorting all features in the XGboost model according to SHAP values, sequentially incorporating the features into an XGboost single-tree model with reset parameters n _ estimators =1, subsamples =1, colsamples_byte =1 according to important sequences, training until the ROC-AUC under a working curve of a subject of the model does not rise any more, selecting the features incorporated into the training at the moment as features of TS-DM, and training a first-stage decision model;

and repeating the process, classifying ASD and GDD samples in comprehensive developmental delay, and training a second stage decision model.

Preferably, the classification model XGboost and the two-stage decision model TS-DM are evaluated through the accuracy ACC and the recall R;

where TP, FP, TN, and FN represent true positive, false positive, true negative, and false negative rates, respectively, with a default threshold of 0.5 for classification.

The method for identifying the autism spectrum disorder at the early stage based on the machine learning has the following beneficial effects:

according to the method, a clinically common and easily obtained scale is integrated, a two-stage decision model TS-DM based on a classification model XGboost is constructed, the two-stage decision model TS-DM is utilized to realize the identification of ASD from DLD and GDD, the auxiliary decision capability of the scale is improved, and the feature variable importance and interpretable model are used for providing a differential diagnosis idea for a clinician, so that the early identification of development disorder diseases such as ASD is facilitated.

Drawings

In order to more clearly illustrate the embodiments of the present invention and the design thereof, the drawings required for the embodiments will be briefly described below. The drawings in the following description are only some embodiments of the invention and it will be clear to a person skilled in the art that other drawings can be derived from them without inventive effort.

Fig. 1 is a flowchart of a method for early identification of autism spectrum disorder based on machine learning according to example 1 of the present invention;

FIG. 2 is a flow chart of a two-stage decision model TS-DM;

FIG. 3 is a flow chart of classifying 60 exceptional test set samples based on a two-stage decision model TS-DM.

Detailed Description

In order that those skilled in the art will better understand the technical solutions of the present invention and can practice the same, the present invention will be described in detail with reference to the accompanying drawings and specific examples. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The invention provides a machine learning-based early identification method for autism spectrum disorder, which is specifically shown in figure 1 and comprises the following steps:

step 1: and selecting a behavior assessment scale and demographic data of the children with developmental disorders to construct a data set, wherein the data set comprises a DLD sample of developmental language disorder, a GDD sample of comprehensive developmental delay and an ASD sample of autism spectrum disorder.

Specifically, in this embodiment, the assessment scale includes Gesell Development Scale (GDS), early language development progress scale (ELMS), modified infant autism screening scale (M-CHAT), autism behavioral scale (ABC), and demographic data (gender, age) as training characteristics, 2004 samples were added to the data set, where each sample contained 14 training characteristics and its corresponding diagnosis result y, y =1 represents Autism Spectrum Disorder (ASD), y =2 represents Developmental Language Disorder (DLD), and y =3 represents General Developmental Delay (GDD).

Specifically, 2004 cases of children with first diagnosis of ASD, DLD and GDD were selected as training and verification sets from the center of growth and development of children and teenagers in the subsidiary children hospital of Chongqing medical university from month 1 to month 9 of 2020, and were treated, and first diagnosed in the first diagnosis of ASD, DLD and GDD. Case data of 20 patients each of ASD, DLD, GDD during the period of 10-2020 to 11-2020 were selected as external test data sets for analysis.

Inclusion criteria were: all cases were diagnosed by the senior and above specialist doctors in the subsidiary Children hospital of Chongqing medical university, and the diagnosis met the diagnosis standards of ASD and GDD in American mental disorder diagnosis statistics handbook, fifth edition (DMS-5), and the diagnosis standard of DLD in the eleventh revision of International disease Classification (ICD-11), and completed Gesell Developmental Scale (GDS), autism diagnosis observed quantity (ADOS), early language developmental progress Scale, modified infant autism screening Scale (M-CHAT), and autism behavior Scale (ABC).

Exclusion criteria: data loss, duplication, hearing impairment, neurological underlying disease: such as epilepsy, duchenne muscular dystrophy, hepatolenticular degeneration, autoimmune encephalitis, etc.

The cognitive situation of the human beings is evaluated by using a Gesell development scale as a tool, and the cognitive situation is divided into five functional areas of large movement, fine movement, language, personal society and adaptability, and the result is expressed by a developer. The Language development condition is evaluated by using an Early Language development progress Scale developed by Liu Xiao and The like as a tool, the Language development condition is compiled by a Language development normal model of children in a sea area above The Shanghai City child medical center according to The blue book of The Early Language development Milestone Scale income of The foreign Duchenson science, the reliability is good, and The Language evaluation Scale is one of The Language evaluation scales widely used for clinic in China. It is divided into three parts of speech and language expression (hereinafter referred to as expressiveness), auditory perception and understanding (hereinafter referred to as receptiveness), and understanding and expression related to vision (hereinafter referred to as vision correlation), and the language development quotient = development month age/physiological month age 100 of the three parts and the whole is calculated respectively. The autism-like behaviors of the children are evaluated by using M-CHAT and ABC scales, the M-CHAT scales are filled in by parents according to the current behaviors and skill conditions of the children, 23 items are provided, 6 items are high-risk items, the results of the high-risk items are represented by the number of failed high-risk items and any items, the ABC scales are also filled in by the parents and comprise 57 items for describing abnormal performances of the autism children in aspects of feeling, behavior, emotion, language and the like, the scoring range of each item is 1, 2, 3 and 4, and finally the total score sum of all the items is calculated.

Step 2: samples within the data set are preprocessed and unbalanced data is processed by setting weights.

In the invention, the ASD has larger difference with the GDD and DLD sample sizes, so the data imbalance needs to be processed, and the processing by setting the weight is mainly adopted.

And step 3: a two-stage decision model TS-DM is constructed based on the XGBoost model, which specifically includes, as shown in fig. 2:

the method comprises the steps of firstly, inputting preprocessed samples into a classification model XGboost, classifying developmental language disorder DLD samples in a data set, combined comprehensively developmental delay GDD samples and autism spectrum disorder ASD samples through an XGboost algorithm, selecting important characteristics of enabling ROC-AUC under a working curve of a subject of the XGboost model not to rise any more through SHAP values, establishing an XGboost single tree model, and forming a first-stage decision model.

And secondly, classifying ASD and GDD samples in comprehensive developmental delay, repeating the steps to select characteristics, and establishing a decision model in the second stage.

Step 4, before the behavior recognition is performed by using the two-stage decision model TS-DM, training the two-stage decision model TS-DM based on the preprocessed data set is required, which specifically includes:

step 4.1, data set partitioning

Combining the preprocessed autism pedigree disorder ASD sample and the comprehensively developmental delayed GDD sample into a first class, and independently using the developmental language disorder DLD sample as a first class to obtain two main classes of samples; from the two main samples, 20 samples of each class are randomly selected as a verification set, and the rest are training sets.

Step 4.2, hyper-parametric training

The best hyper-parameters of the model were determined using a cross-validated grid search strategy programmed with Python 3.6.7. Parameters that the XGBoost needs to optimize include the number of tree estimators (n _ estimators), the maximum depth of a single tree (max _ depth), the random sampling ratio subsample, the ratio of the number of randomly sampled columns colsample byte, and the learning rate (learning _ rate), and the rest of the parameters not mentioned use default parameters.

Specifically, in this embodiment, training the two-stage decision model TS-DM based on a training set includes:

determining the optimal parameters of the XGOST model by adopting a parameter adjusting mode of grid search ten-fold cross validation, wherein the parameters to be optimized comprise the number n _ estimators of tree estimators, the maximum depth max _ depth of a single tree and the random sampling proportion subsample; the ratio of randomly sampled column numbers, colsample _ byte, and learning rate, the remaining parameters not mentioned all use default parameters.

Classifying DLD samples, combined GDD samples and ASD samples through an XGboost algorithm, sequencing all features in the XGboost model according to SHAP values, sequentially incorporating the features into an XGboost single-tree model with reset parameters n _ estimators =1, subsample =1 and subsample \/u byte =1 according to important sequences, training until the area ROC-AUC under a test subject working curve of the model does not rise any more, selecting the features incorporated into the training at the moment as features of TS-DM, and training a first-stage decision model.

And repeating the process, classifying the ASD and GDD samples in the comprehensive developmental retardation of the autism spectrum disorder, and training a decision model in the second stage.

Step 4, evaluating the classification model XGboost and the two-stage decision model TS-DM through the accuracy ACC and the recall rate R;

And 5, inputting the sample to be detected into a two-stage decision model TS-DM, respectively identifying behaviors of the developmental language disorder DLD, the autism spectrum disorder ASD and the comprehensively developmental retarded GDD through the two-stage decision model TS-DM, and identifying a case of the autism spectrum disorder ASD from the developmental language disorder DLD and the comprehensively developmental retarded GDD.

Ordering the importance of the characteristic variables according to the SHAP value of the XGboost model, wherein in the two classification models combining the ASD, the GDD and the DLD, the ordering of the characteristic variables is as shown in table 1, and when the number of the variables is increased to 4, the accuracy of the model tends to be stable and is not greatly improved (table 2), so that the adaptability, personal society, fine movement and large movement are selected as the characteristic variables to establish the single-tree XGboost model; and similarly, sorting the characteristic variables in the two classification models of the ASD and the GDD according to importance (table 3), and selecting M-CHAT high-risk items as the characteristic variables (table 2) to establish a single-tree Xoost model. The overall accuracy of the first stage model is 90.67 +/-3.18%, the classification accuracy of the DLD is 85.50 +/-9.50%, and the classification accuracy of the combined ASD and GDD is 93.24 +/-4.19%; the overall accuracy of the second stage model is 75.88 +/-5.44%, the classification accuracy of the ASD reaches 75.00 +/-9.35%, and the classification accuracy of the GDD reaches 76.75 +/-7.63%.

TABLE 1 sorting of TS-DM first stage model important variables based on SHAP values

TABLE 2 selection of TS-DM model characteristics based on ROC-AUC

TABLE 3 sorting of TS-DM second stage model important variables based on SHAP values

And performing man-machine competition on the established two-stage decision model TS-DM model with senior and junior funders, comparing the diagnosis accuracy of the model and the doctors, and evaluating the potential of the model for assisting the clinicians in diagnosis.

In this embodiment, 12 clinicians, including 6 low-grade physicians with clinical work experience below 3 years and 6 high-grade physicians with work experience more than 10 years and job title above the middle grade, were selected in the secondary child hospital of Chongqing medical university to perform case analysis and diagnosis. Clinical data of 60 patients in the external test set are randomly distributed to 12 doctors as subject contents, and each doctor has 5 subjects. The questions of the second part are added with the medical history information of the same children on the basis of the first part, including chief complaints, current medical history, past history, feeding history, growth and development history, observation conditions in a diagnosis room, physical examination and the like, and the doctors perform diagnosis again according to the medical history information and the results of the scales. And finally, comparing and analyzing the diagnosis conditions of high-age capital, low-age capital and machines.

TABLE 4. Pediatric doctor and machine learning model diagnostic performance on test set (%)

h, providing additional medical history for doctor

The result of the man-machine competition is as follows: selecting an XGboost model and a two-stage decision model TS-DM with better interpretability, classifying external test sets respectively, judging cases in the external test sets by doctors, comparing accuracy conditions of machines and doctors (table 4), wherein the accuracy of the machines is equivalent to Gao Nianzi doctors and is superior to that of low-age-funded doctors under the condition that the doctors and the machines are under the same information condition, and after the medical history information is additionally added to the doctors, the accuracy of the high-age funded doctors is superior to that of the machines, and the machines are still superior to that of the low-age funded doctors.

To understand how the machine learning model classifies the patients, the classification process of the interpretable model is visualized, as shown in fig. 3, in the first stage TS-DM, 60 patients are classified first, and 39 GSD cases (correct 37 cases, incorrect 2 cases) and 21 DLD cases (correct 18 cases, incorrect 3 cases) are separated, the selected threshold value of each node is about 75 points, and the adaptability is the root node; and then, the 39 GSDs are classified into a second stage TS-DM, if the score of the M-CHAT high-risk item is less than 3, the GSD is classified into GDD (21 cases, 15 cases are correct, and 6 cases are wrong), otherwise, the GSD is classified into ASD (18 cases, 12 cases are correct, and 6 cases are wrong).

Aiming at three diseases of ASD, DLD and GDD which are difficult to identify clinically and have high misdiagnosis rate, models such as XGboost and two-stage decision trees are constructed, the models have high accuracy, and low-age-funded doctors can be assisted in early-stage identification of the diseases. Through the discovery of an interpretable model, the adaptability plays an important role in distinguishing DLD (DLD) from GSD (GSD); children with autism-like manifestations and poor developmental levels, not necessarily ASD children; the scale can be screened through machine learning, and screening efficiency is improved by combining the scale. In order to promote the clinical application of the achievement, an online application program based on TS-DM is established, and clinical decision is assisted more conveniently and effectively.

The above-mentioned embodiments are only preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, and any simple modifications or equivalent substitutions of the technical solutions that can be obviously obtained by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. An early identification method of autism spectrum disorder based on machine learning is characterized by comprising the following steps:

2. The method for early identification of autism spectrum disorder based on machine learning of claim 1, wherein the behavior assessment scale comprises Gesell development Scale GDS, early language development progress Scale ELMS, modified infant autism screening Scale M-CHAT, autism behavior Scale ABC as training features.

3. The method for early recognition of autism spectrum disorder based on machine learning of claim 2, wherein the unbalanced data is processed by setting weights, wherein:

4. the method for early stage identification of autism spectrum disorder based on machine learning as claimed in claim 3, wherein the preprocessed autism spectrum disorder ASD sample and fully developmental delayed GDD sample are combined into one kind, and developmental language disorder DLD sample is used alone as one kind, so as to obtain two kinds of samples; from the two main samples, 20 samples of each class are randomly selected as a verification set, and the rest are training sets.

5. The machine learning-based early identification method of autism spectrum disorder according to claim 4, wherein training a two-stage decision model TS-DM based on the training set comprises:

determining the optimal parameters of the XGOST model by adopting a parameter adjusting mode of grid search ten-fold cross validation, wherein the parameters to be optimized comprise the number n _ estimators of tree estimators, the maximum depth max _ depth of a single tree and the random sampling proportion subsample; the ratio of randomly sampled columns, column _ byte, and learning rate, the remaining parameters not mentioned all use default parameters;

classifying DLD samples, combined GDD samples and ASD samples through an XGboost algorithm, sequencing all features in the XGboost model according to SHAP values, sequentially incorporating the features into an XGboost single-tree model with reset parameters n _ estimators =1, subsample =1 and colomple \/byte =1 according to an important sequence, training until the area ROC-AUC under a test subject working curve of the model does not rise any more, selecting the features incorporated into the training at the moment as the features of TS-DM, and training a first-stage decision model;

6. The machine learning-based early identification method of autism spectrum disorder according to claim 5, wherein the classification model XGboost and the two-stage decision model TS-DM are evaluated by accuracy ACC and recall R;