CN113782128A - Missing data fitting method and device and computer equipment - Google Patents
Missing data fitting method and device and computer equipment Download PDFInfo
- Publication number
- CN113782128A CN113782128A CN202110908971.6A CN202110908971A CN113782128A CN 113782128 A CN113782128 A CN 113782128A CN 202110908971 A CN202110908971 A CN 202110908971A CN 113782128 A CN113782128 A CN 113782128A
- Authority
- CN
- China
- Prior art keywords
- data
- missing
- measured
- simulation
- filling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 94
- 238000004088 simulation Methods 0.000 claims abstract description 68
- 230000007246 mechanism Effects 0.000 claims abstract description 45
- 238000005259 measurement Methods 0.000 claims abstract description 19
- 238000012217 deletion Methods 0.000 claims description 35
- 230000037430 deletion Effects 0.000 claims description 35
- 238000012549 training Methods 0.000 claims description 29
- 230000015654 memory Effects 0.000 claims description 18
- 238000004458 analytical method Methods 0.000 claims description 6
- 238000011156 evaluation Methods 0.000 claims description 6
- 238000012417 linear regression Methods 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 3
- 230000000877 morphologic effect Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims 1
- 101000734702 Homo sapiens Proline-, glutamic acid- and leucine-rich protein 1 Proteins 0.000 description 9
- 102100034729 Proline-, glutamic acid- and leucine-rich protein 1 Human genes 0.000 description 9
- 208000006011 Stroke Diseases 0.000 description 9
- 230000008569 process Effects 0.000 description 7
- 230000035485 pulse pressure Effects 0.000 description 6
- 230000009467 reduction Effects 0.000 description 4
- 230000036760 body temperature Effects 0.000 description 3
- 230000036391 respiratory frequency Effects 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000007971 neurological deficit Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 206010003591 Ataxia Diseases 0.000 description 1
- 206010013887 Dysarthria Diseases 0.000 description 1
- 208000004929 Facial Paralysis Diseases 0.000 description 1
- 208000032382 Ischaemic stroke Diseases 0.000 description 1
- 208000028389 Nerve injury Diseases 0.000 description 1
- 208000036826 VIIth nerve paralysis Diseases 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000036772 blood pressure Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 210000003141 lower extremity Anatomy 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000007659 motor function Effects 0.000 description 1
- COCAUCFPFHUGAA-MGNBDDOMSA-N n-[3-[(1s,7s)-5-amino-4-thia-6-azabicyclo[5.1.0]oct-5-en-7-yl]-4-fluorophenyl]-5-chloropyridine-2-carboxamide Chemical compound C=1C=C(F)C([C@@]23N=C(SCC[C@@H]2C3)N)=CC=1NC(=O)C1=CC=C(Cl)C=N1 COCAUCFPFHUGAA-MGNBDDOMSA-N 0.000 description 1
- 230000008764 nerve damage Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013077 scoring method Methods 0.000 description 1
- 230000035807 sensation Effects 0.000 description 1
- 238000000528 statistical test Methods 0.000 description 1
- 210000001364 upper extremity Anatomy 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 230000004393 visual impairment Effects 0.000 description 1
- 238000009528 vital sign measurement Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a missing data fitting method, a missing data fitting device and computer equipment, wherein the method comprises the following steps: acquiring measured data of a measured object; analyzing the actual measurement data to obtain the missing characteristics of the actual measurement data; establishing a measured data set based on the missing mechanism condition; establishing a simulated dataset based on the measured dataset; filling by using different data filling methods based on the missing data in the simulation data set to obtain a plurality of simulation filling results; determining to obtain an optimal data filling method fitting path based on the plurality of simulation filling results; and performing data fitting on the measured data based on the optimal data filling method to obtain a fitting data set. By implementing the method, the data set of the measured data is constructed, the simulation data set is established according to different data types in the measured data, and the optimal missing data filling method is obtained, so that the filled data is closer to the real data.
Description
Technical Field
The invention relates to the technical field of missing data fitting, in particular to a missing data fitting method, a missing data fitting device and computer equipment.
Background
Medical information resources are very valuable for the diagnosis, treatment and medical research of diseases. However, data loss during treatment is caused by inaccurate record and input information of researchers and the reason of the patient, and the like, and the accuracy of the statistical result is difficult to guarantee due to incomplete data, so that the efficiency of statistical test is difficult to improve.
Disclosure of Invention
Therefore, the technical problem to be solved by the invention is to overcome the defect of inaccurate missing data fitting in the prior art, so that a missing data fitting method, a missing data fitting device and computer equipment are provided.
According to a first aspect, the embodiment of the invention discloses a missing data fitting method, which is used for obtaining measured data of a measured object; analyzing the actual measurement data to obtain the missing characteristics of the actual measurement data; establishing a measured data set based on the missing features; establishing a simulated dataset based on the measured dataset; filling by using different data filling methods based on the missing data in the simulation data set to obtain a plurality of simulation filling results; determining an optimal data padding method based on the plurality of simulated padding results; and performing data fitting on the measured data based on the optimal data filling method to obtain a fitting data set.
Optionally, the acquiring measured data of the measured object includes: from the measured data, a variable which can be randomly valued in a certain interval is taken as a continuous variable; extracting data classified based on morphological characteristics of the measured object from the measured data to serve as classification variables; and extracting continuous data of the acquisition time points from the measured data to be used as longitudinal variables.
Optionally, the analyzing the actual measurement data to obtain the missing feature of the actual measurement data includes: obtaining the missing proportion of the continuity variable based on the number of missing data in the measured object, and obtaining the missing mechanism of the continuity variable based on the missing mode of the continuity variable; obtaining the missing proportion of the classification variables based on the number of data missing in the measured object, obtaining a missing mode based on the missing data distribution of the classification variables, and obtaining a missing mechanism of the classification variables based on the data missing reason; and evaluating the object to be tested to obtain an evaluation result, obtaining the missing proportion of the longitudinal variable based on the number of missing data in the evaluation result, obtaining a missing mode based on the missing data distribution of the longitudinal variable, and obtaining the missing mechanism of the longitudinal variable based on the data missing reason of the longitudinal variable.
Optionally, establishing a measured data set based on the missing features includes: acquiring missing characteristics in the measured data, wherein the missing characteristics comprise a missing proportion, a missing mechanism and a missing mode of the data; and establishing the measured data set based on the missing proportion, the missing mechanism and the missing mode of each measured data.
Optionally, establishing a simulated dataset based on the measured dataset comprises: constructing a complete continuous variable training set based on data without missing continuous variables in the measured data; constructing different deletion proportion simulation sets of different deletion mechanisms under different deletion mechanisms based on the continuous variables; and constructing a continuous variable simulation set based on the continuous variable training set and the missing proportion simulation set.
Optionally, establishing a simulated dataset based on the measured dataset comprises: constructing a complete classification variable training set based on data of classification variables in the measured data, wherein the data of the classification variables are not lost; constructing a linear regression model based on the classification variable training set; establishing a classification variable missing proportion simulation set with different missing proportions under different missing mechanisms based on the first data; and constructing a simulation set of the classification variables based on the classification variable training set and the classification variable missing proportion simulation set.
Optionally, establishing a simulated dataset based on the measured dataset comprises: constructing a training set of longitudinal data based on data of which the first data and the second data are not missing in the measured data; establishing a simulation set of the missing proportion of the longitudinal data under different missing mechanisms and different missing proportions on the basis of the longitudinal data; and obtaining a simulation set of the longitudinal data based on the training set of the longitudinal data and the simulation set of the longitudinal data missing proportion.
Optionally, the determining an optimal data padding method based on the plurality of simulated padding results includes: calculating to obtain a statistical result of each data filling method based on the plurality of filling results; and obtaining the optimal data filling method of the missing data based on the statistical result.
According to a second aspect, an embodiment of the present invention further discloses a missing data fitting apparatus, including: the acquisition module is used for acquiring the measured data of the measured object; the analysis module is used for analyzing and obtaining the missing characteristics of the measured data based on the measured data; a first establishing module, configured to establish a measured data set based on the missing feature; a second establishing module for establishing a simulation data set based on the measured data set; the first filling module is used for filling by using different data filling methods based on missing data in the simulation data set to obtain a plurality of simulation filling results; a fitting module for determining an optimal data padding method based on the plurality of simulated padding results; and the second filling module is used for performing data fitting on the measured data based on the optimal data filling method to obtain a fitting data set.
According to a third aspect, an embodiment of the present invention further discloses a computer device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the missing data fitting method of the first aspect or any one of the alternative embodiments of the first aspect.
According to a fourth aspect, the present invention further discloses a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of the missing data fitting method according to the first aspect or any one of the alternative embodiments of the first aspect.
The technical scheme of the invention has the following advantages:
the missing data fitting method, the missing data fitting device and the computer equipment provided by the invention comprise the following steps: acquiring measured data of a measured object; analyzing the actual measurement data to obtain the missing characteristics of the actual measurement data; establishing a measured data set based on the missing mechanism condition; establishing a simulated dataset based on the measured dataset; filling by using different data filling methods based on the missing data in the simulation data set to obtain a plurality of simulation filling results; determining to obtain an optimal data filling method fitting path based on the plurality of simulation filling results; and performing data fitting on the measured data based on the optimal data filling method to obtain a fitting data set. By constructing a data set of the measured data and establishing a simulation data set according to different data types in the measured data, an optimal missing data filling method is obtained, so that the filled data is closer to the real data.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a specific example of a missing data fitting method in an embodiment of the present invention;
FIG. 2 is a schematic block diagram of a specific example of missing data fitting apparatus in an embodiment of the present invention;
FIG. 3 is a diagram of an embodiment of a computer device.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; the two elements may be directly connected or indirectly connected through an intermediate medium, or may be communicated with each other inside the two elements, or may be wirelessly connected or wired connected. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The embodiment of the invention discloses a missing data fitting method, which comprises the following steps of:
step 101, acquiring measured data of a measured object.
The measured data may be physical status data, physiological parameters, etc. monitored during the treatment of patients, the source of the data is not limited by the embodiment of the present invention, and those skilled in the art can determine according to actual needs, for example, data of patients with stroke sequelae after completing the treatment of the baseline period (week-1 to 0), the first visit (month-1 ± 7 days), the second visit (month-3 ± 7 days) and three times, collect 1792 patient data of stroke, leave 1657 patient information after excluding data of baseline-free (gender, age) 135 data, wherein 1657 patient data includes patient ID, name, age, gender, group, vital signs (respiratory frequency, pulse, blood pressure, body temperature), nihss (national Institute of Health stroke) scale score of 3 visits, of these, 507 patients were those with Depression and side-weight after ischemic stroke, and data on the SDS (Self-Rating Depression Scale) Scale were additionally collected.
And 102, analyzing the actual measurement data to obtain the missing characteristics of the actual measurement data.
Illustratively, the deletion characteristics include a deletion ratio of the measured data, which is a percentage of the total data, and a deletion mechanism, which is a deletion form of the deleted data, and may be, for example, three forms of complete random deletion (MCAR), random deletion (MAR), and non-random deletion (MNAR).
And 103, establishing a measured data set based on the missing features. Illustratively, the measured data set is a real data set created from the measured data.
And 104, establishing a simulation data set based on the measured data set. Illustratively, the simulated data set is a simulated data set of different missing mechanisms and different missing proportions established from the measured data.
And 105, filling by using different data filling methods based on the missing data in the analog data set to obtain a plurality of analog filling results. Illustratively, the data padding method may be a deletion method, a weighting method, a single value interpolation method, an expectation maximization method, a multiple interpolation or a predictive mean matching method, and the selection of the data padding method is not limited in the embodiments of the present invention and may be determined by those skilled in the art according to actual needs.
And 106, determining an optimal data padding method based on the plurality of simulation padding results. Illustratively, the simulation filling result is to fill missing data in the simulation data set by adopting different data filling methods, and corresponding different filling results are obtained according to the different data filling methods; and selecting the data filling method which corresponds to the optimal filling result according to the obtained plurality of filling results.
And 107, performing data fitting on the measured data based on the optimal data filling method to obtain a fitting data set. Illustratively, the optimal data filling method is a data filling method with the best filling effect under different missing mechanisms and different missing proportions corresponding to different measured data, the missing mechanisms corresponding to various data types and the optimal data filling methods corresponding to the missing proportions are integrated to obtain the best data fitting method, and the missing data is filled by the best fitting method to obtain the fitting data set.
The missing data fitting method provided by the invention comprises the following steps: acquiring measured data of a measured object; the method comprises the steps of obtaining the missing characteristic condition of actual measurement data based on the analysis of the actual measurement data, establishing an actual measurement data set, establishing a simulation data set based on the actual measurement data set, filling by using different data filling methods based on the missing data in the simulation data set to obtain a plurality of simulation filling results, determining to obtain an optimal data filling method fitting path based on the plurality of simulation filling results, and performing data fitting on the actual measurement data based on the optimal data filling method to obtain a fitting data set. By constructing a data set of the measured data and establishing a simulation data set according to different data types in the measured data, an optimal missing data filling method is obtained, so that the filled data is closer to the real data.
As an optional implementation manner of the present invention, in step 101, the process of acquiring measured data of the measured object specifically includes: from the measured data, a variable which can be randomly valued in a certain interval is taken as a continuous variable; extracting data classified based on morphological characteristics of the measured object from the measured data to serve as classification variables; and extracting continuous data of the acquisition time points from the measured data to be used as longitudinal variables.
For example, the continuous data at the time points may refer to vital sign data, such as respiratory frequency, pulse, body temperature, and pulse pressure difference (systolic pressure-diastolic pressure), and the type and number of the continuous variables are not limited by the embodiment of the present invention, and can be determined by those skilled in the art according to actual needs. The categorical variable may be a variable reflecting subjective feeling of a patient with Depression among stroke patients and change thereof in treatment, such as Self-Rating Scale (SDS) data, which is a 4-level SDS containing 20 items, after the Rating is finished, the scores of the 20 items are added to obtain a total score, then the score is multiplied by 1.25 to obtain an integral part, and a standard score is obtained, according to the result of the normal model in china, the score of the SDS standard score is 53, wherein 53-62 are mild Depression, 63-72 are moderate Depression, and 73 or more are severe Depression. The longitudinal variable may be a variable for assessing neurological deficit of a patient, and a higher score value indicates a greater neurological deficit degree, and 11 assessment items including consciousness level, consciousness level question, consciousness level instruction, gaze, visual field, facial paralysis, upper and lower limb motor function, limb ataxia, sensation, language, dysarthria, and neglect are included, respectively, and the scoring method is as follows: the score of each item is in 3 to 5 grades, the score range is 0-42 points, the higher the score is, the more serious the nerve damage is, the score of 0-1 indicates normal or approaching normal, the score of 1-4 indicates mild stroke, the score of 5-15 indicates moderate stroke, the score of 15-20 indicates moderate or severe stroke, the score of more than 20 indicates severe stroke, and the higher the score is, the more serious the disease condition is, the embodiment of the invention does not limit the type and the quantity of the longitudinal variable, and the person skilled in the art can determine the type and the quantity according to the actual needs.
As an optional implementation manner of the present invention, the step 102, based on the process of analyzing and obtaining the missing feature of the measured data, specifically includes: obtaining the missing proportion of the continuity variable based on the number of missing data in the measured object, obtaining a missing mode based on the missing data distribution of the continuity variable, and obtaining a missing mechanism of the continuity variable based on the reason of data missing; obtaining the missing proportion of the classification variables based on the number of data missing in the measured object, obtaining a missing mode based on the missing data distribution of the classification variables, and obtaining a missing mechanism of the classification variables based on the reasons of data missing; and evaluating the object to be tested to obtain an evaluation result, obtaining the missing proportion of the longitudinal variable based on the number of missing data in the evaluation result, obtaining a missing mode based on the missing data distribution of the longitudinal variable, and obtaining the missing mechanism of the longitudinal variable based on the missing data distribution.
Illustratively, the missing proportion is calculated as the percentage of missing data to the total data. The missing mode can be a monotone missing mode and an arbitrary missing mode, mainly shows the distribution of the missing data and reflects the relation between the missing data and the variable. Different judgment methods of the deletion mechanism are different, wherein the judgment of the MCAR mechanism can be started from the distribution characteristics and judged by comparing whether the mean value and the variance are consistent or not; the MAR mechanism is judged by describing the distribution of missing data by using a Logit model and estimating the significance of parameters; the MNAR mechanism is then identified by analyzing whether the data is in a monotonic or arbitrary pattern of absence and the cause of the data absence. Thus, the determination of the missing mechanism can be facilitated by exploring possible relationships between missing data and variables and by explicitly influencing factors that influence the missing of data.
1657 patients in the continuity data were subjected to vital sign measurements, and 272 patients had different degrees of absence, wherein the absence rate was 15.33% of respiratory frequency, 15.45% of pulse absence rate, 15.27% of body temperature, 16.17% of pulse pressure difference, and 253 patients had four variables simultaneously absent. After analyzing the vital sign example data set table through logistic regression, the groups are found to be important factors influencing the data, and the continuous data loss mechanism is a MAR mechanism.
A total of 508 patients with classified variable collection are subjected to SDS scale measurement, wherein 1 patient lacks basic information (age), and finally 507 patients are subjected to analysis, and the SDS scale data of the patients have no data loss and cannot analyze a data loss mechanism according to an actual data result.
Longitudinal variable acquisition a total of 1657 patients were assessed 3 times for the NIHSS scale, with 155 patients missing from 3 visits and 63 patients missing data from the next two visits. The percentage of visual loss was 10.20% for visit 1, 11.16% for visit 2 and 12.07% for visit 3. Group and gender are found to be important factors influencing the data after analyzing the example data set by logistic regression, and the data deletion mechanism of the longitudinal variable NIHSS scale is a MAR mechanism.
As an optional implementation manner of the present invention, the step 103, based on the missing feature, includes: acquiring missing characteristics in the measured data, wherein the missing characteristics comprise a missing proportion, a missing mechanism and a missing mode of the data; and establishing the measured data set based on the missing proportion, the missing mechanism and the missing mode of each measured data.
Illustratively, the continuous variable, the categorical variable and the longitudinal variable contained in the collected measured data are used to establish the measured data set with deletion ratios of 10%, 20% and 30% under the deletion mechanisms of MCAR, MAR and MNAR, respectively.
As an optional implementation manner of the present invention, the step 104, the process of establishing a simulation data set based on the measured data set specifically includes: constructing a finished continuous variable training set based on the data without missing continuous variables in the measured data; constructing a deletion ratio simulation set of different deletion ratios under different deletion mechanisms based on the continuous variables; and constructing a continuous variable simulation set based on the continuous variable training set and the missing proportion simulation set.
For example, the selection of data in which no continuous variable is missing in the measured data may be, for example, 1381 observation objects in which no group, gender, age, and pulse pressure difference are missing construct a continuous variable training set, and the selection of the type of the continuous data in the embodiment of the present invention is not limited, and can be determined by those skilled in the art according to actual needs; on the basis of constructing a complete training set, establishing a deletion ratio simulation set with deletion ratios of 10%, 20% and 30% under three mechanisms of MCAR, MAR and MNAR, wherein MAR deletion depends on age, 511 patients with the age of less than 65 are deleted randomly according to the proportion of 10%, the rest 869 patients construct deletion according to the proportion, and finally the deletion ratios are combined into 10%, 20% and 30% of the total deletion ratio; MNAR depends on the pulse pressure difference value, considers a possibility that the data of a patient with the pulse pressure difference larger than 40mmHg is easy to be lost, so that 257 patient data with the pulse pressure difference smaller than 40mmHg in vital signs are randomly deleted according to a proportion of 10%, the rest 1124 patient data are constructed and lost according to the proportion, and the continuous variable training set and the lost proportion simulation set are combined to obtain a continuous data variable simulation set.
As an optional implementation manner of the present invention, the step 104, the process of establishing a simulation data set based on the measured data set specifically includes: constructing a complete classification variable training set based on data of classification variables in the measured data, wherein the data of the classification variables are not lost; constructing a linear regression model based on the classification variable training set; establishing a classification variable missing proportion simulation set with different missing proportions under different missing mechanisms based on the first data; and constructing a simulation set of the classification variables based on the classification variable training set and the classification variable missing proportion simulation set.
Illustratively, whether the stroke patient has depression (SDS scale standard is divided into depression with a score of not less than 53 and no depression with a score of less than 53) can be used as first data, age, group and sex can be used as second data, complete data without deletion of the first data and the second data is selected to construct a classification variable training set, and a linear regression model is established according to the classification variable training set: y ═ α + β1x1+β2x2+β3x3+…+βnxnFor e, where the y dependent variable is depression or not, is a dichotomous variable (0 ═ n, 1 ═ yes), and the independent variables X1-X20 are 20 entries of SDS; constructing a simulation set with arbitrary deletion ratios of 10%, 20% and 30% under three mechanisms of MCAR, MAR and MNAR of 20 entries of the SDS scale on the basis of a regression model, wherein MAR deletion depends on gender, 194 female patients are randomly deleted according to the ratio of 10%, and the rest 313 male patients are constructed according to the ratio, and finally the deletion ratios are combined into 10%, 20% and 30% of the SDS scale; MNAR relies on the score of SDS, 108 pieces of data of subjects with scores of 1 and 2 in the entry 2 are randomly deleted according to a proportion of 10%, 399 pieces of data of subjects with scores of 3 and 4 are constructed and deleted according to a proportion, and a classification variable training set and a classification variable deletion proportion simulation set are combined into a classification variable simulation set.
As an optional implementation manner of the present invention, the step 104, the process of establishing a simulation data set based on the measured data set specifically includes: constructing a training set of longitudinal data based on data of which the first data and the second data are not missing in the measured data; establishing a simulation set of the missing proportion of the longitudinal data under different missing mechanisms and different missing proportions on the basis of the longitudinal data; and obtaining a simulation set of the longitudinal data based on the training set of the longitudinal data and the simulation set of the longitudinal data missing proportion.
Illustratively, the first data can be the group, age and gender of the tested objects, the second data can be the total score of a 3-time NIHSS scale, a training set of longitudinal data is formed by the first data and the second data together, a simulation set of which the monotonous missing proportion of the first data is 10%, 20% and 30% under the three mechanisms of MCAR, MAR and MNAR is constructed, wherein each tested object is set to have the first measurement after the base line, so missing is constructed from the 2 nd visit, and the data of the 3 rd visit after the 2 nd visit is missed is also missed. Wherein MAR deletion was gender dependent, 451 female subjects were randomly deleted at a rate of 10%, and the remaining 988 male subjects were constructed proportionally for deletion, eventually combined to overall deletion rates of 10%, 20% and 30%; MNAR relies on the score reduction rate of the previous visits, patients with the NIHSS scale score reduction rate of less than or equal to 18% ((total score of the first visit-total score of the second visit)/total score of the first visit multiplied by 100%) are ineffective for treatment, namely the data of the patients with the NIHSS scale score reduction rate of less than or equal to 18% in the previous 2 visits are easier to be lost, so that the 2 nd visit data of 564 patients with the NIHSS scale score reduction rate of more than 18% in the previous two visits are randomly deleted according to a proportion of 10%, and the rest 875 patients construct the loss according to the proportion, and the training set of longitudinal data and the simulation set of the longitudinal data loss proportion are merged into the simulation set of the longitudinal data.
The embodiment of the invention also discloses a missing data fitting device, as shown in fig. 2, the device comprises:
the acquiring module 201 is configured to acquire measured data of the measured object. For example, the details are described in the above step 101, and are not described herein again.
An analysis module 202, configured to obtain missing features of the measured data based on the measured data analysis. For example, the details are described above in step 102 and will not be repeated here.
A first establishing module 203, configured to establish a measured data set based on the missing features. For example, the details are described in step 103 above, and are not described herein again.
A second establishing module 204, configured to establish a simulated data set based on the measured data set. For example, the details are described in the above step 104, and are not described herein again.
A first padding module 205, configured to pad using different data padding methods based on missing data in the analog data set to obtain a plurality of analog padding results. For example, the details are described in step 105 above, and are not described herein again.
A fitting module 206 for determining an optimal data padding method based on the plurality of simulated padding results. For example, the details are described above in step 106 and will not be repeated here.
And a second padding module 207, configured to perform data fitting on the measured data based on the optimal data padding method to obtain a fitted data set. For example, the details are described above in step 107 and will not be described here.
The missing data fitting device provided by the invention comprises: an obtaining module 201, configured to obtain measured data of a measured object; an analysis module 202, configured to obtain missing features of the measured data based on the measured data analysis; a first establishing module 203, configured to establish a measured data set based on the missing feature; a second establishing module 204, configured to establish a simulation dataset based on the measured dataset; a first padding module 205, configured to pad, based on missing data in the analog data set, using different data padding methods to obtain a plurality of analog padding results; a fitting module 206 for determining an optimal data padding method based on the plurality of simulated padding results; and a second padding module 207, configured to perform data fitting on the measured data based on the optimal data padding method to obtain a fitted data set. By constructing a data set of the measured data and establishing a simulation data set according to different data types in the measured data, an optimal missing data filling method is obtained, so that the filled data is closer to the real data.
An embodiment of the present invention further provides a computer device, as shown in fig. 3, the computer device may include a processor 301 and a memory 302, where the processor 301 and the memory 302 may be connected by a bus or in another manner, and fig. 3 takes the example of being connected by a bus as an example.
The memory 302, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the data loss fitting method in the embodiments of the present invention. The processor 301 executes various functional applications and data processing of the processor by running non-transitory software programs, instructions and modules stored in the memory 302, namely, implements the data missing fitting method in the above method embodiments.
The memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor 301, and the like. Further, the memory 302 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 302 may optionally include memory located remotely from the processor 301, which may be connected to the processor 301 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more modules are stored in the memory 302 and, when executed by the processor 301, perform a data loss fitting method as in the embodiment shown in fig. 1.
The details of the computer device can be understood with reference to the corresponding related descriptions and effects in the embodiment shown in fig. 1, and are not described herein again.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.
Claims (11)
1. A missing data fitting method, comprising:
acquiring measured data of a measured object;
analyzing the actual measurement data to obtain the missing characteristics of the actual measurement data;
establishing a measured data set based on the missing features;
establishing a simulated dataset based on the measured dataset;
filling by using different data filling methods based on the missing data in the simulation data set to obtain a plurality of simulation filling results;
determining an optimal data padding method based on the plurality of simulated padding results;
and performing data fitting on the measured data based on the optimal data filling method to obtain a fitting data set.
2. The method of claim 1, wherein the acquiring measured data of the object under test comprises:
from the measured data, a variable which can be randomly valued in a certain interval is taken as a continuous variable;
extracting data classified based on morphological characteristics of the measured object from the measured data to serve as classification variables;
and extracting continuous data of the acquisition time points from the measured data to be used as longitudinal variables.
3. The method of claim 2, wherein said analyzing missing features of said measured data based on said measured data comprises:
obtaining the missing proportion of the continuity variable based on the number of data missing in the measured object, obtaining a missing mode based on the missing data distribution of the continuity variable, and obtaining a missing mechanism of the continuity variable based on the reason of data missing;
obtaining the missing proportion of the classification variables based on the number of data missing in the measured object, obtaining a missing mode based on the missing data distribution of the classification variables, and obtaining a missing mechanism of the classification variables based on the reasons of data missing;
and evaluating the object to be tested to obtain an evaluation result, obtaining the missing proportion of the longitudinal variable based on the number of missing data in the evaluation result, obtaining a missing mode based on the missing data distribution of the longitudinal variable, and obtaining the missing mechanism of the longitudinal variable based on the missing data distribution.
4. The method of claim 3, wherein creating a measured data set based on the missing features comprises:
acquiring missing characteristics in the measured data, wherein the missing characteristics comprise a missing proportion, a missing mechanism and a missing mode of the data;
and establishing the measured data set based on the missing proportion, the missing mechanism and the missing mode of each measured data.
5. The method of claim 4, wherein creating a simulated dataset based on the measured dataset comprises:
constructing a complete continuous variable training set based on data without missing continuous variables in the measured data;
constructing a deletion ratio simulation set of different deletion ratios under different deletion mechanisms based on the continuous variables;
and constructing a continuous variable simulation set based on the continuous variable training set and the missing proportion simulation set.
6. The method of claim 4, wherein creating a simulated dataset based on the measured dataset comprises:
constructing a complete classification variable training set based on data of classification variables in the measured data, wherein the data of the classification variables are not lost;
constructing a linear regression model based on the classification variable training set;
establishing a classification variable missing proportion simulation set with different missing proportions under different missing mechanisms based on the first data;
and constructing a simulation set of the classification variables based on the classification variable training set and the classification variable missing proportion simulation set.
7. The method of claim 4, wherein creating a simulated dataset based on the measured dataset comprises:
constructing a training set of longitudinal data based on data of which the first data and the second data are not missing in the measured data;
establishing a simulation set of the missing proportion of the longitudinal data under different missing mechanisms and different missing proportions on the basis of the longitudinal data;
and obtaining a simulation set of the longitudinal data based on the training set of the longitudinal data and the simulation set of the longitudinal data missing proportion.
8. The method of any of claims 1-7, wherein determining an optimal data padding method based on the plurality of simulated padding results comprises:
calculating to obtain a statistical result of each data filling method based on the plurality of filling results;
and obtaining the optimal data filling method of the missing data based on the statistical result.
9. A missing data fitting apparatus, comprising:
the acquisition module is used for acquiring the measured data of the measured object;
the analysis module is used for analyzing and obtaining the missing characteristics of the measured data based on the measured data;
a first establishing module, configured to establish a measured data set based on the missing feature;
a second establishing module for establishing a simulation data set based on the measured data set;
the first filling module is used for filling by using different data filling methods based on missing data in the simulation data set to obtain a plurality of simulation filling results;
a fitting module for determining an optimal data padding method based on the plurality of simulated padding results;
and the second filling module is used for performing data fitting on the measured data based on the optimal data filling method to obtain a fitting data set.
10. A computer device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the missing data fitting method of any one of claims 1-8.
11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the missing data fitting method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110908971.6A CN113782128A (en) | 2021-08-09 | 2021-08-09 | Missing data fitting method and device and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110908971.6A CN113782128A (en) | 2021-08-09 | 2021-08-09 | Missing data fitting method and device and computer equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113782128A true CN113782128A (en) | 2021-12-10 |
Family
ID=78837073
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110908971.6A Pending CN113782128A (en) | 2021-08-09 | 2021-08-09 | Missing data fitting method and device and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113782128A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117275220A (en) * | 2023-08-31 | 2023-12-22 | 云南云岭高速公路交通科技有限公司 | Mountain expressway real-time accident risk prediction method based on incomplete data |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108492560A (en) * | 2018-04-04 | 2018-09-04 | 东南大学 | A kind of Road Detection device missing data complementing method and device |
-
2021
- 2021-08-09 CN CN202110908971.6A patent/CN113782128A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108492560A (en) * | 2018-04-04 | 2018-09-04 | 东南大学 | A kind of Road Detection device missing data complementing method and device |
Non-Patent Citations (3)
Title |
---|
栾荣生主编: "《流行病学研究原理与方法 第2版》", 31 August 2014, 四川科学技术出版社, pages: 265 - 271 * |
董学思: "多组学缺失数据联合填补方法评价及其应用", 《中国优秀硕士学位论文全文数据库 医药卫生科技辑》, pages 3 * |
陈丽嫦;衡明莉;王骏;陈平雁;: "定量纵向数据缺失值处理方法的模拟比较研究", 中国卫生统计, no. 03 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117275220A (en) * | 2023-08-31 | 2023-12-22 | 云南云岭高速公路交通科技有限公司 | Mountain expressway real-time accident risk prediction method based on incomplete data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Rhodes et al. | Predictive distributions were developed for the extent of heterogeneity in meta-analyses of continuous outcome data | |
CN113133752B (en) | Psychological assessment method, system, device and medium based on heart rate variability analysis | |
JP2002534144A (en) | Method and apparatus for performing a visual field test and computer program for processing the result | |
CN114360728A (en) | Prediction model for mild cognitive dysfunction of diabetes and construction method of nomogram | |
Nath et al. | Machine learning-based anxiety detection in older adults using wristband sensors and context feature | |
CN115714022A (en) | Neonatal jaundice health management system based on artificial intelligence | |
CN115101204A (en) | Model, equipment and storage medium for quantitatively evaluating depression risk based on blood biochemical indexes | |
Crossland et al. | Comparing body image dissatisfaction between pregnant women and non-pregnant women: a systematic review and meta-analysis | |
CN113782128A (en) | Missing data fitting method and device and computer equipment | |
CN113052205B (en) | Lying-in woman data classification method, device, equipment and medium based on machine learning | |
CN106777872A (en) | A kind of method that life of elderly person quality evaluation is carried out based on intelligent wearing technology | |
CN116861252A (en) | Method for constructing fall evaluation model based on balance function abnormality | |
WO2012141240A1 (en) | Campimeter | |
CN116052877A (en) | Diabetes patient depression risk assessment method and assessment system construction method | |
Harper et al. | Classification trees: A possible method for maternity risk grouping | |
US20210241116A1 (en) | Quantification and estimation based on digital twin output | |
DE112018005890T5 (en) | INFORMATION PROCESSING DEVICE, PROCEDURE AND PROGRAM | |
Batista‐Foguet et al. | Using structural equation models to evaluate the magnitude of measurement error in blood pressure | |
Shibayama et al. | Changes in standing body sway of pregnant women after long-term bed rest | |
CN113157640A (en) | Family doctor auxiliary inquiry device, terminal and inquiry system | |
CN110634541A (en) | Oral health data acquisition and analysis method | |
DE112019002926T5 (en) | Device for supporting behavior change, terminal and server | |
Colacce et al. | How accurately do mothers recall prenatal visits and gestational age? A validation of Uruguayan survey data | |
Al Nasiria et al. | Health-related quality of life in children with sickle cell disease: a concept analysis. | |
Nader | Connectivity analysis of the EHG during pregnancy and Labor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |