CN115510970A

CN115510970A - Characteristic transformation and extraction system based on public health data acquisition

Info

Publication number: CN115510970A
Application number: CN202211165935.6A
Authority: CN
Inventors: 夏天; 夏寒; 付晨; 张�诚; 毛丹; 道理; 刘星航; 林维晓
Original assignee: Shanghai Municipal Center For Disease Control & Prevention
Current assignee: Shanghai Municipal Center For Disease Control & Prevention
Priority date: 2022-09-23
Filing date: 2022-09-23
Publication date: 2022-12-23

Abstract

The invention discloses a feature conversion and extraction system based on public health data acquisition, which comprises the following stages: the method comprises a data preparation stage, a characteristic engineering stage and a model evaluation and evaluation stage, wherein the characteristic engineering stage comprises the following steps: s1, preprocessing data classified in diabetes follow-up scene data; s2, firstly, determining the characteristics of a large number of missing values in partial data, and eliminating the characteristics on the premise that the characteristics cannot influence a reliability judgment result; s3, sensitivity of partial machine learning algorithms is achieved according to missing values of input data, and the missing values are processed according to different machine learning algorithms; the method has the advantages of reducing the dimension of data, simplifying a data model, improving the interpretability of the model, shortening the time required by model training, reducing the overfitting risk of the model and avoiding dimension disasters.

Description

Characteristic transformation and extraction system based on public health data acquisition

Technical Field

The invention relates to the technical field of data analysis and processing, in particular to a feature transformation and extraction system based on public health data acquisition.

Background

With the development of information technology and the arrival of the big data era, more and more scientific researches are favored to use the existing data to develop researches, meanwhile, the fusion researches spanning multiple subjects and multiple fields are gradually increased, data of multiple subject fields and sources need to be used in the researches, before the researches are developed, the reliability of the data to be incorporated into the researches is evaluated, corresponding measures are taken according to the evaluation results to improve the authenticity and the accuracy of the research results, the data reliability refers to the integrity, the consistency, the accuracy, the reliability and the reliability of the data, and the degree of maintaining the characteristics in the whole life cycle of the data, the reliability of the data is reduced due to the deviation in the data, the common data deviation mainly comprises selection deviation, information deviation, mixed deviation and the like, even the data is compiled and falsified, if the data with lower reliability is used in the scientific researches, the research results deviate from the real situation, so that the value of the research results is greatly reduced, in the big data era, the achievement of the research value and the scientific researches are important, and the scientific research method for the research of the data incorporation into the big data era is adopted.

In the field of public health in the prior art, data reliability evaluation methods are mainly classified into three types, namely, rule-based evaluation methods, content-based evaluation methods, and statistic-based evaluation methods:

the rule-based evaluation method mainly evaluates the reliability of data by setting a rule base and verifying the data by using rules in the rule base, and evaluates the reliability of the data by a verification result.

The content-based evaluation method mainly carries out cross validation on the data content to be evaluated through other source data so as to evaluate the reliability of the data, wherein the commonly used other source data comprises data collected by telephone/visiting back and visiting back, data obtained by referring to original medical history and the like. The content-based evaluation method can provide deep evaluation, has a better evaluation effect on information deviation, and can also provide clues for selecting deviation and miscellaneous deviation to a certain extent, but acquiring data from other sources usually requires a lot of time, energy and economic cost, and sometimes data from other sources are not accessible, so that the reliability of the whole data is difficult to evaluate by the method, and the method is generally combined with a sampling method, and the reliability of the evaluated data may have deviation.

The evaluation method based on statistics mainly evaluates the reliability of data integrally by calculating and analyzing the overall statistical result and distribution condition of the data to be evaluated. For example, whether the last digit of the blood pressure value is randomly distributed, whether the proportion of men and women in the data is significantly different from that of the whole population, and the like. The statistical-based evaluation method has a good evaluation effect on the selection deviation and the information deviation, and can also evaluate the full data, but the method has a certain requirement on the data quantity of the data to be evaluated, and can only obtain the reliability evaluation result of the whole data, and cannot form an independent evaluation result for each piece of data.

Therefore, the existing public health data reliability assessment methods have some defects respectively, the reliability assessment requirements of the big data era for mass data cannot be completely met, a new method needs to be explored, the defects of the existing methods are overcome, the reliability of the public health data is assessed more accurately, and clues of more data reliability problems are found, wherein the processing process of the public health data is relatively lagged behind, other machine learning methods need to be explored, and the requirements for the label data volume are reduced.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides a feature transformation and extraction system based on public health data acquisition.

In order to achieve the purpose, the invention adopts the following technical scheme:

a feature transformation extraction system based on public health data acquisition, comprising the stages of: the method comprises a data preparation stage, a characteristic engineering stage and a model evaluation and evaluation stage, wherein the characteristic engineering stage comprises the following steps:

s1, preprocessing data well classified in diabetes follow-up scene data;

s2, firstly, determining the characteristics of a large number of missing values in partial data, and eliminating the characteristics on the premise that the characteristics cannot influence the reliability judgment result;

s3, sensitivity of partial machine learning algorithms is achieved according to missing values of input data, and the missing values are processed according to different machine learning algorithms;

s4, carrying out standardization processing on the format of the data, converting the digital type (integer type, floating point type and reserved decimal place number), adjusting the unit of the characteristic, and unifying the formats of the date and the time;

s5, judging whether the data volume of each category needs to be balanced or not according to the data volume of the label data on each category, if the data volume of the data label on each category is extremely unbalanced, the subsequent model training is affected, a SMOTE data synthesis method needs to be adopted, partial classified data are artificially synthesized, the data volume under the classification is increased, the data volume of each classification is balanced, the undersampling method for the classified data with more data volume is avoided, and the condition of label discarding is reduced;

s6, dividing the label data into a training set, a verification set and a test set, wherein the training set is used for training the model, the verification set is used for verifying the model and optimizing hyper-parameters, and the test set is used for testing, sizing and performance evaluation of the final model;

s7, as for the continuous features, the continuous features need to be dimensionless, the influence of feature units is eliminated, the features are converted into the same specification, the situation that some features in a partial model obtain weight values with very different sizes compared with other features is avoided, meanwhile, the efficiency of machine learning is improved, and according to the fact that whether the features accord with normal distribution or not, the features can be compressed and translated through a normalization method (the range of the feature values is compressed to a [0,1] interval through participation of the maximum value and the minimum value of the features in calculation) or a standardization method (the features are converted into standard normal distribution with the mean value of 0 and the standard deviation of 1), and meanwhile, the distribution state of the features is kept;

s8, information redundancy is reduced, and for quantitative data (such as whether the examination passes or not) only concerning qualitative results, discretization (dividing continuous features into a plurality of discrete features according to a judgment standard) or binarization (dividing continuous features into single discrete features with two states according to the judgment standard) is needed, and in addition, according to a machine learning model to be adopted, the effect of model training can also be improved by carrying out function conversion on partial features;

and S9, for the discrete features, the discrete features need to be digitalized, and specific classification names are converted into codes. If the features relate to more than two types of classifications, it is necessary to generate dummy features for the different classifications, and the features are marked by one-hot codes to avoid the situation that the size of the codes (e.g., 1, 2, etc.) themselves are learned as features during machine learning, and also to facilitate the calculation of the distances between the features during machine learning. Furthermore, as for the characteristics of the timestamp category, some information (such as year or millisecond data having no influence on the data reliability result) can be discarded according to the situation, so that the construction of the model is simplified;

s9, the data after feature conversion is suitable for machine learning, but the number (dimensionality) of data features is possibly large, so that in order to simplify a final model, the interpretability of the model is improved, meanwhile, in order to shorten the time required by model training, reduce the risk of overfitting the model and avoid dimensionality disaster, the features need to be selected and extracted, new features are constructed to replace original features when necessary, and the dimensionality reduction of the features is realized;

s10, important information can be guaranteed to be kept on the premise of reducing the dimensionality of data to the maximum extent, algorithms adopted in feature extraction are a linear method (principal component analysis (PCA) and a linear discriminant method (LDA)) and a nonlinear method (local linear embedding (LLE), laplace feature mapping (LE), random neighborhood embedding (SNE) and T-distribution neighborhood embedding (T-SNE)), the goal of dimensionality reduction is achieved, and the algorithms need to be selected in combination with actual data conditions.

The invention has the following beneficial effects;

1. in the invention, classified label data with small data volume is artificially synthesized by an SMOTE method to balance the label data volume under each classification, thereby reducing the influence of the problem model construction to the maximum extent;

2. in the invention, through the feature conversion and extraction of the processed data, the main features of the data can be reserved, the number of the features is reduced, and new features can be constructed to replace the original features when necessary, so that the dimension of the data is reduced, the data model is simplified, the interpretability of the model is improved, the time required by model training is shortened, the overfitting risk of the model is reduced, and the dimension disaster is avoided.

Detailed Description

The technical solution of the present invention will be clearly and completely described with reference to the following examples.

Example one

The invention provides a feature transformation and extraction system based on public health data acquisition, which comprises the following stages: the method comprises a data preparation stage, a characteristic engineering stage and a model evaluation and evaluation stage, wherein the characteristic engineering stage comprises the following steps:

s1, preprocessing data classified in diabetes follow-up scene data;

s2, firstly, determining the characteristics of a large number of missing values in partial data, and eliminating the characteristics on the premise that the characteristics cannot influence a reliability judgment result;

s5, judging whether the data volume of each category needs to be balanced or not according to the data volume of the label data on each category, if the data volume of the data label on each category is extremely unbalanced, influencing the training of a subsequent model, adopting an SMOTE data synthesis method to artificially synthesize partial classified data, increasing the data volume under the classification, balancing the data volume of each classification, avoiding using an undersampling method for the classified data with more data volume, and reducing the condition of label discarding;

s6, dividing the label data into a training set, a verification set and a test set, wherein the training set is used for training the model, the verification set is used for verifying the model and adjusting and optimizing the hyper-parameters, and the test set is used for testing, shaping and evaluating the performance of the final model;

s8, reducing information redundancy, and regarding quantitative data (such as whether the examination passes or not) only concerning qualitative results, discretizing (dividing continuous features into a plurality of discrete features according to a judgment standard) or binarizing (dividing continuous features into single discrete features with two states according to the judgment standard) is required, and in addition, according to a machine learning model to be adopted, performing function conversion on partial features can also improve the effect of model training;

s9, the discrete features need to be digitized, and the specific classification names are converted into codes. If the features relate to more than two types of classifications, it is necessary to generate dummy features for the different classifications, and the features are marked by one-hot codes to avoid the situation that the size of the codes (e.g., 1, 2, etc.) themselves are learned as features during machine learning, and also to facilitate the calculation of the distances between the features during machine learning. Furthermore, as for the characteristics of the timestamp category, some information (such as year or millisecond data having no influence on the data reliability result) can be discarded according to the situation, so that the construction of the model is simplified;

s9, the data after feature conversion is suitable for machine learning, but the number (dimensionality) of data features is possibly large, so that in order to simplify a final model, the interpretability of the model is improved, meanwhile, in order to shorten the time required by model training, reduce the risk of overfitting the model and avoid dimensionality disasters, the features need to be selected and extracted, and new features are constructed to replace original features when necessary, so that the dimensionality reduction of the features is realized;

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A system for converting and extracting features based on public health data acquisition is characterized by comprising the following stages: the method comprises a data preparation stage, a characteristic engineering stage and a model evaluation and evaluation stage, wherein the characteristic engineering stage comprises the following steps:

s1, preprocessing data well classified in diabetes follow-up scene data;

s7, carrying out dimensionless operation on the continuous features, eliminating the influence of feature units, converting the features into the same specification, avoiding that some features in part of models obtain weight values with very different sizes compared with other features, improving the efficiency of machine learning, and compressing and translating the features by a normalization method (the maximum value and the minimum value of the features participate in calculation, the range of the feature values is compressed to a [0,1] interval) or a normalization method (the features are converted into standard normal distribution with the mean value of 0 and the standard difference of 1) according to whether the features accord with the normal distribution or not, and simultaneously keeping the distribution state of the features;