CN115510970A - Characteristic transformation and extraction system based on public health data acquisition - Google Patents

Characteristic transformation and extraction system based on public health data acquisition Download PDF

Info

Publication number
CN115510970A
CN115510970A CN202211165935.6A CN202211165935A CN115510970A CN 115510970 A CN115510970 A CN 115510970A CN 202211165935 A CN202211165935 A CN 202211165935A CN 115510970 A CN115510970 A CN 115510970A
Authority
CN
China
Prior art keywords
data
features
model
machine learning
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211165935.6A
Other languages
Chinese (zh)
Inventor
夏天
夏寒
付晨
张�诚
毛丹
道理
刘星航
林维晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Municipal Center For Disease Control & Prevention
Original Assignee
Shanghai Municipal Center For Disease Control & Prevention
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Municipal Center For Disease Control & Prevention filed Critical Shanghai Municipal Center For Disease Control & Prevention
Priority to CN202211165935.6A priority Critical patent/CN115510970A/en
Publication of CN115510970A publication Critical patent/CN115510970A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Tourism & Hospitality (AREA)
  • General Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • General Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Educational Administration (AREA)
  • Marketing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Development Economics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a feature conversion and extraction system based on public health data acquisition, which comprises the following stages: the method comprises a data preparation stage, a characteristic engineering stage and a model evaluation and evaluation stage, wherein the characteristic engineering stage comprises the following steps: s1, preprocessing data classified in diabetes follow-up scene data; s2, firstly, determining the characteristics of a large number of missing values in partial data, and eliminating the characteristics on the premise that the characteristics cannot influence a reliability judgment result; s3, sensitivity of partial machine learning algorithms is achieved according to missing values of input data, and the missing values are processed according to different machine learning algorithms; the method has the advantages of reducing the dimension of data, simplifying a data model, improving the interpretability of the model, shortening the time required by model training, reducing the overfitting risk of the model and avoiding dimension disasters.

Description

Characteristic transformation and extraction system based on public health data acquisition
Technical Field
The invention relates to the technical field of data analysis and processing, in particular to a feature transformation and extraction system based on public health data acquisition.
Background
With the development of information technology and the arrival of the big data era, more and more scientific researches are favored to use the existing data to develop researches, meanwhile, the fusion researches spanning multiple subjects and multiple fields are gradually increased, data of multiple subject fields and sources need to be used in the researches, before the researches are developed, the reliability of the data to be incorporated into the researches is evaluated, corresponding measures are taken according to the evaluation results to improve the authenticity and the accuracy of the research results, the data reliability refers to the integrity, the consistency, the accuracy, the reliability and the reliability of the data, and the degree of maintaining the characteristics in the whole life cycle of the data, the reliability of the data is reduced due to the deviation in the data, the common data deviation mainly comprises selection deviation, information deviation, mixed deviation and the like, even the data is compiled and falsified, if the data with lower reliability is used in the scientific researches, the research results deviate from the real situation, so that the value of the research results is greatly reduced, in the big data era, the achievement of the research value and the scientific researches are important, and the scientific research method for the research of the data incorporation into the big data era is adopted.
In the field of public health in the prior art, data reliability evaluation methods are mainly classified into three types, namely, rule-based evaluation methods, content-based evaluation methods, and statistic-based evaluation methods:
the rule-based evaluation method mainly evaluates the reliability of data by setting a rule base and verifying the data by using rules in the rule base, and evaluates the reliability of the data by a verification result.
The content-based evaluation method mainly carries out cross validation on the data content to be evaluated through other source data so as to evaluate the reliability of the data, wherein the commonly used other source data comprises data collected by telephone/visiting back and visiting back, data obtained by referring to original medical history and the like. The content-based evaluation method can provide deep evaluation, has a better evaluation effect on information deviation, and can also provide clues for selecting deviation and miscellaneous deviation to a certain extent, but acquiring data from other sources usually requires a lot of time, energy and economic cost, and sometimes data from other sources are not accessible, so that the reliability of the whole data is difficult to evaluate by the method, and the method is generally combined with a sampling method, and the reliability of the evaluated data may have deviation.
The evaluation method based on statistics mainly evaluates the reliability of data integrally by calculating and analyzing the overall statistical result and distribution condition of the data to be evaluated. For example, whether the last digit of the blood pressure value is randomly distributed, whether the proportion of men and women in the data is significantly different from that of the whole population, and the like. The statistical-based evaluation method has a good evaluation effect on the selection deviation and the information deviation, and can also evaluate the full data, but the method has a certain requirement on the data quantity of the data to be evaluated, and can only obtain the reliability evaluation result of the whole data, and cannot form an independent evaluation result for each piece of data.
Therefore, the existing public health data reliability assessment methods have some defects respectively, the reliability assessment requirements of the big data era for mass data cannot be completely met, a new method needs to be explored, the defects of the existing methods are overcome, the reliability of the public health data is assessed more accurately, and clues of more data reliability problems are found, wherein the processing process of the public health data is relatively lagged behind, other machine learning methods need to be explored, and the requirements for the label data volume are reduced.
Disclosure of Invention
The invention aims to solve the defects in the prior art and provides a feature transformation and extraction system based on public health data acquisition.
In order to achieve the purpose, the invention adopts the following technical scheme:
a feature transformation extraction system based on public health data acquisition, comprising the stages of: the method comprises a data preparation stage, a characteristic engineering stage and a model evaluation and evaluation stage, wherein the characteristic engineering stage comprises the following steps:
s1, preprocessing data well classified in diabetes follow-up scene data;
s2, firstly, determining the characteristics of a large number of missing values in partial data, and eliminating the characteristics on the premise that the characteristics cannot influence the reliability judgment result;
s3, sensitivity of partial machine learning algorithms is achieved according to missing values of input data, and the missing values are processed according to different machine learning algorithms;
s4, carrying out standardization processing on the format of the data, converting the digital type (integer type, floating point type and reserved decimal place number), adjusting the unit of the characteristic, and unifying the formats of the date and the time;
s5, judging whether the data volume of each category needs to be balanced or not according to the data volume of the label data on each category, if the data volume of the data label on each category is extremely unbalanced, the subsequent model training is affected, a SMOTE data synthesis method needs to be adopted, partial classified data are artificially synthesized, the data volume under the classification is increased, the data volume of each classification is balanced, the undersampling method for the classified data with more data volume is avoided, and the condition of label discarding is reduced;
s6, dividing the label data into a training set, a verification set and a test set, wherein the training set is used for training the model, the verification set is used for verifying the model and optimizing hyper-parameters, and the test set is used for testing, sizing and performance evaluation of the final model;
s7, as for the continuous features, the continuous features need to be dimensionless, the influence of feature units is eliminated, the features are converted into the same specification, the situation that some features in a partial model obtain weight values with very different sizes compared with other features is avoided, meanwhile, the efficiency of machine learning is improved, and according to the fact that whether the features accord with normal distribution or not, the features can be compressed and translated through a normalization method (the range of the feature values is compressed to a [0,1] interval through participation of the maximum value and the minimum value of the features in calculation) or a standardization method (the features are converted into standard normal distribution with the mean value of 0 and the standard deviation of 1), and meanwhile, the distribution state of the features is kept;
s8, information redundancy is reduced, and for quantitative data (such as whether the examination passes or not) only concerning qualitative results, discretization (dividing continuous features into a plurality of discrete features according to a judgment standard) or binarization (dividing continuous features into single discrete features with two states according to the judgment standard) is needed, and in addition, according to a machine learning model to be adopted, the effect of model training can also be improved by carrying out function conversion on partial features;
and S9, for the discrete features, the discrete features need to be digitalized, and specific classification names are converted into codes. If the features relate to more than two types of classifications, it is necessary to generate dummy features for the different classifications, and the features are marked by one-hot codes to avoid the situation that the size of the codes (e.g., 1, 2, etc.) themselves are learned as features during machine learning, and also to facilitate the calculation of the distances between the features during machine learning. Furthermore, as for the characteristics of the timestamp category, some information (such as year or millisecond data having no influence on the data reliability result) can be discarded according to the situation, so that the construction of the model is simplified;
s9, the data after feature conversion is suitable for machine learning, but the number (dimensionality) of data features is possibly large, so that in order to simplify a final model, the interpretability of the model is improved, meanwhile, in order to shorten the time required by model training, reduce the risk of overfitting the model and avoid dimensionality disaster, the features need to be selected and extracted, new features are constructed to replace original features when necessary, and the dimensionality reduction of the features is realized;
s10, important information can be guaranteed to be kept on the premise of reducing the dimensionality of data to the maximum extent, algorithms adopted in feature extraction are a linear method (principal component analysis (PCA) and a linear discriminant method (LDA)) and a nonlinear method (local linear embedding (LLE), laplace feature mapping (LE), random neighborhood embedding (SNE) and T-distribution neighborhood embedding (T-SNE)), the goal of dimensionality reduction is achieved, and the algorithms need to be selected in combination with actual data conditions.
The invention has the following beneficial effects;
1. in the invention, classified label data with small data volume is artificially synthesized by an SMOTE method to balance the label data volume under each classification, thereby reducing the influence of the problem model construction to the maximum extent;
2. in the invention, through the feature conversion and extraction of the processed data, the main features of the data can be reserved, the number of the features is reduced, and new features can be constructed to replace the original features when necessary, so that the dimension of the data is reduced, the data model is simplified, the interpretability of the model is improved, the time required by model training is shortened, the overfitting risk of the model is reduced, and the dimension disaster is avoided.
Detailed Description
The technical solution of the present invention will be clearly and completely described with reference to the following examples.
Example one
The invention provides a feature transformation and extraction system based on public health data acquisition, which comprises the following stages: the method comprises a data preparation stage, a characteristic engineering stage and a model evaluation and evaluation stage, wherein the characteristic engineering stage comprises the following steps:
s1, preprocessing data classified in diabetes follow-up scene data;
s2, firstly, determining the characteristics of a large number of missing values in partial data, and eliminating the characteristics on the premise that the characteristics cannot influence a reliability judgment result;
s3, sensitivity of partial machine learning algorithms is achieved according to missing values of input data, and the missing values are processed according to different machine learning algorithms;
s4, carrying out standardization processing on the format of the data, converting the digital type (integer type, floating point type and reserved decimal place number), adjusting the unit of the characteristic, and unifying the formats of the date and the time;
s5, judging whether the data volume of each category needs to be balanced or not according to the data volume of the label data on each category, if the data volume of the data label on each category is extremely unbalanced, influencing the training of a subsequent model, adopting an SMOTE data synthesis method to artificially synthesize partial classified data, increasing the data volume under the classification, balancing the data volume of each classification, avoiding using an undersampling method for the classified data with more data volume, and reducing the condition of label discarding;
s6, dividing the label data into a training set, a verification set and a test set, wherein the training set is used for training the model, the verification set is used for verifying the model and adjusting and optimizing the hyper-parameters, and the test set is used for testing, shaping and evaluating the performance of the final model;
s7, as for the continuous features, the continuous features need to be dimensionless, the influence of feature units is eliminated, the features are converted into the same specification, the situation that some features in a partial model obtain weight values with very different sizes compared with other features is avoided, meanwhile, the efficiency of machine learning is improved, and according to the fact that whether the features accord with normal distribution or not, the features can be compressed and translated through a normalization method (the range of the feature values is compressed to a [0,1] interval through participation of the maximum value and the minimum value of the features in calculation) or a standardization method (the features are converted into standard normal distribution with the mean value of 0 and the standard deviation of 1), and meanwhile, the distribution state of the features is kept;
s8, reducing information redundancy, and regarding quantitative data (such as whether the examination passes or not) only concerning qualitative results, discretizing (dividing continuous features into a plurality of discrete features according to a judgment standard) or binarizing (dividing continuous features into single discrete features with two states according to the judgment standard) is required, and in addition, according to a machine learning model to be adopted, performing function conversion on partial features can also improve the effect of model training;
s9, the discrete features need to be digitized, and the specific classification names are converted into codes. If the features relate to more than two types of classifications, it is necessary to generate dummy features for the different classifications, and the features are marked by one-hot codes to avoid the situation that the size of the codes (e.g., 1, 2, etc.) themselves are learned as features during machine learning, and also to facilitate the calculation of the distances between the features during machine learning. Furthermore, as for the characteristics of the timestamp category, some information (such as year or millisecond data having no influence on the data reliability result) can be discarded according to the situation, so that the construction of the model is simplified;
s9, the data after feature conversion is suitable for machine learning, but the number (dimensionality) of data features is possibly large, so that in order to simplify a final model, the interpretability of the model is improved, meanwhile, in order to shorten the time required by model training, reduce the risk of overfitting the model and avoid dimensionality disasters, the features need to be selected and extracted, and new features are constructed to replace original features when necessary, so that the dimensionality reduction of the features is realized;
s10, important information can be guaranteed to be kept on the premise of reducing the dimensionality of data to the maximum extent, algorithms adopted in feature extraction are a linear method (principal component analysis (PCA) and a linear discriminant method (LDA)) and a nonlinear method (local linear embedding (LLE), laplace feature mapping (LE), random neighborhood embedding (SNE) and T-distribution neighborhood embedding (T-SNE)), the goal of dimensionality reduction is achieved, and the algorithms need to be selected in combination with actual data conditions.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (1)

1. A system for converting and extracting features based on public health data acquisition is characterized by comprising the following stages: the method comprises a data preparation stage, a characteristic engineering stage and a model evaluation and evaluation stage, wherein the characteristic engineering stage comprises the following steps:
s1, preprocessing data well classified in diabetes follow-up scene data;
s2, firstly, determining the characteristics of a large number of missing values in partial data, and eliminating the characteristics on the premise that the characteristics cannot influence a reliability judgment result;
s3, sensitivity of partial machine learning algorithms is achieved according to missing values of input data, and the missing values are processed according to different machine learning algorithms;
s4, carrying out standardization processing on the format of the data, converting the digital type (integer type, floating point type and reserved decimal place number), adjusting the unit of the characteristic, and unifying the formats of the date and the time;
s5, judging whether the data volume of each category needs to be balanced or not according to the data volume of the label data on each category, if the data volume of the data label on each category is extremely unbalanced, influencing the training of a subsequent model, adopting an SMOTE data synthesis method to artificially synthesize partial classified data, increasing the data volume under the classification, balancing the data volume of each classification, avoiding using an undersampling method for the classified data with more data volume, and reducing the condition of label discarding;
s6, dividing the label data into a training set, a verification set and a test set, wherein the training set is used for training the model, the verification set is used for verifying the model and adjusting and optimizing the hyper-parameters, and the test set is used for testing, shaping and evaluating the performance of the final model;
s7, carrying out dimensionless operation on the continuous features, eliminating the influence of feature units, converting the features into the same specification, avoiding that some features in part of models obtain weight values with very different sizes compared with other features, improving the efficiency of machine learning, and compressing and translating the features by a normalization method (the maximum value and the minimum value of the features participate in calculation, the range of the feature values is compressed to a [0,1] interval) or a normalization method (the features are converted into standard normal distribution with the mean value of 0 and the standard difference of 1) according to whether the features accord with the normal distribution or not, and simultaneously keeping the distribution state of the features;
s8, information redundancy is reduced, and for quantitative data (such as whether the examination passes or not) only concerning qualitative results, discretization (dividing continuous features into a plurality of discrete features according to a judgment standard) or binarization (dividing continuous features into single discrete features with two states according to the judgment standard) is needed, and in addition, according to a machine learning model to be adopted, the effect of model training can also be improved by carrying out function conversion on partial features;
s9, the discrete features need to be digitized, and the specific classification names are converted into codes. If the features relate to more than two types of classifications, it is necessary to generate dummy features for the different classifications, and the features are marked by one-hot codes to avoid the situation that the size of the codes (e.g., 1, 2, etc.) themselves are learned as features during machine learning, and also to facilitate the calculation of the distances between the features during machine learning. Furthermore, as for the characteristics of the timestamp category, some information (such as year or millisecond data having no influence on the data reliability result) can be discarded according to the situation, so that the construction of the model is simplified;
s9, the data after feature conversion is suitable for machine learning, but the number (dimensionality) of data features is possibly large, so that in order to simplify a final model, the interpretability of the model is improved, meanwhile, in order to shorten the time required by model training, reduce the risk of overfitting the model and avoid dimensionality disaster, the features need to be selected and extracted, new features are constructed to replace original features when necessary, and the dimensionality reduction of the features is realized;
s10, important information can be guaranteed to be kept on the premise of reducing the dimensionality of data to the maximum extent, algorithms adopted in feature extraction are a linear method (principal component analysis (PCA) and a linear discriminant method (LDA)) and a nonlinear method (local linear embedding (LLE), laplace feature mapping (LE), random neighborhood embedding (SNE) and T-distribution neighborhood embedding (T-SNE)), the goal of dimensionality reduction is achieved, and the algorithms need to be selected in combination with actual data conditions.
CN202211165935.6A 2022-09-23 2022-09-23 Characteristic transformation and extraction system based on public health data acquisition Pending CN115510970A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211165935.6A CN115510970A (en) 2022-09-23 2022-09-23 Characteristic transformation and extraction system based on public health data acquisition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211165935.6A CN115510970A (en) 2022-09-23 2022-09-23 Characteristic transformation and extraction system based on public health data acquisition

Publications (1)

Publication Number Publication Date
CN115510970A true CN115510970A (en) 2022-12-23

Family

ID=84506917

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211165935.6A Pending CN115510970A (en) 2022-09-23 2022-09-23 Characteristic transformation and extraction system based on public health data acquisition

Country Status (1)

Country Link
CN (1) CN115510970A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116627946A (en) * 2023-06-01 2023-08-22 中山市人民医院 Method and system for establishing diabetic foot data model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116627946A (en) * 2023-06-01 2023-08-22 中山市人民医院 Method and system for establishing diabetic foot data model
CN116627946B (en) * 2023-06-01 2024-02-06 中山市人民医院 Method and system for establishing diabetic foot data model

Similar Documents

Publication Publication Date Title
US7437266B2 (en) Time-series data analyzing apparatus
CN110141219A (en) Myocardial infarction automatic testing method based on lead fusion deep neural network
CN110141220A (en) Myocardial infarction automatic testing method based on multi-modal fusion neural network
CN112700325A (en) Method for predicting online credit return customers based on Stacking ensemble learning
CN112435756B (en) Intestinal flora associated disease risk prediction system based on multi-dataset difference interaction
CN109165153A (en) A kind of performance test methods of high emulation securities business transaction class system
CN114999629A (en) AD early prediction method, system and device based on multi-feature fusion
CN115510970A (en) Characteristic transformation and extraction system based on public health data acquisition
CN115185936A (en) Medical clinical data quality analysis system based on big data
CN107480419A (en) Fetal Birth Defect Intelligence Diagnosis system
US20230386665A1 (en) Method and device for constructing autism spectrum disorder (asd) risk prediction model
CN116350203B (en) Physical testing data processing method and system
CN111261298B (en) Medical data quality prejudging method and device, readable medium and electronic equipment
CN115545790B (en) Price data prediction method, price data prediction device, electronic equipment and storage medium
CN116504392A (en) Intelligent auxiliary diagnosis prompt system based on data analysis
CN116662186A (en) Log playback assertion method and device based on logistic regression and electronic equipment
CN113066549B (en) Clinical effectiveness evaluation method and system of medical instrument based on artificial intelligence
CN109886288A (en) A kind of method for evaluating state and device for power transformer
CN114139408A (en) Power transformer health state assessment method
CN114550865A (en) Multidimensional data analysis method and device influencing student physical measurement
CN115511683A (en) Public health data acquisition and processing system
CN113822564A (en) Flight plan minimum sample size confirmation method and device for airspace simulation analysis
CN116864062B (en) Health physical examination report data analysis management system based on Internet
CN114927186A (en) Method and device for automatically generating crowd health problem diagnosis and report, electronic equipment and storage medium
CN115579128B (en) Multi-model characteristic enhanced disease screening system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination