CN115579127B

CN115579127B - Method, system, equipment and storage medium for constructing slow-resistance lung prediction model

Info

Publication number: CN115579127B
Application number: CN202211221832.7A
Authority: CN
Inventors: 于永福; 黄伟红; 黄佳; 吴瑞文; 刘冠宇; 李靖; 高武强
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-10-08
Filing date: 2022-10-08
Publication date: 2023-11-21
Anticipated expiration: 2042-10-08
Also published as: CN115579127A

Abstract

The invention discloses a method, a system, equipment and a storage medium for constructing a slow-resistance lung prediction model, wherein the method is used for obtaining a missing value in bronchorelaxation report data through calculation, filling the missing value in the bronchorelaxation report data, and obtaining complete bronchorelaxation report data; matching diagnostic labels corresponding to the complete bronchodilatory report data; sparse feature screening is carried out on the complete bronchorelaxation report data; acquiring a plurality of preselected basic models, and performing parameter tuning on each basic model to acquire a plurality of tuning basic models; taking the bronchodilating sparse feature and the diagnostic tag as a first data set, and adding Laplace noise to each data in the first data set; selecting a preset number of tuning basic models with top ranks; and constructing a slow resistance lung prediction model based on the top-ranked preset number of tuning basic models. The method and the device can improve the robustness of the model and the accuracy of model prediction.

Description

Method, system, equipment and storage medium for constructing slow-resistance lung prediction model

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, a system, an apparatus, and a storage medium for constructing a slow-resistance lung prediction model.

Background

At present, the clinical examination of the slow-resistance lung is based on the lung function examination, the clinical diagnosis of the slow-resistance lung is mainly judged according to a bronchorelaxation report, and when FEV1% FVC in the bronchorelaxation report is less than 70%, a clinician also needs to manually judge and exclude factors affecting the slow-resistance lung judgment such as lung damage, phthisis and the like, so that the patient can be diagnosed as the slow-resistance lung. However, if the human judgment has the problems of insufficient experience of clinicians and uneven distribution of medical resources, and if the interference factors such as lung damage, pulmonary tuberculosis and the like affecting the judgment of the slow-blocking lung cannot be accurately eliminated, misdiagnosis of the slow-blocking lung is easily caused, and thus, the treatment of the patient with the slow-blocking lung is delayed or the wrong treatment is carried out.

In the prior art, due to different focused characteristics of different machine learning models, different models have different advantages, and due to the very complex relationship between the lung function characteristics and the slow-resistance lung, the accuracy of a single basic machine learning model cannot meet the requirements.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides a method, a system, equipment and a storage medium for constructing a slow-resistance lung prediction model, which can improve the robustness of the model and the accuracy of model prediction.

In a first aspect, an embodiment of the present invention provides a method for constructing a slow-resistance lung prediction model, where the method for constructing the slow-resistance lung prediction model includes:

based on the FOTS model, constructing a PDF image text recognition model;

acquiring a bronchorelaxation test PDF report, and carrying out text data identification on the bronchorelaxation test PDF report through the PDF image text identification model to acquire bronchorelaxation report data;

according to the bronchorelaxation report data, calculating to obtain a missing value in the bronchorelaxation report data, and filling the missing value in the bronchorelaxation report data to obtain complete bronchorelaxation report data;

acquiring patient ID information corresponding to the complete bronchorelaxation report data by a fuzzy matching method, and matching a diagnosis tag corresponding to the complete bronchorelaxation report data according to the patient ID information;

Performing sparse feature screening on the complete bronchorelaxation report data to obtain bronchorelaxation sparse features;

acquiring a plurality of preselected basic models, and performing parameter tuning on each basic model to acquire a plurality of tuning basic models;

taking the bronchodilating sparse feature and the diagnosis tag as a first data set, and adding Laplacian noise to each data in the first data set to obtain a second data set;

training each tuning basic model according to the second data set to obtain the accuracy of each tuning basic model, and selecting a preset number of tuning basic models which are ranked at the front according to the accuracy of each tuning basic model;

and constructing a slow-resistance lung prediction model based on the preset quantity of tuning basic models which are ranked at the front.

Compared with the prior art, the first aspect of the invention has the following beneficial effects:

in order to improve the accuracy of image text recognition, the method constructs a PDF image text recognition model based on the FOTS model; acquiring a bronchorelaxation test PDF report, and carrying out text data identification on the bronchorelaxation test PDF report through a PDF image text identification model to acquire bronchorelaxation report data; according to the bronchorelaxation report data, calculating to obtain a missing value in the bronchorelaxation report data, and filling the missing value in the bronchorelaxation report data to obtain complete bronchorelaxation report data; and acquiring patient ID information corresponding to the complete bronchorelaxation report data by a fuzzy matching method, and matching a diagnosis tag corresponding to the complete bronchorelaxation report data according to the patient ID information. In order to improve the accuracy of model prediction, sparse feature screening is carried out on complete bronchorelaxation report data to obtain bronchorelaxation sparse features; and acquiring a plurality of preselected basic models, and performing parameter tuning on each basic model to acquire a plurality of tuning basic models. In order to improve the robustness of the model and the accuracy of model prediction, the bronchodilating sparse feature and the diagnosis tag are used as a first data set, and Laplacian noise is added to each data in the first data set to obtain a second data set; training each tuning basic model according to the second data set to obtain the accuracy of each tuning basic model, and selecting a preset number of tuning basic models with top ranking according to the accuracy of each tuning basic model; and constructing a slow resistance lung prediction model based on the top-ranked preset number of tuning basic models. The method can improve the accuracy of image text recognition by constructing a PDF image text recognition model, can recognize key information features from high-dimensional data by carrying out sparse feature screening on complete bronchorelaxation report data, and can remarkably improve learning accuracy by selecting the key features, thereby improving the accuracy of model prediction; the robustness of the model can be improved by adding Laplace noise into training data, the accuracy of model prediction can be improved by constructing a slow-resistance lung prediction model based on a preset number of tuning basic models which are ranked at the front, and a doctor can be assisted in judging slow-resistance lung by predicting the slow-resistance lung through the slow-resistance lung prediction model, so that misdiagnosis is reduced.

According to some embodiments of the invention, the building a PDF image text recognition model based on the FOTS model includes:

taking the FOTS model as a basic framework, wherein a convolution sharing module in the FOTS model adopts a RepLKNet network;

and constructing a PDF image text recognition model according to the FOTS model adopting the RepLKNet network.

According to some embodiments of the invention, the calculating the missing value in the broncho-diastole report data according to the broncho-diastole report data comprises:

performing regularization processing by circularly traversing the bronchorelaxation report data by adopting a regular function to obtain data of each row in the bronchorelaxation report;

and calculating the space number between two adjacent data of each row, and if the space number is larger than a preset value, considering that a missing value exists between the two adjacent data so as to obtain the missing value in the bronchodiastole report data.

According to some embodiments of the invention, the filling the missing values in the bronchorelaxation report data to obtain complete bronchorelaxation report data includes:

filling the missing values in the bronchorelaxation report data by adopting a two-stage missing value filling method, wherein:

In the first stage, a formula 90.6043-0.0414×heigh×100 is adopted to perform first filling on a predicted value Pred of the actual measurement value FEV1% FVC corresponding to the deficiency in the bronchorelaxation report data, wherein heigh represents HEIGHT;

after filling the missing predicted value Pred, traversing the unfilled missing values, and if two columns associated with the unfilled missing values are not empty, calculating by adopting division to obtain a calculation result;

filling the calculation result to the corresponding position of the unfilled missing value;

and in the second stage, performing second filling on the residual missing values after the first stage treatment by adopting a MissForest method to obtain complete bronchorelaxation report data.

According to some embodiments of the invention, the performing sparse feature screening on the complete bronchorelaxation report data to obtain bronchorelaxation sparse features includes:

individual sparse feature selection is carried out on the complete bronchorelaxation report data, and individual sparse features are obtained;

group sparse feature selection is carried out on the complete bronchorelaxation report data, and group sparse features are obtained;

counting the occurrence times of each feature according to the individual sparse feature and the population sparse feature;

And presetting a screening threshold, screening out the characteristics with the frequency of occurrence of the characteristics being greater than or equal to the screening threshold, and obtaining the bronchodilating sparse characteristics.

According to some embodiments of the present invention, the training each tuning base model according to the second data set, to obtain an accuracy of each tuning base model, and selecting a preset number of tuning base models that are ranked first according to the accuracy of each tuning base model, includes:

dividing the second data set into a training set and a testing set, and training each tuning basic model by adopting a five-fold cross validation method to obtain the accuracy of each tuning basic model;

the accuracy rate threshold is preset, the accuracy rate of each tuning basic model is compared with the accuracy rate threshold, and if the accuracy rate of the tuning basic model is larger than the accuracy rate threshold, the tuning basic model is included in a model screening list;

and selecting a preset number of tuning basic models which are ranked at the top in the model screening list.

According to some embodiments of the invention, the constructing a slow-resistance lung prediction model based on the top-ranked preset number of tuning base models includes:

Constructing a weight distribution method according to greedy ideas, wherein the weight distribution method distributes different weights to each tuning basic model in the selected preset number of tuning basic models;

and constructing a slow-resistance lung prediction model according to the preset quantity of tuning basic models distributed with different weights.

In a second aspect, an embodiment of the present invention further provides a system for constructing a slow-resistance lung prediction model, where the system for constructing a slow-resistance lung prediction model includes:

the recognition model building unit is used for building a PDF image text recognition model based on the FOTS model;

the data acquisition unit is used for acquiring a bronchorelaxation test PDF report, and carrying out text data identification on the bronchorelaxation test PDF report through the PDF image text identification model to acquire bronchorelaxation report data;

the missing value filling unit is used for calculating and obtaining missing values in the bronchorelaxation report data according to the bronchorelaxation report data, and filling the missing values in the bronchorelaxation report data to obtain complete bronchorelaxation report data;

the label matching unit is used for acquiring patient ID information corresponding to the complete bronchorelaxation report data through a fuzzy matching method and matching a diagnosis label corresponding to the complete bronchorelaxation report data according to the patient ID information;

The feature acquisition unit is used for carrying out sparse feature screening on the complete bronchorelaxation report data to obtain bronchorelaxation sparse features;

the parameter tuning unit is used for acquiring a plurality of preselected basic models, and performing parameter tuning on each basic model to acquire a plurality of tuning basic models;

a data set acquisition unit, configured to take the bronchodilatory sparse feature and the diagnostic tag as a first data set, and add laplace noise to each data in the first data set to obtain a second data set;

the model selection unit is used for training each tuning basic model according to the second data set, obtaining the accuracy of each tuning basic model, and selecting a preset number of tuning basic models with top ranking according to the accuracy of each tuning basic model;

and the prediction model construction unit is used for constructing a slow-resistance lung prediction model based on the preset quantity of tuning basic models which are ranked at the front.

In a third aspect, an embodiment of the present invention further provides an apparatus for constructing a slow-blocking lung prediction model, including at least one control processor and a memory communicatively coupled to the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform a method of constructing a slow-drag lung prediction model as described above.

In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium storing computer-executable instructions for causing a computer to perform a method of constructing a slow-blocking lung prediction model as described above.

It is to be understood that the advantages of the second to fourth aspects compared with the related art are the same as those of the first aspect compared with the related art, and reference may be made to the related description in the first aspect, which is not repeated herein.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a flow chart of a method of constructing a slow-blocking lung prediction model according to an embodiment of the present invention;

FIG. 2 is a block diagram of a system for constructing a slow-blocking lung prediction model in accordance with an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

In the description of the present invention, the description of first, second, etc. is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present invention, it should be understood that the direction or positional relationship indicated with respect to the description of the orientation, such as up, down, etc., is based on the direction or positional relationship shown in the drawings, is merely for convenience of describing the present invention and simplifying the description, and does not indicate or imply that the apparatus or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.

In the description of the present invention, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present invention can be determined reasonably by a person skilled in the art in combination with the specific content of the technical solution.

In order to solve the problems, the invention constructs a PDF image text recognition model based on the FOTS model in order to improve the accuracy of image text recognition; acquiring a bronchorelaxation test PDF report, and carrying out text data identification on the bronchorelaxation test PDF report through a PDF image text identification model to acquire bronchorelaxation report data; calculating and obtaining a missing value in the broncho-diastole report data according to the broncho-diastole report data; filling the missing values in the bronchorelaxation report data to obtain complete bronchorelaxation report data; and acquiring patient ID information corresponding to the complete bronchorelaxation report data by a fuzzy matching method, and matching a diagnosis tag corresponding to the complete bronchorelaxation report data according to the patient ID information. In order to improve the accuracy of model prediction, sparse feature screening is carried out on complete bronchorelaxation report data to obtain bronchorelaxation sparse features; and acquiring a plurality of preselected basic models, and performing parameter tuning on each basic model to acquire a plurality of tuning basic models. In order to improve the robustness of the model and the accuracy of model prediction, the bronchodilating sparse feature and the diagnosis tag are used as a first data set, and Laplacian noise is added to each data in the first data set to obtain a second data set; training each tuning basic model according to the second data set to obtain the accuracy of each tuning basic model, and selecting a preset number of tuning basic models with top ranking according to the accuracy of each tuning basic model; and constructing a slow resistance lung prediction model based on the top-ranked preset number of tuning basic models. Therefore, the accuracy of image text recognition can be improved by constructing the PDF image text recognition model, key information features can be recognized from high-dimensional data by carrying out sparse feature screening on complete bronchorelaxation report data, and learning accuracy can be remarkably improved by selecting the key features, so that the accuracy of model prediction is improved; the robustness of the model can be improved by adding Laplace noise into training data, the accuracy of model prediction can be improved by constructing a slow-resistance lung prediction model based on a preset number of tuning basic models which are ranked at the front, and a doctor can be assisted in judging slow-resistance lung by predicting the slow-resistance lung through the slow-resistance lung prediction model, so that misdiagnosis is reduced.

Referring to fig. 1, an embodiment of the present invention provides a method for constructing a slow-resistance lung prediction model, where the method for constructing a slow-resistance lung prediction model includes:

and step S100, constructing a PDF image text recognition model based on the FOTS model.

Specifically, a FOTS model is used as a basic framework, and a RepLKNet network is adopted by a convolution sharing module in the FOTS model; and constructing a PDF image text recognition model according to the FOTS model adopting the RepLKNet network. Wherein:

the FOTS (Fast Oriented Text Spotting With a Unified Network) model is a frame integrating text detection and recognition, and has the characteristics of small model, high speed, high precision and the like. The overall structure of the FOTS model consists of four parts, namely a convolution sharing, a text detection branch, a RoIRotate operation and a text recognition branch. In the FOTS model, the backbone of the convolutional shared module network adopts a res net-50 network, in this embodiment, the res net-50 network is replaced by a RepLKNet network, and the RepLKNet network uses a small number of large convolution kernels instead of a large number of small convolution kernels to establish a long-distance spatial range, i.e., a larger receiving domain, compared to the res net-50 network. Related experiments prove that compared with a small convolution kernel, a small quantity of large convolution kernels can effectively increase the receiving domain, the problem of dense PDF report indexes of the bronchodilatory test can be well matched, and the receiving domain can well contain more complete indexes.

It should be noted that the RepLKNet network in this embodiment is a prior art, and reference may be made to the literature in "https:// arxiv.

In the embodiment, the PDF image text recognition model constructed by adopting the RepLKNet network has better splitting effect on the PDF report of the bronchorelaxation test.

Step 200, acquiring a bronchorelaxation test PDF report, and carrying out text data identification on the bronchorelaxation test PDF report through a PDF image text identification model to acquire bronchorelaxation report data.

Specifically, a bronchorelaxation test PDF report is obtained, text data identification is performed on the bronchorelaxation test PDF report through the PDF image text identification model constructed in the step S100, and bronchorelaxation report data is obtained.

And step 300, calculating and obtaining the missing value in the broncho-diastole report data according to the broncho-diastole report data, and filling the missing value in the broncho-diastole report data to obtain the complete broncho-diastole report data.

Specifically, regularizing the data of the bronchorelaxation report by adopting a regular function circulation traversal, and obtaining the data of each row in the bronchorelaxation report; and calculating the space number between every two adjacent data of each row, and if the space number is larger than a preset value, considering that a missing value exists between the two adjacent data so as to obtain the missing value in the bronchodiastole report data. For example:

The regular function expression is a rule defined in the form of character strings, matching is performed in the text, and the conforming character strings are found, and six regular functions commonly used in a re library are match, compile, sub, split, search, findall respectively. This embodiment uses python re.search and python re.findall to obtain data for each row in the bronchodilatory report. The method comprises the following steps:

since the first column in the bronchodilatory report may or may not be null, the starting index for each row of reported values needs to be determined by first initializing the definition starting index=39 (by looking at most of the bronchodilatory reports, finding that most of the pulmonary function reports are numbered with reported values starting from the 39 th character), but where this value is not exact, 39 is taken as the starting index when the first value of the bronchodilatory report is null, but when all numbers of the row are found using the python re.

After determining the starting position of each row report, the ending position of each row is preset to 200 (by referring to a large number of bronchodilatory reports), then the number nums is obtained by using python re.findall, and the number nums is corresponding to each column of the bronchodilatory report, so as to obtain the data of each row in the bronchodilatory report.

After obtaining the data of each row in the broncho-diastole report, calculating the space number between two adjacent data of each row, if the space number is larger than a preset value, considering that a missing value exists between the two adjacent data, storing the data between the two adjacent data as a blank value, and circularly traversing the data of each row in the broncho-diastole report to obtain the missing value in the broncho-diastole report data.

It should be noted that, the preset value of the present embodiment may be changed according to actual needs, and the present embodiment is not particularly limited.

in the first stage, a formula 90.6043-0.0414×heigh×100 is adopted to perform first filling on a predicted value Pred of the actual measurement value FEV1% FVC corresponding to the deficiency in the bronchodiastole report data, wherein heigh represents HEIGHT;

and in the second stage, the missing values remained after the treatment in the first stage are subjected to second filling by adopting a MissForest method, so that complete bronchorelaxation report data are obtained.

It should be noted that, in this embodiment, division is adopted to perform calculation, or an appropriate operation manner may be adopted according to actual needs, and this embodiment is not specifically limited, and the missfeest method is the prior art, and this embodiment is not described in detail.

Step 400, obtaining patient ID information corresponding to the complete bronchorelaxation report data through a fuzzy matching method, and matching a diagnosis tag corresponding to the complete bronchorelaxation report data according to the patient ID information.

Specifically, patient ID information corresponding to the complete bronchorelaxation report data is obtained through a fuzzy matching method, and a diagnosis tag corresponding to the complete bronchorelaxation report data is matched according to the patient ID information. The specific process comprises the following steps:

because patient ID is missing in part of the bronchorelaxation report, when the slow-blocking auxiliary diagnosis training set is constructed, the diagnosis information corresponding to the slow-blocking auxiliary diagnosis training set in the HIS system of the hospital cannot be obtained, and in the embodiment, the patient ID information corresponding to the complete bronchorelaxation report data is obtained through a fuzzy matching method, specifically:

and carrying out fuzzy matching by using the lung function report payment information of the patient in the hospital and the information extracted from the complete bronchorelaxation report to obtain the patient ID information of the complete bronchorelaxation report. The specific description is as follows: firstly, extracting bronchorelaxation report payment information of a patient in a hospital, wherein the payment information comprises a name, a birth year, a sex, a report name and payment time, and a fuzzy matching method is adopted in the embodiment to correspondingly relax the matching items, wherein the relaxation of the birth year and the month of birth is different from one year before and after the birth year, and the birth month is not required; and for the payment time and the reported checking time, similarly, performing relaxation operation in time, wherein the difference between the payment time and the reported checking time is required to be one week, other items are required to be in one-to-one correspondence, and the requirements are met, and the matching is considered to be successful, so that the patient ID information of the complete bronchorelaxation report is obtained.

By using the patient ID information obtained in the previous step, diagnosis information corresponding to the complete bronchorelaxation report is matched in the hospital HIS system, and the method adopted in this embodiment is as follows:

if a slow pulmonary impedance related diagnosis is not matched in the hospital HIS system within two weeks before and after the time of the bronchorelaxation report examination, and the FEV1% FVC >70% is satisfied, the diagnosis label corresponding to the bronchorelaxation report is a normal patient;

if a slow pulmonary resistance related diagnosis is matched in the hospital HIS system within two weeks before and after the time of the bronchodilatory report examination and FEV1% FVC <70% is met, the diagnosis label corresponding to the bronchodilatory report is a slow pulmonary resistance patient.

And S500, performing sparse feature screening on the complete bronchorelaxation report data to obtain bronchorelaxation sparse features.

Specifically, individual sparse feature selection is carried out on complete bronchorelaxation report data, and individual sparse features are obtained; group sparse feature selection is carried out on the complete broncho-diastole report data, and group sparse features are obtained; counting the occurrence times of each feature according to the individual sparse features and the group sparse features; and presetting a screening threshold, screening out the characteristics with the frequency of occurrence of the characteristics being greater than or equal to the screening threshold, and obtaining the bronchodilation sparse characteristics. The specific process comprises the following steps:

The basic feature screening method comprises the following steps: given a data set (X, y) = { (χ) _i ,y _i )|i＝1,...,n},χ _i ＝(χ _i1 ,...,χ _ip ) ^T ∈R ^p For the input vector, X represents the complete bronchorelaxation report feature data, y represents the diagnosis label corresponding to the complete bronchorelaxation report data matched according to the patient ID information, and the diagnosis label is divided into a slow-blocking lung patient and a normal patient, and n and p respectively represent the number of bronchorelaxation report samples and the feature number corresponding to each bronchorelaxation report sample. Feature selection using sparse learning models is an optimization problem that minimizes the empirical error penalized by regularization terms. For example: formula (VI)Representing loss items->Representing a canonical term, wherein coefficient vector β ε R ^p Sparse learning model based on estimated coefficient vector +.>Selection features, i.e. at +.>And selecting the characteristic of non-zero estimation coefficient. />The number of non-zero estimation coefficients in (c) represents the number of features. Regularization parameter λ is a trade-off between penalty and penalty, and some sparse learning models even employ multiple regularization parameters to balance the penalty factors.

At present, the feature screening method is mainly divided into two main types, one type is individual sparse feature selection, the individual sparse feature selection is to carry out sparse feature screening without considering the correlation among features, and only the importance of a single lung function feature is considered; the other group is group sparse feature selection, and the group sparse feature selection method considers the interaction among features. The individual sparse feature selection method is specifically divided into linear and nonlinear individual sparse feature selection methods. The group sparse feature selection method is specifically classified into an automatic grouping and structural grouping feature selection method.

In this embodiment, an individual sparse feature linear screening method Lasso, an individual sparse feature nonlinear screening method HSIC Lasso, a Group sparse feature automatic grouping screening method Elastic Net, and a Group sparse feature structure grouping screening method Group Lasso are selected as basic skeletons of a feature screening model of this embodiment, complete bronchorelaxation report data are input into the methods Lasso, HSIC Lasso, elastic Lasso, and Group Lasso, and then the number of times of occurrence of each of the four methods is counted by using a hash table, a screening Threshold is preset in the feature screening model of this embodiment, threshold=3, and the number of times of occurrence of the features in the hash table is greater than or equal to Treshold and can be included in a feature selection list. Then, in the feature screening model of this embodiment, the number of feature choices C may be set, and the first C features are selected in the feature choice list as final broncho-diastolic sparse features.

It should be noted that, in this embodiment, the screening Threshold and the number of feature choices C may be changed according to actual needs, and this embodiment is not limited specifically.

Step S600, obtaining a plurality of preselected basic models, and performing parameter tuning on each basic model to obtain a plurality of tuning basic models.

Specifically, six basic models of logistic regression, decision tree, K-nearest neighbor, support vector machine, random forest and XGBoost are pre-selected, and parameter tuning is performed on each basic model by using a grid search algorithm (GridSearch) to obtain a plurality of tuning basic models. The specific process of parameter tuning by the grid search algorithm comprises the following steps:

setting different parameter value ranges for different parameters in each basic model, traversing all parameter selections by a grid search algorithm, returning a group of parameters with the best model effect, and taking the group of parameters with the best model effect as parameters of the basic model to obtain tuning basic models corresponding to six basic models, namely logistic regression, decision trees, K-nearest neighbor, support vector machines, random forests and XGBoost.

It should be noted that, in this embodiment, the pre-selected plurality of basic models may be changed according to actual needs, and this embodiment is not limited specifically.

And step S700, taking the bronchodilatory sparse feature and the diagnosis tag as a first data set, and adding Laplacian noise to each data in the first data set to obtain a second data set.

Specifically, the bronchodilatory sparse feature and the diagnostic tag are used as a first data set, and Laplacian noise is added to each data in the first data set to obtain a second data set.

In this embodiment, the laplace noise is added to each data in the first data set, so that the robustness of the model can be improved.

Step S800, training each tuning basic model according to the second data set to obtain the accuracy of each tuning basic model, and selecting a preset number of tuning basic models with top ranking according to the accuracy of each tuning basic model.

Specifically, the second data set is divided into a training set and a testing set, and a five-fold cross validation method is adopted to train each tuning basic model, so that the accuracy of each tuning basic model is obtained; presetting an accuracy threshold, comparing the accuracy of each tuning basic model with the accuracy threshold, and if the accuracy of the tuning basic model is larger than the accuracy threshold, incorporating the tuning basic model into a model screening list; and selecting a preset number of tuning basic models which are ranked at the top in the model screening list. For example:

and training six tuned base models after tuning in the step S600 by using the second data set, then giving an accuracy threshold, and if the accuracy of the tuned base models is greater than the threshold, taking the tuned base models into a model screening list, and simultaneously, providing a preset number of N model number selections, wherein in the embodiment, N=4, and selecting the first N tuned base models in the model screening list as skeleton models for constructing the slow-resistance lung prediction model. For example:

And taking the second data set as the input of each tuning basic model, taking the tuning basic models of the logistic regression, the decision tree, the K-nearest neighbor, the support vector machine, the random forest and the XGBoost as basic frameworks, training by utilizing the second data set, and finding four models with the best effects of the six tuning basic models, wherein the four finally selected tuning basic models in the embodiment are the decision tree, the K-nearest neighbor, the random forest and the XGBoost.

And step 900, constructing a slow-resistance lung prediction model based on the preset quantity of tuning basic models which are ranked at the front.

In the prior art, due to different focused characteristics of different machine learning models, different models have different advantages, and due to the very complex relationship between the lung function characteristics and the slow-resistance lung, the accuracy of a single basic machine learning model cannot meet the requirements. Therefore, the embodiment adopts a weight distribution method to distribute proper weights to the screened multiple machine learning models, fuses the multiple basic machine learning models, and constructs the slow-resistance lung prediction model in the embodiment.

Constructing a weight distribution method according to greedy ideas, and distributing different weights to each tuning basic model in the selected preset number of tuning basic models by the weight distribution method; according to preset quantity of tuning basic models distributed with different weights, a slow resistance lung prediction model is constructed, and the specific construction process is as follows:

According to the greedy idea, a weight distribution method is established, and according to the four tuning basic models selected in the step S800, different weights are distributed for the tuning basic models of decision tree (precision_tree), K-nearest neighbor (KNN), random forest (random_forest) and XGBoost by adopting the weight distribution method, wherein the specific distribution process is as follows:

if the initial accuracy is ranked from small to large, the method comprises the following steps: the decision tree < K-neighbor < random forest < XGBoost), the sum of the weights of the decision tree and the K-neighbor is initialized to 0.1 and is marked as L=alpha+beta=0.1, wherein alpha represents the weight of the decision tree, beta represents the weight of the K-neighbor, and correspondingly, the sum of the weights of the random forest and the XGBoost is initialized to 0.9 and is marked as R=gamma+lambda=0.9.

The outer layer circulation is: according to greedy thinking, L varies from 0.1 to 0.5 in steps of 0.1, and correspondingly, R varies from 0.9 to 0.5 in steps of 0.1, and l+r=1 is always maintained.

The inner layer circulation is as follows: taking l=α+β=0.1 as an example, the memory cycle is to find the best combination of α and β with the highest accuracy, and similarly, according to greedy thought, a model with high initial accuracy is assigned a large weight, so α varies from 0.01 to 0.05, the variation step is 0.01, β varies from 0.09 to 0.05, the variation step is 0.01, and α+β=0.1 is kept unchanged during the variation.

Note that, the best combination of γ and λ in r=γ+λ=0.9 of the present embodiment is also found in the same manner as α and β.

After both the outer and inner cycles have been cycled, the final α, β, γ, λ combination is recorded, in this example, the final result after the cycling is α=0.2, β=0.2, γ=0.2, λ=0.4, and therefore, the slow lung resistance prediction model is constructed as follows: 0.2 x precision_tree+0.2 x knn+0.2 x range_forest+0.4 x gboost.

In this embodiment, in order to improve accuracy of image text recognition, a PDF image text recognition model is constructed based on the FOTS model; acquiring a bronchorelaxation test PDF report, and carrying out text data identification on the bronchorelaxation test PDF report through a PDF image text identification model to acquire bronchorelaxation report data; calculating and obtaining a missing value in the broncho-diastole report data according to the broncho-diastole report data; filling the missing values in the bronchorelaxation report data to obtain complete bronchorelaxation report data; and acquiring patient ID information corresponding to the complete bronchorelaxation report data by a fuzzy matching method, and matching a diagnosis tag corresponding to the complete bronchorelaxation report data according to the patient ID information. In order to improve the accuracy of model prediction, sparse feature screening is carried out on complete bronchorelaxation report data to obtain bronchorelaxation sparse features; and acquiring a plurality of preselected basic models, and performing parameter tuning on each basic model to acquire a plurality of tuning basic models. In order to improve the robustness of the model and the accuracy of model prediction, the bronchodilating sparse feature and the diagnosis tag are used as a first data set, and Laplacian noise is added to each data in the first data set to obtain a second data set; training each tuning basic model according to the second data set to obtain the accuracy of each tuning basic model, and selecting a preset number of tuning basic models with top ranking according to the accuracy of each tuning basic model; and constructing a slow resistance lung prediction model based on the top-ranked preset number of tuning basic models. Therefore, the embodiment can improve the accuracy of image text recognition, can recognize key information features from high-dimensional data, and can remarkably improve learning accuracy by selecting the key features, so that the accuracy of model prediction is improved; the method can improve the robustness of the model, improve the accuracy of model prediction, and can assist doctors in judging the slow-resistance lung by predicting the slow-resistance lung through the slow-resistance lung prediction model, so that misdiagnosis is reduced.

For a better illustration, the present invention was analyzed by the following experiments:

the second data set is divided into a training set and a verification set, wherein the training set accounts for 70% and the verification set accounts for 30%. The slow resistance lung prediction model of 0.2decision_tree+0.2knn+0.2random_forest+0.4xgboost was trained and validated by the training set and validation set.

Specifically, each model parameter used in the slow-resistance lung prediction model is set as follows:

the decision tree model parameter specific setting table is referred to in table 1.

TABLE 1

K-neighbor model parameters specific set table, refer to Table 2.

TABLE 2

The random forest model parameter specific setting table is referred to in table 3.

TABLE 3 Table 3

XGBoost model parameters specific settings table, refer to Table 4.

TABLE 4 Table 4

Each model evaluation index used in the slow-resistance lung prediction model of the present embodiment is referred to table 5.

TABLE 5

The evaluation indexes in table 5 include: accuracy (accuracy), precision (precision), specificity (sensitivity), sensitivity (sensitivity), negative predictive rate (NPV), and AUC value, which is a model evaluation index. Wherein, the higher the specificity (specificity), also called true negative rate, the lower the misdiagnosis probability, the higher the diagnosis probability; sensitivity (sensitivity) is also known as true positive rate, and indicates the sensitivity of a diagnostic method to a disease after onset, i.e., the higher the sensitivity, the lower the probability of missed diagnosis.

The accuracy rate calculation formula is:the precision ratio calculation formula is:the specificity calculation formula is: />The sensitivity calculation formula is:the negative predictive rate calculation formula is: />Therefore, the slow lung resistance prediction model 0.2 precision_tree+0.2knn+0.2range_forest+0.4xgboost of the present embodiment has an accuracy of 92%.

Referring to fig. 2, the embodiment of the present invention further provides a system for constructing a slow-blocking lung prediction model, where the system for constructing a slow-blocking lung prediction model includes an identification model constructing unit 100, a data acquiring unit 200, a missing value filling unit 300, a tag matching unit 400, a feature acquiring unit 500, a parameter tuning unit 600, a data set acquiring unit 700, a model selecting unit 800, and a prediction model constructing unit 900, where:

the recognition model construction unit 100 is used for constructing a PDF image text recognition model based on the FOTS model;

the data acquisition unit 200 is used for acquiring a bronchorelaxation test PDF report, and performing text data identification on the bronchorelaxation test PDF report through a PDF image text identification model to acquire bronchorelaxation report data;

a missing value filling unit 300, configured to calculate and obtain a missing value in the broncho-diastole report data according to the broncho-diastole report data, and fill the missing value in the broncho-diastole report data to obtain complete broncho-diastole report data;

A tag matching unit 400, configured to obtain patient ID information corresponding to the complete bronchorelaxation report data by using a fuzzy matching method, and match a diagnostic tag corresponding to the complete bronchorelaxation report data according to the patient ID information;

the feature acquisition unit 500 is configured to perform sparse feature screening on the complete bronchorelaxation report data to obtain bronchorelaxation sparse features;

a parameter tuning unit 600, configured to obtain a plurality of pre-selected base models, and perform parameter tuning on each base model to obtain a plurality of tuning base models;

a data set obtaining unit 700 configured to obtain a second data set by using the bronchodilatory sparse feature and the diagnostic tag as a first data set and adding laplace noise to each data in the first data set;

the model selecting unit 800 is configured to train each tuning base model according to the second data set, obtain an accuracy of each tuning base model, and select a preset number of tuning base models with top ranks according to the accuracy of each tuning base model;

the prediction model construction unit 900 is configured to construct a slow-resistance lung prediction model based on a preset number of tuning base models that are ranked at the top.

It should be noted that, since a system for constructing a slow-resistance lung prediction model in the present embodiment and the above-mentioned method for constructing a slow-resistance lung prediction model are based on the same inventive concept, the corresponding content in the method embodiment is also applicable to the system embodiment, and will not be described in detail herein.

The embodiment of the invention also provides equipment for constructing the slow-resistance lung prediction model, which comprises the following steps: at least one control processor and a memory for communication connection with the at least one control processor.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

A non-transitory software program and instructions required to implement a method of constructing a slow-blocking lung prediction model of the above embodiments are stored in a memory, which when executed by a processor performs one of the methods of constructing a slow-blocking lung prediction model of the above embodiments, for example, performs the method steps S100 to S900 in fig. 1 described above.

The system embodiments described above are merely illustrative, in that the units illustrated as separate components may or may not be physically separate, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Embodiments of the present invention also provide a computer-readable storage medium storing computer-executable instructions that are executed by one or more control processors to cause the one or more control processors to perform a method of constructing a slow-blocking lung prediction model in one of the above method embodiments, for example, to perform the functions of the method steps S100 to S900 in fig. 1 described above.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

While the preferred embodiments of the present application have been described in detail, the embodiments of the present application are not limited to the above-described embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the embodiments of the present application, and these equivalent modifications or substitutions are included in the scope of the embodiments of the present application as defined in the appended claims.

Claims

1. A method of constructing a slow-blocking lung prediction model, the method comprising:

based on the FOTS model, constructing a PDF image text recognition model;

acquiring a plurality of preselected basic models, and performing parameter tuning on each basic model to acquire a plurality of tuning basic models; the method comprises the steps that a plurality of preselected basic models comprise logistic regression, decision trees, K-nearest neighbors, support vector machines, random forests and XGBoost, and parameter tuning is carried out on each basic model by adopting a grid search algorithm to obtain a plurality of tuning basic models;

2. The method for constructing a slow lung resistance prediction model according to claim 1, wherein constructing a PDF image text recognition model based on the FOTS model comprises:

3. The method of constructing a slow fire pulmonary predictive model according to claim 1, wherein calculating missing values in the broncho-diastolic report data from the broncho-diastolic report data comprises:

4. The method of constructing a slow blocking lung prediction model according to claim 1, wherein the filling the missing values in the broncho-diastolic report data to obtain complete broncho-diastolic report data comprises:

5. The method for constructing a slow-blocking lung prediction model according to claim 1, wherein the performing sparse feature screening on the complete broncho-diastole report data to obtain broncho-diastole sparse features comprises:

6. The method for constructing a slow-blocking lung prediction model according to claim 1, wherein training each tuning base model according to the second data set to obtain an accuracy of each tuning base model, and selecting a preset number of tuning base models ranked at the front according to the accuracy of each tuning base model, comprises:

7. The method of constructing a slow-blocking lung prediction model according to claim 1, wherein the constructing a slow-blocking lung prediction model based on the top-ranked preset number of tuning base models comprises:

8. A system for constructing a slow-blocking lung prediction model, the system comprising:

the parameter tuning unit is used for acquiring a plurality of preselected basic models, and performing parameter tuning on each basic model to acquire a plurality of tuning basic models; the method comprises the steps that a plurality of preselected basic models comprise logistic regression, decision trees, K-nearest neighbors, support vector machines, random forests and XGBoost, and parameter tuning is carried out on each basic model by adopting a grid search algorithm to obtain a plurality of tuning basic models;

9. An apparatus for constructing a slow-blocking lung prediction model, comprising at least one control processor and a memory for communication with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform the method of constructing a slow resistance lung prediction model according to any one of claims 1 to 7.

10. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method of constructing a slow resistance lung prediction model according to any one of claims 1 to 7.