CN115601779A - Model iteration method and device - Google Patents

Model iteration method and device Download PDF

Info

Publication number
CN115601779A
CN115601779A CN202211399013.1A CN202211399013A CN115601779A CN 115601779 A CN115601779 A CN 115601779A CN 202211399013 A CN202211399013 A CN 202211399013A CN 115601779 A CN115601779 A CN 115601779A
Authority
CN
China
Prior art keywords
data set
data
initial
model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211399013.1A
Other languages
Chinese (zh)
Inventor
黄东振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pacific Insurance Technology Co Ltd
Original Assignee
Pacific Insurance Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pacific Insurance Technology Co Ltd filed Critical Pacific Insurance Technology Co Ltd
Priority to CN202211399013.1A priority Critical patent/CN115601779A/en
Publication of CN115601779A publication Critical patent/CN115601779A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/42Document-oriented image-based pattern recognition based on the type of document
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content

Abstract

The application discloses a model iteration method and a device, and the method comprises the following steps: acquiring newly added data which is manually marked and used as a first data set; training the initial model based on the first data set, and counting the aggregation outlier indexes of all data in the first data set; deleting the data with the aggregation outlier index smaller than a first preset value from the first data set based on the aggregation outlier index, so as to obtain a second data set; obtaining a third data set based on the initial data set and the second data set; the initial model is retrained based on the third data set, thereby completing one iteration of the model. Therefore, after the newly added data are obtained, the data with quality problems are actively deleted from the data set based on the aggregation outlier indexes counted in the training process, the model is trained again, the iteration of the model is realized, the problem of domain deviation of the model is solved, and meanwhile, the performance of the model is improved, so that the medical bill information can be accurately extracted.

Description

Model iteration method and device
Technical Field
The application relates to the technical field of model training, in particular to a model iteration method and device.
Background
The extraction of the medical bill information is widely applied to the fields of insurance claim settlement and the like. Medical bills have various formats due to different formats of the medical bills in various regions and different formats of the medical bills in various hospitals. The mainstream medical ticket information extraction method at least comprises two steps: optical Character Recognition (OCR) and Named Entity Recognition (NER).
Currently, for the NER task, the most common training mode is supervised learning, that is, the NER task is implemented by a model obtained by the supervised learning in the training mode. Supervised learning is one of machine learning, and specifically, a group of labeled samples are used as a training set, and a model is trained through the training set, so as to obtain a target model.
When the model is applied to a medical bill information extraction task with a complex scene and various bill formats, the medical bill information cannot be accurately obtained based on the problem that the domain deviation exists in the model obtained by supervised learning training.
Disclosure of Invention
Based on the problems, the application provides a model iteration method and a model iteration device, which can solve the problem of domain deviation of a model in a medical bill information extraction task with a complex scene and various bill formats, so that medical bill information can be accurately extracted.
The embodiment of the application discloses the following technical scheme:
in a first aspect, the present application discloses a model iteration method, which is characterized in that the method includes:
acquiring newly added data which is manually marked and used as a first data set;
training an initial model based on the first data set, and counting the aggregation outlier indexes of all data in the first data set; the aggregate outlier indicator is a product of a mean of the confidence levels of the data and a standard deviation of the confidence levels of the data;
deleting the data with the aggregation outlier index smaller than a first preset value from the first data set based on the aggregation outlier index, so as to obtain a second data set;
obtaining a third data set based on the initial data set and the second data set; the third data set comprises data contained in the initial data set and data contained in the second data set;
and re-training the initial model based on the third data set, thereby completing one iteration of the model.
Optionally, after obtaining a third data set based on the initial data set and the second data set, the method further includes:
judging whether the number of data in the third data set is larger than a preset number or not;
if yes, performing down-sampling on all data in the third data set; retraining the initial model based on the down-sampled third data set;
if not, the initial model is directly retrained based on the third data set.
Optionally, the down-sampling all the data in the third data set includes:
counting the aggregation outlier index of all data in the third data set;
based on the aggregation outlier indexes of all the data, deleting the data of which the aggregation outlier indexes are smaller than a second preset value from the third data set, so as to realize down-sampling; the second preset value is greater than the first preset value.
Optionally, before the acquiring the new data that is manually labeled and used as the first data set, the method further includes:
acquiring a medical bill;
and screening out the newly added data of the medical bill based on the medical bill.
Optionally, screening out new data of the medical bill based on the medical bill includes:
screening out data of which the indexes exceed a preset threshold value in the medical bill through an active learning technology based on the medical bill; and the data of which the index exceeds the preset threshold value is the newly added data of the medical bill.
Optionally, the method further includes:
acquiring an initial data set in advance;
selecting a training model based on the initial data set;
and training the training model to obtain an initial model based on the initial data set.
Optionally, selecting a training model based on the initial data set includes:
based on the initial data set, performing word segmentation on the data in the initial data set, and counting the sequence length of each data after word segmentation;
analyzing a distribution of the sequence lengths based on the sequence lengths;
and selecting a corresponding information extraction model as a training model based on the distribution of the sequence length.
Optionally, the pre-acquiring the initial data set includes:
acquiring an initial medical bill which is manually marked based on the initial medical bill;
and performing optical character recognition on a target entity in the initial medical bill based on the initial medical bill which is subjected to manual annotation, so as to obtain the initial data set.
In a second aspect, the present application provides a model iteration apparatus, comprising:
the acquisition module is used for acquiring newly-added data which is subjected to manual labeling and serves as a first data set;
the statistical module is used for training an initial model based on the first data set and counting the aggregation outlier indexes of all data in the first data set; the aggregate outlier indicator is a product of a mean of the confidence levels of the data and a standard deviation of the confidence levels of the data;
the screening module is used for deleting the data of which the aggregation outlier index is smaller than a first preset value from the first data set based on the aggregation outlier index so as to obtain a second data set;
a merging module, configured to obtain a third data set based on the initial data set and the second data set; the third data set comprises data contained in the initial data set and data contained in the second data set;
and the iteration training module is used for retraining the initial model based on the third data set so as to complete the iteration of the model.
Optionally, the apparatus further comprises:
the judging module is used for judging whether the number of the data in the third data set is larger than a preset number or not;
if so, the down-sampling module is used for down-sampling all the data in the third data set; the iterative training module is specifically configured to retrain the initial model based on the down-sampled third data set;
and if not, the iterative training module is specifically configured to retrain the initial model directly based on the third data set.
Optionally, the down-sampling module is specifically configured to count an aggregation outlier indicator of all data in the third data set;
based on the aggregation outlier indexes of all the data, deleting the data of which the aggregation outlier indexes are smaller than a second preset value from the third data set, so as to realize down-sampling; the second preset value is greater than the first preset value.
Optionally, the apparatus further comprises:
the bill acquiring module is used for acquiring medical bills;
and the newly added data screening module is used for screening the newly added data in the medical bills based on the medical bills.
Optionally, the newly added data screening module is specifically configured to: screening out data with the index exceeding a preset threshold value in the medical bill through an active learning technology based on the medical bill; and the data with the index exceeding the preset threshold value is the newly added data of the medical bill.
Optionally, the apparatus further comprises:
the initial data acquisition module is used for acquiring an initial data set in advance;
the training model selection module is used for selecting a training model based on the initial data set;
and the initial model training module is used for training the training model based on the initial data set to obtain an initial model.
Optionally, the training model selecting module is specifically configured to: based on the initial data set, performing word segmentation on the data in the initial data set, and counting the sequence length of each data after word segmentation; analyzing a distribution of the sequence lengths based on the sequence lengths; and selecting a corresponding information extraction model as a training model based on the distribution of the sequence lengths.
Optionally, the initial data obtaining module is specifically configured to:
acquiring an initial medical bill which is manually marked based on the initial medical bill; and performing optical character recognition on a target entity in the initial medical bill based on the initial medical bill which is subjected to manual annotation, so as to obtain the initial data set.
Compared with the prior art, the method has the following beneficial effects: receiving newly added data which is manually marked, training an initial model based on the newly added data which is manually marked, deleting data with quality problems from data set in the training process, and retraining the initial model by combining the newly added data with deleted error data and the initial data together to realize the iteration of the model.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a flowchart of a model iteration method provided in an embodiment of the present application;
FIG. 2 is a flowchart of a method for obtaining an initial data set according to an embodiment of the present disclosure;
fig. 3 is a block diagram of a model iteration apparatus according to an embodiment of the present application.
Detailed Description
As described above, the extraction task of medical bill information in the prior art is realized by a model obtained by supervised learning in a training mode. Supervised learning is trained using a limited set of known data as training samples (e.g., known inputs and corresponding outputs) to obtain an objective model based on which extraction of medical billing information is achieved. Due to the fact that medical bill formats are various (different medical bill formats in various regions and different medical bill formats in various hospitals), the medical bill formats can be updated, the problem of domain deviation often exists when the extraction task of the medical bill information is achieved based on the model obtained under limited data training, and the generalization performance of the model is poor. For example: based on the model A obtained by training the medical bill data of the hospital A, the model A is applied to the task of extracting the medical bill information of the hospital B, and the medical bill information of the hospital B cannot be extracted accurately.
In order to solve the above problem, the present application provides a model iteration method, including: the model iteration method provided by the embodiment of the application comprises the following steps: acquiring newly added data which is manually marked and used as a first data set; training an initial model based on the first data set, and counting the aggregation outlier indexes of all data in the first data set; the aggregate outlier indicator is the product of the mean of the confidence degrees of the data and the standard deviation of the confidence degrees of the data; deleting the data with the aggregation outlier index smaller than a first preset value from the first data set based on the aggregation outlier index, so as to obtain a second data set; obtaining a third data set based on the initial data set and the second data set; the third data set comprises data contained in the initial data set and data contained in the second data set; retraining the initial model based on the third data set, thereby completing an iteration of the model. . Therefore, the new data which are manually marked are identified, the initial model is trained based on the new data which are manually marked, data with quality problems are deleted from the data set in the training process, the initial model is retrained together with the new data which are deleted with error data and the initial data, iteration of the model is achieved, the problem that the model has domain deviation can be solved, the quality of the new data is improved, the data with quality problems are actively screened, the performance of the model is improved, and medical bill information can be accurately extracted.
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As shown in fig. 1, the figure is a flowchart of a model iteration method provided in an embodiment of the present application, where the method includes:
s101, acquiring an initial data set.
The initial data set is a set of labeled data used to train the model.
As shown in the schematic flowchart of fig. 2 for acquiring the initial data set, acquiring the initial data set specifically includes:
s201, acquiring the initial medical bill which is manually marked based on the initial medical bill.
And acquiring the initial medical bill which is manually marked, wherein the manual marking is mainly performed on the target entity in the initial medical bill.
The target entity generally refers to an entity with a specific meaning or strong reference, and generally includes a name of a person, a name of a place, a name of an organization, a date and time, a proper noun, and the like. The target entities may also include more categories of entities such as product name, model, price, etc. according to business needs. Taking a medical ticket as an example, the target entities of a medical ticket generally include: name of person, name of hospital, name of medical project, payment fee, etc.
And manually marking the corresponding position of the target entity in the initial medical bill and the category of the target entity based on the initial medical bill.
S202, based on the initial medical bill which is manually marked, OCR detection and recognition are carried out on the target entity in the initial medical bill, and therefore an initial data set is obtained.
OCR refers to a process in which an electronic device (e.g., a scanner or digital camera) examines a character printed on paper, determines its shape by detecting dark and light patterns, and then translates the shape into computer text using character recognition methods; the method is characterized in that characters in a paper document are converted into an image file with a black-white dot matrix in an optical mode aiming at print characters, and the characters in the image are converted into a text format through recognition software for further editing and processing by word processing software.
In the embodiment of the present application, OCR recognition is performed on the characters of the target entity in the initial medical bill, that is, words corresponding to the target entity are converted into computer words (for example, text format).
S202 may also be a manual input of text into a computer text based on the initial medical instrument. And is not particularly limited in this application.
Through OCR technique, will target entity's character in the initial medical treatment bill converts the computer word into, compares the artifical input into computer word with target entity's character, reduces the waste of manpower resources to it is faster.
The corresponding position and the category of the target entity marked in S201 and the computer text of the target entity obtained in S202 are in one-to-one correspondence to obtain a set of known inputs and corresponding outputs for training the model. For example: the location of the target entity (seat), the category of the target entity (key), and the computed literal of the target entity (value), key-seat-value being a set of data and corresponding output used to train the model.
And S102, selecting a training model based on the initial data set.
Specifically, a word segmentation device is adopted to segment the data in the initial data set, the sequence length of each data after word segmentation is counted, the distribution of the sequence length of the data in the initial data set after word segmentation is analyzed, and an information extraction model with proper sequence length distribution is selected as a training model. For example: after word segmentation is carried out on a certain initial data set, the sequence length is distributed in 95% of 512, the sequence length is distributed in 5% of 1024, and then an information extraction model suitable for the sequence length of 512 is selected as a training model.
S103, training the training model based on the initial data set, and thus obtaining the initial model.
Carrying out supervised training on the training model based on the initial data set, wherein the training model after training is supervised learning
Further, in training the training model based on the initial dataset, a loss function (loss) is calculated. The loss function is used for estimating the difference between the output of the model and the true value, and gives an optimization guidance direction to the model.
Furthermore, an R-Drop module is added in the loss calculation part, and the K-L divergence between two inferences is added on the basis of the original loss calculation part by virtue of randomness in a dropout layer in a transform structure.
The K-L divergence (Kullback-Leibler divergence) is also called relative information entropy, and is a method for describing the difference between two probability distributions P and Q. In the field of machine learning, the physical meaning of K-L divergence is used to measure the degree of similarity or closeness of two functions.
Drop (random inactivation) means that part of neurons of each layer are discarded randomly in the training process of the deep learning network, namely that part of data nodes are disabled randomly in the training process of the model, and overfitting of the model can be avoided.
And R-Drop is a consistent training strategy to normalize Dropout. The R-Drop acts on an output layer of the model to make up for inconsistency of Drop in training and testing, namely in one training, each datum is inferred through the same model with the Drop twice, wherein the Drop part randomly discards different units in the two inferences, and then the R-Drop method restricts the output based on K-L divergence, so that consistency of model output is improved. .
The difference between the output value and the predicted value of model training can be further reduced, so that the initial model obtained by training is more accurate.
And S104, acquiring newly-added data which is manually marked and used as a first data set.
Specifically, when the medical bill is acquired, new data in the medical bill is selected based on an active learning technology, and the new data obtained after manual labeling of the new data is acquired as a first data set.
The Active Learning mainly includes two types, the first type is streaming Active Learning (Sequential Active Learning), and the second type is offline batch Active Learning (Pool-based Active Learning). In different scenarios, different schemes can be selected for execution, and in the embodiment of the present application, offline batch active learning is adopted, that is, active learning determines which samples need to be labeled from an unlabeled sample pool. Active learning of offline batches, a common selection method includes: the minimum confidence (Least confidence), the minimum Variance (Variance Reduction), the edge Sampling method (Margin Sampling), the entropy method (entry), the Density-Weighted Methods (Density-Weighted Methods), and the like, and any one or any combination of the selection Methods may be used in the embodiments of the present application. When a medical ticket is acquired, it can be determined whether the medical ticket is a new format medical ticket (i.e., a different format medical ticket than the original medical ticket) based on active learning techniques. Specifically, since the data of the initial medical bill belongs to labeled data (i.e., an initial data set), the model trained by the initial training set has a higher confidence for the data already existing in the model, and the active learning method designs the series of indexes (the minimum confidence, the minimum variance, the edge sampling method, the entropy method, the density weighting method, and the like) based on the characteristics and the derivative characteristics thereof. By means of the indexes, new types of data (such as data of new edition medical bills) can be selected from unlabeled sample data Chi Zhongshai.
If the medical bill is a new version of medical bill, the following steps are carried out: based on the active learning technology, data with an index exceeding a preset threshold value in the unmarked data of the medical bill is obtained, and the data with the index exceeding the threshold value is used as new data. The preset threshold is determined according to actual conditions, and is not specifically limited in the present application.
Specifically, the step of acquiring the new data after the manual marking based on the new data includes manually marking the position and the category of the new data in the medical bill, and then identifying the characters of the data into computer characters through the ORC technology (of course, the characters of the new data may also be manually input into the computer characters according to the characters of the new data).
S105, training the initial model based on the first data set, and counting the aggregation and outlier indexes of all data in the first data set in the training process.
In the training process, two indexes are counted as the screening basis of the aggregation outlier: the mean value (Confidence) of the Confidence degrees of the new added data labeled in the training process to the real label (label) in each training period (Epoch) and the standard deviation (variance) of the Confidence degrees of the new added data labeled in the training process to the real label in each Epoch.
Wherein the Epoch is a complete training of the model using all the labeled new data (which may also be referred to as the second data set), which is called a "round of training". The real label (label) is a data label of the new labeled data. The confidence of the real label refers to the accuracy (reliability) of the expected label obtained in the training process as the real label.
Confidence and variability, which are the basis for screening the aggregate outliers, are used to obtain the aggregate outlier, which is the product of the mean of Confidence of the data (Confidence) and the standard deviation of Confidence of the data (variability), and the aggregate outlier is the index for finally screening the aggregate outliers. Setting data with the product of Confidence and variance (namely, the aggregation outlier index) smaller than a first preset value as an aggregation outlier, wherein the data corresponding to the aggregation outlier is data which may have quality problems in the labeling process.
And S106, deleting the data with the aggregation outlier index smaller than the first preset value from the first data set based on the aggregation outlier index, so as to obtain a second data set.
Data with quality existing in the labeling process is actively screened out through the model, the data are removed from the newly added labeled data, and the problem that the performance of the model trained through the data is poor or even the random sampling cannot be surpassed due to the fact that personnel label the data wrongly is avoided.
And S107, obtaining a third data set based on the initial data set and the second data set.
The third data set includes: all data of the initial data set and all data of the second data set.
And S108, judging whether the number of the data in the third data set is larger than a preset number.
If the number is larger than the preset number, performing S109; if the number is less than the preset number, the process goes to S110 directly.
And S109, performing down-sampling on all data in the third data set.
Specifically, the whole data is down-sampled, and a simple random down-sampling strategy can be adopted for the down-sampling; or the product of two indicators, configence and variance, in S106 (i.e., the aggregation outlier indicator) may be used as a screening basis to select higher-quality data, where the preset value (the second preset value) used in S109 as the screening criterion should be greater than the preset value (the first preset value) set in S106; in addition, a combination of the two approaches may be used. The present application is not particularly limited.
Specifically, the aggregation and outlier indexes of all the data in the third data set are counted, and the data with the aggregation and outlier indexes smaller than a second preset value are deleted from the third data set according to the aggregation and outlier indexes. Wherein the second preset value is greater than the first preset value.
Repeated training of the same data is reduced, the time for training the model each time is reduced, and meanwhile the performance of the trained model is better.
S108 and S109 are optional steps.
And S110, based on the third data set, retraining the initial model and completing the iteration.
Specifically, when a medical ticket is acquired, a continuous iteration of the model can be achieved by repeating S104-S110.
To sum up, a model iteration method provided in the embodiment of the present application includes: acquiring newly added data which is manually marked and used as a first data set; training an initial model based on the first data set, and counting the aggregation outlier indexes of all data in the first data set; the aggregate outlier indicator is a product of a mean of the confidence levels of the data and a standard deviation of the confidence levels of the data; deleting the data with the aggregation outlier index smaller than a first preset value from the first data set based on the aggregation outlier index, so as to obtain a second data set; obtaining a third data set based on the initial data set and the second data set; the third data set comprises data contained in the initial data set and data contained in the second data set; retraining the initial model based on the third data set, thereby completing an iteration of the model. Therefore, the new data with the manual labeling completed are obtained, the initial model is trained based on the new data with the manual labeling completed, the data with the quality problem are actively deleted from the data set based on the aggregation outlier indexes counted in the training process, the initial model is retrained together with the new data with the error data deleted and the initial data, the iteration of the model is realized, the problem that the model has domain deviation can be solved, the quality of the new data is improved, the data with the quality problem is actively screened, the performance of the model is improved, and the medical bill information can be accurately extracted.
Furthermore, based on the active learning technology, the medical bills are detected, screened and reflowed, the data acquisition process is simplified, the effectiveness of newly added data is improved, the redundancy of the same kind of data is inhibited, and the waste of resources and time in the data collection, manpower labeling and model training processes is reduced.
Further, the overall data in the known third data set is down-sampled, so that the training of repeated data can be reduced, and the waste of resources and time in the model training process is reduced.
As shown in fig. 3, the figure provides a structural block diagram of a model iteration apparatus for an embodiment of the present application, and as described in detail below with reference to fig. 3, the apparatus includes:
an obtaining module 301, configured to obtain newly added data that has been manually labeled as a first data set.
The statistic module 302 is configured to train the initial model based on the first data set, and count the aggregation and outlier indicators of all data in the first data set during the training process.
And the screening module 303 is configured to delete, based on the aggregation outlier indicator, data with an aggregation outlier indicator smaller than a first preset value from the first data set, so as to obtain a second data set.
A merging module 304, configured to obtain a third data set based on the initial data set and the second data set.
And an iterative training module 305, configured to retrain the initial model based on the third data set, so as to complete an iteration of the model.
Further, the apparatus further comprises:
the judging module is used for judging whether the number of the data in the third data set is larger than the preset number or not;
if yes, the down-sampling module is used for down-sampling all data in the third data set;
an iterative training module 305, further configured to train the initial model based on the down-sampled third data set;
if not, the iterative training module 305 trains the initial model directly based on the third data set.
The down-sampling module is specifically used for counting the aggregation outlier indexes of all the data in the third data set; and based on the aggregation outlier indexes of all the data, deleting the data of which the aggregation outlier index is smaller than a second preset value from the third data set, thereby realizing the down-sampling. Wherein the first preset value is less than the second preset value.
In addition, the device further comprises:
the bill acquiring module is used for acquiring medical bills;
and the newly added data screening module is used for screening the newly added data in the medical bills based on the medical bills.
Further, the newly added data screening module is specifically used for screening out data of which the index exceeds a preset threshold value in the medical bill through an active learning technology based on the medical bill; and the data with the index exceeding the preset threshold value is the newly added data of the medical bill.
In addition, the device further comprises:
the initial data acquisition module is used for acquiring an initial data set in advance;
the training model selection module is used for selecting a training model based on the initial data set;
and the initial model training module is used for training the training model based on the initial data set to obtain an initial model.
Further, the training model selection module is specifically configured to perform word segmentation on the data in the initial data set based on the initial data set, and count sequence lengths of the data after word segmentation; analyzing a distribution of the sequence lengths based on the sequence lengths; and selecting a corresponding information extraction model as a training model based on the distribution of the sequence length.
Further, the initial data acquisition module is specifically used for acquiring the initial medical bill which is manually marked based on the initial medical bill; and performing optical character recognition on a target entity in the initial medical bill based on the initial medical bill which is subjected to manual annotation, so as to obtain the initial data set.
In summary, an embodiment of the present application provides a model iteration apparatus, including: an obtaining module 301, configured to obtain newly added data that has been manually labeled, as a first data set; a statistic module 302, configured to train the initial model based on the first data set, and count aggregation outlier indicators of all data in the first data set in a training process; the screening module 303 is configured to delete, based on the aggregation outlier indicator, data whose aggregation outlier indicator is smaller than a first preset value from the first data set, so as to obtain a second data set; a merging module 304, configured to obtain a third data set based on the initial data set and the second data set; and an iterative training module 305, configured to retrain the initial model based on the third data set, so as to complete an iteration of the model. Therefore, the new data which are manually marked are identified, the initial model is trained based on the new data which are manually marked, data with quality problems are deleted from the data set in the training process, the initial model is retrained together with the new data which are deleted with error data and the initial data, iteration of the model is achieved, the problem that the model has domain deviation can be solved, the quality of the new data is improved, the data with quality problems are actively screened, the performance of the model is improved, and medical bill information can be accurately extracted.
Furthermore, based on the active learning technology, the medical bills are detected, screened and reflowed, the data acquisition process is simplified, the effectiveness of newly added data is improved, the redundancy of the same kind of data is inhibited, and the waste of resources and time in the data collection, manpower labeling and model training processes is reduced.
Further, the overall data in the known third data set is downsampled, so that the training of repeated data can be reduced, and the waste of resources and time in the model training process is reduced.
The embodiment of the application also provides corresponding equipment and a computer readable storage medium, which are used for realizing the scheme provided by the embodiment of the application.
Wherein the apparatus comprises a memory for storing instructions or code and a processor for executing the instructions or code to cause the apparatus to perform a model iteration method as described in any of the embodiments of the present application.
In practice, the computer-readable storage medium may take any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present embodiment, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts indicated as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of model iteration, the method comprising:
acquiring newly added data which is manually marked and used as a first data set;
training an initial model based on the first data set, and counting the aggregation outlier indexes of all data in the first data set; the aggregate outlier indicator is a product of a mean of the confidence levels of the data and a standard deviation of the confidence levels of the data;
deleting the data with the aggregation outlier index smaller than a first preset value from the first data set based on the aggregation outlier index, so as to obtain a second data set;
obtaining a third data set based on the initial data set and the second data set; the third data set comprises data contained in the initial data set and data contained in the second data set;
and re-training the initial model based on the third data set, thereby completing one iteration of the model.
2. The method of claim 1, wherein after the deriving a third data set based on the initial data set and the second data set, the method further comprises:
judging whether the number of data in the third data set is larger than a preset number or not;
if yes, performing down-sampling on all data in the third data set; retraining the initial model based on the third data set that has been downsampled;
and if not, re-training the initial model directly based on the third data set.
3. The method of claim 2, wherein the down-sampling all data in the third data set comprises:
counting the aggregation outlier index of all data in the third data set;
based on the aggregation outlier indexes of all the data, deleting the data of which the aggregation outlier indexes are smaller than a second preset value from the third data set, so as to realize down-sampling; the second preset value is greater than the first preset value.
4. The method of claim 1, wherein prior to the obtaining the new data that has been manually annotated as the first data set, the method further comprises:
acquiring a medical bill;
and screening out the newly added data of the medical bill based on the medical bill.
5. The method of claim 4, wherein screening the medical instrument for addition data based on the medical instrument comprises:
screening out data of which the indexes exceed a preset threshold value in the medical bill through an active learning technology based on the medical bill; and the data of which the index exceeds the preset threshold value is the newly added data of the medical bill.
6. The method of claim 1, further comprising:
acquiring an initial data set in advance;
selecting a training model based on the initial data set;
and training the training model to obtain an initial model based on the initial data set.
7. The method of claim 6, wherein selecting a training model based on the initial data set comprises:
based on the initial data set, performing word segmentation on the data in the initial data set, and counting the sequence length of each data subjected to word segmentation;
analyzing a distribution of the sequence lengths based on the sequence lengths;
and selecting a corresponding information extraction model as a training model based on the distribution of the sequence length.
8. The method of claim 6, wherein pre-acquiring an initial data set comprises:
acquiring an initial medical bill which is manually marked based on the initial medical bill;
and performing optical character recognition on a target entity in the initial medical bill based on the initial medical bill which is subjected to manual annotation, so as to obtain the initial data set.
9. An apparatus for model iteration, the apparatus comprising:
the acquisition module is used for acquiring newly-added data which are subjected to manual labeling and serve as a first data set;
the statistical module is used for training an initial model based on the first data set and counting the aggregation outlier indexes of all data in the first data set; the aggregate outlier indicator is the product of the mean of the confidence degrees of the data and the standard deviation of the confidence degrees of the data;
the screening module is used for deleting the data with the aggregation outlier indexes smaller than a first preset value from the first data set based on the aggregation outlier indexes so as to obtain a second data set;
a merging module, configured to obtain a third data set based on the initial data set and the second data set; the third data set comprises data contained in the initial data set and data contained in the second data set;
and the iterative training module is used for retraining the initial model based on the third data set so as to complete the iteration of the model.
10. The apparatus of claim 9, further comprising:
the judging module is used for judging whether the number of the data in the third data set is larger than a preset number or not;
if so, the down-sampling module is used for down-sampling all the data in the third data set; the iterative training module is specifically configured to retrain the initial model based on the down-sampled third data set;
and if not, the iterative training module is specifically configured to retrain the initial model directly based on the third data set.
CN202211399013.1A 2022-11-09 2022-11-09 Model iteration method and device Pending CN115601779A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211399013.1A CN115601779A (en) 2022-11-09 2022-11-09 Model iteration method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211399013.1A CN115601779A (en) 2022-11-09 2022-11-09 Model iteration method and device

Publications (1)

Publication Number Publication Date
CN115601779A true CN115601779A (en) 2023-01-13

Family

ID=84853385

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211399013.1A Pending CN115601779A (en) 2022-11-09 2022-11-09 Model iteration method and device

Country Status (1)

Country Link
CN (1) CN115601779A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116503695A (en) * 2023-06-29 2023-07-28 天津所托瑞安汽车科技有限公司 Training method of target detection model, target detection method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116503695A (en) * 2023-06-29 2023-07-28 天津所托瑞安汽车科技有限公司 Training method of target detection model, target detection method and device
CN116503695B (en) * 2023-06-29 2023-10-03 天津所托瑞安汽车科技有限公司 Training method of target detection model, target detection method and device

Similar Documents

Publication Publication Date Title
US11734328B2 (en) Artificial intelligence based corpus enrichment for knowledge population and query response
CA3124358C (en) Method and system for identifying citations within regulatory content
US20120323866A1 (en) Efficient development of a rule-based system using crowd-sourcing
CN112150298B (en) Data processing method, system, device and readable medium
WO2022089227A1 (en) Address parameter processing method, and related device
US11886820B2 (en) System and method for machine-learning based extraction of information from documents
CN112365939A (en) Data management method and system based on medical health big data
CN112966102A (en) Classification model construction and text sentence classification method, equipment and storage medium
CN115601779A (en) Model iteration method and device
CN113742733A (en) Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device
CN114003690A (en) Information labeling method, model training method, electronic device and storage medium
WO2022143608A1 (en) Language labeling method and apparatus, and computer device and storage medium
CN111581346A (en) Event extraction method and device
CN111797772A (en) Automatic invoice image classification method, system and device
CN117112782A (en) Method for extracting bid announcement information
CN114638304A (en) Training method of image recognition model, image recognition method and device
AU2019290658B2 (en) Systems and methods for identifying and linking events in structured proceedings
CN113807256A (en) Bill data processing method and device, electronic equipment and storage medium
US20200327319A1 (en) Agency Notice Processing System
CN113221705B (en) Automatic classification method, device, equipment and storage medium for electronic documents
Bahaj A hybrid intelligent model for early validation of infectious diseases: An explorative study of machine learning approaches
CN112837148B (en) Risk logic relationship quantitative analysis method integrating domain knowledge
Ellis Accounting for matching uncertainty in photographic identification studies of wild animals
CN115439853A (en) Electronic bill text recognition method and device, electronic equipment and storage medium
CN116991983A (en) Event extraction method and system for company information text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination