CN113158652B - Data enhancement method, device, equipment and medium based on deep learning model - Google Patents

Data enhancement method, device, equipment and medium based on deep learning model Download PDF

Info

Publication number
CN113158652B
CN113158652B CN202110420110.3A CN202110420110A CN113158652B CN 113158652 B CN113158652 B CN 113158652B CN 202110420110 A CN202110420110 A CN 202110420110A CN 113158652 B CN113158652 B CN 113158652B
Authority
CN
China
Prior art keywords
data
enhancement
original
parameter list
synonym
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110420110.3A
Other languages
Chinese (zh)
Other versions
CN113158652A (en
Inventor
李鹏宇
李剑锋
陈又新
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110420110.3A priority Critical patent/CN113158652B/en
Priority to PCT/CN2021/096475 priority patent/WO2022222224A1/en
Publication of CN113158652A publication Critical patent/CN113158652A/en
Application granted granted Critical
Publication of CN113158652B publication Critical patent/CN113158652B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The invention is used in the field of artificial intelligence, relates to the field of blockchain, and discloses a data enhancement method, a device, equipment and a medium based on a deep learning model, wherein the method comprises the following steps: according to an artificial fish swarm algorithm, carrying out random initialization on an original parameter list to obtain a plurality of optimized parameter lists, converting original training data by using the optimized parameter lists to obtain corresponding artificial construction data, mixing the original training data with the corresponding artificial construction data to obtain a plurality of training sets, further training to obtain a plurality of recognition models, determining whether models meeting convergence conditions exist in the plurality of recognition models, and if so, outputting a target data enhancement parameter list to carry out data enhancement on the original training data and obtain a training set of named entity recognition models; in the invention, an artificial fish swarm algorithm is taken as a framework, and the model identification effect is used as an optimization target to be fused into the formulation of a data enhancement strategy, so that the data enhancement effect of data is improved.

Description

Data enhancement method, device, equipment and medium based on deep learning model
Technical Field
The invention relates to the field of artificial intelligence, in particular to a data enhancement method, device, equipment and medium based on a deep learning model.
Background
Along with the development of intelligent technology, in the application fields of natural language processing methods such as question-answering systems, machine translation systems and the like, the demands for named entity recognition (Named Entity Recognition, NER) tasks are increasing, so that named entity recognition tasks are executed based on named entity recognition models obtained through entity data training, and the named entity recognition methods become an increasingly common recognition mode. In order to improve the entity recognition rate of the named entity recognition model in the text to be recognized, the purpose of enhancing the accuracy of the named entity recognition model is achieved usually from the two angles of enhancing training data or enhancing a model algorithm.
In the prior art, a data enhancement model of a named entity recognition model is mainly characterized in that entity words in training data are replaced by different data enhancement methods and parameters corresponding to the data enhancement methods, for example, synonym replacement, random insertion, random exchange positions, random deletion and other operations are performed on the entity words in the training data with a certain probability, so that the scale and diversity of the training data are increased. The enhancement effect of the data enhancement model on the training data is indistinguishable from the density of model parameters, but the model parameters of the existing data enhancement model are determined by experience or a grid search parameter optimizing method, so that interactivity with the named entity recognition model is low, and the enhancement effect of the data enhancement model on the training data is poor.
Disclosure of Invention
The invention provides a data enhancement method, a device, equipment and a medium based on a deep learning model, which are used for solving the problem that in the prior art, model parameters of the data enhancement model are determined by experience or a grid search parameter optimizing method, so that the data enhancement effect of the data enhancement model is poor.
A data enhancement method based on a deep learning model, comprising:
the method comprises the steps of obtaining original training data and original test data which are marked manually, and obtaining an original parameter list, wherein the original parameter list is composed of a data enhancement method and enhancement parameters corresponding to the data enhancement method;
randomly initializing the enhancement parameters in the original parameter list according to an artificial fish swarm algorithm to obtain a plurality of optimized parameter lists;
converting the original training data by utilizing each optimized parameter list to obtain corresponding artificial construction data, and mixing the original training data with the corresponding artificial construction data to obtain a plurality of training sets;
respectively training to obtain a plurality of recognition models by using the plurality of training sets, and testing the plurality of recognition models by taking the original test data as a test set to determine whether a model meeting convergence conditions exists in the plurality of recognition models;
If the models meeting the convergence condition exist in the plurality of identification models, outputting an optimized parameter list corresponding to the models meeting the convergence condition as a target data enhancement parameter list;
and carrying out data enhancement on the original training data by utilizing the target data enhancement parameter list so as to obtain a training set of the named entity recognition model.
Further, after the determining whether a model satisfying a convergence condition exists in the plurality of recognition models, the method further includes:
if the models meeting the convergence condition do not exist in the plurality of identification models, randomly initializing the enhancement parameters in the original parameter list according to an artificial fish swarm algorithm again to obtain a plurality of optimized parameter lists after random initialization, and counting;
determining whether the number of random initialization times of the enhancement parameters in the original parameter list is smaller than a preset number of times;
if the number of times of random initialization of the enhancement parameters in the original parameter list is not less than the preset number of times, stopping the random initialization of the enhancement parameters in the original parameter list;
if the number of times of random initialization of the enhancement parameters in the original parameter list is smaller than the preset number of times, training according to the optimized parameter list after random initialization to obtain a plurality of new identification models, testing the new identification models to obtain the target data enhancement parameter list, and obtaining a training set of the named entity identification models by utilizing the target data enhancement parameter list.
Further, the data enhancement method includes a synonym replacement method, and the converting the original training data by using each optimized parameter list includes:
determining enhancement parameters corresponding to the synonym substitution method in the optimization parameter list, wherein the enhancement parameters corresponding to the synonym substitution method comprise entity word class substitution probability and entity word substitution class;
acquiring a preset synonym dictionary which is pre-constructed by a user according to requirements, wherein entity words which are not prohibited from synonym relations in the same entity category are used as synonyms of each other in the preset synonym dictionary;
and carrying out synonym replacement on the entity words in the original training data according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category.
Further, the performing synonym replacement on the entity words in the original training data according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category includes:
determining whether the category of each entity word in the original training data belongs to the entity word replacement category;
if the category of the entity word in the original training data belongs to the entity word replacement category, searching the synonym of the entity word in the preset synonym dictionary;
Determining whether a synonym relationship is forbidden between the entity word and the synonym of the entity word;
and if the synonym relation between the entity word and the synonym of the entity word is not forbidden, selecting the synonym from the preset synonym dictionary as a replacement word according to the entity word category replacement probability so as to replace the entity word with the replacement word.
Further, the data enhancement method further includes a random replacement method, a random deletion method, a random exchange method, and a long sentence construction method, and after the synonym replacement is performed on the entity words in the original training data, the method further includes:
in the optimized parameter list, determining the random replacement probability of the random replacement method and determining the random deletion probability of the random deletion method;
determining the random exchange probability of the random exchange method, and determining the sentence length set by the long sentence construction method;
performing entity word replacement on each sentence in the original training data according to the random replacement probability, and performing same-sentence entity word replacement on each sentence in the original training data according to the random replacement probability;
Performing entity word deletion on each sentence in the original training data according to the random deletion probability to obtain processing data;
and performing splicing processing on each sentence in the processing data so that the length of the processed sentence is the sentence length.
Further, the determining whether a convergence model exists in the plurality of recognition models includes:
determining the highest recognition score of the plurality of recognition models for recognizing each word in the test set;
determining whether the highest recognition score satisfies the convergence condition;
if the highest recognition score meets the convergence condition, determining that a convergence model meeting the convergence condition exists in the plurality of recognition models, wherein the recognition model corresponding to the highest recognition score is the convergence model;
and if the highest recognition score does not meet the convergence condition, determining that no convergence model meeting the convergence condition exists in the plurality of recognition models.
Further, the determining whether the highest recognition score satisfies the convergence condition includes:
determining a convergence parameter configured by a user;
determining a first highest recognition score of the plurality of recognition models that recognizes a t-th word in the test set;
Determining a second highest recognition score of the plurality of recognition models that recognizes a t-1 st word in the test set;
subtracting the second highest recognition score from the first highest recognition score to obtain a highest recognition score difference;
determining whether a ratio of the highest recognition score difference to the second highest recognition score is less than the convergence parameter;
if the ratio of the highest recognition score difference to the second highest recognition score is smaller than the convergence parameter, determining that the highest recognition score meets the convergence condition;
and if the ratio of the highest recognition score difference to the second highest recognition score is not smaller than the convergence parameter, determining that the highest recognition score does not meet the convergence condition.
A deep learning model-based data enhancement device, comprising:
the acquisition module is used for acquiring the original training data and the original test data which are marked manually and acquiring an original parameter list, wherein the original parameter list is composed of a data enhancement method and enhancement parameters corresponding to the data enhancement method;
the initialization module is used for randomly initializing the enhancement parameters in the original parameter list according to an artificial fish swarm algorithm so as to obtain a plurality of optimized parameter lists;
The conversion module is used for converting the original training data by utilizing each optimized parameter list to obtain corresponding artificial construction data, and mixing the original training data with the corresponding artificial construction data to obtain a plurality of training sets;
the test module is used for respectively training to obtain a plurality of recognition models by utilizing the plurality of training sets, and testing the plurality of recognition models by taking the original test data as a test set so as to determine whether a model meeting convergence conditions exists in the plurality of recognition models;
the output module is used for outputting an optimized parameter list corresponding to the model meeting the convergence condition as a target data enhancement parameter list if the model meeting the convergence condition exists in the plurality of identification models;
and the enhancement module is used for carrying out data enhancement on the original training data by utilizing the target data enhancement parameter list so as to obtain a training set of the named entity recognition model.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the deep learning model based data enhancement method described above when the computer program is executed.
A computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the deep learning model-based data enhancement method described above.
In one scheme provided by the data enhancement method, the device, the equipment and the medium based on the deep learning model, the original training data and the original test data which are marked manually are obtained, an original parameter list is obtained, the original parameter list is composed of enhancement parameters corresponding to the data enhancement method and the data enhancement method, the enhancement parameters in the original parameter list are randomly initialized according to an artificial fish swarm algorithm to obtain a plurality of optimized parameter lists, each optimized parameter list is used for converting the original training data to obtain corresponding artificial construction data, the original training data is mixed with the corresponding artificial construction data to obtain a plurality of training sets, the plurality of training sets are used for respectively training to obtain a plurality of recognition models, the original test data is used as a test set for testing the plurality of recognition models to determine whether the models meeting convergence conditions exist in the plurality of recognition models, if the models meeting the convergence conditions exist in the plurality of recognition models, the optimized parameter list corresponding to the models meeting the convergence conditions is output as a target data enhancement parameter list, and the original training data is subjected to data enhancement by using the target data enhancement parameter list to obtain a training set for entity recognition; in the invention, the enhancement parameters in the original parameter list are randomly initialized by adopting an artificial fish swarm algorithm suitable for the coexistence situation of discrete values and continuous values, and the recognition effect of the recognition model is used as an optimization target to be fused into the formulation of a data enhancement strategy, so that a data enhancement list with better effect is obtained at a lower cost, and the data enhancement effect of the data enhancement list on data is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic view of an application environment of a data enhancement method based on a deep learning model according to an embodiment of the present invention;
FIG. 2 is a flow chart of a data enhancement method based on a deep learning model according to an embodiment of the invention;
FIG. 3 is a schematic flow chart of a data enhancement method based on a deep learning model according to an embodiment of the present invention;
FIG. 4 is a flowchart showing an implementation of step S30 in FIG. 2;
FIG. 5 is a flowchart showing an implementation of step S33 in FIG. 4;
FIG. 6 is a flowchart of another implementation of step S30 in FIG. 2;
FIG. 7 is a flowchart showing an implementation of step S50 in FIG. 2;
FIG. 8 is a flowchart showing an implementation of step S52 in FIG. 7;
FIG. 9 is a schematic diagram of a data enhancement device based on a deep learning model according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of a computer device according to an embodiment of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The data enhancement method based on the deep learning model provided by the embodiment of the invention can be applied to an application environment as shown in fig. 1, wherein terminal equipment communicates with a server through a network. The server obtains the original training data and the original test data which are sent by the user through the terminal equipment and are subjected to manual labeling, and obtains an original parameter list which is sent by the user through the terminal equipment, wherein the original parameter list consists of a data enhancement method and enhancement parameters corresponding to the data enhancement method, the enhancement parameters in the original parameter list are randomly initialized according to an artificial fish swarm algorithm to obtain a plurality of optimized parameter lists, each optimized parameter list is utilized to convert the original training data so as to obtain corresponding artificial construction data, the original training data and the corresponding artificial construction data are mixed so as to obtain a plurality of training sets, the plurality of training sets are utilized to respectively train to obtain a plurality of recognition models, the original test data are used as the test sets to test the plurality of recognition models so as to determine whether models meeting convergence conditions exist in the plurality of recognition models, if models meeting convergence conditions exist in the plurality of recognition models, the model meeting the convergence conditions are output to serve as an optimized parameter list for target data enhancement parameters, the original training data is subjected to data enhancement by utilizing the target data enhancement parameters so as to obtain a training set of the corresponding artificial training entity recognition models, the discrete value and the continuous value is mixed with the corresponding artificial training data, the model has a better random recognition strategy is enlarged, the recognition effect of the model is better than the original training model is better, and the model is better recognized by the model of the model is better than the model with the model meeting the condition enhancement parameters meeting convergence conditions is better than the model obtained, thereby realizing the artificial intelligence of training data enhancement and named entity recognition.
The relevant data used or produced by the equal data enhancement method based on the deep learning model is stored in a database of a server, and the database in the embodiment is stored in a blockchain network and is used for storing data used and generated by the equal data enhancement method based on the deep learning model, such as original training data, original test data, an original parameter list, artificial construction data, an optimized parameter list, a plurality of identification models and the like. The blockchain referred to in the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like. Deploying databases in blockchains may improve the security of data storage.
The terminal device may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.
In one embodiment, as shown in fig. 2, a data enhancement method based on a deep learning model is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:
s10: the method comprises the steps of obtaining original training data and original test data which are marked manually, and obtaining an original parameter list, wherein the original parameter list is composed of a data enhancement method and enhancement parameters corresponding to the data enhancement method.
It can be understood that the original parameter list in this embodiment is a data enhancement model, where the data enhancement model is composed of a data enhancement method and enhancement parameters corresponding to the data enhancement method, and the enhancement performance of the data enhancement model on data depends on the enhancement parameters corresponding to the data enhancement method and the data enhancement method in the model, so before the data enhancement model is utilized, parameter optimization needs to be performed on the existing data enhancement model to improve the enhancement performance of the data enhancement model on training data, so as to ensure the recognition accuracy of the named entity recognition model obtained by subsequent training data training.
And carrying out parameter optimization on the existing data enhancement model, and acquiring the existing data enhancement model, namely acquiring an original parameter list of the model, and acquiring the original training data and the original test data which are marked manually.
S20: and randomly initializing the enhancement parameters in the original parameter list according to the artificial fish swarm algorithm to obtain a plurality of optimized parameter lists.
After the original training data and the original testing data which are marked manually are obtained and an original parameter list is obtained, the enhancement parameters in the original parameter list are randomly initialized by taking an artificial fish swarm algorithm which has high convergence speed and is suitable for the coexistence situation of discrete values and continuous values as a framework, so that a plurality of optimized parameter lists are obtained.
S30: and converting the original training data by utilizing each optimized parameter list to obtain corresponding artificial construction data, and mixing the original training data with the corresponding artificial construction data to obtain a plurality of training sets.
After obtaining a plurality of optimized parameter lists, converting the original training data by utilizing each optimized parameter list to obtain corresponding artificial construction data, and randomly scrambling the original training data and the corresponding artificial construction data to obtain a plurality of training sets in a mixing way.
For example, after obtaining L optimized parameter lists, converting the original training data by using each optimized parameter list, obtaining L copies of corresponding artificial construction data, where each copy of artificial construction data corresponds to one optimized parameter list, after obtaining L copies of corresponding artificial construction data, and mixing the original training data with each copy of artificial construction data, respectively, so as to obtain L training sets.
S40: and respectively training by utilizing a plurality of training sets to obtain a plurality of identification models, and testing the plurality of identification models by taking the original test data as a test set.
After a plurality of training sets are obtained, a plurality of recognition models are respectively obtained by training the plurality of training sets, the original test data are used as test sets, the plurality of recognition models are tested by the test sets, and the recognition effect (recognition score) of each recognition model on each entity word in the test sets is obtained and is used as a test result.
S50: and determining whether a model meeting the convergence condition exists in the plurality of identification models according to the test result.
After testing a plurality of recognition models by taking the original test data as a test set, determining whether a model meeting convergence conditions exists in the plurality of recognition models according to the recognition effect of each recognition model on each entity word in the test set, namely a test result. Wherein the plurality of recognition models may be conventional entity recognition models.
S60: and if the plurality of recognition models exist in the models meeting the convergence condition, outputting an optimized parameter list corresponding to the models meeting the convergence condition as a target data enhancement parameter list.
After determining whether a model meeting the convergence condition exists in the plurality of recognition models, if the model meeting the convergence condition exists in the plurality of recognition models, the recognition effect of the existing recognition models in the plurality of recognition models meets the user requirement, the training set used by the corresponding model meeting the convergence condition meets the requirement, and then the optimization parameter list corresponding to the training set is determined to be a data enhancement list meeting the data enhancement requirement, and the corresponding optimization parameter list is output to be used as a target data enhancement parameter list.
S70: and carrying out data enhancement on the original training data by utilizing the target data enhancement parameter list so as to obtain a training set of the named entity recognition model.
After outputting the corresponding optimized parameter list as a target data enhancement parameter list, carrying out data enhancement on original training data by using the target data enhancement parameter list, and further carrying out random scrambling on the enhanced data after data enhancement and the original training data to obtain a training set of the named entity recognition model in a mixing way, so that a more accurate named entity recognition model can be obtained, and the recognition precision of the named entity recognition model is ensured.
It should be understood that the artificial fish swarm algorithm is a particle swarm optimization algorithm that treats particles as fish trying to reach the position of highest food concentration in the water area, thereby improving its own life state. In this embodiment, the particle and artificial fish are enhancement parameters in the original parameter list that are randomly initialized, the food concentration is a cost function or a loss function of the identification model, and the swimming process of the artificial fish in the algorithm running process is a process that the enhancement parameters in the original parameter list gradually approach the optimal position and the value of the cost function or the loss function approaches the minimum value.
The original parameter list formed by the data enhancement method and the enhancement parameters corresponding to the data enhancement method can be shown in table 1:
TABLE 1
In this embodiment, as shown in table 1, the discrete value β in the original parameter list is obtained by using the artificial fish swarm algorithm as a framework 1 To beta 5 And a discrete value p syn And combining to form an original parameter list mixed with continuous values and discrete values, wherein the original parameter list comprises a data enhancement method and corresponding enhancement parameters, an artificial fish swarm algorithm is used for carrying out iterative optimization on the enhancement parameters of the original parameter list, so that an optimized parameter list is obtained, the original training data is processed by using the optimized parameter list to obtain artificial construction data, the artificial construction data and the original training data are mixed, a high-quality training set is obtained at low cost, and the recognition accuracy of a named entity recognition model is ensured.
In this embodiment, original training data and original test data which are marked manually are obtained, an original parameter list is obtained, the original parameter list is composed of a data enhancement method and enhancement parameters corresponding to the data enhancement method, enhancement parameters in the original parameter list are initialized randomly according to an artificial fish swarm algorithm to obtain a plurality of optimized parameter lists, each optimized parameter list is utilized to convert the original training data to obtain corresponding artificial construction data, the original training data and the corresponding artificial construction data are mixed to obtain a plurality of training sets, the plurality of training sets are utilized to respectively train to obtain a plurality of recognition models, the original test data is used as a test set to test the plurality of recognition models, whether a model meeting convergence conditions exists in the plurality of recognition models is determined, if the model meeting the convergence conditions exists in the plurality of recognition models, an optimized parameter list corresponding to the model meeting the convergence conditions is output, the original training data is subjected to data enhancement by utilizing the target data enhancement parameter list as a target data enhancement parameter list, and a training set of a named entity recognition model is obtained; in the invention, the enhancement parameters in the original parameter list are randomly initialized by adopting an artificial fish swarm algorithm suitable for the coexistence situation of discrete values and continuous values, the identification effect of the identification model is used as an optimization target to be fused into the formulation of a data enhancement strategy, and a data enhancement list with good effect is obtained at a low cost, so that the data diversity of the named entity identification model training set is ensured, the scale of the training set is enlarged, and the identification accuracy of the named entity identification model is further improved.
In addition, because the enhancement parameters of each data enhancement method in the target data enhancement parameter list are obtained by automatic optimization, the method can support the expansion of the data enhancement method, and different data enhancement lists can be obtained according to the requirements of users, so that more model training data can be constructed, and the accuracy of the model is further ensured.
In an embodiment, the data enhancement method includes a synonym replacement method, as shown in fig. 3, after step S50, that is, after determining, according to a test result, whether there is a model satisfying a convergence condition in the plurality of recognition models, the method further specifically includes the following steps:
s80: if the multiple recognition models do not have models meeting the convergence condition, randomly initializing the enhancement parameters in the original parameter list according to the artificial fish swarm algorithm again to obtain a plurality of optimized parameter lists after random initialization, and counting.
After determining whether a model meeting the convergence condition exists in the plurality of recognition models, if the model meeting the convergence condition does not exist in the plurality of recognition models, the recognition effect of the plurality of recognition models does not meet the user requirement, and the enhancement effect of the optimized parameter list of the random optimization on the original training data is insufficient. At this time, the enhancement parameters in the original parameter list are randomly initialized again according to the artificial fish swarm algorithm, so that a plurality of identification models are obtained through training according to the optimized parameter list after random initialization, and the plurality of identification models are tested until a target data enhancement parameter list is obtained. Meanwhile, when the enhancement parameters in the original parameter list are randomly initialized again according to the artificial fish swarm algorithm, the number of times of repeatedly randomly initializing the enhancement parameters in the original parameter list is required to be recorded and counted.
S90: and determining whether the number of times of random initialization of the enhancement parameters of the original parameter list is smaller than a preset number of times.
S100: and if the number of times of random initialization of the enhancement parameters of the original parameter list is not less than the preset number of times, stopping random initialization of the enhancement parameters in the original parameter list.
After determining whether the number of times of random initialization of the enhancement parameters of the original parameter list is smaller than the preset number of times, if the number of times of random initialization of the enhancement parameters of the original parameter list is not smaller than the preset number of times, the iteration number is excessive, in order to reduce calculation load, the random initialization of the enhancement parameters in the original parameter list needs to be stopped, the number of times can be used for outputting an optimized parameter list corresponding to a model close to a convergence condition as a target data enhancement parameter list, and then the target data enhancement parameter list is used for carrying out data enhancement on original training data so as to obtain a training set of a named entity identification model.
S110: if the number of random initializations of the enhancement parameters of the original parameter list is less than the preset number, the steps S30-S70 are repeatedly performed.
After determining whether the number of times of random initialization of the enhancement parameters of the original parameter list is less than the preset number of times, if the number of times of random initialization of the enhancement parameters of the original parameter list is less than the preset number of times, and the target data enhancement parameter list is not determined yet, the steps S30-S70 are required to be repeatedly executed, that is, a plurality of new recognition models are required to be obtained through retraining according to the optimized parameter list after random initialization, and a plurality of new recognition models are tested to obtain the target data enhancement parameter list, and a training set of the named entity recognition model is obtained by utilizing the target data enhancement parameter list.
In this embodiment, after determining whether a model satisfying a convergence condition exists in the plurality of recognition models, if a model satisfying a convergence condition does not exist in the plurality of recognition models, randomly initializing the enhancement parameters in the original parameter list again according to the artificial fish swarm algorithm to obtain a plurality of optimized parameter lists after random initialization, counting, determining whether the number of times of randomly initializing the enhancement parameters of the original parameter list is smaller than a preset number of times, determining whether the number of times of randomly initializing the enhancement parameters of the original parameter list is smaller than the preset number of times, and if the number of times of randomly initializing the enhancement parameters of the original parameter list is smaller than the preset number of times, repeating the steps S30-S70, further defining an operation to be executed when no recognition model converges, optimizing the parameters of the original parameter list by adopting the artificial fish swarm algorithm for a plurality of times, and optimizing the parameters of the original parameter list with the recognition effect of the plurality of recognition models as a target, so as to obtain the optimized parameter list satisfactory to the user, thereby ensuring the parameter performance of the optimized parameter list, and further ensuring the enhancement effect on data.
In one embodiment, the data enhancement method includes a synonym replacement method, as shown in fig. 4, in step S30, that is, converting the original training data by using each optimized parameter list, specifically including the following steps:
S31: and determining enhancement parameters corresponding to the synonym replacement method in the optimization parameter list, wherein the enhancement parameters corresponding to the synonym replacement method comprise entity word class replacement probability and entity word replacement class.
In this embodiment, the data enhancement method in the optimized parameter list includes a synonym replacement method, and enhancement parameters corresponding to the synonym replacement method are determined in the optimized parameter list, where the enhancement parameters corresponding to the synonym replacement method include an entity word class replacement probability and an entity word replacement class.
S32: and acquiring a preset synonym dictionary which is pre-constructed by a user according to the requirement, wherein entity words which are not prohibited from synonym relation in the same entity category are used as synonyms of each other in the preset synonym dictionary.
Before converting the original training data by using the data enhancement method and the corresponding enhancement parameters in each optimized parameter list, a preset synonym dictionary is required to be obtained and used as a source of entity words in the converted original training data, wherein the preset synonym dictionary is a dictionary which is pre-constructed by a user according to requirements and comprises entity words of different entity categories, the entity words of the same entity category are used as synonyms of each other in the preset synonym dictionary, the synonym relationship among specific entity words is forbidden in the preset synonym dictionary, and the entity words of the forbidden synonym relationship cannot be used as synonyms of each other.
In this embodiment, the entity word scale of the preset synonym dictionary is improved by loosening the judgment condition of the synonym, and the entity word of the same entity class is used as the synonym, i.e. if the meaning and grammar of a new sentence obtained by replacing the word A in the sentence with the word B are still reasonable, the word B and the word A are the same entity class, then the word B is a synonym of the word A, and the entity word set of the same class is formed to form the preset synonym dictionary. For example, the grand monkey is pressed under the five elements mountain, and in this sentence, the grand monkey may be replaced by the name of the grand monkey, the cow king, etc., and then the grand monkey, the grand monkey and the cow king are synonyms of each other.
In this embodiment, the quality of the preset synonym dictionary is improved by prohibiting the synonym relationship between specific words. In daily use, although part of entity words belong to the same entity category, the entity words are replaced to sentences as synonyms, so that the grammar of the sentences is changed, and the synonym relationship between the two parties is forbidden, namely the two entity words are not synonyms. For example, the sun-monkey is pressed under the five-element mountain, and if the five-element mountain is replaced by the yellow river in the sentence, the sun-monkey is pressed under the yellow river as a disease sentence, so that the five-element mountain and the yellow river are not prohibited from being synonymous in the preset synonym dictionary, and the five-element mountain and the yellow river cannot be replaced as synonyms of each other when the synonym is replaced.
In this embodiment, the above description uses the sun wu to be pressed under the five-element mountain as a sentence, uses the Buddha ancestor, the cow king and the yellow river as entity words to explain synonyms, and is only exemplary, and in other embodiments, other sentences and entity words may be used as examples.
The synonyms in the preset synonym dictionary may exist in the form of table 2, where table 2 includes four columns, the first column is a serial number, and the second column and the third column are different words: the fourth column is a relation between the word A and the word B, if the word B can replace the word A, the word A and the word B are synonymous with each other, and if the word B cannot replace the word A, the word A and the word B are not synonymous with each other. The content of the preset synonym dictionary is shown in the following table 2:
TABLE 2
S33: and carrying out synonym replacement on the entity words in the original training data according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category.
After the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category are obtained, the entity words in the original training data are subjected to synonym replacement according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category, data after synonym replacement are obtained, and further the data after synonym replacement are subjected to data processing according to other data enhancement methods and corresponding enhancement parameters in the optimized parameter list, so that artificial construction data are obtained. The entity word class replacement probability is the replacement probability of the entity word replacement class, in the optimization parameter list, the probability distribution of each entity word class to be replaced is p_syn= [ p_ (syn 1), p_ (syn 2), …, p_ (synK) ], and k types of entity words in the original training data are replaced into synonyms of the preset synonym dictionary according to the probability of p_ (syn, k) based on the preset synonym dictionary.
In this embodiment, through determining enhancement parameters corresponding to the synonym replacement method in the optimization parameter list, the enhancement parameters corresponding to the synonym replacement method include entity word class replacement probability and entity word replacement class, a preset synonym dictionary which is pre-built according to requirements by a user is obtained, entity words which are not prohibited from synonym relationships in the same entity class are used as synonyms of each other in the preset synonym dictionary, the entity word class replacement probability and the entity word replacement class, the synonym replacement is performed on the entity words in the original training data according to the preset synonym dictionary, the step of converting the original training data by using each optimization parameter list is thinned, the size of the preset synonym dictionary is enlarged by loosening the judgment conditions of entity word synonyms, the diversity of artificial construction data is improved, and the quality of the preset synonym dictionary is continuously improved by constructing a mode that is prohibited based on the synonym relationships, so that the quality of the artificial construction data is ensured.
In one embodiment, as shown in fig. 5, in step S33, the synonym replacement is performed on the entity words in the original training data according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category, which specifically includes the following steps:
S331: and determining whether the category of each entity word in the original training data belongs to the entity word replacement category.
After the synonym dictionary, the entity word class replacement probability and the entity word replacement class are preset, the class of each entity word in the original training data needs to be determined so as to determine whether each entity word in the original training data belongs to the entity word replacement class.
S332: if the category of the entity word in the original training data belongs to the entity word replacement category, searching the synonym of the entity word in a preset synonym dictionary.
After determining whether each entity word in the original training data belongs to an entity word replacement category, if the category of an entity word in the original training data belongs to the entity word replacement category, which means that synonym replacement is required for the entity word in the original training data, all synonyms of the entity word in a preset synonym dictionary are required for subsequent replacement.
S333: it is determined whether a synonym relationship is prohibited between entity words and synonyms of entity words.
After the synonyms of the entity word in the synonym dictionary are preset, whether the synonym relationship between the entity word and each synonym is forbidden or not is determined.
S334: and if the synonym relation between the entity words and the synonyms of the entity words is not forbidden, selecting a synonym from a preset synonym dictionary as a replacement word according to the entity word category replacement probability so as to replace the entity words with the replacement word.
After determining whether the synonym relationship is forbidden or not, if the synonym relationship is not forbidden, replacing the entity word with the corresponding synonym according to the entity word category replacement probability.
S335: and if the synonym relationship between the entity words and the synonym of the entity words is forbidden, the synonym is not used as a replacement word of the entity words.
After determining whether the synonym relation is forbidden between the entity word and the corresponding synonym, if the synonym relation is forbidden between the entity word and the synonym of the entity word, skipping the synonym, namely, not taking the synonym as a replacement of the entity word.
For example, the entity word replacement category includes a name, a place name and an organization name 3 class, the entity word class replacement probability is p_syn= [0.30,0.60,0.10], that is, according to the synonym replacement method, the probability of replacing the name in the original training data is 0.30, the probability of replacing the place name is 0.6, the probability of replacing the organization name is 0.1, if no synonym of the name in the preset synonym dictionary is forbidden in a synonym relation, each name in a sentence in the original training data has a probability of replacing 30% of the name as a synonym of the name in the preset synonym dictionary; if a synonym of the name in the preset synonym dictionary is forbidden for the synonym relationship, skipping the synonym and replacing the synonym, and replacing the name by other synonyms.
In this embodiment, the entity word replacement category includes a person name, a place name, and an organization name of class 3, and the entity word class replacement probability is p_syn= [0.30,0.60,0.10], which is merely an exemplary illustration, and in other embodiments, the entity word replacement category and the entity word class replacement probability may be other.
S336: and if the category of each entity word in the original training data does not belong to the entity word replacement category, not performing synonym replacement.
After determining whether each entity word in the original training data belongs to the entity word replacement category, if the category part of each entity word in the original training data belongs to the entity word replacement category, meaning that synonym replacement is not required for the entity word in the original training data, other data enhancement methods in the optimization parameter list can be executed.
In this embodiment, whether the category of each entity word in the original training data belongs to the entity word replacement category is determined, if the category of each entity word in the original training data belongs to the entity word replacement category, the corresponding synonym of the entity word in the preset synonym dictionary is searched, whether the synonym relationship between the entity word and the corresponding synonym is forbidden is determined, if the synonym relationship between the entity word and the corresponding synonym is not forbidden, the entity word is replaced with the corresponding synonym according to the entity word category replacement probability, and the step of replacing the synonym for the entity word in the original training data according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category is clarified, so that the basis is provided for acquiring the artificial constructional data.
In an embodiment, the data enhancement method further includes a random replacement method, a random deletion method, a random exchange method, and a long sentence construction method, as shown in fig. 6, after step S33, that is, after performing synonym replacement on entity words in the original training data, the method further specifically includes the following steps:
s34: and in the optimized parameter list, determining the random replacement probability of the random replacement method, and determining the random deletion probability of the random deletion method.
In this embodiment, the data enhancement method further includes a random replacement method and a random deletion method, where a random replacement probability of the random replacement method needs to be determined in the optimized parameter list, and a random deletion probability of the random deletion method needs to be determined, so that conversion processing is performed on the original training data according to the random replacement probability and the random deletion probability.
S35: and determining the random exchange probability of the random exchange method and determining the sentence length set by the long sentence construction method.
In this embodiment, the data enhancement method further includes a random exchange method and a long sentence constructing method, and in the optimized parameter list, it is further required to determine a random exchange probability of the random exchange method, and determine a sentence length set by the long sentence constructing method, so as to perform conversion processing on the original training data according to the random exchange probability and the sentence length set by the long sentence constructing method.
S36: and carrying out entity word replacement on each sentence in the original training data according to the random replacement probability, and carrying out same-sentence entity word replacement on each sentence in the original training data according to the random replacement probability.
After the random replacement probability of the random replacement method is determined and the random replacement probability of the random replacement method is determined, entity word replacement is carried out on each sentence in the original training data according to the random replacement probability, and same-sentence entity word replacement is carried out on each sentence in the original training data according to the random replacement probability.
For example, in the optimized parameter list, the random replacement probability of the random replacement method is β2, the random replacement probability of the random replacement method is β3, and in each token (entity word) of each sentence in the original training data, the probability of β2 is replaced by any one other token in a dictionary (which may be a preset synonym dictionary), wherein the rule of selecting the token from the dictionary is as follows: obeying uniform random distribution and excluding other token to be randomly replaced in the original training data. Meanwhile, in each sentence of the original training data, the ith token and the jth token have the probability of beta 3 for position exchange.
S37: and deleting entity words of each sentence in the original training data according to the random deletion probability to obtain the processing data.
And after carrying out entity word replacement on each sentence in the original training data according to the random replacement probability, carrying out same-sentence entity word replacement on each sentence in the original training data according to the random replacement probability, and carrying out entity word deletion on each sentence in the original training data according to the random deletion probability so as to obtain the processing data.
For example, in the original training data, each token of each sentence is replaced with any one of other tokens in the dictionary with the probability of β2, then in each sentence, the ith token and the jth token are subjected to position exchange with the probability of β3, and then in each sentence, each token of each sentence is deleted with the probability of β4, so as to obtain the processing data.
S38: and performing splicing processing on each sentence in the processing data so that the length of the processed sentence is the sentence length.
After the processing data is obtained, each sentence in the processing data is spliced, so that the length of the processed sentence is the length of the sentence.
For example, the sentence length set by the long sentence method is 100, the sentence length of each sentence in the processing data is counted to obtain the 90 th percentile of the sentence length, sentences with the sentence length smaller than or smaller than the 90 th percentile are paired in pairs to splice into a longer spliced sentence (the sequence of the two sentences is random), and then the part with the length exceeding 100 in the spliced sentence is deleted, so that the sentence length of each sentence in the processing data is 100.
In this embodiment, the sentence length set by the long sentence construction method is 100, and performing pairwise pairing and splicing on sentences with sentence lengths smaller than or smaller than the 90 th percentile is only an exemplary illustration, and in other embodiments, the sentence length set by the long sentence construction method may also be other values, and also may perform pairwise pairing and splicing on sentences with sentence lengths of other percentiles, which are not described herein.
In this embodiment, after synonym replacement is performed on entity words in original training data, a random replacement probability of a random replacement method is determined in an optimized parameter list, a random deletion probability of a random deletion method is determined, a random replacement probability of a random replacement method is determined, a sentence length set by a long sentence construction method is determined, entity word replacement is performed on each sentence in the original training data according to the random replacement probability, same-sentence entity word replacement is performed on each sentence in the original training data according to the random replacement probability, entity word deletion is performed on each sentence in the original training data according to the random deletion probability, so as to obtain processing data, and splicing processing is performed on each sentence in the processing data, so that the processed sentence length is the sentence length, the step of converting the original training data by using each optimized parameter list is further refined, the original training data is converted by using a plurality of data enhancement methods, the diversity of artificial construction data is further increased, and the accuracy of a recognition model training set is ensured.
In one embodiment, the data enhancement method includes a synonym replacement method, as shown in fig. 7, in step S50, that is, determining whether there is a convergence model in the plurality of recognition models according to the test result, including the following steps:
s51: the highest recognition score of the plurality of recognition models is determined for recognizing each word in the test set.
After testing the plurality of recognition models with the original test data as the test set, determining the highest recognition score for recognizing each word in the test set in the plurality of recognition models.
The score of the recognition model for recognizing each word in the test set is determined by the following formula:
wherein score t For the recognition model to recognize the score of the t word in the test set, recovery is the recall rate of the entity word, and precision is the precision of the recognition model to recall the entity word.
For example, the number of recognition models is 3, after the original test data is used as the test set to test the A, B, C three recognition models, the recognition scores of the A, B, C three recognition models to the t-th word in the test set are respectively 0.6, 0.8 and 0.9, and the highest recognition score for recognizing the t-th word in the test set is 0.9 in the A, B, C three recognition models.
In this embodiment, the number of recognition models is 3, the recognition scores of the t-th word in the test set are respectively 0.6, 0.8 and 0.9, which are only illustrative, and in other embodiments, the number of recognition models may be other values, and the recognition scores of the t-th word in the test set may be other values, which are not described herein.
S52: it is determined whether the highest recognition score satisfies a convergence condition.
After determining the highest recognition score of the plurality of recognition models for recognizing each word in the test set, determining whether the highest recognition score of the plurality of recognition models for recognizing each word in the test set meets a convergence condition.
S53: if the highest recognition score meets the convergence condition, determining that a convergence model meeting the convergence condition exists in the plurality of recognition models, wherein the recognition model corresponding to the highest recognition score is the convergence model.
After determining whether the highest recognition score meets the convergence condition, if the highest recognition score meets the convergence condition, which means that the recognition effect of the existing recognition models meets the requirement, determining that a convergence model meeting the convergence condition exists in the plurality of recognition models, wherein the recognition model corresponding to the highest recognition score is the convergence model, and the optimization parameter list corresponding to the convergence model can be used as the target data enhancement parameter list.
S54: and if the highest recognition score does not meet the convergence condition, determining that a convergence model meeting the convergence condition does not exist in the plurality of recognition models.
After determining whether the highest recognition score meets the convergence condition, if the highest recognition score meets the convergence condition, which means that the recognition effect of the recognition model does not meet the requirement, determining that the convergence model meeting the convergence condition does not exist in the plurality of recognition models, and performing iterative optimization again by using an artificial fish swarm algorithm.
In this embodiment, by determining the highest recognition score for recognizing each word in the test set in the plurality of recognition models, determining whether the highest recognition score satisfies the convergence condition, if the highest recognition score satisfies the convergence condition, determining that a convergence model satisfying the convergence condition exists in the plurality of recognition models, and if the highest recognition score does not satisfy the convergence condition, determining that a convergence model satisfying the convergence condition does not exist in the plurality of recognition models, determining whether a judgment process for determining whether the convergence model exists in the plurality of recognition models, determining that the recognition effect of the recognition model on the test set is used as the concentration of the artificial fish swarm algorithm, and using the recognition effect of the recognition model on the test set as the target of data enhancement model parameter optimization, and obtaining a data enhancement strategy with better effect at a smaller cost.
In one embodiment, the data enhancement method includes a synonym replacement method, as shown in fig. 8, in step S52, that is, determining whether the highest recognition score meets the convergence condition, specifically includes the following steps:
s521: and determining a convergence parameter configured by a user.
S522: determining a first highest recognition score for recognizing a t-th word in the test set in the plurality of recognition models;
s523: determining a second highest recognition score for recognizing the t-1 word in the test set in the plurality of recognition models;
s524: subtracting the second highest recognition score from the first highest recognition score to obtain a highest recognition score difference;
s525: determining whether a ratio of the highest recognition score difference to the second highest recognition score is less than a convergence parameter;
s526: if the ratio of the highest recognition score difference to the second highest recognition score is smaller than the convergence parameter, determining that the highest recognition score meets the convergence condition;
s527: and if the ratio of the highest recognition score difference to the second highest recognition score is not smaller than the convergence parameter, determining that the highest recognition score does not meet the convergence condition.
After determining the highest recognition score of the plurality of recognition models for recognizing each word in the test set, determining whether the highest recognition score of the plurality of recognition models satisfies a convergence condition by the following formula:
Wherein, maxscore t For the highest recognition score, i.e. the first highest recognition score, of the multiple recognition models for the test set t word t-1 For the highest recognition score, i.e., the second highest recognition score, of the multiple recognition models for the t-1 word of the test set, α is a user-configured convergence parameter (which may be 0.01).
In the above formula, if the first highest recognition score is maxscore t And the second highest recognition score maxscore t-1 Maximum score difference maxscore between t -maxscore t-1 Dividing the second highest recognition score by the convergence valueIf->If the identification score is smaller than the convergence parameter alpha, determining that the highest identification score meets the convergence condition; if->And if the highest recognition score is not smaller than the convergence parameter alpha, determining that the highest recognition score does not meet the convergence condition.
In this embodiment, a convergence parameter configured by a user is determined, a first highest recognition score for recognizing a t-th word in a test set in a plurality of recognition models is determined, a second highest recognition score for recognizing a t-1 th word in the test set in the plurality of recognition models is determined, the second highest recognition score is subtracted from the first highest recognition score to obtain a highest recognition score difference, whether the ratio of the highest recognition score difference to the second highest recognition score is smaller than the convergence parameter is determined, and if the ratio of the highest recognition score difference to the second highest recognition score is smaller than the convergence parameter, it is determined that the highest recognition score meets a convergence condition; if the ratio of the highest recognition score difference to the second highest recognition score is not smaller than the convergence parameter, determining that the highest recognition score does not meet the convergence condition, and determining whether the highest recognition score meets the convergence condition provides a judgment basis for determining whether the model converges according to the determined highest recognition score.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.
In an embodiment, a data enhancement device based on a deep learning model is provided, where the data enhancement device based on the deep learning model corresponds to the data enhancement method based on the deep learning model in the foregoing embodiment one by one. As shown in fig. 9, the data enhancement device based on the deep learning model includes an acquisition module 901, an initialization module 902, a conversion module 903, a test module 904, an output module 905, and an enhancement module 906. The functional modules are described in detail as follows:
the acquisition module 901 is used for acquiring original training data and original test data which are marked manually, and acquiring an original parameter list, wherein the original parameter list is composed of a data enhancement method and enhancement parameters corresponding to the data enhancement method;
an initialization module 902, configured to randomly initialize the enhancement parameters in the original parameter list according to an artificial fish swarm algorithm, so as to obtain a plurality of optimized parameter lists;
The conversion module 903 is configured to convert the original training data by using each of the optimized parameter lists to obtain corresponding artificial construction data, and mix the original training data with the corresponding artificial construction data to obtain a plurality of training sets;
the test module 904 is configured to respectively train to obtain a plurality of recognition models by using the plurality of training sets, and test the plurality of recognition models by using the original test data as a test set to determine whether a model satisfying a convergence condition exists in the plurality of recognition models;
the output module 905 is configured to output, if the plurality of recognition models includes a model that satisfies the convergence condition, an optimization parameter list corresponding to the model that satisfies the convergence condition as a target data enhancement parameter list;
and the enhancement module 906 is configured to perform data enhancement on the original training data by using the target data enhancement parameter list to obtain a training set of the named entity recognition model.
Further, the data enhancing apparatus based on the deep learning model further includes a loop module 907, where after determining whether a model satisfying a convergence condition exists in the plurality of recognition models, the loop module 907 is specifically configured to:
If the models meeting the convergence condition do not exist in the plurality of identification models, randomly initializing the enhancement parameters in the original parameter list according to an artificial fish swarm algorithm again to obtain a plurality of optimized parameter lists after random initialization, and counting;
determining whether the number of random initialization times of the enhancement parameters in the original parameter list is smaller than a preset number of times;
if the number of times of random initialization of the enhancement parameters in the original parameter list is not less than the preset number of times, stopping the random initialization of the enhancement parameters in the original parameter list;
if the number of times of random initialization of the enhancement parameters in the original parameter list is smaller than the preset number of times, training according to the optimized parameter list after random initialization to obtain a plurality of new identification models, testing the new identification models to obtain the target data enhancement parameter list, and obtaining a training set of the named entity identification models by utilizing the target data enhancement parameter list.
Further, the data enhancement method includes a synonym replacement method, and the conversion module 903 is specifically configured to:
Determining enhancement parameters corresponding to the synonym substitution method in the optimization parameter list, wherein the enhancement parameters corresponding to the synonym substitution method comprise entity word class substitution probability and entity word substitution class;
acquiring a preset synonym dictionary which is pre-constructed by a user according to requirements, wherein entity words which are not prohibited from synonym relations in the same entity category are used as synonyms of each other in the preset synonym dictionary;
and carrying out synonym replacement on the entity words in the original training data according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category.
Further, the conversion module 903 is specifically further configured to:
determining whether the category of each entity word in the original training data belongs to the entity word replacement category;
if the category of the entity word in the original training data belongs to the entity word replacement category, searching the synonym of the entity word in the preset synonym dictionary;
determining whether a synonym relationship is forbidden between the entity word and the synonym of the entity word;
and if the synonym relation between the entity word and the synonym of the entity word is not forbidden, selecting the synonym from the preset synonym dictionary as a replacement word according to the entity word category replacement probability so as to replace the entity word with the replacement word.
Further, the data enhancement method further includes a random replacement method, a random deletion method, a random exchange method, and a long sentence construction method, and the conversion module 903 is specifically further configured to:
in the optimized parameter list, determining the random replacement probability of the random replacement method and determining the random deletion probability of the random deletion method;
determining the random exchange probability of the random exchange method, and determining the sentence length set by the long sentence construction method;
performing entity word replacement on each sentence in the original training data according to the random replacement probability, and performing same-sentence entity word replacement on each sentence in the original training data according to the random replacement probability;
performing entity word deletion on each sentence in the original training data according to the random deletion probability to obtain processing data;
and performing splicing processing on each sentence in the processing data so that the length of the processed sentence is the sentence length.
Further, the test module 904 is specifically configured to:
determining the highest recognition score of the plurality of recognition models for recognizing each word in the test set;
determining whether the highest recognition score satisfies the convergence condition;
If the highest recognition score meets the convergence condition, determining that a convergence model meeting the convergence condition exists in the plurality of recognition models, wherein the recognition model corresponding to the highest recognition score is the convergence model;
and if the highest recognition score does not meet the convergence condition, determining that no convergence model meeting the convergence condition exists in the plurality of recognition models.
Further, the test module 905 specifically further is configured to:
determining a convergence parameter configured by a user;
determining a first highest recognition score of the plurality of recognition models that recognizes a t-th word in the test set;
determining a second highest recognition score of the plurality of recognition models that recognizes a t-1 st word in the test set;
subtracting the second highest recognition score from the first highest recognition score to obtain a highest recognition score difference;
determining whether a ratio of the highest recognition score difference to the second highest recognition score is less than the convergence parameter;
if the ratio of the highest recognition score difference to the second highest recognition score is smaller than the convergence parameter, determining that the highest recognition score meets the convergence condition;
and if the ratio of the highest recognition score difference to the second highest recognition score is not smaller than the convergence parameter, determining that the highest recognition score does not meet the convergence condition.
For specific limitations on the data enhancement device based on the deep learning model, reference may be made to the above limitation on the data enhancement method based on the deep learning model, and no further description is given here. The respective modules in the data enhancement device based on the deep learning model may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer equipment is used for storing related data used or produced by data enhancement methods such as original training data, original test data, an original parameter list, artificial construction data, an optimized parameter list, a plurality of identification models and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a data enhancement method based on a deep learning model.
In one embodiment, a computer device is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the deep learning model-based data enhancement method described above when the computer program is executed by the processor.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, implements the steps of the deep learning model-based data enhancement method described above.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims (9)

1. A data enhancement method based on a deep learning model, comprising:
the method comprises the steps of obtaining original training data and original test data which are marked manually, and obtaining an original parameter list, wherein the original parameter list is composed of a data enhancement method and enhancement parameters corresponding to the data enhancement method;
Randomly initializing the enhancement parameters in the original parameter list according to an artificial fish swarm algorithm to obtain a plurality of optimized parameter lists;
converting the original training data by utilizing each optimized parameter list to obtain corresponding artificial construction data, and mixing the original training data with the corresponding artificial construction data to obtain a plurality of training sets;
respectively training to obtain a plurality of recognition models by using the plurality of training sets, and testing the plurality of recognition models by taking the original test data as a test set to determine whether a model meeting convergence conditions exists in the plurality of recognition models;
if the models meeting the convergence condition exist in the plurality of identification models, outputting an optimized parameter list corresponding to the models meeting the convergence condition as a target data enhancement parameter list;
performing data enhancement on the original training data by using the target data enhancement parameter list, and mixing the enhancement data subjected to data enhancement with the original training data to obtain a training set of a named entity recognition model;
the data enhancement method includes a synonym replacement method, and the converting the original training data by using each optimized parameter list includes:
Determining enhancement parameters corresponding to the synonym substitution method in the optimization parameter list, wherein the enhancement parameters corresponding to the synonym substitution method comprise entity word class substitution probability and entity word substitution class;
acquiring a preset synonym dictionary which is pre-constructed by a user according to requirements, wherein entity words which are not prohibited from synonym relations in the same entity category are used as synonyms of each other in the preset synonym dictionary;
and carrying out synonym replacement on the entity words in the original training data according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category.
2. The deep learning model-based data enhancement method of claim 1, wherein after the determining whether a model satisfying a convergence condition exists among the plurality of recognition models, the method further comprises:
if the models meeting the convergence condition do not exist in the plurality of identification models, randomly initializing the enhancement parameters in the original parameter list according to an artificial fish swarm algorithm again to obtain a plurality of optimized parameter lists after random initialization, and counting;
determining whether the number of random initialization times of the enhancement parameters in the original parameter list is smaller than a preset number of times;
If the number of times of random initialization of the enhancement parameters in the original parameter list is not less than the preset number of times, stopping the random initialization of the enhancement parameters in the original parameter list;
and if the number of times of random initialization of the enhancement parameters in the original parameter list is smaller than the preset number of times, training according to the optimized parameter list after random initialization to obtain a plurality of new identification models, testing the plurality of new identification models to obtain the target data enhancement parameter list, and obtaining a training set of the named entity identification models by utilizing the target data enhancement parameter list.
3. The method for enhancing data based on a deep learning model according to claim 1, wherein the performing synonym replacement on the entity words in the original training data according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category comprises:
determining whether the category of each entity word in the original training data belongs to the entity word replacement category;
if the category of the entity word in the original training data belongs to the entity word replacement category, searching the synonym of the entity word in the preset synonym dictionary;
Determining whether a synonym relationship is forbidden between the entity word and the synonym of the entity word;
and if the synonym relation between the entity word and the synonym of the entity word is not forbidden, selecting the synonym from the preset synonym dictionary as a replacement word according to the entity word category replacement probability so as to replace the entity word with the replacement word.
4. The data enhancement method based on a deep learning model according to claim 3, wherein the data enhancement method further comprises a random substitution method, a random deletion method, a random exchange method and a long sentence construction method, and the method further comprises, after the synonym substitution is performed on the entity words in the original training data:
in the optimized parameter list, determining the random replacement probability of the random replacement method and determining the random deletion probability of the random deletion method;
determining the random exchange probability of the random exchange method, and determining the sentence length set by the long sentence construction method;
performing entity word replacement on each sentence in the original training data according to the random replacement probability, and performing same-sentence entity word replacement on each sentence in the original training data according to the random replacement probability;
Performing entity word deletion on each sentence in the original training data according to the random deletion probability to obtain processing data;
and performing splicing processing on each sentence in the processing data so that the length of the processed sentence is the sentence length.
5. The deep learning model based data enhancement method of any of claims 1-4, wherein the determining whether a convergence model exists in the plurality of recognition models comprises:
determining the highest recognition score of the plurality of recognition models for recognizing each word in the test set;
determining whether the highest recognition score satisfies the convergence condition;
if the highest recognition score meets the convergence condition, determining that a convergence model meeting the convergence condition exists in the plurality of recognition models, wherein the recognition model corresponding to the highest recognition score is the convergence model;
and if the highest recognition score does not meet the convergence condition, determining that no convergence model meeting the convergence condition exists in the plurality of recognition models.
6. The deep learning model based data enhancement method of claim 5, wherein the determining whether the highest recognition score meets the convergence condition comprises:
Determining a convergence parameter configured by a user;
determining a first highest recognition score of the plurality of recognition models that recognizes a t-th word in the test set;
determining a second highest recognition score of the plurality of recognition models that recognizes a t-1 st word in the test set;
subtracting the second highest recognition score from the first highest recognition score to obtain a highest recognition score difference;
determining whether a ratio of the highest recognition score difference to the second highest recognition score is less than the convergence parameter;
if the ratio of the highest recognition score difference to the second highest recognition score is smaller than the convergence parameter, determining that the highest recognition score meets the convergence condition;
and if the ratio of the highest recognition score difference to the second highest recognition score is not smaller than the convergence parameter, determining that the highest recognition score does not meet the convergence condition.
7. A data enhancement device based on a deep learning model, comprising:
the acquisition module is used for acquiring the original training data and the original test data which are marked manually and acquiring an original parameter list, wherein the original parameter list is composed of a data enhancement method and enhancement parameters corresponding to the data enhancement method;
The initialization module is used for randomly initializing the enhancement parameters in the original parameter list according to an artificial fish swarm algorithm so as to obtain a plurality of optimized parameter lists;
the conversion module is used for converting the original training data by utilizing each optimized parameter list to obtain corresponding artificial construction data, and mixing the original training data with the corresponding artificial construction data to obtain a plurality of training sets;
the test module is used for respectively training to obtain a plurality of recognition models by utilizing the plurality of training sets, and testing the plurality of recognition models by taking the original test data as a test set so as to determine whether a model meeting convergence conditions exists in the plurality of recognition models;
the output module is used for outputting an optimized parameter list corresponding to the model meeting the convergence condition as a target data enhancement parameter list if the model meeting the convergence condition exists in the plurality of identification models;
the enhancement module is used for carrying out data enhancement on the original training data by utilizing the target data enhancement parameter list, and mixing the enhancement data subjected to data enhancement with the original training data so as to obtain a training set of a named entity recognition model;
Wherein the data enhancement method comprises a synonym replacement method, the transformation module further configured to:
determining enhancement parameters corresponding to the synonym substitution method in the optimization parameter list, wherein the enhancement parameters corresponding to the synonym substitution method comprise entity word class substitution probability and entity word substitution class;
acquiring a preset synonym dictionary which is pre-constructed by a user according to requirements, wherein entity words which are not prohibited from synonym relations in the same entity category are used as synonyms of each other in the preset synonym dictionary;
and carrying out synonym replacement on the entity words in the original training data according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category.
8. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the deep learning model based data enhancement method according to any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the deep learning model based data enhancement method according to any one of claims 1 to 6.
CN202110420110.3A 2021-04-19 2021-04-19 Data enhancement method, device, equipment and medium based on deep learning model Active CN113158652B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110420110.3A CN113158652B (en) 2021-04-19 2021-04-19 Data enhancement method, device, equipment and medium based on deep learning model
PCT/CN2021/096475 WO2022222224A1 (en) 2021-04-19 2021-05-27 Deep learning model-based data augmentation method and apparatus, device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110420110.3A CN113158652B (en) 2021-04-19 2021-04-19 Data enhancement method, device, equipment and medium based on deep learning model

Publications (2)

Publication Number Publication Date
CN113158652A CN113158652A (en) 2021-07-23
CN113158652B true CN113158652B (en) 2024-03-19

Family

ID=76868692

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110420110.3A Active CN113158652B (en) 2021-04-19 2021-04-19 Data enhancement method, device, equipment and medium based on deep learning model

Country Status (2)

Country Link
CN (1) CN113158652B (en)
WO (1) WO2022222224A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116244445B (en) * 2022-12-29 2023-12-12 中国航空综合技术研究所 Aviation text data labeling method and labeling system thereof
CN116451690A (en) * 2023-03-21 2023-07-18 麦博(上海)健康科技有限公司 Medical field named entity identification method
CN116501979A (en) * 2023-06-30 2023-07-28 北京水滴科技集团有限公司 Information recommendation method, information recommendation device, computer equipment and computer readable storage medium
CN116911305A (en) * 2023-09-13 2023-10-20 中博信息技术研究院有限公司 Chinese address recognition method based on fusion model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145965A (en) * 2018-08-02 2019-01-04 深圳辉煌耀强科技有限公司 Cell recognition method and device based on random forest disaggregated model
CN110516835A (en) * 2019-07-05 2019-11-29 电子科技大学 A kind of Multi-variable Grey Model optimization method based on artificial fish-swarm algorithm
CN111967604A (en) * 2019-05-20 2020-11-20 国际商业机器公司 Data enhancement for text-based AI applications

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11093707B2 (en) * 2019-01-15 2021-08-17 International Business Machines Corporation Adversarial training data augmentation data for text classifiers
CN110543906B (en) * 2019-08-29 2023-06-16 彭礼烨 Automatic skin recognition method based on Mask R-CNN model
CN111738004B (en) * 2020-06-16 2023-10-27 中国科学院计算技术研究所 Named entity recognition model training method and named entity recognition method
CN111832294B (en) * 2020-06-24 2022-08-16 平安科技(深圳)有限公司 Method and device for selecting marking data, computer equipment and storage medium
CN111738007B (en) * 2020-07-03 2021-04-13 北京邮电大学 Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN112257441B (en) * 2020-09-15 2024-04-05 浙江大学 Named entity recognition enhancement method based on counterfactual generation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145965A (en) * 2018-08-02 2019-01-04 深圳辉煌耀强科技有限公司 Cell recognition method and device based on random forest disaggregated model
CN111967604A (en) * 2019-05-20 2020-11-20 国际商业机器公司 Data enhancement for text-based AI applications
CN110516835A (en) * 2019-07-05 2019-11-29 电子科技大学 A kind of Multi-variable Grey Model optimization method based on artificial fish-swarm algorithm

Also Published As

Publication number Publication date
WO2022222224A1 (en) 2022-10-27
CN113158652A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
CN113158652B (en) Data enhancement method, device, equipment and medium based on deep learning model
US10394956B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
WO2022048363A1 (en) Website classification method and apparatus, computer device, and storage medium
Nabati et al. Video captioning using boosted and parallel Long Short-Term Memory networks
CN112380837B (en) Similar sentence matching method, device, equipment and medium based on translation model
CN113536795B (en) Method, system, electronic device and storage medium for entity relation extraction
CN110795938A (en) Text sequence word segmentation method, device and storage medium
CN111666775A (en) Text processing method, device, equipment and storage medium
CN115309910B (en) Language-text element and element relation joint extraction method and knowledge graph construction method
Cheng et al. A hierarchical multimodal attention-based neural network for image captioning
CN112052329A (en) Text abstract generation method and device, computer equipment and readable storage medium
CN112861543A (en) Deep semantic matching method and system for matching research and development supply and demand description texts
US20090234852A1 (en) Sub-linear approximate string match
CN116050397A (en) Method, system, equipment and storage medium for generating long text abstract
CN110991193A (en) Translation matrix model selection system based on OpenKiwi
Li et al. Midtd: A simple and effective distillation framework for distantly supervised relation extraction
CN113627159B (en) Training data determining method, device, medium and product of error correction model
CN116610795B (en) Text retrieval method and device
Fakeri-Tabrizi et al. Multiview self-learning
CN112529743B (en) Contract element extraction method, device, electronic equipment and medium
Hu et al. Corpus of Carbonate Platforms with Lexical Annotations for Named Entity Recognition.
CN115512374A (en) Deep learning feature extraction and classification method and device for table text
Xia et al. Generating Questions Based on Semi-Automated and End-to-End Neural Network.
Liang et al. Design of computer aided translation system for English communication language based on grey clustering evaluation
Wu et al. A Text Emotion Analysis Method Using the Dual‐Channel Convolution Neural Network in Social Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant