CN114611513A - Sample generation method, model training method, entity identification method and related device - Google Patents

Sample generation method, model training method, entity identification method and related device Download PDF

Info

Publication number
CN114611513A
CN114611513A CN202210061372.XA CN202210061372A CN114611513A CN 114611513 A CN114611513 A CN 114611513A CN 202210061372 A CN202210061372 A CN 202210061372A CN 114611513 A CN114611513 A CN 114611513A
Authority
CN
China
Prior art keywords
entity
recognition result
model
sample
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210061372.XA
Other languages
Chinese (zh)
Inventor
陈贝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cloudminds Shanghai Robotics Co Ltd
Original Assignee
Cloudminds Shanghai Robotics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cloudminds Shanghai Robotics Co Ltd filed Critical Cloudminds Shanghai Robotics Co Ltd
Priority to CN202210061372.XA priority Critical patent/CN114611513A/en
Publication of CN114611513A publication Critical patent/CN114611513A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

One or more embodiments of the present specification disclose a sample generation method, a model training method, an entity recognition method, and a related apparatus, the method including: pre-training a preset language recognition model through the marked first sample to obtain an initial model; then, predicting and scoring the entity data in the entity dictionary based on the initial model, and outputting a recognition result; and if the predicted entity type in the recognition result does not coincide with the real entity type of the entity data in the entity dictionary, performing entity type correction on the recognition result to enable the corrected recognition result to contain the real entity type, and summarizing all corrected recognition results into a second sample. According to the scheme, a large number of weakly labeled second samples can be generated in a weak supervision labeling mode only by using the labeled first sample, so that the problem that the labeled samples are difficult to obtain is relieved to a certain extent, and the generation efficiency of the labeled samples is improved; thereby improving the model training and recognition performance.

Description

Sample generation method, model training method, entity identification method and related device
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a sample generation method, a model training method, an entity identification method, and a related apparatus.
Background
Named Entity Recognition (NER) is a basic key task in Natural Language Processing (NLP), is the basis of many NLP tasks such as relationship extraction, event extraction, knowledge map, information extraction, question-answering system, syntactic analysis, machine translation, etc., is widely applied to the field of Natural Language Processing, and plays an important role in the process of bringing Natural Language Processing technology into practical use. NER refers to the recognition of a predefined entity type (name of person, place, organization, date and time, proper noun, etc.) from a text in a piece of text.
In the technical field of artificial intelligence at the present stage, no matter the traditional tasks of classification, matching, sequence labeling and text generation, or the newly derived tasks of image understanding, audio emotion analysis and other cross-modal tasks, all places adopting deep learning have high dependence on labeled data. NER also relies on a large amount of tagged data when applications fall on the ground.
Because the corpus data used by the NER has field limitation and cannot be applied to all fields, in order to make the model obtain good effect, a large amount of labeling personnel are often needed to perform manual labeling, the labor cost is high, and the generation and acquisition of the labeling sample are difficult.
Disclosure of Invention
One or more embodiments of the present disclosure provide a sample generation method, a model training method, an entity identification method, and a related apparatus, so as to solve the problem of difficulty in obtaining a labeled sample by a weak supervision labeling method, improve the labeled sample generation efficiency, and further improve the model training and identification performance.
To solve the above technical problem, one or more embodiments of the present specification are implemented as follows:
in a first aspect, a sample generation method is provided, including:
training a preset language recognition model based on the labeled first sample to obtain an initial model;
acquiring an entity dictionary, respectively identifying each entity data in the entity dictionary based on the initial model, and outputting an identification result corresponding to each entity data, wherein the identification result comprises: at least one predicted entity and a predicted entity type corresponding to each predicted entity;
respectively judging whether the predicted entity type in the recognition result of each entity data is overlapped with the real entity type of the entity data in the entity dictionary;
aiming at the identification result with non-coincident entity types, carrying out entity type correction on the identification result based on the real entity type of the entity data corresponding to the identification result, and summarizing the corrected identification result into a second sample;
and each corrected identification result contains the real entity type of the corresponding entity data, and the number of the sample data in the second sample is far larger than that of the sample data in the first sample.
In a second aspect, a model training method is provided, including:
determining a corpus sample, wherein the corpus sample is generated by the sample generation method of the first aspect;
and training the language identification model to be trained based on the corpus sample to obtain an entity identification model.
In a third aspect, a method for entity identification is provided, including:
determining corpus data to be identified;
performing entity recognition on the corpus data to be recognized based on the trained entity recognition model; wherein, the entity recognition model is trained by the model training method of the second aspect.
In a fourth aspect, a sample generation apparatus is provided, comprising:
the training module is used for training a preset language recognition model based on the labeled first sample to obtain an initial model;
the recognition module is used for acquiring an entity dictionary, recognizing each entity data in the entity dictionary based on the initial model, and outputting a recognition result corresponding to each entity data, wherein the recognition result comprises: at least one predicted entity and a predicted entity type corresponding to each predicted entity;
the judging module is used for respectively judging whether the predicted entity type in the recognition result of each entity data is overlapped with the real entity type of the entity data in the entity dictionary;
the correction module is used for correcting the entity type of the identification result based on the real entity type of the entity data corresponding to the identification result aiming at the identification result with the non-coincident entity type, and summarizing the corrected identification result into a second sample;
and each corrected identification result contains the real entity type of the corresponding entity data, and the number of the sample data in the second sample is far larger than that of the sample data in the first sample.
In a fifth aspect, a model training apparatus is provided, including:
a determining module, configured to determine a corpus sample, where the corpus sample is generated by using the sample generation method according to the first aspect;
and the training module is used for training the language identification model to be trained based on the corpus sample to obtain the entity identification model.
In a sixth aspect, an entity identification apparatus is provided, including:
the determining module is used for determining corpus data to be identified;
and the recognition module is used for carrying out entity recognition on the corpus data to be recognized based on the trained entity recognition model, and the entity recognition model is trained by adopting the model training method of the second aspect.
In a seventh aspect, an electronic device is provided, including:
a processor; and
a memory arranged to store computer executable instructions that, when executed, cause the processor to perform at least one of the methods of the first, second and third aspects.
In an eighth aspect, a computer-readable storage medium is presented, which stores one or more programs that, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to perform at least one of the method of the first aspect, the method of the second aspect and the method of the third aspect above.
According to the technical scheme provided by one or more embodiments of the specification, the initial model is obtained by pre-training the preset language recognition model through the marked first sample; then, predicting and scoring the entity data in the entity dictionary based on the initial model, and outputting a recognition result; and if the predicted entity type in the recognition result does not coincide with the real entity type of the entity data in the entity dictionary, performing entity type correction on the recognition result to enable the corrected recognition result to contain the real entity type, and summarizing all corrected recognition results into a second sample. According to the scheme, the marked first samples are used, namely a large number of weakly marked second samples can be generated in a weak supervision marking mode, the problem that the marked samples are difficult to obtain is relieved to a certain extent, and the generation efficiency of the marked samples is improved. Furthermore, the labeling sample generated by the weak supervision labeling mode is used for training the preset language recognition model, so that the model training performance can be improved. And furthermore, an entity recognition model obtained by training the labeling sample generated by the weak supervision labeling mode is improved for the entity recognition performance.
Drawings
In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, reference will now be made briefly to the attached drawings, which are needed in the description of one or more embodiments or prior art, and it should be apparent that the drawings in the description below are only some of the embodiments described in the specification, and that other drawings may be obtained by those skilled in the art without inventive exercise.
Fig. 1 is a schematic step diagram of a sample generation method according to an embodiment of the present application.
Fig. 2 is a schematic step diagram of a model training method according to an embodiment of the present application.
Fig. 3 is a schematic step diagram of an entity identification method according to an embodiment of the present application.
Fig. 4 is a schematic structural diagram of a sample generation device according to an embodiment of the present disclosure.
Fig. 5 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application.
Fig. 6 is a schematic structural diagram of an entity identification apparatus according to an embodiment of the present application.
Fig. 7 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the one or more embodiments described are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments that can be derived by a person skilled in the art from one or more of the embodiments described herein without making any inventive step shall fall within the scope of protection of this document.
In view of the problem of difficulty in obtaining a labeled sample in the prior art, the embodiment of the application provides a scheme for generating a sample by adopting a weak supervision labeling mode. Pre-training a preset language recognition model through a small batch of labeled samples to obtain an initial model; then, predicting and scoring the entity data in the entity dictionary based on the initial model, and outputting a recognition result; and if the predicted entity type in the recognition result is not coincident with the real entity type of the entity data in the entity dictionary, performing entity type correction on the recognition result to enable the corrected recognition result to contain the real entity type, and summarizing all the corrected recognition results into the required corpus tagging sample. According to the scheme, a large number of marked samples can be generated in a weak supervision mode only by using small batches of marked samples, the problem that the marked samples are difficult to acquire is relieved to a certain extent, and the generation efficiency of the marked samples is improved. Furthermore, the labeling sample generated by the weak supervision labeling mode is used for training the preset language recognition model, so that the model training performance can be improved. And furthermore, an entity recognition model obtained by training the labeling sample generated by the weak supervision labeling mode is improved for the entity recognition performance.
Example one
Referring to fig. 1, a schematic step diagram of a sample generation method provided in an embodiment of the present application is shown, where an execution subject of the sample generation method may be a sample generation apparatus, and the sample generation apparatus may be a hardware device or a software device with a computer processing function. The sample generation method in the application can comprise the following steps:
step 101: and training a preset language recognition model based on the marked first sample to obtain an initial model.
In the embodiment of the present application, the first sample may be a corpus data set labeled manually or by other means in a small batch, and it should be understood that the small batch refers to a data capacity which is not enough to support the accurate training of the model, for example, 2 ten thousand pieces of sample data.
Optionally, the preset language identification model in the embodiment of the present application includes any one of the following models:
the Albert model, the Roberta model and the BERT-wwm model in the BERT model; an XLNET model; ERNIE model, etc.
According to the embodiment of the application, the first sample is used for pre-training the preset language recognition model, and an initial model with low prediction accuracy can be obtained. It should be understood that operations such as sample feature extraction required during training can be performed according to the existing feature extraction method, and are not described herein again.
Step 102: acquiring an entity dictionary, respectively identifying each entity data in the entity dictionary based on the initial model, and outputting an identification result corresponding to each entity data, wherein the identification result comprises: at least one predicted entity and a predicted entity type corresponding to each predicted entity.
In this embodiment, the entity dictionary may include one or more entity dictionaries that exist at present, and a data format of each entity dictionary may be: entity text, entity type. For example, Wexxx, singer; zhangxx, singer; week x, actor … …, and the like. The entity types contained in the one or more entity dictionaries may include, but are not limited to: singers, songs, ancient poems, jokes, place names, names of people, etc.
In the step 102, each entity data in the entity dictionary is used as entity data to be identified, after specific feature extraction is respectively carried out, the initial model is input for prediction scoring, one or more prediction entities can be obtained after prediction scoring, the probability score of each prediction entity hitting each entity type in the preset entity type library is carried out, and the entity type with the highest probability score is selected as the prediction entity type corresponding to each prediction entity; and finally, determining the obtained one or more predicted entities and the predicted entity type corresponding to each predicted entity as the recognition result of each entity data determined based on the initial model.
Step 103: and respectively judging whether the predicted entity type in the recognition result of each entity data is overlapped with the real entity type of the entity data in the entity dictionary.
The recognition accuracy of the initial model may not be particularly optimistic considering that the initial model is trained based on only a small batch of labeled samples. In other words, the predicted entity type output by some entity data in the entity dictionary based on the initial model may be different from the actual entity type of the entity data in the entity dictionary. Therefore, after recognizing entity data in the entity dictionary based on the initial model, it is necessary to simply evaluate the recognition effect of the initial model by comparing whether the predicted entity type coincides with the real entity type. If the superposition exists, the recognition of part or all real entity types is shown, and the model recognition effect is good; if no coincidence exists, the identified prediction entity type is not real, and the model identification effect is not good. In fact, in the actual identification processing process, the identification accuracy of the initial model is not high, and therefore most comparison results are the situation that the predicted entity type is not coincident with the actual entity type.
In the embodiment of the application, the predicted entity type in the recognition result of each entity data is respectively judged, when the predicted entity type is overlapped with the real entity type of the entity data in the entity dictionary, whether one or more real entity types of the entity data in the entity dictionary exist in the predicted entity type included in the recognition result of the entity data can be judged, and if the real entity type does not exist in the recognition result, the predicted entity type in the recognition result of the entity data is determined to be not overlapped with the real entity type of the entity data in the entity dictionary.
Step 104: aiming at the identification result with non-coincident entity types, carrying out entity type correction on the identification result based on the real entity type of the entity data corresponding to the identification result, and summarizing the corrected identification result into a second sample; wherein each corrected recognition result contains the real entity type of the corresponding entity data.
In the embodiment of the present application, the entity type of the recognition result is modified based on the real entity type of the entity data corresponding to the recognition result, specifically, the real entity type of the entity data corresponding to the recognition result may be used to replace the predicted entity type in the recognition result to obtain a candidate recognition result, and the modified recognition result is determined based on the candidate recognition result.
In an embodiment of the application, in a preferred implementation manner, the number of the sample data in the second sample is much larger than that of the sample data in the first sample, so that a large amount of weakly labeled sample data enough to support model training can be generated based on less annotated sample data, and a large amount of weakly labeled sample data can be obtained with very little manpower input.
In fact, the entity type modification of the recognition result may not be limited to the above alternative manner, and may also include other modification manners (e.g., addition) so that the modified recognition result includes the real entity type of the corresponding entity data.
It should be understood that, the recognition result that entity types are overlapped (partially overlapped or completely overlapped) is recognized by the initial model, and the existence of the overlap indicates that the recognition effect is better, so that the recognition result is no longer suitable for being used as an effective marking sample to accurately train the initial model.
According to the technical scheme, pre-training is carried out on the preset language recognition model through a small batch of marked first samples, and an initial model is obtained; then, predicting and scoring the entity data in the entity dictionary based on the initial model, and outputting a recognition result; and if the predicted entity type in the recognition result does not coincide with the real entity type of the entity data in the entity dictionary, performing entity type correction on the recognition result to enable the corrected recognition result to contain the real entity type, and summarizing all corrected recognition results into a second sample. According to the scheme, only a small batch of labeled first samples are used, namely a large number of weakly labeled second samples can be generated in a weak supervision labeling mode, the problem that labeled samples are difficult to obtain is relieved to a certain extent, and the generation efficiency of the labeled samples is improved.
When the real entity type is used for replacing the predicted entity type, the embodiment of the application can carry out different operations according to the number of the predicted entities in the recognition result:
1. and if the recognition result only contains one predicted entity, replacing the predicted entity type in the recognition result by using the real entity type of the entity data corresponding to the recognition result to obtain a candidate recognition result, and determining the candidate recognition result as the corrected recognition result.
For example, the entity data and the real entity types in the entity dictionary are: "shepherd/song name of cocoa tuo hai"; the predicted entity and the predicted entity type in the recognition result are: "shepherd/name of cacao Tuo Hai". Predicting that the entity type is not coincident with the real entity type, and replacing the name of the person with the real entity type 'song name'; and obtaining a candidate recognition result 'shepherd/song name of cocoa Touchi', wherein the candidate recognition result is the corrected recognition result.
2. If the recognition result comprises at least two prediction entities, sequentially replacing each prediction entity type in the recognition result by using the real entity type of the entity data corresponding to the recognition result to obtain a plurality of candidate recognition results; and respectively calculating the prediction probability of each candidate recognition result, and selecting the candidate recognition result with the maximum prediction probability value as the corrected recognition result.
Further, a preset entity type library used when the initial model carries out entity prediction scoring comprises real entity types of each entity datum in an entity dictionary; then, in the operation 2, the step of separately calculating the prediction probability of each candidate recognition result may specifically include the following steps:
step 1, determining entity types contained in each candidate recognition result, and scoring the probability when entity prediction is carried out on the basis of the initial model;
and 2, calculating the probability scores of all entity types in each candidate recognition result in a weighted summation or weighted product mode, and determining the calculation result as the prediction probability of each candidate recognition result.
For example, the entity data and the real entity types in the entity dictionary are: "shepherd/song name of cocoa tuo hai"; the predicted entity and the predicted entity type in the recognition result are: shepherd/name of cocoa sea/place name; the recognition result comprises two prediction entities and corresponding prediction entity types. Predicting that the entity type is not coincident with the real entity type, and respectively replacing a place name and a person name by using a real entity type 'song name'; two candidate recognition results are obtained, a "shepherd/person name of cacao Tuohai/Song title" and a "shepherd/Song name of cacao Tuohai/prefecture title". The candidate recognition result with the highest probability score is selected from the two candidate recognition results as the corrected recognition result. Supposing that when the entity is predicted by the predicted entity of 'cocao sea', the probability of hitting the entity type of 'song name' is scored as 0.5, and the probability of hitting the entity type of 'place name' is scored as 0.8; when the entity of the prediction entity shepherd is predicted, the probability of hitting the entity type of song name is scored as 0.6, and the probability of hitting the entity type of human name is scored as 0.7; the probability of "shepherd/person name of cocoa tuo hai/song name" is then scored as: 0.5+0.7 ═ 1.2; the probability of "shepherd/song name of cocoa tuo hai/place name" is scored as: 0.8+0.6 ═ 1.4; the weighting values can be set to 1 by default, and can be set and modified according to requirements in other embodiments. Thus, if the probability score of the candidate recognition result of the shepherd/song name with the cocoa sea/place name is higher, the candidate recognition result is determined as the corrected recognition result. I.e. as weakly labeled sample data.
Through the correction mode, all sample data do not need to be marked manually and directly, the required second sample can be generated automatically by completely adopting machine learning and combining with a processing algorithm, the sample data in the obtained second sample can be guaranteed to be weakly marked corpus data, the labor cost is saved, and the sample generation efficiency is improved.
Example two
Referring to fig. 2, a schematic step diagram of a model training method according to an embodiment of the present disclosure is provided, where an execution subject of the model training method may be a model training apparatus, and the model training apparatus may be a hardware device or a software device with a computer processing function. The model training method in the present application may include the steps of:
step 201: and determining a corpus sample, wherein the corpus sample is generated by adopting the sample generation method from the step 101 to the step 104.
The specific implementation of step 201 can refer to the related steps in the first embodiment, which are not described herein.
Step 202: and training the language identification model to be trained based on the corpus sample to obtain an entity identification model.
Optionally, in the second embodiment, the language identification model to be trained may be the initial model determined in step 101, and then, when the preset language identification model is trained based on the corpus sample, the initial model may be specifically fine-tuned based on the corpus sample. Therefore, the manually marked sample data and the weakly supervised marked sample data are trained separately, and mutual influence among the data is avoided.
The recognition model obtained by training through the model training method can accurately recognize the corpus data, and the model recognition performance can be improved by at least 2% through experimental verification, particularly aiming at a named entity task. Moreover, in view of the improvement of the acquisition convenience of the corpus samples, the training efficiency of the whole model training stage is also improved.
EXAMPLE III
Referring to fig. 3, a schematic step diagram of an entity identification method according to an embodiment of the present application is provided, where an execution subject of the entity identification method may be an entity identification device, and the entity identification device may be a hardware device or a software device with a computer processing function. The entity identification method in the application can comprise the following steps:
step 301: and determining corpus data to be identified.
Step 302: performing entity recognition on the corpus data to be recognized based on the trained entity recognition model; the entity recognition model is trained by adopting a model training method from step 201 to step 202.
The specific implementation of step 302 can refer to the related steps in the first and second embodiments, which are not described herein.
By the entity identification method, corpus data can be accurately identified, and the model identification performance can be improved by at least 2% through experimental verification, particularly aiming at a named entity task.
Example four
Referring to fig. 4, a schematic structural diagram of a sample generation apparatus provided in an embodiment of the present disclosure is shown, where the sample generation apparatus may include:
the training module 401 is configured to train a preset language recognition model based on the labeled first sample to obtain an initial model;
a recognition module 402, configured to obtain an entity dictionary, respectively recognize each entity data in the entity dictionary based on the initial model, and output a recognition result corresponding to each entity data, where the recognition result includes: at least one predicted entity and a predicted entity type corresponding to each predicted entity;
a judging module 403, configured to respectively judge whether a predicted entity type in the recognition result of each entity data coincides with a real entity type of the entity data in an entity dictionary;
a correcting module 404, configured to, for an identification result with non-coincident entity types, perform entity type correction on the identification result based on a real entity type of entity data corresponding to the identification result, and summarize the corrected identification result into a second sample;
wherein each corrected recognition result contains the real entity type of the corresponding entity data.
Optionally, the modifying module 404 is specifically configured to use the real entity type of the entity data corresponding to the recognition result to replace the predicted entity type in the recognition result to obtain a candidate recognition result, and determine the modified recognition result based on the candidate recognition result.
Optionally, if the recognition result only includes one predicted entity, the modification module 404 is specifically configured to use the real entity type of the entity data corresponding to the recognition result to replace the predicted entity type in the recognition result to obtain a candidate recognition result, and determine the candidate recognition result as the modified recognition result.
Optionally, if the recognition result includes at least two predicted entities, the modification module 404 is specifically configured to use the real entity types of the entity data corresponding to the recognition result to sequentially replace some or all of the predicted entity types in the recognition result to obtain a plurality of candidate recognition results; and respectively calculating the prediction probability of each candidate recognition result, and selecting the candidate recognition result with the maximum prediction probability value as the corrected recognition result.
Optionally, the initial model is a preset entity type library used for entity prediction scoring, and the preset entity type library includes real entity types of each entity datum in the entity dictionary; the correcting module 404, when calculating the prediction probability of each candidate recognition result, is specifically configured to:
determining the entity type contained in each candidate recognition result, and scoring the probability when entity prediction is carried out on the basis of the initial model;
and calculating the probability scores of all entity types in each candidate recognition result in a weighted summation or weighted product mode, and determining the calculation result as the prediction probability of each candidate recognition result.
According to the technical scheme, pre-training is carried out on the preset language recognition model through a small batch of marked first samples, and an initial model is obtained; then, predicting and scoring the entity data in the entity dictionary based on the initial model, and outputting a recognition result; and if the predicted entity type in the recognition result does not coincide with the real entity type of the entity data in the entity dictionary, performing entity type correction on the recognition result to enable the corrected recognition result to contain the real entity type, and summarizing all corrected recognition results into a second sample. According to the scheme, only a small batch of labeled first samples are used, namely a large number of weakly labeled second samples can be generated in a weak supervision labeling mode, the problem that labeled samples are difficult to obtain is relieved to a certain extent, and the generation efficiency of the labeled samples is improved.
EXAMPLE five
Referring to fig. 5, a schematic structural diagram of a model training apparatus provided in an embodiment of the present disclosure is shown, where the model training apparatus may include:
a determining module 501, configured to determine a corpus sample, where the corpus sample is generated by using the sample generation method in steps 101 to 104;
a training module 502, configured to train a preset language recognition model based on the corpus samples, so as to obtain an entity recognition model.
The recognition model obtained by training through the model training device can accurately recognize the corpus data, and the model recognition performance can be improved by at least 2% through experimental verification, particularly aiming at a named entity task. Moreover, in view of the improvement of the convenience of obtaining the corpus samples, the training efficiency of the whole model training stage is also improved.
EXAMPLE six
Referring to fig. 6, a schematic structural diagram of an entity identification apparatus provided in an embodiment of the present disclosure is shown, where the entity identification apparatus may include:
a determining module 601, configured to determine corpus data to be identified;
and the recognition module 602 is configured to perform entity recognition on the corpus data to be recognized based on a trained entity recognition model, where the entity recognition model is trained by using the model training method from step 201 to step 202.
Through the entity recognition device, corpus data can be recognized accurately, and through experimental verification, particularly aiming at a named entity task, the model recognition performance can be improved by at least 2%.
EXAMPLE seven
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present specification. Referring to fig. 7, at a hardware level, the electronic device includes a processor, and optionally further includes an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.
The processor, the network interface, and the memory may be connected to each other via an internal bus, which may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 7, but this does not indicate only one bus or one type of bus.
And the memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.
The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs the computer program, and forms a processing device on a logic level, wherein the processing device can be one or more of a sample generation device, a model training device and an entity recognition device. And a processor for executing the program stored in the memory.
The methods performed by the apparatuses disclosed in the embodiments of fig. 1-3 in the present specification can be applied to or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The methods, steps, and logic blocks disclosed in one or more embodiments of the present specification may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with one or more embodiments of the present disclosure may be embodied directly in hardware, in a software module executed by a hardware decoding processor, or in a combination of the hardware and software modules executed by a hardware decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.
The electronic device may also execute the method in fig. 1 to fig. 3, and implement the functions of the corresponding apparatus in the embodiments shown in fig. 1 to fig. 3, which are not described herein again in this specification.
Of course, besides the software implementation, the electronic device of the embodiment of the present disclosure does not exclude other implementations, such as a logic device or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or a logic device.
Example eight
Embodiments of the present specification also provide a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the method of the embodiments shown in fig. 1-3.
In short, the above description is only a preferred embodiment of the present disclosure, and is not intended to limit the scope of the present disclosure. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present specification shall be included in the protection scope of the present specification.
The system, apparatus, module or unit illustrated in one or more of the above embodiments may be implemented by a computer chip or an entity, or by an article of manufacture with a certain functionality. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Claims (15)

1. A method of generating a sample, comprising:
training a preset language recognition model based on the labeled first sample to obtain an initial model;
acquiring an entity dictionary, respectively identifying each entity data in the entity dictionary based on the initial model, and outputting an identification result corresponding to each entity data, wherein the identification result comprises: at least one predicted entity and a predicted entity type corresponding to each predicted entity;
respectively judging whether the predicted entity type in the recognition result of each entity data is overlapped with the real entity type of the entity data in the entity dictionary;
aiming at the recognition result with the non-coincident entity type, carrying out entity type correction on the recognition result based on the real entity type of the entity data corresponding to the recognition result, and summarizing the corrected recognition result into a second sample;
wherein each corrected recognition result contains the real entity type of the corresponding entity data.
2. The sample generation method according to claim 1, wherein the entity type modification is performed on the recognition result based on the real entity type of the entity data corresponding to the recognition result, and specifically includes:
and replacing the predicted entity type in the recognition result with the real entity type of the entity data corresponding to the recognition result to obtain a candidate recognition result, and determining the corrected recognition result based on the candidate recognition result.
3. The sample generation method of claim 2, wherein if the recognition result only includes one predicted entity, replacing the predicted entity type in the recognition result with the real entity type of the entity data corresponding to the recognition result to obtain a candidate recognition result, and determining the modified recognition result based on the candidate recognition result, specifically comprising:
and replacing the predicted entity type in the recognition result with the real entity type of the entity data corresponding to the recognition result to obtain a candidate recognition result, and determining the candidate recognition result as a corrected recognition result.
4. The sample generation method of claim 2, wherein if the recognition result includes at least two predicted entities, replacing the predicted entity type in the recognition result with a real entity type of entity data corresponding to the recognition result to obtain a candidate recognition result, and determining a modified recognition result based on the candidate recognition result, specifically comprising:
sequentially replacing part of and all predicted entity types in the recognition result by using the real entity type of the entity data corresponding to the recognition result to obtain a plurality of candidate recognition results; and respectively calculating the prediction probability of each candidate recognition result, and selecting the candidate recognition result with the maximum prediction probability value as the corrected recognition result.
5. The sample generation method as claimed in claim 4, wherein the predetermined entity type library used in the entity prediction scoring of the initial model contains the real entity type of each entity data in the entity dictionary;
respectively calculating the prediction probability of each candidate recognition result, specifically comprising:
determining the entity type contained in each candidate recognition result, and scoring the probability when entity prediction is carried out on the basis of the initial model;
and calculating the probability scores of all entity types in each candidate recognition result in a weighted summation or weighted product mode, and determining the calculation result as the prediction probability of each candidate recognition result.
6. The method of sample generation according to any of claims 1 to 5, wherein the number of sample data in the second sample is much larger than the number of sample data in the first sample.
7. The sample generation method as claimed in any one of claims 1 to 5, wherein the predetermined language identification model comprises any one of:
the Albert model, the Roberta model and the BERT-wwm model in the BERT model; an XLNET model; the ERNIE model.
8. A method of model training, comprising:
determining a corpus sample, the corpus sample being generated using the sample generation method of any one of claims 1-7;
and training the language identification model to be trained based on the corpus sample to obtain an entity identification model.
9. The model training method according to claim 8, wherein training a language recognition model to be trained based on the corpus samples specifically comprises:
and fine-tuning a language recognition model to be trained based on the corpus samples.
10. An entity identification method, comprising:
determining corpus data to be identified;
performing entity recognition on the corpus data to be recognized based on the trained entity recognition model; wherein the entity recognition model is trained by the model training method of claim 8 or 9.
11. A sample generation device, comprising:
the training module is used for training a preset language recognition model based on the labeled first sample to obtain an initial model;
the recognition module is used for acquiring an entity dictionary, recognizing each entity data in the entity dictionary based on the initial model, and outputting a recognition result corresponding to each entity data, wherein the recognition result comprises: at least one predicted entity and a predicted entity type corresponding to each predicted entity;
the judging module is used for respectively judging whether the predicted entity type in the recognition result of each entity data is overlapped with the real entity type of the entity data in the entity dictionary;
the correction module is used for correcting the entity type of the identification result based on the real entity type of the entity data corresponding to the identification result aiming at the identification result with the non-coincident entity type, and summarizing the corrected identification result into a second sample;
wherein each corrected recognition result contains the real entity type of the corresponding entity data.
12. A model training apparatus, comprising:
a determining module, configured to determine a corpus sample, where the corpus sample is generated by using the sample generation method according to any one of claims 1 to 7;
and the training module is used for training the language identification model to be trained based on the corpus sample to obtain the entity identification model.
13. An entity identification apparatus, comprising:
the determining module is used for determining corpus data to be identified;
and the recognition module is used for carrying out entity recognition on the corpus data to be recognized based on a trained entity recognition model, and the entity recognition model is trained by adopting the model training method of claim 8 or 9.
14. An electronic device, comprising:
a processor; and
a memory arranged to store computer-executable instructions that, when executed, cause the processor to perform at least one of the sample generation method of any one of claims 1 to 7, the model training method of claim 8 or 9, and the entity recognition method of claim 10.
15. A computer-readable storage medium storing one or more programs which, when executed by an electronic device including a plurality of application programs, cause the electronic device to perform at least one of the sample generation method of any one of claims 1 to 7, the model training method of claim 8 or 9, and the entity recognition method of claim 10.
CN202210061372.XA 2022-01-19 2022-01-19 Sample generation method, model training method, entity identification method and related device Pending CN114611513A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210061372.XA CN114611513A (en) 2022-01-19 2022-01-19 Sample generation method, model training method, entity identification method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210061372.XA CN114611513A (en) 2022-01-19 2022-01-19 Sample generation method, model training method, entity identification method and related device

Publications (1)

Publication Number Publication Date
CN114611513A true CN114611513A (en) 2022-06-10

Family

ID=81857834

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210061372.XA Pending CN114611513A (en) 2022-01-19 2022-01-19 Sample generation method, model training method, entity identification method and related device

Country Status (1)

Country Link
CN (1) CN114611513A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341556A (en) * 2023-05-29 2023-06-27 浙江工业大学 Small sample rehabilitation medical named entity identification method and device based on data enhancement

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341556A (en) * 2023-05-29 2023-06-27 浙江工业大学 Small sample rehabilitation medical named entity identification method and device based on data enhancement

Similar Documents

Publication Publication Date Title
Xu et al. MULAPI: Improving API method recommendation with API usage location
CN112966712A (en) Language model training method and device, electronic equipment and computer readable medium
CN111898643B (en) Semantic matching method and device
CN109344406B (en) Part-of-speech tagging method and device and electronic equipment
US11651014B2 (en) Source code retrieval
CN116860949B (en) Question-answering processing method, device, system, computing equipment and computer storage medium
CN109299276B (en) Method and device for converting text into word embedding and text classification
CN110263127A (en) Text search method and device is carried out based on user query word
CN114596845A (en) Training method of voice recognition model, voice recognition method and device
CN111079379A (en) Shape and proximity character acquisition method and device, electronic equipment and storage medium
CN114611513A (en) Sample generation method, model training method, entity identification method and related device
CN112818126B (en) Training method, application method and device for network security corpus construction model
CN113743618A (en) Time series data processing method and device, readable medium and electronic equipment
CN108255891B (en) Method and device for judging webpage type
CN113468323A (en) Dispute focus category and similarity judgment method, dispute focus category and similarity judgment system, dispute focus category and similarity judgment device and dispute focus category and similarity judgment recommendation method
CN116385230A (en) Child reading ability evaluation method and system
CN109522920B (en) Training method and device of synonymy discriminant model based on combination of semantic features
CN114254588B (en) Data tag processing method and device
CN110569429A (en) method, device and equipment for generating content selection model
CN116028626A (en) Text matching method and device, storage medium and electronic equipment
CN114065762A (en) Text information processing method, device, medium and equipment
CN114282586A (en) Data annotation method, system and electronic equipment
CN113836297A (en) Training method and device for text emotion analysis model
CN110717029A (en) Information processing method and system
CN111461346A (en) Network node characterization method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination