CN112818689B - Entity identification method, model training method and device - Google Patents

Entity identification method, model training method and device Download PDF

Info

Publication number
CN112818689B
CN112818689B CN201911118481.5A CN201911118481A CN112818689B CN 112818689 B CN112818689 B CN 112818689B CN 201911118481 A CN201911118481 A CN 201911118481A CN 112818689 B CN112818689 B CN 112818689B
Authority
CN
China
Prior art keywords
corpus
entity
entity recognition
cluster
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911118481.5A
Other languages
Chinese (zh)
Other versions
CN112818689A (en
Inventor
黄磊
杨春勇
靳丁南
权圣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mashang Xiaofei Finance Co Ltd
Original Assignee
Mashang Xiaofei Finance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mashang Xiaofei Finance Co Ltd filed Critical Mashang Xiaofei Finance Co Ltd
Priority to CN201911118481.5A priority Critical patent/CN112818689B/en
Publication of CN112818689A publication Critical patent/CN112818689A/en
Application granted granted Critical
Publication of CN112818689B publication Critical patent/CN112818689B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an entity identification method, a model training method and a device, wherein the entity identification method comprises the following steps: acquiring a target text to be identified; determining a first scene corresponding to the target text; obtaining entity recognition models corresponding to the first scene from at least two entity recognition models trained in advance, wherein different entity recognition models in the at least two entity recognition models are obtained based on corpus training of different scenes; and inputting the target text into an entity recognition model corresponding to the first scene to obtain an entity recognition result of the target text. By the entity identification method provided by the invention, the accuracy of entity identification can be improved.

Description

Entity identification method, model training method and device
Technical Field
The present invention relates to the field of information processing technologies, and in particular, to an entity identification method, a model training method, and an apparatus.
Background
Entity recognition (also referred to as named entity recognition (Named Entity Recognition, simply NER)) refers to an entity in text that has a particular meaning, e.g., a person's name, place's name, organization's name, proper noun, etc. As intelligent systems such as question and answer systems (e.g., customer service question and answer systems) have become more and more widely used, entity identification has also become more and more important as part of natural language processing.
However, in the prior art, an entity recognition model is usually trained based on all collected training samples, and for all texts input by a user, the entity recognition is performed by using the single entity recognition model, so that the accuracy of entity recognition is low.
Disclosure of Invention
The embodiment of the invention provides an entity identification method, a model training method and a model training device, which are used for solving the problem of low entity identification accuracy in the prior art.
In order to solve the technical problems, the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides an entity identification method. The method comprises the following steps:
acquiring a target text to be identified;
determining a first scene corresponding to the target text;
obtaining entity recognition models corresponding to the first scene from at least two entity recognition models trained in advance, wherein different entity recognition models in the at least two entity recognition models are obtained based on corpus training of different scenes;
and inputting the target text into an entity recognition model corresponding to the first scene to obtain an entity recognition result of the target text.
In a second aspect, an embodiment of the present invention provides a model training method. The method comprises the following steps:
Clustering the N obtained corpus to obtain at least two clusters, wherein N is an integer greater than 1, and different clusters in the at least two clusters correspond to different scenes;
respectively obtaining target labeling information of corpus of each cluster in the at least two clusters;
model training is carried out according to the corpus of each cluster and the target labeling information of each cluster in the at least two clusters respectively, so that at least two entity recognition models are obtained.
In a third aspect, an embodiment of the present invention further provides an entity identification device. The device comprises:
the first acquisition module is used for acquiring a target text to be identified;
the determining module is used for determining a first scene corresponding to the target text;
the second acquisition module is used for acquiring entity recognition models corresponding to the first scene from at least two entity recognition models trained in advance, wherein different entity recognition models in the at least two entity recognition models are obtained based on corpus training of different scenes;
and the input module is used for inputting the target text into an entity recognition model corresponding to the first scene to obtain an entity recognition result of the target text.
In a fourth aspect, the embodiment of the invention further provides a model training device. The method comprises the following steps:
the clustering module is used for clustering the N acquired corpus before acquiring the entity recognition model corresponding to the first scene from the at least two entity recognition models trained in advance to obtain at least two clusters, wherein N is an integer greater than 1, and different clusters in the at least two clusters correspond to different scenes;
the acquisition module is used for respectively acquiring target annotation information of the corpus of each cluster in the at least two clusters;
and the training module is used for carrying out model training according to the corpus of each cluster and the target labeling information thereof to obtain at least two entity recognition models.
In a fifth aspect, an embodiment of the present invention further provides an electronic device, including a processor, a memory, and a computer program stored in the memory and capable of running on the processor, where the computer program when executed by the processor implements the steps of the entity identification method or implements the steps of the model training method described above.
In a sixth aspect, embodiments of the present invention further provide a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements the steps of the entity identification method or implements the steps of the model training method described above.
In the embodiment of the invention, the target text to be identified is obtained; determining a first scene corresponding to the target text; obtaining entity recognition models corresponding to the first scene from at least two entity recognition models trained in advance, wherein different entity recognition models in the at least two entity recognition models are obtained based on corpus training of different scenes; and inputting the target text into an entity recognition model corresponding to the first scene to obtain an entity recognition result of the target text. The texts of different scenes are subjected to entity recognition through the entity recognition model obtained through corpus training of different scenes, so that the accuracy of entity recognition can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.
FIG. 1 is a flow chart of an entity identification method provided by an embodiment of the present invention;
FIG. 2 is a flowchart of a customer service response method provided by an embodiment of the present invention;
FIG. 3 is a flow chart of a model training method provided by an embodiment of the present invention;
FIG. 4 is a block diagram of an entity identification device according to an embodiment of the present invention;
FIG. 5 is a block diagram of a model training apparatus provided by an embodiment of the present invention;
FIG. 6 is a block diagram of an entity recognition apparatus according to still another embodiment of the present invention;
fig. 7 is a block diagram of a model training apparatus according to still another embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention provides an entity identification method. Referring to fig. 1, fig. 1 is a flowchart of an entity identification method provided in an embodiment of the present invention, as shown in fig. 1, including the following steps:
and 101, acquiring a target text to be identified.
In this embodiment, the target text may be text input by the user, or may be text converted based on voice input by the user.
Step 102, determining a first scene corresponding to the target text.
In this embodiment, the scene corresponding to the target text, that is, the first scene, may be obtained by identifying the target text. For example, if the target text is identified to include address information and installation cost information, it may be determined to be a delivery installation scenario; if the target text is identified to comprise commodity model information and price information, the target text can be determined to be a commodity consultation scene.
Step 103, obtaining entity recognition models corresponding to the first scene from at least two entity recognition models trained in advance, wherein different entity recognition models in the at least two entity recognition models are obtained based on corpus training of different scenes.
In this embodiment, different entity recognition models in the at least two entity recognition models are obtained based on corpus training of different scenes. For example, the entity recognition model corresponding to the commodity consultation scene may be obtained based on corpus training related to commodity consultation, and the entity recognition model corresponding to the distribution installation scene may be obtained based on corpus training related to distribution installation.
It should be noted that the entity recognition models corresponding to the different scenes may also include entity recognition models corresponding to a plurality of scenes obtained by subdividing a scene with a larger range. For example, for the entity recognition model corresponding to the distribution installation scenario, the entity recognition model corresponding to the first installation distribution scenario (such as the installation distribution scenario corresponding to the province district) and the entity recognition model corresponding to the second installation distribution scenario (i.e. the installation distribution scenario corresponding to the detailed address) may be further subdivided.
The entity recognition model may be a model trained based on one or more of a convolutional neural network, a Long Short-Term Memory (LSTM), a conditional random field (Conditional Random Field, CRF), and the like, which is not limited in this embodiment.
If the entity recognition model corresponding to the first scene exists in the at least two entity recognition models, the entity recognition model corresponding to the first scene may be obtained from the at least two entity recognition models; if not, the process may be ended, or the entity identification may be performed in other manners, for example, manually.
Optionally, in this embodiment, a target entity recognition model obtained based on corpus training of multiple scenes may be stored in advance, so that if an entity recognition model corresponding to the first scene does not exist in the at least two entity recognition models, entity recognition may be performed on the target text based on the target entity recognition model.
And 104, inputting the target text into an entity recognition model corresponding to the first scene to obtain an entity recognition result of the target text.
In this embodiment, a target text to be identified is input into an entity identification model corresponding to a scene thereof to perform entity identification, so as to obtain an entity identification result of the target text.
The accuracy rate of entity recognition of the text of the specific scene by the entity recognition model obtained by training the training corpus based on the specific scene is often higher than the accuracy rate of entity recognition of the text of the specific scene by the entity recognition model trained by training corpus based on all scenes. Therefore, in the embodiment, the entity recognition is performed on the texts of different scenes through the entity recognition model obtained through corpus training of different scenes, so that the accuracy of entity recognition can be improved.
Optionally, the step 102, that is, the determining the first scene corresponding to the target text may include:
performing intention recognition on the target text to obtain at least two intents;
and taking the intention of which the probability value meets a preset condition in the at least two intentions as a first scene corresponding to the target text.
In this embodiment, the intention recognition is performed on the target text, so that at least two intentions and probability values of each intention can be obtained, and an intention with the probability value meeting a preset condition can be selected from the at least two intentions as a scene (i.e., a first scene) corresponding to the target text. For example, an intention with a probability value greater than or equal to a preset probability value may be selected from at least two intentions as a scene corresponding to the target text, where the preset probability value may be reasonably set according to actual requirements, for example, 20%, 25%, etc.; the probability value can also be selected from at least two intentions, and the intention of the K positioned at the front when the probability value is ordered from large to small is taken as a scene corresponding to the target text, wherein K can be reasonably set according to actual requirements, for example, 2, 3 and the like.
Alternatively, the intent recognition may be performed on the target text using a pre-trained intent recognition model.
In this embodiment, if the number of scenes (i.e., first scenes) corresponding to the target text is one, the target text may be directly identified based on the entity identification model corresponding to the first scene, so as to obtain an entity identification result of the target text; if the number of the scenes (i.e., the first scenes) corresponding to the target text is multiple, the entity recognition result of the target text can be determined by integrating the entity recognition results of the entity recognition models corresponding to the multiple scenes on the target text, or the entity recognition result of the entity recognition model on the target text can be selected from the entity recognition results of the entity recognition models corresponding to the multiple scenes as the entity recognition result of the target text.
According to the method and the device for determining the scene corresponding to the target text, the scene corresponding to the target text is determined through intention recognition of the target text, and accuracy of determining the scene corresponding to the target text can be improved.
Optionally, the obtaining an entity recognition model corresponding to the first scene from at least two entity recognition models trained in advance includes:
if the number of the first scenes is at least two, respectively acquiring entity identification models corresponding to each first scene from the at least two entity identification models;
Inputting the target text into an entity recognition model corresponding to the first scene to obtain an entity recognition result of the target text, wherein the entity recognition result comprises:
respectively inputting the target text into entity recognition models corresponding to each first scene to obtain entity recognition results corresponding to each target entity recognition model, wherein the target entity recognition models are entity recognition models corresponding to the first scene;
and determining the entity recognition result of the target text according to the entity recognition result corresponding to each target entity recognition model.
In this embodiment, when at least two scenes corresponding to the target text are identified, entity identification may be performed on the target text based on an entity identification model corresponding to each scene corresponding to the target text (i.e., a target entity identification model), to obtain an entity identification result corresponding to an entity identification model corresponding to each scene corresponding to the target text, and then the entity identification result of the target text may be determined based on an entity identification result corresponding to each entity identification model corresponding to each scene corresponding to the target text, for example, a union of entity identification results corresponding to entity identification models corresponding to each scene corresponding to the target text may be used as the entity identification result corresponding to the target text.
In practical application, if the entity characteristics of a certain text are more, entity recognition can be performed on the text through entity recognition models corresponding to a plurality of scenes subdivided by a scene with a larger range, so as to obtain a plurality of entity recognition results, and the entity recognition results corresponding to the target text can be determined based on the plurality of entity recognition results, so that the accuracy of entity recognition can be improved.
For example, if the target text is: and determining that the target text corresponds to a first installation and distribution scene (such as an installation and distribution scene corresponding to a city district) and a second installation and distribution scene (such as an installation and distribution scene corresponding to a detailed address), wherein entity identification can be performed on the target text by using an entity identification model corresponding to the first installation and distribution scene, entity identification can be performed on the target text by using an entity identification model corresponding to the second installation and distribution scene, and a union of an entity identification result of the entity identification model corresponding to the first installation and distribution scene and an entity identification result of the entity identification model corresponding to the second installation and distribution scene can be used as an entity identification result of the target text.
According to the embodiment, the entity recognition of the same text can be performed based on the entity recognition models corresponding to different scenes, the entity recognition results of the text are determined by integrating the entity recognition results of the entity recognition models corresponding to different scenes, and the accuracy of entity recognition can be improved.
Optionally, in the case that the entity identification method of the embodiment of the present invention is applied to a customer service answering system, the method may further include: and under the condition that the entity recognition model corresponding to the first scene does not exist in the at least two entity recognition models, switching to a manual customer service mode, namely, replying to the problem input by the user manually.
Optionally, the present embodiment may further configure corresponding questions and answers for different scenes in advance, so after obtaining the entity recognition result of the target text, the user may be asked to obtain the condition of querying the answer corresponding to the target text based on the questions and answers corresponding to the scene corresponding to the target text, for example, for a size scene, if the question input by the user is 174, what size is recommended, the question information may be output: asking you multiple times; for the installation and distribution scene, if the problem input by the user carries address information, question information can be output: please supplement the complete address information. Thus, the questions of the user can be accurately replied, and the experience of the user in the question and answer process can be improved.
The following description will be given by taking the application of the customer service answering system as an example:
referring to fig. 2, the customer service answering method provided in this embodiment includes the following steps:
step 201, receiving text input by a user.
And 202, carrying out intention recognition on the text input by the user.
And 203, if an entity recognition model corresponding to the scene determined by the intention recognition exists, performing entity recognition on the text input by the user by using the entity recognition model, otherwise, entering a manual customer service mode.
In this step, after entity recognition is performed on the text input by the user using the entity recognition model, a subsequent question-answering process may be performed based on the entity recognition result.
In this embodiment, corresponding questions and answers techniques may be configured for different scenarios in advance, so that in the questions and answers process, after entity recognition is performed on the user input questions (i.e., the text input by the user) by using the entity recognition model, the conditions required for querying the answers of the user input questions may be obtained by asking questions to the user. For example, for a size scenario, if the question entered by the user is what size is recommended at 174, then question information may be output: asking you multiple. If the weight replied by the user is received, size recommendation can be performed according to the height and the weight, and the process is completed; if the weight replied by the user is not received, the manual customer service can be reminded to assist in solving the problem of the user, and entity information of the problem input by the user can be displayed on a panel of a chat interface of the manual customer service to help the manual customer service complete the flow.
Alternatively, the entity recognition model may be a model based on LSTM and CRF training.
According to the embodiment, the entity recognition model is trained based on the LSTM and the CRF, so that the accuracy of entity recognition of the entity recognition model can be improved.
Optionally, the method may further include:
training the at least two entity recognition models.
For example, the obtained N corpora may be clustered to obtain at least two clusters, where N is an integer greater than 1, and different clusters in the at least two clusters correspond to different scenes; respectively obtaining target labeling information of corpus of each cluster in the at least two clusters; model training is carried out according to the corpus of each cluster and the target labeling information of each cluster in the at least two clusters respectively, so that at least two entity recognition models are obtained.
The embodiment of the invention also provides a model training method, and the at least two entity recognition models can be obtained by training based on the model training method provided by the embodiment of the invention. Referring to fig. 3, fig. 3 is a flowchart of a model training method provided by an embodiment of the present invention, as shown in fig. 3, including the following steps:
step 301, clustering the obtained N corpora to obtain at least two clusters, where N is an integer greater than 1, and different clusters in the at least two clusters correspond to different scenes.
Step 302, obtaining target labeling information of the corpus of each cluster in the at least two clusters respectively.
And 303, respectively performing model training according to the corpus of each cluster and the target labeling information of each cluster in the at least two clusters to obtain at least two entity recognition models.
In this embodiment, the collected N corpora may be clustered to obtain clusters corresponding to different scenes. The entity recognition method provided by the embodiment is applied to a customer service robot as an example, and the N corpus can be collected dialogue data of customer service and customers in specific industries (such as home appliance industry, clothing industry and the like).
Alternatively, the N collected corpus can be clustered by adopting a clustering algorithm such as K-Means, gaussian mixture model, hierarchical clustering or mean shift clustering. Taking K-Means clustering as an example, the K-Means cluster value may be set according to the number of corpora (i.e., N) used for training, and alternatively, the larger the number of corpora used for training, the larger the cluster value may be set.
For example, when the number of corpora used for training is large, a large cluster value may be set, when the number of corpora used for training is small, a small cluster value may be set, after clustering, when the cluster value is large, the top 10 clustered text may be selected, when the cluster value is small, the top 5 clustered text may be selected, and the text of each cluster (i.e., category) may be analyzed to determine the scene corresponding to each cluster family. For example, the home appliance industry clusters: the quantity of the installation fee can be determined as the distribution installation scene when the XX street in XX district of XX province is not delivered; XX television XX size XX model, how much money can be determined as a merchandise consultation scene, and so on.
After obtaining at least two clusters, the target labeling information of each corpus in each cluster can be obtained, and then the entity recognition model can be trained according to the corpus of each cluster and the target labeling information of each corpus, so that the entity recognition model corresponding to each cluster, namely the entity recognition model corresponding to each scene, is obtained.
The target labeling information of each corpus may be the labeling information of the received user for the corpus, or may be the labeling information automatically generated according to the content of the corpus. The labeling format of the target labeling information of each corpus can be reasonably set according to practical situations, for example, for the address-related corpus of the installation and distribution scene, the labeling format can be as follows: province, city and district; for the model-related corpus of commodity consultation scenes, the labeling format can be as follows: commodity name, brand, model; etc.
According to the embodiment, the N obtained corpus are clustered to obtain at least two clusters, and the entity recognition model corresponding to each cluster in the at least two clusters is obtained through training according to the corpus of the cluster and the target marking information of the corpus, so that the complexity of model training can be reduced, and the accuracy of the entity recognition model obtained through training in recognizing text entities of a scene corresponding to the model can be improved.
Optionally, the obtaining target labeling information of the corpus of each cluster in the at least two clusters includes:
respectively matching entities of each corpus in a first cluster through a preset regular expression, wherein the first cluster is any cluster in the at least two clusters;
generating initial labeling information of each corpus in the first cluster according to the entity of each corpus in the first cluster;
and respectively checking the initial labeling information of each corpus in the first cluster to obtain target labeling information of each corpus in the first cluster, wherein the target labeling information of the corpus is the labeling information after the initial labeling information of the corpus is checked.
In this embodiment, the entity of each corpus in each cluster may be matched by a preset regular expression. Optionally, corpus of clusters of different scenes can be provided with different regular expressions, so that accuracy of regular matching is improved.
For example, taking a regular expression in python format as an example, for training expectations related to a distribution installation scenario, an address entity may be extracted by the following regular expression: r' ([_4e00-_9fa5] {2,5} (: hunan, changsha, yue Lu; for training expectation related to commodity consultation scenes, the height and weight entities can be extracted through the following regular expressions: r' ([ 1-9] \d {0,3 }: 180, 80.
Specifically, after the entity of each corpus is obtained by matching, initial labeling information of the corpus can be generated according to a preset labeling format based on the entity of each corpus. In practical situations, since the initial labeling information of each corpus generated automatically is easy to be wrong, the initial labeling information of each corpus can be checked, for example, the initial labeling information of each corpus can be checked in combination with user operation, if the initial labeling information of a certain corpus is not qualified, the modification operation of the user on the initial labeling information of the corpus can be received, and the initial labeling information of the corpus is modified based on the modification operation, so that the target labeling information of the corpus is obtained; if the initial labeling information of a certain corpus is qualified, the initial labeling information of the corpus can be directly used as the target labeling information of the corpus.
According to the method, the initial labeling information of each corpus is automatically generated, and the initial labeling information of each corpus is checked to obtain the target labeling information of each corpus, so that the corpus labeling accuracy is ensured, the corpus labeling efficiency is improved, and the labor cost is saved.
Optionally, the generating initial labeling information of each corpus in the first cluster according to the entity of each corpus in the first cluster may include:
if the number M of the entities of the first corpus exceeds a preset value, dividing the M entities of the first corpus into at least two entity groups according to a preset grouping rule, wherein the first corpus is any corpus in the first cluster, and the number of the entities of each entity group in the at least two entity groups is smaller than or equal to R;
generating initial labeling information according to the entity of each entity group in the at least two entity groups respectively, and obtaining at least two initial labeling information of the first corpus.
In this embodiment, the preset value may be set reasonably according to actual requirements, for example, the preset value may be 5, 6, etc.
The preset grouping rules can be set reasonably according to actual demands, and optionally, corpus of different scenes can correspond to different preset grouping rules. For example, for address-related corpora, the groupings may be made at the address level, such as the following for address information: the XX mark can be marked according to two levels of the province and city area (namely the XX area of the XX province and the XX city) and the detailed address (namely the XX mark of the XX street XX area XX section XX number) to obtain two initial marking information of the corpus.
Optionally, for multiple pieces of labeling information of each corpus, the labeling information can be used for training entity recognition models of different scenes subdivided by the scenes corresponding to the corpus. For example, for the distribution installation scene, the entity recognition model corresponding to the first installation distribution scene (such as the installation distribution scene corresponding to the province and city area) can be trained according to the labeling information of the province and city area through corpus, and the entity recognition model corresponding to the second installation distribution scene (such as the installation distribution scene corresponding to the detailed address) can be trained according to the labeling information of the detailed address through corpus.
In practical application, the more entity features, the lower the labeling efficiency. In addition, the more the entity features are, the more the training corpus amount required by the training of the entity recognition model is, and the accuracy of the prediction of the entity recognition model is correspondingly reduced. According to the embodiment, the corpus with more entity characteristics is subjected to further scene subdivision, namely, under the condition that the number M of the entities of the first corpus exceeds a preset value, the M entities of the first corpus are grouped according to a preset grouping rule, so that the number of the entities marked by each corpus does not exceed the preset value, the corpus marking efficiency can be improved, the training complexity of the entity recognition model can be reduced, and the accuracy of entity recognition of the entity recognition model obtained through training can be improved.
Optionally, the training of the model is performed according to the corpus of each cluster and the target labeling information thereof to obtain at least two entity recognition models, which may include:
training the LSTM and the CRF according to the corpus and the target labeling information of each cluster in the at least two clusters to obtain at least two entity recognition models.
According to the embodiment, the entity recognition model is trained based on the LSTM and the CRF, so that the accuracy of entity recognition of the entity recognition model can be improved.
Referring to fig. 4, fig. 4 is a block diagram of an entity recognition apparatus according to an embodiment of the present invention. As shown in fig. 4, the entity recognition apparatus 400 includes:
a first obtaining module 401, configured to obtain a target text to be identified;
a determining module 402, configured to determine a first scene corresponding to the target text;
a second obtaining module 403, configured to obtain entity recognition models corresponding to the first scene from at least two entity recognition models trained in advance, where different entity recognition models in the at least two entity recognition models are obtained based on corpus training of different scenes;
and an input module 404, configured to input the target text into an entity recognition model corresponding to the first scene, to obtain an entity recognition result of the target text.
Optionally, the determining module is specifically configured to:
performing intention recognition on the target text to obtain at least two intents;
and taking the intention of which the probability value meets a preset condition in the at least two intentions as a first scene corresponding to the target text.
Optionally, the second obtaining module is specifically configured to:
if the number of the first scenes is at least two, respectively acquiring entity identification models corresponding to each first scene from the at least two entity identification models;
the input module is specifically used for:
respectively inputting the target text into entity recognition models corresponding to each first scene to obtain entity recognition results corresponding to each target entity recognition model, wherein the target entity recognition models are entity recognition models corresponding to the first scene;
and determining the entity recognition result of the target text according to the entity recognition result corresponding to each target entity recognition model.
Optionally, the entity identification model is a model obtained based on long-short term memory network LSTM and conditional random field CRF training.
Optionally, the apparatus further includes:
and the training module is used for training the at least two entity recognition models.
The entity recognition device 400 provided in the embodiment of the present invention can implement each process in the above embodiment of the entity recognition method, and in order to avoid repetition, a description thereof is omitted here.
The entity recognition device 400 of the embodiment of the present invention, a first obtaining module 401, configured to obtain a target text to be recognized; a determining module 402, configured to determine a first scene corresponding to the target text; a second obtaining module 403, configured to obtain entity recognition models corresponding to the first scene from at least two entity recognition models trained in advance, where different entity recognition models in the at least two entity recognition models are obtained based on corpus training of different scenes; and an input module 404, configured to input the target text into an entity recognition model corresponding to the first scene, to obtain an entity recognition result of the target text. The texts of different scenes are subjected to entity recognition through the entity recognition model obtained through corpus training of different scenes, so that the accuracy of entity recognition can be improved.
Referring to fig. 5, fig. 5 is a block diagram of a model training apparatus according to an embodiment of the present invention. As shown in fig. 5, the model training apparatus 500 includes:
The clustering module 501 is configured to cluster the obtained N corpora to obtain at least two clusters before obtaining the entity recognition model corresponding to the first scene from the at least two entity recognition models trained in advance, where N is an integer greater than 1, and different clusters in the at least two clusters correspond to different scenes;
an obtaining module 502, configured to obtain target labeling information of corpus of each cluster in the at least two clusters respectively;
and the training module 503 is configured to perform model training according to the corpus of each cluster and the target labeling information thereof, so as to obtain at least two entity recognition models.
Optionally, the acquiring module includes:
the matching unit is used for respectively matching the entity of each corpus in the first cluster through a preset regular expression, wherein the first cluster is any cluster in the at least two clusters;
the generating unit is used for generating initial labeling information of each corpus in the first cluster according to the entity of each corpus in the first cluster;
the verification unit is used for verifying the initial labeling information of each corpus in the first cluster to obtain target labeling information of each corpus in the first cluster, wherein the target labeling information of the corpus is the labeling information after the initial labeling information of the corpus is verified.
Optionally, the generating unit is specifically configured to:
if the number M of the entities of the first corpus exceeds a preset value, dividing the M entities of the first corpus into at least two entity groups according to a preset grouping rule, wherein the first corpus is any corpus in the first cluster, and the number of the entities of each entity group in the at least two entity groups is smaller than or equal to R;
generating initial labeling information according to the entity of each entity group in the at least two entity groups respectively, and obtaining at least two initial labeling information of the first corpus.
Optionally, the training module is specifically configured to:
training the LSTM and the CRF according to the corpus and the target labeling information of each cluster in the at least two clusters to obtain at least two entity recognition models.
The model training apparatus 500 provided in the embodiment of the present invention can implement each process in the embodiment of the model training method, and in order to avoid repetition, a detailed description is omitted here.
The model training device 500 of the embodiment of the present invention, a clustering module 501, configured to cluster N corpora obtained before obtaining an entity recognition model corresponding to the first scene from at least two entity recognition models trained in advance, to obtain at least two clusters, where N is an integer greater than 1, and different clusters in the at least two clusters correspond to different scenes; an obtaining module 502, configured to obtain target labeling information of corpus of each cluster in the at least two clusters respectively; and the training module 503 is configured to perform model training according to the corpus of each cluster and the target labeling information thereof, so as to obtain at least two entity recognition models. The complexity of model training can be reduced, and the accuracy of the training-obtained entity recognition model in recognizing the text entity of the corresponding scene can be improved.
Referring to fig. 6, fig. 6 is a block diagram of an entity recognition apparatus according to still another embodiment of the present invention, and as shown in fig. 6, an entity recognition apparatus 600 includes: a processor 601, a memory 602 and a computer program stored on the memory 602 and executable on the processor, the individual components of the entity identification device 600 being coupled together by a bus interface 603, the computer program when executed by the processor 601 performing the steps of:
acquiring a target text to be identified;
determining a first scene corresponding to the target text;
obtaining entity recognition models corresponding to the first scene from at least two entity recognition models trained in advance, wherein different entity recognition models in the at least two entity recognition models are obtained based on corpus training of different scenes;
and inputting the target text into an entity recognition model corresponding to the first scene to obtain an entity recognition result of the target text.
Optionally, the computer program when executed by the processor 601 is further configured to:
performing intention recognition on the target text to obtain at least two intents;
and taking the intention of which the probability value meets a preset condition in the at least two intentions as a first scene corresponding to the target text.
Optionally, the computer program when executed by the processor 601 is further configured to:
if the number of the first scenes is at least two, respectively acquiring entity identification models corresponding to each first scene from the at least two entity identification models;
respectively inputting the target text into entity recognition models corresponding to each first scene to obtain entity recognition results corresponding to each target entity recognition model, wherein the target entity recognition models are entity recognition models corresponding to the first scene;
and determining the entity recognition result of the target text according to the entity recognition result corresponding to each target entity recognition model.
Optionally, the entity identification model is a model obtained based on long-short term memory network LSTM and conditional random field CRF training.
Referring to fig. 7, fig. 7 is a block diagram of a model training apparatus according to still another embodiment of the present invention, and as shown in fig. 7, a model training apparatus 700 includes: a processor 701, a memory 702, and a computer program stored on the memory 702 and executable on the processor, the components of the model training apparatus 700 being coupled together by a bus interface 703, the computer program when executed by the processor 701 performing the steps of:
Clustering the N obtained corpus to obtain at least two clusters, wherein N is an integer greater than 1, and different clusters in the at least two clusters correspond to different scenes;
respectively obtaining target labeling information of corpus of each cluster in the at least two clusters;
model training is carried out according to the corpus of each cluster and the target labeling information of each cluster in the at least two clusters respectively, so that at least two entity recognition models are obtained.
Optionally, the computer program is further configured to, when executed by the processor 701:
respectively matching entities of each corpus in a first cluster through a preset regular expression, wherein the first cluster is any cluster in the at least two clusters;
generating initial labeling information of each corpus in the first cluster according to the entity of each corpus in the first cluster;
and respectively checking the initial labeling information of each corpus in the first cluster to obtain target labeling information of each corpus in the first cluster, wherein the target labeling information of the corpus is the labeling information after the initial labeling information of the corpus is checked.
Optionally, the computer program is further configured to, when executed by the processor 701:
if the number M of the entities of the first corpus exceeds a preset value, dividing the M entities of the first corpus into at least two entity groups according to a preset grouping rule, wherein the first corpus is any corpus in the first cluster, and the number of the entities of each entity group in the at least two entity groups is smaller than or equal to R;
generating initial labeling information according to the entity of each entity group in the at least two entity groups respectively, and obtaining at least two initial labeling information of the first corpus.
Optionally, the computer program is further configured to, when executed by the processor 701:
training the LSTM and the CRF according to the corpus and the target labeling information of each cluster in the at least two clusters to obtain at least two entity recognition models.
The embodiment of the invention also provides an electronic device, which comprises a processor, a memory, and a computer program stored in the memory and capable of running on the processor, wherein the computer program realizes all the processes of the embodiment of the entity identification method or all the processes of the embodiment of the model training method when being executed by the processor, and can achieve the same technical effects, and the repetition is avoided, so that the description is omitted.
The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements each process of the above embodiment of the entity identification method, or implements each process of the above embodiment of the model training method, and can achieve the same technical effect, so that repetition is avoided, and no further description is given here. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present invention and the scope of the claims, which are to be protected by the present invention.

Claims (8)

1. A method of model training, comprising:
clustering the N obtained corpus to obtain at least two clusters, wherein N is an integer greater than 1, and different clusters in the at least two clusters correspond to different scenes;
respectively obtaining target labeling information of corpus of each cluster in the at least two clusters;
model training is carried out according to the corpus of each cluster and the target labeling information of each cluster in the at least two clusters respectively to obtain at least two entity recognition models;
the obtaining target labeling information of the corpus of each cluster in the at least two clusters respectively includes:
respectively matching entities of each corpus in a first cluster through a preset regular expression, wherein the first cluster is any cluster in the at least two clusters;
generating initial labeling information of each corpus in the first cluster according to the entity of each corpus in the first cluster;
respectively checking the initial labeling information of each corpus in the first cluster to obtain target labeling information of each corpus in the first cluster, wherein the target labeling information of the corpus is the labeling information after the initial labeling information of the corpus is checked;
The generating initial labeling information of each corpus in the first cluster according to the entity of each corpus in the first cluster includes:
if the number M of the entities of the first corpus exceeds a preset value, dividing the M entities of the first corpus into at least two entity groups according to a preset grouping rule, wherein the first corpus is any corpus in the first cluster, and the number of the entities of each entity group in the at least two entity groups is smaller than or equal to R;
generating initial annotation information according to the entity of each entity group in the at least two entity groups respectively to obtain at least two initial annotation information of the first corpus, wherein each initial annotation information in the at least two initial annotation information of the first corpus is used for training entity recognition models of different scenes subdivided by the scene corresponding to the first corpus respectively.
2. The method according to claim 1, wherein the performing model training according to the corpus of each cluster and the target labeling information thereof to obtain at least two entity recognition models includes:
training the LSTM and the CRF according to the corpus and the target labeling information of each cluster in the at least two clusters to obtain at least two entity recognition models.
3. A method of entity identification, comprising:
acquiring a target text to be identified;
determining a first scene corresponding to the target text;
obtaining entity recognition models corresponding to the first scene from at least two entity recognition models trained in advance, wherein the at least two entity recognition models are trained based on the model training method according to any one of claims 1 to 2;
inputting the target text into an entity recognition model corresponding to the first scene to obtain an entity recognition result of the target text;
the determining the first scene corresponding to the target text comprises the following steps:
performing intention recognition on the target text to obtain at least two intents;
taking the intention of which the probability value meets a preset condition in the at least two intentions as a first scene corresponding to the target text;
the obtaining the entity recognition model corresponding to the first scene from at least two entity recognition models trained in advance comprises the following steps:
if the number of the first scenes is at least two, respectively acquiring entity identification models corresponding to each first scene from the at least two entity identification models;
Inputting the target text into an entity recognition model corresponding to the first scene to obtain an entity recognition result of the target text, wherein the entity recognition result comprises:
respectively inputting the target text into entity recognition models corresponding to each first scene to obtain entity recognition results corresponding to each target entity recognition model, wherein the target entity recognition models are entity recognition models corresponding to the first scene;
and determining the union of the entity recognition results corresponding to all the target entity recognition models as the entity recognition result corresponding to the target text.
4. A method according to claim 3, wherein the entity recognition model is a model trained based on a long and short term memory network LSTM and a conditional random field CRF.
5. A model training device, comprising:
the clustering module is used for clustering the acquired N corpus to obtain at least two clusters, wherein N is an integer greater than 1, and different clusters in the at least two clusters correspond to different scenes;
the acquisition module is used for respectively acquiring target annotation information of the corpus of each cluster in the at least two clusters;
The training module is used for carrying out model training according to the corpus of each cluster and the target labeling information of each cluster in the at least two clusters respectively to obtain at least two entity identification models;
wherein, the acquisition module includes:
the matching unit is used for respectively matching the entity of each corpus in the first cluster through a preset regular expression, wherein the first cluster is any cluster in the at least two clusters;
the generating unit is used for generating initial labeling information of each corpus in the first cluster according to the entity of each corpus in the first cluster;
the verification unit is used for verifying the initial labeling information of each corpus in the first cluster to obtain target labeling information of each corpus in the first cluster, wherein the target labeling information of the corpus is the labeling information after the initial labeling information of the corpus is verified;
the generating unit is specifically configured to:
if the number M of the entities of the first corpus exceeds a preset value, dividing the M entities of the first corpus into at least two entity groups according to a preset grouping rule, wherein the first corpus is any corpus in the first cluster, and the number of the entities of each entity group in the at least two entity groups is smaller than or equal to R;
Generating initial annotation information according to the entity of each entity group in the at least two entity groups respectively to obtain at least two initial annotation information of the first corpus, wherein each initial annotation information in the at least two initial annotation information of the first corpus is used for training entity recognition models of different scenes subdivided by the scene corresponding to the first corpus respectively.
6. An entity identification device, comprising:
the first acquisition module is used for acquiring a target text to be identified;
the determining module is used for determining a first scene corresponding to the target text;
a second obtaining module, configured to obtain an entity recognition model corresponding to the first scene from at least two entity recognition models trained in advance, where the at least two entity recognition models are obtained by training based on the model training method according to any one of claims 1 to 2;
the input module is used for inputting the target text into an entity recognition model corresponding to the first scene to obtain an entity recognition result of the target text;
the determining module is specifically configured to:
performing intention recognition on the target text to obtain at least two intents;
Taking the intention of which the probability value meets a preset condition in the at least two intentions as a first scene corresponding to the target text;
the second obtaining module is specifically configured to:
if the number of the first scenes is at least two, respectively acquiring entity identification models corresponding to each first scene from the at least two entity identification models;
the input module is specifically used for:
respectively inputting the target text into entity recognition models corresponding to each first scene to obtain entity recognition results corresponding to each target entity recognition model, wherein the target entity recognition models are entity recognition models corresponding to the first scene;
and determining the union of the entity recognition results corresponding to all the target entity recognition models as the entity recognition result corresponding to the target text.
7. An electronic device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the model training method of any one of claims 1 to 2 or the steps of the entity identification method of any one of claims 3 to 4.
8. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the model training method according to any of claims 1 to 2 or the steps of the entity identification method according to any of claims 3 to 4.
CN201911118481.5A 2019-11-15 2019-11-15 Entity identification method, model training method and device Active CN112818689B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911118481.5A CN112818689B (en) 2019-11-15 2019-11-15 Entity identification method, model training method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911118481.5A CN112818689B (en) 2019-11-15 2019-11-15 Entity identification method, model training method and device

Publications (2)

Publication Number Publication Date
CN112818689A CN112818689A (en) 2021-05-18
CN112818689B true CN112818689B (en) 2023-07-21

Family

ID=75851678

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911118481.5A Active CN112818689B (en) 2019-11-15 2019-11-15 Entity identification method, model training method and device

Country Status (1)

Country Link
CN (1) CN112818689B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114647727A (en) * 2022-03-17 2022-06-21 北京百度网讯科技有限公司 Model training method, device and equipment applied to entity information recognition
CN114677691B (en) * 2022-04-06 2023-10-03 北京百度网讯科技有限公司 Text recognition method, device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6052682A (en) * 1997-05-02 2000-04-18 Bbn Corporation Method of and apparatus for recognizing and labeling instances of name classes in textual environments
CN105488044A (en) * 2014-09-16 2016-04-13 华为技术有限公司 Data processing method and device
CN109992763A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 Language marks processing method, system, electronic equipment and computer-readable medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160335544A1 (en) * 2015-05-12 2016-11-17 Claudia Bretschneider Method and Apparatus for Generating a Knowledge Data Model
CN108304372B (en) * 2017-09-29 2021-08-03 腾讯科技(深圳)有限公司 Entity extraction method and device, computer equipment and storage medium
CN108255816A (en) * 2018-03-12 2018-07-06 北京神州泰岳软件股份有限公司 A kind of name entity recognition method, apparatus and system
CN109344401B (en) * 2018-09-18 2023-04-28 深圳市元征科技股份有限公司 Named entity recognition model training method, named entity recognition method and named entity recognition device
CN109522393A (en) * 2018-10-11 2019-03-26 平安科技(深圳)有限公司 Intelligent answer method, apparatus, computer equipment and storage medium
CN109493166B (en) * 2018-10-23 2021-12-28 深圳智能思创科技有限公司 Construction method for task type dialogue system aiming at e-commerce shopping guide scene
CN110287283B (en) * 2019-05-22 2023-08-01 中国平安财产保险股份有限公司 Intention model training method, intention recognition method, device, equipment and medium
CN110276075A (en) * 2019-06-21 2019-09-24 腾讯科技(深圳)有限公司 Model training method, name entity recognition method, device, equipment and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6052682A (en) * 1997-05-02 2000-04-18 Bbn Corporation Method of and apparatus for recognizing and labeling instances of name classes in textual environments
CN105488044A (en) * 2014-09-16 2016-04-13 华为技术有限公司 Data processing method and device
CN109992763A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 Language marks processing method, system, electronic equipment and computer-readable medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于实体关系网络的微博文本摘要;薛竹君;杨树强;束阳雪;;计算机科学(第09期);77-81 *
融合词向量和主题模型的领域实体消歧;马晓军;郭剑毅;王红斌;张志坤;线岩团;余正涛;;模式识别与人工智能(第12期);1130-1137 *

Also Published As

Publication number Publication date
CN112818689A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
CN110502608B (en) Man-machine conversation method and man-machine conversation device based on knowledge graph
CN109165291B (en) Text matching method and electronic equipment
CN112163153B (en) Industry label determining method, device, equipment and storage medium
CN112527972A (en) Intelligent customer service chat robot implementation method and system based on deep learning
CN107688651B (en) News emotion direction judgment method, electronic device and computer readable storage medium
CN112818689B (en) Entity identification method, model training method and device
CN112417158A (en) Training method, classification method, device and equipment of text data classification model
CN111177307A (en) Test scheme and system based on semantic understanding similarity threshold configuration
CN113032520A (en) Information analysis method and device, electronic equipment and computer readable storage medium
CN112507095A (en) Information identification method based on weak supervised learning and related equipment
CN113127621A (en) Dialogue module pushing method, device, equipment and storage medium
CN113705250B (en) Session content identification method, device, equipment and computer readable medium
CN111402864A (en) Voice processing method and electronic equipment
CN114969352A (en) Text processing method, system, storage medium and electronic equipment
CN114817507A (en) Reply recommendation method, device, equipment and storage medium based on intention recognition
CN117911039A (en) Control method, equipment and storage medium for after-sales service system
CN117370512A (en) Method, device, equipment and storage medium for replying to dialogue
CN111651554A (en) Insurance question-answer method and device based on natural language understanding and processing
CN113177061B (en) Searching method and device and electronic equipment
CN112328871B (en) Reply generation method, device, equipment and storage medium based on RPA module
CN112836036B (en) Interactive training method and device for intelligent agent, terminal and storage medium
CN114186048A (en) Question-answer replying method and device based on artificial intelligence, computer equipment and medium
CN113449506A (en) Data detection method, device and equipment and readable storage medium
CN113010664A (en) Data processing method and device and computer equipment
CN114548103B (en) Named entity recognition model training method and named entity recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant