CN113255355A - Entity identification method and device in text information, electronic equipment and storage medium - Google Patents

Entity identification method and device in text information, electronic equipment and storage medium Download PDF

Info

Publication number
CN113255355A
CN113255355A CN202110634748.7A CN202110634748A CN113255355A CN 113255355 A CN113255355 A CN 113255355A CN 202110634748 A CN202110634748 A CN 202110634748A CN 113255355 A CN113255355 A CN 113255355A
Authority
CN
China
Prior art keywords
information
training
text
model
text information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110634748.7A
Other languages
Chinese (zh)
Inventor
王博
薛小娜
张文剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN202110634748.7A priority Critical patent/CN113255355A/en
Publication of CN113255355A publication Critical patent/CN113255355A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a device for identifying an entity in text information, electronic equipment and a storage medium, and belongs to the technical field of entity identification. The method comprises the following steps: clustering the text messages by a clustering scheme to obtain a plurality of clusters; dividing the text information in each class cluster into non-tag information and tag information through a preset dividing scheme; clustering and dividing the label-free information of the plurality of clusters through the clustering scheme and the preset dividing scheme until new labeled information cannot be obtained; continuing training on the basis of the text information based on a pre-training model to obtain an initial training model; and performing model training on all the text information based on the initial training model so as to recognize the entity in the text information through the trained model. The method and the device improve the labeling efficiency, and improve the generalization ability and the training precision of the model by adopting a continuous training mode.

Description

Entity identification method and device in text information, electronic equipment and storage medium
Technical Field
The present application relates to the field of entity identification technologies, and in particular, to a method and an apparatus for identifying an entity in text information, an electronic device, and a storage medium.
Background
With the continuous development of internet technology, the identification of entities from text information has been widely applied. For example, the instant messaging tool has a group chat function, a user can communicate through the group chat function of the instant messaging tool, the group chat interface comprises a group chat name, for an enterprise, the group chat name usually comprises a client name, and the client name is extracted from the group chat name and has an actual application value to the enterprise.
Two methods are mainly adopted for identifying entities from text information, one is a rule-based extraction method, and the other is a machine learning method. The rule-based extraction method needs to manually summarize rules contained in text information, predefine some rules according to the rules, and identify an entity by combining a dictionary, a word segmentation tool and the rules, if the entity is a client name, because the client name in the group chat name usually appears in the forms of spoken language, short name, pinyin initials and the like, the quality of the group chat name is poor, the word segmentation tool is not applicable any more, the dictionary needs to be reconstructed, and the implementation process of the rule-based method is complicated.
The method based on machine learning is that a training information set is constructed by utilizing a large amount of labeled information, a corresponding machine learning model is selected for training, and finally, the trained model is utilized to identify and extract entities in text information. At present, the marking information is manually marked, a large amount of marking information consumes manpower, and the efficiency is low.
In addition, machine learning in the prior art depends on feature construction, and a pre-training model obtained by training on large-scale general corpus is often adopted, but the pre-training model lacks sensitivity to data in the field.
Disclosure of Invention
An object of the embodiments of the present application is to provide a method and an apparatus for identifying an entity in text information, an electronic device, and a storage medium, so as to solve the problems of complex process and labor consumption in entity identification. The specific technical scheme is as follows:
in a first aspect, a method for identifying an entity in text information is provided, where the method includes:
clustering a plurality of text messages by a clustering scheme to obtain a plurality of clusters, wherein each cluster comprises at least one text message;
dividing the text information in each class cluster into non-labeled information and labeled information through a preset dividing scheme, wherein the preset dividing scheme is associated with the same entity in the text information;
clustering and dividing the label-free information of the plurality of clusters through the clustering scheme and the preset dividing scheme until new labeled information cannot be obtained;
continuing training on the basis of the text information on the basis of a pre-training model to obtain an initial training model, wherein the pre-training model is a model generated on the basis of large-scale general corpus training;
and performing model training on all the text information based on the initial training model so as to recognize the entity in the text information through the trained model.
Optionally, dividing the text information in each of the class clusters into tagged information by a preset dividing scheme includes:
determining a target entity with the most occurrence times in the text information of the class cluster;
determining the occurrence frequency of the target entity and the text number of the text information in the class cluster;
determining all text information in the class cluster as label information to be selected under the condition that the occurrence frequency and the text number meet a preset ratio;
and taking the text information carrying the target entity in the tag information to be selected as the tagged information.
Optionally, dividing the text information in each of the class clusters into label-free information according to a preset dividing scheme includes:
determining all text information in the class cluster as label-free information under the condition that the occurrence frequency and the text number do not meet a preset ratio;
and taking the text information which does not carry the target entity in the to-be-selected label information as non-label information.
Optionally, after the text information carrying the target entity in the to-be-selected tag information is used as the tagged information, the method further includes:
taking the target entity as a tag of the tagged information;
marking the information with the label in a preset marking mode to obtain marking information.
Optionally, the continuing training on the basis of the text information based on the pre-training model comprises:
and inputting the plurality of text messages into the pre-training model so as to enable the pre-training model to carry out continuous training on the basis of the text messages, thereby obtaining the pre-training model after continuous training. The pre-training model is used for acquiring prior information hidden in the text information, and the prior information comprises structural information and format information in the text information.
Optionally, model training all tagged information based on the initial training model comprises:
inputting the text information into a pre-training model after continuous training to obtain a target vector of the text information output by the pre-training model after continuous training, wherein the target vector is associated with the word granularity of the text information;
inputting the target vector into a recognition model to obtain prediction information corresponding to the text information output by the recognition model;
and updating parameters of a loss function according to the prediction information and the labeling information until the difference between the prediction information and the labeling information is minimized, wherein the weight of the pre-training model after continuous training is updated in the parameter updating process of the loss function.
Optionally, the clustering the text information by using a clustering scheme to obtain a plurality of clusters includes:
inputting initial text information into a vector model to obtain vector information of the initial text information output by the vector model;
clustering the vector information to obtain a plurality of clusters through a clustering scheme, wherein each cluster comprises at least one vector information;
and mapping the vector information in each cluster to text information according to a mapping relation, wherein the mapping relation is the mapping relation between the vector information and the text information.
In a second aspect, an apparatus for entity recognition in text information is provided, the apparatus comprising:
the clustering module is used for clustering a plurality of text messages through a clustering scheme to obtain a plurality of clusters, wherein each cluster comprises at least one text message;
the dividing module is used for dividing the text information in each class cluster into non-tag information and tag information through a preset dividing scheme, wherein the preset dividing scheme is associated with the same entity in the text information;
the clustering and dividing module is used for clustering and dividing the label-free information of the clusters according to the clustering scheme and the preset dividing scheme until new labeled information cannot be obtained;
the continuous training module is used for carrying out continuous training on the basis of the text information based on a pre-training model to obtain an initial training model, wherein the pre-training model is a model generated based on large-scale general corpus training;
and the training module is used for carrying out model training on all the text information based on the initial training model so as to identify the entity in the text information through the trained model.
In a third aspect, an electronic device is provided, which includes a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the steps of the entity identification method in any text message when executing the program stored in the memory.
In a fourth aspect, a computer-readable storage medium is provided, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of any of the methods for entity identification in textual information.
The embodiment of the application has the following beneficial effects:
the application is applied to natural language processing in the technical field of deep learning, and the embodiment of the application provides an entity recognition method in text information. In the application, a plurality of clusters are obtained through clustering at first, then the text information in each cluster is divided into the label-free information and the labeled information, and due to the fact that a plurality of clusters are arranged, the label-free information of the clusters can be comprehensively gathered, clustering and division are carried out again, so that the labeled information can be found to the greatest extent, the number of the labeled information is increased, the labeling mode is simple, rules do not need to be set, automatic server labeling is also achieved, manual data labeling is not needed, and the labeling efficiency is improved. According to the method and the device, the hidden prior information in the text information is obtained through continuous training, the accuracy of entity identification in the text information is improved, and the generalization capability of the model is improved.
Of course, not all of the above advantages need be achieved in the practice of any one product or method of the present application.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
Fig. 1 is a flowchart of a method for entity identification in text information according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of model training provided in an embodiment of the present application;
fig. 3 is a schematic structural diagram of an entity identification apparatus in text information according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for the convenience of description of the present application, and have no specific meaning in themselves. Thus, "module" and "component" may be used in a mixture.
In order to solve the problems mentioned in the background, according to an aspect of embodiments of the present application, an embodiment of a method for identifying an entity in text information is provided.
The embodiment of the application provides an entity identification method in text information, which can be applied to a server and used for identifying an entity from the text information.
The following describes in detail an entity identification method in text information provided in an embodiment of the present application with reference to a specific embodiment, and as shown in fig. 1, the specific steps are as follows:
step 101: and clustering the plurality of text messages by a clustering scheme to obtain a plurality of clusters.
Wherein each class cluster comprises at least one text message.
In the embodiment of the application, the server needs to automatically label the text messages, and therefore, the text messages need to be clustered by a clustering scheme to obtain a plurality of clusters, and each cluster contains at least one text message. Each text message may or may not include at least an entity. The clustering scheme can also be a K-Means clustering algorithm, and the clustering scheme is not particularly limited in the application.
Step 102: and dividing the text information in each class cluster into non-tag information and tag information by a preset dividing scheme.
Wherein the preset partitioning scheme is associated with the same entity in the text information.
In the embodiment of the application, for a plurality of text messages in each class cluster, a server determines the same entity in the plurality of text messages, determines a preset dividing scheme according to the occurrence times of the same entity and the number of the text messages, and then divides the text messages into non-tag information and tagged information, wherein the non-tag information does not carry tags, the tagged information carries tags, and the tags can be determined by the same entity.
Step 103: and clustering and dividing the label-free information of the plurality of clusters through a clustering scheme and a preset dividing scheme until new labeled information cannot be obtained.
In the embodiment of the application, after the server obtains the non-tag information, because each cluster can obtain the non-tag information, the server can comprehensively summarize a plurality of non-tag information, and for the plurality of non-tag information after the comprehensive summarization, clustering is performed according to a clustering scheme, then division is performed according to a preset division scheme, and new tagged information and new non-tag information are obtained again. And then repeating the steps for the obtained non-tag information until new tagged information cannot be obtained, so that the tagged information can be mined to the maximum extent through multiple iterations. For the remaining unlabeled information, the server may set its label as a corpus negative example.
Step 104: and continuously training on the basis of the text information based on the pre-training model to obtain an initial training model.
The pre-training model is a model generated based on large-scale general corpus training.
In the embodiment of the application, the pre-training model is a model based on Chinese large-scale general corpus training, the application carries out continuous training on the basis of the pre-training model on the basis of the text information to obtain the initial training model, the prior information hidden in the text information can be obtained, the accuracy of entity recognition in the text information is improved, and the generalization capability of the initial training model is improved.
Step 105: model training is carried out on all information with labels based on the initial training model, and the entity in the text information is identified through the trained model.
In the embodiment of the application, the pre-training model belongs to one sub-model in the initial training model, and after the accuracy of entity identification in the text information is improved through the pre-training model, model training is carried out on all labeled information based on the initial training model, so that the trained model can identify the entity more accurately.
In the method, a plurality of clusters are obtained through clustering, text information in each cluster is divided into non-tag information and tagged information, and due to the fact that the plurality of clusters are arranged, the non-tag information of the plurality of clusters can be comprehensively gathered and clustered and divided again, so that the tagged information can be found to the maximum extent, and the number of the tagged information is increased. In the process of identifying the entity, word segmentation and a dictionary are not used, rules do not need to be set, the identification mode is simplified, meanwhile, automatic server marking is achieved, manual marking of data is not needed, manpower is saved, and marking efficiency is improved. The method and the device have the advantages that the hidden prior information in the text information is obtained through continuous training, the accuracy of entity identification in the text information is improved, the generalization capability of the model is improved, and the subsequent model identification entity is more accurate.
As an optional implementation manner, dividing the text information in each class cluster into tagged information by using a preset dividing scheme includes: determining a target entity with the most occurrence times in the text information of the class cluster; determining the occurrence frequency of a target entity and the text number of text information in a class cluster; determining all text information in the class cluster as the label information to be selected under the condition that the occurrence frequency and the text number meet a preset ratio; and taking the text information carrying the target entity in the tag information to be selected as tagged information.
In the embodiment of the application, after obtaining the plurality of text messages in each cluster, the server determines the target entity with the largest occurrence number in the text messages, determines the text number of the plurality of text messages, and then determines the quotient of the occurrence number and the text number.
If the server determines that the quotient is greater than the preset threshold value, which indicates that the number of times of occurrence of the target entity in the text information is large, most of the text information in the cluster contains the target entity, the target entity can be used as a label of the cluster, and all the text information in the cluster is label information to be selected. However, the first tag information which does not contain the target entity exists in the to-be-selected tag information, so that the server traverses the to-be-selected tag information, removes the first tag information, and takes the to-be-selected tag information which contains the target entity as tagged information, so that the tagged information contains the corresponding target entity, and the accuracy of the tag is improved.
The preset threshold may be set to 0.5, the quotient of the occurrence number of the target entity and the number of the text messages is greater than 0.5, which indicates that more than half of the text messages include the target entity, and then the target entity is taken as a representative of the label of the cluster, which can represent most of the text messages.
Illustratively, the text information is a group chat name in an instant messaging tool, the text information in a cluster class comprises an "xx sports group", an "xx sports squad", an "xx sports group" and an "xx star group", the target entity with the largest occurrence frequency is "sports", the occurrence frequency of the "sports" is 4, 5 text information items exist, and 4/5 is greater than 0.5, then the "sports" is used as a label of the cluster class, and all the text information in the cluster class is label information to be selected. Since the "xx star group" does not contain the target entity "motion", the "xx star group" needs to be removed from the cluster, and the remaining "xx motion group", "xx motion squadron", "xx motion group", all containing "motion", can be used as tagged information.
As an optional implementation manner, dividing the text information in each class cluster into non-tag information by using a preset dividing scheme includes: under the condition that the occurrence frequency and the text quantity do not meet the preset proportion, determining all text information in the class cluster as label-free information; and taking the text information which does not carry the target entity in the information of the to-be-selected label as the label-free information.
In the embodiment of the application, the server determines a quotient of the occurrence frequency of the target entity and the number of the text messages, and if the server determines that the quotient is not greater than a preset threshold value, which indicates that the occurrence frequency of the target entity in the text messages is less and the labels of the clusters cannot be determined, the text messages in the clusters are used as label-free information.
In addition, in the above description, the server traverses the tag information to be selected, and removes the first tag information that does not include the target entity, where the first tag information is also used as non-tag information.
Illustratively, the text information is a group chat name in an instant messaging tool, the text information in a cluster class comprises "xx sports group", "xx eating squad", "xx basketball group" and "xx star group", each text segment is 1, the number of the text information is 4, a preset threshold value is set to be 0.5, a quotient of the number of occurrences of the text segment and the number of the text information is not greater than 0.5, therefore, a label of the cluster class cannot be determined, and all four text information in the cluster class are label-free information.
Illustratively, the text message is a group chat name in the instant messenger, and the text message in a cluster class includes "xx sports group", "xx sports squad", "xx sports group", "xx star group", and "sports" as a label of the cluster class, and "xx star group" removes the cluster class and serves as unlabeled information.
Optionally, after the server obtains a plurality of pieces of non-tag information, because the non-tag information is all text information in one class cluster (the occurrence frequency of the target entity and the number of texts do not satisfy the preset ratio), only the text information of different tags is divided into one class cluster, or the non-tag information is part of text information (first tag information) in one class cluster, and only does not satisfy the tags of the current class cluster, the server may further perform clustering and dividing on all the non-tag information again, possibly divide new tagged information, and increase the number of the tagged information. The server repeats the steps of clustering and dividing for the non-tag information obtained each time until new tagged information cannot be obtained, so that the tagged information can be mined out to the maximum extent through multiple iterations.
As an optional implementation manner, after the text information carrying the target entity in the to-be-selected tag information is taken as tagged information, the method further includes: taking the target entity as a label with label information; marking the information with the label in a preset marking mode to obtain marking information.
In the embodiment of the application, the server determines a quotient of the occurrence number of the target entity and the text number, if the quotient is greater than a preset threshold value, which indicates that the occurrence number of the target entity in the text information is large, most of the text information in the cluster includes the target entity, and after the server removes the first tag information that does not include the target entity from the cluster, the obtained tagged information includes the target entity, so that the target entity can be used as a tag with tagged information. And marking the information with the label by the server in a preset marking mode, wherein the label is a target entity to obtain marking information. Wherein, the preset labeling mode can be in BIO (B-begin, I-inside, O-outside) format. The label-free information can also be marked by adopting BIO, an initial training model is input in the training process, and the label-free information is used as a negative sample and can play a role in enhancing data.
As an alternative embodiment, the continuing training based on the pre-training model on the basis of the text information includes: and inputting the plurality of text messages into the pre-training model so that the pre-training model can be continuously trained on the basis of the text messages to obtain the pre-training model after continuous training. The pre-training model is used for acquiring hidden prior information in the text information, and the prior information comprises structural information and format information in the text information.
In the embodiment of the application, after the server obtains all information with tags, the server firstly inputs a plurality of text messages into the pre-training model to obtain the pre-training model after continuous training. The pre-training model is trained continuously on the basis of the text information, the pre-training model is used for obtaining the prior information hidden in the text information, the prior information comprises the structural information and the format information in the text information, the training is continuously carried out according to the prior information, the target vector output by the pre-training model can be attached to the marking information in the subsequent training process, the accuracy of entity identification in the text information is improved in the using process of the model, and the generalization capability of the model is improved.
As an alternative embodiment, the model training of all tagged information based on the initial training model comprises: inputting the text information into the pre-training model after continuous training to obtain a target vector of the text information output by the pre-training model after continuous training, wherein the target vector is associated with the word granularity of the text information; inputting the target vector into the recognition model to obtain prediction information corresponding to the text information output by the recognition model; and updating the parameters of the loss function according to the prediction information and the marking information until the difference between the prediction information and the marking information is minimized, wherein the weight of the pre-training model after continuous training is updated in the parameter updating process of the loss function.
In the embodiment of the application, after the server continues to train the pre-training model, the initial model needs to be trained, specifically, the server inputs the text information into the pre-training model after continuing to train, and obtains a target vector of the text information output by the pre-training model after continuing to train, the target vector is associated with the word granularity of the text information, the output of the pre-training model can be used as vector representation of the text information on the word granularity, information carried by vectors with different granularities has slight difference, and the integration of the vectors with different granularities is better than single vector representation under most conditions. Because the model generalization capability of the pre-training model is improved in the continuous training process, the accuracy of entity identification can be improved by the pre-training model in the pre-training model after continuous training. The pre-training model may be a bert (bidirectional Encoder reproduction from transforms) model or an enrie (enhanced reproduction from kNowledge integration), and the specific implementation manner of the pre-training model is not limited in the present application.
And then the server inputs the target vector into the recognition model to obtain prediction information corresponding to the text information output by the recognition model, the server constructs a loss function, and then the parameters of the loss function are updated according to the prediction information and the marking information until the difference between the prediction information and the marking information is minimized, and the loss function training is completed, wherein the parameters of the loss function contain the weight of the pre-training model after continuous training, and the weight of the pre-training model after continuous training can be updated in the parameter updating process of the loss function, so that the entity can be recognized more accurately by the pre-training model after continuous training. The recognition model may be a BILSTM model and a CRF (Conditional Random Field) model.
According to the method, a BERT + BILSTM + CRF model is adopted, the word granularity of text information and the updating of the model weight are considered, and the accuracy of model training is improved.
Fig. 2 is a schematic diagram of model training, and as shown in fig. 2, the server continues training the BERT model through the text information, and then sequentially inputs the text information and the label information into the BERT model, the bltm model and the CRF model after continuing training to complete the model training process.
As an optional implementation manner, clustering the text messages by using a clustering scheme to obtain a plurality of clusters includes: inputting the initial text information into a vector model to obtain vector information of the initial text information output by the vector model; clustering the vector information to obtain a plurality of clusters through a clustering scheme, wherein each cluster comprises at least one vector information; and mapping the vector information in each class cluster into text information according to a mapping relation, wherein the mapping relation is the mapping relation between the vector information and the text information.
In the embodiment of the application, the server inputs the initial text information into the vector model to obtain the vector information of the initial text information output by the vector model, then clusters the plurality of vector information by a clustering scheme to obtain a plurality of clusters, each cluster comprises at least one vector information, and because the subsequent determination that the non-tagged information and the tagged information are both based on the text information, the vector information needs to be converted into the text information, the server stores the mapping relationship between the vector information and the text information, and the server maps the vector information in each cluster into the text information according to the mapping relationship. The vector model may be a bag of words model or Word2vec, and the application does not specifically limit the vector model.
Before the initial text information is input into the vector model by the server, the server firstly carries out stop operation on the original text information through a preset stop scheme to obtain the initial text information, and specifically, stop operation is carried out on the original text information by adopting stop words, so that prepositions, language and words without practical meanings are removed, and the efficiency of subsequent text information processing can be greatly improved.
Optionally, an embodiment of the present application further provides a processing flow chart for entity identification, and the specific steps are as follows.
Step 1: and performing stop operation on the original text information to obtain initial text information, and performing vectorization on the initial text information to obtain vector information.
Step 2: clustering the vector information to obtain k class clusters through a clustering scheme, and determining text information in each class cluster, wherein the text information is group chat name information (group name information).
And step 3: traversing Mk pieces of group name information in a class cluster K (K is 1, 2, 3 … K), and finding out a target entity S with the largest occurrence frequency in the Mk pieces of group name information;
and 4, step 4: and setting the occurrence frequency as N, and if N is more than or equal to 0.5Mk, setting the target entity S as the label of the cluster k. And removing the group name information of the target entity S which is not contained in the class cluster k, and using the rest group name information in the class cluster as tagged information.
And 5: and if N is less than 0.5Mk, determining that all the group name information in the class cluster is label-free information, and taking the group name information removed in the step 4 as the label-free information.
Step 6: and repeating the steps 2 to 5 until no new labeled information can be obtained.
And 7: and (3) continuing training the initial text information by adopting a BERT model, and training the initial text information by adopting the BERT + BILSTM + CRF model after the continuous training.
And 8: and identifying the entity in the group name information by adopting a BERT + BILSTM + CRF model.
Based on the same technical concept, an embodiment of the present application further provides an entity identification apparatus in text information, as shown in fig. 3, the apparatus includes:
the clustering module 301 is configured to cluster the plurality of text messages by using a clustering scheme to obtain a plurality of clusters, where each cluster includes at least one text message;
a dividing module 302, configured to divide the text information in each class cluster into non-tagged information and tagged information according to a preset dividing scheme, where the preset dividing scheme is associated with the same entity in the text information;
the clustering and dividing module 303 is configured to cluster and divide the label-free information of the plurality of clusters according to a clustering scheme and a preset dividing scheme until new labeled information cannot be obtained;
a continuous training module 304, configured to perform continuous training on the basis of the text information based on a pre-training model to obtain an initial training model, where the pre-training model is a model generated based on large-scale general corpus training;
and the training module 305 is configured to perform model training on all labeled information based on the initial training model, so as to perform recognition of the entity in the text information through the trained model.
Optionally, the dividing module 302 includes:
the first determining unit is used for determining the target entity with the largest occurrence frequency in the text information of the class cluster;
the second determining unit is used for determining the occurrence frequency of the target entity and the text number of the text information in the class cluster;
the third determining unit is used for determining all the text information in the class cluster as the label information to be selected under the condition that the occurrence frequency and the text number meet a preset proportion;
the first unit is used for taking the text information carrying the target entity in the tag information to be selected as tagged information.
Optionally, the dividing module 302 includes:
the fourth determining unit is used for determining all the text information in the class cluster as label-free information under the condition that the occurrence frequency and the text quantity do not meet the preset proportion;
and the second unit is used for taking the text information which does not carry the target entity in the tag information to be selected as the non-tag information.
Optionally, the apparatus comprises:
the module is used for taking the target entity as a label with label information;
and the marking module is used for marking the information with the label in a preset marking mode to obtain the marking information.
Optionally, the continuing training module 304 includes:
the input unit is used for inputting a plurality of text messages into a vector model so as to enable the vector model to be trained continuously on the basis of the text messages, wherein the vector model is used for acquiring prior information hidden in the text messages, and the prior information comprises structural information and format information in the text messages;
and the output unit is used for obtaining the training vector output by the vector model, wherein the training vector is associated with the word granularity of the text information.
Optionally, training module 305 comprises:
the first input and output unit is used for inputting the text information into the pre-training model after continuous training to obtain a target vector of the text information output by the pre-training model after continuous training, wherein the target vector is associated with the word granularity of the text information;
the second input and output unit is used for inputting the target vector into the recognition model to obtain the prediction information corresponding to the text information output by the recognition model;
and the updating unit is used for updating the parameters of the loss function according to the prediction information and the marking information until the difference between the prediction information and the marking information is minimized, wherein the weight of the pre-training model after continuous training is updated in the parameter updating process of the loss function.
Optionally, the clustering module 301 comprises:
the fourth input and output unit is used for inputting the initial text information into the vector model to obtain the vector information of the initial text information output by the vector model;
the clustering unit is used for clustering a plurality of vector information through a clustering scheme to obtain a plurality of clusters, wherein each cluster comprises at least one vector information;
and the mapping unit is used for mapping the vector information in each class cluster into text information according to a mapping relationship, wherein the mapping relationship is the mapping relationship between the vector information and the text information.
According to another aspect of the embodiments of the present application, there is provided an electronic device, as shown in fig. 4, including a memory 403, a processor 401, a communication interface 402, and a communication bus 404, where the memory 403 stores a computer program that is executable on the processor 401, the memory 403 and the processor 401 communicate through the communication interface 402 and the communication bus 404, and the processor 401 implements the steps of the method when executing the computer program.
The memory and the processor in the electronic equipment are communicated with the communication interface through a communication bus. The communication bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, an information bus, a control bus, etc.
The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
There is also provided, in accordance with yet another aspect of an embodiment of the present application, a computer-readable medium having non-volatile program code executable by a processor.
Optionally, in an embodiment of the present application, a computer readable medium is configured to store program code for the processor to execute the above method.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments, and this embodiment is not described herein again.
When the embodiments of the present application are specifically implemented, reference may be made to the above embodiments, and corresponding technical effects are achieved.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk. It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for entity identification in textual information, the method comprising:
clustering a plurality of text messages by a clustering scheme to obtain a plurality of clusters, wherein each cluster comprises at least one text message;
dividing the text information in each class cluster into non-labeled information and labeled information through a preset dividing scheme, wherein the preset dividing scheme is associated with the same entity in the text information;
clustering and dividing the label-free information of the plurality of clusters through the clustering scheme and the preset dividing scheme until new labeled information cannot be obtained;
continuing training on the basis of the text information on the basis of a pre-training model to obtain an initial training model, wherein the pre-training model is a model generated on the basis of large-scale general corpus training;
and performing model training on all the text information based on the initial training model so as to recognize the entity in the text information through the trained model.
2. The method of claim 1, wherein the dividing the text information in each of the class clusters into labeled information by a preset dividing scheme comprises:
determining a target entity with the most occurrence times in the text information of the class cluster;
determining the occurrence frequency of the target entity and the text number of the text information in the class cluster;
determining all text information in the class cluster as label information to be selected under the condition that the occurrence frequency and the text number meet a preset ratio;
and taking the text information carrying the target entity in the tag information to be selected as the tagged information.
3. The method of claim 2, wherein the dividing the text information in each of the class clusters into label-free information according to a preset dividing scheme comprises:
determining all text information in the class cluster as label-free information under the condition that the occurrence frequency and the text number do not meet a preset ratio;
and taking the text information which does not carry the target entity in the to-be-selected label information as non-label information.
4. The method according to claim 2, wherein after the text information of the target entity is carried in the tag information to be selected as the tagged information, the method further comprises:
taking the target entity as a tag of the tagged information;
marking the information with the label in a preset marking mode to obtain marking information.
5. The method of claim 1, wherein continuing training based on a pre-trained model based on the textual information comprises:
and inputting the plurality of text messages into the pre-training model so as to enable the pre-training model to carry out continuous training on the basis of the text messages, thereby obtaining the pre-training model after continuous training. The pre-training model is used for acquiring prior information hidden in the text information, and the prior information comprises structural information and format information in the text information.
6. The method of claim 4, wherein model training all tagged information based on the initial training model comprises:
inputting the text information into a pre-training model after continuous training to obtain a target vector of the text information output by the pre-training model after continuous training, wherein the target vector is associated with the word granularity of the text information;
inputting the target vector into a recognition model to obtain prediction information corresponding to the text information output by the recognition model;
and updating parameters of a loss function according to the prediction information and the labeling information until the difference between the prediction information and the labeling information is minimized, wherein the weight of the pre-training model after continuous training is updated in the parameter updating process of the loss function.
7. The method of claim 1, wherein clustering the plurality of textual information by the clustering scheme to obtain a plurality of clusters comprises:
inputting initial text information into a vector model to obtain vector information of the initial text information output by the vector model;
clustering the vector information to obtain a plurality of clusters through a clustering scheme, wherein each cluster comprises at least one vector information;
and mapping the vector information in each cluster to text information according to a mapping relation, wherein the mapping relation is the mapping relation between the vector information and the text information.
8. An apparatus for entity recognition in textual information, the apparatus comprising:
the clustering module is used for clustering a plurality of text messages through a clustering scheme to obtain a plurality of clusters, wherein each cluster comprises at least one text message;
the dividing module is used for dividing the text information in each class cluster into non-tag information and tag information through a preset dividing scheme, wherein the preset dividing scheme is associated with the same entity in the text information;
the clustering and dividing module is used for clustering and dividing the label-free information of the clusters according to the clustering scheme and the preset dividing scheme until new labeled information cannot be obtained;
the continuous training module is used for carrying out continuous training on the basis of the text information based on a pre-training model to obtain an initial training model, wherein the pre-training model is a model generated based on large-scale general corpus training;
and the training module is used for carrying out model training on all the text information based on the initial training model so as to identify the entity in the text information through the trained model.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1 to 7 when executing a program stored in the memory.
10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.
CN202110634748.7A 2021-06-08 2021-06-08 Entity identification method and device in text information, electronic equipment and storage medium Pending CN113255355A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110634748.7A CN113255355A (en) 2021-06-08 2021-06-08 Entity identification method and device in text information, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110634748.7A CN113255355A (en) 2021-06-08 2021-06-08 Entity identification method and device in text information, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113255355A true CN113255355A (en) 2021-08-13

Family

ID=77186946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110634748.7A Pending CN113255355A (en) 2021-06-08 2021-06-08 Entity identification method and device in text information, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113255355A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114861670A (en) * 2022-07-07 2022-08-05 浙江一山智慧医疗研究有限公司 Entity identification method, device and application for learning unknown label based on known label

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108563655A (en) * 2017-12-28 2018-09-21 北京百度网讯科技有限公司 Text based event recognition method and device
CN111143571A (en) * 2018-11-06 2020-05-12 马上消费金融股份有限公司 Entity labeling model training method, entity labeling method and device
CN112148952A (en) * 2020-09-28 2020-12-29 腾讯科技(深圳)有限公司 Task execution method, device and equipment and computer readable storage medium
WO2021068329A1 (en) * 2019-10-10 2021-04-15 平安科技(深圳)有限公司 Chinese named-entity recognition method, device, and computer-readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108563655A (en) * 2017-12-28 2018-09-21 北京百度网讯科技有限公司 Text based event recognition method and device
CN111143571A (en) * 2018-11-06 2020-05-12 马上消费金融股份有限公司 Entity labeling model training method, entity labeling method and device
WO2021068329A1 (en) * 2019-10-10 2021-04-15 平安科技(深圳)有限公司 Chinese named-entity recognition method, device, and computer-readable storage medium
CN112148952A (en) * 2020-09-28 2020-12-29 腾讯科技(深圳)有限公司 Task execution method, device and equipment and computer readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114861670A (en) * 2022-07-07 2022-08-05 浙江一山智慧医疗研究有限公司 Entity identification method, device and application for learning unknown label based on known label

Similar Documents

Publication Publication Date Title
CN109933686B (en) Song label prediction method, device, server and storage medium
CN111539197B (en) Text matching method and device, computer system and readable storage medium
US9336299B2 (en) Acquisition of semantic class lexicons for query tagging
CN112164391A (en) Statement processing method and device, electronic equipment and storage medium
CN109460551B (en) Signature information extraction method and device
CN111104526A (en) Financial label extraction method and system based on keyword semantics
CN112084381A (en) Event extraction method, system, storage medium and equipment
CN111198948A (en) Text classification correction method, device and equipment and computer readable storage medium
CN111985229A (en) Sequence labeling method and device and computer equipment
CN111859964A (en) Method and device for identifying named entities in sentences
CN109948160B (en) Short text classification method and device
CN112084746A (en) Entity identification method, system, storage medium and equipment
CN111368072A (en) Microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity
CN108304387B (en) Method, device, server group and storage medium for recognizing noise words in text
CN112307337B (en) Associated recommendation method and device based on tag knowledge graph and computer equipment
CN114757176A (en) Method for obtaining target intention recognition model and intention recognition method
CN106874397B (en) Automatic semantic annotation method for Internet of things equipment
CN112328655B (en) Text label mining method, device, equipment and storage medium
CN111563380A (en) Named entity identification method and device
CN108664464B (en) Method and device for determining semantic relevance
CN112732863B (en) Standardized segmentation method for electronic medical records
CN113255355A (en) Entity identification method and device in text information, electronic equipment and storage medium
CN112926341A (en) Text data processing method and device
CN110851597A (en) Method and device for sentence annotation based on similar entity replacement
CN113255319B (en) Model training method, text segmentation method, abstract extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination