CN112257421A

CN112257421A - Nested entity data identification method and device and electronic equipment

Info

Publication number: CN112257421A
Application number: CN202011522097.4A
Authority: CN
Inventors: 于淼; 刘炎; 覃建策; 陈邦忠
Original assignee: Perfect World Beijing Software Technology Development Co Ltd
Current assignee: Perfect World Beijing Software Technology Development Co Ltd
Priority date: 2020-12-21
Filing date: 2020-12-21
Publication date: 2021-01-22
Anticipated expiration: 2040-12-21
Also published as: CN112257421B

Abstract

The application discloses a method and a device for identifying nested entity data and electronic equipment, and relates to the technical field of data identification. The method comprises the following steps: arranging and combining the seed entity vocabularies of different entity categories to generate a short text data set; defining at least one entity category label for the short text in the short text data set, wherein each entity category label corresponds to the index information of the beginning and the end of the subfile in the short text; training a deep learning recognition model by using the defined short text data set as a training set; and identifying the nested entity data by using the identification model which reaches the standard after training. According to the method and the device, entity marking information is defined according to the mode of starting and ending index and entity category label of the statement, so that the marking of the content of the multi-nested entity is simpler to realize, the process and the workload of marking by identifying the nested entity are optimized, the time cost and the labor cost are saved, and the identification efficiency and the accuracy of the data of the nested entity can be improved.

Description

Nested entity data identification method and device and electronic equipment

Technical Field

The present application relates to the field of data identification technologies, and in particular, to a method and an apparatus for identifying nested entity data, and an electronic device.

Background

Named Entity Recognition (NER) is an important research direction in the field of natural language processing, and refers to Recognition of entities with specific meanings in texts, mainly including names of people, places, organizations, proper nouns, and the like. With the development of deep learning techniques and the need for practical production applications, the requirements for named entity identification are also increasing. When an entity is utilized to carry out search support, fine-grained and nested entity information is needed to ensure the accuracy and the coverage of search, and at the present stage, the technology mainly applied to named entity identification is a deep learning technology.

When the deep learning technology is used for entity identification, a large amount of labeled data is used, characteristics are automatically extracted through a neural network, and data distribution information is learned, so that the prediction of entity content is completed. The nested entity recognition mainly adopts a multi-level neural network, and performs nested recognition from coarse granularity to fine granularity, thereby completing the nested fine-grained entity recognition work. Currently, entity labeling work from coarse granularity to fine granularity mostly adopts a BIO mode for labeling, wherein B represents Begin, entity vocabulary is identified, I represents In, represents In and is positioned inside an entity word, and O represents Out and represents outside the entity vocabulary. The labeling is usually done by adding "-label" from B and I, O alone.

However, the entity data labeling method is difficult to divide the thickness and granularity of the entity, and multi-level BIO labeling requires a lot of time and labor cost, and is difficult to process a large amount of effective labeled data in a short time, thereby affecting the identification efficiency and accuracy of the nested entity data.

Disclosure of Invention

In view of this, the present application provides a method and an apparatus for identifying nested entity data, and an electronic device, and mainly aims to solve the technical problem that the identification efficiency and accuracy of the nested entity data are affected in the prior art.

According to an aspect of the present application, there is provided a method for identifying nested entity data, the method including:

arranging and combining the seed entity vocabularies of different entity categories to generate a short text data set;

defining at least one entity category label for the short text in the short text data set, wherein each entity category label corresponds to the starting index information and the ending index information of the sub text in the short text;

training a deep learning recognition model by using the defined short text data set as a training set;

identifying nested entity data using the recognition model trained to meet a standard.

According to another aspect of the present application, there is provided an apparatus for identifying nested entity data, the apparatus including:

the generating module is used for arranging and combining the seed entity vocabularies of different entity categories to generate a short text data set;

the defining module is used for defining at least one entity class label for the short text in the short text data set, and each entity class label corresponds to the starting index information and the ending index information of the sub-text in the short text;

the training module is used for training a deep learning identification model by using the defined short text data set as a training set;

and the recognition module is used for recognizing the nested entity data by using the recognition model which reaches the standard after training.

According to yet another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described method of identifying nested entity data.

According to still another aspect of the present application, there is provided an electronic device, including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the method for identifying the nested entity data when executing the computer program.

By means of the technical scheme, compared with the prior art, the identification method and device for the nested entity data and the electronic device, an entity data labeling mode of BIO is abandoned, and the entity data labeling mode is optimized. Specifically, after seed entity vocabularies of different entity categories are arranged and combined to generate a short text data set, at least one entity category label is defined for a short text in the short text data set, and each entity category label corresponds to index information of the beginning and the end of a sub-text in the short text. The entity marking information is defined according to the mode of starting and ending indexes and entity category labels for the statements, so that the marking of the content of the multi-nested entity is simpler to realize, the processing is convenient, the marking of the nested entity is convenient, the process and the workload of marking the nested entity by identification are optimized, and the time cost and the labor cost are saved. And then, a short text data set obtained by the marking mode is used as a training set, and a deep learning recognition model is trained so as to recognize the nested entity data by using the recognition model which reaches the standard after training, and the recognition efficiency and accuracy of the nested entity data can be improved.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic flowchart illustrating an identification method of nested entity data according to an embodiment of the present application;

fig. 2 is a schematic flowchart illustrating another method for identifying nested entity data according to an embodiment of the present application;

FIG. 3 illustrates an example diagram of entity tagging provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a recognition model provided by an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating an example sorting flow of nested entity data identification provided by an embodiment of the present application;

FIG. 6 illustrates an example schematic diagram of entity data generation and annotation provided by an embodiment of the present application;

fig. 7 shows a schematic structural diagram of an apparatus for identifying nested entity data according to an embodiment of the present application.

Detailed Description

The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The method aims to solve the technical problems that the existing BIO entity data labeling mode is improved, the thickness and granularity of an entity are difficult to divide, multi-level BIO labeling needs a large amount of time and labor cost, a large amount of effective labeled data are difficult to process in a short time, and the identification efficiency and accuracy of nested entity data are influenced. The embodiment provides a method for identifying nested entity data, as shown in fig. 1, the method includes:

101. and arranging and combining the seed entity vocabularies of different entity categories to generate a short text data set.

First, entity classes are determined, and the present embodiment may involve eight types of entities: JOB (JOB), company (ORG), Brand (BRD), Industry (IND), Skill (SKL), Location (LOC), Welfare (WEL), Salary (SAL), etc., each entity category sorts and summarizes the collected seed entity vocabulary. For example, for the entity category of the place, the corresponding seed entity vocabulary includes beijing, shanghai, beijing wudao, etc.; for the entity category of the industry, the corresponding seed entity vocabulary comprises human resource service, automobile manufacturing, labor dispatching and the like.

And then, arranging and combining the collected seed entity vocabularies of different entity categories to generate a short text data set. For example, according to a common search sentence pattern, the seed entity words of the eight types of entity categories are arranged and combined according to the entity sequence, the selected times, the entity positions and other elements, stop words are added as non-entity content, which is equivalent to the "O" label in the BIO label, and a short text is formed as a short text data set a. The combination forms are 81 combination forms such as 'company' + 'position', 'position' + 'salary', 'industry' + 'benefit', 'industry' + 'salary', 'position' + 'benefit', 'place' + 'company' + 'benefit' + 'salary', etc.

The execution subject of the embodiment can be a recognition device or equipment of nested entity data, and can be configured on a client side or a server side and used for intelligent recognition of nested fine-grained entity data.

102. At least one entity category label is defined for the short text in the generated short text data set, and each entity category label corresponds to the start index information and the end index information of the sub text in the short text.

Because the identification of the nested entity data is to support the implementation of some service searches (such as recruitment service searches), the identified text comes from the query searched by the user, and the labeling from coarse granularity to fine granularity is difficult, and the structure can not be controlled manually, the entity data labeling in the embodiment abandons the labeling mode of BIO, and defines the entity labeling information according to the beginning and ending indexes and the entity labels for the sentences, so that the labeling of the content of the multi-nested entity is simple to implement, is convenient to process, and saves the time cost and the labor cost.

For example, for a short text "XX labor dispatch limited, guangzhou", which has 12 words in total, entity data labeling can be performed in the form of "0, 2 LOC | 0,12 ORG", where "LOC" represents an entity category tag as a location and "0, 2" represents an index of the beginning and end of the location tag in the short text, i.e., the text "guangzhou" contained from 0 to the second word corresponds to the location tag. And separates the next set of tag data using "|". The entity category label represented by "ORG" is company, "0, 12" represents the index of the beginning and end of the company label in the short text, i.e. the text (i.e. the whole text) contained from 0 to the twelfth word "guangzhou XX labor dispatch limited", corresponding to the company label.

103. And training a deep learning recognition model by using the defined short text data set as a training set.

In this embodiment, model training may be performed based on a plurality of optional deep learning algorithm models to obtain a recognition model for recognizing nested entity data. For example, after the model training is completed, the test set can be used for testing, and after the result output by the model meets the index requirement, the training can be determined to reach the standard, so that the recognition model with the training reaching the standard is obtained. And if the test output result of the recognition model obtained by training does not meet the index requirement, continuing the model training until the model training reaches the standard.

104. And identifying the nested entity data by using the identification model which reaches the standard after training.

For example, text data to be recognized is input into a recognition model which is trained to reach the standard, text data features are extracted by the model, entity category labels corresponding to the features and index information corresponding to the entity category labels are obtained, tag data are obtained, and recognition results of nested entity data are output according to the tag data.

Compared with the prior art, the identification method of the nested entity data provided by the embodiment abandons the entity data labeling mode of the BIO, and optimizes the entity data labeling mode. Specifically, after seed entity vocabularies of different entity categories are arranged and combined to generate a short text data set, at least one entity category label is defined for a short text in the short text data set, and each entity category label corresponds to index information of the beginning and the end of a sub-text in the short text. The entity marking information is defined according to the mode of starting and ending indexes and entity category labels for the statements, so that the marking of the content of the multi-nested entity is simpler to realize, the processing is convenient, the marking of the nested entity is convenient, the process and the workload of marking the nested entity by identification are optimized, and the time cost and the labor cost are saved. And then, a short text data set obtained by the marking mode is used as a training set, and a deep learning recognition model is trained so as to recognize the nested entity data by using the recognition model which reaches the standard after training, and the recognition efficiency and accuracy of the nested entity data can be improved.

Further, as a refinement and an extension of the specific implementation of the foregoing embodiment, in order to fully describe the implementation of this embodiment, this embodiment further provides another identification method of nested entity data, as shown in fig. 2, where the method includes:

201. an entity category dictionary is created.

The dictionary comprises different entity categories and seed entity vocabularies which are respectively collected correspondingly to the entity categories.

This embodiment involves eight classes of entities: position (JOB), company (ORG), Brand (BRD), Industry (IND), Skill (SKL), Location (LOC), Welfare (WEL), Salary (SAL), etc., each entity category sorts and summarizes the collected entity vocabularies, each entity category is separately stored into one text file, each file stores the text content of the category entity in rows, and an external entity category dictionary, i.e., an entity category dictionary, is formed.

202. And arranging and combining the seed entity vocabularies in the entity category dictionary according to the search sentence pattern with the use frequency of the user being greater than the preset frequency threshold, and adding stop words as non-entity content to generate a short text data set.

The preset frequency threshold value can be used for judging the search sentence patterns commonly used by the user. By the method, a basic data set with rich sample elements can be generated, the quality of model training data can be improved in the early stage, and the efficiency and accuracy of model training can be improved subsequently.

203. And dividing the short texts in the short text data set into text identification blocks according to symbols and different languages.

The dividing of the text identifier block may be regarded as token dividing, and optionally, the step 203 may specifically include: uniformly replacing the blank spaces in the short text with preset characters; the method comprises the steps of integrally processing foreign words into a text identification block (token), processing Chinese texts and other characters according to single words, and separating the tokens according to a first preset symbol. By the alternative mode, the entity class labels can be conveniently labeled, and the workload of labeling the entity data is simplified.

For example, spaces are uniformly replaced with an english underline symbol "_", and english is uniformly converted into lowercase. As part of entity categories relate to a large number of English words, in order to avoid the problem of UNK, the regular expression is used for matching and processing, English words are not divided and are processed into a token according to the whole, Chinese texts and other characters are processed according to single characters, and the tokens can be separated according to spaces.

204. And adding entity category labels for the texts divided by the text identification blocks according to the seed entity vocabularies and storing the entity category labels in a list.

In this way, by the text in the list and the entity category label corresponding to the text after each token division, it is possible to search for a short text entity by using the seed entity vocabulary, and correspondingly add an index and a label, so that efficiency and accuracy of entity data labeling are improved, and the process shown in step 205 can be specifically executed.

205. At least one entity category label is defined for the short text in the short text data set by using the text in the list and the entity category labels corresponding to the text, and each entity category label corresponds to the index information of the beginning and the end of the sub-text in the short text.

Optionally, step 205 may specifically include: the short text occupies one line, the line is changed, entity labeling content of the short text is added, the entity labeling content comprises entity category labels and starting and ending subscript positions in the defined short text, an index value is given by adopting a left-closed and right-opened interval, the starting index and the ending index are separated by a second preset symbol, a third preset symbol is added after the index is ended, then the corresponding entity category labels are added, a fourth preset symbol interval is formed between the labeling content of each entity, and each group of data and labels are separated from the next group of data and labels by an empty line.

For example, as shown in fig. 3, the composed short text occupies one line. Adding entity labeling contents of a text in a line feed manner, searching and matching by utilizing an entity external dictionary, defining the entity category and the subscript positions of start and end in a short text, giving an index value by adopting a left closed and right open interval, separating the start index and the end index by commas, adding a space and then adding a label after the index is ended, completing the labeling of one entity, separating the labeling contents of each entity by a symbol "|" and separating each group of data and label from the next group of data and label by an empty line.

206. And training a deep learning recognition model by using the defined short text data set as a training set.

In order to ensure the accuracy of the recognition model training, optionally, step 206 may specifically include: disordering the text data in the defined short text data set; then, dividing the short text data set after the sequence is disturbed into a training set and a testing set according to a preset proportion; then adding each seed entity vocabulary in the entity category dictionary and entity category labels, start and end indexes corresponding to each seed entity vocabulary into a training set; and training the recognition model by using the added training set, and testing the recognition model obtained by training by using the test set.

For example, based on the example form as shown in fig. 3, a text data set a is generated, and then the generated text data set a is scrambled, and then the data is displayed in the order of 9: 1 is divided into a training set and a test set, and the original seed entities of eight types of entities are added into the content of the training set, so that the comprehensiveness of basic data in training data is ensured. The training set and the test set are respectively saved as files.

In the prior art, definition of entity granularity division can generate ambiguity, and since nested NER (entity name includes other entity names) is generally processed according to sentences divided by coarse granularity entities to sentences divided by fine granularity entities, it is difficult to define structures of text contents from coarse granularity to fine granularity by marking. Therefore, in order to solve this problem, the recognition model used in the present embodiment may be a BERT _ BiLSTM _ Multi _ CRF model. A decoding mode is provided based on a BERT _ BILSTM _ CRF model output layer, a CRF transformation and decoding method is provided, and structural ambiguity is solved. Wherein the BERT is called Bidirectional Encoder reproduction from Transformers, i.e. Encoder of Bidirectional Transformer. BilSTM: bidirectional long-short term memory artificial neural network. LSTM is known collectively as Long Short-Term Memory, which is one of RNN (Current Neural network). LSTM is well suited for modeling time series data, such as text data, due to its design features. BilSTM is an abbreviation of Bi-directional Long Short-Term Memory, and is formed by combining forward LSTM and backward LSTM. Both are often used to model context information in natural language processing tasks. CRF: a conditional random field. CRF, also known as Conditional Random Field, is a discriminative probability model, a Random Field, and is commonly used to label or analyze sequence data, such as natural language text or biological sequences.

Correspondingly, step 206 may specifically include: firstly, constructing a BERT _ BilSTM _ Multi _ CRF model, so that a separate CRF is prepared for each entity type; acquiring a feature vector corresponding to each short text in the defined short text data set, wherein the feature vector comprises a text feature, a label feature, a first token feature and a last token feature; combining the obtained feature vectors to generate tuple data; the BERT _ BilSTM _ Multi _ CRF model is then trained using the tuple data.

For example, the obtaining of the feature vector corresponding to each short text in the defined short text data set may specifically include: defining a classification list of the triples, wherein the classification list comprises a token list, a char list and an entites list, the token list comprises tokens divided by short texts according to a first preset symbol, the char list comprises chars divided by the short texts according to characters, and the entites list comprises a start index, an end index and an entity class label of each entity in the short texts; utilizing a classification list to process according to a single sentence, adding a first label at the beginning of the short text, converting a dictionary table corresponding to the BERT Chinese model after tokenize processing into an ID value, then carrying out tokenize processing on the single-sentence text, then carrying out token conversion ID value processing, adding a second label at the end of the short text, converting the dictionary table corresponding to the BERT Chinese model after tokenize processing into the ID value, and obtaining a text input ID value; converting the entity category labels into numerical value types according to the mapping of the word list, storing the start index, the end index and the label ID as triples, and generating a first token list and a last token list for each single sentence text; converting all texts and entity category labels of the three-tuple class formed by the training set, respectively generating a dictionary for summarizing the texts, the labels, the first tokens and the last tokens, and respectively forming four feature vectors, namely text features, label features, the first tokens and the last tokens, according to the batch size as an interval after sorting.

For example, a triple class SentList is defined, consisting of elements of the three list types tokens, chars, and entities. Respectively reading data of files of a training set and a test set, dividing tokens according to spaces for short text contents, dividing chars according to characters, dividing each group of labels according to '|', dividing indexes and label according to spaces for labels, dividing a start index and an end index according to an English comma for indexes, storing the start index, the end index and the labels into a form of a triplet, and storing the tag triplets of all entities into a list of entities. The labels included in the training set and test set are processed, stored as a list, and a mapping is established for the word list generated.

According to the generated sentList, according to single sentence processing, firstly adding [ CLS ] labels at the beginning of texts and tokenize the same, then converting dictionary tables corresponding to BERT Chinese models into ID values, secondly processing the single sentence texts, firstly carrying out tokenize, carrying out token conversion ID, after the text processing is finished, converting the last [ SEP ] labels in the same way, finally converting the labels related to the texts into numerical types according to the mapping of the word tables, storing the start indexes, the end indexes and the label IDs as triples, and simultaneously generating a first token list and a last token list for each single sentence text to record the start position and the end position of each token in the single sentences. And converting all texts and labels of a three-tuple class formed by the training data set and the testing data set, respectively generating a dictionary for summarizing the texts, the labels, the first tokens and the last tokens, and respectively forming four kinds of feature vectors by taking the batch size as an interval after sorting.

The generating of the tuple data by combining according to the feature vectors may specifically include: acquiring a first maximum length of an element in a text feature and a second maximum length of the element in a first token feature; supplementing 0 to the text input ID value at the tail according to the first maximum length to form a new text input ID value, and forming an input ID characteristic; generating an input mask ID value according to the length of the new text input ID value to form an input mask characteristic, wherein the value is 1 from the beginning to the original text input ID length part, and the rest positions are supplemented with 0; generating a mask list according to the first token feature, and generating a mask feature by using the mask list, wherein the part value from the starting position to the second maximum length is True, and the rest positions are filled with False; and adding the input ID characteristic, the input mask characteristic, the first token characteristic, the last token characteristic, the label characteristic and the mask characteristic into the list, and combining to obtain the tuple data.

For example, the feature vectors of the text and the first token are measured, wherein the maximum length values of the elements are respectively A and B, the text has more start and end [ CLS ] and [ SEP ] flag bits than the first token, so that A-B =2, the text input ID is supplemented with 0 at the end according to the length of A to form a new text input ID composition feature, an input mask ID is generated according to the length of the new text input ID to generate an input mask feature, the value from the start to the original text input ID length portion is 1, the rest positions are supplemented with 0, a mask list is generated according to the length of B to generate a mask feature, the value from the start position to the B length portion is True, and the rest positions are supplemented with False. Adding the input ID feature, the input mask feature, the first token feature, the last token feature, the label feature and the mask feature to the list to form a tuple. And respectively packaging and writing the training data file and the test data file into a Pickle file as model training data and test data according to the same processing flow.

For example, the building of the BERT _ BiLSTM _ Multi _ CRF model so that a separate CRF is prepared for each entity category may specifically include: firstly, using a Chinese model form of a BERT model as a context word embedding generator; stacking the BILSTM layer on a BERT model, wherein the BERT _ BILSTM _ Multi _ CRF model is provided with at least two BILSTM hidden layers; and the CRF layer uses a standard objective function of a linear chain CRF to replace the trained objective function, and uses a plurality of CRFs to respectively consider each entity class when decoding the output layer, wherein each element of each CRF conversion matrix has a fixed value, so that the range of Viterbi decoding is dynamically determined according to the last Viterbi decoding result by recursively narrowing the range to Viterbi decoding, and then the text segment identified as the entity is subjected to Viterbi decoding.

For example, the recognition model in the present embodiment is mainly performed by using a BERT _ BiLSTM _ Multi _ CRF model. The model structure diagram is shown in fig. 4. The chinese model form of the BERT model is used as a context word embedding generator without fine-tuning and stacking the BiLSTM layers on the BERT model. The CRF layer firstly replaces the trained objective function with the standard objective function of the linear chain CRF, and establishes a CRF for each entity category. BERT employed Masked LM and Next sequence Prediction on pre-training to capture word and Sentence level representation, respectively. Masked LM refers to a model that replaces a small number of words with Mask or another random word with a reduced probability when training a bi-directional language model, which causes the model to increase memory of the context. Next sequence Prediction refers to adding a loss that predicts the Next Sentence. The invention adopts a 'BERT-Base, Chinese' pre-training model to support Chinese simplified and traditional Chinese, and comprises 12 layers, 768 hidden, 12 heads and 110M parameters.

The BilSTM is formed by combining forward LSTM and backward LSTM. The LSTM model can realize long-short term memory storage through Cell and gate mechanism operation, is suitable for processing serialized data such as texts, and the like, but cannot encode information from back to front, so that a good effect is difficult to obtain when fine-grained classification is processed. When the method of the embodiment is applied, 2 BilSTM hidden layers are set, the dimension of each hidden unit is 256, the Dropout value is set to be 0.5, and the Batch size is set to be 32.

The output layer employs a modified CRF. In the output layer in this embodiment, the conventional CRF is abandoned, and a separate CRF is prepared for each entity type, so that the model can handle the case where multiple entity types are allocated to the same mention range. In addition, each element of each CRF transformation matrix has a fixed value depending on whether it is legally transformed, i.e. from a B-tag to an I-tag sequence, and illegitimate transformation, e.g. from an O-tag to an I-tag, helps to make the score of the tag sequence comprising the external entity higher than the score of the tag sequence comprising the internal entity. Formally, this embodiment uses

Representing a sequence output from a last hidden layer of the neural model, wherein

Is the vector of the ith word and n is the number of tags.

Represents the sequence of IOBES tags for Z entity type k. The scoring function is defined as:

(formula 1)

In this connection, it is possible to use,

respectively representing the weight matrix and the offset vector,

representing the transition matrix from the last token to the current token,

represent from

The offset score of (2).

Using multiple CRFs to consider each entity type separately in decoding at the output layer allows handling of cases where multiple entity types are contained within the same text range. The decoder searches for nested entities in an outside-in manner, achieving efficient processing by eliminating the range of non-entities at an early stage. That is, by recursively narrowing the range to viterbi decoding, the range of viterbi decoding is dynamically determined based on the previous viterbi decoding result, and then only the text segment identified as an entity is viterbi-decoded. The score calculation is carried out on the external entity and the internal entity by adopting the formula 1, so that recoding is not needed, and the processing becomes faster and more effective. Taking the text "beijing dongtong science and technology shares limited company/regional sales manager" as an example, firstly, the outermost entity "beijing dongtong science and technology shares limited company" ORG category is entity a and "regional sales manager" JOB category is entity B, and according to a and B, the entity is respectively taken as a next search path downwards, taking a as an example, the next layer can search an entity "beijing" place LOC category as entity C and a "beijing dongtong science and technology shares limited company" ORG category as entity D under the path a, the next layer of the entity C can not find the entity, C branches to stop finding, then the entity D is taken as a next search path, the entity D searches an entity "beijing dongtong" brand BRD category and a "science and technology" industry IND category under the path D, and the entities can not find the entity downwards again, and then the finding is stopped. And the entity B carries out searching and score calculation in the same way until the entity can not be divided, stops searching and gives all searched results as the final nested entity identification result.

207. And training the recognition model by using the added training set.

Optionally, step 207 may specifically include: setting an initial learning rate, a first preset number (such as 20) and a second preset number (such as 50) of Epoch parameters, and model indexes, wherein the model indexes may include accuracy, and/or recall, and/or F1 values; training an identification model, and when the ring ratio of model indexes is smaller than a first preset threshold (used for determining that the model indexes stop improving) and the Epoch parameters of the model are smaller than a first preset number, reducing the learning rate, specifically reducing the learning rate according to a preset function rule; and subsequently, continuing to train the recognition model based on the reduced learning rate, and stopping model training when the ring ratio of the model indexes is smaller than a second preset threshold (used for determining that the model indexes stop improving, and in addition, judging whether the ring ratio of the learning rate is smaller than a certain threshold so as to judge whether the learning rate does not change basically) and the Epoch parameters of the model are smaller than a second preset number.

For example, model training is carried out in a Linux 64-bit operating system, the system version Red Hat4.8.5, the video card driver version 440.82, the CUDA version 9.2, the CUDNN version 7.6.5 and the video card model GeForce RTX 2070 adopt a Pythrch frame 1.3.0 version to build a model. In the training process, AdaBound can be used as an optimizer, learning is carried out in a mode of gradual attenuation of learning rate, and the initial learning rate is set to be 0.1. The data volume is large and rich, the trained Epoch is set to be 50 for full training, the Batch size is set to be 32, meanwhile, an early-stopping method is added to avoid overfitting of model training, the accuracy, the recall rate and the F1 value are used as evaluation indexes, the F1 value of a test set is used as a reference index, and the optimal model is stored in an iteration mode. When the model index stops improving and is not optimized at 20 epochs set during the Patience period, then the learning rate is reduced. And before 50 epochs are reached, stopping training if the learning rate basically does not change any more. The model verification adopts an independent data set which is manually arranged and labeled and a data set constructed by the model as a test set, and takes precision, accuracy and recall rate as evaluation indexes. And saving the Badcase to a file according to the model verification result, supplementing the Badcase and data similar to the Badcase, adding the Badcase and the data into training data, and iterating the model.

208. And identifying the nested entity data by using the identification model which reaches the standard after training.

For the embodiment, the model prediction adopts the flash building service to deploy the model and complete the prediction of data.

In order to fully describe the overall specific implementation process of the method of this embodiment, an overall flow of the solution is given, and as shown in fig. 5, for example, the flow includes:

a. and relevant hyper-parameter setting of model training is carried out, and the hyper-parameter setting is packaged into a file, so that modification and maintenance are facilitated. If the configuration class is set, the required hyper-parameters are packaged and managed in a unified way, so that the setting and adjustment of the hyper-parameters are facilitated. The hyper-parameters of the invention in application comprise: data path, model path, Batch size, whether data needs to be shuffled, word size, label size, Bert model, number of layers, hidden layer size, Dropout, Epoch, learning rate, optimizer, regularization, whether GPU is used, etc.

b. And generating short text data by using the seed entity, and completing data annotation. As shown in FIG. 6, the process may specifically include first determining entity categories, and counting seed entity vocabularies; generating short text sentences according to the search common combinations; processing Chinese and English symbols according to different tokens; then defining a list, and defining a label for the text processed by the token; then, searching a short text entity by using the seed entity, and adding an index and a label; and finally, dividing the training set and the test set, and adding seed data to the training set.

c. And performing data reading and processing to generate a Pickle file required by training.

d. The method adopts a learning rate adjustment and early stop method to train a BERT _ BilSTM _ Multi _ CRF model, utilizes decoding and various CRFs to modify an output layer, and uses F1 as an index to store an optimal model.

e. And performing model verification by using the precision, the accuracy and the recall rate as indexes.

f. And (5) establishing a service by using a flash, deploying a model, and completing calling and prediction.

The nested fine-grained entity identification method provided by the scheme of the embodiment can provide a data labeling mode of the entity index subscript identification, is convenient for labeling the nested entities, optimizes the labeling process and workload of the nested entity identification, and adjusts through subsequent data processing, thereby simplifying the data labeling process and saving time and cost; and a decoding mode is provided based on a BERT _ BilSTM _ Multi _ CRF model output layer, a CRF-based reconstruction and decoding method is provided, the problem that structural ambiguity occurs in the traditional nested NER can be solved, searching and prediction from coarse-grained entities to fine-grained entities are facilitated, and the processing speed is increased. The nested fine-grained identification of the entities in the recruitment service is realized, the entity division of the search content of the user is more accurate and comprehensive, and a better entity basis is provided for the subsequent search.

Further, as a specific implementation of the method shown in fig. 1 and fig. 2, this embodiment provides an apparatus for identifying nested entity data, as shown in fig. 7, the apparatus includes: a generating module 31, a defining module 32, a training module 33, and a recognition module 34.

The generating module 31 is configured to arrange and combine the seed entity vocabularies of different entity categories to generate a short text data set;

a defining module 32, configured to define at least one entity category label for a short text in the short text data set, where each entity category label corresponds to start index information and end index information of a sub-text in the short text;

the training module 33 is configured to train a deep learning recognition model by using the defined short text data set as a training set;

and the recognition module 34 is used for recognizing the nested entity data by using the recognition model which reaches the standard after training.

In a specific application scenario, the definition module 32 is specifically configured to perform token division on short texts in the short text data set according to symbols and different languages; adding entity category labels for the text divided by the token according to the seed entity vocabulary and storing the entity category labels in a list; and defining at least one entity category label for the short text in the short text data set by using the text in the list and the entity category labels corresponding to the text, wherein each entity category label corresponds to the index information of the beginning and the end of the sub-text in the short text.

In a specific application scenario, the definition module 32 is specifically further configured to uniformly replace the blank spaces in the short text with preset characters, and unify the capital and small formats; integrally processing foreign words into a token, processing the Chinese text and other characters according to the single words, and separating the tokens according to a first preset symbol;

the defining module 32 is further configured to occupy a line of the short text, and add entity tagging contents of the short text in a line-changing manner, where the entity tagging contents include an entity category label in the defined short text and subscript positions of a start and an end, and an index value is given by using a left-closed and right-open interval, where the start index and the end index are separated by a second preset symbol, a third preset symbol is added after the index is ended, then a corresponding entity category label is added, a fourth preset symbol interval is used between tagging contents of each entity, and each group of data and labels is separated from a next group of data and labels by an empty line.

In a specific application scenario, optionally, the identification model is a BERT _ BiLSTM _ Multi _ CRF model; correspondingly, the training module 33 is specifically configured to construct the BERT _ BiLSTM _ Multi _ CRF model, so that a separate CRF is prepared for each entity category; obtaining a feature vector corresponding to each short text in the defined short text data set, wherein the feature vector comprises a text feature, a label feature, a first token feature and a last token feature; combining the feature vectors to generate tuple data; training the BERT _ BilSTM _ Multi _ CRF model using the tuple data.

In a specific application scenario, the training module 33 is further configured to use a chinese model form of the BERT model as a context word embedding generator; stacking the BILSTM layer on a BERT model, wherein the BERT _ BILSTM _ Multi _ CRF model is provided with at least two BILSTM hidden layers; and the CRF layer uses a standard objective function of a linear chain CRF to replace the trained objective function, and uses a plurality of CRFs to respectively consider each entity class when decoding the output layer, wherein each element of each CRF conversion matrix has a fixed value, so that the range of Viterbi decoding is dynamically determined according to the last Viterbi decoding result by recursively narrowing the range to Viterbi decoding, and then the text segment identified as the entity is subjected to Viterbi decoding.

In a specific application scenario, the training module 33 is further configured to define a classification list of the triplets, where the classification list includes a token list, a char list and an entites list, where the token list includes tokens divided by a short text according to a first preset symbol, the char list includes char divided by a short text according to characters, and the entites list includes a start index, an end index and an entity category label of each entity in the short text; utilizing the classification list to process according to a single sentence, adding a first label for starting a short text, converting a dictionary table corresponding to a BERT Chinese model after tokenize processing into an ID value, and then carrying out tokenize processing and token conversion ID value processing on the single sentence text; adding a second label at the end of the short text, converting the dictionary table corresponding to the BERT Chinese model after the token size processing into an ID value, and obtaining a text input ID value; converting the entity category labels into numerical value types according to the mapping of the word list, storing the start index, the end index and the label ID as triples, and generating a first token list and a last token list for each single sentence text; converting all texts and entity category labels of the three-tuple class formed by the training set, respectively generating a dictionary for summarizing the texts, the labels, the first tokens and the last tokens, and respectively forming text features, label features, first tokens and last tokens according to the batch size as intervals after sorting.

In a specific application scenario, the training module 33 is further configured to obtain a first maximum length of an element in the text feature and a second maximum length of an element in the top token feature; supplementing 0 to the text input ID value at the tail according to the first maximum length to form a new text input ID value, and obtaining input ID characteristics; generating an input mask ID value according to the length of the new text input ID value to form an input mask characteristic, wherein the value is 1 from the beginning to the original text input ID length part, and the rest positions are supplemented with 0; generating a mask list according to the first token feature, and generating a mask feature by using the mask list, wherein the part value from the starting position to the second maximum length is True, and the rest positions are filled with False; and adding the input ID characteristic, the input mask characteristic, the first token characteristic, the last token characteristic, the label characteristic and the mask characteristic into a list, and combining to obtain tuple data.

In a specific application scenario, the generating module 31 is specifically configured to create an entity category dictionary, where the dictionary includes different entity categories and seed entity vocabularies collected corresponding to the entity categories; and arranging and combining the seed entity words in the dictionary according to the search sentence pattern with the use frequency of the user being greater than the preset frequency threshold, and adding stop words as non-entity content to generate the short text data set.

In a specific application scenario, the training module 33 is further configured to shuffle text data in the defined short text data set; dividing the short text data set after the sequence is disturbed into a training set and a testing set according to a preset proportion; adding each seed entity vocabulary in the entity category dictionary and entity category labels, start indexes and end indexes corresponding to each seed entity vocabulary to a training set; and training the recognition model by using the added training set, and testing the trained recognition model by using the test set.

In a specific application scenario, the training module 33 is further configured to set an initial learning rate, a first preset number and a second preset number of Epoch parameters, and a model index, where the model index includes an accuracy rate, and/or a recall rate, and/or an F1 value; training the recognition model, and reducing the learning rate when the ring ratio of the model indexes is smaller than a first preset threshold and the Epoch parameters of the model are smaller than the first preset number; and continuing to train the recognition model based on the reduced learning rate, and stopping model training when the ring ratio of the model indexes is smaller than a second preset threshold value and the Epoch parameters of the model are smaller than a second preset number.

It should be noted that other corresponding descriptions of the functional units related to the apparatus for identifying nested entity data provided in this embodiment may refer to the corresponding descriptions in fig. 1 and fig. 2, and are not described herein again.

Based on the above methods shown in fig. 1 and fig. 2, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the above method for identifying nested entity data shown in fig. 1 and fig. 2.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the embodiments of the present application.

Based on the method shown in fig. 1 and fig. 2 and the virtual device embodiment shown in fig. 7, in order to achieve the above object, an embodiment of the present application further provides an electronic device, which may be a personal computer, a notebook computer, a smart phone, a server, or other network devices, and the device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the above method for identifying nested entity data as shown in fig. 1 and 2.

Optionally, the entity device may further include a user interface, a network interface, a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.

It will be understood by those skilled in the art that the above-described physical device structure provided in the present embodiment is not limited to the physical device, and may include more or less components, or combine some components, or arrange different components.

The storage medium may further include an operating system and a network communication module. The operating system is a program that manages the hardware and software resources of the above-described physical devices, and supports the operation of the information processing program as well as other software and/or programs. The network communication module is used for realizing communication among components in the storage medium and communication with other hardware and software in the information processing entity device.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. By applying the scheme of the embodiment, compared with the prior art, the embodiment abandons the entity data annotation mode of BIO, and optimizes the entity data annotation mode. Specifically, after seed entity vocabularies of different entity categories are arranged and combined to generate a short text data set, at least one entity category label is defined for a short text in the short text data set, and each entity category label corresponds to index information of the beginning and the end of a sub-text in the short text. The entity marking information is defined according to the mode of starting and ending indexes and entity category labels for the statements, so that the marking of the content of the multi-nested entity is simpler to realize, the processing is convenient, the marking of the nested entity is convenient, the process and the workload of marking the nested entity by identification are optimized, and the time cost and the labor cost are saved. And then, a short text data set obtained by the marking mode is used as a training set, and a deep learning recognition model is trained so as to recognize the nested entity data by using the recognition model which reaches the standard after training, and the recognition efficiency and accuracy of the nested entity data can be improved.

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims

1. A method for identifying nested entity data is characterized by comprising the following steps:

2. The method according to claim 1, wherein the defining at least one entity category label for the short text in the short text data set, and each entity category label corresponds to the start index information and the end index information of the sub-text in the short text data set, specifically comprises:

dividing short texts in the short text data set into text identification blocks according to symbols and different languages;

adding entity category labels to the text divided by the text identification block according to the seed entity vocabulary and storing the entity category labels in a list;

and defining at least one entity category label for the short text in the short text data set by using the text in the list and the entity category labels corresponding to the text, wherein each entity category label corresponds to the index information of the beginning and the end of the sub-text in the short text.

3. The method according to claim 2, wherein the dividing of the short text in the short text data set into text identification blocks according to symbols and different languages specifically comprises:

uniformly replacing the blank spaces in the short text with preset characters;

the foreign word is integrally processed into a text identification block, the Chinese text and other characters are processed according to the single word, and the text identification blocks are separated according to a first preset symbol;

the defining at least one entity category label for the short text in the short text data set by using the text in the list and the entity category labels corresponding to the text, and the index information of the beginning and the end of the sub-text corresponding to each entity category label in the short text specifically include:

the short text occupies one line, the entity labeling content of the short text is added in a line changing manner, the entity labeling content comprises an entity category label and starting and ending subscript positions in the defined short text, an index value is given by adopting a left-closed and right-open interval, the starting index and the ending index are separated by a second preset symbol, a third preset symbol is added after the index is ended, then the corresponding entity category label is added, a fourth preset symbol interval is formed between the labeling content of each entity, and each group of data and label is separated from the next group of data and label by an empty line.

4. The method of claim 3, wherein the identification model is a BERT _ BilSTM _ Multi _ CRF model;

the training of the deep learning recognition model by using the defined short text data set as a training set specifically includes:

constructing the BERT _ BilSTM _ Multi _ CRF model so that a separate CRF is prepared for each entity category;

acquiring a feature vector corresponding to each short text in the defined short text data set, wherein the feature vector comprises a text feature, a label feature, a first text identification block feature and a last text identification block feature;

combining the feature vectors to generate tuple data;

training the BERT _ BilSTM _ Multi _ CRF model using the tuple data.

5. The method of claim 4, wherein constructing the BERT _ BilSTM _ Multi _ CRF model such that a separate CRF is prepared for each entity class comprises:

using the Chinese model form of the BERT model as a context word embedding generator;

stacking the BILSTM layer on a BERT model, wherein the BERT _ BILSTM _ Multi _ CRF model is provided with at least two BILSTM hidden layers;

and the CRF layer uses a standard objective function of a linear chain CRF to replace the trained objective function, and uses a plurality of CRFs to respectively consider each entity class when decoding the output layer, wherein each element of each CRF conversion matrix has a fixed value, so that the range of Viterbi decoding is dynamically determined according to the last Viterbi decoding result by recursively narrowing the range to Viterbi decoding, and then the text segment identified as the entity is subjected to Viterbi decoding.

6. The method according to claim 4, wherein the obtaining the feature vector corresponding to each short text in the defined short text data set specifically includes:

defining a classification list of triples, wherein the classification list comprises a tokens list, a char list and an entites list, the tokens list comprises a text identification block token divided by a short text according to a first preset symbol, the char list comprises chars divided by the short text according to characters, and the entites list comprises a start index, an end index and an entity category label of each entity in the short text;

utilizing the classification list to process according to a single sentence, adding a first label at the beginning of the short text, converting the dictionary table corresponding to the BERT Chinese model after tokenize processing into an ID value, then carrying out tokenize processing on the single sentence text, then carrying out text identification block conversion ID value processing, adding a second label at the end of the short text, converting the dictionary table corresponding to the BERT Chinese model after tokenize processing into the ID value, and obtaining a text input ID value;

converting the entity category labels into numerical value types according to the mapping of a word list, storing a start index, an end index and a label ID as triples, and generating a first text identification block list and a last text identification block list for each single sentence text;

and converting all texts and entity category labels of the three-tuple class formed by the training set, respectively generating dictionaries summarizing the texts, the labels, the first text identification blocks and the last text identification blocks, and respectively forming text characteristics, label characteristics, first text identification block characteristics and last text identification block characteristics according to batch size serving as intervals after sorting.

7. The method according to claim 6, wherein the generating metadata by combining according to the feature vectors specifically comprises:

acquiring a first maximum length of an element in a text characteristic and a second maximum length of the element in a first text identification block characteristic;

supplementing 0 to the text input ID value at the tail according to the first maximum length to form a new text input ID value, and obtaining input ID characteristics;

generating an input mask ID value according to the length of the new text input ID value to form an input mask characteristic, wherein the value is 1 from the beginning to the original text input ID length part, and the rest positions are supplemented with 0;

generating a mask list according to the first text identification block characteristics, and generating mask characteristics by using the mask list, wherein the part value from the starting position to the second maximum length is True, and the rest positions are filled with False;

and adding the input ID characteristic, the input mask characteristic, the first text identification block characteristic, the last text identification block characteristic, the label characteristic and the mask characteristic into a list, and combining to obtain tuple data.

8. The method of claim 1, wherein the arranging and combining the seed entity vocabularies of different entity categories to generate the short text data set specifically comprises:

creating an entity category dictionary, wherein the dictionary comprises different entity categories and seed entity vocabularies which are respectively and correspondingly collected by the entity categories;

and arranging and combining the seed entity words in the dictionary according to the search sentence pattern with the use frequency of the user being greater than the preset frequency threshold, and adding stop words as non-entity content to generate the short text data set.

9. The method according to claim 8, wherein the training of the deep-learning recognition model by using the defined short text data set as a training set specifically comprises:

disordering the defined text data in the short text data set;

dividing the short text data set after the sequence is disturbed into a training set and a testing set according to a preset proportion;

adding each seed entity vocabulary in the entity category dictionary and entity category labels, start indexes and end indexes corresponding to each seed entity vocabulary to a training set;

and training the recognition model by using the added training set, and testing the trained recognition model by using the test set.

10. The method according to claim 9, wherein the training the recognition model with the added training set specifically comprises:

setting an initial learning rate, a first preset number and a second preset number of Epoch parameters and model indexes, wherein the model indexes comprise accuracy and/or recall rate and/or F1 values;

training the recognition model, and reducing the learning rate when the ring ratio of the model indexes is smaller than a first preset threshold and the Epoch parameters of the model are smaller than the first preset number;

and continuing to train the recognition model based on the reduced learning rate, and stopping model training when the ring ratio of the model indexes is smaller than a second preset threshold value and the Epoch parameters of the model are smaller than a second preset number.

11. An apparatus for identifying nested entity data, comprising:

12. A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method of any of claims 1 to 10.

13. An electronic device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the method of any one of claims 1 to 10 when executing the computer program.