CN108304372B

CN108304372B - Entity extraction method and device, computer equipment and storage medium

Info

Publication number: CN108304372B
Application number: CN201710909581.4A
Authority: CN
Inventors: 崔建苓
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-09-29
Filing date: 2017-09-29
Publication date: 2021-08-03
Anticipated expiration: 2037-09-29
Also published as: CN108304372A

Abstract

The invention provides an entity extraction method and device, computer equipment and a storage medium, wherein the entity extraction method comprises the following steps: acquiring input information and scene information to be extracted; preprocessing input information to obtain text characteristics of the input information; determining a corresponding predefined entity regular expression according to the scene information; identifying each entity of the input information according to the text characteristics to obtain an entity list; combining all entities in the entity list to obtain statement combinations according to the corresponding entity regular expressions, and calculating the confidence coefficient of each statement combination; and performing regular matching based on the regular expression of the sentence combination with the confidence coefficient larger than the preset threshold value to obtain a regular entity extraction result. The entity extraction can be carried out by means of the pre-determined regular expression, the regular entity identification capability is provided for the third party, the cost that the third party spends a large amount of manpower to enumerate all entities or label a large amount of entities in the cold start stage can be saved, and the expression entities with specific grammars and sentences can be identified conveniently.

Description

Entity extraction method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to an entity extraction method and apparatus, a computer device, and a storage medium.

Background

Entity extraction, also called named entity recognition, is a fundamental problem of natural language processing. In natural language processing, an entity mainly includes entity names such as place names, organization names, person names, numbers, and domain proper nouns, and expressions such as numerical expressions (formula, monetary value, score), time expressions, character string expressions, and the like. The application fields of recognition body identification are information extraction, information retrieval, machine translation, question-answering system and the like.

The classical named entity algorithm comprises statistical methods such as maximum entropy, hidden Markov, maximum entropy-hidden Markov, conditional random field, deep learning-conditional random field and the like, and also comprises some regularization methods such as regular expression, dictionary matching, dictionary fuzzy matching and the like.

Traditional entity recognition algorithms recognize based on a large number of statistics, and in specific applications, such as the early stage of the use of a vertical question-answering system, a lot of labeled data are rare, and the data need to be labeled by a large number of workers. And the manual labeling needs to consume a great deal of time cost and labor cost.

Disclosure of Invention

Based on this, it is necessary to provide an entity extraction method and apparatus, a computer device, and a storage medium for solving the problem that manual labeling requires a large amount of time cost and labor cost.

In order to achieve the above purpose, one embodiment adopts the following technical scheme:

an entity extraction method, comprising:

acquiring input information and scene information to be extracted;

preprocessing the input information to obtain text characteristics of the input information;

determining a corresponding predefined entity regular expression according to the scene information;

identifying each entity of the input information according to the text characteristics to obtain an entity list;

combining all entities in the entity list to obtain statement combinations according to the corresponding entity regular expressions, and calculating the confidence coefficient of each statement combination;

and performing regular matching based on the regular expression of the sentence combination with the confidence coefficient larger than the preset threshold value to obtain a regular entity extraction result.

An entity extraction apparatus comprising: the device comprises an information acquisition module, a preprocessing module, a searching module, an identification module, a combination module and a matching module;

the information acquisition module is used for acquiring input information and scene information to be extracted;

the preprocessing module is used for preprocessing the input information to obtain text characteristics of the input information;

the searching module is used for determining a corresponding predefined entity regular expression according to the scene information;

the identification module is used for identifying each entity of the input information according to the text characteristics to obtain an entity list;

the combination module is used for combining the entities in the entity list according to the corresponding entity regular expressions to obtain statement combinations, and calculating the confidence coefficient of each statement combination;

and the matching module is used for performing regular matching on the basis of the regular expression of the sentence combination with the confidence coefficient larger than the preset threshold value to obtain a regular entity extraction result.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the entity extraction method described above when executing the program.

A storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the entity extraction method described above.

According to the entity extraction method and device, after all entities of input information are obtained through text feature recognition, all entities in an entity list are combined according to a predefined entity regular expression corresponding to scene information to obtain sentence combinations, the sentence combinations with confidence degrees larger than a preset threshold are selected for regular matching, and a regular entity extraction result is obtained through extraction. The entity extraction can be carried out by means of the predetermined regular expression, the regular entity identification capability is provided for the third party, the cost that the third party spends a large amount of manpower to enumerate all entities or label a large amount in the cold starting stage (the initial stage of system use) can be saved, and the expression entities with specific grammar and sentence patterns can be identified conveniently.

Drawings

FIG. 1 is a diagram of an embodiment of an application environment of an entity extraction method;

FIG. 2 is a schematic flow chart diagram illustrating a method for entity extraction in one embodiment;

FIG. 3 is a flowchart illustrating the steps of training an entity extraction model in one embodiment;

FIG. 4 is a schematic flow chart of a method for entity extraction in another embodiment;

FIG. 5 is a diagram illustrating a process for extracting tokens hierarchically using regular expressions in an embodiment

FIG. 6 is a schematic flow chart diagram illustrating a method for entity extraction in yet another embodiment;

FIG. 7 is a block diagram showing the structure of an entity extraction apparatus according to an embodiment;

FIG. 8 is a block diagram showing the construction of an entity extraction apparatus according to another embodiment;

FIG. 9 is a diagram showing an internal configuration of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

FIG. 1 is a diagram of an embodiment of an application environment of an entity extraction method. As shown in fig. 1, an entity extraction platform 101 provides semantic understanding function and implements an entity extraction method. The user terminal 105 is in communication connection with the third-party server 103, and transmits the acquired input information of the user to the third-party server. The third-party server provides services such as information retrieval, machine translation, question-answering system and the like, the third-party server 103 is accessed to the entity extraction platform, input information is sent to the entity extraction platform 101, semantic understanding is carried out by the entity extraction platform 101, entities are extracted, and the entities are fed back to the third-party server 103. The third party server 103 provides a corresponding service such as translation or retrieval according to the extracted entity.

FIG. 2 is a flowchart illustrating a method for entity extraction according to an embodiment. The embodiment is mainly illustrated by applying the entity extraction method to the entity extraction platform 101 in fig. 1. Referring to fig. 1, the method specifically includes the following steps:

s202: and acquiring input information and scene information to be extracted.

The input information refers to information to be processed, which is acquired through a third-party server and input by a user at a user terminal, and the input information comprises character information, picture information or voice information. For example, the input information may be a daily dialog to be answered, an instruction to be executed, a sentence to be translated, or a text to be retrieved, etc.

The scene information is information related to an application scene of the input information. The context information is generally related to a service provided by a third party, and comprises an identification mark of the service provided by a third party server. For example, if the third-party server provides a translation function, the scene information is an identification mark indicating a translation scene.

When the third-party server acquires input information to be extracted, which is sent by the user terminal, the input information and scene information corresponding to the third-party server are sent to the entity extraction platform 101.

S204: and preprocessing the input information to obtain the text characteristics of the input information.

The preprocessing refers to a processing behavior of preprocessing input information to obtain text features before entity extraction. The pretreatment specifically comprises: the method comprises the steps of converting upper case to lower case, converting half angle to full angle, converting traditional form to simple form, removing stop words and expression words before and after the removal of the stop words and the expression words, shortening long sentences into short sentences, normalizing text data, extracting semantic features, extracting grammatical features, extracting statistical features and the like.

The semantic feature extraction mode comprises N-gram model analysis, intention analysis, theme analysis, emotional tendency analysis and the like. The method for extracting the grammatical features comprises the following steps: word segmentation, part-of-speech tagging, shallow syntactic analysis, sentence pattern, dependency syntactic analysis, and the like. The statistical feature extraction method comprises the steps of word weight calculation, frequency calculation and feature extraction by utilizing a dictionary.

Normalization, i.e. data normalization. The method is a basic work of data mining, different evaluation indexes often have different dimensions and dimension units, the condition can affect the result of data analysis, and in order to eliminate the dimension influence among the indexes, data standardization processing is needed to solve the comparability among the data indexes. After the raw data are subjected to data standardization processing, all indexes are in the same order of magnitude, and the method is suitable for comprehensive comparison and evaluation. Common methods of normalization include: Min-Max Normalization (Min-Max Normalization) and vector modulo Normalization.

The long sentence is shortened to the short sentence, which means that when the number of characters of one sentence in the input text exceeds a set value, the long sentence is divided into a plurality of short sentences.

If the input information is picture information, the preprocessing also comprises character recognition of the picture and extraction of character information in the picture. If the input information is voice information, the preprocessing further includes performing voice recognition on the voice information and converting the voice information into corresponding characters.

S206: and determining a corresponding predefined entity regular expression according to the scene information.

Specifically, the third party submits a regular expression entity statement in advance on the entity extraction platform 101 based on the identification requirement of the service content of the third party, and the submitted regular expression entity statement carries the scene information corresponding to the service content of the third party. Specifically, the regular expression entity declaration includes: regular expression entity name, English name and regular expression. And after the entity submitting platform obtains the regular expression entity statement, establishing the corresponding relation between the scene information and the regular expression. The regular expression includes entity parts to be extracted, and a first packet string of the packets is the entity parts to be extracted. The syntax of the entity part is completely compatible with the regular expression engine syntax.

When the platform receives and acquires the input information and the scene information, the corresponding entity regular expression used by the corresponding third party in advance statement is determined according to the acquired scene information by utilizing the corresponding relation between the pre-established regular expression and the scene information.

The syntax supported by the entity extraction platform is as follows:

1) and the common regular expression customized by a third party, wherein the grammar specification supported by the regular expression engine conforms to the ECMAScript syntax format. Such as "i am (d + sign)", "((1 |2|3| one | two | three) (quarter moment))", etc. In one embodiment, the entity extraction platform provides a regular expression editing page, and a third party can perform custom operation of a common regular expression. A third-party customized list of common regular expressions is shown in table 1. And for the regular expressions in the list, the user can perform operations such as editing, uploading, downloading, deleting and the like.

TABLE 1 third-party custom generic regular expression List

2) Regular expression entities of an entity extraction platform used by third party claims. The entities provided by the entity extraction platform comprise 40 types of entities such as names of people, place names, numbers, time and the like, and the 40 types of entities respectively correspond to respective symbolic expressions, so that the natural language capability understood by a computer is greatly expanded. Such as "my is (@ ner _ number)", "(@ ner _ number) (quarter time)"), etc. A third party may declare some or all of the entities used therein to directly utilize the platform-provided entity regular expressions. In one embodiment, the entity extraction platform provides an entity regular expression that is using a claim editing page, and a third party can select the entity regular expression of the entity extraction platform that needs to be utilized. The regular expression list of the entity extraction platform used by the third party declaration is shown in table 2. And for the regular expressions in the list, the user can perform operations such as uploading, downloading and deleting.

Table 2 entities of entity extraction platform used by third party declaration

A regular expression is generally referred to as a pattern (pattern) and is a string of characters used to describe or match a syntactic rule. For example: handel,

And Haendel, can be composed of

This mode is described. Most regular expressions are of the following form:

the vertical separator represents a selection. For example, "gray | grey" may match a grey or a gray.

The number is limited, and the number limiter after a certain character is used for limiting the number of the characters allowed to appear in the front. The most common quantity qualifiers include "+", "? "and" (with no numerical limitation representing one and only one occurrence): the + plus sign indicates that the preceding character must appear at least once (1 time, or more). For example, "goo + gle" may match google, gooogle, etc.; is there a Question marks represent that the preceding characters can only appear once (0 times, or 1 time) at most. For example, "colour" may match colour or colour; asterisks indicate that the preceding character may not be present, but may be present one or more times (0 times, or 1 time, or more times). For example, "0 x 42" may match 42, 042, 0042, 00042, etc.

Matching, parentheses may be used to define the scope and priority of the operators. For example, "gr (a | e) y" is equivalent to "gray | grey", "(grand)? The fast "matches fast and grandfast.

The entity extraction platform also carries out data extraction verification on the entity regular expression of the third party, on one hand, the attack of code injection is avoided, on the other hand, the third party is convenient to correct the regular grammar or the written abnormal condition which is not extracted, and the robustness of the online system is improved by leading the regular risk.

S208: and identifying each entity of the input information according to the text characteristics to obtain an entity list.

Specifically, based on the text features such as semantics, grammar and statistics obtained after preprocessing, the entity list is obtained by identifying each entity of the input information such as name, place name, number, mailbox, telephone and the like through a pre-trained entity identification model. At this time, the input information is divided into a segment, and each segment represents an entity and an entity confidence. In this embodiment, the entity recognition model is a conditional random field model.

Taking the input information as 'i want to listen to a song of singer a' as an example, the entities identified by the entity identification model include three entities, which are respectively: 1) listen, label action, weight 0.83 >; 2) a singer, label, weight 0.90 >; 3) text singer, label album, weight 0.70. Wherein label represents an entity name symbol, and weight represents the confidence of an entity.

S210: and combining the entities in the entity recognition list to obtain sentence combinations according to the corresponding entity regular expressions, and calculating the confidence coefficient of each sentence combination.

Specifically, user sentences are encoded into a group of sentence templates, the sentence templates are ordered according to confidence level, and in consideration of matching efficiency, at most 3-5 sentence combinations with confidence levels exceeding a certain threshold value are selected. This problem can be abstractly expressed as: for the given input information q, an entity list nes ═ n is identified₁,n₂,., wherein n is_i＝label＜s_i,e_iWhere label is a symbolic representation of an entity name, s represents a start position and e represents an end position, e.g. "q ═ i want to listen to a singer's song and nes ═ action { (action)<0,2>,singer<3,5>,album<3,7>"expression mode can be abstracted as songs of" @ ner _ action @ ner _ singer "(confidence ═ 1)," @ ner _ action @ ner _ album "(confidence ═ 0.8), etc., and can be decomposed into two subproblems: 1. and finding out all expression modes of q, and 2, sorting according to confidence degrees.

Specifically, step S210 includes the following steps S1 to S2:

s1: and determining an entity corresponding to the entity regular expression predefined in the entity list.

S2: and combining the entities determined in the entity list to obtain sentence combinations according to the corresponding entity regular expressions, and calculating the confidence coefficient of each sentence combination.

First, for sub-problem 1: it can be seen that all entity lists and common word segmentation units can be regarded as nodes of the same kind, and the expression of the sentence is the permutation and combination of the nodes, so that all permutation and combination of the sentence can be obtained. In order to further optimize efficiency, all combined paths need to be pruned, search space is reduced, regular entity definition of a third party is further utilized, an entity list only keeps entities appearing in the regular entity definition, entities not in a definition range are not kept, and uniqueness of third party service is reflected. Meanwhile, one characteristic of system design is as follows: when the system is used by an end user, the system can carry scene information besides user sentences, and the high efficiency of the system is guaranteed.

For sub-problem 2: the confidence of each combined statement is scored. Scoring is a regression problem, and can label all entities in sentences in a corpus by manually labeling a certain corpus, but it is very difficult to label all entities. In one embodiment: firstly, after large data mining, entity name symbol replacement is carried out on all sentences in the corpus, if a sentence has a plurality of sentence pattern templates, a plurality of paths are output, all the corpuses are converted into symbol expression string corpuses corresponding to the sentence pattern templates, then a language model is trained by RNNLM and Ngram, the probability score P (W) of a sentence symbol expression string W is calculated, the score is approximate to S (W) by log-log conversion, the following formula (1) represents the probability score that a sentence is normally expressed, wherein W represents the whole sentence, and W represents the whole sentence, and_iis the ith word in the sentence, N represents the number of words in the sentence, and N takes 3. Because the probability score measures the predicted performance of a probability distribution or probability model for a sample, a higher value indicates that the sentence is more smooth, and the probability of the symbolic expression string is higher, which can be used for confidence evaluation here.

And finally, respectively calculating log statement scores by using the trained RNNLM and ngram, carrying out weighted linear difference on the log statement scores and the log statement scores (such as formula (2), carrying out weight adjustment on the reserved data to adapt to optimal parameters), obtaining the final value of the symbolic expression string, and sequencing the final scores from large to small. Where k denotes the kth language model, α_kRepresenting the weight distribution of the language model.

S212: and performing regular matching based on the regular expression of the sentence combination with the confidence coefficient larger than the preset threshold value to obtain a regular entity extraction result.

Specifically, the statements are sorted according to the confidence level, and in consideration of the matching efficiency, at most 3-5 statement combinations with the confidence level exceeding a certain threshold are selected. And performing regular matching based on the symbolic expression list of the sentence, identifying a regular entity with a symbol, mapping the regular entity with the symbol back to an original position through position mapping, and finding an original character expression, thereby extracting an entity extraction result conforming to the semantics.

According to the entity extraction method, after all entities of input information are obtained through text feature recognition, all entities in an entity list are combined according to a predefined entity regular expression corresponding to scene information to obtain sentence combinations, the sentence combinations with confidence degrees larger than a preset threshold are selected for regular matching, and a regular entity extraction result is obtained through extraction. The entity extraction can be carried out by means of the predetermined regular expression, the regular entity identification capability is provided for the third party, the cost that the third party spends a large amount of manpower to enumerate all entities or label a large amount in the cold starting stage (the initial stage of system use) can be saved, and the expression entities with specific grammar and sentence patterns can be identified conveniently.

In another embodiment, a small amount of manual labeling and a large amount of automatic labeling are carried out for a long time by means of the data extracted by the accumulated regular expressions, labeled data are continuously obtained, and an entity extraction model is continuously trained. Specifically, as shown in fig. 3, the step of training the entity extraction model includes steps S302 to S304:

s302: and obtaining feedback of the regular entity extraction result extracted by the regular expression, and obtaining the marking data based on the feedback, the input information and the regular entity extraction result.

The annotation data in this embodiment includes manual annotation data and automatic annotation data.

The manual labeling data comprises feedback of the regular entity extraction result. The feedback refers to the evaluation of the third party or the user on the correctness or the mistake of the regular entity extraction result extracted by the regular expression. For example, the third party may perform the correct and incorrect judgment on the extraction result on the background management page to obtain the feedback on the extraction result of the regular entity. And a feedback interface can be provided at the user terminal so that the user can evaluate the question-answering result or the translation result and obtain feedback according to the evaluation of the user. For example, in a vertical question-and-answer system, if an entity is incorrectly extracted, the reply will lead to the wrong direction, i.e. the dialog reply error is an entity extraction error. And high-quality manual labeling data are obtained by utilizing the feedback result, and the extraction algorithm can be guided to continuously approach the optimization by long-term manual labeling.

A large amount of automatic labeling can be directly used as labeling linguistic data by utilizing input information and regular entity extraction results extracted by utilizing a regular expression.

S304: and training the entity extraction model according to the identification data.

The entity extraction model refers to model parameters obtained by training according to the labeled data, and can be used for entity extraction. For an entity extraction model in a vertical question-answering field, in order to prevent the entity extraction model from being trained in the vertical field and being used in an open field, the entity extraction performance is reduced, before field entity identification, an intention recognition classifier is used for judging whether the entity field is the entity field, and then the field entity identification is carried out. If the domain is not a vertical domain, only the performance of the entity extraction model is concerned.

In the embodiment, the training corpora are collected and trained according to a large amount of automatic labeling data obtained by input information and positive entity results extracted by using the regular expression and a small amount of manual labeling data obtained by feedback of the extraction results, so that a high-performance entity extraction model is obtained, and the entity identification accuracy and the recall rate can be improved.

Specifically, in the process from cold start to model maturation, four stages of cold start, low model performance, medium model performance and high model performance are distinguished, and only regular entity extraction is used in the cold start and low performance stages, and a regular extraction method is not used for extraction. At the model maturation stage, the entity can be extracted without using the regular expression, and the entity extraction model is only used for applying to the online service completely to completely replace the original regular expression.

Specifically, the flowchart of the entity extraction method according to another embodiment is shown in fig. 4, and after step S204, the method further includes the steps of:

s205: and judging the performance of the entity extraction model.

The performance of the entity extraction model includes accuracy and recall.

When the performance of the entity extraction model reaches the first preset value, step S213 is executed: and inputting the text features into the entity extraction model to obtain extraction model output. When the performance of the entity extraction model reaches a first preset value, only the entity extraction model is used for extraction, and the regular expression is not used for entity extraction.

When the performance of the entity extraction model is smaller than a second preset value, step S206 is executed to determine a corresponding predefined entity regular expression according to the scene information, wherein the first preset value is greater than the second preset value. When the performance of the entity extraction model is smaller than a second preset value (in a cold start stage or a low-performance stage of the model), only the regular expression is used for extraction, and the entity extraction model is not used for extraction.

It can be understood that the first preset value and the second preset value respectively include a numerical value corresponding to the accuracy and a numerical value corresponding to the recall rate.

The training of the entity extraction model is from a cold start stage to a mature stage when the model performance reaches high performance, and the accuracy and the recall rate of the high performance standard are generally higher than 90%, namely the first preset value is 90%.

When the performance of the entity extraction model is between the first preset value and the second preset value, step S206 and step S213 are performed simultaneously, i.e. by using the regular expression extraction and the regular extraction model extraction simultaneously.

And when the entity extraction model has the extraction model output and the regular extraction model does not have the extraction model output, punishing the confidence coefficient of the prediction result of the entity extraction model, and when the entity extraction model has the extraction model output and the regular extraction model does not have the extraction model output, punishing the regular entity weight so as to optimize the performance of the entity extraction model.

Specifically, the performance stages in the model are both combined, and there are four cases, case one: the two entities hit the same entity at the same time, the entity is reserved, and the entity weight is increased; case two: regular identification is carried out, a model is not identified, and a certain penalty is carried out on regular default entity weight, namely, a penalty factor between 0 and 1 is multiplied; case three: identifying the model without identifying the regular model, and punishing the confidence coefficient of the model prediction result; case four: neither of which is recognized as being absent.

The following description will be given, taking an example in which the entity extraction method is applied to a vertical question-answering system, with reference to a specific embodiment. The vertical question-answering system can adopt a mature machine learning model to train and predict an entity recognition model under the condition that a large number of labeled corpora exist, the number of entities is fixed, a certain database scale exists, and recognition can be carried out in a dictionary matching mode.

In the vertical question and answer, the identification of an intelligent entity is an important function and an important embodiment of the intelligence of the vertical question and answer, and in a vertical question and answer platform, the problem that unstructured extraction is structured data is required to be solved and the application and deployment are also required to be rapid. The method has important application in data mining, searching, recommending and automatic question answering systems. In particular, in different scenarios, the vertical question-answering system is required to be able to give accurate and complete entity types and entity expressions. For example, the user communicates to the robot "tomorrow 3 pm please remind that there is a teleconference. "when requested, the robot needs to accurately extract a comprehensive entity list," < time: tomorrow afternoon at 3 o' clock, event: there is one teleconference > ". In general, machine learning algorithms are used in academia to utilize a large number of supervised corpus learning models for further recognition, however, in practice, users often face cold start, only a few to a dozen pieces of labeled data are provided, and the machine learning models cannot be used practically. For this reason, in the present embodiment, in the cold start phase, regular entity extraction is used.

Specifically, on the vertical question-answering system platform, the custom entity of the third-party developer has an enumerable vertical domain entity word list (for example, "i want to listen to [ a singer ] { beijing welcome you }"), a non-enumerable expression entity such as "translation (i come from china)", and the like, and the calculation (3 plus 4) is equal to several ". As shown in fig. 5, for the feature meaning character strings that cannot be enumerated, the data corpus is rare, and the collection and labeling cost is high, the entity identification is gradually layered:

the first layer recognizes on the entity dictionary and the entity model;

the second layer utilizes the characteristics of the previous layer, rewrites sentences into entity labels, and combines the matching of the regular expression on the rewritten sentences, so as to increase the recall of the sparse regular entity.

By accumulating a batch of linguistic data of the entities, the number of the linguistic data reaches more than thousands, namely, an entity recognition model is trained through a CRF (conditional random field) or an LSTM-CRF (deep learning-conditional random field), and the accuracy is improved.

Specifically, as shown in fig. 6, the method includes the following steps:

s602: and acquiring the regular expression input by the third party.

A regular expression is generally referred to as a pattern (pattern) and is a string of characters used to describe or match a syntactic rule. Specifically, the third party submits the regular expression entity statement in advance in the entity extraction platform based on the identification requirement of the service content of the third party. Specifically, the regular expression entity declaration includes: regular expression entity name, English name and regular expression. The syntax supported by the entity extraction platform is as follows: 1) the common regular expression is customized by a third party; 2) the entity used by the third party declaration extracts the regular expression entities of the platform.

S604: and verifying the regular expression input by the third party, and if the regular expression passes the verification, executing the step S606.

S606: and storing the regular expression defined by the third party.

The above steps define a regular expression process for a third party.

The method comprises the following steps of receiving input information of a user, and performing a regular extraction process, wherein the process comprises the following steps:

s608: and acquiring input information and scene information to be extracted.

S610: and preprocessing the input information to obtain the text characteristics of the input information.

Preprocessing refers to a processing behavior of preprocessing input information to obtain text features before entity extraction. As shown in fig. 6, the input information is "wangli is twenty-seven", and the words obtained by preprocessing the input information include "wangli", "yes", "twenty-seven", and "number".

S612: and judging the performance of the entity extraction model.

The entity extraction model is used for carrying out entity extraction on the input information.

When the performance of the entity extraction model reaches the first preset value, step S626 is executed, that is, when the performance of the entity extraction model reaches the first preset value, only the entity extraction model is used for extraction, and the regular expression is not used for entity extraction. When the performance of the entity extraction model is smaller than the second preset value, step S614 is executed, that is, when the performance of the entity extraction model is smaller than the second preset value (in a cold start stage or a low performance stage of the model), only the regular expression is used for extraction, and the entity extraction model is not used for extraction. When the performance of the entity extraction model is between the first preset value and the second preset value, step S614 and step S626 are performed simultaneously, i.e., using the regular expression extraction and the regular extraction model extraction simultaneously.

S614: and determining a corresponding predefined entity regular expression according to the scene information.

Specifically, the entity regular expression used by the corresponding third party declaration in advance is determined according to the scene information.

S616: and identifying each entity of the input information according to the text characteristics to obtain an entity list.

Specifically, based on the text features such as semantics, grammar and statistics obtained after preprocessing, each entity of the input information is identified through a pre-trained entity identification model, and an entity list is obtained. As shown in fig. 6, the entity recognition accuracy is improved by accumulating corpora of a batch of entities, and training an entity recognition model through CRF (conditional random field) or LSTM-CRF (deep learning-conditional random field).

And S618, combining the entities in the entity list to obtain statement combinations according to the corresponding entity regular expressions, and calculating the confidence coefficient of each statement combination.

Specifically, user sentences are encoded into a group of sentence templates, the sentence templates are ordered according to confidence level, and in consideration of matching efficiency, at most 3-5 sentence combinations with confidence levels exceeding a certain threshold value are selected.

Specifically, step S618 includes the following steps S1 to S2:

S620: and performing regular matching based on the regular expression of the sentence combination with the confidence coefficient larger than the preset threshold value to obtain a regular entity extraction result.

Specifically, the statements are sorted according to the confidence level, and in consideration of the matching efficiency, at most 3-5 statement combinations with the confidence level exceeding a certain threshold are selected. And performing regular matching based on the symbolic expression list of the sentence, identifying a regular entity with a symbol, mapping the regular entity to the original position through position mapping, and finding the original character expression, thereby extracting an entity extraction result which accords with the semantic meaning. As shown in fig. 6, this step is that the entities extracted for the input information include "royal" and "twenty-seven" by nesting regular expressions.

S622: and obtaining feedback of the regular entity extraction result extracted by the regular expression, and obtaining the marking data based on the feedback, the input information and the regular entity extraction result.

The annotation data in this embodiment includes manual annotation data and automatic annotation data. Specifically, a large number of automatic annotations can be directly used as the annotation corpus by using the input information and the regular entity extraction result extracted by using the regular expression. And the third party can perform the correct and wrong judgment of the extraction result on the background management page to obtain the feedback of the correct entity extraction result and obtain the manual annotation data.

S624: and training an entity extraction model according to the benchmarking data.

S626: and inputting the text features into the entity extraction model to obtain extraction model output.

Specifically, in the process from cold start to model maturation, four stages of cold start, low model performance, medium model performance and high model performance are distinguished, and only regular entity extraction is used in the cold start and low performance stages, and a regular extraction method is not used for extraction. At the model maturity stage, the regular expression is not used any more to extract the entity, and the entity extraction model is only used for applying to the online service, so that the original regular expression is completely replaced.

When the performance stage in the model is in the performance stage, the regular extraction method and the model extraction method are adopted to perform extraction simultaneously, and then step S628 is performed: and comparing the entity extraction model with the extraction model output and the entity extraction result extracted by the regular expression, and performing corresponding processing according to the comparison result.

Specifically, when the entity extraction model has the extraction model output and the regular extraction does not have the entity extraction result, punishment is carried out on the confidence coefficient of the prediction result of the entity extraction model; and when the entity extraction result exists in the regular extraction and the extraction model does not exist in the entity extraction model, punishing regular entity weight so as to optimize the performance of the entity extraction model.

Specifically, there are four cases, case one: the two entities hit the same entity at the same time, the entity is reserved, and the entity weight is increased; case two: regular identification is carried out, a model is not identified, and a certain penalty is carried out on regular default entity weight, namely, a penalty factor between 0 and 1 is multiplied; case three: identifying the model without identifying the regular model, and punishing the confidence coefficient of the model prediction result; case four: neither of which is recognized as being absent.

The entity extraction method provides the self-defined regular entity recognition capability for the third party of the vertical question-answering system by means of the existing regular expression and the system entity. The service avoids the cost that a user needs to spend a large amount of manpower to enumerate all entities or label a large amount in the cold starting stage, and the user can conveniently identify the expression entities with specific grammars and sentence patterns as a supplementary scheme of the traditional special name identification, thereby improving the accuracy and the recall rate on the original basis. As an underlying natural language understanding component, may be used in other natural language computing applications, such as information retrieval/recommendation systems.

In one embodiment, there is provided an entity extraction apparatus, as shown in fig. 7, including: an information acquisition module 702, a pre-processing module 704, a lookup module 706, an identification module 708, a combination module 710, and a matching module 712.

An information obtaining module 702, configured to obtain input information and scene information to be extracted.

The preprocessing module 704 is configured to preprocess the input information to obtain a text feature of the input information.

And the searching module 706 is configured to determine a corresponding predefined entity regular expression according to the scene information.

The identifying module 708 is configured to identify each entity that obtains the input information according to the text feature to obtain an entity list.

And the combining module 710 is configured to combine the entities in the entity list according to the corresponding entity regular expressions to obtain sentence combinations, and calculate confidence degrees of the sentence combinations.

Specifically, the combining module 710 is configured to determine an entity corresponding to a predefined entity regular expression in the entity list, combine each entity determined in the entity list according to the corresponding entity regular expression to obtain a sentence combination, and calculate a confidence of each sentence combination.

And the matching module 712 is configured to perform regular matching based on the regular expression of the sentence combination with the confidence coefficient greater than the preset threshold value, so as to obtain a regular entity extraction result.

After each entity of the input information is obtained by the entity extraction device according to the text characteristic identification, each entity in the entity list is combined according to a predefined entity regular expression corresponding to the scene information to obtain a sentence combination, the sentence combination with the confidence coefficient larger than a preset threshold value is selected for regular matching, and a regular entity extraction result is obtained by extraction. The entity extraction can be carried out by means of the predetermined regular expression, the regular entity identification capability is provided for the third party, the cost that the third party spends a large amount of manpower to enumerate all entities or label a large amount in the cold starting stage (the initial stage of system use) can be saved, and the expression entities with specific grammar and sentence patterns can be identified conveniently.

In another embodiment, as shown in fig. 8, the entity extracting apparatus further includes: an annotation data acquisition module 714 and a training module 716.

And the labeled data acquisition module 714 is used for acquiring feedback of the regular entity extraction result extracted by the regular expression and obtaining labeled data based on the feedback, the input information and the regular entity extraction result.

And a training module 716, configured to train the entity extraction model according to the benchmarking data.

Referring to fig. 8, the entity extracting apparatus further includes: a prediction module 718.

And the prediction module 718 is configured to, when the performance of the entity extraction model reaches a first preset value, input the text features into the entity extraction model to obtain an extraction model output.

The searching module 706 is configured to determine a corresponding predefined entity regular expression according to the scene information when the performance of the entity extraction model is smaller than a second preset value; wherein the first preset value is greater than the second preset value.

Referring to fig. 8, the entity extraction apparatus further includes an optimization module 720.

The prediction module 718 is further configured to, when the performance of the entity extraction model is between the first preset value and the second preset value, input the text feature into the entity extraction model to obtain an extraction model output.

The searching module 706 is further configured to determine a corresponding predefined entity regular expression according to the scene information when the performance of the entity extraction model is between the first preset value and the second preset value.

The optimization module 720 is used for punishing the confidence coefficient of the prediction result of the entity extraction model when the entity extraction model has the extraction model output and the regular expression is used for extracting the entity extraction result; and when the regular expression is used for extracting the entity extraction result and the entity extraction model does not have extraction model output, punishing the regular entity weight.

The entity extraction device provides the self-defined regular entity recognition capability for the third party of the vertical question-answering system by means of the existing regular expression and the system entity. The service avoids the cost that a user needs to spend a large amount of manpower to enumerate all entities or label a large amount in the cold starting stage, and the user can conveniently identify the expression entities with specific grammars and sentence patterns as a supplementary scheme of the traditional special name identification, thereby improving the accuracy and the recall rate on the original basis. As an underlying natural language understanding component, may be used in other natural language computing applications, such as information retrieval/recommendation systems.

Based on the foregoing embodiments, the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the steps of the entity extraction method implemented in the foregoing embodiments are implemented.

FIG. 9 is a diagram showing an internal configuration of a computer device according to an embodiment. The entity extraction platform may be an entity extraction platform. Referring to fig. 9, the computer apparatus includes a processor, a non-volatile storage medium, an internal memory, and a network interface connected through a system bus. Wherein the non-volatile storage medium of the computer device may store an operating system and a computer program that, when executed, may cause the processor to perform a method of entity extraction. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The internal memory may have stored therein a computer program that, when executed by the processor, causes the processor to perform a method of entity extraction. The network interface of the computer device is used for network communication.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

Based on the above embodiments, the present invention further provides a storage medium having a computer program stored thereon, wherein the computer program is configured to implement the steps of the entity extraction method according to the above embodiments when executed by a processor.

It will be understood by those skilled in the art that all or part of the processes in the methods of the embodiments described above may be implemented by hardware related to instructions of a computer program, and the program may be stored in a non-volatile computer readable storage medium, and in the embodiments of the present invention, the program may be stored in a storage medium of a computer system and executed by at least one processor in the computer system, so as to implement the processes of the embodiments including the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An entity extraction method, comprising:

acquiring input information and scene information to be extracted;

performing regular matching based on the regular expression of the sentence combination with the confidence coefficient larger than the preset threshold value to obtain a regular entity extraction result;

obtaining feedback of a regular entity extraction result extracted by using a regular expression, and obtaining labeling data based on the feedback, the input information and the regular entity extraction result; the feedback is the evaluation of the third party or the user on the correctness or the mistake of the regular entity extraction result;

training an entity extraction model according to the labeled data;

when the performance of the entity extraction model is between a first preset value and a second preset value, inputting text features of new input information to be extracted into the entity extraction model to obtain extraction model output, and executing the step of determining a corresponding predefined entity regular expression according to the scene information so as to extract an entity by using the regular expression; wherein the first preset value is greater than the second preset value;

and when the entity extraction model has extraction model output and the regular expression is used for extracting the result without entity extraction, punishing the confidence coefficient of the prediction result of the entity extraction model.

2. The method of claim 1, further comprising, after the step of preprocessing the input information to obtain text features of the input information:

when the performance of the entity extraction model reaches a first preset value, inputting the text features into the entity extraction model to obtain extraction model output;

and when the performance of the entity extraction model is smaller than a second preset value, executing the step of determining a corresponding predefined entity regular expression according to the scene information so as to extract the entity by using the regular expression.

3. The method of claim 1, wherein penalizing regular entity weights occurs when there is an entity extraction result with regular expression extraction and the entity extraction model has no extraction model output.

4. The method according to claim 1, wherein the step of combining the entities in the entity list according to the corresponding entity regular expressions to obtain sentence combinations and calculating the confidence degrees of the sentence combinations comprises:

determining an entity corresponding to a predefined entity regular expression in the entity list;

and combining the entities determined in the entity list to obtain sentence combinations according to the corresponding entity regular expressions, and calculating the confidence coefficient of each sentence combination.

5. An entity extraction apparatus, comprising: the device comprises an information acquisition module, a preprocessing module, a searching module, an identification module, a combination module and a matching module;

the matching module is used for performing regular matching on the basis of the regular expression of the sentence combination with the confidence coefficient larger than the preset threshold value to obtain a regular entity extraction result;

the device further comprises: the system comprises a labeling data acquisition module, a training module, a prediction module and an optimization module;

the annotation data acquisition module is used for acquiring feedback of a regular entity extraction result extracted by a regular expression and acquiring annotation data based on the feedback, the input information and the regular entity extraction result; the feedback is the evaluation of the third party or the user on the correctness or the mistake of the regular entity extraction result;

the training module is used for training an entity extraction model according to the labeling data;

the prediction module is used for inputting the text characteristics of the new input information to be extracted into the entity extraction model to obtain the extraction model output when the performance of the entity extraction model is between a first preset value and a second preset value;

the searching module is further configured to determine a corresponding predefined regular expression of the entity according to the scene information when the performance of the entity extraction model is between the first preset value and the second preset value, so as to extract the entity by using the regular expression; wherein the first preset value is greater than the second preset value;

and the optimization module is used for punishing the confidence coefficient of the prediction result of the entity extraction model when the entity extraction model has the extraction model output and does not have the entity extraction result by using the regular expression.

6. The apparatus of claim 5,

the prediction module is further used for inputting the text features into the entity extraction model to obtain extraction model output when the performance of the entity extraction model reaches a first preset value;

and the searching module is further used for determining a corresponding predefined entity regular expression according to the scene information when the performance of the entity extraction model is smaller than a second preset value, so as to extract an entity by using the regular expression.

7. The apparatus of claim 5,

the optimization module is further used for punishing regular entity weight when the regular expression is used for extracting the entity extraction result and the entity extraction model does not have extraction model output.

8. The apparatus according to claim 5, wherein the combining module is further configured to determine an entity corresponding to a predefined entity regular expression in the entity list, combine each entity determined in the entity list according to the corresponding entity regular expression to obtain a sentence combination, and calculate a confidence of each sentence combination.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the entity extraction method of any one of claims 1 to 4 are implemented by the processor when executing the program.

10. A storage medium having stored thereon a computer program, characterized in that the program, when being executed by a processor, carries out the steps of the entity extraction method of any one of claims 1 to 4.