CN108304372A

CN108304372A - Entity extraction method and apparatus, computer equipment and storage medium

Info

Publication number: CN108304372A
Application number: CN201710909581.4A
Authority: CN
Inventors: 崔建苓
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-09-29
Filing date: 2017-09-29
Publication date: 2018-07-20
Anticipated expiration: 2037-09-29
Also published as: CN108304372B

Abstract

A kind of entity extraction method and apparatus of present invention offer, computer equipment and storage medium, the entity extraction method include：Obtain input information to be extracted and scene information；Input information is pre-processed, the text feature of input information is obtained；Corresponding pre-defined entity regular expression is determined according to scene information；It is identified to obtain each entity of input information according to text feature, obtains list of entities；According to corresponding entity regular expression, each entity obtains sentence combination in composite entity list, and calculates the confidence level of each sentence combination；It is more than the regular expression that the sentence of predetermined threshold value combines based on confidence level, carries out canonical matching, obtain canonical entity extraction result.Since entity extraction can be carried out by scheduled regular expression in advance, canonical Entity recognition ability is provided for third party, third party can be saved and spend the cost that a large amount of manpowers are enumerated all entities or largely marked in cold-start phase, easily expression formula entity of the identification with specific syntax and clause.

Description

Entity extraction method and apparatus, computer equipment and storage medium

Technical field

The present invention relates to technical field of data processing, are set more particularly to a kind of entity extraction method and apparatus, computer Standby and storage medium.

Background technology

Entity extraction is also name Entity recognition, is the underlying issue of natural language processing.It is real in natural language processing Body includes mainly physical name, such as place name, institution term, name, number and field proper noun and some expression formulas, such as Numerical expression (formula, currency values, score), temporal expression, string expression etc..Know the application field of body identification for example Information extraction, information retrieval, machine translation and question answering system etc..

Classical name entity algorithm has maximum entropy, hidden markov, maximum entropy-hidden markov, condition random The statistical methods, also some rule methods such as field, deep learning-condition random field, such as regular expression, dictionary matching, word The methods of allusion quotation fuzzy matching.

Conventional entity recognizer is based on a large amount of statistics and is identified, and in a particular application, such as vertical question answering system Using initial stage, few many labeled data, data need largely by artificial mark.And it manually marks needs and expends largely Time cost and human cost.

Invention content

Based on this, it is necessary to which the problem of needing to take a substantial amount of time cost and human cost for artificial mark provides A kind of entity extraction method and apparatus, computer equipment and storage medium.

In order to achieve the above objectives, one embodiment uses following technical scheme：

A kind of entity extraction method, including：

Obtain input information to be extracted and scene information；

The input information is pre-processed, the text feature of the input information is obtained；

Corresponding pre-defined entity regular expression is determined according to the scene information；

It is identified to obtain each entity of the input information according to the text feature, obtains list of entities；

According to the corresponding entity regular expression, combines each entity in the list of entities and obtain sentence combination, and Calculate the confidence level of each sentence combination；

It is more than the regular expression that the sentence of predetermined threshold value combines based on confidence level, carries out canonical matching, obtains canonical reality Body extracts result.

A kind of entity extraction device, including：Data obtaining module, preprocessing module, searching module, identification module, combination Module and matching module；

Described information acquisition module, for obtaining input information to be extracted and scene information；

The preprocessing module, for being pre-processed to the input information, the text for obtaining the input information is special Sign；

The searching module, for determining corresponding pre-defined entity regular expression according to the scene information；

The identification module obtains each entity of the input information for being identified according to the text feature, obtains reality Body list；

The composite module, for according to the corresponding entity regular expression, combining each reality in the list of entities Body obtains sentence combination, and calculates the confidence level of each sentence combination；

The matching module, the regular expression that the sentence for being more than predetermined threshold value based on confidence level is combined, carries out just It then matches, obtains canonical entity extraction result.

A kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor The step of computer program, the processor realizes above-mentioned entity extraction method when executing described program.

A kind of storage medium, is stored thereon with computer program, when which is executed by processor, realizes above-mentioned entity The step of extracting method.

Above-mentioned entity extraction method and apparatus are identified according to text feature after obtaining each entity of input information, according to Pre-defined entity regular expression corresponding with scene information, each entity obtains sentence combination in composite entity list, and The sentence combination progress canonical matching that confidence level is more than predetermined threshold value is chosen, extraction obtains canonical entity extraction result.Due to energy It is enough to carry out entity extraction by scheduled regular expression in advance, canonical Entity recognition ability is provided for third party, can be saved Third party spends the cost that a large amount of manpowers are enumerated all entities or largely marked in cold-start phase (system uses initial stage), just The expression formula entity with specific syntax and clause is identified promptly.

Description of the drawings

Fig. 1 is the application environment schematic diagram of entity extraction method in one embodiment；

Fig. 2 is the flow diagram of entity extraction method in one embodiment；

Fig. 3 is the flow diagram of the step of the training of entity extraction model in one embodiment；

Fig. 4 is the flow diagram of entity extraction method in another embodiment；

Fig. 5 is the process schematic for knowing body in one embodiment using regular expression stratification extraction

Fig. 6 is the flow diagram of entity extraction method in another embodiment；

Fig. 7 is the structure diagram of entity extraction device in one embodiment；

Fig. 8 is the structure diagram of entity extraction device in another embodiment；

Fig. 9 is the internal structure schematic diagram of one embodiment Computer equipment.

Specific implementation mode

To make the objectives, technical solutions, and advantages of the present invention more comprehensible, with reference to the accompanying drawings and embodiments, to this Invention is described in further detail.It should be appreciated that the specific embodiments described herein are only used to explain the present invention, Do not limit protection scope of the present invention.

Fig. 1 is the applied environment figure of entity extraction method in one embodiment.As shown in Figure 1, entity extraction platform 101 carries For semantic understanding function, implement a kind of entity extraction method.User terminal 105 and third-party server 103 communicate to connect, to the Tripartite's server sends the input information of the user got.Third-party server provides information retrieval, machine translation, question and answer system The services such as system, third-party server 103 access entity extraction platform, input information are sent to entity extraction platform 101, by reality Body extraction platform 101 carries out semantic understanding, extracts entity, and feed back to third-party server 103.Third-party server 103 The services such as corresponding translation or retrieval are provided according to the entity of extraction.

Fig. 2 is the flow diagram of entity extraction method in one embodiment.The present embodiment is mainly with the entity extraction side Method is illustrated applied to the entity extraction platform 101 in Fig. 1.Referring to Fig.1, this method specifically comprises the following steps：

S202：Obtain input information to be extracted and scene information.

Wherein, input information refers to the pending letter that the user obtained by third-party server inputs in user terminal Breath, input information includes text information, pictorial information or voice messaging.For example, input information can be to be answered daily right Words, it is pending instruct, to be translated one section of sentence or passage etc. to be retrieved.

Scene information is the relevant information of application scenarios of input information.The service that scene information is usually provided with third party Correlation, including third-party server provide the identification label of service.For example, if third-party server provides interpretative function, field Scape information is to indicate the identification label of translation scene.

When third-party server get user terminal transmission input information to be extracted when, by input information and with The corresponding scene information of third-party server is sent to entity extraction platform 101.

S204：Input information is pre-processed, the text feature of input information is obtained.

Wherein, pretreatment refers to being anticipated to input information before entity extraction to obtain the place of text feature Reason behavior.Pretreatment specifically includes：Capitalization turns small letter, half-angle turns full-shape, traditional font turns simplified, the front and back stop words of removal and expression Word, long sentence shorten to short sentence, text data normalization, semantic feature extraction, grammar property extraction and statistical nature extraction etc..

Wherein, the mode of semantic feature extraction include N-gram model analysis, be intended to analysis, subject analysis emotion of seeking peace is inclined To analysis etc..Grammar property extraction mode include：Participle, part-of-speech tagging, shallow parsing, clause and interdependent syntactic analysis Deng.The mode of statistical nature extraction includes word weight calculation, frequency calculating and dictionary is utilized to carry out feature extraction.

Normalization namely data normalization.It is an element task of data mining, different evaluation index often has not With dimension and dimensional unit, such situation influence whether data analysis as a result, in order to eliminate the dimension shadow between index It rings, needs to carry out data normalization processing, to solve the comparativity between data target.Initial data is by data normalization After reason, each index is in the same order of magnitude, is appropriate for Comprehensive Correlation evaluation.Normalized common method includes：Min-max is marked Standardization (Min-Max Normalization) and vectorial modulus standardization.

Long sentence shorten to short sentence, refers to drawing long sentence when the word quantity for inputting a word in text is more than setting value It is divided into multiple short sentences.

If input information is pictorial information, pretreatment should also include carrying out Text region to picture, extract in picture Text information.If input information is voice messaging, pretreatment should also include carrying out speech recognition to voice messaging, and be converted to Corresponding word.

S206：Corresponding pre-defined entity regular expression is determined according to scene information.

Specifically, identification demand of the third party based on its service content submits canonical table in advance in entity extraction platform 101 Up to formula entity declaration, the regular expression entity declaration of submission carries the corresponding scene information of third party's service content.Specifically Ground, regular expression entity declaration include：Regular expression physical name, English name and regular expression.Entity submits platform to obtain After getting regular expression entity declaration, the correspondence of scene information and regular expression is established.Regular expression will carry The entity part taken brackets, and first packet chain of grouping is entity part to be extracted.The grammer of entity part is completely compatible Regular expression engine syntax.

When platform reception gets input information and scene information, believed using the regular expression pre-established and scene The correspondence of breath determines the entity regular expression that corresponding third party states to use in advance according to the scene information of acquisition.

The grammer that entity extraction platform is supported has：

1), by the customized common regular expression of third party, the syntax gauge that regular expression engine is supported follows ECMAScript syntax formats.Such as " I is (No. d+) ", " ((1 | 2 | 3 | one | two | two | three) (quarter | carve)) " etc..One In embodiment, entity extraction platform provides regular expression edit page, and third party can carry out making by oneself for common regular expression Justice operation.The customized common regular expression list of third party is as shown in table 1.Wherein, for the regular expression in list, User can be into operations such as edlin, upload, download and deletions.

The customized common regular expression list of 1 third party of table

2), the regular expression entity of the entity extraction platform used by third party's statement.What entity extraction platform provided Entity includes the 40 class entity such as name, place name, number, time, this 40 class entity corresponds to respective symbolic formulation respectively, greatly Expanded computer it will be appreciated that natural language ability.Such as " I is (@ner_number) ", " ((@ner_number) (is carved Clock | carve)) " etc..Third party can state using some or all of wherein entity, with the entity canonical for directly utilizing platform to provide Expression formula.In one embodiment, for the offer of entity extraction platform just using statement edit page, third party may be selected what needs utilized The entity regular expression of entity extraction platform.The regular expression list such as table for the entity extraction platform that third party's statement uses Shown in 2.Wherein, for the regular expression in list, the operations such as user can upload, downloads and delete.

The entity for the entity extraction platform that 2 third party of table statement uses

One regular expression is commonly known as a pattern (pattern), to be used for describing or match some syntax The character string of rule.Such as：Handel、It, can be by with these three character strings of Haendel This pattern describes.The form of most of regular expression has following structure：

Selection, vertical separator represent selection.Such as " gray | grey " grey or gray can be matched.

Quantity limits, and the quantity qualifier after some character, which is used for limiting this character of front, allows the number occurred.Most Common quantity qualifier include "+", "" and " * " (being not added with quantity restriction then to represent appearance once and only occur primary)：+ plus The character for number representing front must at least occur primary (1 time or repeatedly).For example, " goo+gle " can match google, Gooogle, goooogle etc.；The character that question mark represents front may only at most occur primary (0 time or 1 time).For example, “colouR " can match color or colour；* the character that asterisk represents front can not occur, and can also occur primary Or repeatedly (0 time or 1 time or multiple).For example, " 0*42 " can match 42,042,0042,00042 etc..

Matching, round parentheses can be used for the range and priority of defining operation symbol.For example, " gr (a | e) y " it is equivalent to " gray | grey ", " (grand)Father " matches father and grandfather.

Entity extraction platform also carries out data extraction verification to third-party entity regular expression, on the one hand avoids generation On the other hand the attack of code injection corrects regular grammar convenient for third party or writes non-put forward exception, by by canonical risk It is preposition, improve inline system robustness.

S208：It is identified to obtain each entity of input information according to text feature, obtains list of entities.

Specifically, based on text features such as the semantemes, grammer and statistics obtained after pretreatment, pass through advance trained reality Body identification model identifies each entity of input information, such as name, place name, number, mailbox, phone, obtains list of entities.At this time Input information is divided into one section one section, and each section represents an entity and entity confidence level.In the present embodiment, Entity recognition mould Type uses conditional random field models.

By taking input information is " I will listen the song of A singer " as an example, the entity of entity recognition model identification includes three, respectively For：1)<I will listen text=, label=action, weight=0.83>；2)、<Text=A singer, label=singer, Weight=0.90>；3)<The song of text=A singer, label=album, weight=0.70>.Wherein, label indicates real Body name symbol, the confidence level of weigh presentation-entity.

S210：According to corresponding entity regular expression, each entity obtains sentence combination in composite entity recognized list, and Calculate the confidence level of each sentence combination.

Specifically, it is one group of sentence masterplate by user's statement coding, is ranked up according to confidence level, it is contemplated that matching effect Rate, it is more than that the sentence of certain threshold value combines at most to take 3-5 confidence level.This problem can be with abstract representation：For giving input Information q identifies list of entities nes={ n₁,n₂... }, wherein n_i=label ＜ s_i,e_i＞, label are physical name symbol tables Show, behalf starting position, e represents end position, such as " q=' I to listen the song of A singer ', nes={ action<0,2>, singer<3,5>,album<3,7>", expression pattern can be abstracted as " song of@ner_action@ner_singer " (confidence Degree=1), "@ner_action@ner_album " (confidence level=0.8) etc., two sub-problems can be decomposed into：1. finding out q's All expression patterns, 2. sort according to confidence level.

Specifically, step S210 includes the following steps S1 to S2：

S1：Determine the corresponding entity of entity regular expression pre-defined in list of entities.

S2：According to corresponding entity regular expression, each entity determined in composite entity list obtains sentence combination, and Calculate the confidence level of each sentence combination.

First, for subproblem 1：As can be seen that all list of entities and common participle unit can regard same as Node nodes, the expression of sentence are exactly the permutation and combination of node, can obtain all permutation and combination of sentence at this time.In order into one Optimization efficiency is walked, needs to carry out " beta pruning " to all combinatorial paths, reduces search space, further utilizes third-party canonical real Body defines, and list of entities only retains the entity occurred in canonical substantial definition, do not retain not as defined in the range of entity, embody Go out the uniqueness of third party's business.A meanwhile feature of system design：End user using when can also take in addition to user's sentence Band scene information, it has also ensured the efficient of system.

For subproblem 2：Confidence level marking to each combination sentence.Marking is a regression problem, can be used artificial Certain language material is marked, entity all in the sentence in corpus is labeled, but this is physically labeled all It is a very difficult thing.In one embodiment：Big data excavation is first passed around, whole sentences in language material are carried out Physical name symbol is replaced, and multiple paths are exported if a word has multiple sentence pattern templates, and whole language materials ifs is converted to clause mould Then the corresponding symbolic formulation string language material of plate uses RNNLM, Ngram train language model, calculate symbol of statement expression string W's Probability score P (W) is approximately S (W) score by log Logarithm conversions, following formula (1), indicate a sentence be one just The probability score often expressed, wherein W indicate whole word, w_iIt is i-th of word in sentence, N indicates that the number of word in sentence, n take 3.Because probability score weighs the prediction performance of sample of a probability distribution or probabilistic model pair, value is higher to represent sentence More clear and more coherent, the probability of symbolic formulation string is higher, can be used for confidence level estimation herein.

Finally, log sentence scores are calculated separately using trained RNNLM, ngram, the two carries out cum rights linear difference (such as formula (2), weight carry out that power is adjusted to be adapted to out optimized parameter in reserved data), obtains the end value of symbolic formulation string, And it is sorted from big to small to final score.Wherein, k indicates k-th of language model, α_kThe weight of representation language model point.

S212：It is more than the regular expression that the sentence of predetermined threshold value combines based on confidence level, carries out canonical matching, obtain just Then entity extraction result.

Specifically, it is ranked up according to confidence level, it is contemplated that matching efficiency, it is more than certain threshold at most to take 3-5 confidence level The sentence of value combines.Symbolic formulation list based on sentence carries out canonical matching, identifies the canonical entity of tape symbol, pass through Position maps, and maps back home position, finds original character expression, meets semantic entity extraction result to extract.

Above-mentioned entity extraction method is identified according to text feature after obtaining each entity of input information, according to scene The corresponding pre-defined entity regular expression of information, each entity obtains sentence combination in composite entity list, and chooses and set The sentence combination that reliability is more than predetermined threshold value carries out canonical matching, and extraction obtains canonical entity extraction result.Due to can be by Scheduled regular expression carries out entity extraction in advance, provides canonical Entity recognition ability for third party, can save third party The cost that a large amount of manpowers are enumerated all entities or largely marked is spent in cold-start phase (system uses initial stage), is easily known Not Ju You specific syntax and clause expression formula entity.

In another embodiment, by the data of the regular expression of accumulation extraction carry out long-term a small amount of artificial mark and A large amount of automation marks, persistently obtain labeled data, constantly training entity extraction model.Specifically, as shown in figure 3, entity carries The step of the training of modulus type includes step S302 to step S304：

S302：The feedback to the canonical entity extraction result using regular expression extraction is obtained, and based on feedback, input Information and canonical entity extraction result obtain labeled data.

Labeled data in the present embodiment includes artificial labeled data and automatic marking data.

Artificial labeled data includes the feedback of canonical entity extraction result.Feedback refers to third party or user to regular expressions The evaluation of the canonical entity extraction result of formula extraction correctly or incorrectly.For example, third party can carry in the back-stage management page The positive erroneous judgement for taking result, obtains the feedback to canonical entity extraction result.User terminal can also provide feedback interface with Question and answer result or translation result are evaluated convenient for user, fed back according to user's evaluation.For example, being vertical in third party In question answering system, if entity extraction mistake, the direction that can be oriented to mistake is replied, namely mistake is replied in dialogue, is exactly in fact Body extracts mistake.The artificial labeled data of high quality is obtained using feedback result, long-term artificial mark can guide extraction algorithm Constantly approached to optimization.

A large amount of automation marks can utilize input information and the canonical entity extraction result using regular expression extraction Directly as mark language material.

S304：Entity extraction model is trained according to mark data.

Entity extraction model refers to the model parameter trained according to labeled data, can be used in entity extraction.For The entity extraction model in vertical question and answer field is but used in open neck to prevent the training on vertical field of entity extraction model Domain causes entity extraction performance to decline, and first determines whether entity field with intention recognition classifier before domain entities identification, then Carry out domain entities identification.The performance of entity extraction model is then only concerned if not vertical field.

In the present embodiment, obtained according to input information and the positive entity result extracted using regular expression a large amount of automatic Labeled data, and to a small amount of artificial labeled data that the feedback for extracting result obtains, collect training corpus and be trained, obtain High performance entity extraction model, can improve Entity recognition accuracy rate and recall rate.

Specifically, it during cold start-up is to the models mature stage, divides into cold start-up, model low performance, model Performance, model high-performance four-stage only use canonical entity extraction, are carried without using canonical in cold start-up and low performance stage Method is taken to extract.To the models mature stage, regular expression can not used, entity is extracted, only use reality completely Body extraction model is applied to online service, completely instead of original regular expression.

Specifically, the flow chart of the entity extraction method of another embodiment is as shown in figure 4, after step s 204, go back Including step：

S205：Judge the performance of entity extraction model.

The performance of entity extraction model includes accuracy rate and recall rate.

When the performance of entity extraction model reaches the first preset value, step S213 is executed：Text feature is input to reality Body extraction model obtains extraction model output.I.e. when entity extraction model performance reaches the first preset value, only uses entity and carry Modulus type extracts, and does not carry out entity extraction using regular expression.

When the performance of entity extraction model is less than the second preset value, step S206 is executed:According to scene information determination pair The pre-defined entity regular expression answered, wherein the first preset value is more than the second preset value.I.e. when entity extraction model (in cold-start phase or model low performance stage) when performance is less than the second preset value, only uses and carried out using regular expression Extraction does not use entity extraction model extraction.

It is understood that the first preset value and the second preset value respectively include the corresponding numerical value of accuracy rate and recall rate pair The numerical value answered.

The training of entity extraction model reaches the high performance stage of ripeness, high-performance from cold-start phase to model performance Energy Standard General takes accuracy rate and recall rate to be higher than 90%, i.e. the first preset value is 90%.

When the performance of entity extraction model is between the first preset value and the second preset value, be performed simultaneously step S206 and Step S213 is extracted using regular expression extraction and canonical extraction model simultaneously.

When entity extraction model has extraction model output and canonical extraction does not have entity extraction result, to entity extraction mould The confidence level of the prediction result of type punished, when canonical extraction has entity extraction result and entity extraction model does not have extraction mould When type exports, canonical entity weight is punished, to optimize the performance of entity extraction model.

Specifically, both performance stages have concurrently in model, and there are four types of situation, situations one：The two hits identical entity simultaneously, Retain entity, increases entity weight；Situation two：Canonical identifies, and model is unidentified, is carried out to canonical default entity weight certain Punishment, that is, a penalty factor being multiplied by between 0-1；Situation three：Canonical is unidentified and Model Identification, to model prediction result Confidence level carry out certain punishment；Situation four：Both unidentified, as incorporeity.

For being applied to vertical question answering system to entity extraction method with reference to specific embodiment, illustrate.It hangs down Straight question answering system may be used ripe machine learning model and carry out Entity recognition mould in the case where there is a large amount of mark language materials The training and prediction of type, fix physical quantities, there is certain database size, and the mode of dictionary matching may be used, and carry out Identification.

In vertical question and answer, intelligent entity identification is very important function, and the intelligent important body of vertical question and answer It is existing, in vertical answer platform, it is desirable that can either solve the problems, such as it is unstructured be extracted as structural data, and can quickly apply Deployment.There is important application in data mining, search, recommendation and automatically request-answering system.Specifically, under different scenes, It is required that vertical question answering system can provide accurate, complete entity type and entity expression.For example, user conveys to robot " 3 points of tomorrow afternoon please remind there are one videoconference." request when, robot needs accurate to extract comprehensive entity row Table, "<Time：3 points of tomorrow afternoon, event：There are one videoconferences>”.It is usually academicly to be utilized using machine learning algorithm A large amount of supervision language material learning models, and then be identified, however user often faces cold start-up in putting into practice, labeled data only has Several to more than ten, machine learning model there is no method practical.For this purpose, in the present embodiment, in cold-start phase, using canonical reality Body extracts.

Specifically, on vertical question answering system platform, the custom entities of third party developer have enumerable vertical neck Domain entity vocabulary (for example, " I will listen { Beijing welcomes you } of [A singer] "), not enumerable expression formula entity, such as " translation (I am from China) " " calculates (3 plus 4) equal to several " for another example.As shown in figure 5, it is directed to not enumerable feature meaning character string, Data language material is rare, collects the high situation of labeled cost, and gradually hierarchical entity identifies：

First layer is identified on entity dictionary and physical model；

The second layer utilizes preceding layer feature, and rewriting sentence is entity tag, in conjunction with regular expression on rewriting sentence Matching, increases recalling for sparse canonical entity.

By accumulating the language material of the such entity of a batch, quantity reaches thousand or more, i.e., by CRF (condition random field) or LSTM-CRF (deep learning-condition random field) trains entity recognition model, promotes accuracy rate.

Specifically, as shown in fig. 6, including the following steps：

S602：Obtain the regular expression of third party's input.

One regular expression is commonly known as a pattern (pattern), to be used for describing or match some syntax The character string of rule.Specifically, identification demand of the third party based on its service content submits canonical in advance in entity extraction platform Expression formula entity declaration.Specifically, regular expression entity declaration includes：Regular expression physical name, English name and canonical table Up to formula.The grammer that entity extraction platform is supported has：1), by the customized common regular expression of third party；2) by third party's sound The regular expression entity of the bright entity extraction platform used.

S604：The regular expression of third party's input is verified, if being verified, thens follow the steps S606.

S606：The regular expression that storage third party defines.

Above-mentioned steps are the process that third party defines regular expression.

Following steps are to receive the input information of user, carry out the process of canonical extraction, including：

S608：Obtain input information to be extracted and scene information.

S610：Input information is pre-processed, the text feature of input information is obtained.

Pretreatment refers to being anticipated to input information before entity extraction to obtain the processing row of text feature For.If Fig. 6 is by taking input information is " Wang Li is 20 No. seven " as an example, after being pre-processed to input information, obtained word includes " Wang Li ", "Yes", " 27 ", " number ".

S612：The performance of entity extraction model is judged.

Entity extraction model is used to carry out entity extraction to input information.

When the performance of entity extraction model reaches the first preset value, step S626 is executed, i.e., when entity extraction model It when can reach the first preset value, only uses entity extraction model and extracts, do not carry out entity extraction using regular expression.When When the performance of entity extraction model is less than the second preset value, step S614 is executed, i.e., when the performance of entity extraction model is less than the (in cold-start phase or model low performance stage) when two preset values, only uses and extracted using regular expression, do not made With entity extraction model extraction.When the performance of entity extraction model is between the first preset value and the second preset value, hold simultaneously Row step S614 and step S626, that is, regular expression extraction and canonical extraction model is utilized to extract simultaneously.

S614：Corresponding pre-defined entity regular expression is determined according to scene information.

Specifically, entity regular expression that corresponding third party states to use in advance is determined according to scene information.

S616：It is identified to obtain each entity of input information according to text feature, obtains list of entities.

Specifically, based on text features such as the semantemes, grammer and statistics obtained after pretreatment, pass through advance trained reality Body identification model identifies each entity of input information, obtains list of entities.As shown in fig. 6, the language material by accumulating a collection of entity, Entity recognition model is trained by CRF (condition random field) or LSTM-CRF (deep learning-condition random field), is promoted real The accuracy rate of body identification.

S618:According to corresponding entity regular expression, each entity obtains sentence combination in composite entity list, and calculates The confidence level of each sentence combination.

Specifically, it is one group of sentence masterplate by user's statement coding, is ranked up according to confidence level, it is contemplated that matching effect Rate, it is more than that the sentence of certain threshold value combines at most to take 3-5 confidence level.

Specifically, step S618 includes the following steps S1 to S2：

S620：It is more than the regular expression that the sentence of predetermined threshold value combines based on confidence level, carries out canonical matching, obtain just Then entity extraction result.

Specifically, it is ranked up according to confidence level, it is contemplated that matching efficiency, it is more than certain threshold at most to take 3-5 confidence level The sentence of value combines.Symbolic formulation list based on sentence carries out canonical matching, identifies the canonical entity of tape symbol, pass through Position maps, and maps back home position, original character expression is found, to extract the entity extraction result for meeting the meaning of one's words.Such as Shown in Fig. 6, which is by nested regular expression, and the entity to input information extraction includes " Wang Li " and " 27 Number ".

S622：The feedback to the canonical entity extraction result using regular expression extraction is obtained, and based on feedback, input Information and canonical entity extraction result obtain labeled data.

Labeled data in the present embodiment includes artificial labeled data and automatic marking data.Specifically, a large amount of automations Mark can utilize input information and the canonical entity extraction result extracted using regular expression directly as mark language material.And Third party can extract the positive erroneous judgement of result in the back-stage management page, obtain the feedback to canonical entity extraction result, Obtain artificial labeled data.

S624：Entity extraction model is trained according to distance pole data.

S626：Text feature is input to entity extraction model and obtains extraction model output.

Specifically, it during cold start-up is to the models mature stage, divides into cold start-up, model low performance, model Performance, model high-performance four-stage only use canonical entity extraction, are carried without using canonical in cold start-up and low performance stage Method is taken to extract.To the models mature stage, regular expression can not used, method is extracted to entity, only used completely Entity extraction model is applied to online service, completely instead of original regular expression.

When the performance stage in model, extraction is carried out at the same time using canonical extracting method and model extraction method, then Execute step S628：Compare the entity extraction knot that entity extraction model has extraction model to export and extracted using regular expression Fruit carries out alignment processing according to comparison result.

Specifically, when entity extraction model has extraction model output and canonical extraction does not have entity extraction result, to reality The confidence level of the prediction result of body extraction model is punished；When canonical extraction have an entity extraction result and entity extraction model not When having extraction model output, canonical entity weight is punished, to optimize the performance of entity extraction model.

Specifically, there are four types of situation, situations one：The two hits identical entity simultaneously, retains entity, increases entity weight； Situation two：Canonical identifies, and model is unidentified, and certain punishment is carried out to canonical default entity weight, that is, is multiplied by between 0-1 One penalty factor；Situation three：Canonical is unidentified and Model Identification, and certain punish is carried out to the confidence level of model prediction result It penalizes；Situation four：Both unidentified, as incorporeity.

Above-mentioned entity extraction method, by existing regular expression, in conjunction with system entity, for vertical question answering system Third party provides self-defined canonical Entity recognition ability.The service avoids cold-start phase user and needs to spend a large amount of manpowers piece The cost lifted all entities or largely marked, easily identification has the expression formula entity of specific syntax and clause, as biography The additional project that proper name of uniting identifies, and then accuracy rate and recall rate can be improved in original basis.It is managed as basic natural language Component is solved, can be used for other natural languages and calculate application, such as information retrieval/commending system.

In one embodiment, a kind of entity extraction device is provided, as shown in fig. 7, comprises：It is data obtaining module 702, pre- Processing module 704, searching module 706, identification module 708, composite module 710 and matching module 712.

Data obtaining module 702, for obtaining input information to be extracted and scene information.

Preprocessing module 704 obtains the text feature of input information for being pre-processed to input information.

Searching module 706, for determining corresponding pre-defined entity regular expression according to scene information.

Identification module 708 obtains each entity of input information for being identified according to text feature, obtains list of entities.

Composite module 710, for according to corresponding entity regular expression, each entity to obtain sentence in composite entity list Combination, and calculate the confidence level of each sentence combination.

Specifically, composite module 710, for determining the corresponding reality of entity regular expression pre-defined in list of entities Body, and according to corresponding entity regular expression, each entity determined in composite entity list obtains sentence combination, and calculates each The confidence level of sentence combination.

Matching module 712, the regular expression that the sentence for being more than predetermined threshold value based on confidence level is combined, carries out canonical Matching, obtains canonical entity extraction result.

Above-mentioned entity extraction device is identified according to text feature after obtaining each entity of input information, according to scene The corresponding pre-defined entity regular expression of information, each entity obtains sentence combination in composite entity list, and chooses and set The sentence combination that reliability is more than predetermined threshold value carries out canonical matching, and extraction obtains canonical entity extraction result.Due to can be by Scheduled regular expression carries out entity extraction in advance, provides canonical Entity recognition ability for third party, can save third party The cost that a large amount of manpowers are enumerated all entities or largely marked is spent in cold-start phase (system uses initial stage), is easily known Not Ju You specific syntax and clause expression formula entity.

In another embodiment, as shown in figure 8, entity extraction device further includes：Labeled data acquisition module 714 and instruction Practice module 716.

Labeled data acquisition module 714, for obtaining to the canonical entity extraction result using regular expression extraction Feedback, and labeled data is obtained based on feedback, input information and canonical entity extraction result.

Training module 716, for training entity extraction model according to distance pole data.

Please continue to refer to Fig. 8, entity extraction device further includes：Prediction module 718.

Prediction module 718, for when the performance of entity extraction model reaches the first preset value, text feature to be input to Entity extraction model obtains extraction model output.

Searching module 706, for when the performance of entity extraction model is less than the second preset value, being determined according to scene information Corresponding pre-defined entity regular expression；Wherein the first preset value is more than the second preset value.

Please continue to refer to Fig. 8, entity extraction device further includes optimization module 720.

Prediction module 718 is additionally operable to when the performance of entity extraction model is between the first preset value and the second preset value, Text feature is input to entity extraction model and obtains extraction model output.

Searching module 706 is additionally operable to when the performance of entity extraction model is between the first preset value and the second preset value, Corresponding pre-defined entity regular expression is determined according to scene information.

Optimization module 720, for work as entity extraction model have extraction model export and using regular expression extraction do not have When entity extraction result, the confidence level of the prediction result of entity extraction model is punished；It is extracted when using regular expression There is entity extraction result when entity extraction model does not have extraction model output, canonical entity weight is punished.

Above-mentioned entity extraction device, by existing regular expression, in conjunction with system entity, for vertical question answering system Third party provides self-defined canonical Entity recognition ability.The service avoids cold-start phase user and needs to spend a large amount of manpowers piece The cost lifted all entities or largely marked, easily identification has the expression formula entity of specific syntax and clause, as biography The additional project that proper name of uniting identifies, and then accuracy rate and recall rate can be improved in original basis.It is managed as basic natural language Component is solved, can be used for other natural languages and calculate application, such as information retrieval/commending system.

Based on the above embodiments, the present invention also provides a kind of computer equipment, including memory, processor and it is stored in On memory and the computer program that can run on a processor, processor realize that the above-mentioned entity respectively implemented carries when executing program The step of taking method.

Fig. 9 is the internal structure schematic diagram of one embodiment Computer equipment.The entity extraction platform can be entity Extraction platform.With reference to Fig. 9, which includes the processor connected by system bus, non-volatile memory medium, interior Memory and network interface.Wherein, the non-volatile memory medium of the computer equipment can storage program area and computer journey Sequence, the computer program are performed, and processor may make to execute a kind of entity extraction method.The processor of the computer equipment For providing calculating and control ability, the operation of entire computer equipment is supported.Computer journey can be stored in the built-in storage Sequence when the computer program is executed by processor, may make processor to execute a kind of entity extraction method.The net of computer equipment Network interface is for carrying out network communication.

It will be understood by those skilled in the art that structure shown in Fig. 8, is only tied with the relevant part of application scheme The block diagram of structure does not constitute the restriction for the computer equipment being applied thereon to application scheme, specific computer equipment May include either combining certain components than more or fewer components as shown in the figure or being arranged with different components.

Based on the above embodiments, the present invention also provides a kind of storage mediums, are stored thereon with computer program, feature It is, when which is executed by processor, the step of realizing the entity extraction method of the various embodiments described above.

One of ordinary skill in the art will appreciate that realizing all or part of flow in above-described embodiment method, being can be with Instruct relevant hardware to complete by computer program, program can be stored in one and non-volatile computer-readable deposit In storage media, in the embodiment of the present invention, which can be stored in the storage medium of computer system, and by the department of computer science At least one of system processor executes, and includes the flow such as the embodiment of above-mentioned each method with realization.Wherein, storage medium Can be magnetic disc, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

Each technical characteristic of embodiment described above can be combined arbitrarily, to keep description succinct, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, it is all considered to be the range of this specification record.

Several embodiments of the invention above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention Range.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. a kind of entity extraction method, which is characterized in that including：

Obtain input information to be extracted and scene information；

It is more than the regular expression that the sentence of predetermined threshold value combines based on confidence level, carries out canonical matching, obtain canonical entity and carry Take result.

2. according to the method described in claim 1, it is characterized in that, in the sentence combination for being more than predetermined threshold value based on confidence level After the step of regular expression carries out canonical matching, obtains canonical entity extraction result, further include：

The feedback to the canonical entity extraction result using regular expression extraction is obtained, and based on the feedback, the input Information and the canonical entity extraction result obtain labeled data；

Entity extraction model is trained according to the distance pole data.

3. according to the method described in claim 2, it is characterized in that, pre-processed to the input information, obtain described After the step of text feature of input information, further include：

When the performance of the entity extraction model reaches the first preset value, the text feature is input to the entity extraction Model obtains extraction model output；

When the performance of the entity extraction model is less than the second preset value, executes described determined according to the scene information and correspond to Pre-defined entity regular expression the step of；Wherein described first preset value is more than second preset value.

4. according to the method described in claim 3, it is characterized in that, the performance when the entity extraction model is pre- described first If when between value and second preset value, the text feature is input to the entity extraction model and obtains extraction mould by execution The step of type exports and described the step of corresponding pre-defined entity regular expression is determined according to the scene information；

It is right when the entity extraction model has extraction model to export and do not have entity extraction result using regular expression extraction The confidence level of the prediction result of the entity extraction model is punished；

When have an entity extraction result using regular expression extraction and the entity extraction model do not have extraction model output when, it is right Canonical entity weight is punished.

5. according to the method described in claim 1, it is characterized in that, described according to the corresponding entity regular expression, group It closes each entity in the list of entities and obtains sentence combination, and the step of calculating the confidence level of each sentence combination, including：

Determine the corresponding entity of entity regular expression pre-defined in the list of entities；

According to corresponding entity regular expression, each entity determined in composite entity list obtains sentence combination, and calculates each The confidence level of sentence combination.

6. a kind of entity extraction device, which is characterized in that including：Data obtaining module, preprocessing module, searching module, identification Module, composite module and matching module；

The preprocessing module obtains the text feature of the input information for being pre-processed to the input information；

The identification module obtains each entity of the input information for being identified according to the text feature, obtains entity row Table；

The composite module, for according to the corresponding entity regular expression, combining each entity in the list of entities and obtaining It is combined to sentence, and calculates the confidence level of each sentence combination；

The matching module, the regular expression that the sentence for being more than predetermined threshold value based on confidence level is combined, carries out canonical Match, obtains canonical entity extraction result.

7. device according to claim 6, which is characterized in that described device further includes：Labeled data acquisition module and instruction Practice module；

The labeled data acquisition module, for obtaining to the anti-of the canonical entity extraction result using regular expression extraction Feedback, and labeled data is obtained based on the feedback, the input information and the canonical entity extraction result；

The training module, for training entity extraction model according to the distance pole data.

8. device according to claim 7, which is characterized in that described device further includes：Prediction module；

The prediction module, for when the performance of the entity extraction model reaches the first preset value, by the text feature It is input to the entity extraction model and obtains extraction model output；

The searching module, for when the performance of the entity extraction model is less than the second preset value, being believed according to the scene Breath determines corresponding pre-defined entity regular expression；Wherein described first preset value is more than second preset value.

9. device according to claim 8, which is characterized in that described device further includes optimization module；

The prediction module is additionally operable to preset in first preset value and described second when the performance of the entity extraction model When between value, the text feature is input to the entity extraction model and obtains extraction model output；

The searching module is additionally operable to preset in first preset value and described second when the performance of the entity extraction model When between value, corresponding pre-defined entity regular expression is determined according to the scene information；

The optimization module, for work as the entity extraction model have extraction model export and using regular expression extraction do not have When entity extraction result, the confidence level of the prediction result of the entity extraction model is punished；When utilizing regular expression Extraction has an entity extraction result and the entity extraction model punishes canonical entity weight when not having extraction model output It penalizes.

10. device according to claim 6, which is characterized in that the composite module, for determining in the list of entities The pre-defined corresponding entity of entity regular expression, and according to corresponding entity regular expression, in composite entity list Determining each entity obtains sentence combination, and calculates the confidence level of each sentence combination.

11. a kind of computer equipment, including memory, processor and storage are on a memory and the meter that can run on a processor Calculation machine program, which is characterized in that the processor realizes entity described in any one of claim 1 to 5 when executing described program The step of extracting method.

12. a kind of storage medium, is stored thereon with computer program, which is characterized in that when the program is executed by processor, realize The step of entity extraction method described in any one of claim 1 to 5.