CN109740159A - For naming the processing method and processing device of Entity recognition - Google Patents

For naming the processing method and processing device of Entity recognition Download PDF

Info

Publication number
CN109740159A
CN109740159A CN201811644812.4A CN201811644812A CN109740159A CN 109740159 A CN109740159 A CN 109740159A CN 201811644812 A CN201811644812 A CN 201811644812A CN 109740159 A CN109740159 A CN 109740159A
Authority
CN
China
Prior art keywords
entity
text
regular expression
template
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811644812.4A
Other languages
Chinese (zh)
Other versions
CN109740159B (en
Inventor
申化泽
刘丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Teddy Bear Mobile Technology Co ltd
Beijing Teddy Future Technology Co ltd
Original Assignee
Beijing Teddy Bear Mobile Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Teddy Bear Mobile Technology Co Ltd filed Critical Beijing Teddy Bear Mobile Technology Co Ltd
Priority to CN201811644812.4A priority Critical patent/CN109740159B/en
Publication of CN109740159A publication Critical patent/CN109740159A/en
Application granted granted Critical
Publication of CN109740159B publication Critical patent/CN109740159B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Character Input (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses a kind of for naming the processing method and processing device of Entity recognition.This method includes that training can be used for obtaining the template for meeting the substring of preset condition from the character string of name entity;Name entity itself regular expression is obtained, and constructs comprehensive regular expression with the template;According to the comprehensive regular expression, identification name entity.Present application addresses lack to can be used for naming the technical problem of Entity recognition rule.Method in the application, in conjunction with the template that automatically generates of regular expression and computer of the entity itself of manual compiling, generate comprehensive regular expression, for naming entity identification, both guarantee recognition effect and efficiency, while also greatly reducing the manpower and time cost of company.In addition, the present processes are applicable in the mobile devices such as mobile phone.

Description

For naming the processing method and processing device of Entity recognition
Technical field
This application involves natural language processing fields, in particular to a kind of for naming the processing side of Entity recognition Method and device.
Background technique
It names Entity recognition (Named Entity Recognition, NER), refers in identification text that there is certain sense Entity, mainly include name, place name, mechanism name, proper noun and any business scenario under it is of concern have specific meaning The word etc. of justice.Rule-based name Entity recognition is mainly based upon regular expression (Regular Expression) It realizes.
Inventors have found that the entity class of required extraction also can be more and more as business scenario is more and more abundant, need Will also can be very more by manual compiling and the regular expression of maintenance, needing to expend a large amount of human cost and time carries out It writes and safeguards.
It can be used for naming the problem of Entity recognition rule for lacking in the related technology, not yet propose effective solution at present Scheme.
Summary of the invention
The main purpose of the application be to provide it is a kind of for naming the processing method and processing device of Entity recognition, with solve lack It can be used for naming the problem of Entity recognition rule less.
To achieve the goals above, it according to the one aspect of the application, provides a kind of for naming the place of Entity recognition Reason method, for generating the rule of name Entity recognition,.
What it is according to the application includes: that training can be used for from the character for naming entity for naming the processing method of Entity recognition The template for meeting the substring of preset condition is obtained in string;Obtain name entity itself regular expression, and with the template Construct comprehensive regular expression;According to the comprehensive regular expression, identification name entity.
Further, training can be used for obtaining the mould for meeting the substring of preset condition from the character string of name entity Plate comprises determining that business scenario;According to the business scenario, text data is collected;Definition needs the entity that extracts, described in mark Text data is simultaneously stored according to standard data format.
Further, training can be used for obtaining the mould for meeting the substring of preset condition from the character string of name entity Plate further include: any one or more filters out the processing mode of complex text as follows: filtering text is started with entity;Filtering text This is ended up with entity;Entity is adjacent together there are two filtering in text;There are two some entity class in filtering text contains Above entity value;It filters and is free of any entity in text.
Further, training can be used for obtaining the mould for meeting the substring of preset condition from the character string of name entity Plate includes: the mark corpus for obtaining text, using the entity value marked in corpus that marked as slot;The slot is replaced Upper corresponding entity class, generates new text.
Further, training can be used for obtaining the mould for meeting the substring of preset condition from the character string of name entity Plate includes: that as the step of template, the step specifically includes remaining frame after entity value is generalized for entity class: will Entity class replaces with corresponding regular expression in template, generates batch regular expression;Institute is tested on training set State whether batch regular expression meets preset condition;If the test batch regular expression meets default on training set Condition then saves the corresponding template of the regular expression.
To achieve the goals above, it according to the another aspect of the application, provides a kind of for naming the place of Entity recognition Manage device.
For naming the processing unit of Entity recognition include: training module according to the application, for training can be used for from Name the template for obtaining in the character string of entity and meeting the substring of preset condition;Module is constructed, for obtaining name entity Itself regular expression, and comprehensive regular expression is constructed with the template;Identification module, for according to comprehensive regular expressions Formula, identification name entity.
Further, the training module includes: collection labeling module, and the collection labeling module comprises determining that unit, For determining business scenario;Collector unit, for collecting text data according to the business scenario;Processing unit is marked, is used for Definition needs the entity extracted, marks the text data and stores according to standard data format.
Further, the training module includes: filtering module, the filtering module for execute it is following any one or A variety of processing for filtering out complex text: filtering text is started with entity;Filtering text is ended up with entity;Have two in filtering text A entity is adjacent together;Some entity class in filtering text contains more than two entity values;It is free of in filtering text Any entity.
Further, the training module includes: entity replacement module, and the entity replacement module includes: processing unit, For obtaining the mark corpus of text, using the entity value marked in corpus that marked as slot;Replacement unit, being used for will The slot replaces upper corresponding entity class, generates new text.
Further, the training module includes: template generation module, and the template generation module includes: generation unit, For entity class in template to be replaced with to corresponding regular expression, batch regular expression is generated;Test cell is used In testing whether the batch regular expression meets preset condition on training set;Storage unit, for being surveyed on training set When the examination batch regular expression meets preset condition, the corresponding template of the regular expression is saved.
In the embodiment of the present application, it can be used for obtaining from the character string of name entity using training and meet preset condition The mode of the template of substring names entity itself regular expression by obtaining, and constructs comprehensive canonical with the template Expression formula has reached according to the comprehensive regular expression, the purpose of identification name entity, to realize guarantee name entity The effect and efficiency of identification, at the same also reduce company manpower and time cost technical effect, and then solve lack it is available In the technical problem of name Entity recognition rule.
Detailed description of the invention
The attached drawing constituted part of this application is used to provide further understanding of the present application, so that the application's is other Feature, objects and advantages become more apparent upon.The illustrative examples attached drawing and its explanation of the application is for explaining the application, not Constitute the improper restriction to the application.In the accompanying drawings:
Fig. 1 is according to the application first embodiment for naming the processing method schematic diagram of Entity recognition;
Fig. 2 is according to the application second embodiment for naming the processing method schematic diagram of Entity recognition;
Fig. 3 is according to the application 3rd embodiment for naming the processing method schematic diagram of Entity recognition;
Fig. 4 is according to the application fourth embodiment for naming the processing method schematic diagram of Entity recognition;
Fig. 5 is according to the 5th embodiment of the application for naming the processing method schematic diagram of Entity recognition;
Fig. 6 is according to the application first embodiment for naming the processing unit schematic diagram of Entity recognition;
Fig. 7 is according to the application second embodiment for naming the processing unit schematic diagram of Entity recognition;
Fig. 8 is according to the application 3rd embodiment for naming the processing unit schematic diagram of Entity recognition;
Fig. 9 is according to the application fourth embodiment for naming the processing unit schematic diagram of Entity recognition;
Figure 10 is the processing unit schematic diagram for being used to name Entity recognition that embodiment is asked according to this Shen the 5th;
Figure 11 is the realization principle schematic diagram of the application;
Figure 12 to Figure 17 is the treatment effect schematic diagram in the application.
Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection It encloses.
It should be noted that the description and claims of this application and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to embodiments herein described herein.In addition, term " includes " and " tool Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a series of steps or units Process, method, system, product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include without clear Other step or units listing to Chu or intrinsic for these process, methods, product or equipment.
Entity (referred to as: entity) is named to refer in embodiments herein, it is main with the word of certain sense in text It to include the word of concern with certain sense under name, place name, mechanism name, proper noun and any business scenario (such as restaurant name, hotel's name, movie name etc.);
In embodiments herein name Entity recognition refer to, refer to identified from text name entity task or Person's technology.
Regular expression refers in embodiments herein, describes a kind of mode (pattern) of string matching, Can be used to check whether a string contains certain substring, replace matched substring or taking-up meets some from some string The substring etc. of condition.
Name entity is identified using regular expression, with accurate rate and recall rate are high, scalability is strong, EMS memory occupation Less, the advantages that recognition speed is fast.But due to demands such as practical business scenes, the required regular expression write and safeguarded is often Very much, and single regular expression may be very long very complicated, will be very if manually writing and safeguarding entirely It takes time and effort.In this context, a kind of method that method in this application proposes Semi-Automatic Generation regular expression, in conjunction with The template that the regular expression and computer of the entity of manual compiling itself automatically generate, generates comprehensive regular expression, For naming entity identification, both guarantee recognition effect and efficiency, while also greatly reducing the manpower and time cost of company.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
As shown in Figure 1, this method includes the following steps, namely S102 to step S104:
Step S102, training can be used for obtaining the mould for meeting the substring of preset condition from the character string of name entity Plate;
The template that training can be used for obtaining the substring for meeting preset condition from the character string of name entity refers to, instructs Practise the template that there can be regular expression.
Template in this application refers to for entity value being generalized for remaining frame after entity class.Template can be made Make a living into the basis of the rule of name Entity recognition.
Step S104 obtains name entity itself regular expression, and constructs comprehensive regular expression with the template;
Comprehensive canonical table is constituted by template obtained in itself regular expression of the name entity of acquisition and above-mentioned steps Up to formula.Name entity itself regular expression can be carried out by the way of artificial.Template then can be directly raw by computer At.
Step S106, according to the comprehensive regular expression, identification name entity.
According to the construction of obtained synthesis regular expression as a result, can be used to identify name entity.
It can be seen from the above description that the application realizes following technical effect:
In the embodiment of the present application, it can be used for obtaining from the character string of name entity using training and meet preset condition The mode of the template of substring names entity itself regular expression by obtaining, and constructs comprehensive canonical with the template Expression formula has reached according to the comprehensive regular expression, the purpose of identification name entity, to realize guarantee name entity The effect and efficiency of identification, at the same also reduce company manpower and time cost technical effect, and then solve lack it is available In the technical problem of name Entity recognition rule.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in Fig. 2, training can be used for from name entity Character string in obtain the template of substring for meeting preset condition and include:
Step S202, determines business scenario;
Which text data determination needs to collect after determining business scenario.
Step S204 collects text data according to the business scenario;
For relevant business scenario, it is necessary first to which collection obtains true text data.
Step S206, definition need the entity extracted, mark the text data and store according to standard data format.
The entity for needing to extract is defined, and is marked manually.
Specifically, for example, determining that business scenario is that the relevant text of meeting carries out entity extraction, specifically it is also desirable that extracting Include: the entity class such as session topic, the time of meeting and meeting-place, define need the entity class that extracts it Afterwards, so that it may which the relevant text of collected meeting is marked one by one.To every text, go in text to search either with or without upper State the entity of care, if any, just be marked out.Such as, it is assumed that having a text is " every employee, please under Noon 15:00 participates in company Annual commendatory meeting to second floor meeting room, everybody is asked to participate on time.", then it can mark out:
The time of meeting are as follows: " 15:00 in afternoon ",
Meeting-place are as follows: " second floor meeting room ",
Session topic are as follows: " company Annual commendatory meeting ".
After mark is completed, so that it may with the data of mark come training pattern.
It is stored it should be noted that being further comprised in above-mentioned steps according to standard data format.Specifically, data prediction The data of mark may have different unprocessed forms, need these different unprocessed forms being uniformly converted to a kind of criterion numeral According to format, it is uniformly processed after convenient.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 3, training can be used for from name entity Character string in obtain the template for meeting the substring of preset condition further include: it is following any one or more filter out complexity The processing mode of text: filtering text is started with entity;Filtering text is ended up with entity;Entity is adjacent there are two filtering in text Together;Some entity class in filtering text contains more than two entity values;It filters and is free of any entity in text.Tool Body, first some texts poorly handled first can be filtered out, so as to since first handling fairly simple text. After processing plain text that can be relatively good, the complicated text of processing can be reattempted.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 4, training can be used for from name entity Character string in obtain the template of substring for meeting preset condition and include:
Step S402 obtains the mark corpus of text, using the entity value marked in corpus that marked as slot;
According to mark corpus, the entity value marked in original text is plucked out to come, becomes a slot.
The slot is replaced upper corresponding entity class, generates new text by step S404.
By filling upper corresponding entity class, new text can be generated.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 5, training can be used for from name entity Character string in obtain the substring for meeting preset condition template include: entity value is generalized for after entity class it is remaining Frame as template the step of, the step specifically includes:
Entity class in template is replaced with corresponding regular expression by step S502, generates batch regular expressions Formula;
Step S504, tests whether the batch regular expression meets preset condition on training set;
Step S506, if the test batch regular expression meets preset condition on training set, save described in The corresponding template of regular expression.
Specifically, remaining frame may include: first to examine as template after entity value being generalized for entity class It is gradually extended toward two sides centered on the entity of worry, one word of extension generates a new template every time, often expands and comes one Template is all tested on all training sets.If met the requirements, the template is just saved, is then interrupted, then jumped to next Entity continuation generates template in the same way.
It should be noted that step shown in the flowchart of the accompanying drawings can be in such as a group of computer-executable instructions It is executed in computer system, although also, logical order is shown in flow charts, and it in some cases, can be with not The sequence being same as herein executes shown or described step.
According to the embodiment of the present application, additionally provide a kind of for implementing the above-mentioned processing method for naming Entity recognition Device, for generating the rule of name Entity recognition, as shown in fig. 6, the device includes: training module 10, it is available for training The template for meeting the substring of preset condition is obtained in the character string from name entity;Module 20 is constructed, for obtaining life Name entity itself regular expression, and comprehensive regular expression is constructed with the template;Identification module 30 is used for according to synthesis just Then expression formula, identification name entity.
In the training module 10 of the embodiment of the present application training can be used for from name entity character string in obtain meet it is default The template of the substring of condition refers to, trains the template that can have regular expression.
Template in this application refers to for entity value being generalized for remaining frame after entity class.Template can be made Make a living into the basis of the rule of name Entity recognition.
Itself regular expression of name entity and above-mentioned steps in the building module 20 of the embodiment of the present application by obtaining Obtained in template constitute comprehensive regular expression.Name entity itself regular expression can be carried out by the way of artificial. Template then can be generated directly by computer.
According to the construction of obtained synthesis regular expression as a result, can be used in the identification module 30 of the embodiment of the present application Identification name entity.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in fig. 7, the training module includes: to collect Labeling module 101, the collection labeling module comprises determining that unit 1011, for determining business scenario;Collector unit 1012, For collecting text data according to the business scenario;Processing unit 1013 is marked, for defining the entity for needing to extract, mark It infuses the text data and is stored according to standard data format.
Which textual data determination needs to collect after determining business scenario in the determination unit 1011 of the embodiment of the present application According to.
For relevant business scenario in the collector unit 1012 of the embodiment of the present application, it is necessary first to which collection obtains true Text data.
The entity for needing to extract is defined in the mark processing unit 1013 of the embodiment of the present application, and is marked manually.
Specifically, for example, determining that business scenario is that the relevant text of meeting carries out entity extraction, specifically it is also desirable that extracting Include: the entity class such as session topic, the time of meeting and meeting-place, define need the entity class that extracts it Afterwards, so that it may which the relevant text of collected meeting is marked one by one.To every text, go in text to search either with or without upper State the entity of care, if any, just be marked out.Such as, it is assumed that having a text is " every employee, please under Noon 15:00 participates in company Annual commendatory meeting to second floor meeting room, everybody is asked to participate on time.", then it can mark out:
The time of meeting are as follows: " 15:00 in afternoon ",
Meeting-place are as follows: " second floor meeting room ",
Session topic are as follows: " company Annual commendatory meeting ".
After mark is completed, so that it may with the data of mark come training pattern.
It is stored it should be noted that being further comprised in above-mentioned steps according to standard data format.Specifically, data prediction The data of mark may have different unprocessed forms, need these different unprocessed forms being uniformly converted to a kind of criterion numeral According to format, it is uniformly processed after convenient.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 8, the training module includes: filtering Module 104, the filtering module for execute it is following any one or more filter out the processing of complex text: filtering text with Entity beginning;Filtering text is ended up with entity;Entity is adjacent together there are two filtering in text;Filter some reality in text Body classification contains more than two entity values;It filters and is free of any entity in text.
It should be noted that being not limited to the above-mentioned complex text that filters out in the implementation of embodiments herein Processing mode.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 9, the training module 10 includes: reality Body replacement module 102, the entity replacement module 102 include: processing unit 1021, for obtaining the mark corpus of text, Using the entity value marked in corpus that marked as slot;Replacement unit 1022, for slot replacement is upper corresponding real Body classification generates new text.
According to mark corpus in the processing unit 1021 of the embodiment of the present application, the entity value marked in original text is plucked out Come, becomes a slot.
By filling upper corresponding entity class in the replacement unit 1022 of the embodiment of the present application, new text can be generated This.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in Figure 10, the training module 10 includes: Template generation module 103, the template generation module 103 include: generation unit 1031, for replacing entity class in template For corresponding regular expression, batch regular expression is generated;Test cell 1032, for described in the test on training set Whether batch regular expression meets preset condition;Storage unit 1033, for the test batch canonical table on training set When meeting preset condition up to formula, the corresponding template of the regular expression is saved.
In in the generation unit 1031 of the embodiment of the present application, test cell 1032 and in storage unit 1033 specifically, After entity value is generalized for entity class remaining frame as template may include: first centered on the entity of consideration by It is gradually extended toward two sides, one word of extension generates a new template every time, often expands and carrys out a template, all in all training It is tested on collection.If met the requirements, the template is just saved, is then interrupted, then jumped to next entity and continue with same Method generates template.
It as shown in figure 11, is the realization principle schematic diagram of the application.
(1) collection and labeled data
As shown in figure 11, for business scenario of concern, first have to collect true text data, secondary definition I Be concerned about and wish extract which entity, finally manually marked.Specifically, it is assumed that now desire to the relevant text of meeting Entity extraction is carried out, and wishes that extracts there are the entity class such as session topic, the time of meeting and meeting-place, is defining needs After the entity class for needing to extract, so that it may the relevant text of collected meeting is marked one by one, to every text, All artificial goes in text to search either with or without the above-mentioned entity that we are concerned about, if any, is just marked.
Assuming that have a text be " every employee, please in afternoon 15:00 to second floor meeting room participate in company Annual commend greatly Meeting asks everybody to participate on time.", so that it may marking out the time of meeting is " 15:00 in afternoon ", and meeting-place is " second floor meeting room ", Session topic is " company Annual commendatory meeting ".After mark is completed, so that it may with the data of mark come training pattern.
(2) data prediction
The data marked as shown in figure 11 may have different unprocessed forms, it would be desirable to by these different original lattice Formula is uniformly converted to a kind of standard data format, is uniformly processed after convenient.Specifically, the mark defined in the embodiment of the present application Quasi- data format is as shown in figure 12.
As shown in figure 13, if some entity class there are multiple entity values, separated between multiple entity values with "@@@".
It is defined as example with above-mentioned, specific format is summarized as follows:
Separated between urtext and entity class, different entities classification with tab key;
Separated between entity class and entity value with " ### ";
Separated between multiple entity values with "@@@".
In addition it is also necessary to which the entity class for every field defines corresponding English name, the entity in each field Classification is all defined with an enum class.Entity class in standard data format, is indicated with English.
(3) complex text is filtered out
As shown in figure 11, in order to which by problem reduction, we determine first to filter out some texts poorly handled, first locate Manage fairly simple text.Main filtration falls following several texts:
Text is started with entity;
Text is ended up with entity;
There are two entities in text closely;
Some entity class in text contains more than two entity values;
Any entity is free of in text.
By taking above-mentioned meeting identification as an example, 4000 original labeled data about there remains after above-mentioned filtering 3400 or so.After processing plain text that can be relatively good, the complicated text of processing can be reattempted.
(4) entity is replaced
As shown in figure 11, according to mark corpus, the entity value marked in original text is plucked out to come, becomes a slot, then Corresponding entity class is refilled, new text is generated.Assuming that mark corpus are as follows: as shown in figure 14, then new after replacement The text of generation are as follows: as shown in figure 15.
(5) it segments
Further optimization processing includes needing to segment in above-mentioned newly-generated text, as a result as shown in figure 16.
(6) regular expression of manual compiling entity itself
The regular expression of entity itself, what is referred to can just exactly match the regular expressions of some entity class Formula., have in the regular expression set of own one for " [0-9] { 4 } year (and?: 0 [1-9] | 1 [0-2]) moon (?: [0-2] [1-9] | 3 [0-1] |) day ", it is such can just to exactly match " on 01 03rd, 2018 " Time.
It should be noted that the regular expression of manual compiling entity itself can just exactly match some entity class Other regular expression.In comparison, workload and the time it takes are all less.
After the regular expression for having write all entities itself, so that it may in conjunction with labeled data automatically generate it is whole just Then expression formula: including the synthesis regular expression of physical surroundings word and entity itself regular expression.
(7) comprehensive regular expression is generated
Template refers to for entity value being generalized for remaining frame after entity class.For example, original text is " in people People's the great hall meeting ", then by for entity class be meeting-place (English is LOCATION), corresponding entity value is " people People's the great hall ", then by entity it is extensive after, so that it may obtain " LOCATION have a meeting ", an as template.
For another example, this regular expression that " (.* hoof flower soup) gone to have a meal ", wherein " (.* hoof flower soup) " is known as entity itself Regular expression, " (.* hoof flower soup) is gone to have a meal " is then known as comprehensive regular expression, and " RESTAURANT is gone to have a meal " is then Referred to as template.
It is obtaining template and then LOCATION is replaced with to the regular expression of own, so that it may obtain final comprehensive The regular expression of conjunction.For example assume that the regular expression of LOCATION itself is " (.+ meeting room) ", then combining template The synthesis regular expression of generation is just " having a meeting at (.+ meeting room) ", the regular expression integrated with this, so that it may to one The new meeting text of sentence is matched, if can match, can extract the entity in meeting-place therein.
The method for automatically generating template specifically:
It is gradually extended toward two sides first centered on the entity of consideration, one word of extension generates a new mould every time Plate often expands and carrys out a template, all tested on all training sets, if meeting the requirements such as accurate rate P= 100%, just save the template.Then break is interrupted, next entity continuation is jumped to and generates template in the same way.
By taking above-mentioned example as an example, it is assumed that the entity of consideration is TIME, then just centered on TIME, gradually extends toward two sides, The following candidate template can be generated, and (degree of unbalancedness and context of word number can be adjusted at left and right sides of entity Section considers that degree of unbalancedness is 1, context 3 here), as shown in figure 17.
For each template, when being tested on training set, entity class is replaced with into corresponding a batch itself Regular expression, generate the comprehensive regular expression of a batch, then tested again, as long as there is a regular expression to meet It is required that then preserving corresponding template, break is then interrupted, next entity is jumped to and continues to generate template.Until traversal Complete training set.It finally can be obtained by a collection of template.In conjunction with the regular expression of entity itself, so that it may obtain final Desired synthesis regular expression.
It should be noted that the regular expression of entity itself needs manual compiling here.That is, most throughout one's life At rule include two parts, a part is template, and a part is the regular expression of entity itself, and two parts combine It is the synthesis regular expression that can be actually used.The regular expression of entity itself is to need manual compiling, but template is It can be automatically generated using the computer of the offer in the embodiment of the present application.
Obviously, those skilled in the art should be understood that each module of above-mentioned the application or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored Be performed by computing device in the storage device, perhaps they are fabricated to each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.In this way, the application be not limited to it is any specific Hardware and software combines.
The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.

Claims (10)

1. a kind of for naming the processing method of Entity recognition, which is characterized in that for generating the rule of name Entity recognition, institute The method of stating includes:
Training can be used for obtaining the template for meeting the substring of preset condition from the character string of name entity;
Name entity itself regular expression is obtained, and constructs comprehensive regular expression with the template;
According to the comprehensive regular expression, identification name entity.
2. processing method according to claim 1, which is characterized in that training can be used for from the character string of name entity obtaining The template for taking the substring for meeting preset condition includes:
Determine business scenario;
According to the business scenario, text data is collected;
Definition needs the entity extracted, marks the text data and stores according to standard data format.
3. processing method according to claim 1, which is characterized in that training can be used for from the character string of name entity obtaining Take the template for meeting the substring of preset condition further include: any one or more filters out the processing side of complex text as follows Formula:
Filtering text is started with entity;
Filtering text is ended up with entity;
Entity is adjacent together there are two filtering in text;
Some entity class in filtering text contains more than two entity values;
It filters and is free of any entity in text.
4. processing method according to claim 1, which is characterized in that training can be used for from the character string of name entity obtaining The template for taking the substring for meeting preset condition includes:
The mark corpus for obtaining text, using the entity value marked in corpus that marked as slot;
The slot is replaced into upper corresponding entity class, generates new text.
5. processing method according to claim 1, which is characterized in that training can be used for from the character string of name entity obtaining Take the substring for meeting preset condition template include: after entity value is generalized for entity class remaining frame as mould The step of plate, the step specifically include:
Entity class in template is replaced with to corresponding regular expression, generates batch regular expression;
Test whether the batch regular expression meets preset condition on training set;
If the test batch regular expression meets preset condition on training set, it is corresponding to save the regular expression Template.
6. a kind of for naming the processing unit of Entity recognition, which is characterized in that for generating the rule of name Entity recognition, institute Stating device includes:
Training module can be used for obtaining the mould for meeting the substring of preset condition from the character string of name entity for training Plate;
Module is constructed, constructs comprehensive regular expression for obtaining name entity itself regular expression, and with the template;
Identification module, for according to comprehensive regular expression, identification name entity.
7. processing unit according to claim 6, which is characterized in that the training module includes: collection labeling module, institute Stating collection labeling module includes:
Determination unit, for determining business scenario;
Collector unit, for collecting text data according to the business scenario;
Processing unit is marked, for defining the entity for needing to extract, the text data is marked and is deposited according to standard data format It puts.
8. processing unit according to claim 6, which is characterized in that the training module includes: filtering module, the mistake Filter module for execute it is following any one or more filter out the processing of complex text:
Filtering text is started with entity;
Filtering text is ended up with entity;
Entity is adjacent together there are two filtering in text;
Some entity class in filtering text contains more than two entity values;
It filters and is free of any entity in text.
9. processing unit according to claim 6, which is characterized in that the training module includes: entity replacement module, institute Stating entity replacement module includes:
Processing unit, for obtaining the mark corpus of text, using the entity value marked in corpus that marked as slot;
Replacement unit generates new text for the slot to be replaced upper corresponding entity class.
10. processing unit according to claim 6, which is characterized in that the training module includes: template generation module, The template generation module includes:
Generation unit generates batch regular expressions for entity class in template to be replaced with to corresponding regular expression Formula;
Test cell, for testing whether the batch regular expression meets preset condition on training set;
Storage unit, for saving the canonical when test batch regular expression meets preset condition on training set The corresponding template of expression formula.
CN201811644812.4A 2018-12-29 2018-12-29 Processing method and device for named entity recognition Active CN109740159B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811644812.4A CN109740159B (en) 2018-12-29 2018-12-29 Processing method and device for named entity recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811644812.4A CN109740159B (en) 2018-12-29 2018-12-29 Processing method and device for named entity recognition

Publications (2)

Publication Number Publication Date
CN109740159A true CN109740159A (en) 2019-05-10
CN109740159B CN109740159B (en) 2022-04-26

Family

ID=66362654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811644812.4A Active CN109740159B (en) 2018-12-29 2018-12-29 Processing method and device for named entity recognition

Country Status (1)

Country Link
CN (1) CN109740159B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909160A (en) * 2019-10-11 2020-03-24 平安科技(深圳)有限公司 Regular expression generation method, server and computer readable storage medium
CN110990540A (en) * 2019-12-26 2020-04-10 厦门快商通科技股份有限公司 Synonym extraction method and device based on regular expression
CN111079436A (en) * 2019-12-20 2020-04-28 中南大学 Geological named entity extraction method and device
CN112578742A (en) * 2019-09-27 2021-03-30 罗克韦尔自动化技术公司 System and method for customer specific naming conventions for industrial automation devices
CN113378561A (en) * 2021-08-16 2021-09-10 北京泰迪熊移动科技有限公司 Word prediction template generation method and device
CN114078470A (en) * 2020-08-17 2022-02-22 阿里巴巴集团控股有限公司 Model processing method and device, and voice recognition method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103268339A (en) * 2013-05-17 2013-08-28 中国科学院计算技术研究所 Recognition method and system of named entities in microblog messages
CN103631948A (en) * 2013-12-11 2014-03-12 北京京东尚科信息技术有限公司 Identifying method of named entities
CN106095745A (en) * 2016-05-27 2016-11-09 厦门市美亚柏科信息股份有限公司 Transaction record extracting method based on log and system thereof
CN107527073A (en) * 2017-09-05 2017-12-29 中南大学 The recognition methods of entity is named in electronic health record
CN107608949A (en) * 2017-10-16 2018-01-19 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model
US20180253663A1 (en) * 2017-03-06 2018-09-06 Wipro Limited Method and system for extracting relevant entities from a text corpus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103268339A (en) * 2013-05-17 2013-08-28 中国科学院计算技术研究所 Recognition method and system of named entities in microblog messages
CN103631948A (en) * 2013-12-11 2014-03-12 北京京东尚科信息技术有限公司 Identifying method of named entities
CN106095745A (en) * 2016-05-27 2016-11-09 厦门市美亚柏科信息股份有限公司 Transaction record extracting method based on log and system thereof
US20180253663A1 (en) * 2017-03-06 2018-09-06 Wipro Limited Method and system for extracting relevant entities from a text corpus
CN107527073A (en) * 2017-09-05 2017-12-29 中南大学 The recognition methods of entity is named in electronic health record
CN107608949A (en) * 2017-10-16 2018-01-19 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
马宁等: "面向互联网的藏文实体关系模板获取技术研究", 《中央民族大学学报(自然科学版)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112578742A (en) * 2019-09-27 2021-03-30 罗克韦尔自动化技术公司 System and method for customer specific naming conventions for industrial automation devices
CN110909160A (en) * 2019-10-11 2020-03-24 平安科技(深圳)有限公司 Regular expression generation method, server and computer readable storage medium
CN111079436A (en) * 2019-12-20 2020-04-28 中南大学 Geological named entity extraction method and device
CN110990540A (en) * 2019-12-26 2020-04-10 厦门快商通科技股份有限公司 Synonym extraction method and device based on regular expression
CN114078470A (en) * 2020-08-17 2022-02-22 阿里巴巴集团控股有限公司 Model processing method and device, and voice recognition method and device
CN113378561A (en) * 2021-08-16 2021-09-10 北京泰迪熊移动科技有限公司 Word prediction template generation method and device

Also Published As

Publication number Publication date
CN109740159B (en) 2022-04-26

Similar Documents

Publication Publication Date Title
CN109740159A (en) For naming the processing method and processing device of Entity recognition
CN107577826B (en) Classification of diseases coding method and system based on raw diagnostic data
CN111753099B (en) Method and system for enhancing relevance of archive entity based on knowledge graph
CN107766371B (en) Text information classification method and device
CN107705839A (en) Disease automatic coding and system
CN101093478B (en) Method and system for identifying Chinese full name based on Chinese shortened form of entity
CN107562918A (en) A kind of mathematical problem knowledge point discovery and batch label acquisition method
US20170242847A1 (en) Apparatus and method for translating a meeting speech
CN107731269A (en) Disease code method and system based on raw diagnostic data and patient file data
CN108196880A (en) Software project knowledge mapping method for automatically constructing and system
CN106357942A (en) Intelligent response method and system based on context dialogue semantic recognition
CN107194158A (en) A kind of disease aided diagnosis method based on image recognition
CN110020424A (en) Extracting method, the extracting method of device and text information of contract information
CN107301170A (en) The method and apparatus of cutting sentence based on artificial intelligence
CN108182972A (en) The intelligent coding method and system of Chinese medical diagnosis on disease based on participle network
CN104573231A (en) BIM based smart building system and method
CN109783624A (en) Answer generation method, device and the intelligent conversational system in knowledge based library
CN110334343B (en) Method and system for extracting personal privacy information in contract
CN102567310B (en) Networking artificial intelligence's translation system based on Intelligence repository and interpretation method thereof
CN110297961A (en) A kind of Quick Acquisition of policy information and optimization extracting method
CN110298039A (en) Recognition methods, system, equipment and the computer readable storage medium of event
CN104765823A (en) Method and device for collecting website data
CN107679174A (en) Construction method, device and the server of Knowledge Organization System
CN105389482A (en) Massive data analysis method based on cloud platform
CN109740947A (en) Expert's method for digging, system, storage medium and electric terminal based on patent data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: East of 1st floor, No.36 Haidian Street, Haidian District, Beijing, 100080

Patentee after: Beijing Teddy Future Technology Co.,Ltd.

Address before: East of 1st floor, No.36 Haidian Street, Haidian District, Beijing, 100080

Patentee before: Beijing Teddy Bear Mobile Technology Co.,Ltd.

CP01 Change in the name or title of a patent holder
CP03 Change of name, title or address

Address after: East of 1st floor, No.36 Haidian Street, Haidian District, Beijing, 100080

Patentee after: Beijing Teddy Bear Mobile Technology Co.,Ltd.

Address before: 100085 07a36, block D, 7 / F, No.28, information road, Haidian District, Beijing

Patentee before: BEIJING TEDDY BEAR MOBILE TECHNOLOGY Co.,Ltd.

CP03 Change of name, title or address