Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application
Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only
The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people
Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection
It encloses.
It should be noted that the description and claims of this application and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to embodiments herein described herein.In addition, term " includes " and " tool
Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a series of steps or units
Process, method, system, product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include without clear
Other step or units listing to Chu or intrinsic for these process, methods, product or equipment.
Entity (referred to as: entity) is named to refer in embodiments herein, it is main with the word of certain sense in text
It to include the word of concern with certain sense under name, place name, mechanism name, proper noun and any business scenario
(such as restaurant name, hotel's name, movie name etc.);
In embodiments herein name Entity recognition refer to, refer to identified from text name entity task or
Person's technology.
Regular expression refers in embodiments herein, describes a kind of mode (pattern) of string matching,
Can be used to check whether a string contains certain substring, replace matched substring or taking-up meets some from some string
The substring etc. of condition.
Name entity is identified using regular expression, with accurate rate and recall rate are high, scalability is strong, EMS memory occupation
Less, the advantages that recognition speed is fast.But due to demands such as practical business scenes, the required regular expression write and safeguarded is often
Very much, and single regular expression may be very long very complicated, will be very if manually writing and safeguarding entirely
It takes time and effort.In this context, a kind of method that method in this application proposes Semi-Automatic Generation regular expression, in conjunction with
The template that the regular expression and computer of the entity of manual compiling itself automatically generate, generates comprehensive regular expression,
For naming entity identification, both guarantee recognition effect and efficiency, while also greatly reducing the manpower and time cost of company.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
As shown in Figure 1, this method includes the following steps, namely S102 to step S104:
Step S102, training can be used for obtaining the mould for meeting the substring of preset condition from the character string of name entity
Plate;
The template that training can be used for obtaining the substring for meeting preset condition from the character string of name entity refers to, instructs
Practise the template that there can be regular expression.
Template in this application refers to for entity value being generalized for remaining frame after entity class.Template can be made
Make a living into the basis of the rule of name Entity recognition.
Step S104 obtains name entity itself regular expression, and constructs comprehensive regular expression with the template;
Comprehensive canonical table is constituted by template obtained in itself regular expression of the name entity of acquisition and above-mentioned steps
Up to formula.Name entity itself regular expression can be carried out by the way of artificial.Template then can be directly raw by computer
At.
Step S106, according to the comprehensive regular expression, identification name entity.
According to the construction of obtained synthesis regular expression as a result, can be used to identify name entity.
It can be seen from the above description that the application realizes following technical effect:
In the embodiment of the present application, it can be used for obtaining from the character string of name entity using training and meet preset condition
The mode of the template of substring names entity itself regular expression by obtaining, and constructs comprehensive canonical with the template
Expression formula has reached according to the comprehensive regular expression, the purpose of identification name entity, to realize guarantee name entity
The effect and efficiency of identification, at the same also reduce company manpower and time cost technical effect, and then solve lack it is available
In the technical problem of name Entity recognition rule.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in Fig. 2, training can be used for from name entity
Character string in obtain the template of substring for meeting preset condition and include:
Step S202, determines business scenario;
Which text data determination needs to collect after determining business scenario.
Step S204 collects text data according to the business scenario;
For relevant business scenario, it is necessary first to which collection obtains true text data.
Step S206, definition need the entity extracted, mark the text data and store according to standard data format.
The entity for needing to extract is defined, and is marked manually.
Specifically, for example, determining that business scenario is that the relevant text of meeting carries out entity extraction, specifically it is also desirable that extracting
Include: the entity class such as session topic, the time of meeting and meeting-place, define need the entity class that extracts it
Afterwards, so that it may which the relevant text of collected meeting is marked one by one.To every text, go in text to search either with or without upper
State the entity of care, if any, just be marked out.Such as, it is assumed that having a text is " every employee, please under
Noon 15:00 participates in company Annual commendatory meeting to second floor meeting room, everybody is asked to participate on time.", then it can mark out:
The time of meeting are as follows: " 15:00 in afternoon ",
Meeting-place are as follows: " second floor meeting room ",
Session topic are as follows: " company Annual commendatory meeting ".
After mark is completed, so that it may with the data of mark come training pattern.
It is stored it should be noted that being further comprised in above-mentioned steps according to standard data format.Specifically, data prediction
The data of mark may have different unprocessed forms, need these different unprocessed forms being uniformly converted to a kind of criterion numeral
According to format, it is uniformly processed after convenient.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 3, training can be used for from name entity
Character string in obtain the template for meeting the substring of preset condition further include: it is following any one or more filter out complexity
The processing mode of text: filtering text is started with entity;Filtering text is ended up with entity;Entity is adjacent there are two filtering in text
Together;Some entity class in filtering text contains more than two entity values;It filters and is free of any entity in text.Tool
Body, first some texts poorly handled first can be filtered out, so as to since first handling fairly simple text.
After processing plain text that can be relatively good, the complicated text of processing can be reattempted.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 4, training can be used for from name entity
Character string in obtain the template of substring for meeting preset condition and include:
Step S402 obtains the mark corpus of text, using the entity value marked in corpus that marked as slot;
According to mark corpus, the entity value marked in original text is plucked out to come, becomes a slot.
The slot is replaced upper corresponding entity class, generates new text by step S404.
By filling upper corresponding entity class, new text can be generated.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 5, training can be used for from name entity
Character string in obtain the substring for meeting preset condition template include: entity value is generalized for after entity class it is remaining
Frame as template the step of, the step specifically includes:
Entity class in template is replaced with corresponding regular expression by step S502, generates batch regular expressions
Formula;
Step S504, tests whether the batch regular expression meets preset condition on training set;
Step S506, if the test batch regular expression meets preset condition on training set, save described in
The corresponding template of regular expression.
Specifically, remaining frame may include: first to examine as template after entity value being generalized for entity class
It is gradually extended toward two sides centered on the entity of worry, one word of extension generates a new template every time, often expands and comes one
Template is all tested on all training sets.If met the requirements, the template is just saved, is then interrupted, then jumped to next
Entity continuation generates template in the same way.
It should be noted that step shown in the flowchart of the accompanying drawings can be in such as a group of computer-executable instructions
It is executed in computer system, although also, logical order is shown in flow charts, and it in some cases, can be with not
The sequence being same as herein executes shown or described step.
According to the embodiment of the present application, additionally provide a kind of for implementing the above-mentioned processing method for naming Entity recognition
Device, for generating the rule of name Entity recognition, as shown in fig. 6, the device includes: training module 10, it is available for training
The template for meeting the substring of preset condition is obtained in the character string from name entity;Module 20 is constructed, for obtaining life
Name entity itself regular expression, and comprehensive regular expression is constructed with the template;Identification module 30 is used for according to synthesis just
Then expression formula, identification name entity.
In the training module 10 of the embodiment of the present application training can be used for from name entity character string in obtain meet it is default
The template of the substring of condition refers to, trains the template that can have regular expression.
Template in this application refers to for entity value being generalized for remaining frame after entity class.Template can be made
Make a living into the basis of the rule of name Entity recognition.
Itself regular expression of name entity and above-mentioned steps in the building module 20 of the embodiment of the present application by obtaining
Obtained in template constitute comprehensive regular expression.Name entity itself regular expression can be carried out by the way of artificial.
Template then can be generated directly by computer.
According to the construction of obtained synthesis regular expression as a result, can be used in the identification module 30 of the embodiment of the present application
Identification name entity.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in fig. 7, the training module includes: to collect
Labeling module 101, the collection labeling module comprises determining that unit 1011, for determining business scenario;Collector unit 1012,
For collecting text data according to the business scenario;Processing unit 1013 is marked, for defining the entity for needing to extract, mark
It infuses the text data and is stored according to standard data format.
Which textual data determination needs to collect after determining business scenario in the determination unit 1011 of the embodiment of the present application
According to.
For relevant business scenario in the collector unit 1012 of the embodiment of the present application, it is necessary first to which collection obtains true
Text data.
The entity for needing to extract is defined in the mark processing unit 1013 of the embodiment of the present application, and is marked manually.
Specifically, for example, determining that business scenario is that the relevant text of meeting carries out entity extraction, specifically it is also desirable that extracting
Include: the entity class such as session topic, the time of meeting and meeting-place, define need the entity class that extracts it
Afterwards, so that it may which the relevant text of collected meeting is marked one by one.To every text, go in text to search either with or without upper
State the entity of care, if any, just be marked out.Such as, it is assumed that having a text is " every employee, please under
Noon 15:00 participates in company Annual commendatory meeting to second floor meeting room, everybody is asked to participate on time.", then it can mark out:
The time of meeting are as follows: " 15:00 in afternoon ",
Meeting-place are as follows: " second floor meeting room ",
Session topic are as follows: " company Annual commendatory meeting ".
After mark is completed, so that it may with the data of mark come training pattern.
It is stored it should be noted that being further comprised in above-mentioned steps according to standard data format.Specifically, data prediction
The data of mark may have different unprocessed forms, need these different unprocessed forms being uniformly converted to a kind of criterion numeral
According to format, it is uniformly processed after convenient.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 8, the training module includes: filtering
Module 104, the filtering module for execute it is following any one or more filter out the processing of complex text: filtering text with
Entity beginning;Filtering text is ended up with entity;Entity is adjacent together there are two filtering in text;Filter some reality in text
Body classification contains more than two entity values;It filters and is free of any entity in text.
It should be noted that being not limited to the above-mentioned complex text that filters out in the implementation of embodiments herein
Processing mode.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 9, the training module 10 includes: reality
Body replacement module 102, the entity replacement module 102 include: processing unit 1021, for obtaining the mark corpus of text,
Using the entity value marked in corpus that marked as slot;Replacement unit 1022, for slot replacement is upper corresponding real
Body classification generates new text.
According to mark corpus in the processing unit 1021 of the embodiment of the present application, the entity value marked in original text is plucked out
Come, becomes a slot.
By filling upper corresponding entity class in the replacement unit 1022 of the embodiment of the present application, new text can be generated
This.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in Figure 10, the training module 10 includes:
Template generation module 103, the template generation module 103 include: generation unit 1031, for replacing entity class in template
For corresponding regular expression, batch regular expression is generated;Test cell 1032, for described in the test on training set
Whether batch regular expression meets preset condition;Storage unit 1033, for the test batch canonical table on training set
When meeting preset condition up to formula, the corresponding template of the regular expression is saved.
In in the generation unit 1031 of the embodiment of the present application, test cell 1032 and in storage unit 1033 specifically,
After entity value is generalized for entity class remaining frame as template may include: first centered on the entity of consideration by
It is gradually extended toward two sides, one word of extension generates a new template every time, often expands and carrys out a template, all in all training
It is tested on collection.If met the requirements, the template is just saved, is then interrupted, then jumped to next entity and continue with same
Method generates template.
It as shown in figure 11, is the realization principle schematic diagram of the application.
(1) collection and labeled data
As shown in figure 11, for business scenario of concern, first have to collect true text data, secondary definition I
Be concerned about and wish extract which entity, finally manually marked.Specifically, it is assumed that now desire to the relevant text of meeting
Entity extraction is carried out, and wishes that extracts there are the entity class such as session topic, the time of meeting and meeting-place, is defining needs
After the entity class for needing to extract, so that it may the relevant text of collected meeting is marked one by one, to every text,
All artificial goes in text to search either with or without the above-mentioned entity that we are concerned about, if any, is just marked.
Assuming that have a text be " every employee, please in afternoon 15:00 to second floor meeting room participate in company Annual commend greatly
Meeting asks everybody to participate on time.", so that it may marking out the time of meeting is " 15:00 in afternoon ", and meeting-place is " second floor meeting room ",
Session topic is " company Annual commendatory meeting ".After mark is completed, so that it may with the data of mark come training pattern.
(2) data prediction
The data marked as shown in figure 11 may have different unprocessed forms, it would be desirable to by these different original lattice
Formula is uniformly converted to a kind of standard data format, is uniformly processed after convenient.Specifically, the mark defined in the embodiment of the present application
Quasi- data format is as shown in figure 12.
As shown in figure 13, if some entity class there are multiple entity values, separated between multiple entity values with "@@@".
It is defined as example with above-mentioned, specific format is summarized as follows:
Separated between urtext and entity class, different entities classification with tab key;
Separated between entity class and entity value with " ### ";
Separated between multiple entity values with "@@@".
In addition it is also necessary to which the entity class for every field defines corresponding English name, the entity in each field
Classification is all defined with an enum class.Entity class in standard data format, is indicated with English.
(3) complex text is filtered out
As shown in figure 11, in order to which by problem reduction, we determine first to filter out some texts poorly handled, first locate
Manage fairly simple text.Main filtration falls following several texts:
Text is started with entity;
Text is ended up with entity;
There are two entities in text closely;
Some entity class in text contains more than two entity values;
Any entity is free of in text.
By taking above-mentioned meeting identification as an example, 4000 original labeled data about there remains after above-mentioned filtering
3400 or so.After processing plain text that can be relatively good, the complicated text of processing can be reattempted.
(4) entity is replaced
As shown in figure 11, according to mark corpus, the entity value marked in original text is plucked out to come, becomes a slot, then
Corresponding entity class is refilled, new text is generated.Assuming that mark corpus are as follows: as shown in figure 14, then new after replacement
The text of generation are as follows: as shown in figure 15.
(5) it segments
Further optimization processing includes needing to segment in above-mentioned newly-generated text, as a result as shown in figure 16.
(6) regular expression of manual compiling entity itself
The regular expression of entity itself, what is referred to can just exactly match the regular expressions of some entity class
Formula., have in the regular expression set of own one for " [0-9] { 4 } year (and?: 0 [1-9]
| 1 [0-2]) moon (?: [0-2] [1-9] | 3 [0-1] |) day ", it is such can just to exactly match " on 01 03rd, 2018 "
Time.
It should be noted that the regular expression of manual compiling entity itself can just exactly match some entity class
Other regular expression.In comparison, workload and the time it takes are all less.
After the regular expression for having write all entities itself, so that it may in conjunction with labeled data automatically generate it is whole just
Then expression formula: including the synthesis regular expression of physical surroundings word and entity itself regular expression.
(7) comprehensive regular expression is generated
Template refers to for entity value being generalized for remaining frame after entity class.For example, original text is " in people
People's the great hall meeting ", then by for entity class be meeting-place (English is LOCATION), corresponding entity value is " people
People's the great hall ", then by entity it is extensive after, so that it may obtain " LOCATION have a meeting ", an as template.
For another example, this regular expression that " (.* hoof flower soup) gone to have a meal ", wherein " (.* hoof flower soup) " is known as entity itself
Regular expression, " (.* hoof flower soup) is gone to have a meal " is then known as comprehensive regular expression, and " RESTAURANT is gone to have a meal " is then
Referred to as template.
It is obtaining template and then LOCATION is replaced with to the regular expression of own, so that it may obtain final comprehensive
The regular expression of conjunction.For example assume that the regular expression of LOCATION itself is " (.+ meeting room) ", then combining template
The synthesis regular expression of generation is just " having a meeting at (.+ meeting room) ", the regular expression integrated with this, so that it may to one
The new meeting text of sentence is matched, if can match, can extract the entity in meeting-place therein.
The method for automatically generating template specifically:
It is gradually extended toward two sides first centered on the entity of consideration, one word of extension generates a new mould every time
Plate often expands and carrys out a template, all tested on all training sets, if meeting the requirements such as accurate rate P=
100%, just save the template.Then break is interrupted, next entity continuation is jumped to and generates template in the same way.
By taking above-mentioned example as an example, it is assumed that the entity of consideration is TIME, then just centered on TIME, gradually extends toward two sides,
The following candidate template can be generated, and (degree of unbalancedness and context of word number can be adjusted at left and right sides of entity
Section considers that degree of unbalancedness is 1, context 3 here), as shown in figure 17.
For each template, when being tested on training set, entity class is replaced with into corresponding a batch itself
Regular expression, generate the comprehensive regular expression of a batch, then tested again, as long as there is a regular expression to meet
It is required that then preserving corresponding template, break is then interrupted, next entity is jumped to and continues to generate template.Until traversal
Complete training set.It finally can be obtained by a collection of template.In conjunction with the regular expression of entity itself, so that it may obtain final
Desired synthesis regular expression.
It should be noted that the regular expression of entity itself needs manual compiling here.That is, most throughout one's life
At rule include two parts, a part is template, and a part is the regular expression of entity itself, and two parts combine
It is the synthesis regular expression that can be actually used.The regular expression of entity itself is to need manual compiling, but template is
It can be automatically generated using the computer of the offer in the embodiment of the present application.
Obviously, those skilled in the art should be understood that each module of above-mentioned the application or each step can be with general
Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed
Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored
Be performed by computing device in the storage device, perhaps they are fabricated to each integrated circuit modules or by they
In multiple modules or step be fabricated to single integrated circuit module to realize.In this way, the application be not limited to it is any specific
Hardware and software combines.
The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field
For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair
Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.