Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the embodiment of the application, a named entity (short for entity) refers to words with specific meanings in a text, and mainly includes a person name, a place name, an organization name, a proper noun, and words with specific meanings (such as a restaurant name, a movie name and the like) which are concerned under any business scene;
named entity recognition in embodiments of the present application refers to the task or technique of recognizing a named entity from text.
In the embodiment of the present application, the regular expression refers to a pattern (pattern) describing a character string matching, and may be used to check whether a string contains a certain substring, replace the matched substring, or extract a substring meeting a certain condition from a certain string, and the like.
The regular expression is used for identifying the named entity, and the method has the advantages of high accuracy and recall rate, strong expandability, less memory occupation, high identification speed and the like. However, due to requirements of actual service scenarios and the like, many regular expressions need to be written and maintained, and a single regular expression may be long and complex, which is very time-consuming and labor-consuming if written and maintained by all people. Under the background, the method in the application provides a semi-automatic regular expression generation method, and a comprehensive regular expression is generated by combining a manually written regular expression of an entity and a computer automatically generated template and is used for named entity recognition and identification, so that the recognition effect and efficiency are ensured, and meanwhile, the labor and time cost of a company is greatly reduced.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
As shown in fig. 1, the method includes steps S102 to S104 as follows:
step S102, training a template which can be used for acquiring a substring meeting preset conditions from the character string of the named entity;
training a template that can be used to obtain a substring that meets a preset condition from a string of named entities refers to training a template that can have a regular expression.
Templates in this application refer to the framework that remains after generalization of entity values to entity classes. The template may serve as a basis for generating rules for named entity identification.
Step S104, acquiring a regular expression of the named entity, and constructing a comprehensive regular expression with the template;
and forming a comprehensive regular expression by the obtained regular expression of the named entity and the template obtained in the step. The regular expression of the named entity itself can be performed manually. The template may then be generated directly by the computer.
And S106, identifying a named entity according to the comprehensive regular expression.
And according to the obtained construction result of the comprehensive regular expression, the named entity can be identified.
From the above description, it can be seen that the following technical effects are achieved by the present application:
in the embodiment of the application, the method for training the template which can be used for acquiring the substrings meeting the preset conditions from the character strings of the named entity is adopted, the purpose of identifying the named entity according to the comprehensive regular expression is achieved by acquiring the regular expression of the named entity and constructing the comprehensive regular expression with the template, so that the effect and the efficiency of identifying the named entity are guaranteed, the technical effects of labor and time cost of a company are reduced, and the technical problem that the named entity identification rule is lacked is solved.
According to the embodiment of the present application, as a preferred embodiment in the present application, as shown in fig. 2, training a template that can be used to obtain a substring that meets a preset condition from a string of a named entity includes:
step S202, determining a service scene;
after the business scene is determined, determining which text data needs to be collected.
Step S204, collecting text data according to the service scene;
for relevant service scenes, real text data needs to be collected firstly.
And step S206, defining an entity needing to be extracted, labeling the text data and storing the text data according to a standard data format.
And defining entities needing to be extracted, and manually marking the entities.
Specifically, for example, determining that the service scenario is a text related to a conference for entity extraction, specifically, it is further desirable that the extraction includes: after entity categories such as conference subjects, conference time, conference places and the like are defined, entity categories needing to be extracted can be labeled one by one on collected texts related to the conference. For each text, the text is searched for entities without the above concerns, and if any, the entities are marked. For example, suppose there is a text of "each employee, please attend the annual meeting at 15:00 pm to the second floor conference room, please attend on time. ", then one can note:
the meeting time is as follows: "15:00 p.m.,
the meeting place is as follows: "second floor conference room",
the conference theme is: "companies show up in the great league annually".
After the annotation is complete, the model can be trained with the annotated data.
It should be noted that, the above steps further include storing according to a standard data format. Specifically, the data labeled by the data preprocessing may have different original formats, and these different original formats need to be uniformly converted into a standard data format, which facilitates uniform processing.
According to the embodiment of the present application, as a preferred embodiment in the present application, as shown in fig. 3, training the template that can be used to obtain the substring that meets the preset condition from the string of the named entity further includes: any one or more of the following processing modes for filtering out complex texts: filtering text beginning with an entity; filtering the text to end with the entity; two entities are adjacent in the filtering text; a certain entity category in the filtering text contains more than two entity values; the filter text does not contain any entities. Specifically, some text that is not well processed may be filtered first, so that one can begin with processing simpler text first. After simple text can be processed better, processing of complex text can be attempted again.
According to the embodiment of the present application, as a preferred embodiment in the present application, as shown in fig. 4, training a template that can be used to obtain a substring that meets a preset condition from a string of a named entity includes:
step S402, obtaining a labeled corpus of a text, and taking an entity value labeled in the labeled corpus as a slot;
and according to the labeled corpus, extracting the entity value labeled in the original text to form a slot.
And S404, replacing the slot with a corresponding entity type to generate a new text.
By filling in the corresponding entity category, new text can be generated.
According to the embodiment of the present application, as a preferred embodiment in the present application, as shown in fig. 5, training a template that can be used to obtain a substring that meets a preset condition from a string of a named entity includes: a step of generalizing the entity value into a frame left after the entity class as a template, the step specifically including:
step S502, replacing the entity category in the template with a regular expression corresponding to the entity category to generate a batch of regular expressions;
step S504, testing whether the batch of regular expressions meet preset conditions on a training set;
step S506, if the regular expressions in the test batch on the training set meet the preset conditions, storing the templates corresponding to the regular expressions.
Specifically, generalizing the entity values into entity categories may include, as templates, the remaining frames that are left over: the method comprises the steps of firstly, gradually expanding towards two sides by taking a considered entity as a center, generating a new template by expanding a word each time, and testing all training sets when the template is expanded. If the requirements are met, the template is saved, then interrupted, and then the next entity is jumped to continue generating the template in the same way.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
According to an embodiment of the present application, there is also provided an apparatus for implementing the above processing method for named entity identification, configured to generate a rule for named entity identification, as shown in fig. 6, where the apparatus includes: a training module 10, configured to train a template that is used to obtain a sub-string that meets a preset condition from a string of a named entity; the construction module 20 is configured to obtain a regular expression of the named entity itself, and construct a comprehensive regular expression with the template; and the identifying module 30 is used for identifying the named entity according to the comprehensive regular expression.
Training the template, which can be used to obtain the substring meeting the preset condition from the string of the named entity, in the training module 10 in the embodiment of the present application means training a template that can have a regular expression.
Templates in this application refer to the framework that remains after generalization of entity values to entity classes. The template may serve as a basis for generating rules for named entity identification.
In the construction module 20 of the embodiment of the application, a comprehensive regular expression is formed by the obtained regular expression of the named entity and the template obtained in the above step. The regular expression of the named entity itself can be performed manually. The template may then be generated directly by the computer.
The identification module 30 of the embodiment of the present application may be configured to identify the named entity according to the obtained structural result of the comprehensive regular expression.
According to the embodiment of the present application, as shown in fig. 7, the training module preferably includes: a collection annotation module 101, the collection annotation module comprising: a determining unit 1011 configured to determine a service scenario; a collecting unit 1012, configured to collect text data according to the service scenario; and a label processing unit 1013, configured to define an entity that needs to be extracted, label the text data, and store the text data according to a standard data format.
In the determining unit 1011 of the embodiment of the present application, after determining the service scenario, it is determined which text data needs to be collected.
For the relevant service scenario in the collecting unit 1012 of the embodiment of the present application, it is first required to collect and obtain real text data.
In the annotation processing unit 1013 of the embodiment of the present application, entities that need to be extracted are defined and manually annotated.
Specifically, for example, determining that the service scenario is a text related to a conference for entity extraction, specifically, it is further desirable that the extraction includes: after entity categories such as conference subjects, conference time, conference places and the like are defined, entity categories needing to be extracted can be labeled one by one on collected texts related to the conference. For each text, the text is searched for entities without the above concerns, and if any, the entities are marked. For example, suppose there is a text of "each employee, please attend the annual meeting at 15:00 pm to the second floor conference room, please attend on time. ", then one can note:
the meeting time is as follows: "15:00 p.m.,
the meeting place is as follows: "second floor conference room",
the conference theme is: "companies show up in the great league annually".
After the annotation is complete, the model can be trained with the annotated data.
It should be noted that, the above steps further include storing according to a standard data format. Specifically, the data labeled by the data preprocessing may have different original formats, and these different original formats need to be uniformly converted into a standard data format, which facilitates uniform processing.
According to the embodiment of the present application, as shown in fig. 8, the training module preferably includes: a filtering module 104, configured to perform any one or more of the following processes of filtering out complex text: filtering text beginning with an entity; filtering the text to end with the entity; two entities are adjacent in the filtering text; a certain entity category in the filtering text contains more than two entity values; the filter text does not contain any entities.
It should be noted that the implementation manner of the embodiment of the present application is not limited to the above processing manner of filtering out the complex text.
According to the embodiment of the present application, as shown in fig. 9, the training module 10 preferably includes: an entity replacement module 102, the entity replacement module 102 comprising: a processing unit 1021, configured to obtain a labeled corpus of a text, and use an entity value labeled in the labeled corpus as a slot; a replacing unit 1022, configured to replace the slot with a corresponding entity category, and generate a new text.
In the processing unit 1021 in the embodiment of the present application, entity values marked in an original text are extracted according to a marked corpus to form a slot.
In the replacing unit 1022 in the embodiment of the present application, a new text may be generated by filling the corresponding entity category.
According to the embodiment of the present application, as shown in fig. 10, the training module 10 preferably includes: a template generation module 103, wherein the template generation module 103 comprises: a generating unit 1031, configured to replace the entity category in the template with a regular expression corresponding thereto, and generate a batch of regular expressions; a testing unit 1032, configured to test whether the batch of regular expressions meets a preset condition on a training set; a storing unit 1033, configured to store the template corresponding to the regular expression when the batch of the regular expressions tested on the training set meets a preset condition.
Specifically, in the generating unit 1031, the testing unit 1032, and the storing unit 1033 in the embodiment of the present application, the step of generalizing the entity value into the entity category using the remaining frame as the template may include: the method comprises the steps of firstly, gradually expanding towards two sides by taking a considered entity as a center, generating a new template by expanding a word each time, and testing all training sets when the template is expanded. If the requirements are met, the template is saved, then interrupted, and then the next entity is jumped to continue generating the template in the same way.
Fig. 11 is a schematic diagram illustrating an implementation principle of the present application.
(1) Collecting and annotating data
As shown in fig. 11, for the concerned business scenario, firstly real text data is collected, secondly defining which entities we are concerned and wish to extract, and finally performing manual annotation. Specifically, suppose that entity extraction is desired to be performed on texts related to a conference, and entity categories such as conference subjects, conference times, conference places and the like which are desired to be extracted are defined, the collected texts related to the conference can be labeled item by item after the entity categories which need to be extracted are defined, and for each text, whether entities concerned by us exist or not is manually searched in the text, and if the entities concerned exist, the text is labeled.
Suppose there is a text of "each employee, please attend the annual meeting at 15:00 pm to the second floor conference room, and please attend everyone on time. "the meeting time is" 15:00 pm ", the meeting place is" meeting room in second floor ", and the subject of meeting is" annual great meeting of company ". After the annotation is complete, the model can be trained with the annotated data.
(2) Data pre-processing
The data labeled as shown in fig. 11 may have different original formats, and we need to convert these different original formats into a standard data format uniformly, so as to facilitate uniform processing later. Specifically, the standard data format defined in the embodiment of the present application is shown in fig. 12.
As shown in FIG. 13, if there are multiple entity values in an entity category, the multiple entity values are separated by "@ @ @".
Taking the above definition as an example, the specific format is summarized as follows:
the original text is separated from the entity category and different entity categories by a Tab key;
the entity classes are separated from the entity values by "###";
the multiple entity values are separated by "@ @ @".
In addition, corresponding english names are defined for the entity classes of the respective domains, and the entity class of each domain is defined by an enum class. The entity classes in the standard data format are all represented in english.
(3) Filtering complex text
To simplify the problem, we decided to filter out some text that is not well processed first, and to process simpler text first, as shown in fig. 11. The following text is mainly filtered out:
-the text starts with an entity;
-the text ends with an entity;
-two entities in the text are close together;
-an entity class in the text contains more than two entity values;
-no entity is contained in the text.
Taking the conference identification as an example, about 3400 pieces of original annotation data of 4000 pieces remain after the filtering. After simple text can be processed better, processing of complex text can be attempted again.
(4) Entity replacement
As shown in fig. 11, according to the labeled corpus, the entity values labeled in the original text are extracted to form a slot, and then the corresponding entity categories are filled in to generate a new text. Assume the markup corpus is: as shown in fig. 14, the newly generated text after replacement is: as shown in fig. 15.
(5) Word segmentation
Further optimization includes the need to perform word segmentation on the newly generated text, and the result is shown in fig. 16.
(6) Manually writing regular expressions of entities themselves
The regular expression of the entity itself refers to a regular expression which can exactly match with a certain entity category. For example, such as meeting time, one of its regular expression sets is "[ 0-9] {4} year (.
It should be noted that the regular expression of the manually written entity itself can exactly match the regular expression of the last entity category. In comparison, the workload and the time taken are both small.
After writing the regular expressions of all entities, the marking data can be combined to automatically generate an integral regular expression: i.e. a comprehensive regular expression containing the entity environment words and the regular expression of the entity itself.
(7) Generating comprehensive regular expressions
Templates refer to the framework left after generalization of entity values to entity classes. For example, if the original text is "conference in a people's hall", the corresponding entity type is a conference place (LOCATION in english), and the corresponding entity value is "people's hall", then after the entities are generalized, the "conference in LOCATION" can be obtained, which is a template.
For another example, the regular expression "go (hoof flower soup)" is called the regular expression of the entity itself, "go (hoof flower soup) is called the comprehensive regular expression, and" go (soup) is called the template.
After the template is obtained, the LOCATION is replaced by the regular expression of the LOCATION, and then the final comprehensive regular expression can be obtained. For example, if the regular expression of LOCATION itself is "(. + conference room)", the comprehensive regular expression generated by combining the template is "meeting in" (. + conference room) ", and a new conference text can be matched by using the comprehensive regular expression, and if the comprehensive regular expression can be matched, the entity of the conference place in the text can be extracted.
The method for automatically generating the template specifically comprises the following steps:
firstly, the entity to be considered is used as the center to gradually expand towards two sides, a new template is generated by expanding a word each time, the template is tested on all training sets when the template is expanded, and if the requirement is met, such as the accuracy rate P is 100%, the template is stored. Break then and jump to the next entity to continue generating the template in the same way.
Taking the above example as an example, assuming that the entity under consideration is TIME, the entity under consideration is gradually expanded towards both sides with TIME as the center, and candidate templates (the degree of imbalance of the word numbers on both left and right sides of the entity and the context range can be adjusted, where the degree of imbalance is considered to be 1, and the context range is considered to be 3) can be generated as shown in fig. 17.
And for each template, when testing is carried out on the training set, replacing the entity category with a corresponding batch of self regular expressions to generate a batch of comprehensive regular expressions, then testing, as long as one regular expression meets the requirement, storing the corresponding template, then interrupting break, and jumping to the next entity to continue generating the template. Until the complete training set is traversed. A batch of templates is finally available. And then the entity can obtain the final desired comprehensive regular expression by combining the regular expressions of the entity.
It should be noted that here the regular expression of the entity itself needs to be written manually. That is to say, the finally generated rule includes two parts, one part is a template, the other part is a regular expression of the entity itself, and the two parts are combined to be a comprehensive regular expression which can be actually used. The regular expressions of the entities themselves need to be written manually, but the templates are automatically generated by a computer provided in the embodiment of the application.
It will be apparent to those skilled in the art that the modules or steps of the present application described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present application is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.