CN109740159B - Processing method and device for named entity recognition - Google Patents

Processing method and device for named entity recognition Download PDF

Info

Publication number
CN109740159B
CN109740159B CN201811644812.4A CN201811644812A CN109740159B CN 109740159 B CN109740159 B CN 109740159B CN 201811644812 A CN201811644812 A CN 201811644812A CN 109740159 B CN109740159 B CN 109740159B
Authority
CN
China
Prior art keywords
entity
template
text
training
regular expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811644812.4A
Other languages
Chinese (zh)
Other versions
CN109740159A (en
Inventor
申化泽
刘丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Teddy Bear Mobile Technology Co ltd
Beijing Teddy Future Technology Co ltd
Original Assignee
Beijing Teddy Bear Mobile Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Teddy Bear Mobile Technology Co ltd filed Critical Beijing Teddy Bear Mobile Technology Co ltd
Priority to CN201811644812.4A priority Critical patent/CN109740159B/en
Publication of CN109740159A publication Critical patent/CN109740159A/en
Application granted granted Critical
Publication of CN109740159B publication Critical patent/CN109740159B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Character Input (AREA)

Abstract

The application discloses a processing method and device for named entity recognition. Training a template which can be used for acquiring a substring meeting a preset condition from a character string of a named entity; acquiring a regular expression of the named entity, and constructing a comprehensive regular expression with the template; and identifying the named entity according to the comprehensive regular expression. The method and the device solve the technical problem that the named entity identification rule is lacked. The method in the application combines the manually written regular expression of the entity and the automatically generated template of the computer to generate the comprehensive regular expression for identifying and identifying the named entity, thereby not only ensuring the identification effect and efficiency, but also greatly reducing the labor and time cost of the company. In addition, the method can be applied to mobile equipment such as mobile phones.

Description

Processing method and device for named entity recognition
Technical Field
The present application relates to the field of natural language processing, and in particular, to a processing method and apparatus for named entity recognition.
Background
Named Entity Recognition (NER) refers to recognizing entities with specific meanings in text, and mainly includes names of people, places, organizations, proper nouns, and words with specific meanings concerned in any service scene. The rule-based named entity recognition is mainly implemented based on Regular expressions (Regular expressions).
The inventor finds that with richer business scenes, more and more entity types need to be extracted, more regular expressions need to be written and maintained manually, and a great amount of labor cost and time are consumed for writing and maintaining.
In order to solve the problem of the lack of available named entity identification rules in the related art, no effective solution has been proposed at present.
Disclosure of Invention
The present application mainly aims to provide a processing method and apparatus for named entity identification, so as to solve the problem of lacking available rules for named entity identification.
To achieve the above object, according to one aspect of the present application, there is provided a processing method for named entity recognition for generating rules for named entity recognition.
The processing method for named entity recognition according to the application comprises the following steps: training a template which can be used for acquiring a substring meeting preset conditions from the character string of the named entity; acquiring a regular expression of the named entity, and constructing a comprehensive regular expression with the template; and identifying the named entity according to the comprehensive regular expression.
Further, training a template that can be used to obtain a substring that meets a preset condition from a string of named entities includes: determining a service scene; collecting text data according to the service scene; and defining an entity to be extracted, marking the text data and storing according to a standard data format.
Further, training a template that can be used for obtaining a substring meeting a preset condition from a string of a named entity further comprises: any one or more of the following processing modes for filtering out complex texts: filtering text beginning with an entity; filtering the text to end with the entity; two entities are adjacent in the filtering text; a certain entity category in the filtering text contains more than two entity values; the filter text does not contain any entities.
Further, training a template that can be used to obtain a substring that meets a preset condition from a string of named entities includes: acquiring a labeled corpus of a text, and taking an entity value labeled in the labeled corpus as a slot; and replacing the corresponding entity type with the slot to generate a new text.
Further, training a template that can be used to obtain a substring that meets a preset condition from a string of named entities includes: a step of generalizing the entity value into a frame left after the entity class as a template, the step specifically including: replacing the entity category in the template with a regular expression corresponding to the entity category to generate a batch of regular expressions; testing whether the batch of regular expressions meet preset conditions on a training set; and if the regular expressions in batches tested on the training set meet the preset conditions, storing the templates corresponding to the regular expressions.
To achieve the above object, according to another aspect of the present application, there is provided a processing apparatus for named entity recognition.
The processing device for named entity recognition according to the application comprises: the training module is used for training a template which can be used for acquiring a substring meeting preset conditions from the character string of the named entity; the construction module is used for acquiring a regular expression of the named entity and constructing a comprehensive regular expression with the template; and the identification module is used for identifying the named entity according to the comprehensive regular expression.
Further, the training module comprises: a collect labels module, the collect labels module comprising: a determining unit, configured to determine a service scenario; the collecting unit is used for collecting text data according to the service scene; and the marking processing unit is used for defining the entity needing to be extracted, marking the text data and storing the text data according to a standard data format.
Further, the training module comprises: a filtering module for performing any one or more of the following processes of filtering out complex text: filtering text beginning with an entity; filtering the text to end with the entity; two entities are adjacent in the filtering text; a certain entity category in the filtering text contains more than two entity values; the filter text does not contain any entities.
Further, the training module comprises: an entity replacement module, the entity replacement module comprising: the processing unit is used for acquiring the labeled linguistic data of the text and taking the entity value labeled in the labeled linguistic data as a slot; and the replacing unit is used for replacing the slot with the corresponding entity type to generate a new text.
Further, the training module comprises: a template generation module, the template generation module comprising: the generating unit is used for replacing the entity category in the template with the corresponding regular expression to generate a batch of regular expressions; the test unit is used for testing whether the batch of regular expressions meet preset conditions on a training set; and the storage unit is used for storing the template corresponding to the regular expression when the test batch of the regular expressions on the training set meets the preset condition.
In the embodiment of the application, the method for training the template which can be used for acquiring the substrings meeting the preset conditions from the character strings of the named entity is adopted, the purpose of identifying the named entity according to the comprehensive regular expression is achieved by acquiring the regular expression of the named entity and constructing the comprehensive regular expression with the template, so that the effect and the efficiency of identifying the named entity are guaranteed, the technical effects of labor and time cost of a company are reduced, and the technical problem that the named entity identification rule is lacked is solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:
FIG. 1 is a schematic diagram of a processing method for named entity identification according to a first embodiment of the present application;
FIG. 2 is a schematic diagram of a processing method for named entity identification according to a second embodiment of the present application;
FIG. 3 is a schematic diagram of a processing method for named entity identification according to a third embodiment of the present application;
FIG. 4 is a schematic diagram of a processing method for named entity identification according to a fourth embodiment of the present application;
FIG. 5 is a schematic diagram of a processing method for named entity identification according to a fifth embodiment of the present application;
FIG. 6 is a schematic diagram of a processing device for named entity recognition according to a first embodiment of the present application;
FIG. 7 is a schematic diagram of a processing device for named entity recognition according to a second embodiment of the present application;
FIG. 8 is a schematic diagram of a processing device for named entity recognition according to a third embodiment of the present application;
FIG. 9 is a schematic diagram of a processing device for named entity recognition according to a fourth embodiment of the present application;
FIG. 10 is a schematic diagram of a processing device for named entity recognition according to a fifth embodiment of the present application;
FIG. 11 is a schematic diagram of an implementation principle of the present application;
fig. 12 to 17 are schematic views of the processing effects in the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the embodiment of the application, a named entity (short for entity) refers to words with specific meanings in a text, and mainly includes a person name, a place name, an organization name, a proper noun, and words with specific meanings (such as a restaurant name, a movie name and the like) which are concerned under any business scene;
named entity recognition in embodiments of the present application refers to the task or technique of recognizing a named entity from text.
In the embodiment of the present application, the regular expression refers to a pattern (pattern) describing a character string matching, and may be used to check whether a string contains a certain substring, replace the matched substring, or extract a substring meeting a certain condition from a certain string, and the like.
The regular expression is used for identifying the named entity, and the method has the advantages of high accuracy and recall rate, strong expandability, less memory occupation, high identification speed and the like. However, due to requirements of actual service scenarios and the like, many regular expressions need to be written and maintained, and a single regular expression may be long and complex, which is very time-consuming and labor-consuming if written and maintained by all people. Under the background, the method in the application provides a semi-automatic regular expression generation method, and a comprehensive regular expression is generated by combining a manually written regular expression of an entity and a computer automatically generated template and is used for named entity recognition and identification, so that the recognition effect and efficiency are ensured, and meanwhile, the labor and time cost of a company is greatly reduced.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
As shown in fig. 1, the method includes steps S102 to S104 as follows:
step S102, training a template which can be used for acquiring a substring meeting preset conditions from the character string of the named entity;
training a template that can be used to obtain a substring that meets a preset condition from a string of named entities refers to training a template that can have a regular expression.
Templates in this application refer to the framework that remains after generalization of entity values to entity classes. The template may serve as a basis for generating rules for named entity identification.
Step S104, acquiring a regular expression of the named entity, and constructing a comprehensive regular expression with the template;
and forming a comprehensive regular expression by the obtained regular expression of the named entity and the template obtained in the step. The regular expression of the named entity itself can be performed manually. The template may then be generated directly by the computer.
And S106, identifying a named entity according to the comprehensive regular expression.
And according to the obtained construction result of the comprehensive regular expression, the named entity can be identified.
From the above description, it can be seen that the following technical effects are achieved by the present application:
in the embodiment of the application, the method for training the template which can be used for acquiring the substrings meeting the preset conditions from the character strings of the named entity is adopted, the purpose of identifying the named entity according to the comprehensive regular expression is achieved by acquiring the regular expression of the named entity and constructing the comprehensive regular expression with the template, so that the effect and the efficiency of identifying the named entity are guaranteed, the technical effects of labor and time cost of a company are reduced, and the technical problem that the named entity identification rule is lacked is solved.
According to the embodiment of the present application, as a preferred embodiment in the present application, as shown in fig. 2, training a template that can be used to obtain a substring that meets a preset condition from a string of a named entity includes:
step S202, determining a service scene;
after the business scene is determined, determining which text data needs to be collected.
Step S204, collecting text data according to the service scene;
for relevant service scenes, real text data needs to be collected firstly.
And step S206, defining an entity needing to be extracted, labeling the text data and storing the text data according to a standard data format.
And defining entities needing to be extracted, and manually marking the entities.
Specifically, for example, determining that the service scenario is a text related to a conference for entity extraction, specifically, it is further desirable that the extraction includes: after entity categories such as conference subjects, conference time, conference places and the like are defined, entity categories needing to be extracted can be labeled one by one on collected texts related to the conference. For each text, the text is searched for entities without the above concerns, and if any, the entities are marked. For example, suppose there is a text of "each employee, please attend the annual meeting at 15:00 pm to the second floor conference room, please attend on time. ", then one can note:
the meeting time is as follows: "15:00 p.m.,
the meeting place is as follows: "second floor conference room",
the conference theme is: "companies show up in the great league annually".
After the annotation is complete, the model can be trained with the annotated data.
It should be noted that, the above steps further include storing according to a standard data format. Specifically, the data labeled by the data preprocessing may have different original formats, and these different original formats need to be uniformly converted into a standard data format, which facilitates uniform processing.
According to the embodiment of the present application, as a preferred embodiment in the present application, as shown in fig. 3, training the template that can be used to obtain the substring that meets the preset condition from the string of the named entity further includes: any one or more of the following processing modes for filtering out complex texts: filtering text beginning with an entity; filtering the text to end with the entity; two entities are adjacent in the filtering text; a certain entity category in the filtering text contains more than two entity values; the filter text does not contain any entities. Specifically, some text that is not well processed may be filtered first, so that one can begin with processing simpler text first. After simple text can be processed better, processing of complex text can be attempted again.
According to the embodiment of the present application, as a preferred embodiment in the present application, as shown in fig. 4, training a template that can be used to obtain a substring that meets a preset condition from a string of a named entity includes:
step S402, obtaining a labeled corpus of a text, and taking an entity value labeled in the labeled corpus as a slot;
and according to the labeled corpus, extracting the entity value labeled in the original text to form a slot.
And S404, replacing the slot with a corresponding entity type to generate a new text.
By filling in the corresponding entity category, new text can be generated.
According to the embodiment of the present application, as a preferred embodiment in the present application, as shown in fig. 5, training a template that can be used to obtain a substring that meets a preset condition from a string of a named entity includes: a step of generalizing the entity value into a frame left after the entity class as a template, the step specifically including:
step S502, replacing the entity category in the template with a regular expression corresponding to the entity category to generate a batch of regular expressions;
step S504, testing whether the batch of regular expressions meet preset conditions on a training set;
step S506, if the regular expressions in the test batch on the training set meet the preset conditions, storing the templates corresponding to the regular expressions.
Specifically, generalizing the entity values into entity categories may include, as templates, the remaining frames that are left over: the method comprises the steps of firstly, gradually expanding towards two sides by taking a considered entity as a center, generating a new template by expanding a word each time, and testing all training sets when the template is expanded. If the requirements are met, the template is saved, then interrupted, and then the next entity is jumped to continue generating the template in the same way.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
According to an embodiment of the present application, there is also provided an apparatus for implementing the above processing method for named entity identification, configured to generate a rule for named entity identification, as shown in fig. 6, where the apparatus includes: a training module 10, configured to train a template that is used to obtain a sub-string that meets a preset condition from a string of a named entity; the construction module 20 is configured to obtain a regular expression of the named entity itself, and construct a comprehensive regular expression with the template; and the identifying module 30 is used for identifying the named entity according to the comprehensive regular expression.
Training the template, which can be used to obtain the substring meeting the preset condition from the string of the named entity, in the training module 10 in the embodiment of the present application means training a template that can have a regular expression.
Templates in this application refer to the framework that remains after generalization of entity values to entity classes. The template may serve as a basis for generating rules for named entity identification.
In the construction module 20 of the embodiment of the application, a comprehensive regular expression is formed by the obtained regular expression of the named entity and the template obtained in the above step. The regular expression of the named entity itself can be performed manually. The template may then be generated directly by the computer.
The identification module 30 of the embodiment of the present application may be configured to identify the named entity according to the obtained structural result of the comprehensive regular expression.
According to the embodiment of the present application, as shown in fig. 7, the training module preferably includes: a collection annotation module 101, the collection annotation module comprising: a determining unit 1011 configured to determine a service scenario; a collecting unit 1012, configured to collect text data according to the service scenario; and a label processing unit 1013, configured to define an entity that needs to be extracted, label the text data, and store the text data according to a standard data format.
In the determining unit 1011 of the embodiment of the present application, after determining the service scenario, it is determined which text data needs to be collected.
For the relevant service scenario in the collecting unit 1012 of the embodiment of the present application, it is first required to collect and obtain real text data.
In the annotation processing unit 1013 of the embodiment of the present application, entities that need to be extracted are defined and manually annotated.
Specifically, for example, determining that the service scenario is a text related to a conference for entity extraction, specifically, it is further desirable that the extraction includes: after entity categories such as conference subjects, conference time, conference places and the like are defined, entity categories needing to be extracted can be labeled one by one on collected texts related to the conference. For each text, the text is searched for entities without the above concerns, and if any, the entities are marked. For example, suppose there is a text of "each employee, please attend the annual meeting at 15:00 pm to the second floor conference room, please attend on time. ", then one can note:
the meeting time is as follows: "15:00 p.m.,
the meeting place is as follows: "second floor conference room",
the conference theme is: "companies show up in the great league annually".
After the annotation is complete, the model can be trained with the annotated data.
It should be noted that, the above steps further include storing according to a standard data format. Specifically, the data labeled by the data preprocessing may have different original formats, and these different original formats need to be uniformly converted into a standard data format, which facilitates uniform processing.
According to the embodiment of the present application, as shown in fig. 8, the training module preferably includes: a filtering module 104, configured to perform any one or more of the following processes of filtering out complex text: filtering text beginning with an entity; filtering the text to end with the entity; two entities are adjacent in the filtering text; a certain entity category in the filtering text contains more than two entity values; the filter text does not contain any entities.
It should be noted that the implementation manner of the embodiment of the present application is not limited to the above processing manner of filtering out the complex text.
According to the embodiment of the present application, as shown in fig. 9, the training module 10 preferably includes: an entity replacement module 102, the entity replacement module 102 comprising: a processing unit 1021, configured to obtain a labeled corpus of a text, and use an entity value labeled in the labeled corpus as a slot; a replacing unit 1022, configured to replace the slot with a corresponding entity category, and generate a new text.
In the processing unit 1021 in the embodiment of the present application, entity values marked in an original text are extracted according to a marked corpus to form a slot.
In the replacing unit 1022 in the embodiment of the present application, a new text may be generated by filling the corresponding entity category.
According to the embodiment of the present application, as shown in fig. 10, the training module 10 preferably includes: a template generation module 103, wherein the template generation module 103 comprises: a generating unit 1031, configured to replace the entity category in the template with a regular expression corresponding thereto, and generate a batch of regular expressions; a testing unit 1032, configured to test whether the batch of regular expressions meets a preset condition on a training set; a storing unit 1033, configured to store the template corresponding to the regular expression when the batch of the regular expressions tested on the training set meets a preset condition.
Specifically, in the generating unit 1031, the testing unit 1032, and the storing unit 1033 in the embodiment of the present application, the step of generalizing the entity value into the entity category using the remaining frame as the template may include: the method comprises the steps of firstly, gradually expanding towards two sides by taking a considered entity as a center, generating a new template by expanding a word each time, and testing all training sets when the template is expanded. If the requirements are met, the template is saved, then interrupted, and then the next entity is jumped to continue generating the template in the same way.
Fig. 11 is a schematic diagram illustrating an implementation principle of the present application.
(1) Collecting and annotating data
As shown in fig. 11, for the concerned business scenario, firstly real text data is collected, secondly defining which entities we are concerned and wish to extract, and finally performing manual annotation. Specifically, suppose that entity extraction is desired to be performed on texts related to a conference, and entity categories such as conference subjects, conference times, conference places and the like which are desired to be extracted are defined, the collected texts related to the conference can be labeled item by item after the entity categories which need to be extracted are defined, and for each text, whether entities concerned by us exist or not is manually searched in the text, and if the entities concerned exist, the text is labeled.
Suppose there is a text of "each employee, please attend the annual meeting at 15:00 pm to the second floor conference room, and please attend everyone on time. "the meeting time is" 15:00 pm ", the meeting place is" meeting room in second floor ", and the subject of meeting is" annual great meeting of company ". After the annotation is complete, the model can be trained with the annotated data.
(2) Data pre-processing
The data labeled as shown in fig. 11 may have different original formats, and we need to convert these different original formats into a standard data format uniformly, so as to facilitate uniform processing later. Specifically, the standard data format defined in the embodiment of the present application is shown in fig. 12.
As shown in FIG. 13, if there are multiple entity values in an entity category, the multiple entity values are separated by "@ @ @".
Taking the above definition as an example, the specific format is summarized as follows:
the original text is separated from the entity category and different entity categories by a Tab key;
the entity classes are separated from the entity values by "###";
the multiple entity values are separated by "@ @ @".
In addition, corresponding english names are defined for the entity classes of the respective domains, and the entity class of each domain is defined by an enum class. The entity classes in the standard data format are all represented in english.
(3) Filtering complex text
To simplify the problem, we decided to filter out some text that is not well processed first, and to process simpler text first, as shown in fig. 11. The following text is mainly filtered out:
-the text starts with an entity;
-the text ends with an entity;
-two entities in the text are close together;
-an entity class in the text contains more than two entity values;
-no entity is contained in the text.
Taking the conference identification as an example, about 3400 pieces of original annotation data of 4000 pieces remain after the filtering. After simple text can be processed better, processing of complex text can be attempted again.
(4) Entity replacement
As shown in fig. 11, according to the labeled corpus, the entity values labeled in the original text are extracted to form a slot, and then the corresponding entity categories are filled in to generate a new text. Assume the markup corpus is: as shown in fig. 14, the newly generated text after replacement is: as shown in fig. 15.
(5) Word segmentation
Further optimization includes the need to perform word segmentation on the newly generated text, and the result is shown in fig. 16.
(6) Manually writing regular expressions of entities themselves
The regular expression of the entity itself refers to a regular expression which can exactly match with a certain entity category. For example, such as meeting time, one of its regular expression sets is "[ 0-9] {4} year (.
It should be noted that the regular expression of the manually written entity itself can exactly match the regular expression of the last entity category. In comparison, the workload and the time taken are both small.
After writing the regular expressions of all entities, the marking data can be combined to automatically generate an integral regular expression: i.e. a comprehensive regular expression containing the entity environment words and the regular expression of the entity itself.
(7) Generating comprehensive regular expressions
Templates refer to the framework left after generalization of entity values to entity classes. For example, if the original text is "conference in a people's hall", the corresponding entity type is a conference place (LOCATION in english), and the corresponding entity value is "people's hall", then after the entities are generalized, the "conference in LOCATION" can be obtained, which is a template.
For another example, the regular expression "go (hoof flower soup)" is called the regular expression of the entity itself, "go (hoof flower soup) is called the comprehensive regular expression, and" go (soup) is called the template.
After the template is obtained, the LOCATION is replaced by the regular expression of the LOCATION, and then the final comprehensive regular expression can be obtained. For example, if the regular expression of LOCATION itself is "(. + conference room)", the comprehensive regular expression generated by combining the template is "meeting in" (. + conference room) ", and a new conference text can be matched by using the comprehensive regular expression, and if the comprehensive regular expression can be matched, the entity of the conference place in the text can be extracted.
The method for automatically generating the template specifically comprises the following steps:
firstly, the entity to be considered is used as the center to gradually expand towards two sides, a new template is generated by expanding a word each time, the template is tested on all training sets when the template is expanded, and if the requirement is met, such as the accuracy rate P is 100%, the template is stored. Break then and jump to the next entity to continue generating the template in the same way.
Taking the above example as an example, assuming that the entity under consideration is TIME, the entity under consideration is gradually expanded towards both sides with TIME as the center, and candidate templates (the degree of imbalance of the word numbers on both left and right sides of the entity and the context range can be adjusted, where the degree of imbalance is considered to be 1, and the context range is considered to be 3) can be generated as shown in fig. 17.
And for each template, when testing is carried out on the training set, replacing the entity category with a corresponding batch of self regular expressions to generate a batch of comprehensive regular expressions, then testing, as long as one regular expression meets the requirement, storing the corresponding template, then interrupting break, and jumping to the next entity to continue generating the template. Until the complete training set is traversed. A batch of templates is finally available. And then the entity can obtain the final desired comprehensive regular expression by combining the regular expressions of the entity.
It should be noted that here the regular expression of the entity itself needs to be written manually. That is to say, the finally generated rule includes two parts, one part is a template, the other part is a regular expression of the entity itself, and the two parts are combined to be a comprehensive regular expression which can be actually used. The regular expressions of the entities themselves need to be written manually, but the templates are automatically generated by a computer provided in the embodiment of the application.
It will be apparent to those skilled in the art that the modules or steps of the present application described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present application is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (6)

1. A processing method for named entity recognition, characterized in that, the method comprises:
training a template which can be used for acquiring a substring meeting preset conditions from the character string of the named entity;
acquiring a regular expression of the named entity, and constructing a comprehensive regular expression with the template;
identifying a named entity according to the comprehensive regular expression;
the automatic generation process of the template comprises the following steps: gradually expanding towards two sides by taking the considered entity as a center, generating a new template by expanding a word each time, and testing on all training sets when a template is expanded; if the requirements are met, the template is saved, then the template is interrupted, and then the next entity is skipped to continue generating the template in the same way;
training a template that can be used to obtain substrings that meet a preset condition from strings of named entities includes:
determining a service scene;
collecting text data according to the service scene;
defining entities needing to be extracted, marking the text data and storing the text data according to a standard data format;
training a template that can be used to obtain substrings that meet a preset condition from strings of named entities includes:
acquiring a labeled corpus of a text, and taking an entity value labeled in the labeled corpus as a slot;
and replacing the corresponding entity type with the slot to generate a new text.
2. The process of claim 1, wherein training a template that can be used to obtain substrings that meet a preset condition from strings of named entities further comprises: any one or more of the following processing modes for filtering out complex texts:
filtering text beginning with an entity;
filtering the text to end with the entity;
two entities are adjacent in the filtering text;
a certain entity category in the filtering text contains more than two entity values;
the filter text does not contain any entities.
3. The process of claim 1, wherein training a template that can be used to obtain substrings that meet a preset condition from strings of named entities comprises: a step of generalizing the entity value into a frame left after the entity class as a template, the step specifically including:
replacing the entity category in the template with a regular expression corresponding to the entity category to generate a batch of regular expressions;
testing whether the batch of regular expressions meet preset conditions on a training set;
and if the regular expressions in batches tested on the training set meet the preset conditions, storing the templates corresponding to the regular expressions.
4. A processing apparatus for named entity recognition, wherein rules for generating a named entity recognition are generated, the apparatus comprising:
the training module is used for training a template which can be used for acquiring a substring meeting preset conditions from the character string of the named entity;
the construction module is used for acquiring a regular expression of the named entity and constructing a comprehensive regular expression with the template;
the identification module is used for identifying the named entity according to the comprehensive regular expression;
the automatic generation process of the template comprises the following steps: gradually expanding towards two sides by taking the considered entity as a center, generating a new template by expanding a word each time, and testing on all training sets when a template is expanded; if the requirements are met, the template is saved, then the template is interrupted, and then the next entity is skipped to continue generating the template in the same way;
the training module comprises: a collect labels module, the collect labels module comprising:
a determining unit, configured to determine a service scenario;
the collecting unit is used for collecting text data according to the service scene;
the label processing unit is used for defining an entity to be extracted, labeling the text data and storing the text data according to a standard data format;
the training module comprises: an entity replacement module, the entity replacement module comprising:
the processing unit is used for acquiring the labeled linguistic data of the text and taking the entity value labeled in the labeled linguistic data as a slot;
and the replacing unit is used for replacing the slot with the corresponding entity type to generate a new text.
5. The processing apparatus as in claim 4, wherein the training module comprises: a filtering module for performing any one or more of the following processes of filtering out complex text:
filtering text beginning with an entity;
filtering the text to end with the entity;
two entities are adjacent in the filtering text;
a certain entity category in the filtering text contains more than two entity values;
the filter text does not contain any entities.
6. The processing apparatus as in claim 4, wherein the training module comprises: a template generation module, the template generation module comprising:
the generating unit is used for replacing the entity category in the template with the corresponding regular expression to generate a batch of regular expressions;
the test unit is used for testing whether the batch of regular expressions meet preset conditions on a training set;
and the storage unit is used for storing the template corresponding to the regular expression when the test batch of the regular expressions on the training set meets the preset condition.
CN201811644812.4A 2018-12-29 2018-12-29 Processing method and device for named entity recognition Active CN109740159B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811644812.4A CN109740159B (en) 2018-12-29 2018-12-29 Processing method and device for named entity recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811644812.4A CN109740159B (en) 2018-12-29 2018-12-29 Processing method and device for named entity recognition

Publications (2)

Publication Number Publication Date
CN109740159A CN109740159A (en) 2019-05-10
CN109740159B true CN109740159B (en) 2022-04-26

Family

ID=66362654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811644812.4A Active CN109740159B (en) 2018-12-29 2018-12-29 Processing method and device for named entity recognition

Country Status (1)

Country Link
CN (1) CN109740159B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11953887B2 (en) * 2019-09-27 2024-04-09 Rockwell Automation Technologies, Inc. System and method for customer-specific naming conventions for industrial automation devices
CN110909160A (en) * 2019-10-11 2020-03-24 平安科技(深圳)有限公司 Regular expression generation method, server and computer readable storage medium
CN111079436B (en) * 2019-12-20 2021-09-21 中南大学 Geological named entity extraction method and device
CN110990540A (en) * 2019-12-26 2020-04-10 厦门快商通科技股份有限公司 Synonym extraction method and device based on regular expression
CN113378561A (en) * 2021-08-16 2021-09-10 北京泰迪熊移动科技有限公司 Word prediction template generation method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103268339A (en) * 2013-05-17 2013-08-28 中国科学院计算技术研究所 Recognition method and system of named entities in microblog messages
CN103631948A (en) * 2013-12-11 2014-03-12 北京京东尚科信息技术有限公司 Identifying method of named entities
CN106095745A (en) * 2016-05-27 2016-11-09 厦门市美亚柏科信息股份有限公司 Transaction record extracting method based on log and system thereof
CN107527073A (en) * 2017-09-05 2017-12-29 中南大学 The recognition methods of entity is named in electronic health record
CN107608949A (en) * 2017-10-16 2018-01-19 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model
US20180253663A1 (en) * 2017-03-06 2018-09-06 Wipro Limited Method and system for extracting relevant entities from a text corpus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103268339A (en) * 2013-05-17 2013-08-28 中国科学院计算技术研究所 Recognition method and system of named entities in microblog messages
CN103631948A (en) * 2013-12-11 2014-03-12 北京京东尚科信息技术有限公司 Identifying method of named entities
CN106095745A (en) * 2016-05-27 2016-11-09 厦门市美亚柏科信息股份有限公司 Transaction record extracting method based on log and system thereof
US20180253663A1 (en) * 2017-03-06 2018-09-06 Wipro Limited Method and system for extracting relevant entities from a text corpus
CN107527073A (en) * 2017-09-05 2017-12-29 中南大学 The recognition methods of entity is named in electronic health record
CN107608949A (en) * 2017-10-16 2018-01-19 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model

Also Published As

Publication number Publication date
CN109740159A (en) 2019-05-10

Similar Documents

Publication Publication Date Title
CN109740159B (en) Processing method and device for named entity recognition
CN107766371B (en) Text information classification method and device
CN107301170A (en) The method and apparatus of cutting sentence based on artificial intelligence
US20200372088A1 (en) Recommending web api's and associated endpoints
CN107391675A (en) Method and apparatus for generating structure information
WO2023108991A1 (en) Model training method and apparatus, knowledge classification method and apparatus, and device and medium
CN108399157B (en) Dynamic extraction method of entity and attribute relationship, server and readable storage medium
CN108090104A (en) For obtaining the method and apparatus of webpage information
CN110275963A (en) Method and apparatus for output information
CN109508458A (en) The recognition methods of legal entity and device
CN108520065B (en) Method, system, equipment and storage medium for constructing named entity recognition corpus
CN110738055A (en) Text entity identification method, text entity identification equipment and storage medium
CN112183102A (en) Named entity identification method based on attention mechanism and graph attention network
CN111369980A (en) Voice detection method and device, electronic equipment and storage medium
CN111144116B (en) Document knowledge structured extraction method and device
CN111177401A (en) Power grid free text knowledge extraction method
CN106407288A (en) Method and system for synchronously updating information
CN112328246A (en) Page component generation method and device, computer equipment and storage medium
CN109062913B (en) Internationalization resource intelligent acquisition method and storage medium
CN110489628A (en) Data processing method, device and electronic equipment
CN110866394A (en) Company name identification method and device, computer equipment and readable storage medium
CN113051869B (en) Method and system for realizing identification of text difference content by combining semantic recognition
CN106407271B (en) Intelligent customer service system and updating method of intelligent customer service knowledge base thereof
CN112767933B (en) Voice interaction method, device, equipment and medium of highway maintenance management system
CN116306506A (en) Intelligent mail template method based on content identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: East of 1st floor, No.36 Haidian Street, Haidian District, Beijing, 100080

Patentee after: Beijing Teddy Future Technology Co.,Ltd.

Address before: East of 1st floor, No.36 Haidian Street, Haidian District, Beijing, 100080

Patentee before: Beijing Teddy Bear Mobile Technology Co.,Ltd.

CP01 Change in the name or title of a patent holder
CP03 Change of name, title or address

Address after: East of 1st floor, No.36 Haidian Street, Haidian District, Beijing, 100080

Patentee after: Beijing Teddy Bear Mobile Technology Co.,Ltd.

Address before: 100085 07a36, block D, 7 / F, No.28, information road, Haidian District, Beijing

Patentee before: BEIJING TEDDY BEAR MOBILE TECHNOLOGY Co.,Ltd.

CP03 Change of name, title or address