CN109543153B

CN109543153B - Sequence labeling system and method

Info

Publication number: CN109543153B
Application number: CN201811344499.2A
Authority: CN
Inventors: 纪大胜; 崔诚煜; 刘世林; 丁国栋; 曾途; 吴桐
Original assignee: Chengdu Business Big Data Technology Co Ltd
Current assignee: Chengdu Business Big Data Technology Co Ltd
Priority date: 2018-11-13
Filing date: 2018-11-13
Publication date: 2023-08-18
Anticipated expiration: 2038-11-13
Also published as: CN109543153A

Abstract

The application relates to a sequence labeling system, which comprises a model labeling module, an adjusting module and a strategy library, wherein the output end of the model labeling module is connected with the input end of the adjusting module; the model labeling module is used for carrying out sequence labeling on the input text data; the strategy library is stored with one or more strategies, and the adjustment module is used for adjusting the strategies from the strategy library and adjusting the labeling results output by the model labeling module according to the strategies and the input text data. The system or the method of the application can label the text sequence, thereby enhancing the accuracy and applicability of the original model label.

Description

Sequence labeling system and method

Technical Field

The application relates to the technical field of natural language processing, in particular to a sequence labeling system and a sequence labeling method.

Background

Knowledge and information of human society is mostly recorded in the form of text. The knowledge and information are described in terms of human language words, and the machine cannot directly recognize the knowledge and information. Natural language processing is an algorithmic technique to process human natural language text, where word segmentation (Words Segmentation), part-of-speech Tagging (POS taging), and named entity recognition (Named Entity Recognition) are fundamental tasks. Word segmentation, namely dividing a sentence into word sequences from word sequences; part of speech tagging, which is to assign a part of speech to each word, such as nouns, verbs, adjectives, etc.; named entity recognition is the extraction of nouns of a particular type in the text, such as "Xiaoming" (type: name of person), "today's morning" (type: time). The word segmentation, part-of-speech tagging, named entity recognition can all be translated into sequence tagging (Sequence Labeling) questions.

As shown in fig. 1, the sequence labeling problem is mostly performed by using a model+crf, that is, the sequence labeling is performed by using a model, and then the correction is performed by using a CRF probability model. For example, the Chinese patent application with the application number 201710828497.X and the name of text sequence labeling system and method based on Bi-LSTM and CRF is to label the sequences in the mode of Bi-LSTM model and CRF model. The prior art is a supervised machine learning algorithm, training of a model is completed through a large number of labeling corpuses, and the trained model can execute a sequence labeling task on new data (unlabeled data). However, because there may be a large difference between the new data and the training data, such as the occurrence of proper nouns (e.g., the name of a person is "abruptness", which is not present in the training data), or the undercoverage of the training data, the uneven distribution, etc., the trained model may not process some text correctly, and the re-labeling of the data is time-consuming and laborious.

Disclosure of Invention

The application aims to overcome the defects in the prior art and provide a sequence labeling system and a sequence labeling method so as to improve the accuracy of sequence labeling.

In order to achieve the above object, the embodiment of the present application provides the following technical solutions:

the sequence labeling system comprises a model labeling module, an adjusting module and a strategy library, wherein the output end of the model labeling module is connected with the input end of the adjusting module;

the model labeling module is used for carrying out sequence labeling on the input text data;

the strategy library is stored with one or more strategies, and the adjustment module is used for adjusting the strategies from the strategy library and adjusting the labeling results output by the model labeling module according to the strategies and the input text data.

According to the embodiment of the application, each strategy comprises three elements of words, boundaries and scores, and the adjustment module is specifically used for:

sequentially calling a strategy from the strategy library, and calling a strategy after the current strategy is executed until all strategies are traversed;

aiming at the current strategy, matching word elements in the current strategy with the input text data, and calling a next strategy if the matching is unsuccessful; if the matching is successful, the sequence items and the scores to be adjusted are obtained according to the boundary elements and the score elements, and the scores of the corresponding sequence items in the labeling results output by the model labeling module are adjusted.

On the other hand, the embodiment of the application also provides a sequence labeling method, which comprises the following steps:

step 1, performing preliminary sequence labeling on input text data;

and 2, invoking the strategy from the strategy library, and adjusting the primary labeling result according to the strategy and the input text data.

According to an embodiment of the present application, the step 2 specifically includes the following steps:

step 21, a strategy is called from a strategy library;

step 22, matching word elements in the current strategy with the input text data, and returning to the step 21 if the matching is unsuccessful; if the matching is successful, step 23 is advanced;

step 23, obtaining sequence items and scores to be adjusted according to boundary elements and score elements in the current strategy, and adjusting scores of corresponding sequence items in the labeling results output by the model labeling module;

step 24, determining whether all the measurements in the policy repository have been performed, if not, returning to step 21, if so, ending.

In yet another aspect, embodiments of the present application also provide a computer-readable storage medium comprising computer-readable instructions that, when executed, cause a processor to perform operations in the methods described in embodiments of the present application.

In still another aspect, an embodiment of the present application also provides an electronic device, including: a memory storing program instructions; and the processor is connected with the memory and executes program instructions in the memory to realize the steps in the method in the embodiment of the application.

Compared with the prior art, the sequence labeling system of the adjusting module is added, so that the problem that certain texts cannot be accurately identified before is successfully solved, and the original identification capability is not damaged. Even in the field of more text corpus difference, the problem that most special entities cannot be identified can be solved by adding an adjustment strategy, so that the multiplexing rate of the original model is greatly improved, and the production efficiency is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a prior art sequence labeling method.

FIG. 2 is a flow chart of one example of a process employing the method shown in FIG. 1.

Fig. 3 is a schematic block diagram of a sequence labeling system described in an embodiment.

FIG. 4 is a flowchart illustrating the operation of the adjustment module according to an embodiment.

FIG. 5 is a flow chart of one example of a process employing the system shown in FIG. 3 in an embodiment.

Fig. 6 is a block diagram of an electronic device according to an embodiment.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.

In order to facilitate understanding of the sequence marking system of the present application, a brief description of a sequence marking method in the prior art will be given here.

As shown in fig. 1, the sequence labeling achieves the purposes of blocking and classifying by assigning a label to each unit (word or word). Such as the text "the small Ming is late in the morning today. "small (B) indicates that (I) is earlier (I) and (I) is later (B) to (I) on day (B). (B) "where B represents the beginning (Begin), I represents the middle (Inside), where B appears as a boundary, and the word can be extracted by identifying the boundary to the B, I tag: "Xiaoming", "today's morning", "tardy", "late", "and". ". The parts of speech labels, "Small (B-NR) Ming (I-NR) jin (B-T) early (I-T) on (I-T) late (B-VI) to (I-VI) have (B-Y). (B-WJ) ", wherein B in B-NR represents a boundary, NR represents a type, and here represents a name of a person. This way both boundaries are distinguished and types, i.e. parts of speech, are identified. Entity identification, "small (B-Person) bright (I-Person) day (B-Time) early (I-Time) on (I-Time) late (O) to (O). (O) ", similar to the part of speech notation, is just one more label O, which indicates Outside, and indicates a category not of interest to the task. By processing the tag, the entity can be extracted: "Xiaoming" (type: person), "today's morning" (type: time).

It should be noted that the method of identifying the boundary by B, I, O is not unique, and there are many other ways. For example, the BIESO specifications, B (Begin), I (insert), E (End), S (Single), O (outlide) are used. "late in afternoon today. "can be labeled" small (B-Person) to (E-Person) day (B-Time) early (I-Time) and (E-Time) late (O) to (O). (O) ".

The most popular algorithm for the sequence labeling task is "model+crf":

1) And a model part. Bi-LSTM (Bi-directional LSTM), as in the Bi-LSTM-CRF model, is a deep learning model that has the task of assigning a score to each class to which each unit (here, a word is taken as an example), corresponding to performing a classification task for each word. As shown in fig. 2, the higher the score, the higher the probability that this word corresponds to this category. Bi-LSTM may be replaced with other models such as Bi-GRU, multilayer CNN, multilayer Bi-LSTM, etc.

2) The CRF part (or not) is specifically a Linear Chain CRF (Conditional Random Field), which is a probability model and mainly aims at optimizing the relation among labels and finding a label sequence with the maximum probability (generally decoding by using a Viterbi algorithm). If the B-Person tag cannot be followed by the I-Time tag, the probability of the B-Person tag being followed by the I-Person tag is higher. Through the optimization of the CRF layer, the sequence labeling precision is improved, as shown in figure 2.

Referring to fig. 3, the sequence labeling method or system provided in this embodiment adds an adjustment module or step between the model and the CRF based on the existing model+crf method. Specifically, the sequence labeling system provided in this embodiment includes a model labeling module, an adjusting module, a policy repository, and a CRF module, where an output end of the model labeling module is connected to an input end of the adjusting module, and an output end of the adjusting module is connected to an input end of the CRF module.

The policy library stores a plurality of policies, each policy consisting of three or four elements. For example, in this embodiment, each policy is composed of four elements, where the four elements are:

1) regex, used to match text. The simplest is a word stock, such as "mins" that will match all occurrences of "mins" in text;

2) pattern, for designating the label of the score to be adjusted, e.g. "Person" indicates adjusting the score of the label associated with Person; if the task does not require category information (e.g., word segmentation), the element may be omitted;

3) bounds needs to represent the left and right 2 boundaries. As an example, use is made of, for example, "-)! "? "," + "indicates a boundary," ++! "represents a determined boundary," + "represents a non-boundary,"? "means that it is not determined whether it is a boundary. Such as "+ ]! "means that the left side is not a boundary and the right side is not a boundary;

4) confidence, which specifies the size of the score to be adjusted.

For example: one strategy is: "Small Person! The following is carried out 5", representing the score of the Person type corresponding to the adjustment text" small ", corresponding score +5, the modification corresponding to the different bounds is as follows.

For flexibility, the above strategy can be extended in many ways, for example:

1) regex, regular expressions can be used, which allows for more flexible text matching. E.g. "Zhang Sanfeng? "will match" Zhang Sano "or" Zhang Sanfeng ", preferably longer strings;

2) Pattern, tags may be combined, such as by "person|company" for name or Company type; special symbols may also be used to represent special types, such as by "+" for all entity types;

3) bounds, the left and right boundaries can be divided into 2 elements for representation, for example, "BI" on the left represents the label scores corresponding to adjustments "B" and "I";

4) confidence, which can support negative numbers, is used to subtract the scores. The larger the value, the greater the probability that the term is recognized, the smaller the value, and the greater the probability that the term is not recognized.

The content varies between policies based on the different uses. For example, if it is desired to promote the probability that a sequence ending with "limited Company" is identified as a Company, then the expression for the policy may be: "Company of limited +)! 5". For another example, it is desirable to promote the probability that "Zhang Sanj" is identified as Person, then: the policy expression may be "Zhang Sanperson-! The following is carried out 10". The policies in the modification policy library may be dynamically adjusted for different application scenarios.

The adjustment module may extract each policy from the policy repository. Referring to fig. 3, for each piece of text data input into the adjustment module, the adjustment module sequentially retrieves one policy from the policy library, and then retrieves the next policy after the current policy is executed until all policies in the policy library are traversed. For the current strategy, regex (word elements) in the current strategy is matched with the input text data, if the matching is unsuccessful, a next strategy is called, if the matching is successful, the data item and the score which need to be adjusted are obtained according to the current strategy, and then the score of the corresponding data item is adjusted for the result output by the model labeling module. And the score is regulated by the regulating module and then is output to the CRF module.

For example, the text data entered is "late in the afternoon in Ming's day. Two strategies are included in the strategy library, namely' Xiaoming Person-! The following is carried out 5 "and" Company of limited +)! 5". The processing procedure of the adjusting module is as follows:

(1) The strategy "Ming Person-! The following is carried out 5';

(2) The "Ming" and text data "Ming Person-! The following is carried out 5' matching is carried out, and the matching is successful;

(3) According to the definition in the current strategy, the data items and the scores which need to be adjusted are obtained, and the specific steps are as follows:

a. pattern in the strategy is Person, which indicates to adjust the score corresponding to Person;

b. the bound in the policy is "+|! The following is carried out ", it indicates that both left and right are boundaries, the left boundary corresponds to B (to" small ") and the right boundary corresponds to E (to" bright ");

c. confidence in the policy is "5", indicating a score plus 5;

combining abc, the score adjustment terms obtained according to this strategy are:

(4) Adjusting the output result of the model labeling module according to the adjustment item and the score obtained in the step (2), as shown in fig. 5;

(5) The next policy, namely, "Company of limited +)! 5';

the "Limited company" and text data "Ming Person-! The following is carried out 5, matching, namely, not finding a limited company, and matching is unsuccessful;

(6) Since all policies in the policy repository have been traversed, the result after adjustment is output to the CRF module, as shown in fig. 5.

The adjustment module traverses all policies in the policy repository, and when regex successfully matches the input text data, the policy is executed, and thus one or more policies may be executed for the same input text data.

The adjustment module is a mild adjustment method based on probability, and the method has the advantages that the whole sequence is not damaged, for example, the score "Zhang (B-person+2)" corresponding to Zhang San is adjusted, and the name of a Person identified as Zhang Sanfeng is not influenced when 'Zhang Sanfeng' is encountered after 'San (I-person+2)'. The score adjusted by the adjusting module can be decoded through the CRF layer to find out the most likely labeling sequence, and can be directly used as output without passing through the CRF module.

After training tests on the labeling data, the F1 value on the test set reaches 95%. The sequence labeling system of the adjusting module is added, so that the problem that certain texts cannot be accurately identified before is successfully solved, and the original identification capability is not damaged. Even in the field of more text corpus difference, the problem that most special entities cannot be identified can be solved by adding an adjustment strategy, so that the multiplexing rate of the model is greatly improved, and the production efficiency is effectively improved. The model labeling module is used for carrying out sequence labeling on the input text data, and the model can be a Bi-LSTM model, a Bi-GRU, a multi-layer CNN, a multi-layer BI-LSTM and the like; the strategy library is stored with one or more strategies, and the adjustment module is used for adjusting the strategies from the strategy library and adjusting the labeling results output by the model labeling module according to the strategies and the input text data.

As shown in fig. 6, the present embodiment also provides an electronic device that may include a processor 51 and a memory 52, wherein the memory 52 is coupled to the processor 51. It is noted that the figure is exemplary and that other types of structures may be used in addition to or in place of the structure to implement data extraction, report generation, communication, or other functions.

As shown in fig. 6, the electronic device may further include: an input unit 53, a display unit 54, and a power supply 55. It is noted that the electronic device need not necessarily include all of the components shown in fig. 6. In addition, the electronic device may further comprise components not shown in fig. 6, to which reference is made to the prior art.

The processor 51, sometimes also referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, which processor 51 receives inputs and controls the operation of the various components of the electronic device.

The memory 52 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable medium, a volatile memory, a nonvolatile memory, or other suitable devices, and may store information such as configuration information of the processor 51, instructions executed by the processor 51, and recorded table data. The processor 51 may execute programs stored in the memory 52 to realize information storage or processing, and the like. In one embodiment, a buffer memory, i.e., a buffer, is also included in memory 52 to store intermediate information.

The input unit 53 is for example used for providing the processor 51 with text data to be annotated. The display unit 54 is used for displaying various results in the processing, such as input text data, output results of the adjustment module, output results of the CRF module, etc., and may be, for example, an LCD display, but the present application is not limited thereto. The power supply 55 is used to provide power to the electronic device.

Embodiments of the present application also provide a computer readable instruction, wherein the program when executed in an electronic device causes the electronic device to perform the operational steps comprised by the method of the present application.

Embodiments of the present application also provide a storage medium storing computer-readable instructions that cause an electronic device to perform the operational steps involved in the methods of the present application.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that the modules of the examples described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the constituent modules and steps of the examples have been described generally in terms of functionality in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided by the present application, it should be understood that the disclosed system may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The sequence labeling system comprises a model labeling module and is characterized by further comprising an adjusting module and a strategy library, wherein the output end of the model labeling module is connected with the input end of the adjusting module;

one or more strategies are stored in the strategy library, each strategy comprises three elements of words, boundaries and scores, and the adjustment module is specifically used for: sequentially calling a strategy from the strategy library, and calling a strategy after the current strategy is executed until all strategies are traversed; aiming at the current strategy, matching word elements in the current strategy with the input text data, and calling a next strategy if the matching is unsuccessful; if the matching is successful, the data items and the scores which need to be adjusted are obtained according to the boundary elements and the score elements, and the scores of the corresponding data items in the labeling results output by the model labeling module are adjusted.

2. The system of claim 1, further comprising a CRF module, wherein an output of the adjustment module is coupled to an input of the CRF module for optimizing an output of the adjustment module.

3. A method for sequence annotation, comprising the steps of:

step 1, performing preliminary sequence labeling on input text data;

step 2, a strategy is called from a strategy library, and the preliminary labeling result is adjusted according to the strategy and the input text data;

one or more strategies are stored in the strategy library, each strategy comprises three elements of words, boundaries and scores, and the step 2 specifically comprises the following steps:

step 21, a strategy is called from a strategy library;

step 22, matching word elements in the current strategy with the input text data, and returning to the step 21 if the matching is unsuccessful; if the matching is successful, the step 23 is entered;

step 23, obtaining the data items and the scores to be adjusted according to the boundary elements and the score elements in the current strategy, and adjusting the scores of the corresponding data items in the labeling results output by the model labeling module;

step 24, judging whether all strategies in the strategy library are executed, if not, returning to the step 21, and if so, ending.

4. A method according to claim 3, further comprising:

and 3, optimizing the result output in the step 2 through a CRF model.

5. A computer readable storage medium comprising computer readable instructions which, when executed, cause a processor to perform the operations of the method of any of claims 3-4.

6. An electronic device, said device comprising:

a memory storing program instructions;

a processor, coupled to the memory, for executing program instructions in the memory, for implementing the steps of the method of any of claims 3-4.