CN110210035B - Sequence labeling method and device and training method of sequence labeling model - Google Patents

Sequence labeling method and device and training method of sequence labeling model Download PDF

Info

Publication number
CN110210035B
CN110210035B CN201910481021.2A CN201910481021A CN110210035B CN 110210035 B CN110210035 B CN 110210035B CN 201910481021 A CN201910481021 A CN 201910481021A CN 110210035 B CN110210035 B CN 110210035B
Authority
CN
China
Prior art keywords
sequence
label
labeling
model
binding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910481021.2A
Other languages
Chinese (zh)
Other versions
CN110210035A (en
Inventor
李正华
黄德朋
张民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201910481021.2A priority Critical patent/CN110210035B/en
Publication of CN110210035A publication Critical patent/CN110210035A/en
Application granted granted Critical
Publication of CN110210035B publication Critical patent/CN110210035B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application discloses a sequence labeling method, a sequence labeling device, a sequence labeling model training method, a sequence labeling model training device and a computer readable storage medium, wherein the scoring layers of the sequence labeling model in the scheme comprise second scoring layers which correspond to labeling standards one by one, and first scoring layers which correspond to all the labeling standards. In addition, the output result of the model is a binding tag sequence, which is equivalent to directly obtaining tag sequences under various labeling specifications, and is convenient for the conversion of texts between different labeling specifications.

Description

Sequence labeling method and device and training method of sequence labeling model
Technical Field
The present application relates to the field of natural language processing, and in particular, to a sequence labeling method and apparatus, a training method and device for a sequence labeling model, and a computer-readable storage medium.
Background
In a natural language processing task, it is often necessary to use annotation data as a training sample of a natural language processing model, wherein the scale of the annotation data significantly affects the performance of the model. Because the construction cost of manual annotation data is very expensive, some researchers propose a scheme for realizing the enlargement of the data scale by utilizing heterogeneous data resources. However, since heterogeneous data follows different labeling specifications, the heterogeneous data cannot be directly mixed. Therefore, how to effectively utilize heterogeneous data to improve the model performance becomes a research problem.
At present, a scheme for improving model performance by using heterogeneous data is as follows: the method is similar to accumulation learning by using one data resource to generate additional characteristics on another data resource, and takes CTB and PKU as examples, firstly, model parameter parameters which are trained independently by CTB linguistic data are used, then, the linguistic data characteristics of PKU are added additionally, and the model is trained continuously by PKU linguistic data. However, since the two corpora are different in research direction and different in part-of-speech tagging specifications, noise is generated for the model, and the purpose of improving performance cannot be achieved.
Therefore, the existing scheme for training the model by using data with different labeling specifications has the problem of noise introduction, and the purpose of improving the labeling performance of the model cannot be realized.
Disclosure of Invention
The application aims to provide a sequence labeling method, a sequence labeling device, a sequence labeling model training method, a sequence labeling model training device and a computer readable storage medium, which are used for solving the problem that the improvement of the model labeling performance cannot be realized by the existing scheme of training a model by using data with different labeling specifications. The specific scheme is as follows:
in a first aspect, the present application provides a sequence annotation apparatus, including a sequence annotation model, where the sequence annotation model includes:
an input layer: the method comprises the steps of obtaining a text to be marked;
presentation layer: the system comprises a first scoring layer and a plurality of second scoring layers, wherein the first scoring layer is used for identifying the vector representation of each word of the text to be labeled;
the first scoring layer: the system is used for determining the original score of each binding label in the binding label set according to the vector representation;
the second scoring layer: the system is used for determining scores of all independent tags in a corresponding tag set of the labeling specification according to the vector representation, wherein the second score layer corresponds to the labeling specification in a one-to-one mode, and the binding tags are a group of tag combinations comprising single independent tags of all the labeling specifications;
prediction layer: determining a final score of the bundled label according to the original score of the bundled label and the scores of the individual labels corresponding to the bundled label; determining a target binding tag of the word according to the final score of each binding tag;
an output layer: and the target label sequence is used for outputting the target label sequence of the text to be labeled, and the target label sequence comprises target binding labels of all words of the text to be labeled.
Preferably, the presentation layer includes:
a first encoding unit: the system comprises a text to be labeled, a first vector and a second vector, wherein the text to be labeled comprises words of the text to be labeled;
a second encoding unit: the first bi-directional recurrent neural network is used for coding each word of the word to obtain a second vector of the word;
a representation unit: for determining a vector representation of the word from the first vector and the second vector and sending the vector representation to a first scoring layer and a plurality of second scoring layers, respectively.
Preferably, the representation unit is specifically configured to:
determining a vector representation of the word from the first vector and the second vector; encoding the vector representation of the word by using a second bidirectional cyclic neural network to obtain global information; and sending the global information to a first scoring layer and a plurality of second scoring layers respectively.
Preferably, the prediction layer is specifically configured to:
determining a final score of the binding tag according to the original score of the binding tag and the score of the independent tag corresponding to the binding tag; determining the probability of each binding tag according to the final score of each binding tag by utilizing a softmax function; determining a target binding tag for the word according to the probability of each binding tag.
Preferably, the method further comprises the following steps:
loss layer: the method is used for determining a loss value according to a target loss function, a prediction labeling result and an actual labeling result in a training process, and realizing training by adjusting model parameters, wherein the target loss function is as follows:
Figure GDA0003983365220000021
where k denotes the number of correct tags, y j Scoring for correct labels after passing through the prediction layerThe probability of (c).
In a second aspect, the present application provides a training method for a sequence annotation model, which is applied to the sequence annotation model of the sequence annotation apparatus described above, and includes:
acquiring training samples with various labeling specifications, wherein the training samples comprise training texts and actual label sequences of the training texts;
inputting the training samples into a sequence labeling model to obtain a predicted tag sequence output by the sequence labeling model;
and adjusting parameters of the sequence labeling model according to the predicted label sequence and the actual label sequence until a preset termination condition is reached so as to realize the training of the sequence labeling model.
In a third aspect, the present application provides a training apparatus for a sequence annotation model, which is applied to the sequence annotation model of the sequence annotation apparatus described above, and includes:
a memory: for storing a computer program;
a processor: for executing the computer program to implement the steps of:
acquiring training samples with various labeling specifications, wherein the training samples comprise training texts and actual label sequences of the training texts; inputting the training samples into a sequence labeling model to obtain a predicted tag sequence output by the sequence labeling model; and adjusting parameters of the sequence labeling model according to the predicted label sequence and the actual label sequence until a preset termination condition is reached so as to realize the training of the sequence labeling model.
In a fourth aspect, the present application provides a computer-readable storage medium for use in a sequence annotation model of a sequence annotation apparatus as described above, the computer-readable storage medium having stored thereon a computer program for implementing, when executed by a processor, the steps of:
acquiring training samples with various labeling specifications, wherein the training samples comprise training texts and actual label sequences of the training texts; inputting the training samples into a sequence labeling model to obtain a predicted tag sequence output by the sequence labeling model; and adjusting parameters of the sequence labeling model according to the predicted label sequence and the actual label sequence until a preset termination condition is reached so as to realize the training of the sequence labeling model.
In a fifth aspect, the present application provides a sequence annotation method, including:
acquiring a text to be marked;
determining vector representation of each word of the text to be labeled;
according to the vector representation, determining the score of each independent label in a label set with various labeling specifications, and determining the original score of each binding label in a binding label set, wherein the binding label is a group of label combinations comprising single independent labels with various labeling specifications;
determining a final score of the binding tag according to the original score of the binding tag and the score of the independent tag corresponding to the binding tag;
and determining the target binding label of the word according to the final score of the binding label to obtain a target label sequence of the text to be labeled.
The application provides a sequence labeling method, a sequence labeling device, a training method and equipment of a sequence labeling model and a computer readable storage medium, wherein the scheme can be used for acquiring a text to be labeled and determining the vector representation of each word of the text to be labeled; according to the vector representation, determining the score of each independent label in a label set with various labeling specifications, and determining the original score of each binding label in a binding label set; then determining the final score of the binding label according to the original score of the binding label and the score of the independent label corresponding to the binding label; and finally, determining a target binding tag of the word according to the final score of the binding tag so as to obtain a target tag sequence of the text to be labeled.
It can be seen that, the scoring layers of the sequence labeling model in the scheme include scoring layers corresponding to the labeling standards one by one, and also include scoring layers corresponding to all the labeling standards, due to the unique design of the scoring layers in the model, heterogeneous data of various labeling standards can be used as a training set of the model, the scale of the training corpus is expanded, and the model can learn commonalities among corpora of different labeling standards, so that the labeling performance of the model under a single labeling standard is improved. In addition, the output result of the model is a binding tag sequence, which is equivalent to directly obtaining tag sequences under various labeling specifications, and the conversion of texts between different labeling specifications is facilitated.
Drawings
In order to clearly illustrate the embodiments or technical solutions of the present application, the drawings used in the embodiments or technical solutions of the present application will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a functional block diagram of a sequence labeling apparatus according to a first embodiment of the present application;
FIG. 2 is a schematic diagram of a binding tag in an embodiment of a sequence tagging apparatus provided in the present application;
FIG. 3 is a schematic diagram illustrating vector representations of words in an embodiment of a sequence labeling apparatus provided in the present application;
FIG. 4 is a flowchart illustrating an implementation of an embodiment of a training method for a sequence annotation model provided in the present application;
FIG. 5 is a schematic structural diagram of an embodiment of a training apparatus for a sequence annotation model provided in the present application;
fig. 6 is a flowchart illustrating a sequence tagging method according to an embodiment of the present application.
Detailed Description
In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings. It should be apparent that the described embodiments are only a few embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
At present, in order to improve the performance of a model, the scale of labeled data is often enlarged by using heterogeneous data, however, noise is introduced into the model by using the existing scheme of using heterogeneous data, and the purpose of improving the performance of the model cannot be achieved. In order to solve the problems, the application provides a sequence labeling method, a sequence labeling device, a sequence labeling model training method, sequence labeling model training equipment and a computer readable storage medium.
A first embodiment of a sequence labeling apparatus provided in the present application is described below, and the first embodiment includes a sequence labeling model. It should be noted that, in this embodiment, a deep neural network is used as the sequence labeling model, so as to avoid the disadvantages of the conventional model based on feature engineering, for example, the feature extraction process is complicated, and it is difficult to ensure the rationality of the feature template. As a specific implementation manner, the embodiment selects a BilSTM (Bidirectional Long Short-Term Memory) as a basic model.
Referring to fig. 1, the sequence labeling model specifically includes:
input layer 101: the method comprises the steps of obtaining a text to be marked;
presentation layer 102: the system is used for determining vector representations of all words of the text to be labeled and respectively sending the vector representations to a first scoring layer 103 and a plurality of second scoring layers 104;
the sequence tagging model in this embodiment may be specifically used to implement processing such as named entity recognition, word segmentation, part of speech tagging of a text to be tagged, and since the focus of this embodiment is on the process of assigning labels to each word in the text to be tagged, detailed description of operations such as named entity recognition and word segmentation is not given in this embodiment. Before the text to be labeled enters the hierarchy, each word in the text to be labeled needs to be converted into vector representation. As a specific implementation manner, in this embodiment, the word embedding vector is obtained in a pre-training manner, that is, the word vectors trained by other models are directly obtained during initialization to represent the current word, and for an unknown word that is not found in the pre-training vocabulary, the word embedding vector may be randomly generated.
The first scoring layer 103: the system comprises a vector representation module, a label matching module and a label matching module, wherein the vector representation module is used for determining the original score of each binding label in the binding label set according to the vector representation;
the second scoring layer 104: the system comprises a vector representation module, a label set module, a binding label module and a label analysis module, wherein the vector representation module is used for determining scores of all independent labels in a label set of a corresponding labeling specification according to the vector representation, the second score layer corresponds to the labeling specification in a one-to-one mode, and the binding label is a group of label combinations comprising single independent labels of all the labeling specifications;
the marking specification refers to rules and bases for marking each word of a text sentence, specifically, the words are marked in a label form, and currently known marking specifications include CTB, PKU, MSR, and the like. Taking two marking specifications of CTB and PKU as an example, the method is particularly suitable for the economic uplink of China for the same text. "the sequence labeling result obtained according to the CTB labeling specification is shown in table 1, and the sequence labeling result obtained according to the PKU labeling specification is shown in table 2, so that different sequence labeling results, that is, different tag sequences, can be obtained for the same text according to different labeling specifications. The purpose of this embodiment is to improve the labeling performance of the model by using heterogeneous data, so this embodiment is implemented based on multiple labeling specifications, and more specifically, this embodiment is implemented based on two or more labeling specifications, which labeling specification is specifically selected may be determined according to actual requirements, which is not specifically limited in this embodiment.
TABLE 1
Individual words in text In particular I am concerned with State of China Economy of production Uplink is carried out
Labeling result of CTB labeling specification AD PN NN NN VV PU
TABLE 2
Individual words in text In particular Is that China's republic of China Economy of production Uplink is carried out
Labeling result of PKU labeling specification d v n n v w
It should be noted that, in this embodiment, tags of multiple labeling specifications are bundled to obtain a tag combination including tags in various labeling specifications, and for convenience of description, this embodiment refers to the tag combination as a bundled tag, and refers to the tags in the labeling specifications as independent tags, and a process for constructing the bundled tag is specifically shown in fig. 2. The present embodiment models on an enlarged set of bundled labels, translating a single independent label map into a set of bundled labels by considering all bundled label possibilities.
The first scoring layer 103 and the second scoring layer 104 are only used for distinguishing the two scoring layers, and do not indicate the number and the sequence. As shown in fig. 1, in this embodiment, the sequence annotation model has N +1 scoring layers in total, including a first scoring layer and a plurality of second scoring layers, where the first scoring layer is used to determine an original score of each binding label in the binding label set according to vector representation of a word, the second scoring layers are in one-to-one correspondence with the annotation specification, and the second scoring layers are used to determine a score of each independent label in the label set of the corresponding annotation specification according to vector representation of the word. As a specific embodiment, the first score layer and the second score layer may be different MLP (multi layer Preceptron) layers.
Prediction layer 105: determining a final score of the bundled label according to the original score of the bundled label and the scores of the individual labels corresponding to the bundled label; determining a target binding tag of the word according to the final score of each binding tag;
as a specific implementation manner, the embodiment sums the original scores of the binding tags and the scores of the individual tags corresponding to the binding tags, uses the sum result as the final score of the binding tags, and finally determines the target binding tags of the words according to the size relationship of the final scores of the binding tags.
Output layer 106: and the target label sequence is used for outputting the target label sequence of the text to be labeled, and the target label sequence comprises target binding labels of all words of the text to be labeled.
It should be noted that, due to the unique design of the scoring layer in the sequence labeling model of this embodiment, corpora of various labeling specifications can be selected as the training set of the model, the data scale is extended, and the model can learn commonalities between corpora of different labeling specifications, thereby improving the labeling performance of the model under a single labeling specification. That is to say, the model can implement the labeling mode corresponding to any one of the multiple labeling specifications, and the labeling performance under any one of the multiple labeling specifications is improved. Specifically, in the test process, the sequence labeling model can output a binding tag sequence of a text to be labeled, the binding tag sequence includes tag sequences of the multiple labeling specifications, and the tag sequence under any one of the multiple labeling specifications can be obtained by simply dividing the binding tag sequence.
The embodiment provides a sequence labeling device, including the sequence labeling model, the scoring layer of this model includes the scoring layer that corresponds with labeling standard one-to-one, still include the scoring layer that corresponds with all labeling standards, because the unique design of the scoring layer in this model, consequently can utilize the heterogeneous data of multiple labeling standards as the training set of this model, expand the training corpus scale, and this model can learn the commonality between the corpus of different labeling standards in addition, thereby promote the labeling performance of model under single labeling standard. In addition, the output result of the model is a binding tag sequence, which is equivalent to directly obtaining tag sequences under various labeling specifications, and is convenient for the conversion of texts between different labeling specifications.
The second embodiment of the sequence labeling apparatus provided by the present application is described in detail below, and the second embodiment is implemented based on the first embodiment and is expanded to a certain extent based on the first embodiment.
Specifically, the sequence labeling apparatus provided in the second embodiment includes a sequence labeling model, where the sequence labeling model includes: an input layer, a presentation layer, an encoding layer, a first MLP layer, a plurality of second MLP layers, a prediction layer, an output layer, and a loss layer, each of which is described below:
an input layer: the method comprises the steps of obtaining a text to be marked;
presentation layer: the vector representation of each word of the text to be labeled is determined;
in the conventional scheme, when a word is converted into a vector representation, an embedded vector of the word is generally directly used as the vector representation of the word, and in order to make the vector representation of the word more fully express text information, as a preferred embodiment, as shown in fig. 3, the embodiment uses a first vector and a second vector together to obtain the vector representation of the word. Specifically, in the presentation layer, the present embodiment first determines a first vector and a second vector, respectively. The first vector, namely the word embedding vector, can be obtained in a pre-training manner, and the unknown words can be obtained in a random initialization manner. For the second vector, the word vector of each word of the word is obtained by random initialization, then, as shown in fig. 3, all the word vectors are input into one layer of BiLSTM to obtain the last output of each of the two directions, and the outputs in the two directions are spliced to obtain the second vector. Since the output of the last character has learned the information of other characters, the text information can be more sufficiently represented by using this as the second vector. Finally, the embodiment splices the first vector and the second vector, and represents the spliced vector as a vector of a word.
Specifically, one text to be labeled S = { w =isgiven 1 ,w 2 ,...,w n },w i In the presentation textN represents the number of words in the text to be annotated, w for each word i ={c i_1 ,c i_2 ,...,c i_n },c i_j The expression w i The jth word in (c), m represents the number of words in the word. For w i All the characters in (1), which the embodiment inputs into BilTM, and outputs h of the last character lm And h rm Is spliced to w i Behind the corresponding word vector, we get w i Is represented by a vector of (A) i Then X i Can be expressed as:
Figure GDA0003983365220000081
in summary, the representation layers in this embodiment specifically include:
a first encoding unit: the system comprises a text to be labeled, a first vector and a second vector, wherein the text to be labeled comprises words of the text to be labeled;
a second encoding unit: the first bi-directional recurrent neural network is used for coding each word of the word to obtain a second vector of the word;
a representation unit: for determining a vector representation of the word from the first vector and the second vector.
And (3) coding layer: encoding the vector representation of the word by using a second bidirectional cyclic neural network to obtain global information; sending the global information to a first MLP layer and a plurality of second MLP layers respectively;
specifically, the coding layer uses BilSTM to code the sentence information, and this embodiment will represent the output X of the layer i As input of LSTM, the word w is obtained by encoding the entire sentence sequence by LSTM i Global information h of i The process involves the formula including:
i i =σ(W in ·[h i-1 ,x i ]+b in ) (2)
f i =σ(W fg ·[h i-1 ,x i ]+b fg ) (3)
o i =σ(W out ·[h i-1 ,x i ]+b out ) (4)
c i =f i ·c i-1 +i i ·tanh(W c ·[h i-1 ,x i ]+b c ) (5)
h i =o i ·tanh(c i ) (6)
wherein i i ,f i ,o i ,c i Respectively representing the input gate, the forgetting gate, the output gate and the output of the cell state corresponding to the ith word, x i And h i Representing the input and hidden layer output corresponding to the ith word. σ denotes the sigmoid activation function, and W and b are the weight and bias of the corresponding gate, respectively.
The hidden state of the LSTM is simply the information obtained from the past and never considered. In order to encode sentence information in two directions, the embodiment concatenates hidden layer outputs of two LSTM in the forward direction and the backward direction to obtain a word w i Is represented by a BilSTM hidden state of i
Figure GDA0003983365220000091
The first MLP layer: the system is used for determining the original score of each binding label in the binding label set according to the vector representation;
the second MLP layer: the system is used for determining the score of each independent label in the label set of the corresponding labeling specification according to the vector representation;
specifically, in this embodiment, the score of each tag is calculated by using MLP for the score layer, and the sequence labeling model has N +1 MLP layers in total, which includes: the second MLP layer is used for respectively determining the scores of the independent labels of the N kinds of labeling specifications, and the first MLP layer is used for determining the score of each binding label. Specifically, the output h of BiLSTM is converted into i As the input of MLP, the score P of each label corresponding to each word in the sentence is obtained i
Figure GDA0003983365220000101
Wherein, W mlp And b mlp The weight and bias of the MLP layer are indicated separately.
Prediction layer: determining a final score of the binding tag according to the original score of the binding tag and the scores of the independent tags corresponding to the binding tag; determining a target binding tag of the word according to the final score of each binding tag;
specifically, according to the coupling mapping relationship between the binding tag and the independent tags, the original score of the binding tag and the scores of the N independent tags corresponding to the binding tag are added to obtain the final score of the binding tag. Taking N =2 as an example, the ith word in S in a sentence is marked as a binding tag [ t a ,t b ]The score of (a) is:
Figure GDA0003983365220000102
wherein, score joint (s,i,[t a ,t b ]) Indicating that the ith word in the sentence S is labeled as a joint tag [ t ] a ,t b ]Raw Score of, score sep_a (s,i,[t a ,t b ]) The independent tag t in the tag set indicating that the ith word in the sentence S is marked as the first marking specification a Score of (1), score sep_b (s,i,[t a ,t b ]) Independent tags t in the set of tags representing the second annotation specification b Is scored.
After the final score of the binding tag is obtained, as a specific implementation manner, in this embodiment, a Softmax function is used to normalize the scores of all the binding tags obtained through calculation, so as to obtain the probability of each binding tag, and predict the target binding tag of each word according to the probability:
Figure GDA0003983365220000103
wherein p is i For normalized probability of the ith binding tag in the set of binding tags, score i And n is the number of the binding tags in the binding tag set.
In summary, the prediction layer in this embodiment is specifically configured to: determining a final score of the binding tag according to the original score of the binding tag and the scores of the independent tags corresponding to the binding tag; determining the probability of each binding tag according to the final score of each binding tag by utilizing a softmax function; and determining the target binding label of the word according to the probability of each binding label.
An output layer: and the target label sequence is used for outputting the target label sequence of the text to be labeled, and the target label sequence comprises target binding labels of all words of the text to be labeled.
Loss layer: and determining a loss value according to the target loss function, the prediction labeling result and the actual labeling result in the training process so as to adjust the model parameters to realize the training.
In the art, the cross control function is generally adopted by the model as an objective function for parameter estimation of the model, and the model performance is solved and evaluated by minimizing the objective function. Wherein the crossEntrol function is:
Figure GDA0003983365220000111
wherein, y i Is the probability distribution of the correct label,
Figure GDA0003983365220000112
and loss is the corresponding loss between the sample result obtained by calculation and the model prediction result and is used for returning to carry out parameter estimation, so that the model is trained, and the purpose of model training is to minimize the loss.
On this basis, the present embodiment takes into account the correct part of speech due to each wordMore than one label, assume the number of labels in label Specification 1 is | T 1 I, the number of labels labeled in Specification 2 is T 2 L of 8230, the number of tags labeled with specification N is T N If, then, the correct answer for each word in annotation Specification 1 is | T 2 |*...*|T N L, and the correct answer for each word in the labeling Specification 2 is | T 1 |*|T 3 |*...*|T N I, and so on, the correct answer of each word in the labeling specification N is | T 1 |*...|T N-1 L, are provided. Therefore, as a preferred implementation, this embodiment proposes an improved objective function, specifically:
Figure GDA0003983365220000113
where k denotes the number of correct tags, y j Probability after Softmax normalization of the score for the correct tag.
To prove the performance improvement effect of the sequence labeling model of this embodiment, the following description is made by comparing the sequence labeling model of this embodiment with the existing model:
it is assumed that an existing practical application scenario aims to improve the labeling performance of the model under the CTB labeling specification, and it is assumed that the number of the labeling specifications in this embodiment is 2, and the two labeling specifications are CTB and PKU, respectively. Then, in terms of experimental data set-up, the data set setup for the existing model is shown in table 3, and the input and output of the model is shown in table 4. It can be seen that the existing model can only use a single corpus of labeling specifications as a training set, and the input of the model only considers the vector of a word and can only output a single label sequence of labeling specifications. Referring to table 5, table 6 and table 7, the sequence annotation model of the embodiment can use corpora of various annotation specifications as a training set, so that the data scale is enlarged; the word vector and the word vector are comprehensively considered in the input of the model, the learning capability of the model to the text can be improved through better vector representation, and the performance of the model is improved; the model can output the binding tag sequence, namely the tag sequence under various labeling specifications is directly obtained, the text is convenient to convert under different labeling specifications, and the method is simple and efficient.
TABLE 3
Figure GDA0003983365220000121
TABLE 4
Input of existing models Output tags for existing models
In particular AD
I am PN
State of China NN
Economy of production NN
Uplink (UL) VV
PU
TABLE 5
Figure GDA0003983365220000122
TABLE 6
Figure GDA0003983365220000123
TABLE 7
Figure GDA0003983365220000124
Figure GDA0003983365220000131
In summary, the sequence annotation apparatus provided in this embodiment achieves the purpose of improving the sequence annotation performance by using wide heterogeneous specification data through improving the score layer of the sequence annotation model. In addition, the model directly outputs label sequences of various labeling specifications, so that the accuracy of labeling under a single labeling specification is improved, and the conversion of texts between different labeling specifications is facilitated.
The following introduces a training method of a sequence annotation model provided in an embodiment of the present application, and the training method of the sequence annotation model described below is applied to the sequence annotation model of the sequence annotation apparatus described above.
As shown in fig. 4, the training method of the sequence labeling model includes:
step S401: acquiring training samples with various labeling specifications, wherein the training samples comprise training texts and actual label sequences of the training texts;
step S402: inputting the training samples into a sequence labeling model to obtain a predicted tag sequence output by the sequence labeling model;
step S403: and adjusting parameters of the sequence labeling model according to the predicted label sequence and the actual label sequence until a preset termination condition is reached so as to realize the training of the sequence labeling model.
Specifically, the adjustment process of the model parameters may be an automatic process. The preset termination condition for completing the model training may be that the iteration number reaches a preset maximum iteration number, or that the model training is determined to be completed when the performance of the model does not reach an expected improvement after a certain number of iterations, which is specifically determined according to actual requirements, and this embodiment is not specifically limited.
The following introduces a training apparatus of a sequence labeling model provided in an embodiment of the present application, and the training apparatus of the sequence labeling model described below is applied to the sequence labeling model of the sequence labeling apparatus.
As shown in fig. 5, the training apparatus for the sequence annotation model includes:
the memory 501: for storing a computer program;
the processor 502: for executing the computer program to implement the steps of:
acquiring training samples with various labeling specifications, wherein the training samples comprise training texts and actual label sequences of the training texts; inputting the training sample into a sequence labeling model to obtain a predicted tag sequence output by the sequence labeling model; and adjusting parameters of the sequence labeling model according to the predicted label sequence and the actual label sequence until a preset termination condition is reached so as to realize the training of the sequence labeling model.
The following describes a computer-readable storage medium provided by an embodiment of the present application, and the computer-readable storage medium described below is applied to the sequence annotation model of the sequence annotation device described above.
In particular, the computer readable storage medium has stored thereon a computer program which, when executed by a processor, is adapted to carry out the steps of:
acquiring training samples with various labeling specifications, wherein the training samples comprise training texts and actual label sequences of the training texts; inputting the training sample into a sequence labeling model to obtain a predicted tag sequence output by the sequence labeling model; and adjusting parameters of the sequence labeling model according to the predicted label sequence and the actual label sequence until a preset termination condition is reached so as to realize the training of the sequence labeling model.
As shown in fig. 6, the sequence annotation method provided in the embodiment of the present application is introduced as follows, and the sequence annotation method includes:
step S601: acquiring a text to be marked;
step S602: determining vector representation of each word of the text to be labeled;
step S603: according to the vector representation, determining the score of each independent label in a label set with various labeling specifications, and determining the original score of each binding label in a binding label set, wherein the binding label is a group of label combinations comprising single independent labels with various labeling specifications;
step S604: determining a final score of the binding tag according to the original score of the binding tag and the score of the independent tag corresponding to the binding tag;
step S605: and determining the target binding label of the word according to the final score of the binding label to obtain a target label sequence of the text to be labeled.
In the present specification, the embodiments are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same or similar parts between the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above detailed descriptions of the solutions provided in the present application, and the specific examples applied herein are set forth to explain the principles and implementations of the present application, and the above descriptions of the examples are only used to help understand the method and its core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (9)

1. A sequence annotation apparatus, comprising a sequence annotation model, wherein the sequence annotation model comprises:
an input layer: the method comprises the steps of obtaining a text to be marked;
presentation layer: the system comprises a first scoring layer and a plurality of second scoring layers, wherein the first scoring layer is used for identifying the vector representation of each word of the text to be labeled;
the first scoring layer: the system is used for determining the original score of each binding label in the binding label set according to the vector representation;
the second scoring layer: the system comprises a vector representation module, a label set module, a binding label module and a label analysis module, wherein the vector representation module is used for determining scores of all independent labels in a label set of a corresponding labeling specification according to the vector representation, the second score layer corresponds to the labeling specification in a one-to-one mode, and the binding label is a group of label combinations comprising single independent labels of all the labeling specifications;
prediction layer: determining a final score of the binding tag according to the original score of the binding tag and the scores of the independent tags corresponding to the binding tag; determining a target binding tag of the word according to the final score of each binding tag;
an output layer: and the target label sequence is used for outputting the target label sequence of the text to be labeled, and the target label sequence comprises target binding labels of all words of the text to be labeled.
2. The sequence annotation apparatus of claim 1, wherein the presentation layer comprises:
a first encoding unit: the system comprises a text to be labeled, a first vector and a second vector, wherein the text to be labeled comprises words of the text to be labeled;
a second encoding unit: the first bi-directional recurrent neural network is used for coding each word of the word to obtain a second vector of the word;
a representation unit: for determining a vector representation of the word from the first vector and the second vector and sending the vector representation to a first scoring level and a plurality of second scoring levels, respectively.
3. The sequence labeling apparatus of claim 2, wherein the representation unit is specifically configured to:
determining a vector representation of the word from the first vector and the second vector; encoding the vector representation of the word by using a second bidirectional cyclic neural network to obtain global information; and sending the global information to a first scoring layer and a plurality of second scoring layers respectively.
4. The sequence labeling apparatus of claim 1, wherein the prediction layer is specifically configured to:
determining a final score of the binding tag according to the original score of the binding tag and the scores of the independent tags corresponding to the binding tag; determining the probability of each binding tag according to the final score of each binding tag by utilizing a softmax function; determining a target binding tag for the word according to the probability of each binding tag.
5. The sequence annotation apparatus of claim 4, further comprising:
loss layer: the method is used for determining a loss value according to a target loss function, a prediction labeling result and an actual labeling result in a training process, and realizing training by adjusting model parameters, wherein the target loss function is as follows:
Figure FDA0002083838900000021
wherein, k isIndicating the number of correct labels, y j The probability of scoring a correct label after passing through the prediction layer.
6. A method for training a sequence annotation model, which is applied to the sequence annotation apparatus of any one of claims 1 to 5, and comprises:
acquiring training samples with various labeling specifications, wherein the training samples comprise training texts and actual label sequences of the training texts;
inputting the training samples into a sequence labeling model to obtain a predicted tag sequence output by the sequence labeling model;
and adjusting parameters of the sequence labeling model according to the predicted label sequence and the actual label sequence until a preset termination condition is reached so as to realize the training of the sequence labeling model.
7. A training device of a sequence annotation model, which is applied to the sequence annotation device of any one of claims 1 to 5, and comprises:
a memory: for storing a computer program;
a processor: for executing the computer program to implement the steps of:
acquiring training samples with various labeling specifications, wherein the training samples comprise training texts and actual label sequences of the training texts; inputting the training sample into a sequence labeling model to obtain a predicted tag sequence output by the sequence labeling model; and adjusting parameters of the sequence labeling model according to the predicted label sequence and the actual label sequence until a preset termination condition is reached so as to realize the training of the sequence labeling model.
8. A computer-readable storage medium, characterized by a sequence annotation model applied to the sequence annotation apparatus of any one of claims 1 to 5, the computer-readable storage medium having stored thereon a computer program for implementing, when executed by a processor, the steps of:
acquiring training samples with various labeling specifications, wherein the training samples comprise training texts and actual label sequences of the training texts; inputting the training sample into a sequence labeling model to obtain a predicted tag sequence output by the sequence labeling model; and adjusting parameters of the sequence labeling model according to the predicted label sequence and the actual label sequence until a preset termination condition is reached so as to realize the training of the sequence labeling model.
9. A method for sequence annotation, comprising:
acquiring a text to be marked;
determining vector representation of each word of the text to be labeled;
according to the vector representation, determining the score of each independent label in a label set with various labeling specifications, and determining the original score of each binding label in a binding label set, wherein the binding label is a group of label combinations comprising single independent labels with various labeling specifications;
determining a final score of the binding tag according to the original score of the binding tag and the score of the independent tag corresponding to the binding tag;
and determining the target binding label of the word according to the final score of the binding label to obtain a target label sequence of the text to be labeled.
CN201910481021.2A 2019-06-04 2019-06-04 Sequence labeling method and device and training method of sequence labeling model Active CN110210035B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910481021.2A CN110210035B (en) 2019-06-04 2019-06-04 Sequence labeling method and device and training method of sequence labeling model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910481021.2A CN110210035B (en) 2019-06-04 2019-06-04 Sequence labeling method and device and training method of sequence labeling model

Publications (2)

Publication Number Publication Date
CN110210035A CN110210035A (en) 2019-09-06
CN110210035B true CN110210035B (en) 2023-01-24

Family

ID=67790556

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910481021.2A Active CN110210035B (en) 2019-06-04 2019-06-04 Sequence labeling method and device and training method of sequence labeling model

Country Status (1)

Country Link
CN (1) CN110210035B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115879524A (en) * 2021-09-27 2023-03-31 华为技术有限公司 Model training method and related equipment thereof
CN115391608B (en) * 2022-08-23 2023-05-23 哈尔滨工业大学 Automatic labeling conversion method for graph-to-graph structure

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729312A (en) * 2017-09-05 2018-02-23 苏州大学 More granularity segmenting methods and system based on sequence labelling modeling
CN109800298A (en) * 2019-01-29 2019-05-24 苏州大学 A kind of training method of Chinese word segmentation model neural network based

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729312A (en) * 2017-09-05 2018-02-23 苏州大学 More granularity segmenting methods and system based on sequence labelling modeling
CN109800298A (en) * 2019-01-29 2019-05-24 苏州大学 A kind of training method of Chinese word segmentation model neural network based

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Ambiguity-aware Ensemble Training for Semi-supervised Dependency Parsing;Zhenghua Li等;《Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics》;20140625;全文 *
实体-属性抽取的GRU+CRF方法;王仁武等;《现代情报》;20181031;全文 *

Also Published As

Publication number Publication date
CN110210035A (en) 2019-09-06

Similar Documents

Publication Publication Date Title
CN110795543B (en) Unstructured data extraction method, device and storage medium based on deep learning
CN110457675B (en) Predictive model training method and device, storage medium and computer equipment
CN109657239B (en) Chinese named entity recognition method based on attention mechanism and language model learning
CN111460807B (en) Sequence labeling method, device, computer equipment and storage medium
CN109800298B (en) Training method of Chinese word segmentation model based on neural network
CN110377916B (en) Word prediction method, word prediction device, computer equipment and storage medium
CN110807332A (en) Training method of semantic understanding model, semantic processing method, semantic processing device and storage medium
CN110275939B (en) Method and device for determining conversation generation model, storage medium and electronic equipment
CN110704576B (en) Text-based entity relationship extraction method and device
CN112100349A (en) Multi-turn dialogue method and device, electronic equipment and storage medium
CN108932226A (en) A kind of pair of method without punctuate text addition punctuation mark
CN111666427A (en) Entity relationship joint extraction method, device, equipment and medium
CN110795945A (en) Semantic understanding model training method, semantic understanding device and storage medium
CN112699686B (en) Semantic understanding method, device, equipment and medium based on task type dialogue system
CN111062217A (en) Language information processing method and device, storage medium and electronic equipment
CN112101010B (en) Telecom industry OA office automation manuscript auditing method based on BERT
CN110807333A (en) Semantic processing method and device of semantic understanding model and storage medium
CN113743099B (en) System, method, medium and terminal for extracting terms based on self-attention mechanism
CN109933792A (en) Viewpoint type problem based on multi-layer biaxially oriented LSTM and verifying model reads understanding method
US11775769B2 (en) Sentence type recognition method and apparatus, electronic device, and storage medium
CN113821616B (en) Domain-adaptive slot filling method, device, equipment and storage medium
CN114676255A (en) Text processing method, device, equipment, storage medium and computer program product
CN116484879A (en) Prompt message generation method and device, electronic equipment and storage medium
CN114067786A (en) Voice recognition method and device, electronic equipment and storage medium
CN110210035B (en) Sequence labeling method and device and training method of sequence labeling model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant