CN111950278A - Sequence labeling method and device and computer readable storage medium - Google Patents

Sequence labeling method and device and computer readable storage medium Download PDF

Info

Publication number
CN111950278A
CN111950278A CN201910399055.7A CN201910399055A CN111950278A CN 111950278 A CN111950278 A CN 111950278A CN 201910399055 A CN201910399055 A CN 201910399055A CN 111950278 A CN111950278 A CN 111950278A
Authority
CN
China
Prior art keywords
hidden state
word
training sentence
label
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910399055.7A
Other languages
Chinese (zh)
Inventor
孟茜
童毅轩
张永伟
姜珊珊
董滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to CN201910399055.7A priority Critical patent/CN111950278A/en
Publication of CN111950278A publication Critical patent/CN111950278A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a sequence labeling method, a sequence labeling device and a computer readable storage medium. According to the sequence labeling method provided by the invention, the part of speech and/or the syntactic characteristics are introduced into the sequence labeling process, and the richer part of speech and syntactic information are utilized, so that the better sequence labeling effect can be obtained, and the accuracy of sequence labeling is improved.

Description

Sequence labeling method and device and computer readable storage medium
Technical Field
The invention relates to the technical field of Natural Language Processing (NLP), in particular to a sequence labeling method and device and a computer readable storage medium.
Background
In the field of artificial intelligence, an information extraction technology is an indispensable important technology. Currently, the information extraction technology mainly includes three algorithms. The first is a knowledge-graph based extraction algorithm. The extraction algorithm requires knowledge base map data and rule support. The knowledge graph is established by consuming a large amount of human resources, and the finally obtained data volume is not ideal. The second is an extraction algorithm based on the traditional statistical machine learning algorithm, which can use manually labeled training data and apply different learning models to deal with different scenes, and the algorithm has the disadvantages of high labor cost and poor popularization, so that the algorithm encounters a bottleneck in wide application. The last algorithm is an algorithm using a neural network model, which has prevailed in recent years. Compared with the traditional machine learning algorithm, the neural network-based model using the large-scale training data set shows excellent performance in natural language processing tasks.
As one of the basic tasks of natural language processing, sequence labeling is usually required. Sequence tagging refers to marking or tagging elements in a sequence for a given sequence. Sequence annotation typically includes Named Entity Recognition (NER), Chinese tokenization, and classification questions (e.g., relationship Recognition, sentiment analysis, intent analysis, etc.).
For example, Named Entity Recognition (NER) is a common task in natural language processing, and Named entities are used as basic units of semantic representation in many applications, and the range of use of the Named entities is very wide, so that the Named Entity Recognition technology plays an important role. Named entities generally refer to entities of particular significance or strong reference in text, and generally include names of people, places, organizations, time, proper nouns, and the like. Named entity recognition techniques play an important role because named entities are used as the fundamental unit of semantic representation in many tasks. Sequence labeling problems typically require label data for model training, and deep learning based neural network models can be typically utilized.
Therefore, the high-precision sequence identification method has important significance in developing systems such as high-performance translation, conversation, public opinion monitoring, topic tracking, semantic understanding and the like.
Disclosure of Invention
The technical problem to be solved by the embodiments of the present invention is to provide a sequence labeling method and device, so as to improve the accuracy of sequence labeling.
According to an aspect of the embodiments of the present invention, there is provided a sequence annotation method, including:
generating first labels of words in a training sentence, the first labels comprising part-of-speech labels and/or syntax labels;
constructing a first feature vector based on the first label aiming at the training statement, and generating a first hidden state of the first feature vector through a neural network model;
generating a second feature vector containing dictionary features of a preset dictionary aiming at the training sentences, and generating a second hidden state of the second feature vector through the neural network model, wherein the preset dictionary comprises a plurality of reference labeling results;
combining the first hidden state and the second hidden state to obtain a third hidden state;
and carrying out sequence labeling according to the third hidden state to obtain a sequence labeling result of the training sentence.
Further in accordance with at least one embodiment of the present invention, the step of constructing a first feature vector based on the first label for the training sentence comprises:
replacing each word of the training sentence with the probability corresponding to the first label to which the word belongs to obtain the first feature vector; the probability corresponding to the first label to which each word belongs is positively correlated with the proportion of the first class of words in the second class of words, the second class of words are words under the first label to which the word belongs in the training sentence, and the first class of words are words belonging to the reference labeling result in the second class of words.
Furthermore, according to at least one embodiment of the present invention, the step of generating a second feature vector containing dictionary features of a preset dictionary for the training sentence includes:
obtaining word embedding vectors of all words in the training sentence;
generating an unique hot code corresponding to each word according to whether the word context including the word in the training sentence exists in the preset dictionary or not and obtaining an unique hot vector corresponding to the training sentence;
and combining the word embedding vector of the word in the training sentence and the unique heat vector corresponding to the training sentence to obtain a second feature vector containing the dictionary features of the preset dictionary.
Further in accordance with at least one embodiment of the present invention, the step of merging the first hidden state and the second hidden state includes:
and carrying out vector connection operation or vector addition operation on the first hidden state and the second hidden state to obtain the third hidden state.
Furthermore, according to at least one embodiment of the present invention, the step of performing sequence labeling according to the third hidden state of the training sentence includes:
and generating a segmentation sequence of the training sentence based on the third hidden state, inputting the segmentation sequence to an output layer softmax layer of the neural network model, training the neural network model, and obtaining the label of the class to which each segmentation sequence of the training sentence output by the softmax layer belongs and the probability of the label.
Further, in accordance with at least one embodiment of the present invention, after training the neural network model, the method further comprises:
and carrying out sequence labeling on the sentences to be processed by utilizing the neural network model obtained by training.
According to another aspect of the embodiments of the present invention, there is provided a sequence annotation apparatus, including:
a label generating unit, configured to generate a first label of a word in a training sentence, where the first label includes a part-of-speech label and/or a syntax label;
a first hidden state generating unit, configured to construct, for the training sentence, a first feature vector based on the first label, and generate a first hidden state of the first feature vector through a neural network model;
a second hidden state generating unit, configured to generate, for the training sentence, a second feature vector including dictionary features of a preset dictionary, and generate a second hidden state of the second feature vector through the neural network model, where the preset dictionary includes multiple reference labeling results;
a state merging unit, configured to merge the first hidden state and the second hidden state to obtain a third hidden state;
and the first labeling processing unit is used for carrying out sequence labeling according to the third hidden state to obtain a sequence labeling result of the training sentence.
In addition, according to at least one embodiment of the present invention, the first hidden state generating unit is further configured to replace each word of the training sentence with a probability corresponding to a first label to which the word belongs, so as to obtain the first feature vector; the probability corresponding to the first label to which each word belongs is positively correlated with the proportion of the first class of words in the second class of words, the second class of words are words under the first label to which the word belongs in the training sentence, and the first class of words are words belonging to the reference labeling result in the second class of words.
Furthermore, according to at least one embodiment of the present invention, the second hidden state generating unit is further configured to obtain a word embedding vector of each word in the training sentence; generating an unique hot code corresponding to each word according to whether the word context including the word in the training sentence exists in the preset dictionary or not and obtaining an unique hot vector corresponding to the training sentence; and combining the word embedding vector of the word in the training sentence and the unique heat vector corresponding to the training sentence to obtain a second feature vector containing the dictionary features of the preset dictionary.
Furthermore, according to at least one embodiment of the present invention, the state merging unit is further configured to perform a vector join operation or a vector add operation on the first hidden state and the second hidden state to obtain the third hidden state.
Furthermore, according to at least one embodiment of the present invention, the first label processing unit is further configured to generate a segmentation sequence of the training sentence based on the third hidden state, input the segmentation sequence to a softmax layer, which is an output layer of the neural network model, train the neural network model, and obtain a label and a probability thereof of a category to which each segmentation sequence of the training sentence output by the softmax layer belongs.
Furthermore, in accordance with at least one embodiment of the present invention, the sequence labeling apparatus further includes:
and the second labeling processing unit is used for performing sequence labeling on the sentences to be processed by using the neural network model obtained by training.
The embodiment of the present invention further provides a sequence labeling apparatus, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the sequence tagging method as described above.
An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the sequence annotation method are implemented as described above.
Compared with the prior art, the sequence tagging method, the sequence tagging device and the computer readable storage medium provided by the embodiment of the invention introduce the part of speech and/or the syntactic characteristics into the sequence tagging process, and because rich part of speech information and syntactic information are utilized, the embodiment of the invention can obtain a better sequence tagging effect and improve the accuracy of sequence tagging.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without inventive labor.
FIG. 1 is a flowchart illustrating a sequence tagging method according to an embodiment of the present invention;
FIG. 2 is an exemplary diagram of a syntactic analysis according to an embodiment of the present invention;
FIG. 3 is an exemplary diagram of constructing part-of-speech and syntactic label based feature vectors in an embodiment of the present invention;
FIG. 4 is an exemplary diagram of training a Bi-LSTM model based on hidden states in an embodiment of the present invention;
FIG. 5 is an exemplary diagram of a join operation performed on a word-embedded vector and a one-hot vector according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating a sequence annotation result obtained by a neural network model according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a sequence labeling apparatus according to an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a sequence labeling apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments. In the following description, specific details such as specific configurations and components are provided only to help the full understanding of the embodiments of the present invention. Thus, it will be apparent to those skilled in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
In various embodiments of the present invention, it should be understood that the sequence numbers of the following processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Referring to fig. 1, a flow diagram of a sequence tagging method provided by an embodiment of the present invention is shown, and the sequence tagging method can be applied to tasks including named entity recognition, chinese word segmentation, and classification problems, and can improve the accuracy of sequence tagging. As shown in fig. 1, a sequence annotation method provided in the embodiment of the present invention includes:
step 11, generating a first label of a word in a training sentence, wherein the first label comprises a part-of-speech label and/or a syntax label.
Here, the training sentence is a sentence in training data collected in advance, the training data including a plurality of training sentences. Specifically, a preset dictionary may be pre-constructed, where the dictionary includes a plurality of reference labeling results in a sequence labeling task, and the training sentence is a sentence that has been subjected to data preprocessing and includes reference labeling results that are pre-labeled in the training sentence in a manual labeling manner for training a subsequent model. For example, for sequence labeling of named entity recognition, the reference labeling result in the dictionary may be various reference named entity sequences obtained in advance, and each named entity sequence existing in the training sentence may be labeled through manual labeling. In addition, the data preprocessing may include document slicing, text word segmentation, deletion of stop words and other noise (including punctuation, numbers, single words and other meaningless words), and the like.
Natural language has rich semantic features and strong structural features, which are usually expressed in that a single word has a corresponding part-of-speech (such as may include nouns, verbs, adjectives or other predefined parts-of-speech), and a specific syntactic relationship is formed between words (such as subject predicate structures, modified and modified structures, etc.). Thus, embodiments of the present invention introduce part-of-speech and/or syntax in semantic understanding in sequence tagging. In general, part of speech or syntax can be summarized into a limited number of categories, respectively, depending on the language that needs to be analyzed, and part of speech tags and syntax tags can be tagged for each word in a sentence by related tools of the prior art.
In step 11, part-of-speech tags and/or syntactic tags may be generated on the training data by the NLP parser. Part-of-speech tags and syntax tags for words of each training sentence are obtained.
Taking the training sentence "London is the topic and most probability city" as an example, FIG. 2 shows an example of a syntax tree formed by syntactic analysis. Each box in fig. 2 represents a word in the sentence, and the corresponding label is shown above each word. For example, in fig. 2, S denotes a root of the tree structure, NP, VP, adpp, and the like are part-of-speech tags, and NNP, VBZ, DT, NN, CC, RBS, and JJ, and the like denote syntax tags. If NP represents a noun phrase, VP represents a verb phrase, NNP represents a place name, and VBZ represents a conjoin verb, and so on. In addition, the type and number of part-of-speech/syntax tags generated by different tools may be different, and the embodiment of the present invention does not limit the specifically adopted tools.
In addition, to simplify the processing, in the embodiment of the present invention, only the part-of-speech tag or only the syntax tag may be generated in step 11. Of course, the embodiment of the present invention may also generate two kinds of tags, namely, a part-of-speech tag and a syntax tag. By introducing the part-of-speech and/or syntactic characteristics described above into sequence annotation, the accuracy of sequence annotation (e.g., named entity recognition) can be improved.
And step 12, constructing a first feature vector based on the first label aiming at the training sentence, and generating a first hidden state of the first feature vector through a neural network model.
Here, when constructing the first feature vector of the first label, each word of the training sentence may be replaced with a probability corresponding to the first label to which the word belongs according to the position of the word in the training sentence, so as to obtain the first feature vector. The probability corresponding to the first label to which each word belongs is positively correlated with the proportion of the first class of words in the second class of words, the second class of words is the words under the first label to which the word belongs in the training sentence, and the first class of words is the words belonging to the reference labeling result in the preset dictionary in the second class of words. Taking the sequence label recognized by the named entity as an example, a word of a first type belongs to a reference labeling result, and the word may be the same as a reference named entity sequence in the predetermined dictionary, or the word is a part of a reference named entity sequence in the predetermined dictionary.
According to at least one embodiment of the present invention, the probability corresponding to the first label to which each word belongs is positively correlated with the proportion of the first class of words in the second class of words. That is, the larger the ratio, the larger the corresponding probability. A simpler implementation is that the probability corresponding to the first label to which the word belongs is equal to the proportion of the first class of words in the second class of words. Of course, the probability may also be determined according to the ratio in a more complex manner, and this is not specifically limited in the embodiment of the present invention.
Still taking the named entity recognition as an example, assuming that the first tag is a part-of-speech tag, and a certain training sentence includes 10 words, in the above step 11, part-of-speech tags of the words in the training sentence are generated, and it is assumed that the 10 words in the training sentence include 5 nouns, 1 verb, 2 adjectives and 2 conjunctions. It is assumed that 2 nouns (i.e. the first type words) in the above 5 nouns (i.e. the second type words) belong to the reference named entity sequence in the dictionary, specifically, the two nouns may both belong to the same reference named entity sequence, or may belong to different reference named entity sequences, which is not limited in the embodiment of the present invention. Under the label of the noun, the proportion of the first class word in the second class word is 2/5, so that the probability corresponding to the label of the noun can be determined to be 2/5, and further the noun in the training sentence is replaced by 2/5. Assuming that none of the 2 conjunctions belongs to the reference named entity sequence in the dictionary, the ratio of the first class word in the second class word is 0 under the label of the conjunctions, so that the probability corresponding to the label of the conjunctions can be determined to be 0, and then the conjunctions in the training sentence are replaced by 0 or a constant close to 0. Through the above manner, each word in the training sentence can be replaced by the probability corresponding to the label to which the word belongs, so that a first feature vector is obtained, and the part-of-speech feature is introduced into the first feature vector.
In addition, considering that the number of words in different training sentences may be different, the number of words of all training sentences may be normalized for the convenience of calculation processing. Specifically, a normalized number may be set according to the number of words of most training sentences, for example, when the number of words of 95% of the training sentences is 20 or less, the normalized number may be set to 20. Then, for training sentences that exceed the normalized number, one or more words at the end of the sentence may be deleted until the normalized number is satisfied. For training sentences smaller than the normalized number, after the first feature vector of the first label is generated in step 12, a padding operation (padding) may be performed on the first feature vector, such as padding 0, to make the dimension of the first feature vector reach the normalized number.
When the part-of-speech tag and the syntax tag are considered at the same time, in step 12, the embodiment of the present invention may construct a feature vector based on the part-of-speech tag and a feature vector based on the syntax tag for the training sentence. FIG. 3 shows an example of constructing feature vectors based on the part-of-speech tags and the syntactic tags, where f in FIG. 3posRepresenting feature vectors based on said part-of-speech tags, fsynRepresenting a feature vector based on the syntactic label. And then, combining the two feature vectors to obtain a first feature vector based on the part of speech feature and the syntactic feature. Specifically, the merging mode of the vectors may be vector connection or vector addition, where the vector connection refers to connecting the head and the tail of two vectorsConcatenating to get a higher-dimensional vector, for example, when both feature vectors are 20-dimensional, a 40-dimensional vector can be obtained after vector concatenation.
After obtaining the first feature vector of the first tag, the embodiment of the present invention may input the first feature vector of the first tag into a neural network model, and through transformation and calculation, a representation of a hidden layer may be obtained, that is, a first hidden state (usually, a vector) of the first feature vector may be obtained, and the first hidden state may be used as a deep learning representation of a part of speech and/or a syntactic structure.
According to at least one embodiment of the present invention, a Bi-directional Long Short-Term Memory network (Bi-LSTM) may be utilized to obtain the hidden state. The Bi-LSTM model is trained with the first feature vector of the first label as input, such that a first hidden state can be obtained by equation 1 below. As shown in FIG. 4, fiA first feature vector generated in step 12 is input into the Bi-LSTM neural network, and the hidden state at the corresponding moment can be obtained by utilizing the calculation of forward propagation and backward propagation algorithms
Figure BDA0002059137080000091
In addition, as for the specific structure of the Bi-LSTM model, reference can be made to the description of the prior art, which is not repeated herein for brevity.
Figure BDA0002059137080000092
In the above formula, Bi-LSTM () represents a Bi-LSTM model;
Figure BDA0002059137080000093
a hidden layer state vector (i.e. a first hidden state) representing all useful information stored at time i; f. ofiRepresenting a first feature vector based on a first label at time i;
Figure BDA0002059137080000094
respectively representForward-propagating and backward-propagating hidden layer feature vectors at time i.
And step 13, generating a second feature vector containing dictionary features of a preset dictionary aiming at the training sentence, and generating a second hidden state of the second feature vector through the neural network model, wherein the preset dictionary comprises a plurality of reference labeling results.
Here, word embedding vectors are also generated based on the training sentences, and in order to improve the accuracy of sequence labeling, dictionary features are introduced into the word embedding vectors in the embodiment of the present invention. Specifically, step 13 may include:
1) word embedding vectors of each word in the training sentence are obtained, and pre-trained word embedding vectors can be used for improving the efficiency of model training. The pre-training Word embedding vector may use different Word vector generation models (Word2Vec), such as a Continuous Bag of Words (CBOW) model, a Skip-gram model, or a C & W model.
2) And aiming at each word in the training sentence, generating a one-hot code corresponding to each word according to whether the word context including the word in the training sentence exists in the preset dictionary or not, and obtaining a one-hot vector corresponding to the training sentence, thereby converting dictionary features into vector representation.
Here, it may be searched whether each word in the training sentence exists in the dictionary, if so, the corresponding code of the word is 1, otherwise, the corresponding code is 0, so that a set of codes ordered by the words in the training sentence, that is, the one-hot vector corresponding to the training sentence, may be generated.
In order to improve the accuracy of sequence labeling, a window of a word context may be set when generating the above-mentioned unique heat vector, the window size being N words, the N words constituting a word context. N is typically greater than 1, e.g., N ═ 3. And sliding the window in the training sentence according to the step length of a word, judging whether the word context in the current window exists in the dictionary after each sliding, generating the one-hot coding of the word at the preset position in the window, for example, the one-hot coding of the first or last 1 word in the window, and finally obtaining the one-hot vector of the training sentence.
It should be noted that, in order to further improve the accuracy of sequence labeling, the embodiment of the present invention may further adopt a plurality of windows with different sizes to perform the above operations, and generate the unique heat vectors of the training sentences under different windows respectively. And then carrying out merging operation on the one-hot vectors under different windows, such as vector addition or connection, and taking the merged vector as the one-hot vector of the training sentence.
3) And combining the word embedding vector of the word in the training sentence and the unique heat vector corresponding to the training sentence to obtain a second feature vector containing the dictionary features of the preset dictionary.
Here, the word embedding vector of the word in the training sentence and the one-hot vector corresponding to the training sentence may be combined in a mode of vector addition or vector connection, and the like, which is not specifically limited in the embodiment of the present invention. FIG. 5 gives an example of a join operation on a word-embedded vector and a one-hot vector, where ewRepresenting word-embedded vectors, fdicRepresenting a one-hot vector (only 3 bits of data are schematically shown in fig. 5), and connecting the word-embedded vector with the one-hot vector end-to-end, e.g., connecting the head of the one-hot vector with the tail of the word-embedded vector (of course, the front and rear positions of the two vectors may be interchanged), so as to obtain a new vector with one dimension being the sum of the dimensions of the two vectors.
After the second feature vector is obtained, a hidden state (i.e., a second hidden state) of the second feature vector may be generated using a Bi-LSTM model. And training the Bi-LSTM model by taking the second feature vector as an input, so that a second hidden state can be obtained by the following formula 2, wherein the second hidden state comprises the semantic features of word embedding enhanced by dictionary features.
Figure BDA0002059137080000101
In the above formula, the first and second light sources are,Bi-LSTM () represents the Bi-LSTM model;
Figure BDA0002059137080000111
a hidden layer state vector (i.e. a second hidden state) representing all useful information stored at time i; e.g. of the typeiRepresenting a second feature vector at time i;
Figure BDA0002059137080000112
representing the forward-propagating and backward-propagating hidden layer feature vectors at time i, respectively.
And 14, combining the first hidden state and the second hidden state to obtain a third hidden state.
Here, the third hidden state may be obtained by performing a vector join operation or a vector add operation on the first hidden state and the second hidden state. For example, the vector addition operation may be expressed in the form of the following equation 3, resulting in a third hidden state hi
Figure BDA0002059137080000113
And step 15, carrying out sequence labeling according to the third hidden state to obtain a sequence labeling result of the training sentence.
Here, based on the third hidden state, a segment sequence of the training sentence having a short length may be generated, input to an output layer (softmax layer) of the neural network model, train the neural network model, and obtain a label of a category to which each segment sequence output by the softmax layer belongs and a probability thereof. Fig. 6 gives an example of inputting the third hidden state into the neural network model, thereby obtaining the output of the softmax layer. For example, for named entity recognition, the softmax layer may output information such as the category and probability of the named entity existing in the training sentence. In addition, reference may be made to the description of the prior art for the structure of the neural network model, and details are not repeated herein for brevity.
Through the mode, the part-of-speech and/or syntactic characteristics are introduced into the sequence tagging process, and the accuracy of sequence tagging can be improved due to the fact that richer part-of-speech information and more abundant syntactic information are utilized.
After step 15, the neural network model obtained by training may be applied to a specific sequence labeling task to perform sequence labeling processing, for example, to identify and label named entities in a to-be-processed sentence. Because the embodiment of the invention introduces rich part-of-speech information and syntax information when training and generating the neural network model, the neural network model obtained by training has better sequence labeling effect, and the labeling accuracy can be improved when the neural network model is applied to a sequence labeling task.
Based on the above method, an embodiment of the present invention further provides a device for implementing the above method, please refer to fig. 7, the sequence labeling device 70 provided in the embodiment of the present invention, the sequence labeling device 70 can be applied to respective sequence labeling scenarios, and can improve the accuracy of sequence labeling. As shown in fig. 7, the sequence labeling apparatus 70 specifically includes:
a label generating unit 71, configured to generate a first label of a word in a training sentence, where the first label includes a part-of-speech label and/or a syntax label;
a first hidden state generating unit 72, configured to construct, for the training sentence, a first feature vector based on the first label, and generate a first hidden state of the first feature vector through a neural network model;
a second hidden state generating unit 73, configured to generate, for the training sentence, a second feature vector including dictionary features of a preset dictionary, and generate a second hidden state of the second feature vector through the neural network model, where the preset dictionary includes multiple reference labeling results;
a state merging unit 74, configured to merge the first hidden state and the second hidden state to obtain a third hidden state;
and a first labeling processing unit 75, configured to perform sequence labeling according to the third hidden state, and obtain a sequence labeling result of the training sentence.
In addition, according to at least one embodiment of the present invention, the first hidden state generating unit 72 is further configured to replace each word of the training sentence with a probability corresponding to a first label to which the word belongs, so as to obtain the first feature vector; the probability corresponding to the first label to which each word belongs is positively correlated with the proportion of the first class of words in the second class of words, the second class of words are words under the first label to which the word belongs in the training sentence, and the first class of words are words belonging to the reference labeling result in the second class of words.
Furthermore, according to at least one embodiment of the present invention, the second hidden state generating unit 73 is further configured to obtain a word embedding vector of each word in the training sentence; generating an unique hot code corresponding to each word according to whether the word context including the word in the training sentence exists in the preset dictionary or not and obtaining an unique hot vector corresponding to the training sentence; and combining the word embedding vector of the word in the training sentence and the unique heat vector corresponding to the training sentence to obtain a second feature vector containing the dictionary features of the preset dictionary.
Furthermore, according to at least one embodiment of the present invention, the state merging unit 74 is further configured to perform a vector join operation or a vector add operation on the first hidden state and the second hidden state to obtain the third hidden state.
Furthermore, according to at least one embodiment of the present invention, the first labeling processing unit 75 is further configured to generate a segmentation sequence of the training sentence based on the third hidden state, input the segmentation sequence to a softmax layer, which is an output layer of the neural network model, train the neural network model, and obtain labels and probabilities thereof of classes to which the segmentation sequences of the training sentence output by the softmax layer belong.
Furthermore, according to at least one embodiment of the present invention, the sequence labeling apparatus may further include the following units (not shown in fig. 7):
and the second labeling processing unit is used for performing sequence labeling on the sentences to be processed by using the neural network model obtained by training.
Through the units, the sequence labeling device provided by the embodiment of the invention can introduce the part of speech information and the syntactic information into the sequence labeling, so that the accuracy of the sequence labeling is improved.
Referring to fig. 8, an embodiment of the present invention further provides a hardware structure block diagram of a sequence annotation apparatus, as shown in fig. 8, the sequence annotation apparatus 800 includes:
a processor 802; and
a memory 804, in which memory 804 computer program instructions are stored,
wherein the computer program instructions, when executed by the processor, cause the processor 802 to perform the steps of:
generating first labels of words in a training sentence, the first labels comprising part-of-speech labels and/or syntax labels;
constructing a first feature vector based on the first label aiming at the training statement, and generating a first hidden state of the first feature vector through a neural network model;
generating a second feature vector containing dictionary features of a preset dictionary aiming at the training sentences, and generating a second hidden state of the second feature vector through the neural network model, wherein the preset dictionary comprises a plurality of reference labeling results;
combining the first hidden state and the second hidden state to obtain a third hidden state;
and carrying out sequence labeling according to the third hidden state to obtain a sequence labeling result of the training sentence.
Further, as shown in fig. 8, the sequence labeling apparatus 800 may further include a network interface 801, an input device 803, a hard disk 805, and a display device 806.
The various interfaces and devices described above may be interconnected by a bus architecture. The bus architecture may be any architecture that includes any number of interconnected buses and bridges. Various circuits of one or more Central Processing Units (CPUs), represented in particular by processor 802, and one or more memories, represented by memory 804, are coupled together. The bus architecture may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like. It will be appreciated that a bus architecture is used to enable communications among the components. The bus architecture includes a power bus, a control bus, and a status signal bus, in addition to a data bus, all of which are well known in the art and therefore will not be described in detail herein.
The network interface 801 may be connected to a network (e.g., the internet, a local area network, etc.), receive data (e.g., training sentences) from the network, and store the received data in the hard disk 805.
The input device 803 may receive various commands input by an operator and send the commands to the processor 802 for execution. The input device 803 may include a keyboard or a pointing device (e.g., a mouse, trackball, touch pad, touch screen, or the like).
The display device 806 may display a result obtained by the processor 802 executing the instruction, for example, a result of displaying the sequence annotation.
The memory 804 is used for storing programs and data necessary for operating the operating system, and data such as intermediate results in the calculation process of the processor 802.
It is to be understood that the memory 804 in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. The memory 804 of the apparatus and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some embodiments, memory 804 stores the following elements, executable modules or data structures, or a subset thereof, or an expanded set thereof: an operating system 8041 and application programs 8042.
The operating system 8041 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application program 8042 includes various application programs such as a Browser (Browser) and the like for implementing various application services. A program implementing a method according to an embodiment of the present invention may be included in application program 8042.
The sequence labeling method disclosed in the above embodiments of the present invention can be applied to the processor 802, or implemented by the processor 802. The processor 802 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above sequence labeling method may be implemented by hardware integrated logic circuits in the processor 802 or instructions in the form of software. The processor 802 may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, configured to implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present invention. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 804, and the processor 802 reads the information in the memory 804 and performs the steps of the above method in combination with the hardware thereof.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
In particular, the computer program, when executed by the processor 802, may further implement the steps of:
replacing each word of the training sentence with the probability corresponding to the first label to which the word belongs to obtain the first feature vector; the probability corresponding to the first label to which each word belongs is positively correlated with the proportion of the first class of words in the second class of words, the second class of words are words under the first label to which the word belongs in the training sentence, and the first class of words are words belonging to the reference labeling result in the second class of words.
In particular, the computer program, when executed by the processor 802, may further implement the steps of: obtaining word embedding vectors of all words in the training sentence; generating an unique hot code corresponding to each word according to whether the word context including the word in the training sentence exists in the preset dictionary or not and obtaining an unique hot vector corresponding to the training sentence; and combining the word embedding vector of the word in the training sentence and the unique heat vector corresponding to the training sentence to obtain a second feature vector containing the dictionary features of the preset dictionary.
In particular, the computer program, when executed by the processor 802, may further implement the steps of: and carrying out vector connection operation or vector addition operation on the first hidden state and the second hidden state to obtain the third hidden state.
In particular, the computer program, when executed by the processor 802, may further implement the steps of: and generating a segmentation sequence of the training sentence based on the third hidden state, inputting the segmentation sequence to an output layer softmax layer of the neural network model, training the neural network model, and obtaining the label of the class to which each segmentation sequence of the training sentence output by the softmax layer belongs and the probability of the label.
In particular, the computer program, when executed by the processor 802, may further implement the steps of: and carrying out sequence labeling on the sentences to be processed by utilizing the neural network model obtained by training.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part thereof, which essentially contributes to the prior art, can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the sequence labeling method described in the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (11)

1. A method for labeling a sequence, comprising:
generating first labels of words in a training sentence, the first labels comprising part-of-speech labels and/or syntax labels;
constructing a first feature vector based on the first label aiming at the training statement, and generating a first hidden state of the first feature vector through a neural network model;
generating a second feature vector containing dictionary features of a preset dictionary aiming at the training sentences, and generating a second hidden state of the second feature vector through the neural network model, wherein the preset dictionary comprises a plurality of reference labeling results;
combining the first hidden state and the second hidden state to obtain a third hidden state;
and carrying out sequence labeling according to the third hidden state to obtain a sequence labeling result of the training sentence.
2. The method of claim 1, wherein the step of constructing, for the training sentence, a first feature vector based on the first label comprises:
replacing each word of the training sentence with the probability corresponding to the first label to which the word belongs to obtain the first feature vector; the probability corresponding to the first label to which each word belongs is positively correlated with the proportion of the first class of words in the second class of words, the second class of words are words under the first label to which the word belongs in the training sentence, and the first class of words are words belonging to the reference labeling result in the second class of words.
3. The method of claim 1, wherein the step of generating a second feature vector containing dictionary features of a predetermined dictionary for the training sentence comprises:
obtaining word embedding vectors of all words in the training sentence;
generating an unique hot code corresponding to each word according to whether the word context including the word in the training sentence exists in the preset dictionary or not and obtaining an unique hot vector corresponding to the training sentence;
and combining the word embedding vector of the word in the training sentence and the unique heat vector corresponding to the training sentence to obtain a second feature vector containing the dictionary features of the preset dictionary.
4. The method of claim 1, wherein the step of merging the first hidden state with the second hidden state comprises:
and carrying out vector connection operation or vector addition operation on the first hidden state and the second hidden state to obtain the third hidden state.
5. The method of claim 1, wherein the step of performing sequence labeling according to the third hidden state of the training sentence comprises:
and generating a segmentation sequence of the training sentence based on the third hidden state, inputting the segmentation sequence to an output layer softmax layer of the neural network model, training the neural network model, and obtaining the label of the class to which each segmentation sequence of the training sentence output by the softmax layer belongs and the probability of the label.
6. The method of claim 5, wherein after training the neural network model, the method further comprises:
and carrying out sequence labeling on the sentences to be processed by utilizing the neural network model obtained by training.
7. A sequence annotation apparatus, comprising:
a label generating unit, configured to generate a first label of a word in a training sentence, where the first label includes a part-of-speech label and/or a syntax label;
a first hidden state generating unit, configured to construct, for the training sentence, a first feature vector based on the first label, and generate a first hidden state of the first feature vector through a neural network model;
a second hidden state generating unit, configured to generate, for the training sentence, a second feature vector including dictionary features of a preset dictionary, and generate a second hidden state of the second feature vector through the neural network model, where the preset dictionary includes multiple reference labeling results;
a state merging unit, configured to merge the first hidden state and the second hidden state to obtain a third hidden state;
and the first labeling processing unit is used for carrying out sequence labeling according to the third hidden state to obtain a sequence labeling result of the training sentence.
8. The sequence annotation apparatus of claim 7,
the first hidden state generating unit is further configured to replace each word of the training sentence with a probability corresponding to a first label to which the word belongs, so as to obtain the first feature vector; the probability corresponding to the first label to which each word belongs is positively correlated with the proportion of the first class of words in the second class of words, the second class of words are words under the first label to which the word belongs in the training sentence, and the first class of words are words belonging to the reference labeling result in the second class of words.
9. The sequence annotation apparatus of claim 7,
the second hidden state generating unit is further configured to obtain a word embedding vector of each word in the training sentence; generating an unique hot code corresponding to each word according to whether the word context including the word in the training sentence exists in the preset dictionary or not and obtaining an unique hot vector corresponding to the training sentence; and combining the word embedding vector of the word in the training sentence and the unique heat vector corresponding to the training sentence to obtain a second feature vector containing the dictionary features of the preset dictionary.
10. The sequence annotation apparatus of claim 7,
the state merging unit is further configured to perform vector join operation or vector addition operation on the first hidden state and the second hidden state to obtain the third hidden state.
11. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the sequence labeling method of any one of claims 1 to 7.
CN201910399055.7A 2019-05-14 2019-05-14 Sequence labeling method and device and computer readable storage medium Pending CN111950278A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910399055.7A CN111950278A (en) 2019-05-14 2019-05-14 Sequence labeling method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910399055.7A CN111950278A (en) 2019-05-14 2019-05-14 Sequence labeling method and device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111950278A true CN111950278A (en) 2020-11-17

Family

ID=73335620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910399055.7A Pending CN111950278A (en) 2019-05-14 2019-05-14 Sequence labeling method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111950278A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544242A (en) * 2013-09-29 2014-01-29 广东工业大学 Microblog-oriented emotion entity searching system
CN105260361A (en) * 2015-10-28 2016-01-20 南京邮电大学 Trigger word tagging system and method for biomedical events
US9311299B1 (en) * 2013-07-31 2016-04-12 Google Inc. Weakly supervised part-of-speech tagging with coupled token and type constraints
CN106776711A (en) * 2016-11-14 2017-05-31 浙江大学 A kind of Chinese medical knowledge mapping construction method based on deep learning
CN107908671A (en) * 2017-10-25 2018-04-13 南京擎盾信息科技有限公司 Knowledge mapping construction method and system based on law data
CN108280064A (en) * 2018-02-28 2018-07-13 北京理工大学 Participle, part-of-speech tagging, Entity recognition and the combination treatment method of syntactic analysis
CN108460013A (en) * 2018-01-30 2018-08-28 大连理工大学 A kind of sequence labelling model based on fine granularity vocabulary representation model
CN108628824A (en) * 2018-04-08 2018-10-09 上海熙业信息科技有限公司 A kind of entity recognition method based on Chinese electronic health record
CN109408812A (en) * 2018-09-30 2019-03-01 北京工业大学 A method of the sequence labelling joint based on attention mechanism extracts entity relationship

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9311299B1 (en) * 2013-07-31 2016-04-12 Google Inc. Weakly supervised part-of-speech tagging with coupled token and type constraints
CN103544242A (en) * 2013-09-29 2014-01-29 广东工业大学 Microblog-oriented emotion entity searching system
CN105260361A (en) * 2015-10-28 2016-01-20 南京邮电大学 Trigger word tagging system and method for biomedical events
CN106776711A (en) * 2016-11-14 2017-05-31 浙江大学 A kind of Chinese medical knowledge mapping construction method based on deep learning
CN107908671A (en) * 2017-10-25 2018-04-13 南京擎盾信息科技有限公司 Knowledge mapping construction method and system based on law data
CN108460013A (en) * 2018-01-30 2018-08-28 大连理工大学 A kind of sequence labelling model based on fine granularity vocabulary representation model
CN108280064A (en) * 2018-02-28 2018-07-13 北京理工大学 Participle, part-of-speech tagging, Entity recognition and the combination treatment method of syntactic analysis
CN108628824A (en) * 2018-04-08 2018-10-09 上海熙业信息科技有限公司 A kind of entity recognition method based on Chinese electronic health record
CN109408812A (en) * 2018-09-30 2019-03-01 北京工业大学 A method of the sequence labelling joint based on attention mechanism extracts entity relationship

Similar Documents

Publication Publication Date Title
KR102577514B1 (en) Method, apparatus for text generation, device and storage medium
CN111177394B (en) Knowledge map relation data classification method based on syntactic attention neural network
CN111274394B (en) Method, device and equipment for extracting entity relationship and storage medium
CN112329465A (en) Named entity identification method and device and computer readable storage medium
CN111611810B (en) Multi-tone word pronunciation disambiguation device and method
JP7159248B2 (en) Review information processing method, apparatus, computer equipment and medium
CN112395385B (en) Text generation method and device based on artificial intelligence, computer equipment and medium
CN113987169A (en) Text abstract generation method, device and equipment based on semantic block and storage medium
WO2022174496A1 (en) Data annotation method and apparatus based on generative model, and device and storage medium
CN114912450B (en) Information generation method and device, training method, electronic device and storage medium
CN110874536A (en) Corpus quality evaluation model generation method and bilingual sentence pair inter-translation quality evaluation method
CN112528654A (en) Natural language processing method and device and electronic equipment
US20050171759A1 (en) Text generation method and text generation device
Seker et al. A pointer network architecture for joint morphological segmentation and tagging
CN113918031A (en) System and method for Chinese punctuation recovery using sub-character information
CN111950281B (en) Demand entity co-reference detection method and device based on deep learning and context semantics
CN115169370B (en) Corpus data enhancement method and device, computer equipment and medium
WO2023137903A1 (en) Reply statement determination method and apparatus based on rough semantics, and electronic device
CN116483314A (en) Automatic intelligent activity diagram generation method
CN115587184A (en) Method and device for training key information extraction model and storage medium thereof
CN113033192B (en) Training method and device for sequence annotation and computer readable storage medium
CN111950278A (en) Sequence labeling method and device and computer readable storage medium
CN114912434A (en) Method and device for generating style text, storage medium and electronic equipment
CN112926314A (en) Document repeatability identification method and device, electronic equipment and storage medium
CN113761907A (en) Text emotion classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination