CN111353295A - Sequence labeling method and device, storage medium and computer equipment - Google Patents

Sequence labeling method and device, storage medium and computer equipment Download PDF

Info

Publication number
CN111353295A
CN111353295A CN202010123296.1A CN202010123296A CN111353295A CN 111353295 A CN111353295 A CN 111353295A CN 202010123296 A CN202010123296 A CN 202010123296A CN 111353295 A CN111353295 A CN 111353295A
Authority
CN
China
Prior art keywords
word
sequence
labeled
speech
context information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010123296.1A
Other languages
Chinese (zh)
Inventor
周玥
胡盼盼
佟博
赵茜
张超
黄仲强
张坚琳
廖凤玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Bozhilin Robot Co Ltd
Original Assignee
Guangdong Bozhilin Robot Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Bozhilin Robot Co Ltd filed Critical Guangdong Bozhilin Robot Co Ltd
Priority to CN202010123296.1A priority Critical patent/CN111353295A/en
Publication of CN111353295A publication Critical patent/CN111353295A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a sequence labeling method, a device, a storage medium and computer equipment, wherein the sequence comprises words to be labeled and labeled words, and is used for generating texts; identifying context information of a word to be marked in the sequence; and determining a second part of speech of the word to be labeled according to the context information and by combining with the first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled. According to the method and the device, the part of speech of the word to be labeled is labeled according to the context information of the word to be labeled in the sequence, so that the part of speech labeling accuracy can be improved, the sequence labeling effect is improved, and the text generation is effectively assisted.

Description

Sequence labeling method and device, storage medium and computer equipment
Technical Field
The invention relates to the technical field of natural language processing, in particular to a sequence labeling method, a sequence labeling device, a sequence labeling storage medium and computer equipment.
Background
In the field of natural language processing, a sequence is labeled (i.e., the part of speech of each word in the sequence is labeled), and then a text can be generated by using the labeled sequence.
In the related art, a general sequence labeling method includes, for example: LSTM (Long Short-Term Memory), HMM (Hidden Markov Model), BERT (bidirectional encoder representation of transformer), CRF (conditional random field), and the like.
In these ways, only local features of the sequence are concerned when the sequence is labeled, and context information of the sequence is ignored, so that the labeling accuracy is low, the labeling effect is poor, and the text generation quality is reduced.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, the invention aims to provide a sequence tagging method, a sequence tagging device, a storage medium and computer equipment, which can improve the accuracy of part-of-speech tagging and the sequence tagging effect, thereby effectively assisting in generating texts.
The sequence labeling method provided by the embodiment of the first aspect of the present invention includes a word to be labeled and a labeled word, and the sequence is used for generating a text, and the method includes: acquiring a sequence; identifying context information of words to be labeled in the sequence; and determining a second part of speech of the word to be labeled according to the context information and by combining with a first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled.
According to the sequence labeling method provided by the embodiment of the first aspect of the invention, the sequence is obtained, the context information of the word to be labeled in the sequence is identified, the second part of speech of the word to be labeled is determined according to the context information and the first part of speech of the labeled word adjacent to the word to be labeled, the second part of speech is used for labeling the word to be labeled, and the part of speech of the word to be labeled is labeled according to the context information of the word to be labeled in the sequence, so that the part of speech labeling accuracy can be improved, the sequence labeling effect is improved, and the text generation is effectively assisted.
The sequence labeling device provided by the embodiment of the second aspect of the present invention, where the sequence includes words to be labeled and labeled words, and the sequence is used to generate a text, includes: an obtaining module, configured to obtain a sequence; the identification module is used for identifying the context information of the words to be labeled in the sequence; and the labeling module is used for determining a second part of speech of the word to be labeled according to the context information and by combining with the first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled.
The sequence labeling device provided by the embodiment of the second aspect of the invention can improve the accuracy of part-of-speech labeling and the sequence labeling effect by acquiring the sequence, identifying the context information of the word to be labeled in the sequence, determining the second part-of-speech of the word to be labeled according to the context information and the first part-of-speech of the labeled word adjacent to the word to be labeled, labeling the word to be labeled according to the second part-of-speech, and labeling the part-of-speech of the word to be labeled according to the context information of the word to be labeled in the sequence, thereby effectively assisting in generating the text.
The computer-readable storage medium according to the embodiment of the third aspect of the present invention has a computer program stored thereon, where the computer program is used to implement the sequence annotation method according to the embodiment of the first aspect of the present invention when the computer program is executed by a processor.
The computer-readable storage medium provided by the embodiment of the third aspect of the present invention obtains the sequence, identifies the context information of the word to be tagged in the sequence, determines the second part of speech of the word to be tagged according to the context information and the first part of speech of the tagged word adjacent to the word to be tagged, where the second part of speech is used for tagging the word to be tagged, and tags the part of speech of the word to be tagged according to the context information of the word to be tagged in the sequence, so that the accuracy of part of speech tagging can be improved, the tagging effect of the sequence is improved, and thus the generation of a text is effectively assisted.
The computer equipment provided by the embodiment of the fourth aspect of the invention comprises a shell, a processor, a memory, a circuit board and a power supply circuit, wherein the circuit board is arranged inside a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; the power supply circuit is used for supplying power to each circuit or device of the computer equipment; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing: acquiring a sequence, wherein the sequence comprises words to be labeled and labeled words, and the sequence is used for generating a text; identifying context information of words to be labeled in the sequence; and determining a second part of speech of the word to be labeled according to the context information and by combining with a first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled.
The computer device provided by the embodiment of the fourth aspect of the present invention obtains the sequence, identifies the context information of the word to be tagged in the sequence, determines the second part of speech of the word to be tagged according to the context information and the first part of speech of the tagged word adjacent to the word to be tagged, where the second part of speech is used for tagging the word to be tagged, and tags the part of speech of the word to be tagged according to the context information of the word to be tagged in the sequence, so that the accuracy of part of speech tagging can be improved, the tagging effect of the sequence is improved, and thus the generation of a text is effectively assisted.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a sequence tagging method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a sequence annotation method according to another embodiment of the present invention;
FIG. 3 is a flowchart illustrating a sequence annotation method according to another embodiment of the present invention;
FIG. 4 is a schematic flow chart of encoding using XLNET model in the embodiment of the present invention;
FIG. 5 is a flowchart illustrating a sequence annotation method according to another embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a sequence labeling apparatus according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a sequence labeling apparatus according to another embodiment of the present invention;
fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.
Fig. 1 is a flowchart illustrating a sequence tagging method according to an embodiment of the present invention.
The present embodiment is exemplified in a case where the sequence labeling method is configured as a sequence labeling apparatus.
The sequence annotation method in this embodiment may be configured in a sequence annotation device, and the sequence annotation device may be set in a server, or may also be set in an electronic device, which is not limited in this embodiment of the present invention.
The present embodiment takes the case that the sequence tagging method is configured in the electronic device.
Among them, electronic devices such as smart phones, tablet computers, personal digital assistants, electronic books, and other hardware devices having various operating systems.
It should be noted that the execution subject in the embodiment of the present invention may be, for example, a Central Processing Unit (CPU) in the electronic device in terms of hardware, and may be, for example, a service labeled for a sequence in the electronic device in terms of software, which is not limited to this.
Referring to fig. 1, the method includes:
s101: a sequence is obtained.
The sequence comprises a plurality of words, wherein each word can be a word to be labeled or a labeled word, the word which is not labeled currently can be called the word to be labeled, the word which is labeled currently can be called the labeled word, and the sequence is specifically used for generating a text.
The sequence X is, for example, X1, X2, X3, X4, where X1, X2, X3, X4 may be referred to as words, and X1, X2, X3, X4, etc. are arranged and combined in a certain order to form the sequence X, and the words X1, X2, X3, X4 in the sequence X have corresponding actual contents, positions, parts of speech, etc. respectively.
In the embodiment of the present invention, the words x1, x2, x3, and x4 in the sequence may be labeled with a part of speech, for example, the word is an adjective or a verb, or may be another part of speech capable of describing features of the word, which is not limited in this respect.
S102: and identifying the context information of the words to be labeled in the sequence.
The context information of the word to be labeled comprises the above information and the following information.
The context information of the word to be marked is composed of the upper information and the lower information.
As an example, the sequence X, for example, X1, X2, X3, X4, assuming that the current word to be annotated is X3, the context association information between X3 and X2 may be referred to as the above information, and the context association information between X3 and X2 and X1 may be referred to as the context association information between X3 and X4.
According to the invention, the context information of the word to be labeled in the sequence is identified, so that the identified context information can be adopted to assist in labeling the word to be labeled, and the labeling accuracy is improved.
In a specific implementation process, context information of words to be labeled in the XLNET model identification sequence can be sorted based on a double-current self-attention mechanism.
As an example, the sequence X includes words X1, X2, X3, X4, wherein the words X1, X2, X3, X4 are sequentially arranged, assuming that the current word to be labeled is the word X3, the context information of the word X3 is X1, X2, and the context information of the word X3 is X4, when the context information of the word X3 is simultaneously identified using the XLNet model, the words X1, X2, X3, X4 in the sequence X may be rearranged, in the rearranged sequence X', the words are arranged in the sequence X4, X1, X2, X3 sequentially, wherein X4 has corresponding actual content, and also has an actual position in the original sequence X, X4 is located fourth in the original sequence X, and similarly, X1 has corresponding actual median content, and the actual position in the first sequence X2, the actual position in the original sequence X2, and the actual position in the sequence X2, and the third position of the original sequence X2, thereby, the sequence X is identified using the XLNet model, and the actual sequence X model, and the third position of the original sequence X, and the actual position of the sequence X2, that is, X3 of X4, X1, X2 and X3, and the following information of X3 of the original sequence X is reversely derived from the above information of X3 of X', which is not limited.
S103: and determining a second part of speech of the word to be labeled according to the context information and by combining with the first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled.
The part of speech of the tagged word may be referred to as a first part of speech, and the part of speech of the word to be tagged may be referred to as a second part of speech.
In a specific execution process, a CRF (conditional random field) and an XLNET model can be combined, so that the tagging of the tagged word is performed according to the context information and the part of speech of the tagged word, and the tagging effect can be guaranteed in an all-around manner by considering not only the context information of the word to be tagged but also the part of speech of the tagged word in the sequence in the process of tagging the sequence.
In the embodiment, the sequence is obtained, the context information of the word to be tagged in the sequence is identified, the second part of speech of the word to be tagged is determined according to the context information and the first part of speech of the tagged word adjacent to the word to be tagged, the second part of speech is used for tagging the word to be tagged, and the part of speech of the word to be tagged is tagged according to the context information of the word to be tagged in the sequence, so that the part of speech tagging accuracy can be improved, the tagging effect of the sequence is improved, and the generation of the text is effectively assisted.
Fig. 2 is a flowchart illustrating a sequence annotation method according to another embodiment of the invention.
Referring to fig. 2, the method includes:
s201: a sequence is obtained.
The sequence comprises a plurality of words, wherein each word can be a word to be labeled or a labeled word, the word which is not labeled currently can be called the word to be labeled, the word which is labeled currently can be called the labeled word, and the sequence is specifically used for generating a text.
The sequence X is, for example, X1, X2, X3, X4, where X1, X2, X3, X4 may be referred to as words, and X1, X2, X3, X4, etc. are arranged and combined in a certain order to form the sequence X, and the words X1, X2, X3, X4 in the sequence X have corresponding actual contents, positions, parts of speech, etc. respectively.
In the embodiment of the present invention, the words x1, x2, x3, and x4 in the sequence may be labeled with a part of speech, for example, the word is an adjective or a verb, or may be another part of speech capable of describing features of the word, which is not limited in this respect.
S202: and sequencing context information of the words to be marked in the XLNET model identification sequence by adopting a double-current self-attention-based mechanism.
In a specific implementation process, the content stream in the hidden state and the context information of the word to be annotated in the query stream identification sequence in the hidden state in the dual-flow self-attention mechanism based on the dual-flow self-attention mechanism sorting XLNet model may be specifically used, which is not limited to this.
Specifically, the context information of the tagged word and the actual content of the word to be tagged in the updated sequence (the updated sequence is obtained by correspondingly adjusting the actual position of the word in the initial sequence, and the updated sequence is used for being combined with the initial sequence to determine the context information of the word to be tagged) can be determined through the content flow in the hidden state, and the actual position and the upper information of the word to be tagged in the initial sequence can be determined through the query flow in the hidden state, so that the context information of the word to be tagged can be determined according to the context information of the tagged word and the actual content of the word to be tagged (the actual content of the word to be tagged refers to the text represented by the word to be tagged) and the actual position and the upper information of the word to be tagged in the initial sequence, thereby obtaining the context information of the word to be tagged, and then through an autoregressive model in an XLNet model, and collecting and recording the context information of the word to be labeled so as to identify the context information of the word to be labeled to assist subsequent part-of-speech labeling.
In the embodiment of the invention, the context information of the word to be labeled is identified based on the content flow in the hidden state and the query flow in the hidden state of the XLNET model obtained by initialization, so that a reasonable identification effect can be obtained, the identified context information is matched with the context content of the sequence, and the part-of-speech labeling of the word to be labeled subsequently is assisted.
S203: and determining a second part of speech of the word to be labeled according to the context information by combining the first part of speech and a random condition field CRF model.
The random condition field CRF model may have learned the association relationship between the context information and the matched part-of-speech rule, and thus, after the context information is identified, the corresponding part-of-speech rule may be determined according to the context information in combination with the random condition field CRF model, and thus, the second part-of-speech of the word to be tagged may be determined according to the tagged first part-of-speech in combination with the corresponding part-of-speech rule, without limitation.
In the embodiment, by acquiring the sequence, identifying the context information of the word to be tagged in the sequence by using the double-current self-attention-based ordering XLNet model, and determining the second part of speech of the word to be tagged by combining the first part of speech and the random condition field CRF model according to the context information, because the part of speech of the word to be tagged is tagged by combining the double-current self-attention-based ordering XLNet model and the CRF model, the tagging process is not limited by the actual position of the word to be tagged, the part of speech is determined more flexibly, and a more appropriate tagging effect is ensured to be obtained. Context information of the word to be annotated is identified based on the content flow of the hidden state and the query flow of the hidden state of the XLNET model obtained by initialization, so that a reasonable identification effect can be obtained, the identified context information is matched with the context content of the sequence, and the part-of-speech annotation of the word to be annotated subsequently is assisted.
Fig. 3 is a flowchart illustrating a sequence annotation method according to another embodiment of the invention.
S301: a sequence is obtained.
The sequence comprises a plurality of words, wherein each word can be a word to be labeled or a labeled word, the word which is not labeled currently can be called the word to be labeled, the word which is labeled currently can be called the labeled word, and the sequence is specifically used for generating a text.
The sequence X is, for example, X1, X2, X3, X4, where X1, X2, X3, X4 may be referred to as words, and X1, X2, X3, X4, etc. are arranged and combined in a certain order to form the sequence X, and the words X1, X2, X3, X4 in the sequence X have corresponding actual contents, positions, parts of speech, etc. respectively.
In the embodiment of the present invention, the words x1, x2, x3, and x4 in the sequence may be labeled with a part of speech, for example, the word is an adjective or a verb, or may be another part of speech capable of describing features of the word, which is not limited in this respect.
S302: a word embedding vector for the sequence is obtained.
The word embedding vector is a vector in which words or phrases from the vocabulary are mapped to real numbers, i.e., the words or phrases are represented in vector form.
Referring to fig. 4, fig. 4 is a schematic flow chart of encoding by using an XLNet model in the embodiment of the present invention, where the word embedding vector of the sequence X is e (X), and the word embedding vectors of the words X1, X2, X3, and X4 in the sequence are e (X) respectively1)、e(x2)、e(x3)、e(x4)。
In a specific implementation process, the word embedding vector e (x) includes multidimensional information, and the multidimensional information comprehensively represents a numerical value corresponding to actual content that the word can represent.
As an example, the word embedding vector of the word "open heart" is e (x), e (x) {0.789, 0.674, 0.506, -0.705, -0.305, 0.587, … … }, where the value in the parentheses is the value of the vector corresponding to the "open heart", and a word embedding vector includes a plurality of values to represent the actual content that the word can represent.
Through the word embedding vector of the acquisition sequence, the data dimension in the sequence labeling process can be effectively reduced, so that the identification of actual content in the labeling process is more convenient, the error rate of character content of directly identifying words is reduced, the identification effect of the actual content is effectively improved, and the sequence labeling precision is guaranteed at multiple angles.
S303: and initializing the zero-level processing logic of the content flow of the hidden state of the XLNET model by adopting the word embedding vector, and initializing the inquiry flow of the hidden state of the XLNET model into a variable.
See FIG. 4, "hi (m)"represents the content stream in hidden state, m takes 1-2 and represents the coding level of the double-stream self-attention mechanism, i represents the word of the corresponding sequence," gi n"represents the hidden state query stream, n takes a value of 1-2, represents the double-stream self-attention mechanism coding level, and initializes the hidden state content stream of XLNET model" hi (m)The zeroth layer of "is the word embedding vector e (x), and" gi n"is the variable w.
By adopting the word embedding vector, the zero-level processing logic of the content flow in the hidden state of the XLNET model is initialized, and the query flow in the hidden state of the XLNET model is initialized as a variable, so that the function of giving consideration to the context information of the words to be labeled can be realized, and the logical conflict generated when a plurality of words to be labeled in the prediction sequence are avoided.
S304: and taking the text as the input of the XLNET model obtained by initialization, and determining the code corresponding to the text by using the XLNET model obtained by initialization.
For example, if the sequence X is X1, X2, X3, and X4, where each word corresponds to an actual content (the actual content is a subfile described by the word, and if the word is "happy", the subfile described by the word "happy" is "happy"), the subfiles corresponding to the aforementioned words are concatenated to obtain the text.
The XLNET model obtained by initialization has learned the corresponding relation between the words to be labeled of the samples and the sample codes calibrated aiming at the words to be labeled of the samples in advance, the text resulting from the above concatenation can thus be used as input for the XLNet model resulting from the initialization, so as to determine the codes corresponding to the words in the text by adopting the XLNET model obtained by initialization, and take the combination of the codes corresponding to the words as the codes corresponding to the text, context information among words in the text is carried in a code corresponding to the text, and by taking the text as an input of an XLNET model obtained by initialization, therefore, the XLNET model obtained by initialization is adopted to determine the code corresponding to the text, the operation amount in the sequence labeling process can be effectively reduced, and the text is coded so as to determine the context information among the words in the text, so that a better context recognition effect can be obtained.
S305: and identifying the context information of the word to be marked by combining the code corresponding to the word to be marked according to the code corresponding to the text.
As an example, in a piece of text, for example, "beijing is a beautiful city, i love it", there is a pronoun "it", which refers to the city of "beijing" for its coded recognition, as known from context information.
The text is used as the input of the XLNET model obtained by initialization, so that the coding corresponding to the text is determined by the XLNET model obtained by initialization, and the coding corresponding to the text carries the context information among the words in the text, so that the context information of the words to be marked can be identified by combining the coding corresponding to the text and the coding corresponding to the words to be marked.
S306: and inputting the context information of the words to be labeled into the full connection layer.
The XLNet model obtained by initialization in the embodiment of the present invention includes: fully-connected layer, which may be described in detail below with reference to fig. 5, may be embodied as a fully-connected layer of a CNN (Convolutional Neural Networks) structure, each neuron in the fully-connected layer is fully connected with all neurons in the previous layer, the fully-connected layer can integrate local information with class distinction in a convolutional layer or a pooling layer, in order to improve the performance of the CNN network, a ReLU (Rectified Linear Unit) function is generally used as an excitation function of each neuron of the full connection layer, in the embodiment of the invention, the full connection layer in the neural network model is integrated with the XLNET model, and initializing the zero-level processing logic of the content flow of the hidden state of the XLNET model by adopting the word embedding vector, initializing the inquiry flow of the hidden state of the XLNET model as a variable, therefore, the context information of the word to be annotated can be accurately identified by adopting the XLNET model obtained by initialization.
S307: and determining the probability distribution condition of a plurality of candidate parts of speech corresponding to the words to be labeled by combining the result output by the full connection layer with the first part of speech and the CRF model.
And initializing a result output by the full connection layer of the obtained XLNET model, and accurately describing the context information of the word to be labeled.
The candidate parts of speech are possible parts of speech determined for the word to be tagged, the candidate parts of speech may be one or more, and when the number of the candidate parts of speech is multiple, the most suitable candidate part of speech may be determined as the second part of speech according to the probability distribution of the multiple candidate parts of speech.
S308: and determining a second part of speech of the word to be labeled according to the probability distribution condition of the candidate parts of speech.
As an example, referring to fig. 5, fig. 5 is a schematic flowchart of a sequence tagging method according to another embodiment of the present invention, where the XLNet model 51 obtained by initialization in fig. 5 includes a full connection layer, a text formed by splicing sequences X including words to be tagged is input into the XLNet model, and a content stream h in a hidden state of the XLNet model is initialized according to a word embedding vector of the texti (m)Word embedding vector e (x), initializing query stream g in hidden statei nAnd a variable w is adopted, so that the text sequence X is coded by adopting a double-flow self-attention mechanism, the obtained coding result is input into a full-connection layer, context information of the word to be marked is determined according to the output result of the full-connection layer, then the result output by the full-connection layer is marked by combining a first part of speech and a CRF model, the probability distribution condition of a plurality of candidate parts of speech is obtained, the probability distribution condition can be expressed as an n matrix, then the n matrix is substituted into a normalization index function, a 1 n matrix of sequence marking is obtained, and the candidate part of speech with the highest probability is selected as the result obtained by marking the word to be marked.
The probability distribution condition of a plurality of candidate parts of speech corresponding to the word to be labeled is determined by combining the result output by the full-connection layer with the first part of speech and the CRF model, the second part of speech of the word to be labeled is determined according to the probability distribution condition of the plurality of candidate parts of speech, the part of speech with probability concentrated distribution can be selected as the part of speech of the word to be labeled according to the probability distribution of the candidate parts of speech, and the accuracy of selecting the part of speech of the word to be labeled is effectively improved.
In the embodiment, through acquiring the word embedding vector of the sequence, the data dimension in the sequence labeling process can be effectively reduced, so that the identification of the actual content in the labeling process is more convenient, the error rate of the character content of the directly identified word is reduced, the identification effect of the actual content is effectively improved, and the sequence labeling precision is guaranteed at multiple angles. By adopting the word embedding vector, the zero-level processing logic of the content flow in the hidden state of the XLNET model is initialized, and the query flow in the hidden state of the XLNET model is initialized as a variable, so that the function of giving consideration to the context information of the words to be labeled can be realized, and the logical conflict generated when a plurality of words to be labeled in the prediction sequence are avoided. The text is used as the input of the XLNET model obtained by initialization, so that the XLNET model obtained by initialization is adopted to determine the code corresponding to the text, the operation amount in the sequence labeling process can be effectively reduced, and the text is coded to determine the context information among the words in the text, so that a better context recognition effect can be obtained. The context information of the words to be marked can be identified by combining the codes corresponding to the texts and the codes corresponding to the words to be marked, and the context information of each word is represented in a coding mode, so that the electronic equipment can conveniently and accurately identify the context information of the words to be marked, and the electronic equipment can correctly understand the meaning of the words to be marked in the context. By integrating the full connection layer in the neural network model with the XLNET model and adopting the word embedding vector, initializing the zeroth layer processing logic of the content flow of the XLNET model in the hidden state and initializing the query flow of the XLNET model in the hidden state as a variable, the XLNET model obtained by initialization can be used for accurately identifying the context information of the words to be labeled. The probability distribution condition of a plurality of candidate parts of speech corresponding to the word to be labeled is determined by combining the result output by the full-connection layer with the first part of speech and the CRF model, the second part of speech of the word to be labeled is determined according to the probability distribution condition of the plurality of candidate parts of speech, the part of speech with probability concentrated distribution can be selected as the part of speech of the word to be labeled according to the probability distribution of the candidate parts of speech, and the accuracy of selecting the part of speech of the word to be labeled is effectively improved.
Fig. 6 is a schematic structural diagram of a sequence labeling apparatus according to an embodiment of the present invention.
Referring to fig. 6, the apparatus 600 includes:
an obtaining module 601, configured to obtain a sequence;
the identification module 602 is configured to identify context information of a word to be labeled in a sequence;
the tagging module 603 is configured to determine, according to the context information, a second part of speech of the word to be tagged in combination with the first part of speech of the tagged word adjacent to the word to be tagged, where the second part of speech is used for tagging the word to be tagged.
Optionally, in some embodiments, the identifying module 602 is specifically configured to:
and sequencing context information of the words to be marked in the XLNET model identification sequence by adopting a double-current self-attention-based mechanism.
Optionally, in some embodiments, the labeling module 603 is specifically configured to:
and determining a second part of speech of the word to be labeled according to the context information by combining the first part of speech and a random condition field CRF model.
Optionally, in some embodiments, referring to fig. 7, the identifying module 602 includes:
an obtaining submodule 6021 for obtaining a word embedding vector of the sequence;
the initialization submodule 6022 is configured to initialize a zero-level processing logic of the content stream in the hidden state of the XLNet model by using the word embedding vector, and initialize the query stream in the hidden state of the XLNet model as a variable;
the identifier module 6023 is configured to identify context information of the word to be labeled by using the XLNet model obtained through initialization.
Optionally, in some embodiments, referring to fig. 7, the XLNet model obtained by initializing, learning to obtain a correspondence between a word to be labeled of the sample and a sample code calibrated in advance for the word to be labeled of the sample, and the identifying submodule 6023 is specifically configured to:
the text is used as the input of an XLNET model obtained by initialization, so that the XLNET model obtained by initialization is adopted to determine the code corresponding to the text;
and identifying the context information of the word to be marked by combining the code corresponding to the word to be marked according to the code corresponding to the text.
Alternatively, in some embodiments, referring to fig. 7, initializing the resulting XLNet model comprises: the hidden-state content flow and the hidden-state query flow, identifying sub-module 6023 includes:
a first determining unit 60231, configured to determine, through the content stream in the hidden state, context information of a tagged word in the updated sequence and actual content of a word to be tagged; the updated sequence is obtained by correspondingly adjusting the actual positions of all the words in the initial sequence, and the initial sequence is the obtained sequence;
a second determining unit 60232, configured to determine, through the query stream in the hidden state, the actual position and the above information of the word to be annotated in the initial sequence;
a third determining unit 60233, configured to determine the context information of the word to be annotated according to the context information of the word to be annotated and the actual content of the word to be annotated, as well as the actual position and the context information of the word to be annotated in the initial sequence;
and using the upper information and the lower information of the word to be marked as context information together.
Alternatively, in some embodiments, referring to fig. 7, initializing the resulting XLNet model comprises: the full connection layer, labeling module 603, includes:
an input submodule 6031 for inputting context information of a word to be annotated into the full connection layer;
a first determining submodule 6032, configured to determine, by combining the first part of speech and the CRF model, probability distribution conditions of multiple candidate parts of speech corresponding to the word to be tagged from the result output by the full connection layer;
and the second determining submodule 6033 is configured to determine a second part of speech of the word to be tagged according to the probability distribution of the multiple candidate parts of speech.
It should be noted that the explanation of the sequence labeling method in the foregoing embodiments of fig. 1 to 5 also applies to the sequence labeling apparatus 600 in this embodiment, and the implementation principle thereof is similar, and thus, the detailed description is omitted here
In the embodiment, the sequence is obtained, the context information of the word to be tagged in the sequence is identified, the second part of speech of the word to be tagged is determined according to the context information and the first part of speech of the tagged word adjacent to the word to be tagged, the second part of speech is used for tagging the word to be tagged, and the part of speech of the word to be tagged is tagged according to the context information of the word to be tagged in the sequence, so that the part of speech tagging accuracy can be improved, the tagging effect of the sequence is improved, and the generation of the text is effectively assisted.
Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Referring to fig. 8, the computer device 80 of the present embodiment includes: the device comprises a shell 801, a processor 802, a memory 803, a circuit board 804 and a power supply circuit 805, wherein the circuit board 804 is arranged in a space enclosed by the shell 801, and the processor 802 and the memory 803 are arranged on the circuit board 804; a power supply circuit 805 for supplying power to each circuit or device of the computer apparatus 80; the memory 803 is used to store executable program code; wherein, the processor 802 runs a program corresponding to the executable program code by reading the executable program code stored in the memory 803, for executing:
acquiring a sequence, wherein the sequence comprises words to be labeled and labeled words and is used for generating a text;
identifying context information of a word to be marked in the sequence;
and determining a second part of speech of the word to be labeled according to the context information and by combining with the first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled.
It should be noted that the foregoing explanations on the embodiments of the sequence labeling method in fig. 1 to fig. 5 also apply to the computer device 80 of the embodiment, and the implementation principles thereof are similar and will not be described herein again.
In the embodiment, the sequence is obtained, the context information of the word to be tagged in the sequence is identified, the second part of speech of the word to be tagged is determined according to the context information and the first part of speech of the tagged word adjacent to the word to be tagged, the second part of speech is used for tagging the word to be tagged, and the part of speech of the word to be tagged is tagged according to the context information of the word to be tagged in the sequence, so that the part of speech tagging accuracy can be improved, the tagging effect of the sequence is improved, and the generation of the text is effectively assisted.
In order to implement the foregoing embodiments, the present application provides a computer-readable storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the sequence tagging method of the foregoing method embodiments.
In the embodiment, the sequence is obtained, the context information of the word to be tagged in the sequence is identified, the second part of speech of the word to be tagged is determined according to the context information and the first part of speech of the tagged word adjacent to the word to be tagged, the second part of speech is used for tagging the word to be tagged, and the part of speech of the word to be tagged is tagged according to the context information of the word to be tagged in the sequence, so that the part of speech tagging accuracy can be improved, the tagging effect of the sequence is improved, and the generation of the text is effectively assisted.
It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (16)

1. A sequence labeling method is characterized in that the sequence comprises words to be labeled and labeled words, the sequence is used for generating texts, and the method comprises the following steps:
acquiring a sequence;
identifying context information of words to be labeled in the sequence;
and determining a second part of speech of the word to be labeled according to the context information and by combining with a first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled.
2. The sequence annotation method of claim 1, wherein the identifying context information of the words to be annotated in the sequence comprises:
and identifying context information of the words to be marked in the sequence by adopting a double-current self-attention mechanism-based ordering XLNET model.
3. The sequence tagging method of claim 2, wherein said determining a second part-of-speech of said word to be tagged in combination with a first part-of-speech of a tagged word adjacent to said word to be tagged based on said context information comprises:
and determining a second part of speech of the word to be labeled according to the context information by combining the first part of speech and a random condition field CRF model.
4. The sequence annotation method of claim 2, wherein the identifying context information of the word to be annotated using a dual-stream auto-attention-based ranking XLNet model comprises:
obtaining a word embedding vector of the sequence;
initializing a zero-level processing logic of the content flow of the XLNET model in the hidden state by adopting the word embedding vector, and initializing the query flow of the XLNET model in the hidden state as a variable;
and identifying the context information of the word to be marked by adopting the XLNET model obtained by initialization.
5. The sequence annotation method of claim 4, wherein the initializing the obtained XLNET model to identify context information of the word to be annotated using the initialized XLNET model, the learned word to be annotated of the sample, and the sample code calibrated in advance for the word to be annotated of the sample comprises:
taking the text as the input of the XLNET model obtained by the initialization, so as to determine the code corresponding to the text by the XLNET model obtained by the initialization;
and identifying the context information of the word to be marked by combining the code corresponding to the word to be marked according to the code corresponding to the text.
6. The sequence annotation process of claim 4, wherein initializing the resulting XLNET model comprises: the method for recognizing the context information of the word to be marked by adopting the XLNET model obtained by initialization comprises the following steps:
determining context information of the marked words and actual contents of the words to be marked in the updated sequence through the content stream in the hidden state; the updated sequence is obtained by correspondingly adjusting the actual positions of all the words in the initial sequence, and the initial sequence is the obtained sequence;
determining the actual position and the above information of the word to be marked in the initial sequence through the query stream in the hidden state;
determining the context information of the word to be marked according to the context information of the marked word and the actual content of the word to be marked, as well as the actual position and the context information of the word to be marked in the initial sequence;
and taking the above information and the below information of the word to be marked together as the context information.
7. The sequence annotation process of claim 2, wherein initializing the resulting XLNET model comprises: and the fully-connected layer determines a second part of speech of the word to be labeled by combining the first part of speech and a random conditional field CRF model according to the context information, and comprises the following steps:
inputting the context information of the word to be marked into the full connection layer;
determining the probability distribution condition of a plurality of candidate parts of speech corresponding to the word to be labeled by combining the result output by the full connection layer with the first part of speech and the CRF model;
and determining a second part of speech of the word to be labeled according to the probability distribution condition of the candidate parts of speech.
8. A sequence annotation device, wherein the sequence comprises words to be annotated and words that have been annotated, and wherein the sequence is used to generate text, the device comprising:
an obtaining module, configured to obtain a sequence;
the identification module is used for identifying the context information of the words to be labeled in the sequence;
and the labeling module is used for determining a second part of speech of the word to be labeled according to the context information and by combining with the first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled.
9. The sequence tagging device of claim 8, wherein said identification module is specifically configured to:
and identifying context information of the words to be marked in the sequence by adopting a double-current self-attention mechanism-based ordering XLNET model.
10. The sequence labeling apparatus of claim 9, wherein the labeling module is specifically configured to:
and determining a second part of speech of the word to be labeled according to the context information by combining the first part of speech and a random condition field CRF model.
11. The sequence annotation apparatus of claim 9, wherein the identification module comprises:
an obtaining submodule for obtaining word embedding vectors of the sequence;
the initialization submodule is used for initializing the zero-level processing logic of the content flow of the XLNET model in the hidden state by adopting the word embedding vector and initializing the query flow of the XLNET model in the hidden state as a variable;
and the identification submodule is used for identifying the context information of the word to be labeled by adopting the XLNET model obtained by initialization.
12. The sequence labeling apparatus of claim 11, wherein the XLNet model obtained by initialization has learned a correspondence between a word to be labeled of the sample and a sample code calibrated in advance for the word to be labeled of the sample, and the identification submodule is specifically configured to:
taking the text as the input of the XLNET model obtained by the initialization, so as to determine the code corresponding to the text by the XLNET model obtained by the initialization;
and identifying the context information of the word to be marked by combining the code corresponding to the word to be marked according to the code corresponding to the text.
13. The sequence annotation apparatus of claim 11, wherein the initializing the resulting XLNet model comprises: the hidden state content flow and the hidden state query flow, the identification submodule comprising:
a first determining unit, configured to determine, through the content stream in the hidden state, context information of the tagged word in the updated sequence and actual content of the word to be tagged; the updated sequence is obtained by correspondingly adjusting the actual positions of all the words in the initial sequence, and the initial sequence is the obtained sequence;
a second determining unit, configured to determine, through the query stream in the hidden state, an actual position and information of the word to be annotated in the initial sequence;
a third determining unit, configured to determine context information of the word to be labeled according to the context information of the labeled word and the actual content of the word to be labeled, as well as the actual position and the context information of the word to be labeled in the initial sequence;
and taking the above information and the below information of the word to be marked together as the context information.
14. The sequence annotation apparatus of claim 9, wherein the initializing the resulting XLNet model comprises: the full connection layer, the mark module includes:
the input submodule is used for inputting the context information of the words to be labeled into the full connection layer;
the first determining submodule is used for determining the probability distribution condition of a plurality of candidate parts of speech corresponding to the word to be labeled by combining the first part of speech and the CRF model according to the result output by the full connection layer;
and the second determining submodule is used for determining the second part of speech of the word to be labeled according to the probability distribution condition of the candidate parts of speech.
15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the sequence labeling method of any one of claims 1 to 7.
16. A computer device comprising a housing, a processor, a memory, a circuit board, and a power circuit, wherein the circuit board is disposed inside a space enclosed by the housing, the processor and the memory being disposed on the circuit board; the power supply circuit is used for supplying power to each circuit or device of the computer equipment; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing:
acquiring a sequence, wherein the sequence comprises words to be labeled and labeled words, and the sequence is used for generating a text;
identifying context information of words to be labeled in the sequence;
and determining a second part of speech of the word to be labeled according to the context information and by combining with a first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled.
CN202010123296.1A 2020-02-27 2020-02-27 Sequence labeling method and device, storage medium and computer equipment Pending CN111353295A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010123296.1A CN111353295A (en) 2020-02-27 2020-02-27 Sequence labeling method and device, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010123296.1A CN111353295A (en) 2020-02-27 2020-02-27 Sequence labeling method and device, storage medium and computer equipment

Publications (1)

Publication Number Publication Date
CN111353295A true CN111353295A (en) 2020-06-30

Family

ID=71194065

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010123296.1A Pending CN111353295A (en) 2020-02-27 2020-02-27 Sequence labeling method and device, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN111353295A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112987940A (en) * 2021-04-27 2021-06-18 广州智品网络科技有限公司 Input method and device based on sample probability quantization and electronic equipment
CN113255343A (en) * 2021-06-21 2021-08-13 中国平安人寿保险股份有限公司 Semantic identification method and device for label data, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE68923981D1 (en) * 1988-02-05 1995-10-05 At & T Corp Process for determining parts of text and use.
CN103077164A (en) * 2012-12-27 2013-05-01 新浪网技术(中国)有限公司 Text analysis method and text analyzer
CN108170674A (en) * 2017-12-27 2018-06-15 东软集团股份有限公司 Part-of-speech tagging method and apparatus, program product and storage medium
CN110196967A (en) * 2019-06-05 2019-09-03 腾讯科技(深圳)有限公司 Sequence labelling method and apparatus based on depth converting structure

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE68923981D1 (en) * 1988-02-05 1995-10-05 At & T Corp Process for determining parts of text and use.
CN103077164A (en) * 2012-12-27 2013-05-01 新浪网技术(中国)有限公司 Text analysis method and text analyzer
CN108170674A (en) * 2017-12-27 2018-06-15 东软集团股份有限公司 Part-of-speech tagging method and apparatus, program product and storage medium
CN110196967A (en) * 2019-06-05 2019-09-03 腾讯科技(深圳)有限公司 Sequence labelling method and apparatus based on depth converting structure

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112987940A (en) * 2021-04-27 2021-06-18 广州智品网络科技有限公司 Input method and device based on sample probability quantization and electronic equipment
CN113255343A (en) * 2021-06-21 2021-08-13 中国平安人寿保险股份有限公司 Semantic identification method and device for label data, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110196894B (en) Language model training method and language model prediction method
CN109388793B (en) Entity marking method, intention identification method, corresponding device and computer storage medium
CN110046350B (en) Grammar error recognition method, device, computer equipment and storage medium
CN110188362B (en) Text processing method and device
CN112329465A (en) Named entity identification method and device and computer readable storage medium
CN112036162B (en) Text error correction adaptation method and device, electronic equipment and storage medium
CN111126068A (en) Chinese named entity recognition method and device and electronic equipment
CN111738016B (en) Multi-intention recognition method and related equipment
CN110866401A (en) Chinese electronic medical record named entity identification method and system based on attention mechanism
CN111062217B (en) Language information processing method and device, storage medium and electronic equipment
CN110717331A (en) Neural network-based Chinese named entity recognition method, device, equipment and storage medium
CN112131861B (en) Dialog state generation method based on hierarchical multi-head interaction attention
CN110990555B (en) End-to-end retrieval type dialogue method and system and computer equipment
CN110309511B (en) Shared representation-based multitask language analysis system and method
CN113449489B (en) Punctuation mark labeling method, punctuation mark labeling device, computer equipment and storage medium
CN112507695A (en) Text error correction model establishing method, device, medium and electronic equipment
CN112825114A (en) Semantic recognition method and device, electronic equipment and storage medium
CN111814479B (en) Method and device for generating enterprise abbreviations and training model thereof
CN112668333A (en) Named entity recognition method and device, and computer-readable storage medium
CN111353295A (en) Sequence labeling method and device, storage medium and computer equipment
CN112036186A (en) Corpus labeling method and device, computer storage medium and electronic equipment
Park et al. Natural language generation using dependency tree decoding for spoken dialog systems
CN112100360A (en) Dialog response method, device and system based on vector retrieval
CN109902309B (en) Translation method, device, equipment and storage medium
CN115906855A (en) Word information fused Chinese address named entity recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200630