CN111353295A

CN111353295A - Sequence labeling method and device, storage medium and computer equipment

Info

Publication number: CN111353295A
Application number: CN202010123296.1A
Authority: CN
Inventors: 周玥; 胡盼盼; 佟博; 赵茜; 张超; 黄仲强; 张坚琳; 廖凤玲
Original assignee: Guangdong Bozhilin Robot Co Ltd
Current assignee: Guangdong Bozhilin Robot Co Ltd
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2020-06-30

Abstract

The invention provides a sequence labeling method, a device, a storage medium and computer equipment, wherein the sequence comprises words to be labeled and labeled words, and is used for generating texts; identifying context information of a word to be marked in the sequence; and determining a second part of speech of the word to be labeled according to the context information and by combining with the first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled. According to the method and the device, the part of speech of the word to be labeled is labeled according to the context information of the word to be labeled in the sequence, so that the part of speech labeling accuracy can be improved, the sequence labeling effect is improved, and the text generation is effectively assisted.

Description

Sequence labeling method and device, storage medium and computer equipment

Technical Field

The invention relates to the technical field of natural language processing, in particular to a sequence labeling method, a sequence labeling device, a sequence labeling storage medium and computer equipment.

Background

In the field of natural language processing, a sequence is labeled (i.e., the part of speech of each word in the sequence is labeled), and then a text can be generated by using the labeled sequence.

In the related art, a general sequence labeling method includes, for example: LSTM (Long Short-Term Memory), HMM (Hidden Markov Model), BERT (bidirectional encoder representation of transformer), CRF (conditional random field), and the like.

In these ways, only local features of the sequence are concerned when the sequence is labeled, and context information of the sequence is ignored, so that the labeling accuracy is low, the labeling effect is poor, and the text generation quality is reduced.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the invention aims to provide a sequence tagging method, a sequence tagging device, a storage medium and computer equipment, which can improve the accuracy of part-of-speech tagging and the sequence tagging effect, thereby effectively assisting in generating texts.

The sequence labeling method provided by the embodiment of the first aspect of the present invention includes a word to be labeled and a labeled word, and the sequence is used for generating a text, and the method includes: acquiring a sequence; identifying context information of words to be labeled in the sequence; and determining a second part of speech of the word to be labeled according to the context information and by combining with a first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled.

According to the sequence labeling method provided by the embodiment of the first aspect of the invention, the sequence is obtained, the context information of the word to be labeled in the sequence is identified, the second part of speech of the word to be labeled is determined according to the context information and the first part of speech of the labeled word adjacent to the word to be labeled, the second part of speech is used for labeling the word to be labeled, and the part of speech of the word to be labeled is labeled according to the context information of the word to be labeled in the sequence, so that the part of speech labeling accuracy can be improved, the sequence labeling effect is improved, and the text generation is effectively assisted.

The sequence labeling device provided by the embodiment of the second aspect of the present invention, where the sequence includes words to be labeled and labeled words, and the sequence is used to generate a text, includes: an obtaining module, configured to obtain a sequence; the identification module is used for identifying the context information of the words to be labeled in the sequence; and the labeling module is used for determining a second part of speech of the word to be labeled according to the context information and by combining with the first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled.

The sequence labeling device provided by the embodiment of the second aspect of the invention can improve the accuracy of part-of-speech labeling and the sequence labeling effect by acquiring the sequence, identifying the context information of the word to be labeled in the sequence, determining the second part-of-speech of the word to be labeled according to the context information and the first part-of-speech of the labeled word adjacent to the word to be labeled, labeling the word to be labeled according to the second part-of-speech, and labeling the part-of-speech of the word to be labeled according to the context information of the word to be labeled in the sequence, thereby effectively assisting in generating the text.

The computer-readable storage medium according to the embodiment of the third aspect of the present invention has a computer program stored thereon, where the computer program is used to implement the sequence annotation method according to the embodiment of the first aspect of the present invention when the computer program is executed by a processor.

The computer-readable storage medium provided by the embodiment of the third aspect of the present invention obtains the sequence, identifies the context information of the word to be tagged in the sequence, determines the second part of speech of the word to be tagged according to the context information and the first part of speech of the tagged word adjacent to the word to be tagged, where the second part of speech is used for tagging the word to be tagged, and tags the part of speech of the word to be tagged according to the context information of the word to be tagged in the sequence, so that the accuracy of part of speech tagging can be improved, the tagging effect of the sequence is improved, and thus the generation of a text is effectively assisted.

The computer equipment provided by the embodiment of the fourth aspect of the invention comprises a shell, a processor, a memory, a circuit board and a power supply circuit, wherein the circuit board is arranged inside a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; the power supply circuit is used for supplying power to each circuit or device of the computer equipment; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing: acquiring a sequence, wherein the sequence comprises words to be labeled and labeled words, and the sequence is used for generating a text; identifying context information of words to be labeled in the sequence; and determining a second part of speech of the word to be labeled according to the context information and by combining with a first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled.

The computer device provided by the embodiment of the fourth aspect of the present invention obtains the sequence, identifies the context information of the word to be tagged in the sequence, determines the second part of speech of the word to be tagged according to the context information and the first part of speech of the tagged word adjacent to the word to be tagged, where the second part of speech is used for tagging the word to be tagged, and tags the part of speech of the word to be tagged according to the context information of the word to be tagged in the sequence, so that the accuracy of part of speech tagging can be improved, the tagging effect of the sequence is improved, and thus the generation of a text is effectively assisted.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of a sequence tagging method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a sequence annotation method according to another embodiment of the present invention;

FIG. 3 is a flowchart illustrating a sequence annotation method according to another embodiment of the present invention;

FIG. 4 is a schematic flow chart of encoding using XLNET model in the embodiment of the present invention;

FIG. 5 is a flowchart illustrating a sequence annotation method according to another embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a sequence labeling apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a sequence labeling apparatus according to another embodiment of the present invention;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

Fig. 1 is a flowchart illustrating a sequence tagging method according to an embodiment of the present invention.

The present embodiment is exemplified in a case where the sequence labeling method is configured as a sequence labeling apparatus.

The sequence annotation method in this embodiment may be configured in a sequence annotation device, and the sequence annotation device may be set in a server, or may also be set in an electronic device, which is not limited in this embodiment of the present invention.

The present embodiment takes the case that the sequence tagging method is configured in the electronic device.

Among them, electronic devices such as smart phones, tablet computers, personal digital assistants, electronic books, and other hardware devices having various operating systems.

It should be noted that the execution subject in the embodiment of the present invention may be, for example, a Central Processing Unit (CPU) in the electronic device in terms of hardware, and may be, for example, a service labeled for a sequence in the electronic device in terms of software, which is not limited to this.

Referring to fig. 1, the method includes:

s101: a sequence is obtained.

The sequence comprises a plurality of words, wherein each word can be a word to be labeled or a labeled word, the word which is not labeled currently can be called the word to be labeled, the word which is labeled currently can be called the labeled word, and the sequence is specifically used for generating a text.

The sequence X is, for example, X1, X2, X3, X4, where X1, X2, X3, X4 may be referred to as words, and X1, X2, X3, X4, etc. are arranged and combined in a certain order to form the sequence X, and the words X1, X2, X3, X4 in the sequence X have corresponding actual contents, positions, parts of speech, etc. respectively.

In the embodiment of the present invention, the words x1, x2, x3, and x4 in the sequence may be labeled with a part of speech, for example, the word is an adjective or a verb, or may be another part of speech capable of describing features of the word, which is not limited in this respect.

S102: and identifying the context information of the words to be labeled in the sequence.

The context information of the word to be labeled comprises the above information and the following information.

The context information of the word to be marked is composed of the upper information and the lower information.

As an example, the sequence X, for example, X1, X2, X3, X4, assuming that the current word to be annotated is X3, the context association information between X3 and X2 may be referred to as the above information, and the context association information between X3 and X2 and X1 may be referred to as the context association information between X3 and X4.

According to the invention, the context information of the word to be labeled in the sequence is identified, so that the identified context information can be adopted to assist in labeling the word to be labeled, and the labeling accuracy is improved.

In a specific implementation process, context information of words to be labeled in the XLNET model identification sequence can be sorted based on a double-current self-attention mechanism.

As an example, the sequence X includes words X1, X2, X3, X4, wherein the words X1, X2, X3, X4 are sequentially arranged, assuming that the current word to be labeled is the word X3, the context information of the word X3 is X1, X2, and the context information of the word X3 is X4, when the context information of the word X3 is simultaneously identified using the XLNet model, the words X1, X2, X3, X4 in the sequence X may be rearranged, in the rearranged sequence X', the words are arranged in the sequence X4, X1, X2, X3 sequentially, wherein X4 has corresponding actual content, and also has an actual position in the original sequence X, X4 is located fourth in the original sequence X, and similarly, X1 has corresponding actual median content, and the actual position in the first sequence X2, the actual position in the original sequence X2, and the actual position in the sequence X2, and the third position of the original sequence X2, thereby, the sequence X is identified using the XLNet model, and the actual sequence X model, and the third position of the original sequence X, and the actual position of the sequence X2, that is, X3 of X4, X1, X2 and X3, and the following information of X3 of the original sequence X is reversely derived from the above information of X3 of X', which is not limited.

S103: and determining a second part of speech of the word to be labeled according to the context information and by combining with the first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled.

The part of speech of the tagged word may be referred to as a first part of speech, and the part of speech of the word to be tagged may be referred to as a second part of speech.

In a specific execution process, a CRF (conditional random field) and an XLNET model can be combined, so that the tagging of the tagged word is performed according to the context information and the part of speech of the tagged word, and the tagging effect can be guaranteed in an all-around manner by considering not only the context information of the word to be tagged but also the part of speech of the tagged word in the sequence in the process of tagging the sequence.

In the embodiment, the sequence is obtained, the context information of the word to be tagged in the sequence is identified, the second part of speech of the word to be tagged is determined according to the context information and the first part of speech of the tagged word adjacent to the word to be tagged, the second part of speech is used for tagging the word to be tagged, and the part of speech of the word to be tagged is tagged according to the context information of the word to be tagged in the sequence, so that the part of speech tagging accuracy can be improved, the tagging effect of the sequence is improved, and the generation of the text is effectively assisted.

Fig. 2 is a flowchart illustrating a sequence annotation method according to another embodiment of the invention.

Referring to fig. 2, the method includes:

s201: a sequence is obtained.

S202: and sequencing context information of the words to be marked in the XLNET model identification sequence by adopting a double-current self-attention-based mechanism.

In a specific implementation process, the content stream in the hidden state and the context information of the word to be annotated in the query stream identification sequence in the hidden state in the dual-flow self-attention mechanism based on the dual-flow self-attention mechanism sorting XLNet model may be specifically used, which is not limited to this.

Specifically, the context information of the tagged word and the actual content of the word to be tagged in the updated sequence (the updated sequence is obtained by correspondingly adjusting the actual position of the word in the initial sequence, and the updated sequence is used for being combined with the initial sequence to determine the context information of the word to be tagged) can be determined through the content flow in the hidden state, and the actual position and the upper information of the word to be tagged in the initial sequence can be determined through the query flow in the hidden state, so that the context information of the word to be tagged can be determined according to the context information of the tagged word and the actual content of the word to be tagged (the actual content of the word to be tagged refers to the text represented by the word to be tagged) and the actual position and the upper information of the word to be tagged in the initial sequence, thereby obtaining the context information of the word to be tagged, and then through an autoregressive model in an XLNet model, and collecting and recording the context information of the word to be labeled so as to identify the context information of the word to be labeled to assist subsequent part-of-speech labeling.

In the embodiment of the invention, the context information of the word to be labeled is identified based on the content flow in the hidden state and the query flow in the hidden state of the XLNET model obtained by initialization, so that a reasonable identification effect can be obtained, the identified context information is matched with the context content of the sequence, and the part-of-speech labeling of the word to be labeled subsequently is assisted.

S203: and determining a second part of speech of the word to be labeled according to the context information by combining the first part of speech and a random condition field CRF model.

The random condition field CRF model may have learned the association relationship between the context information and the matched part-of-speech rule, and thus, after the context information is identified, the corresponding part-of-speech rule may be determined according to the context information in combination with the random condition field CRF model, and thus, the second part-of-speech of the word to be tagged may be determined according to the tagged first part-of-speech in combination with the corresponding part-of-speech rule, without limitation.

In the embodiment, by acquiring the sequence, identifying the context information of the word to be tagged in the sequence by using the double-current self-attention-based ordering XLNet model, and determining the second part of speech of the word to be tagged by combining the first part of speech and the random condition field CRF model according to the context information, because the part of speech of the word to be tagged is tagged by combining the double-current self-attention-based ordering XLNet model and the CRF model, the tagging process is not limited by the actual position of the word to be tagged, the part of speech is determined more flexibly, and a more appropriate tagging effect is ensured to be obtained. Context information of the word to be annotated is identified based on the content flow of the hidden state and the query flow of the hidden state of the XLNET model obtained by initialization, so that a reasonable identification effect can be obtained, the identified context information is matched with the context content of the sequence, and the part-of-speech annotation of the word to be annotated subsequently is assisted.

Fig. 3 is a flowchart illustrating a sequence annotation method according to another embodiment of the invention.

S301: a sequence is obtained.

S302: a word embedding vector for the sequence is obtained.

The word embedding vector is a vector in which words or phrases from the vocabulary are mapped to real numbers, i.e., the words or phrases are represented in vector form.

Referring to fig. 4, fig. 4 is a schematic flow chart of encoding by using an XLNet model in the embodiment of the present invention, where the word embedding vector of the sequence X is e (X), and the word embedding vectors of the words X1, X2, X3, and X4 in the sequence are e (X) respectively₁)、e(x₂)、e(x₃)、e(x₄)。

In a specific implementation process, the word embedding vector e (x) includes multidimensional information, and the multidimensional information comprehensively represents a numerical value corresponding to actual content that the word can represent.

As an example, the word embedding vector of the word "open heart" is e (x), e (x) {0.789, 0.674, 0.506, -0.705, -0.305, 0.587, … … }, where the value in the parentheses is the value of the vector corresponding to the "open heart", and a word embedding vector includes a plurality of values to represent the actual content that the word can represent.

Through the word embedding vector of the acquisition sequence, the data dimension in the sequence labeling process can be effectively reduced, so that the identification of actual content in the labeling process is more convenient, the error rate of character content of directly identifying words is reduced, the identification effect of the actual content is effectively improved, and the sequence labeling precision is guaranteed at multiple angles.

S303: and initializing the zero-level processing logic of the content flow of the hidden state of the XLNET model by adopting the word embedding vector, and initializing the inquiry flow of the hidden state of the XLNET model into a variable.

See FIG. 4, "h_i ^(m)"represents the content stream in hidden state, m takes 1-2 and represents the coding level of the double-stream self-attention mechanism, i represents the word of the corresponding sequence," g_i ⁿ"represents the hidden state query stream, n takes a value of 1-2, represents the double-stream self-attention mechanism coding level, and initializes the hidden state content stream of XLNET model" h_i ^(m)The zeroth layer of "is the word embedding vector e (x), and" g_i ⁿ"is the variable w.

By adopting the word embedding vector, the zero-level processing logic of the content flow in the hidden state of the XLNET model is initialized, and the query flow in the hidden state of the XLNET model is initialized as a variable, so that the function of giving consideration to the context information of the words to be labeled can be realized, and the logical conflict generated when a plurality of words to be labeled in the prediction sequence are avoided.

S304: and taking the text as the input of the XLNET model obtained by initialization, and determining the code corresponding to the text by using the XLNET model obtained by initialization.

For example, if the sequence X is X1, X2, X3, and X4, where each word corresponds to an actual content (the actual content is a subfile described by the word, and if the word is "happy", the subfile described by the word "happy" is "happy"), the subfiles corresponding to the aforementioned words are concatenated to obtain the text.

The XLNET model obtained by initialization has learned the corresponding relation between the words to be labeled of the samples and the sample codes calibrated aiming at the words to be labeled of the samples in advance, the text resulting from the above concatenation can thus be used as input for the XLNet model resulting from the initialization, so as to determine the codes corresponding to the words in the text by adopting the XLNET model obtained by initialization, and take the combination of the codes corresponding to the words as the codes corresponding to the text, context information among words in the text is carried in a code corresponding to the text, and by taking the text as an input of an XLNET model obtained by initialization, therefore, the XLNET model obtained by initialization is adopted to determine the code corresponding to the text, the operation amount in the sequence labeling process can be effectively reduced, and the text is coded so as to determine the context information among the words in the text, so that a better context recognition effect can be obtained.

S305: and identifying the context information of the word to be marked by combining the code corresponding to the word to be marked according to the code corresponding to the text.

As an example, in a piece of text, for example, "beijing is a beautiful city, i love it", there is a pronoun "it", which refers to the city of "beijing" for its coded recognition, as known from context information.

The text is used as the input of the XLNET model obtained by initialization, so that the coding corresponding to the text is determined by the XLNET model obtained by initialization, and the coding corresponding to the text carries the context information among the words in the text, so that the context information of the words to be marked can be identified by combining the coding corresponding to the text and the coding corresponding to the words to be marked.

S306: and inputting the context information of the words to be labeled into the full connection layer.

The XLNet model obtained by initialization in the embodiment of the present invention includes: fully-connected layer, which may be described in detail below with reference to fig. 5, may be embodied as a fully-connected layer of a CNN (Convolutional Neural Networks) structure, each neuron in the fully-connected layer is fully connected with all neurons in the previous layer, the fully-connected layer can integrate local information with class distinction in a convolutional layer or a pooling layer, in order to improve the performance of the CNN network, a ReLU (Rectified Linear Unit) function is generally used as an excitation function of each neuron of the full connection layer, in the embodiment of the invention, the full connection layer in the neural network model is integrated with the XLNET model, and initializing the zero-level processing logic of the content flow of the hidden state of the XLNET model by adopting the word embedding vector, initializing the inquiry flow of the hidden state of the XLNET model as a variable, therefore, the context information of the word to be annotated can be accurately identified by adopting the XLNET model obtained by initialization.

S307: and determining the probability distribution condition of a plurality of candidate parts of speech corresponding to the words to be labeled by combining the result output by the full connection layer with the first part of speech and the CRF model.

And initializing a result output by the full connection layer of the obtained XLNET model, and accurately describing the context information of the word to be labeled.

The candidate parts of speech are possible parts of speech determined for the word to be tagged, the candidate parts of speech may be one or more, and when the number of the candidate parts of speech is multiple, the most suitable candidate part of speech may be determined as the second part of speech according to the probability distribution of the multiple candidate parts of speech.

S308: and determining a second part of speech of the word to be labeled according to the probability distribution condition of the candidate parts of speech.

As an example, referring to fig. 5, fig. 5 is a schematic flowchart of a sequence tagging method according to another embodiment of the present invention, where the XLNet model 51 obtained by initialization in fig. 5 includes a full connection layer, a text formed by splicing sequences X including words to be tagged is input into the XLNet model, and a content stream h in a hidden state of the XLNet model is initialized according to a word embedding vector of the text_i ^(m)Word embedding vector e (x), initializing query stream g in hidden state_i ⁿAnd a variable w is adopted, so that the text sequence X is coded by adopting a double-flow self-attention mechanism, the obtained coding result is input into a full-connection layer, context information of the word to be marked is determined according to the output result of the full-connection layer, then the result output by the full-connection layer is marked by combining a first part of speech and a CRF model, the probability distribution condition of a plurality of candidate parts of speech is obtained, the probability distribution condition can be expressed as an n matrix, then the n matrix is substituted into a normalization index function, a 1 n matrix of sequence marking is obtained, and the candidate part of speech with the highest probability is selected as the result obtained by marking the word to be marked.

The probability distribution condition of a plurality of candidate parts of speech corresponding to the word to be labeled is determined by combining the result output by the full-connection layer with the first part of speech and the CRF model, the second part of speech of the word to be labeled is determined according to the probability distribution condition of the plurality of candidate parts of speech, the part of speech with probability concentrated distribution can be selected as the part of speech of the word to be labeled according to the probability distribution of the candidate parts of speech, and the accuracy of selecting the part of speech of the word to be labeled is effectively improved.

In the embodiment, through acquiring the word embedding vector of the sequence, the data dimension in the sequence labeling process can be effectively reduced, so that the identification of the actual content in the labeling process is more convenient, the error rate of the character content of the directly identified word is reduced, the identification effect of the actual content is effectively improved, and the sequence labeling precision is guaranteed at multiple angles. By adopting the word embedding vector, the zero-level processing logic of the content flow in the hidden state of the XLNET model is initialized, and the query flow in the hidden state of the XLNET model is initialized as a variable, so that the function of giving consideration to the context information of the words to be labeled can be realized, and the logical conflict generated when a plurality of words to be labeled in the prediction sequence are avoided. The text is used as the input of the XLNET model obtained by initialization, so that the XLNET model obtained by initialization is adopted to determine the code corresponding to the text, the operation amount in the sequence labeling process can be effectively reduced, and the text is coded to determine the context information among the words in the text, so that a better context recognition effect can be obtained. The context information of the words to be marked can be identified by combining the codes corresponding to the texts and the codes corresponding to the words to be marked, and the context information of each word is represented in a coding mode, so that the electronic equipment can conveniently and accurately identify the context information of the words to be marked, and the electronic equipment can correctly understand the meaning of the words to be marked in the context. By integrating the full connection layer in the neural network model with the XLNET model and adopting the word embedding vector, initializing the zeroth layer processing logic of the content flow of the XLNET model in the hidden state and initializing the query flow of the XLNET model in the hidden state as a variable, the XLNET model obtained by initialization can be used for accurately identifying the context information of the words to be labeled. The probability distribution condition of a plurality of candidate parts of speech corresponding to the word to be labeled is determined by combining the result output by the full-connection layer with the first part of speech and the CRF model, the second part of speech of the word to be labeled is determined according to the probability distribution condition of the plurality of candidate parts of speech, the part of speech with probability concentrated distribution can be selected as the part of speech of the word to be labeled according to the probability distribution of the candidate parts of speech, and the accuracy of selecting the part of speech of the word to be labeled is effectively improved.

Fig. 6 is a schematic structural diagram of a sequence labeling apparatus according to an embodiment of the present invention.

Referring to fig. 6, the apparatus 600 includes:

an obtaining module 601, configured to obtain a sequence;

the identification module 602 is configured to identify context information of a word to be labeled in a sequence;

the tagging module 603 is configured to determine, according to the context information, a second part of speech of the word to be tagged in combination with the first part of speech of the tagged word adjacent to the word to be tagged, where the second part of speech is used for tagging the word to be tagged.

Optionally, in some embodiments, the identifying module 602 is specifically configured to:

and sequencing context information of the words to be marked in the XLNET model identification sequence by adopting a double-current self-attention-based mechanism.

Optionally, in some embodiments, the labeling module 603 is specifically configured to:

and determining a second part of speech of the word to be labeled according to the context information by combining the first part of speech and a random condition field CRF model.

Optionally, in some embodiments, referring to fig. 7, the identifying module 602 includes:

an obtaining submodule 6021 for obtaining a word embedding vector of the sequence;

the initialization submodule 6022 is configured to initialize a zero-level processing logic of the content stream in the hidden state of the XLNet model by using the word embedding vector, and initialize the query stream in the hidden state of the XLNet model as a variable;

the identifier module 6023 is configured to identify context information of the word to be labeled by using the XLNet model obtained through initialization.

Optionally, in some embodiments, referring to fig. 7, the XLNet model obtained by initializing, learning to obtain a correspondence between a word to be labeled of the sample and a sample code calibrated in advance for the word to be labeled of the sample, and the identifying submodule 6023 is specifically configured to:

the text is used as the input of an XLNET model obtained by initialization, so that the XLNET model obtained by initialization is adopted to determine the code corresponding to the text;

and identifying the context information of the word to be marked by combining the code corresponding to the word to be marked according to the code corresponding to the text.

Alternatively, in some embodiments, referring to fig. 7, initializing the resulting XLNet model comprises: the hidden-state content flow and the hidden-state query flow, identifying sub-module 6023 includes:

a first determining unit 60231, configured to determine, through the content stream in the hidden state, context information of a tagged word in the updated sequence and actual content of a word to be tagged; the updated sequence is obtained by correspondingly adjusting the actual positions of all the words in the initial sequence, and the initial sequence is the obtained sequence;

a second determining unit 60232, configured to determine, through the query stream in the hidden state, the actual position and the above information of the word to be annotated in the initial sequence;

a third determining unit 60233, configured to determine the context information of the word to be annotated according to the context information of the word to be annotated and the actual content of the word to be annotated, as well as the actual position and the context information of the word to be annotated in the initial sequence;

and using the upper information and the lower information of the word to be marked as context information together.

Alternatively, in some embodiments, referring to fig. 7, initializing the resulting XLNet model comprises: the full connection layer, labeling module 603, includes:

an input submodule 6031 for inputting context information of a word to be annotated into the full connection layer;

a first determining submodule 6032, configured to determine, by combining the first part of speech and the CRF model, probability distribution conditions of multiple candidate parts of speech corresponding to the word to be tagged from the result output by the full connection layer;

and the second determining submodule 6033 is configured to determine a second part of speech of the word to be tagged according to the probability distribution of the multiple candidate parts of speech.

It should be noted that the explanation of the sequence labeling method in the foregoing embodiments of fig. 1 to 5 also applies to the sequence labeling apparatus 600 in this embodiment, and the implementation principle thereof is similar, and thus, the detailed description is omitted here

Referring to fig. 8, the computer device 80 of the present embodiment includes: the device comprises a shell 801, a processor 802, a memory 803, a circuit board 804 and a power supply circuit 805, wherein the circuit board 804 is arranged in a space enclosed by the shell 801, and the processor 802 and the memory 803 are arranged on the circuit board 804; a power supply circuit 805 for supplying power to each circuit or device of the computer apparatus 80; the memory 803 is used to store executable program code; wherein, the processor 802 runs a program corresponding to the executable program code by reading the executable program code stored in the memory 803, for executing:

acquiring a sequence, wherein the sequence comprises words to be labeled and labeled words and is used for generating a text;

identifying context information of a word to be marked in the sequence;

and determining a second part of speech of the word to be labeled according to the context information and by combining with the first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled.

It should be noted that the foregoing explanations on the embodiments of the sequence labeling method in fig. 1 to fig. 5 also apply to the computer device 80 of the embodiment, and the implementation principles thereof are similar and will not be described herein again.

In order to implement the foregoing embodiments, the present application provides a computer-readable storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the sequence tagging method of the foregoing method embodiments.

It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A sequence labeling method is characterized in that the sequence comprises words to be labeled and labeled words, the sequence is used for generating texts, and the method comprises the following steps:

acquiring a sequence;

identifying context information of words to be labeled in the sequence;

and determining a second part of speech of the word to be labeled according to the context information and by combining with a first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled.

2. The sequence annotation method of claim 1, wherein the identifying context information of the words to be annotated in the sequence comprises:

and identifying context information of the words to be marked in the sequence by adopting a double-current self-attention mechanism-based ordering XLNET model.

3. The sequence tagging method of claim 2, wherein said determining a second part-of-speech of said word to be tagged in combination with a first part-of-speech of a tagged word adjacent to said word to be tagged based on said context information comprises:

4. The sequence annotation method of claim 2, wherein the identifying context information of the word to be annotated using a dual-stream auto-attention-based ranking XLNet model comprises:

obtaining a word embedding vector of the sequence;

initializing a zero-level processing logic of the content flow of the XLNET model in the hidden state by adopting the word embedding vector, and initializing the query flow of the XLNET model in the hidden state as a variable;

and identifying the context information of the word to be marked by adopting the XLNET model obtained by initialization.

5. The sequence annotation method of claim 4, wherein the initializing the obtained XLNET model to identify context information of the word to be annotated using the initialized XLNET model, the learned word to be annotated of the sample, and the sample code calibrated in advance for the word to be annotated of the sample comprises:

taking the text as the input of the XLNET model obtained by the initialization, so as to determine the code corresponding to the text by the XLNET model obtained by the initialization;

6. The sequence annotation process of claim 4, wherein initializing the resulting XLNET model comprises: the method for recognizing the context information of the word to be marked by adopting the XLNET model obtained by initialization comprises the following steps:

determining context information of the marked words and actual contents of the words to be marked in the updated sequence through the content stream in the hidden state; the updated sequence is obtained by correspondingly adjusting the actual positions of all the words in the initial sequence, and the initial sequence is the obtained sequence;

determining the actual position and the above information of the word to be marked in the initial sequence through the query stream in the hidden state;

determining the context information of the word to be marked according to the context information of the marked word and the actual content of the word to be marked, as well as the actual position and the context information of the word to be marked in the initial sequence;

and taking the above information and the below information of the word to be marked together as the context information.

7. The sequence annotation process of claim 2, wherein initializing the resulting XLNET model comprises: and the fully-connected layer determines a second part of speech of the word to be labeled by combining the first part of speech and a random conditional field CRF model according to the context information, and comprises the following steps:

inputting the context information of the word to be marked into the full connection layer;

determining the probability distribution condition of a plurality of candidate parts of speech corresponding to the word to be labeled by combining the result output by the full connection layer with the first part of speech and the CRF model;

and determining a second part of speech of the word to be labeled according to the probability distribution condition of the candidate parts of speech.

8. A sequence annotation device, wherein the sequence comprises words to be annotated and words that have been annotated, and wherein the sequence is used to generate text, the device comprising:

an obtaining module, configured to obtain a sequence;

the identification module is used for identifying the context information of the words to be labeled in the sequence;

and the labeling module is used for determining a second part of speech of the word to be labeled according to the context information and by combining with the first part of speech of the labeled word adjacent to the word to be labeled, wherein the second part of speech is used for labeling the word to be labeled.

9. The sequence tagging device of claim 8, wherein said identification module is specifically configured to:

10. The sequence labeling apparatus of claim 9, wherein the labeling module is specifically configured to:

11. The sequence annotation apparatus of claim 9, wherein the identification module comprises:

an obtaining submodule for obtaining word embedding vectors of the sequence;

the initialization submodule is used for initializing the zero-level processing logic of the content flow of the XLNET model in the hidden state by adopting the word embedding vector and initializing the query flow of the XLNET model in the hidden state as a variable;

and the identification submodule is used for identifying the context information of the word to be labeled by adopting the XLNET model obtained by initialization.

12. The sequence labeling apparatus of claim 11, wherein the XLNet model obtained by initialization has learned a correspondence between a word to be labeled of the sample and a sample code calibrated in advance for the word to be labeled of the sample, and the identification submodule is specifically configured to:

13. The sequence annotation apparatus of claim 11, wherein the initializing the resulting XLNet model comprises: the hidden state content flow and the hidden state query flow, the identification submodule comprising:

a first determining unit, configured to determine, through the content stream in the hidden state, context information of the tagged word in the updated sequence and actual content of the word to be tagged; the updated sequence is obtained by correspondingly adjusting the actual positions of all the words in the initial sequence, and the initial sequence is the obtained sequence;

a second determining unit, configured to determine, through the query stream in the hidden state, an actual position and information of the word to be annotated in the initial sequence;

a third determining unit, configured to determine context information of the word to be labeled according to the context information of the labeled word and the actual content of the word to be labeled, as well as the actual position and the context information of the word to be labeled in the initial sequence;

14. The sequence annotation apparatus of claim 9, wherein the initializing the resulting XLNet model comprises: the full connection layer, the mark module includes:

the input submodule is used for inputting the context information of the words to be labeled into the full connection layer;

the first determining submodule is used for determining the probability distribution condition of a plurality of candidate parts of speech corresponding to the word to be labeled by combining the first part of speech and the CRF model according to the result output by the full connection layer;

and the second determining submodule is used for determining the second part of speech of the word to be labeled according to the probability distribution condition of the candidate parts of speech.

15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the sequence labeling method of any one of claims 1 to 7.

16. A computer device comprising a housing, a processor, a memory, a circuit board, and a power circuit, wherein the circuit board is disposed inside a space enclosed by the housing, the processor and the memory being disposed on the circuit board; the power supply circuit is used for supplying power to each circuit or device of the computer equipment; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing:

acquiring a sequence, wherein the sequence comprises words to be labeled and labeled words, and the sequence is used for generating a text;

identifying context information of words to be labeled in the sequence;