CN113282719A

CN113282719A - Construction method of labeled data set, intelligent terminal and storage medium

Info

Publication number: CN113282719A
Application number: CN202010100949.4A
Authority: CN
Inventors: 张高升
Original assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Current assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Priority date: 2020-02-19
Filing date: 2020-02-19
Publication date: 2021-08-20

Abstract

The invention discloses a construction method of a labeled data set, an intelligent terminal and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining an article data set, wherein the article data set comprises a plurality of articles; generating answers of the articles of the article data set by adopting a sequence labeling model; and generating a question corresponding to the answer by adopting a deep learning generation model, constructing a combination corresponding to the article, the question and the answer, and generating the labeled data set. According to the method, the marked data set required in deep learning is automatically constructed by constructing the corresponding combination of the article, the question and the answer, and the time cost and the economic cost are saved.

Description

Construction method of labeled data set, intelligent terminal and storage medium

Technical Field

The invention relates to the technical field of computer data processing, in particular to a construction method of a labeled data set, an intelligent terminal and a storage medium.

Background

Machine-reading understanding means that a context description is given, a query is given correspondingly, and then the machine reads the context and gives an answer to the query correspondingly. An assumption is made here that the answer to the query must be a segment of the word (which can also be understood as a succession of words) that can be found in the context text, i.e. the goal of the final model prediction is to output two indices, corresponding to the beginning and ending positions of the query answer in the context text, respectively. The penalty function of the final model is the cross-entropy softmax of the multi-class (since the problem is essentially equivalent to a multi-class problem, the number of classes of the problem is equal to the number of words in the context, i.e. each word is likely to start as an answer).

Machine Reading Comprehension (MRC) is a task that requires the machine to answer questions according to a given context, in order to test the extent to which the machine understands natural language. Constructing a model of an MRC based on a deep neural network requires a large amount of annotation data. A piece of annotation data is formally a combination, including an article, a question, and an answer. An annotation data set is a collection containing a plurality of pieces of annotation data. The existing labeling data is constructed in a manual labeling mode, so that the time cost and the economic cost are very high.

Accordingly, the prior art is yet to be improved and developed.

Disclosure of Invention

The invention mainly aims to provide a construction method of a labeled data set, an intelligent terminal and a storage medium, and aims to solve the problem that in the prior art, labeled data are constructed in a manual labeling mode, so that the time cost and the economic cost are high.

In order to achieve the above object, the present invention provides a method for constructing an annotated data set, which comprises the following steps:

the method comprises the steps of obtaining an article data set, wherein the article data set comprises a plurality of articles;

generating answers of the articles of the article data set by adopting a sequence labeling model;

and generating a question corresponding to the answer by adopting a deep learning generation model, constructing a combination corresponding to the article, the question and the answer, and generating the labeled data set.

Optionally, the method for constructing an annotation data set, where the manner of obtaining an article data set includes: the method comprises the following steps of obtaining the article data set from a network in a web crawler mode, obtaining the article data set by inquiring a service database according to a specific rule, obtaining the article data set from a data set which is published and licensed on the network, and obtaining at least one of the article data sets in a mode of obtaining authorization from a third party.

Optionally, the method for constructing a labeled data set, wherein the step of generating answers to articles in the article data set by using a sequence labeling model includes:

defining a label of the sequence label, wherein the label is used for representing name information in the article;

selecting information of a person name, a place name and an organization name in the article as the answer according to the label;

and inputting each article into the sequence labeling model, and outputting the selectable answer set corresponding to each article by the sequence labeling model.

Optionally, the method for constructing an annotation data set, wherein the tag includes: at least one of a beginning part of a person name, a middle part of a person name, a beginning part of a place name, a middle part of a place name, a beginning part of an organization, a middle part of an organization, and non-entity information.

Optionally, the method for constructing a labeled data set, where the step of generating the question corresponding to the answer by using the deep learning generative model includes:

in the training stage of a deep learning generation model, words of input articles and answers are embedded, the words of output questions are embedded as input in an encoder part in sequence, and an output part of a decoder is selected;

the loss function is defined as:

P(q₁，...q_n|p₁...p_n，a_start，a_end)；

representing the probability of a question as text appearing in a language model given articles and answers;

wherein, { q₁，...q_nText sequence representing a question, a_start，a_endDenotes the start and end positions of the answer, p ═ p₁…p_nA text sequence representing an article;

modeling as a language model, wherein the language model is used for calculating the probability of a sentence, and the probability of a section of text is represented by the product of the probabilities of each word in the text;

the formula is described as:

in the prediction stage of the deep learning generation model, the output part of the decoder is taken as the generated problem.

Optionally, the method for constructing a labeled data set, wherein the step of constructing a combination corresponding to the article, the question and the answer, and generating the labeled data set includes:

constructing a set of articles, questions and answers as S { (p1, q1, a1), (p2, q2, a2), (p3, q3, a 3.);

wherein S represents the labeled data set, each element is a tuple (p, q, a), p represents an article, q represents a question, a represents an answer, and the article, the question and the answer in one tuple correspond to each other.

Optionally, in the method for constructing a labeled data set, a correspondence between the answer and the article is determined according to input and output of the sequence labeling model;

and the corresponding relation between the question and the answer is determined according to the input and the output of the deep learning generation model.

Optionally, the annotation data set is constructed by a deep learning model, wherein the deep learning model includes an encoder and a decoder.

In addition, to achieve the above object, the present invention further provides an intelligent terminal, wherein the intelligent terminal includes: a memory, a processor and a construction program of an annotation data set stored on the memory and executable on the processor, the construction program of the annotation data set implementing the steps of the construction method of the annotation data set as described above when executed by the processor.

In order to achieve the above object, the present invention further provides a storage medium, wherein the storage medium stores a construction program of an annotation data set, and the construction program of the annotation data set realizes the steps of the construction method of the annotation data set as described above when executed by a processor.

The method comprises the steps of obtaining an article data set, wherein the article data set comprises a plurality of articles; generating answers of the articles of the article data set by adopting a sequence labeling model; and generating a question corresponding to the answer by adopting a deep learning generation model, constructing a combination corresponding to the article, the question and the answer, and generating the labeled data set. According to the method, the marked data set required in deep learning is automatically constructed by constructing the corresponding combination of the article, the question and the answer, and the time cost and the economic cost are saved.

Drawings

FIG. 1 is a flow chart of a preferred embodiment of the construction method of the annotation data set of the present invention;

FIG. 2 is a schematic diagram of a sequence annotation model in a preferred embodiment of the construction method of the annotation data set of the present invention;

FIG. 3 is a schematic diagram of a sequence annotation model for prediction according to a preferred embodiment of the method for constructing an annotated data set of the present invention;

FIG. 4 is a schematic diagram of a deep learning generative model in a preferred embodiment of the construction method of an annotation data set of the present invention;

fig. 5 is a schematic operating environment diagram of an intelligent terminal according to a preferred embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, the method for constructing an annotated data set according to the preferred embodiment of the present invention includes the following steps:

and step S10, acquiring an article data set, wherein the article data set comprises a plurality of articles.

Specifically, a data set of articles is obtained, wherein the articles generally refer to an article, such as a reading comprehension article. The article data can be acquired by adopting different technologies according to specific service requirements; for example, web crawlers (also known as web spiders, web robots, and more often web chasers, which are programs or scripts that automatically capture web information according to certain rules) can be used to obtain information from the web; the data can be obtained by querying from a business database according to a certain rule; can employ data set acquisition that is published and licensed on the network; may be obtained in a manner that takes authorization from a third party.

Further, the obtained article data set is P ═ { P1, P2, p3.. }, where P denotes the article data set; p1, p2, p3, etc. represent each an article.

And step S20, generating answers of the articles of the article data set by adopting a sequence labeling model.

Specifically, all articles P are sequentially taken out from the article data set P, and the same processing manner is adopted for each article, specifically: a sequence annotation model (sequence annotation refers to an artificial intelligence task type, such as judging the part of speech of each word in a sentence) is used as an optional answer set of an article, the input is an article p, and the output is an optional answer set A ═ a1, a2, a3., wherein a1, a2, a3 and the like respectively represent optional answers.

And selecting the information of the names of people, places and organizations in the article as the answers, wherein the adopted sequence labeling model is shown in figure 2.

Firstly, defining a label marked by a sequence, wherein the label is used for representing name information (such as information representing a person name, a place name and an organization name, so that the information of the person name, the place name and the organization name in the article can be selected as the answer according to the label), and according to the business requirement, defining the following labels:

B-Person (beginning part of Person name);

I-Person (middle part of the name of a Person);

B-Place (beginning of Place name);

I-Place (middle part of Place name);

B-Organization (beginning part of the Organization);

I-Organization (middle part of the Organization);

o (non-entity information);

namely, the tag includes: at least one of a beginning part of a person name, a middle part of a person name, a beginning part of a place name, a middle part of a place name, a beginning part of an organization, a middle part of an organization, and non-entity information.

As shown in fig. 2, the input of the BilSTM-CRF (BilSTM, Bi-directional Long Short-Term Memory network, which is a bidirectional Long-Term Memory network and is formed by combining forward LSTM and backward LSTM, the representation of words is combined into the representation of sentences, longer distance dependency can be better captured by using the LSTM model, and CRF is a conditional random field) is a word embedding vector (which refers to a technology for mapping a word to a vector in a high-dimensional space), such as w1, w2, w3 and the like in fig. 2, and the output is a prediction tag corresponding to each word (the word embedding vector and the word have a corresponding relationship, and the input in fig. 2 is further preprocessed to convert the word into the word embedding vector). Each word in a sentence is a word vector containing word embedding (word embedding vector refers to a technique of mapping a word to a vector in a high-dimensional space), word embedding is usually trained in advance, and word embedding is initialized randomly. All embeddings are adjusted as the training is iterated.

The process of prediction is shown in fig. 3, where the output of the BiLSTM layer represents the score of the word for each category. The output is the score of the BilSTM layer indicating that the word corresponds to each category, for example, W0, and the output of the BilSTM node is 1.5(B-Person), 0.9(I-Person), 0.1(B-Organization), 0.08(I-Organization) and 0.05 (O). These scores will be the input to the CRF layer. All the scores output by the BilSTM layer are used as the input of the CRF layer (the input is the output from the BilSTM layer, namely the scores representing the words corresponding to all the categories), and the category with the highest score in the category sequence is the final result predicted by the user.

The CRF layer may add some constraints to ensure that the final prediction result is valid. These constraints can be learned automatically by the CRF layer during training data, and possible constraints are: the beginning of the sentence should be "B-" or "O" instead of "I-".

"B-label 1I-label 2I-label 3 …", in this mode,

classes

1, 2, 3 should be the same entity class. For example, "B-Person I-Person" is correct, while "B-Person I-Organization" is incorrect. "O I-label" is erroneous, and the beginning of the named entity should be "B-" rather than "I-". With these useful constraints, the erroneous prediction sequences will be greatly reduced.

The CRF loss function consists of two parts, namely the fraction of a real path and the total fraction of all paths; the score of the real path should be the highest of all paths. The output is the path with the highest score in all paths, and the category with the highest score in the category sequence is the predicted final result.

And finally, extracting information of the names of people, places and organizations in the articles according to the prediction result of the sequence labeling model, wherein the information forms an optional answer set.

And step S30, generating the question corresponding to the answer by adopting a deep learning generation model, constructing a combination corresponding to the article, the question and the answer, and generating the labeled data set.

Specifically, a generation model of deep learning is adopted to generate a corresponding question of each answer. The generative model of the deep learning is modeled as an encoder-decoder process as shown in fig. 4, and an attention model is introduced.

In the training stage of the deep learning generation model, words of an input article and answers are embedded, the encoder (encoder) part is used as input in sequence, words of a question are embedded, and the output part of a decoder (decoder) is taken. The loss function is defined as:

P(q₁，...q_n|p₁...p_n，a_start，a_end)；

representing the probability of a question appearing in a language model as text given an article and an answer. Wherein:

{q₁，...q_ntext sequence representing the question;

a_start，a_enda start position and an end position representing an answer;

p＝p₁...p_na text sequence representing an article;

in order to make the loss function calculable, a language model (i.e., a model for calculating the probability of a sentence, i.e., the probability of judging whether a sentence is a human word) is modeled, and the probability of occurrence of a piece of text is expressed by the product of the probabilities of occurrence of each word in the text.

The formula is described as:

in the prediction stage of the deep learning generative model, the output part of a decoder is taken as a problem of generation.

Building a set of articles, questions and answers: s { (p1, q1, a1), (p2, q2, a2), (p3, q3, a3). };

where S represents the set (i.e. the annotation data set), each element is a tuple (p, q, a), p represents an article, q represents a question, and a represents an answer. There is correspondence between articles, questions and answers in a tuple.

The corresponding relation between the answers and the articles is determined according to the input and the output of the sequence labeling model; the corresponding relation between the question and the answer is determined according to the input and the output of the deep learning generative model, namely the corresponding relation between the question and the (article, answer) is corresponding according to the input and the output of the deep learning generative model.

Furthermore, the construction method of the labeled data set can expand the labeled data set read and understood by a machine, can automatically generate the question and answer of the read and understood question type on the teaching, and brings great convenience to the use of users.

Further, as shown in fig. 5, based on the above construction method of the annotation data set, the present invention also provides an intelligent terminal, which includes a processor 10, a memory 20 and a display 30. Fig. 5 shows only some of the components of the smart terminal, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

The memory 20 may be an internal storage unit of the intelligent terminal in some embodiments, such as a hard disk or a memory of the intelligent terminal. The memory 20 may also be an external storage device of the Smart terminal in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the Smart terminal. Further, the memory 20 may also include both an internal storage unit and an external storage device of the smart terminal. The memory 20 is used for storing application software installed in the intelligent terminal and various data, such as program codes of the installed intelligent terminal. The memory 20 may also be used to temporarily store data that has been output or is to be output. In one embodiment, the memory 20 stores a construction program 40 of the annotation data set, and the construction program 40 of the annotation data set can be executed by the processor 10, so as to implement the construction method of the annotation data set in the present application.

The processor 10 may be, in some embodiments, a Central Processing Unit (CPU), a microprocessor or other data Processing chip, and is used for executing the program codes stored in the memory 20 or Processing data, such as executing the constructing method of the labeled data set.

The display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch panel, or the like in some embodiments. The display 30 is used for displaying information at the intelligent terminal and for displaying a visual user interface. The components 10-30 of the intelligent terminal communicate with each other via a system bus.

In one embodiment, the following steps are implemented when the processor 10 executes the construction program 40 for labeling data sets in the memory 20:

The method for acquiring the article data set comprises the following steps: the method comprises the following steps of obtaining the article data set from a network in a web crawler mode, obtaining the article data set by inquiring a service database according to a specific rule, obtaining the article data set from a data set which is published and licensed on the network, and obtaining at least one of the article data sets in a mode of obtaining authorization from a third party.

The step of generating answers to the articles of the article data set by using the sequence tagging model comprises the following steps:

The label includes: a beginning part of a person name, a middle part of a person name, a beginning part of a place name, a middle part of a place name, a beginning part of an organization, a middle part of an organization, and non-entity information.

The step of generating the question corresponding to the answer by using the deep learning generative model comprises the following steps:

the loss function is defined as:

P(q₁，...q_n|p₁...p_n，a_start，a_end)；

wherein, { q₁，…q_nText sequence representing a question, a_start，a_endDenotes the start and end positions of the answer, p ═ p₁…p_nA text sequence representing an article;

the formula is described as:

The constructing of the corresponding combination of the article, the question and the answer to generate the labeled data set specifically includes:

The corresponding relation between the answers and the articles is determined according to the input and the output of the sequence labeling model;

The generative model of deep learning includes an encoder and a decoder.

The present invention also provides a storage medium, wherein the storage medium stores a construction program of an annotated data set, and the construction program of the annotated data set realizes the steps of the construction method of the annotated data set as described above when executed by a processor.

In summary, the present invention provides a method for constructing an annotated data set, an intelligent terminal and a storage medium, wherein the method includes: the method comprises the steps of obtaining an article data set, wherein the article data set comprises a plurality of articles; generating answers of the articles of the article data set by adopting a sequence labeling model; and generating a question corresponding to the answer by adopting a deep learning generation model, constructing a combination corresponding to the article, the question and the answer, and generating the labeled data set. According to the method, the marked data set required in deep learning is automatically constructed by constructing the corresponding combination of the article, the question and the answer, and the time cost and the economic cost are saved.

Of course, it will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program instructing relevant hardware (such as a processor, a controller, etc.), and the program may be stored in a computer readable storage medium, and when executed, the program may include the processes of the above method embodiments. The storage medium may be a memory, a magnetic disk, an optical disk, etc.

It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims

1. A construction method of an annotated data set is characterized by comprising the following steps:

2. The method for constructing an annotation data set according to claim 1, wherein the manner of obtaining the article data set comprises: the method comprises the following steps of obtaining the article data set from a network in a web crawler mode, obtaining the article data set by inquiring a service database according to a specific rule, obtaining the article data set from a data set which is published and licensed on the network, and obtaining at least one of the article data sets in a mode of obtaining authorization from a third party.

3. The method for constructing a labeled data set according to claim 1, wherein the step of generating answers to articles of the article data set by using a sequence labeling model comprises:

4. The construction method of an annotation data set according to claim 3, wherein the tag comprises: at least one of a beginning part of a person name, a middle part of a person name, a beginning part of a place name, a middle part of a place name, a beginning part of an organization, a middle part of an organization, and non-entity information.

5. The method for constructing a labeled data set according to claim 4, wherein the step of generating the question corresponding to the answer by using the deep learning generative model comprises:

the loss function is defined as:

P(q₁，...q_n|p₁...p_n，a_start，a_end)；

wherein, { q₁，...q_nText sequence representing a question, a_start，a_endDenotes the start and end positions of the answer, p ═ p₁...p_nA text sequence representing an article;

the formula is described as:

6. The method for constructing a labeled data set according to claim 5, wherein the step of constructing a corresponding combination of articles, questions and answers and generating the labeled data set comprises:

7. The method for constructing a labeled data set according to claim 1 or 6, wherein the correspondence between the answers and the articles is determined according to the input and output of the sequence labeling model;

8. The method of constructing an annotation data set according to claim 1 or 5, wherein the generative model of deep learning comprises an encoder and a decoder.

9. An intelligent terminal, characterized in that, intelligent terminal includes: memory, processor and a construction program of an annotation data set stored on the memory and executable on the processor, which when executed by the processor implements the steps of the construction method of an annotation data set according to any one of claims 1 to 8.

10. A storage medium characterized by storing a construction program of an annotation data set, which when executed by a processor implements the steps of the construction method of an annotation data set according to any one of claims 1 to 8.