CN109739956B

CN109739956B - Corpus cleaning method, apparatus, device and medium

Info

Publication number: CN109739956B
Application number: CN201811326771.4A
Authority: CN
Inventors: 王靖淞; 邢少敏
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2018-11-08
Filing date: 2018-11-08
Publication date: 2020-04-10
Anticipated expiration: 2038-11-08
Also published as: CN109739956A

Abstract

The invention provides a corpus cleaning method, a corpus cleaning device, equipment and a corpus cleaning medium, wherein the corpus cleaning method comprises the following steps: obtaining a sentence vector extraction model structure, wherein the sentence vector extraction model structure is taken from a part of a question-answer pair model which is trained in advance and used for evaluating the matching condition of question sentences and answer sentences, and is used for extracting the sentence vectors of input question sentences or answer sentences; extracting at least one part of the linguistic data from all the linguistic data serving as question and answer pairs to be cleaned; obtaining a labeling result of at least one part of corpus; training a classification model based on a training set formed by at least one part of corpus and labeling results thereof, wherein the classification model is used for evaluating whether the corpus is suitable for being used as a question-answer pair or not based on sentence vectors respectively extracted from question sentences and answer sentences in the question-answer pair of the input corpus by a sentence vector extraction model structure; and screening out the linguistic data suitable for being used as question and answer pairs from the unmarked linguistic data in all the linguistic data by utilizing the trained classification model. Therefore, a large amount of linguistic data with high quality can be obtained through a small amount of manual labeling.

Description

Corpus cleaning method, apparatus, device and medium

Technical Field

The present invention relates generally to the field of data science and technology, and more particularly, to a corpus cleaning method, apparatus, device and medium.

Background

In the field of artificial intelligence interaction, the interaction mode realized by conversation still occupies an important position. The implementation of dialog-based artificial intelligence interaction techniques relies on the construction of high-quality question-answer pairs. How to screen out the linguistic data suitable for being used as question-answer pairs from a large number of linguistic data to realize the linguistic data cleaning is the key for constructing high-quality question-answer pairs.

The existing corpus cleaning scheme is mainly divided into two types. The other method is to obtain large-scale corpora through manual labeling, and then use the labeled corpora to carry out model training so as to use the trained model to carry out cleaning. No matter which mode needs to consume a large amount of manual work to label, the cost of labor is higher.

Disclosure of Invention

An exemplary embodiment of the present invention is to provide a corpus cleaning method and apparatus to solve the above problems in the prior art.

According to a first aspect of the present invention, a corpus cleaning method is provided, including: obtaining a sentence vector extraction model structure, wherein the sentence vector extraction model structure is taken from a part of a question-answer pair model which is trained in advance and used for evaluating the matching condition of question sentences and answer sentences, and is used for extracting input sentence vectors of the question sentences or the answer sentences; extracting at least one part of the linguistic data from all the linguistic data serving as question and answer pairs to be cleaned; acquiring a labeling result of at least one part of the corpus, wherein the labeling result indicates whether the corpus is suitable for being used as a question-answer pair; training a classification model based on a training set formed by at least one part of corpus and labeling results thereof, wherein the classification model evaluates whether the corpus is suitable for being used as a question-answer pair based on sentence vectors respectively extracted from question sentences and answer sentences in the question-answer pair of the input corpus by a sentence vector extraction model structure; and screening out the linguistic data suitable for being used as question and answer pairs from the unmarked linguistic data in all the linguistic data by utilizing the trained classification model.

Optionally, the sentence vector extraction model structure includes an embedding layer and an association layer, the embedding layer is configured to obtain word vectors of words in the input question sentence or answer sentence, and the association layer is configured to obtain a sentence vector further carrying word order information based on the word vectors of the words.

Optionally, the association layer is a long-short term memory neural network layer, a recurrent neural network layer and/or a convolutional neural network layer.

Optionally, the step of obtaining a sentence vector extraction model structure includes: receiving a sentence vector extraction model structure from the outside; or, the step of obtaining the sentence vector extraction model structure comprises: training a question-answer pair model by utilizing a pre-constructed matching question-answer pair and a non-matching question-answer pair, wherein the question-answer pair model comprises a sentence vector extraction model structure and a matching operation structure, and the matching operation structure is used for acquiring the matching degree between a question and an answer sentence in the question-answer pair based on the extracted sentence vector; and (4) taking out sentence vectors from the trained question-answer model to extract a model structure.

Optionally, the question-answer pair model is trained with the aim of differentiating matching question-answer pairs from non-matching question-answer pairs as much as possible, wherein the matching operation structure is a similarity calculation unit or a neural network structure.

Optionally, the classification model comprises: the splicing layer is used for splicing sentence vectors respectively extracted from question sentences and answer sentences in question-answer pairs of the input corpus by the sentence vector extraction model structure so as to obtain spliced vectors corresponding to the input corpus; and an additional layer for outputting a classification result as to whether the corresponding input corpus is suitable for use as a challenge-answer pair based on the concatenation vector.

Optionally, the additional layer is at least one fully connected layer.

Optionally, the step of using the trained classification model to screen out the linguistic data suitable for being used as the question-answer pair from the unlabeled linguistic data in the whole linguistic data comprises iteratively executing the following steps: evaluating at least one part of the unlabeled corpus in all corpora by using the trained classification model; screening out linguistic data of which the evaluation results are relatively suitable for being used as question and answer pairs, and acquiring the labeling results of at least one part of the screened linguistic data; updating the training set by adding corpora with inconsistent labeling results and evaluation results; and updating or retraining the classification model based on the updated training set.

Optionally, the corpus used as the query-answer pair is relatively appropriate for each round of screening as the cleaned corpus.

Optionally, the method further comprises: and stopping iteratively training the classification model under the condition that the linguistic data which are relatively suitable for being used as the question-answer pairs cannot be screened out based on the evaluation result.

Optionally, the method further comprises: and filtering the screened linguistic data based on the sensitive word list so as to remove question-answer pairs containing sensitive words.

Optionally, the method further comprises: and receiving new sensitive words in each iteration to add into the sensitive word list.

According to a second aspect of the present invention, there is provided a corpus washing device, comprising: a first obtaining unit, configured to obtain a sentence vector extraction model structure, where the sentence vector extraction model structure is obtained from a part of a question-answer pair model trained in advance for evaluating matching conditions of a question and an answer sentence, and is used to extract an input question or a sentence vector of the answer sentence; the extraction unit is used for extracting at least one part of linguistic data from all the linguistic data serving as question and answer pairs to be cleaned; the second acquisition unit is used for acquiring a labeling result of at least one part of the corpus, wherein the labeling result indicates whether the corpus is suitable for being used as a question-answer pair; a training unit, configured to train a classification model based on a training set formed by at least a part of corpora and labeling results thereof, wherein the classification model evaluates whether the corpora are suitable for being used as question-answer pairs based on sentence vectors respectively extracted from question sentences and answer sentences in question-answer pairs of input corpora by a sentence vector extraction model structure; and the screening unit is used for screening out the linguistic data suitable for being used as question-answer pairs from the unmarked linguistic data in all the linguistic data by utilizing the trained classification model.

Optionally, the first obtaining unit receives a sentence vector extraction model structure from the outside; or the first obtaining unit trains a question-answer pair model by using a matching question-answer pair and a non-matching question-answer pair which are constructed in advance, and takes out a sentence vector extraction model structure from the trained question-answer pair model, wherein the question-answer pair model comprises the sentence vector extraction model structure and a matching operation structure, and the matching operation structure is used for obtaining the matching degree between a question and an answer sentence in the question-answer pair based on the extracted sentence vector.

Optionally, the additional layer is at least one fully connected layer.

Optionally, the filtering unit includes an evaluating unit, a filtering subunit, a training set updating unit, and an updating or retraining unit, and filters out the linguistic data suitable for being used as question-answer pairs from the unlabeled linguistic data in the whole linguistic data by iteratively performing the following operations: evaluating at least one part of the unlabeled corpus in all corpora by using the trained classification model by an evaluation unit; the screening subunit screens out linguistic data of which the evaluation results are relatively suitable for being used as question and answer pairs, and obtains the labeling results of at least part of the screened linguistic data; updating the training set by adding the corpus with the labeling result inconsistent with the evaluation result by a training set updating unit; and updating or retraining, by the updating or retraining unit, the classification model based on the updated training set.

Optionally, in a case where the filtering unit fails to filter out the corpus more suitable for use as a question-answer pair based on the evaluation result, the corpus cleaning device stops iteratively training the classification model.

Optionally, the corpus washing device further comprises: and the filtering unit is used for filtering the screened linguistic data based on the sensitive word list so as to remove question-answer pairs containing the sensitive words.

Optionally, the corpus washing device further comprises: and the sensitive word updating unit is used for receiving the newly added sensitive words in each iteration so as to add the sensitive words into the sensitive word list.

According to a third aspect of the present invention, there is also presented a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method according to the first aspect of the invention.

According to a third aspect of the present invention, there is also proposed a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method as set forth in the first aspect of the present invention.

According to the corpus cleaning method and device provided by the exemplary embodiment of the invention, a large amount of corpora with high quality can be obtained through a small amount of manual labeling on the basis of accurately obtaining the sentence vectors of question sentences and answer sentences in question-answer pairs.

Additional aspects and/or advantages of the present general inventive concept will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the general inventive concept.

Drawings

The above and other objects and features of exemplary embodiments of the present invention will become more apparent from the following description taken in conjunction with the accompanying drawings which illustrate exemplary embodiments, wherein:

FIG. 1 illustrates a flow diagram of a corpus washing method according to an exemplary embodiment of the present invention;

FIG. 2 shows a schematic structural diagram of a question-answer pair model according to an exemplary embodiment of the present invention;

FIG. 3 illustrates a flow chart of an implementation of using a classification model to screen out corpora suitable for use as question-answer pairs according to an exemplary embodiment of the present invention.

FIG. 4 illustrates a block diagram of a corpus washing device according to an exemplary embodiment of the present invention;

FIG. 5 shows a block diagram of functional units that a screening unit may have according to an example embodiment of the present invention;

FIG. 6 illustrates a block diagram of a computing device in an exemplary embodiment of the invention.

Detailed Description

Reference will now be made in detail to the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below in order to explain the present invention by referring to the figures.

Fig. 1 illustrates a flowchart of a corpus washing method according to an exemplary embodiment of the present invention. The method shown in FIG. 1 may be implemented by a computer program or by a dedicated corpus washing device.

In step S110, a sentence vector extraction model structure is acquired.

The sentence vector extraction model structure is used for extracting the sentence vectors of the input question sentences or answer sentences. In the present invention, the sentence vector extraction model structure may be taken from a part of a question-answer pair model trained in advance for evaluating matching conditions of question sentences and answer sentences, that is, a part of a question-answer pair model for extracting a sentence vector of an input question sentence or answer sentence. The question-answer pair model is used for evaluating the matching condition of question sentences and answer sentences, so that a sentence vector extraction model structure extracted from the question-answer pair model can be used for obtaining the sentence vectors of the question sentences or the answer sentences in the question-answer pairs to be cleaned, and the sentence vectors obtained based on the sentence vector extraction model structure can reflect the internal association relationship (namely the matching relationship) among the sentences to a certain extent.

As an example, the sentence vector extraction model structure may include an embedding layer and an association layer, where the embedding layer is configured to obtain word vectors of words in the input question sentence or answer sentence, and the association layer is configured to obtain a sentence vector further carrying word order information based on the word vectors of the words. The association layer may be any one of a long-short term memory neural network layer (LSTM layer), a recurrent neural network layer (RNN layer), and a convolutional neural network layer (CNN layer), or a combination of any two of them, or a combination of these three structures. Therefore, the sentence vector of the question or answer in the question-answer pair to be cleaned, which is extracted based on the sentence vector extraction model structure, not only contains word information in the sentence, but also contains word sequence information among words.

In the invention, a sentence vector extraction model structure can be received from the outside, for example, a question-answer pair model can be trained externally, and a part of the question-answer pair model for extracting the input question or the sentence vector of the answer sentence is extracted to obtain the sentence vector extraction model structure. In addition, a question-answer pair model can be trained by using a matching question-answer pair and a non-matching question-answer pair which are constructed in advance, and then a sentence vector is taken out from the trained question-answer pair model to extract a model structure. Here, the pre-constructed matching question-answer pair and the non-matching question-answer pair may be from the corpus to be cleaned, or may be other question-answer pairs similar to or even unrelated to the corpus to be cleaned. The question-answer pair model comprises a sentence vector extraction model structure and a matching operation structure, and the matching operation structure is used for acquiring the matching degree between a question and an answer sentence in a question-answer pair (a matching question-answer pair or a non-matching question-answer) based on the extracted sentence vector. The matching operation structure may be a neural network structure, or may be a similarity calculation unit (for example, a cosine similarity calculation unit). In training the question-answer pair model, training may be performed with the goal of distinguishing matching question-answer pairs from non-matching question-answer pairs as much as possible.

The training process of the model is exemplified below with respect to question and answer.

As shown in fig. 2, the question-answer pair model may include an Embedding layer (Embedding layer), a long-short term memory neural network layer (LSTM layer), and a matching operation structure, which are connected in sequence. The LSTM layer may be a bidirectional long-short term memory neurometer network layer (Bi-LSTM layer), or may be replaced by a recurrent neural network layer (RNN layer) or a convolutional neural network layer (CNN layer). Optionally, any two or three of the three structures LSTM, RNN, CNN may also be used. The matching operation structure can be a neural network structure, and can also be other similarity calculation units, such as a cosine similarity calculation unit.

First, a training set may be constructed on the basis of existing corpora. Specifically, for each question in the corpus, the question and the corresponding answer may be taken as a positive example (i.e., the above-mentioned matching question-answer pair), and one sentence may be randomly selected from other question or answer, and matched with the question to form a negative example (i.e., the above-mentioned non-matching question-answer pair). Wherein, for a question, one or more negative examples can be constructed, so that a training set consisting of a plurality of positive examples and negative examples can be obtained. Alternatively, all positive and negative examples may be randomly shuffled as training data for training the challenge-response pair model.

As shown in fig. 2, for each question and answer in the positive or negative example, a word segmentation process may first be performed to segment the sentence into independent word sequences. The word sequence passes through the Embedding layer to obtain the distributed representation of each word in the sequence, the obtained output can be input into the LSTM layer, and further the sentence vector representation of the question (namely, the question vector) and the sentence vector representation of the answer (namely, the answer vector) can be obtained. The question vector and the answer vector can be input into the matching operation structure, and the matching operation structure calculates the score according to the question vector and the answer vector, for example, the matching operation structure can calculate the score through a neural network or directly by using cosine similarity (namely, dot multiplication of two vectors) and the like. Among them, the calculated score may be referred to as a positive-example score for a positive example composed of a question and a matching answer, and the calculated score may be referred to as a negative-example score for a negative example composed of a question and a matching answer. The objective of training the question-answer pair model can be to make the positive score as close to 1 as possible, the negative score as close to 0 as possible, and to enlarge the difference between the positive score and the negative score as much as possible, so as to train the question-answer pair model to obtain the trained question-answer pair model.

The Embedding layer and the LSTM layer shown in fig. 2 are common to both question sentences and answer sentences in positive examples and negative examples, that is, for each sentence, the same Embedding layer and LSTM layer can be used to obtain a sentence vector representation of the sentence.

In step S120, at least a part of the corpus is extracted from all the corpora as question-answer pairs to be cleaned. Here, a part of the corpus may be randomly extracted from the whole corpus.

In step S130, a labeling result of at least a part of the corpus is obtained, wherein the labeling result indicates whether the corpus is suitable for being used as a question-answer pair.

And manually labeling the linguistic data extracted from all the linguistic data to obtain a labeling result. The standard of manual marking can be established by marking personnel in a unified mode. By way of example, the criteria may relate to, but are not limited to, whether the question-answer pair is semantic compliant, whether the question-answer pair propagates negative emotions, whether the question-answer pair relates to illicit speech, and so forth. Optionally, during the tagging process, sensitive words may be collected by tagging personnel to obtain a sensitive vocabulary, and the collected sensitive words may include, but are not limited to, names of people, place names, advertisements, pornography, politics, insults, religions, and other words that are difficult to distinguish by the training model.

In step S140, a classification model is trained based on a training set formed by at least a part of the corpus and the labeling result thereof.

The classification model may evaluate whether the corpus is suitable for use as a question-answer pair based on sentence vectors respectively extracted from question sentences and answer sentences among question-answer pairs of the input corpus by the sentence vector extraction model structure. The classification model can be viewed as a part on top of the sentence vector extraction model structure. As an example, the classification model may include a splice layer and an additional layer. The splicing layer is used for splicing sentence vectors respectively extracted from question sentences and answer sentences in question-answer pairs of the input corpus by the sentence vector extraction model structure so as to obtain spliced vectors corresponding to the input corpus. The additional layer is used for outputting a classification result about whether the corresponding input corpus is suitable for being used as a question-answer pair or not based on the splicing vector. Optionally, the additional layer is at least one fully connected layer.

As an example, the classification model may multiplex parameter information of the sentence vector extraction model structure, that is, the classification model may include a sentence vector extraction model structure for extracting a sentence vector for a question sentence and an answer sentence, respectively, among a question-answer pair of the input corpus. When the classification model is trained, the sentence vector extraction model structure does not participate in the training any more. That is, the classification model may include a sentence vector extraction model structure, a concatenation layer, and an additional layer, which are connected in sequence. The parameters of the sentence vector extraction model structure are fixed, and the sentence vector extraction model structure can be referred to the above description, which is not described herein again.

When training the classification model, the labeling score of the question-answer pair whose labeling result in the training set is not suitable for being used as the question-answer pair may be labeled as 0, and the labeling score of the question-answer pair whose labeling result in the training set is suitable for being used as the question-answer pair may be labeled as 1. For each question-answer pair in the training set, a spliced vector obtained by splicing sentence vectors respectively extracted from a question and an answer in the question-answer pair based on a sentence vector extraction model structure by the splicing layer is used as a sample feature, and a labeling score of the question-answer pair is used as a sample label to train a classification model (for example, parameters of an additional layer). As an example, the classification model may use a sigmoid function as an activation function during training, whereby the classification model may be used to score question-answer pairs to be cleaned between (0, 1) intervals to evaluate whether the question-answer pairs to be cleaned are suitable for use as question-answer pairs. Optionally, in the process of training the classification model, a part of data in the training set may be divided into a validation set, the validation set may be used to validate the evaluation effect of the trained classification model, and the classification model may be retrained or updated according to the evaluation effect. Alternatively, the training may be ended when the evaluation effect obtained by evaluating the trained classification model based on the validation set is no longer improved, so as to obtain the trained classification model.

In step S150, the trained classification model is used to screen out the linguistic data suitable for being used as question-answer pairs from the unlabeled linguistic data in all the linguistic data.

The invention trains the classification model through a small amount of manual marking on the basis of accurately acquiring the sentence vectors of question sentences and answer sentences in question-answer pairs based on a sentence vector extraction model structure, thereby carrying out corpus cleaning operation based on the trained classification model. The quality of the classification model can be improved while the workload of manual labeling is reduced.

By way of example, the trained classification model may be used to screen out corpora suitable for use as question-answer pairs from the unlabeled corpora among the total corpora by iteratively performing the steps shown in fig. 3.

Referring to fig. 3, in step S1510, at least a portion of the unlabeled corpora among all corpora is evaluated by using the trained classification model.

As described in steps S120 and S130, when the classification model is trained in step S140, the classification model is trained based on at least a part of corpora extracted from all corpora and labeling results thereof. Here, after obtaining the classification model, at least a part of the unlabeled corpora in the whole corpora may be evaluated by using the classification model. Wherein, at least a part of the un-labeled corpus may be a corpus randomly selected from the un-labeled corpora of all the corpora. Alternatively, at least a portion of the unlabeled corpus may be all of the unlabeled corpora in the entire corpus.

In step S1520, the corpus that is relatively suitable for being used as the question-answer pair is screened out, and the labeling result of at least a part of the screened corpus is obtained.

The classification model is used for classifying whether the linguistic data are suitable for being used as question and answer pairs or not, the linguistic data with higher classification scores can be screened out and used as an evaluation result, and the linguistic data are relatively suitable for being used as the question and answer pairs. For example, after at least a part of the unlabeled corpora in the whole corpora is evaluated by using the classification model to obtain a score of each unlabeled corpora between 0 and 1, the corpora with the score higher than a predetermined threshold (e.g., 0.8) may be selected as the corpora selected in the round.

The selected corpora can be sampled, and at least a part of the selected corpora is extracted to be labeled, so that the labeling result of at least a part of the selected corpora is obtained. The labeling personnel can perform manual labeling, and the labeling standard can refer to the above related description, which is not described herein again. Optionally, in the process of performing manual tagging, a tagging person may also collect a new sensitive word to obtain a sensitive word list or update the sensitive word list.

In step S1530, the training set is updated by adding corpora whose labeling result is inconsistent with the evaluation result.

The main point here is to update the training set by using the corpus which is labeled as not suitable for being used as a question-answer pair, but the evaluation result is to compare the corpus suitable for being used as a question-answer pair as a new training sample. Alternatively, if the labeling result indicates that the evaluation quality of the classification model does not meet the requirement (e.g., the labeling result and the evaluation result do not conform to each other and have too many corpora), the classification model may be considered to be retrained. In addition, if the evaluation quality is considered to be good (e.g., too few or no corpora with the labeling result inconsistent with the evaluation result), and the remaining corpora are not enough to pick out training samples capable of performing the update training, the training may be terminated, and step S1510 is returned to start the next iterative training.

In step S1540, the classification model is updated or retrained based on the updated training set.

As an example, when updating the classification model, incremental learning may be performed on the classification model based on the new corpus and the labeling result thereof in the updated training set, for example, parameters of the classification model may be adjusted, so that the adjusted classification model is more suitable for the new corpus. When the classification model is retrained, the classification model can be retrained based on all corpora and labeling results in the updated training set.

In the present invention, in the process of iteratively executing the steps shown in fig. 3, the corpus that is relatively suitable for being used as the question-answer pair in each round of screening can be used as the cleaned corpus. The corpora screened out through multiple rounds of iterative training can be combined to obtain a corpus set after final cleaning. Based on the labeled result of the corpus obtained in the process of training the classification model by performing steps S110 to S140, the corpus that is suitable for being used as a question-answer pair as the cleaned corpus may be merged with the corpus screened by the following multi-round iterative training to obtain the final cleaned corpus set.

As an example, in the process of executing step S1520, in the case where the corpus more suitable for use as a question-answer pair cannot be screened out based on the evaluation result, the iterative training of the classification model may be stopped. That is, the iterative training of the classification model may be stopped in case the evaluation result indicates that the remaining unlabeled corpus is not enough to pick out the question-answer pairs suitable for use. In addition, the iterative training of the classification model can also be stopped under the condition that the filtered corpora are enough.

Optionally, the corpus obtained after cleaning may be filtered based on the sensitive word list to remove question-answer pairs containing the sensitive words. The sensitive vocabulary may be obtained from the outside, such as a sensitive vocabulary that may be pre-counted. In addition, the sensitive vocabulary may also be obtained by continuously updating in each iteration, for example, in the process of iteratively executing the steps shown in fig. 3, a new sensitive word may be received in each iteration to be added into the sensitive vocabulary. The new sensitive words may be sensitive words that are continuously updated by a labeling person when the corpus obtained by sampling the selected corpus is labeled in each iteration.

The process of manual marking is reasonably added in the corpus screening process, the training of the model can be started by using a small amount of marked samples, and the high-quality target corpus can be obtained by a small amount of manual intervention in the iteration of model evaluation (step S1510), manual sampling marking (step S1520), rich training set (step S1530) and model updating (step S1540). In addition, when the corpus is screened, the end condition does not depend on a single index, but the quality of the corpus which is iteratively screened in the current round is judged in the process of manual sampling and marking to judge whether the screening is ended or not, and the remaining corpus is still available, so that the screening can be continued.

It should be noted that the sources of the corpus mentioned in the present invention may be various, for example, the sources of the corpus include but are not limited to microblog, bar, bean and other user chat data of the social platform. The corpus may contain sensitive information that is not suitable for being presented in chat sessions, including but not limited to identification numbers, cell phone numbers, addresses, etc., so optionally a desensitization process may be performed to remove conversations with such sensitive information from the corpus. Then, the corpus can be sorted into a question-answer form, sentences which cannot form question-answer sentence pairs are removed, optionally, sentences of which the sentence lengths are larger than a certain threshold value can be filtered, and a candidate set of the corpus is obtained. For example, if the original dialog is: { A: there is great interest today. B: what do so? A: today the weather is very good. B: when the weather is good, we go out and play a bar! Then, can split into: { Q: there is great interest today. A: what do so? }, { Q: what do so? A: today the weather is very good. }, { Q: today the weather is very good. A: when the weather is good, we go out and play a bar! }.

As described above, the sentence vector extraction model structure is taken from a part of a question-answer pair model trained in advance for evaluating matching of a question and an answer sentence, and is used to extract a sentence vector of an input question or answer sentence. When the question-answer pair model is trained, the used linguistic data can be linguistic data to be cleaned or other linguistic data, namely, the question-answer pair model can be trained by the linguistic data to be cleaned or other linguistic data, and a part of a sentence vector used for extracting an input question or answer is extracted, so that a sentence vector extraction model structure is obtained.

The corpus cleaning method of the present invention can also be implemented as a corpus cleaning device. Fig. 4 illustrates a block diagram of a corpus washing apparatus according to an exemplary embodiment of the present invention. Wherein the functional elements of the corpus cleaning device may be implemented by hardware, software or a combination of hardware and software implementing the principles of the present invention. It will be appreciated by those skilled in the art that the functional units described in fig. 4 may be combined or divided into sub-units to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional units described herein.

In the following, a brief description is given of functional units that the corpus cleaning device may have and operations that each functional unit may perform, and for details related thereto, reference may be made to the above description, which is not repeated herein.

Referring to fig. 4, the corpus washing device 400 includes a first obtaining unit 410, an extracting unit 420, a training unit 430, and a filtering unit 440.

The first obtaining unit 410 is configured to obtain a sentence vector extraction model structure, where the sentence vector extraction model structure is obtained from a part of a question-answer model trained in advance for evaluating matching of a question and an answer sentence, and is used to extract a sentence vector of an input question or answer sentence. As an example, the sentence vector extraction model structure may include an embedding layer and an association layer, where the embedding layer is configured to obtain word vectors of words in the input question sentence or answer sentence, and the association layer is configured to obtain a sentence vector further carrying word order information based on the word vectors of the words. Wherein, the association layer can be a long-short term memory neural network layer, a recurrent neural network layer and/or a convolutional neural network layer.

The first obtaining unit 410 may receive a sentence vector extraction model structure from the outside. In addition, the first obtaining unit 410 may also train a question-answer pair model by using a matching question-answer pair and a non-matching question-answer pair that are constructed in advance, where the question-answer pair model includes a sentence vector extraction model structure and a matching operation structure, where the matching operation structure is configured to obtain a matching degree between a question and an answer in the question-answer pair based on an extracted sentence vector, and extract the sentence vector extraction model structure from the trained question-answer pair model. The question-answer pair model is trained by taking the matched question-answer pair and the unmatched question-answer pair as targets to be separated as far as possible, wherein the matching operation structure is a similarity calculation unit or a neural network structure.

The extracting unit 420 is configured to extract at least a part of the corpus from the total corpus as the question-answer pairs to be cleaned. The second obtaining unit 430 is configured to obtain a labeling result of at least a portion of the corpus, where the labeling result indicates whether the corpus is suitable for being used as a question-answer pair. The training unit 440 is configured to train a classification model based on a training set formed by at least a part of the corpus and the labeling result thereof, wherein the classification model evaluates whether the corpus is suitable for use as a question-answer pair based on sentence vectors respectively extracted from question sentences and answer sentences in a question-answer pair of the input corpus by a sentence vector extraction model structure.

As an example, the classification model may include: the splicing layer is used for splicing sentence vectors respectively extracted from question sentences and answer sentences in question-answer pairs of the input corpus by the sentence vector extraction model structure so as to obtain spliced vectors corresponding to the input corpus; and an additional layer for outputting a classification result as to whether the corresponding input corpus is suitable for use as a challenge-answer pair based on the concatenation vector. Wherein the additional layer may be at least one fully connected layer.

The screening unit 450 is configured to screen out corpora suitable for being used as question-answer pairs from the unmarked corpora in all corpora by using the trained classification model.

Fig. 5 illustrates a block diagram of functional units that a screening unit 450 according to an exemplary embodiment of the present invention may have. As shown in fig. 5, screening unit 450 may include an evaluation unit 451, a screening sub-unit 453, a training set update unit 455, and an update or retrain unit 457.

By way of example, the linguistic data suitable for use as question-answer pairs may be screened out from the unlabeled linguistic data among the entire linguistic data using the trained classification model by iteratively performing the following operations. Evaluating, by the evaluating unit 451, at least a part of the unlabeled corpora among the whole corpora using the trained classification model; the screening subunit 453 screens out the corpora of which the evaluation results are relatively suitable for being used as question-answer pairs, and obtains the labeling results of at least a part of the screened corpora; updating the training set by the training set updating unit 455 by adding a corpus in which the labeling result is inconsistent with the evaluation result; the classification model is updated or retrained based on the updated training set by an update or retraining unit 457. The linguistic data screened in each round and relatively suitable for being used as the question-answer pairs can be used as the washed linguistic data.

Alternatively, the screening unit may stop iteratively training the classification model in a case where the corpus more suitable for use as the question-answer pair cannot be screened out based on the evaluation result.

Optionally, the corpus cleaning device 400 may further include a filtering unit (not shown in the figure) for filtering the selected corpus to remove question-answer pairs containing sensitive words based on the sensitive word list.

Optionally, the corpus cleaning device 400 may further include a sensitive word updating unit (not shown in the figure) for receiving a new sensitive word for adding to the sensitive vocabulary in each iteration.

It should be understood that the specific implementation of the corpus cleaning device according to the exemplary embodiment of the present invention may be implemented with reference to the related specific implementation described in conjunction with fig. 1 to 3, and will not be described herein again.

Fig. 6 shows a schematic structural diagram of a computing device that can be used to implement the above method according to an exemplary embodiment of the present invention.

Referring to fig. 6, computing device 600 includes memory 610 and processor 620.

The processor 620 may be a multi-core processor or may include a plurality of processors. In some embodiments, processor 620 may include a general-purpose host processor and one or more special coprocessors such as a Graphics Processor (GPU), a Digital Signal Processor (DSP), or the like. In some embodiments, processor 620 may be implemented using custom circuits, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 610 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are required by the processor 620 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 610 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 610 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 610 has stored thereon executable code that, when executed by the processor 620, causes the processor 620 to perform the corpus cleansing methods described above.

The corpus cleaning method, the corpus cleaning apparatus, and the computing device according to the present invention have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A corpus cleaning method comprises the following steps:

obtaining a sentence vector extraction model structure, wherein the sentence vector extraction model structure is obtained from a pre-trained question-answer pair model used for evaluating matching conditions of question sentences and answer sentences and used for extracting sentence vectors of input question sentences or answer sentences;

extracting at least one part of the linguistic data from all the linguistic data serving as question and answer pairs to be cleaned;

obtaining a labeling result of the at least one part of the corpus, wherein the labeling result indicates whether the corpus is suitable for being used as a question-answer pair;

training a classification model based on a training set formed by at least one part of the corpus and labeling results thereof, wherein the classification model evaluates whether the corpus is suitable for being used as a question-answer pair or not based on sentence vectors respectively extracted from question sentences and answer sentences in the question-answer pair of the input corpus by a sentence vector extraction model structure; and

and screening out the linguistic data suitable for being used as question and answer pairs from the unmarked linguistic data in all the linguistic data by utilizing the trained classification model.

2. The corpus cleaning method according to claim 1, wherein the sentence vector extraction model structure includes an embedding layer and an association layer, the embedding layer is configured to obtain word vectors of respective words in an input question or answer sentence, and the association layer is configured to obtain, based on the word vectors of the respective words, a sentence vector further carrying word order information.

3. The corpus cleansing method according to claim 2, wherein the association layer is a long-short term memory neural network layer, a recurrent neural network layer and/or a convolutional neural network layer.

4. The corpus cleaning method according to claim 1, wherein said step of obtaining a sentence vector extraction model structure comprises: receiving a sentence vector extraction model structure from the outside; alternatively, the first and second electrodes may be,

the step of obtaining a sentence vector extraction model structure comprises: training a question-answer pair model by utilizing a pre-constructed matching question-answer pair and a non-matching question-answer pair, wherein the question-answer pair model comprises a sentence vector extraction model structure and a matching operation structure, and the matching operation structure is used for acquiring the matching degree between a question and an answer sentence in the question-answer pair based on the extracted sentence vector; and (4) taking out sentence vectors from the trained question-answer model to extract a model structure.

5. The corpus cleaning method according to claim 4, wherein the question-answer pair model is trained with a goal of differentiating matching question-answer pairs from non-matching question-answer pairs as much as possible, wherein the matching operation structure is a similarity calculation unit or a neural network structure.

6. The corpus washing method of claim 1, wherein the classification model comprises:

the splicing layer is used for splicing sentence vectors respectively extracted from question sentences and answer sentences in question-answer pairs of the input corpus by the sentence vector extraction model structure so as to obtain spliced vectors corresponding to the input corpus; and

and an additional layer for outputting a classification result as to whether the corresponding input corpus is suitable for use as a question-answer pair based on the concatenation vector.

7. The corpus cleaning method of claim 6, wherein the additional layer is at least one fully connected layer.

8. The corpus washing method according to claim 1, wherein the step of using the trained classification model to screen out the corpus suitable for being used as a question-answer pair from the unlabeled corpuses in the whole corpuses comprises iteratively performing the following steps:

evaluating at least one part of the unlabeled corpus in the whole corpus by using the trained classification model;

screening out linguistic data of which the evaluation results are relatively suitable for being used as question and answer pairs, and acquiring the labeling results of at least one part of the screened linguistic data;

updating the training set by adding corpora with inconsistent labeling results and evaluation results; and

updating or retraining the classification model based on the updated training set.

9. The corpus cleaning method according to claim 8, wherein the corpus selected in each round is suitable for being used as a question-answer pair as the cleaned corpus.

10. The corpus cleaning method according to claim 9, further comprising:

and stopping iteratively training the classification model under the condition that the linguistic data which are relatively suitable for being used as the question-answer pairs cannot be screened out based on the evaluation result.

11. The corpus cleaning method according to claim 8, further comprising:

and filtering the screened linguistic data based on the sensitive word list so as to remove question-answer pairs containing sensitive words.

12. The corpus cleaning method according to claim 11, further comprising:

and receiving new sensitive words in each iteration to add into the sensitive word list.

13. A corpus cleaning device, comprising:

a first obtaining unit, configured to obtain a sentence vector extraction model structure, where the sentence vector extraction model structure is obtained from a part of a question-answer pair model trained in advance for evaluating matching conditions of question sentences and answer sentences, the part being used for extracting sentence vectors of input question sentences or answer sentences;

the extraction unit is used for extracting at least one part of linguistic data from all the linguistic data serving as question and answer pairs to be cleaned;

the second obtaining unit is used for obtaining a labeling result of the at least one part of the corpus, wherein the labeling result indicates whether the corpus is suitable for being used as a question-answer pair;

a training unit, configured to train a classification model based on a training set composed of the at least a part of corpora and labeling results thereof, wherein the classification model evaluates whether the corpora are suitable for use as question-answer pairs based on sentence vectors respectively extracted by a sentence vector extraction model structure from question sentences and answer sentences in question-answer pairs of input corpora; and

and the screening unit is used for screening out the linguistic data suitable for being used as question and answer pairs from the unmarked linguistic data in all the linguistic data by utilizing the trained classification model.

14. The corpus cleaning device according to claim 13, wherein the sentence vector extraction model structure includes an embedding layer and an association layer, the embedding layer is configured to obtain word vectors of respective words in an input question or answer sentence, and the association layer is configured to obtain, based on the word vectors of the respective words, a sentence vector further carrying word order information.

15. The corpus washing device of claim 14, wherein the association layer is a long-short term memory neural network layer, a recurrent neural network layer, and/or a convolutional neural network layer.

16. The corpus cleaning apparatus according to claim 13, wherein said first obtaining unit receives a sentence vector extraction model structure from outside; alternatively, the first and second electrodes may be,

the first obtaining unit trains a question-answer pair model by using a matching question-answer pair and a non-matching question-answer pair which are constructed in advance, and takes out a sentence vector extraction model structure from the trained question-answer pair model, wherein the question-answer pair model comprises the sentence vector extraction model structure and a matching operation structure, and the matching operation structure is used for obtaining the matching degree between a question and an answer sentence in the question-answer pair based on the extracted sentence vector.

17. The corpus cleaning device of claim 16, wherein the question-answer pair model is trained with the goal of differentiating matching question-answer pairs from non-matching question-answer pairs as much as possible, wherein the matching operation structure is a similarity calculation unit or a neural network structure.

18. The corpus washing device of claim 13, wherein the classification model comprises:

19. The corpus cleaning device of claim 18, wherein the additional layer is at least one fully connected layer.

20. The corpus cleaning apparatus according to claim 13, wherein the filtering unit comprises an evaluating unit, a filtering subunit, a training set updating unit, and an updating or retraining unit, and selects the corpus suitable for being used as a question-answer pair from the unlabeled corpuses among the whole corpuses by iteratively performing:

evaluating at least one part of the unlabeled corpora in the whole corpora by the evaluation unit by using the trained classification model;

the screening subunit screens out linguistic data of which the evaluation results are relatively suitable for being used as question and answer pairs, and obtains the labeling results of at least part of the screened linguistic data;

updating the training set by the training set updating unit by adding the corpus with the labeling result inconsistent with the evaluation result; and

updating or retraining, by the updating or retraining unit, the classification model based on the updated training set.

21. The corpus washing device of claim 20, wherein the corpus used as the washed corpus is adapted to be used as a question-answer pair for each round of screening.

22. The corpus cleaning device of claim 21, wherein,

in the case where the filtering unit cannot filter out the corpus more suitable for use as a question-answer pair based on the evaluation result, the corpus cleaning device stops iteratively training the classification model.

23. The corpus cleaning device of claim 20, further comprising:

and the filtering unit is used for filtering the screened linguistic data based on the sensitive word list so as to remove question-answer pairs containing the sensitive words.

24. The corpus cleaning device of claim 23, further comprising:

and the sensitive word updating unit is used for receiving the newly added sensitive words in each iteration so as to add the sensitive words into the sensitive word list.

25. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1-12.

26. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-12.