CN112231458B

CN112231458B - Capacity expansion method, device, equipment and storage medium for dialogue corpus

Info

Publication number: CN112231458B
Application number: CN202011146220.7A
Authority: CN
Inventors: 王栋; 张伟男; 王士进; 刘挺; 刘权; 陈志刚; 胡国平
Original assignee: Hebei Xunfei Institute Of Artificial Intelligence; iFlytek Co Ltd
Current assignee: Hebei Xunfei Institute Of Artificial Intelligence; iFlytek Co Ltd
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2023-03-21
Anticipated expiration: 2040-10-23
Also published as: CN112231458A

Abstract

The application provides a method, a device, equipment and a storage medium for expanding a dialogue corpus, wherein the method comprises the following steps: acquiring an input text collection; filtering out input texts with matching reply texts in the current dialogue corpus from the input text total set, wherein a set formed by the remaining input texts is used as a target input text set; generating a reply text corresponding to the input text in the target input text set by using a pre-established generative dialogue generating model to obtain a first spoken material set; and adding the dialogue corpora in the first dialogue corpus into the current dialogue corpus. The capacity expansion method of the dialogue corpus provided by the application can automatically generate the reply text corresponding to the input text by utilizing the generative dialogue generating model, so that the dialogue corpus is obtained, and the dialogue corpus added into the dialogue corpus is automatically generated without manual writing, so that the efficiency of obtaining the dialogue corpus is higher, and the labor cost is lower.

Description

Capacity expansion method, device, equipment and storage medium for dialogue corpus

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for expanding a dialog corpus.

Background

At present, a dialog system is used in many scenes, and acquires the input of a user and gives a corresponding reply according to the input of the user.

The present dialog system is mostly based on a search-type dialog generating model, and this dialog system generates a reply corresponding to the user input based on the search-type dialog generating model, specifically, the process of generating the reply corresponding to the user input based on the search-type dialog generating model is to search the reply related to the user input in a dialog corpus by using the search-type dialog generating model, and take the optimal reply in the searched replies as the reply corresponding to the user input.

Through the above process, it is not difficult to find that the quality of the reply obtained based on the retrieval type dialog generation model depends on the dialog corpus, which means that if a reply with better quality is to be obtained, it is necessary to ensure that the dialog corpus should be sufficient, however, the dialog corpus in the dialog corpus is written manually at present, and the manual writing is time-consuming and labor-consuming, so that the dialog corpus is difficult to meet the application requirement.

Disclosure of Invention

In view of this, the present application provides a method, an apparatus, a device, and a storage medium for expanding a dialog corpus, so as to expand the dialog corpus to meet an application requirement, and the technical scheme is as follows:

a method for expanding a dialogue corpus comprises the following steps:

acquiring an input text collection, wherein the input text collection comprises at least one input text;

filtering out input texts with matching reply texts in the current dialogue corpus from the input text total set, wherein a set formed by the remaining input texts is used as a target input text set;

generating a reply text corresponding to the input text in the target input text set by using a pre-established generative dialogue generating model to obtain a first pair of speaking material sets, wherein each speaking material in the first pair of speaking material sets consists of one input text in the target input text set and the reply text corresponding to the input text;

and adding the dialogue corpora in the first dialogue corpus into the current dialogue corpus.

Optionally, the filtering out, from the total input text set, the input text having matching reply texts in the current corpus of dialogues, includes:

determining a reply text corresponding to each input text in the input text total set and a confidence coefficient of the reply text corresponding to each input text by utilizing a pre-established search type dialog generation model and a current dialog corpus;

and filtering the input texts corresponding to the reply texts with the confidence degrees larger than or equal to the confidence degree threshold value in the input text total set.

Optionally, the filtering out, from the total input text set, the input text having matching reply texts in the current corpus of dialogues, further includes:

acquiring a quality label of the reply text with the confidence coefficient smaller than the confidence coefficient threshold value, wherein the quality label of the reply text can indicate whether the reply text is qualified or not;

and filtering the input texts corresponding to the unqualified reply texts in the input text total set according to the quality labels corresponding to the reply texts with the confidence degrees smaller than the confidence degree threshold value.

Optionally, the process of determining the confidence threshold includes:

sorting reply texts respectively corresponding to the input texts in the input text total set according to the sequence of the confidence degrees from high to low;

grouping the sorted reply texts, and acquiring the corresponding qualification rate of each group of reply texts;

and determining the confidence corresponding to the last reply text in the reply text group with the first qualification rate smaller than the preset qualification rate threshold as the confidence threshold.

Optionally, the adding the dialog corpus in the first dialog corpus set to the current dialog corpus includes:

evaluating the confidence corresponding to each dialogue corpus in the first dialogue corpus pair to obtain the confidence corresponding to each dialogue corpus in the first dialogue corpus pair;

and selecting a dialogue corpus with the confidence coefficient larger than or equal to a preset confidence coefficient threshold value from the first dialogue corpus and adding the dialogue corpus into the current dialogue corpus.

Optionally, the adding the dialog corpus in the first dialog corpus set to the current dialog corpus further includes:

obtaining a quality label corresponding to each dialogue corpus of which the confidence coefficient is smaller than the confidence coefficient threshold value, wherein the quality label corresponding to one dialogue corpus is a quality label of a reply text in the dialogue corpus and can represent whether the reply text in the dialogue corpus is qualified or not;

and selecting the dialog corpus with qualified reply texts from the dialog corpuses with the confidence degrees smaller than the confidence degree threshold value to add into the current dialog corpus according to the quality label corresponding to each dialog corpus with the confidence degree smaller than the confidence degree threshold value.

Optionally, the method for expanding the dialog corpus further includes:

acquiring a training dialogue corpus set and a quality label corresponding to each dialogue corpus in the training dialogue corpus set, wherein the training dialogue corpus set comprises all dialogue corpuses in the first pair of utterance corpus sets or comprises dialogue corpuses of which the confidence coefficient is smaller than the confidence coefficient threshold value in the first pair of utterance corpus sets;

and training the retrieval type dialogue generating model by utilizing the dialogue corpora in the training dialogue corpus set and the quality labels corresponding to the dialogue corpora in the training dialogue corpus set so as to optimize the performance of the retrieval type dialogue generating model.

Optionally, the training the retrieval type dialog generation model by using the dialog corpus in the training dialog corpus set and the quality label corresponding to the dialog corpus in the training dialog corpus set includes:

determining the sample difficulty corresponding to each dialogue corpus in the training dialogue corpus set according to the quality label and the confidence coefficient corresponding to each dialogue corpus in the training dialogue corpus set;

determining sampling probability distribution according to the sample difficulty corresponding to each dialogue corpus in the training dialogue corpus set;

according to the sampling probability distribution, conversation corpora are sampled from the training conversation corpus in a centralized mode;

and training the retrieval type dialogue generating model by utilizing the sampled dialogue corpus and the quality label corresponding to the sampled dialogue corpus.

Optionally, the training the retrieval-type dialog generation model by using the dialog corpus in the training dialog corpus set and the quality label corresponding to the dialog corpus in the training dialog corpus set further includes:

when the retrieval type dialogue generating model is trained by dialogue corpora sampled based on the sampling probability distribution to enable the prediction loss of the retrieval type dialogue generating model to be stable, the sampling probability distribution is adjusted to improve the sampling probability of a high-difficulty sample;

and sampling dialogue corpora from the training dialogue corpus according to the adjusted sampling probability distribution, and training the retrieval type dialogue generating model by using the sampled dialogue corpora and the corresponding quality labels.

Optionally, the method for expanding a dialog corpus further includes:

for each dialogue corpus in the first dialogue corpus set, determining the prediction loss of the generative dialogue generating model corresponding to the dialogue corpus according to the confidence coefficient corresponding to the dialogue corpus and the quality characterization value corresponding to the dialogue corpus, and updating the parameters of the generative dialogue generating model according to the prediction loss of the generative dialogue generating model corresponding to the dialogue corpus;

the quality characterization values corresponding to the pair of utterance data can characterize whether the reply text in the dialog corpus is qualified, and the quality characterization values corresponding to the pair of utterance data are determined according to the confidence degrees corresponding to the dialog corpus or the confidence degrees corresponding to the dialog corpus and the quality labels.

An expansion apparatus for a corpus of dialogues, comprising: the system comprises an input text acquisition module, an input text filtering module, a reply text generation module and a dialogue corpus expansion module;

the input text acquisition module is used for acquiring an input text collection, wherein the input text collection comprises at least one input text;

the input text filtering module is used for filtering the input texts with matching reply texts in the current dialogue corpus from the input text total set, and a set formed by the remaining input texts is used as a target input text set;

the reply text generation module is used for generating a reply text corresponding to the input text in the target input text set by using a pre-established generative dialogue generation model to obtain a first pair of speaking material sets, wherein each dialogue linguistic material in the first pair of speaking material sets consists of one input text in the target input text set and the reply text corresponding to the input text;

and the dialogue corpus capacity expansion module is used for adding the dialogue corpus in the first dialogue corpus set into the current dialogue corpus.

Optionally, the dialog corpus capacity expansion module includes: a confidence evaluation module and a dialogue corpus adding module;

the confidence evaluation module is used for evaluating the confidence corresponding to each dialogue corpus in the first dialogue corpus pair to obtain the confidence corresponding to each dialogue corpus in the first dialogue corpus pair;

and the dialogue corpus adding module is used for selecting dialogue corpora with the confidence coefficient larger than or equal to a preset confidence coefficient threshold value from the first dialogue corpus and adding the dialogue corpora into the current dialogue corpus.

Optionally, the expansion device of the dialog corpus further includes: a search-type dialogue generation model optimization module;

the retrieval type dialogue generation model optimization module is used for acquiring a training dialogue corpus and a quality label corresponding to each dialogue corpus in the training dialogue corpus, and training the retrieval type dialogue generation model by using the dialogue corpuses in the training dialogue corpus and the quality labels corresponding to the dialogue corpuses in the training dialogue corpus so as to optimize the performance of the retrieval type dialogue generation model;

wherein the training corpus of dialogues includes all corpus of dialogues in the first corpus of dialogues, or includes corpus of dialogues in the first corpus of dialogues whose confidence level is less than the confidence level threshold.

Optionally, the expansion device of the dialog corpus further includes: a generative dialogue generative model optimization module;

the generative dialogue generative model optimization module is used for determining the prediction loss of the generative dialogue generative model corresponding to each dialogue corpus in the first dialogue corpus set according to the confidence coefficient corresponding to the dialogue corpus and the quality characterization value corresponding to the dialogue corpus, and updating the parameters of the generative dialogue generative model according to the prediction loss of the generative dialogue generative model corresponding to the dialogue corpus;

A capacity expansion apparatus for a corpus of dialogues, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement each step of the method for expanding a dialog corpus.

A readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the method for expanding a dialog corpus according to any one of the above descriptions.

The method, the device, the equipment and the storage medium for expanding the dialogue corpus provided by the application are characterized in that an input text total set is obtained firstly, then an input text with a matching reply text in a current dialogue corpus is filtered from the input text total set, a set formed by the rest input texts is used as a target input text set, then a reply text corresponding to the input text in the target input text set is generated by utilizing a pre-established generative dialogue generating model to obtain a first pair of speaking corpus, and finally the dialogue corpora in the first pair of speaking corpus is added into the current dialogue corpus. The application provides a dialogue corpus dilatation method, usable generative dialogue generative model automatic generation input text corresponds reply text in order to obtain the dialogue corpus, and then can add the dialogue corpus that generates, thereby realize the dilatation of dialogue corpus, because the dialogue corpus automatic generation who adds the dialogue corpus, and need not artifical the writing, consequently, the efficiency of obtaining the dialogue corpus is higher, the cost of labor is lower, because the dilatation has been carried out to the dialogue corpus, consequently, the scale of dialogue corpus can be enlarged, thereby can satisfy the application demand.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of a method for expanding a dialog corpus according to an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating an implementation manner of filtering an input text having matching reply texts in a current corpus of dialogues from an input text corpus according to an embodiment of the present application;

fig. 3 is a flowchart illustrating another implementation manner of filtering out an input text having matching reply texts in a current corpus of dialogues from an input text corpus according to an embodiment of the present application;

fig. 4 is a schematic flowchart of determining a confidence threshold according to an embodiment of the present application;

FIG. 5 is a schematic flow chart illustrating optimization of a retrieved dialogue generating model according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a sampling probability distribution provided by an embodiment of the present application;

fig. 7 is a schematic structural diagram of a capacity expansion device for a dialog corpus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a capacity expansion device for a dialog corpus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In order to realize the capacity expansion of the dialogue corpus, the inventor of the present application researches and discovers that:

the model capable of generating a reply according to the user input at present is a generative dialog generation model besides a search-type dialog generation model, which develops rapidly in recent years, and has the advantages that the reply generation is not limited by a dialog corpus, and the controllability of reply content is poor. Because the chat robot/conversation system in the industry has high requirements on the safety and controllability of the reply, the generative conversation generation model has limited application in online scenes in the industry, and the retrieval conversation generation model is frequently used online.

Based on this, the inventor of the present application thinks that since the online scene cannot use the generative dialogue generating model, it can use the generated user to input the corresponding reply to obtain the dialogue corpus, so as to add the dialogue corpus into the dialogue corpus depended by the retrieval dialogue generating model to improve the dialogue generating effect of the retrieval dialogue generating model.

Based on the above idea, the inventor provides the method for expanding the dialogue corpus provided by the present application, and the method can automatically generate the dialogue corpus, thereby greatly improving the efficiency of obtaining the dialogue corpus and reducing the cost of obtaining the dialogue corpus compared with the existing method for manually compiling the dialogue corpus. The method for expanding the dialogue corpus provided by the application can be applied to a terminal with data processing capacity and can also be applied to a single server or a server cluster formed by a plurality of servers. Next, a method for expanding a corpus of dialogues provided by the present application is described by the following embodiments.

First embodiment

Referring to fig. 1, a flow diagram of a method for expanding a corpus of dialogues according to an embodiment of the present application is shown, where the method includes:

step S101: and acquiring an input text aggregate.

The input text collection includes at least one input text, and in general, the input text collection includes a plurality of input texts.

It should be noted that, the dialog system may obtain the input of the user, and then generate a reply corresponding to the input of the user, and the input of the user may be a voice or a text.

Optionally, in this embodiment, the input text in the input text aggregate may be obtained from a history log of the user. Of course, the input text in the input text total set may also be obtained by other methods, and the obtaining method of the input text is not limited in this embodiment.

Step S102: and filtering out the input texts with matching reply texts in the current dialog corpus from the input text total set, wherein the set formed by the residual input texts is used as a target input text set.

The dialogue corpus comprises a plurality of dialogue corpora, each dialogue corpus consists of an input text and a reply text corresponding to the input text, and each reply text in the dialogue corpus is a reply text matched with the corresponding input text. Wherein the reply text matching the input text refers to a reply text that is qualified for the input text.

For an input text in the input text aggregate, if there is a matching reply text in the current dialog corpus, it indicates that there is a reply text suitable for the input text in the current dialog corpus, that is, there is a dialog corpus that is the same as or similar to a dialog corpus composed of the input text and the matching reply text of the input text in the current dialog corpus, and such a dialog corpus does not need to be added to the current dialog corpus.

Step S103: and generating a reply text corresponding to the input text in the target input text set by using a pre-established generative dialogue generating model to obtain a first spoken material set.

Each dialog corpus in the first dialog corpus consists of an input text in the target input text set and a reply text corresponding to the input text.

Specifically, for each input text in the target input text set, inputting the input text into a pre-established generative dialogue generating model, obtaining a reply text corresponding to the input text, so as to obtain a reply text corresponding to each input text in the target input text set, forming each input text and the corresponding reply text into a dialogue corpus, and forming the obtained dialogue corpus into a first dialogue corpus.

It should be noted that the generative dialog generation model is obtained based on training of an initial dialog corpus, and in one possible implementation manner, the dialog corpus in the initial dialog corpus may be written manually, and in consideration of low efficiency of manually writing the dialog corpus, in another possible implementation manner, the dialog corpus may be crawled from a network open community such as a microblog or a sticker bar, and the crawled dialog corpus may be subjected to data cleansing.

Step S104: and adding the dialogue corpora in the first dialogue corpus into the current dialogue corpus.

In this embodiment, all the corpus dialogues in the first corpus pair may be added to the current corpus dialog, and in order to ensure the quality of the corpus dialogues in the corpus dialog, corpus dialogues with better quality (or qualified) may be selected from the first corpus pair and added to the current corpus dialog.

The method for expanding the dialog corpus provided by the embodiment of the application comprises the steps of firstly obtaining an input text collection, then filtering an input text with a matching reply text in the current dialog corpus from the input text collection, using a collection formed by the remaining input texts as a target input text collection, then generating a reply text corresponding to the input text in the target input text collection by using a pre-established generative dialog generation model to obtain a first pair of speaking material collections, and finally adding the dialog corpus in the first pair of speaking material collections into the current dialog corpus. According to the dialogue corpus expansion method, the reply text corresponding to the input text can be automatically generated by using the generative dialogue generation model, so that the dialogue corpora is obtained, and the dialogue corpora added into the dialogue corpus is automatically generated without manual writing, so that the dialogue corpora obtaining efficiency is high, and the labor cost is low.

Second embodiment

This embodiment describes a process of "filtering out the input text matching the reply text in the current corpus of dialogues" from the total input text set in step S102 of the first embodiment.

There are various implementations of filtering out the input text having matching reply texts in the current corpus of dialogues from the input text collection, please refer to fig. 2, which shows a flowchart of an implementation, which may include:

step S201: and determining a reply text corresponding to each input text in the input text total set and the confidence coefficient of the reply text corresponding to each input text by utilizing a pre-established search type dialog generation model and the current dialog corpus.

The retrieval-type dialog generation model is trained based on the dialog corpus in the "initial dialog corpus" mentioned in the above embodiment.

For each input text in the input text set, using a pre-established search type dialog generation model and a current dialog corpus, the process of determining a reply text corresponding to the input text may include: and inputting the input text into a search type dialog generation model, searching a reply text related to the input text in a current dialog corpus by using the search type dialog generation model to obtain at least one reply text related to the input text and the confidence coefficient corresponding to each reply text, which are output by the search type dialog generation model, and determining the reply text with the highest confidence coefficient as the reply text corresponding to the input text to obtain the reply text corresponding to each input text in the input text total set and the confidence coefficient of the reply text corresponding to each input text.

Step S202: and filtering the input texts corresponding to the reply texts with the confidence degrees larger than or equal to the confidence degree threshold value in the input text total set.

The input text corresponding to the reply text with the confidence coefficient greater than or equal to the confidence coefficient threshold is the input text with the matching reply text in the current dialog corpus.

Considering that some qualified texts may exist in some reply texts with confidence degrees smaller than the confidence degree threshold, and the input texts corresponding to the part of the reply texts may also be filtered, based on this, this embodiment provides another implementation manner of "filtering, from the input text corpus, the input texts that exist in the current dialog corpus and match the reply texts", please refer to fig. 3, which shows a flowchart of the implementation manner, and may include:

step S301: and determining a reply text corresponding to each input text in the input text total set and the confidence coefficient of the reply text corresponding to each input text by utilizing a pre-established search type dialog generation model and the current dialog corpus.

Step S302: and filtering the input texts corresponding to the reply texts with the confidence degrees larger than or equal to the confidence degree threshold value in the input text total set.

The implementation processes of step S301 and step S302 are the same as the implementation processes of step S201 and step S202, and are not described herein again in this embodiment.

Step S303: and acquiring the quality label of the reply text with the confidence coefficient smaller than the confidence coefficient threshold value.

Wherein, the quality label of a reply text is 'qualified' or 'unqualified'.

In this embodiment, for reply texts with confidence degrees smaller than the confidence degree threshold, the reply texts may be labeled manually, whether the reply texts are qualified or unqualified is noted, and a manual labeling result corresponding to each reply text is obtained as a quality label of the reply text.

Step S304: and filtering the input texts corresponding to the unqualified reply texts in the input text total set according to the quality labels corresponding to the reply texts with the confidence degrees smaller than the confidence degree threshold value.

Through steps S301 and S301, the input texts corresponding to the reply texts with the confidence level greater than or equal to the confidence level threshold in the input text total set are filtered, through steps S303 and S304, the input texts corresponding to the reply texts with the quality labels of "qualified" in the remaining input texts in the input text total set can be filtered, the final remaining texts in the input text total set are the input texts corresponding to the reply texts with the quality labels of "unqualified", and the input texts corresponding to the reply texts with the quality labels of "unqualified" form the target input text set.

In both implementations, a "confidence threshold" is used, and the process of determining the confidence threshold is described below.

Referring to fig. 4, a schematic flow chart of determining a confidence threshold is shown, which may include:

step S401: and sorting reply texts corresponding to the input texts in the input text total set according to the sequence from high confidence level to low confidence level.

Step S402: and grouping the sequenced reply texts, and acquiring the corresponding qualification rate of each group of reply texts.

Illustratively, the total set of input texts includes 100 input texts, each input text corresponds to a reply text, and then there are 100 reply texts, the 100 reply texts are sorted in order of confidence level from high to low, assuming that the sorted reply texts are text1 (with the highest confidence level), text2, \ 8230, text100 (with the lowest confidence level), and the text1, text2, \ 8230, and text100 are grouped, and assuming that the text100 is divided into 5 groups, and each group includes 20 reply texts, namely, the first group includes text1 to text20, the second group includes text21 to text40, 8230, and the fifth group includes text81 to text100, and for each group of reply texts, the first group manually checks whether each reply text in the group is qualified relative to its corresponding input text, so as to obtain the number of qualified reply texts in each group of reply texts, and further obtain the qualification rate corresponding to each group of reply texts (the qualification rate corresponding to a group of reply texts is the ratio of the number of qualified reply texts in the group to the total number of the reply texts in the group.

Step S403: and determining the confidence corresponding to the last reply text in the reply text group with the first qualification rate smaller than the preset qualification rate threshold as a confidence threshold.

For the above example, assuming that the third group of reply texts is a group of reply texts with a first yield smaller than a preset yield threshold, the confidence corresponding to the last reply text (i.e. text 60) in the third group of reply texts is determined as the confidence threshold.

Third embodiment

This embodiment is similar to the "step S104: and adding the dialogue corpora in the first dialogue corpus into the current dialogue corpus for introduction.

There are various implementations of adding the corpus of the first corpus into the current corpus, and this embodiment provides the following three alternative implementations:

the first implementation mode comprises the following steps: and adding all the dialogue corpora in the first dialogue corpus into the current dialogue corpus.

Because the reply text in each dialog corpus in the first dialog corpus set is generated based on the generative dialog generating model, and the reply text generated by the generative dialog generating model is not of good quality, that is, some dialog corpuses with poor quality may exist in the first dialog corpus set, if the dialog corpuses with poor quality are added into the dialog corpus, the generation effect of the subsequent retrieval dialog generating model on the reply text will be influenced, and in order to avoid this situation, the embodiment provides a second implementation manner:

step a1, evaluating the confidence coefficient of each dialogue corpus in the first dialogue corpus set to obtain the corresponding confidence coefficient of each dialogue corpus in the first dialogue corpus set.

Optionally, for each dialog corpus in the first pair of corpus sets, the dialog corpus may be input into a pre-established confidence level evaluation model, and a confidence level corresponding to the dialog corpus is obtained, so as to obtain a confidence level corresponding to each dialog corpus in the first pair of corpus sets.

Step a2, selecting a dialogue corpus with the confidence coefficient larger than or equal to a preset confidence coefficient threshold value from the first dialogue corpus, and adding the dialogue corpus into the current dialogue corpus.

And adding the dialogue corpora with the confidence coefficient larger than or equal to the confidence coefficient threshold value into the current dialogue corpus, wherein the dialogue corpora are good in quality, or qualified dialogue corpora.

Considering that there may be some qualified corpus in the corpus with a confidence lower than the preset confidence threshold, in order to add the part of corpus to the corpus, the embodiment provides a third implementation manner:

and b1, evaluating the confidence degree corresponding to each dialogue corpus in the first dialogue corpus set to obtain the confidence degree corresponding to each dialogue corpus in the first dialogue corpus set.

And b2, selecting the dialogue corpus with the confidence coefficient larger than or equal to a preset confidence coefficient threshold value from the first dialogue corpus, and adding the dialogue corpus into the current dialogue corpus.

And b3, obtaining a quality label corresponding to each dialogue corpus of which the confidence coefficient is smaller than the confidence coefficient threshold value.

Wherein, the quality label corresponding to one dialogue corpus is the quality label of the reply text in the dialogue corpus, and the quality label is qualified or unqualified.

And manually labeling each dialogue corpus of which the confidence coefficient is smaller than the confidence coefficient threshold value to mark whether the reply text in the dialogue corpus is qualified or unqualified relative to the input text in the dialogue corpus so as to obtain a quality label corresponding to each dialogue corpus.

And b4, selecting the dialogue corpus with qualified reply texts from the dialogue corpora with the confidence degrees smaller than the confidence degree threshold value to add into the current dialogue corpus according to the quality label corresponding to each dialogue corpus with the confidence degree smaller than the confidence degree threshold value.

And adding the dialog corpus corresponding to the qualified dialog corpus into the dialog corpus with the confidence coefficient smaller than the confidence coefficient threshold value.

Fourth embodiment

The present embodiment provides another method for expanding a dialog corpus, which is different from the method for expanding a dialog corpus provided in the first embodiment in that, in addition to steps S101 to S104 in the first embodiment, the method may further include: and optimizing the retrieval type dialogue generating model.

The process of "optimizing the retrieved dialog generation model" is described next.

Referring to fig. 5, a schematic flow chart illustrating optimization of the retrieved dialog generation model is shown, which may include:

step S501: and acquiring a quality label corresponding to each dialogue corpus in the training dialogue corpus set.

The training dialogue corpus set comprises all dialogue corpuses in the first dialogue corpus set, or comprises dialogue corpuses of which the confidence degrees in the first dialogue corpus set are smaller than a preset confidence degree threshold value, a quality label corresponding to one dialogue corpus is a quality label of a reply text in the dialogue corpus, and the quality label corresponding to one dialogue corpus can be marked manually and is qualified or unqualified.

The confidence threshold in this embodiment is the confidence threshold determined in the above embodiment through steps b1 to b 4.

Step S502: and training the retrieval type dialogue generating model by utilizing the dialogue corpora in the training dialogue corpus set and the quality labels corresponding to the dialogue corpora in the training dialogue corpus set so as to optimize the performance of the retrieval type dialogue generating model.

Specifically, the process of training the retrieve-type dialog generation model may include, by using the dialog corpus in the training dialog corpus set and the quality label corresponding to the dialog corpus in the training dialog corpus set:

step S5021, determining sample difficulty corresponding to each dialogue corpus in the training dialogue corpus set according to the quality label and the confidence degree corresponding to each dialogue corpus in the training dialogue corpus set.

Specifically, the sample difficulty corresponding to each dialog corpus in the training dialog corpus set can be determined according to the following formula:

wherein, P _i The difficulty of the sample corresponding to the ith dialogue corpus in the dialogue corpus set is trained, and if the ith dialogue corpus is unqualified, the difficulty of the sample corresponding to the ith dialogue corpus is S _i If the ith dialogue corpus is qualified, the difficulty of the corresponding sample is 1-S _i 。

Step S5022, determining sampling probability distribution according to the sample difficulty corresponding to each dialogue corpus in the training dialogue corpus set.

The sampling probability distribution can represent the probability of each dialogue corpus being sampled in the training dialogue corpus set.

And S5023, sampling dialogue corpora from the training dialogue corpora in a centralized mode according to the sampling probability distribution.

And S5024, training a retrieval type dialogue generating model by using the sampled dialogue corpus and the quality labels corresponding to the sampled dialogue corpus.

In this embodiment, the retrieval-type dialog generating model is trained by using the dialog corpus sampled according to the sampling probability distribution until the prediction loss of the retrieval-type dialog generating model is stable.

Preferably, the process of optimizing the retrieved dialogue generating model may further include:

and S5025, when the dialogue corpus sampled based on the sampling probability distribution is used for training the retrieval type dialogue generating model to enable the prediction loss of the dialogue generating model to be stable, the sampling probability distribution is adjusted to improve the sampling probability of the high-difficulty samples.

Step S506, conversation corpora are sampled from the training conversation corpora in a centralized mode according to the adjusted sampling probability distribution, and the sampled conversation corpora and the corresponding quality labels are used for training the retrieval type conversation generation model.

It should be noted that, when the retrieval-type dialog generation model is optimized, the present application may perform training in multiple stages, where the training in multiple stages uses different probability distribution sampling dialog corpora, for example, the training may be performed in three stages, where the first stage uses the sampling probability distribution sampling dialog corpus illustrated in (a) of fig. 6 to perform training, the second stage uses the sampling probability distribution sampling dialog corpus illustrated in (b) of fig. 6 to perform training, and the third stage uses the sampling probability distribution sampling dialog corpus illustrated in (c) of fig. 6 to perform training.

The retrieval type dialogue generating model obtained through the training mode has better effect.

Fifth embodiment

The present embodiment provides another method for expanding a dialog corpus, which is different from the method for expanding a dialog corpus provided in the foregoing embodiment in that, in addition to the steps S101 to S104 and the optimization process of the search-type dialog generation model in the first embodiment, the method may further include: the generative dialog generative model is optimized.

The process of "optimizing the generative dialog generative model" is described next.

The process of optimizing the generative dialog generative model comprises:

for each dialog corpus in the first set of dialog corpuses, performing:

step c1, determining the prediction loss of the generative dialogue generating model corresponding to the dialogue corpus according to the confidence coefficient corresponding to the dialogue corpus and the quality characterization value corresponding to the dialogue corpus.

The quality characterization value corresponding to one dialogue corpus can characterize whether the reply text in the dialogue corpus is qualified, and the quality characterization value corresponding to one dialogue corpus is determined according to the confidence coefficient corresponding to the dialogue corpus or the confidence coefficient corresponding to the dialogue corpus and the quality label.

Specifically, a quality characterization value corresponding to one dialog corpus is "0" or "1", and if a confidence degree corresponding to the dialog corpus is greater than or equal to a preset confidence degree threshold value, the quality characterization value corresponding to the dialog corpus is "1"; if the confidence corresponding to the dialog corpus is smaller than a preset confidence threshold and the quality label corresponding to the dialog corpus is qualified, the quality characterization value corresponding to the dialog corpus is 1; if the confidence corresponding to the dialog corpus is smaller than the preset confidence threshold value and the quality label corresponding to the dialog corpus is unqualified, the quality characterization value corresponding to the dialog corpus is 0.

In this embodiment, the prediction loss of the generative dialog generation model corresponding to one dialog corpus (assumed to be the ith dialog corpus in the first set of dialog corpuses) can be determined as follows:

wherein h is _i A quality characteristic value r corresponding to the ith dialogue corpus _i The confidence corresponding to the ith dialogue corpus, alpha is the weighted weightWeight, loss _i Namely the predicted loss of the generative dialogue generating model corresponding to the ith dialogue corpus.

The present application considers that the higher the confidence corresponding to a dialog corpus is, the larger the corresponding weighting weight should be, based on this, the present embodiment determines the weighting weight α by using the following formula:

α＝CORR([h ₀ ，h ₁ ，...]，[r ₀ ，r ₁ ，...])) (3)

wherein, [ h ] ₀ ，h ₁ ，...]Is a set of quality token values corresponding to quality labeled dialogue corpora, [ r ₀ ，r ₁ ，...]Is a set of confidence levels corresponding to the quality-labeled corpus of dialogues, h ₀ And r ₀ Corresponding to the same dialogue corpus, h ₁ And r ₁ Corresponding to the same dialogue corpus, in other same manners, the CORR represents that the correlation degree of the two sets is calculated, and optionally, the correlation degree of the two sets can be calculated by methods such as pearson correlation coefficient and AUC.

And c2, updating the parameters of the generative dialogue generating model according to the prediction loss of the generative dialogue generating model corresponding to the dialogue corpus.

And after obtaining the predicted loss, updating parameters of the generative dialogue generating model through a back propagation algorithm according to the predicted loss.

After the steps S101 to S104, and the optimization process of the search dialog generation model and the optimization process of the generative dialog generation model are performed, the process may return to step S101, and the expansion of the dialog corpus and the optimization of the search dialog generation model and the generative dialog generation model may be performed again until a condition for ending the expansion is reached, for example, the number of dialog corpora in the dialog corpus reaches a preset number.

Sixth embodiment

The following describes the expansion device of a dialog corpus provided in this embodiment, and the expansion of the dialog corpus described below and the expansion method of the dialog corpus described above may be referred to in correspondence with each other.

Referring to fig. 7, a schematic structural diagram of a capacity expansion device for a corpus to be spoken according to this embodiment is shown, where the capacity expansion device may include: an input text acquisition module 701, an input text filtering module 702, a reply text generation module 703 and a dialogue corpus expansion module 704.

An input text obtaining module 701, configured to obtain an input text collection.

Wherein the input text collection comprises at least one input text.

An input text filtering module 702, configured to filter, from the total input text set, input texts with matching reply texts in the current dialog corpus, where a set of remaining input texts is used as a target input text set.

The reply text generation module 703 is configured to generate, by using a pre-established generative dialog generation model, a reply text corresponding to the input text in the target input text set, so as to obtain a first paired utterance material set.

A dialog corpus expansion module 704, configured to add the dialog corpus in the first dialog corpus to the current dialog corpus.

Optionally, the input text filtering module 702 includes: a reply text and confidence coefficient determining module and a first input text filtering module.

And the reply text and confidence coefficient determining module is used for determining the reply text corresponding to each input text in the input text total set and the confidence coefficient of the reply text corresponding to each input text by utilizing a pre-established retrieval type dialogue generating model and the current dialogue corpus.

And the first input text filtering module is used for filtering the input texts corresponding to the reply texts with the confidence degrees larger than or equal to the confidence degree threshold value in the input text total set.

Optionally, the input text filtering module 702 may further include: the quality label acquisition module and the second input text filtering module.

And the quality label obtaining module is used for obtaining a quality label of the reply text with the confidence coefficient smaller than the confidence coefficient threshold value, wherein the quality label of the reply text can indicate whether the reply text is qualified or not.

And the second input text filtering module is used for filtering the input text corresponding to the unqualified reply text in the input text total set according to the quality label corresponding to the reply text with the confidence coefficient smaller than the confidence coefficient threshold value.

Optionally, the capacity expansion device for a dialog corpus provided in this embodiment may further include: a confidence threshold determination module.

And the confidence threshold determining module is used for sequencing the reply texts respectively corresponding to the input texts in the input text total set from high confidence to low confidence, grouping the sequenced reply texts, acquiring the qualified rate corresponding to each group of reply texts, and determining the confidence corresponding to the last reply text in the reply text group with the first qualified rate smaller than the preset qualified rate threshold as the confidence threshold.

Optionally, the dialog corpus expansion module may include: a confidence evaluation module and a first dialogue material adding module.

And the confidence evaluation module is used for evaluating the confidence corresponding to each dialogue corpus in the first dialogue corpus pair to obtain the confidence corresponding to each dialogue corpus in the first dialogue corpus pair.

And the first spoken corpus adding module is used for selecting the spoken corpora with the confidence coefficient larger than or equal to a preset confidence coefficient threshold from the first spoken corpus and adding the spoken corpora into the current spoken corpus.

Optionally, the dialog corpus capacity expansion module may further include: and a second material adding module.

And the second spoken corpus adding module is used for acquiring the quality label corresponding to each spoken corpus of which the confidence coefficient is smaller than the confidence coefficient threshold, and selecting the spoken corpus with qualified reply text from the spoken corpuses of which the confidence coefficient is smaller than the confidence coefficient threshold to add into the current spoken corpus according to the quality label corresponding to each spoken corpus of which the confidence coefficient is smaller than the confidence coefficient threshold.

The quality label corresponding to one dialogue corpus is the quality label of the reply text in the dialogue corpus, and can represent whether the reply text in the dialogue corpus is qualified or not.

Optionally, the capacity expansion device for a dialog corpus provided in this embodiment may further include: and the retrieval type dialogue generation model optimization module.

And the retrieval type dialogue generation model optimization module is used for acquiring the quality labels corresponding to the training dialogue corpus and each of the training dialogue corpora in the training dialogue corpus, and training the retrieval type dialogue generation model by using the dialogue corpora in the training dialogue corpus and the quality labels corresponding to the dialogue corpora in the training dialogue corpus so as to optimize the performance of the retrieval type dialogue generation model.

The training corpus comprises all the corpus of dialogues in the first corpus of dialogues, or comprises the corpus of dialogues with the confidence coefficient smaller than the threshold of the confidence coefficient in the first corpus of dialogues.

Optionally, the retrieval type dialog generation model optimization module includes: the system comprises a sample difficulty determining module, a sampling probability distribution determining module, a dialogue corpus sampling module and a model training module.

And the sample difficulty determining module is used for determining the sample difficulty corresponding to each dialogue corpus in the training dialogue corpus set according to the quality label and the confidence corresponding to each dialogue corpus in the training dialogue corpus set.

And the sampling probability distribution determining module is used for determining the sampling probability distribution according to the sample difficulty corresponding to each dialogue corpus in the training dialogue corpus set.

And the dialogue corpus sampling module is used for sampling dialogue corpora from the training dialogue corpus in a centralized manner according to the sampling probability distribution.

And the model training module is used for training the retrieval type dialogue generating model by utilizing the sampled dialogue corpus and the quality label corresponding to the sampled dialogue corpus.

Optionally, the searching dialog generation model optimizing module may further include: and a sampling probability distribution adjusting module.

And the sampling probability distribution adjusting module is used for adjusting the sampling probability distribution when the retrieval type dialogue generating model is trained by using dialogue corpora sampled based on the sampling probability distribution to ensure that the prediction loss of the retrieval type dialogue generating model is stable, so as to improve the sampling probability of high-difficulty samples.

And the model training module is also used for sampling dialogue corpora from the training dialogue corpus according to the adjusted sampling probability distribution, and training the retrieval type dialogue generating model by using the sampled dialogue corpora and the corresponding quality labels.

Optionally, the capacity expansion device for a dialog corpus provided in this embodiment may further include: and the generative dialogue generative model optimization module.

And the generative dialogue generative model optimization module is used for determining the prediction loss of the generative dialogue generative model corresponding to each dialogue corpus in the first dialogue corpus according to the confidence coefficient corresponding to the dialogue corpus and the quality characterization value corresponding to the dialogue corpus, and updating the parameters of the generative dialogue generative model according to the prediction loss of the generative dialogue generative model corresponding to the dialogue corpus.

The capacity expansion device for the dialogue corpus provided by the embodiment of the application firstly obtains an input text total set, then filters an input text with a matching reply text in the current dialogue corpus from the input text total set, uses a set formed by the remaining input texts as a target input text set, then generates a reply text corresponding to the input text in the target input text set by using a pre-established generative dialogue generating model to obtain a first pair of speaking material sets, and finally adds the dialogue corpora in the first pair of speaking material sets into the current dialogue corpus. The application provides a dialogue corpus dilatation method, usable generative dialogue generative model automatic generation input text corresponds's reply text to obtain the dialogue corpus, because the dialogue corpus automatic generation who adds the dialogue corpus, and need not artifical the writing, consequently, the efficiency of obtaining the dialogue corpus is higher, and the cost of labor is lower, in addition, because the dialogue corpus that adds the dialogue corpus is the better dialogue corpus of quality, consequently, can ensure the quality of dialogue corpus.

Seventh embodiment

An embodiment of the present application further provides a capacity expansion device for a dialog corpus, please refer to fig. 8, which shows a schematic structural diagram of the capacity expansion device for the dialog corpus, where the capacity expansion device for the dialog corpus may include: at least one processor 801, at least one communication interface 802, at least one memory 803, and at least one communication bus 804;

in the embodiment of the present application, the number of the processor 801, the communication interface 802, the memory 803, and the communication bus 804 is at least one, and the processor 801, the communication interface 802, and the memory 803 complete communication with each other through the communication bus 804;

the processor 801 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 803 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

Alternatively, the detailed function and the extended function of the program may be as described above.

Eighth embodiment

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

filtering out input texts with matching reply texts in the current dialog corpus from the input text total set, wherein a set formed by the remaining input texts is used as a target input text set;

Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for expanding a corpus of dialogues, comprising:

adding the dialogue linguistic data in the first dialogue linguistic data set into a current dialogue corpus;

wherein the filtering out from the total input text set of input texts that there are matching reply texts in the current corpus of dialogues comprises:

2. A method for expanding a dialog corpus according to claim 1, wherein the filtering out from the input text corpus the input text having matching reply text in the current dialog corpus further comprises:

3. The method for expanding a dialog corpus according to claim 2, wherein the process of determining the confidence threshold includes:

4. The method for expanding the dialog corpus according to claim 1, wherein the adding the dialog corpus in the first dialog corpus into the current dialog corpus comprises:

5. The method for expanding the dialog corpus according to claim 4, wherein the adding the dialog corpus in the first dialog corpus to the current dialog corpus further comprises:

acquiring a quality label corresponding to each dialogue corpus of which the confidence coefficient is smaller than the confidence coefficient threshold, wherein the quality label corresponding to one dialogue corpus is a quality label of a reply text in the dialogue corpus and can represent whether the reply text in the dialogue corpus is qualified or not;

6. The method for expanding a dialog corpus according to claim 4, further comprising:

7. The method for expanding a dialog corpus according to claim 6, wherein the training the retrieval dialog generation model using the dialog corpus in the training dialog corpus set and the quality labels corresponding to the dialog corpuses in the training dialog corpus set includes:

sampling dialogue corpora from the training dialogue corpora in a centralized manner according to the sampling probability distribution;

8. A method for expanding a dialog corpus according to claim 7, wherein the training the search type dialog generation model by using the dialog corpus in the training corpus set and the quality label corresponding to the dialog corpus in the training corpus set further comprises:

9. The method for expanding a dialog corpus according to claim 4, further comprising:

10. A capacity expansion apparatus for a dialogue corpus, comprising: the system comprises an input text acquisition module, an input text filtering module, a reply text generation module and a dialogue corpus expansion module;

the dialogue corpus capacity expansion module is used for adding the dialogue corpora in the first dialogue corpus set into the current dialogue corpus;

11. An expansion device for a conversation corpus, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the method for expanding a dialog corpus according to any one of claims 1 to 9.

12. A readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the method for expanding a dialog corpus according to any one of claims 1 to 9.