CN114490976B

CN114490976B - Method, device, equipment and storage medium for generating dialogue abstract training data

Info

Publication number: CN114490976B
Application number: CN202111680397.XA
Authority: CN
Inventors: 陈默也; 李伟; 刘家辰; 肖欣延
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2023-04-25
Anticipated expiration: 2041-12-30
Also published as: CN114490976A

Abstract

The disclosure provides a method, a device, an electronic device and a storage medium for generating dialogue abstract training data, which relate to the technical field of computers, in particular to the technical field of artificial intelligence such as deep learning, natural language processing and the like, and specifically realize the following scheme: acquiring a plurality of reference texts and a reference abstract corresponding to each reference text; generating a dialogue sentence set based on sentences contained in each reference text; randomly arranging all sentences in the dialogue sentence set to generate dialogue texts; and randomly arranging a plurality of the reference summaries to generate target summaries corresponding to the dialogue texts. The method and the device adopt the text in a non-dialogue form and the corresponding abstract to construct the dialogue text and the dialogue abstract, realize the acquisition of a large amount of dialogue abstract training data, and provide support for training to obtain a dialogue abstract algorithm model.

Description

Method, device, equipment and storage medium for generating dialogue abstract training data

Technical Field

The disclosure relates to the technical field of computers, in particular to the technical field of artificial intelligence such as natural language processing and deep learning, and especially relates to a method, a device, equipment and a storage medium for generating dialogue abstract training data.

Background

Currently, various forms of dialogue data are increasing, such as conferences, boring, customer service dialogues, doctor-patient dialogues, and the like. The dialogue abstract can extract key information from complex dialogue data, so that the difficulty of people in understanding the dialogue data is reduced. The method is used for establishing a dialogue abstract algorithm model, and is one of modes for acquiring dialogue abstracts from dialogue data. However, manually annotating a conversation summary is time consuming and laborious. Therefore, it is of great importance to study how to generate a large amount of dialogue summary training data.

Disclosure of Invention

The disclosure provides a method, a device, equipment and a storage medium for generating dialogue abstract training data.

According to a first aspect of the present disclosure, there is provided a method for generating dialogue digest training data, including:

acquiring a plurality of reference texts and a reference abstract corresponding to each reference text;

generating a dialogue sentence set based on sentences contained in each reference text;

randomly arranging all sentences in the dialogue sentence set to generate dialogue texts;

and randomly arranging a plurality of the reference summaries to generate target summaries corresponding to the dialogue texts.

According to a second aspect of the present disclosure, there is provided a device for generating dialogue digest training data, including:

The acquisition module is used for acquiring a plurality of reference texts and reference abstracts corresponding to each reference text;

the first generation module is used for generating a dialogue statement set based on the statement contained in each reference text;

the second generation module is used for randomly arranging all sentences in the dialogue sentence set to generate dialogue texts;

and the third generation module is used for randomly arranging a plurality of the reference summaries to generate target summaries corresponding to the dialogue texts.

Embodiments of a third aspect of the present disclosure provide a computer device comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, which processor implements a method as set forth in the embodiments of the first aspect of the disclosure when executing the program.

An embodiment of a fourth aspect of the present disclosure proposes a non-transitory computer readable storage medium storing a computer program which, when executed by a processor, implements a method as proposed by an embodiment of the first aspect of the present disclosure.

Embodiments of a fifth aspect of the present disclosure propose a computer program product which, when executed by an instruction processor in the computer program product, performs the method proposed by the embodiments of the first aspect of the present disclosure.

The method, the device, the equipment and the storage medium for generating the dialogue abstract training data have at least the following beneficial effects:

firstly, acquiring a plurality of reference texts and a reference abstract corresponding to each reference text; then generating a dialogue sentence set based on sentences contained in each reference text; randomly arranging each sentence in the dialogue sentence set to generate dialogue text; and finally, randomly arranging the plurality of reference summaries to generate target summaries corresponding to the dialogue texts. Therefore, the method realizes that the text and the corresponding abstract in a non-dialogue form are adopted, and the dialogue text and the dialogue abstract are constructed, so that a large amount of dialogue abstract training data is obtained, and support is provided for training to obtain a dialogue abstract algorithm model.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a method for generating dialogue summary training data according to an embodiment of the disclosure;

FIG. 2 is a flow chart of a method for generating dialog digest training data according to another embodiment of the disclosure;

FIG. 3 is a flow chart of a method for generating dialog digest training data according to another embodiment of the disclosure;

FIG. 4 is a flow chart of a method for generating dialog digest training data according to another embodiment of the disclosure;

FIG. 5 is a schematic diagram of a device for generating dialogue summary training data according to an embodiment of the disclosure;

fig. 6 is a block diagram of an electronic device for implementing a method of generating dialog digest training data in accordance with an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In order to facilitate understanding of the present disclosure, the technical field to which the present disclosure relates is first briefly explained in the following.

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning, deep learning, a big data processing technology, a knowledge graph technology and the like.

Deep learning is the inherent regularity and presentation hierarchy of learning sample data, and the information obtained during such learning is helpful in interpreting data such as text, images and sounds. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data. Deep learning is a complex machine learning algorithm that achieves far greater results in terms of speech and image recognition than prior art.

Natural language processing (Natura lLanguage Processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Summary is a classical task in the field of natural language processing, and its core purpose is to select key information from input, and turn it into a short summary. For example, given a piece of news, a short summary can be generated through the summary technology, and the summary can save time of readers and help the readers to quickly understand important content of the original text.

The method, apparatus, computer device and storage medium for generating dialogue summary training data provided in the present disclosure are described in detail below with reference to the accompanying drawings.

The present disclosure provides a method for generating dialogue digest training data, which may be executed by a device for generating dialogue digest training data provided by the present disclosure, or may be executed by an electronic device provided by the present disclosure, where the electronic device may include, but is not limited to, a terminal device such as a mobile phone, a desktop computer, a tablet computer, or the like, or may be a server, and the method for generating dialogue digest training data provided by the present disclosure is executed by the device for generating dialogue digest training data provided by the present disclosure, and is not limited by the present disclosure, and is hereinafter referred to as "device" for short.

Fig. 1 is a flow chart illustrating a method for generating dialogue summary training data according to an embodiment of the present disclosure.

As shown in fig. 1, the method for generating the dialogue summary training data may include the following steps:

step S101, a plurality of reference texts and a reference abstract corresponding to each reference text are obtained.

It should be noted that, according to the data type of the abstract, the method can be classified into news abstract, paper abstract, dialogue abstract, etc.

Wherein the conversation digest belongs to one of the digests, which is derived from the conversation data. The dialogue data may have different forms such as meetings, boring, mail, dialogues, etc.

It will be appreciated that the conversation digest contains key information in the conversation data. The dialog core content can be quickly understood through the dialog abstract.

For example, a conversation between two speakers is:

a: we discuss the conversation digest. B: this is a very hot topic. A: it is, however, now a difficult problem.

The dialogue digest for the above dialogue may be: although popular, the conversation abstract is also a difficult problem.

In the embodiment of the disclosure, in order to generate the dialogue abstract training data, the dialogue abstract training data may be constructed by using the reference text and the reference abstract corresponding to the reference text in other fields.

Wherein the reference text may be any type of text having a corresponding abstract. For example, the reference text may be news, and the news digest is the corresponding reference digest. Alternatively, the reference text may be a microblog, and the microblog title is a corresponding reference abstract.

The dialogue data has a feature that a plurality of speakers speak alternately, as compared with other types of data. Therefore, in the embodiment of the disclosure, when the dialogue summary training data is constructed, more than two reference texts and the reference summary corresponding to each reference text can be obtained.

Wherein the number of people involved in the conversation is uncertain because of the uncertainty in the actual conversation. Accordingly, in the embodiments of the present disclosure, the number of selected reference texts is not limited.

For example, two reference texts and corresponding two reference summaries may be used to construct a dialog summary training data. Alternatively, three reference texts and corresponding three reference summaries may be used to construct a dialog summary training data.

It should be noted that the above examples are illustrative only and should not be taken as limiting the reference text in the embodiments of the present disclosure.

Step S102, a dialogue sentence set is generated based on sentences contained in each reference text.

It should be noted that each reference text contains a plurality of sentences arranged in a certain order. To construct dialogue data in the form of a dialogue from reference texts, each reference text may be divided into a plurality of sentences to generate a set of dialogue sentences.

The specific division manner of each reference text can be implemented in any possible manner.

For example, the reference text may be divided into a plurality of sentences according to a particular type of punctuation. Alternatively, the reference text may be divided into a plurality of sentences in any type of punctuation.

For example, only periods, question marks, mark marks representing long pauses may be used as the basis for dividing sentences. Alternatively, periods, question marks, sighs, commas, stop marks, and colon marks representing long pauses may be used as the basis for dividing sentences. Alternatively, all punctuations may be used as basis for dividing sentences, which is not limited by the present disclosure.

It should be noted that, after the reference text is divided into sentences, all the sentences of the respective reference texts may be formed into a dialogue sentence set. Alternatively, a set of dialogue sentences may be formed by selecting a partial sentence of each reference text.

For example, sentences having a number of words exceeding a set threshold may be selected to form a dialogue sentence set. Alternatively, sentences containing key information may be selected to form a dialogue sentence set, which is not limited by the present disclosure.

Step S103, each sentence in the dialogue sentence set is arranged randomly to generate dialogue text.

In order to embody the characteristic that a plurality of speakers speak alternately in the dialogue data, the sentence sequence of each reference text in the dialogue sentence set can be disordered and randomly arranged to form the dialogue text.

For example, the dialogue sentence set includes a plurality of sentences A1, A2, A3, A4 of the reference text a, and a plurality of sentences B1, B2, B3, B4 of the reference text B. The dialog text generated after randomly arranging the sentences in the dialog sentence set may be A1B1A2B2A3B 4A4 or A4A1B1A2B4A3B 2.

Alternatively, the dialogue sentence set includes a plurality of sentences A1, A2, A3 of the reference text a, a plurality of sentences B1, B2, B3 of the reference text B, and a plurality of sentences C1, C2, C3 of the reference text C. The dialog text generated after randomly arranging the sentences in the dialog sentence set may be A1B1C 2A2B2A3B3C3 or C1B3B2A2C3A3B1C2A1.

It should be noted that the above examples are only illustrative, and should not be taken as limiting the text of the dialog in the embodiments of the disclosure.

Step S104, a plurality of reference summaries are arranged randomly to generate a target summary corresponding to the dialogue text.

It will be appreciated that the dialog digest training data includes two parts, namely a dialog part and a digest part. Wherein the dialogue part can be used as input data, and the abstract part can be used as target data.

In the embodiment of the disclosure, after the dialogue text is generated by using the plurality of reference texts, the reference summaries corresponding to the plurality of reference texts may be randomly combined to generate the target summary corresponding to the dialogue text.

It should be noted that, the reference abstract corresponding to each reference text may include a plurality of abstract sentences. When a plurality of reference digests are arranged randomly, each reference digest is arranged as a whole, and the order of digest statements contained in each reference digest is unchanged.

For example, a dialog text is generated by using the reference text a and the reference text B, the reference abstract corresponding to the reference text a is A0, the reference abstract corresponding to the reference text B is B0, and the target abstract corresponding to the dialog text may be A0B0 or B0A0.

Or, the reference text a, the reference text B and the reference text C are utilized to generate the dialogue text, the reference abstract corresponding to the reference text a is A0, the reference abstract corresponding to the reference text B is B0, the reference abstract corresponding to the reference text C is C0, and the target abstract corresponding to the dialogue text may be C0A0B0, or B0C0A0, or the like.

It should be noted that the above examples are illustrative only and should not be taken as limiting the object abstract of the disclosure.

In the embodiment of the disclosure, a plurality of reference texts and a reference abstract corresponding to each reference text are firstly obtained; then generating a dialogue sentence set based on sentences contained in each reference text; randomly arranging each sentence in the dialogue sentence set to generate dialogue text; and finally, randomly arranging the plurality of reference summaries to generate target summaries corresponding to the dialogue texts. Therefore, the method realizes that the text and the corresponding abstract in a non-dialogue form are adopted, and the dialogue text and the dialogue abstract are constructed, so that a large amount of dialogue abstract training data is obtained, support is provided for training to obtain a dialogue abstract algorithm model, and time and energy for manually labeling the data are saved.

Fig. 2 is a flow chart of a method of generating dialog digest training data according to another embodiment of the disclosure.

As shown in fig. 2, the method for generating the dialogue summary training data may include the following steps:

step S201, determining candidate words.

It should be noted that in the process of talking to multiple speakers, it is generally performed around a certain topic. Such as sports, games, delicacies, etc.

Thus, in order to make the structured dialog digest training data more realistic, the relevant reference text may be filtered based on dialog topics.

Specifically, keywords may be determined according to the dialog topic, and the keywords may be used as candidate words. Wherein the candidate terms may include one or more, which is not limited by this disclosure.

For example, when constructing dialogue data for a sports topic, candidate words may be determined to be sports. Alternatively, basketball, soccer, etc. may be further specified.

It should be noted that the foregoing is merely illustrative and is not intended to limit the terms used as candidates in the embodiments of the present disclosure.

Step S202, determining the number of each candidate abstract containing candidate words in the text data set.

It should be noted that the reference text may be obtained from a specific text data set. Such as news summary data sets, paper summary data sets, and the like.

Since constructing the dialog text requires multiple reference texts, when screening the reference texts based on candidate words, the number of individual candidate summaries in the text data set that contain candidate words may be first determined to further determine whether multiple reference texts can be obtained.

In step S203, in response to the number of candidate summaries being greater than the first threshold, a reference summary is selected from the candidate summaries, and the candidate text corresponding to the candidate summary is the reference text.

It should be noted that when the number of candidate summaries containing candidate words is 0 or 1, it indicates that the text data set contains no or only 1 text of the relevant subject matter. The candidate term may be replaced or adjusted at this point for selection.

When the number of candidate abstracts containing the candidate words is greater than a first threshold value, a plurality of candidate abstracts can be selected as reference abstracts from the candidate abstracts, and candidate texts corresponding to the candidate abstracts are used as reference texts.

The magnitude of the first threshold may be set as needed. For example, when the dialog text is constructed with at least two reference texts, the first threshold may be 1. Alternatively, the first threshold may be 2 when the dialog text is constructed with at least three reference texts.

In the embodiment of the disclosure, candidate abstracts and corresponding candidate texts are screened from the text data set based on candidate words, and the candidate abstracts and the corresponding candidate texts are used as reference abstracts and reference texts, so that the reference texts and the reference abstracts are obtained based on the relevance of text contents, and the similarity of the generated dialogue abstract training data and the real data is improved.

It will be appreciated that depending on the number of texts in the text data set and the text content, the number of candidate summaries obtained may not be determined when screening candidate summaries from the text data set based on candidate terms. For example, it may be several, tens or hundreds.

In some embodiments of the present disclosure, the second threshold may be determined based on the number of reference text required to construct the dialog text. And then, determining a mode of selecting the reference abstract according to the size relation between the number of the candidate abstracts containing the candidate words and the second threshold value.

For example, when more than two reference texts are needed to construct dialog text, the second threshold may be 2. Alternatively, the second threshold may be 3 when more than three reference texts are required to construct dialog text.

Further, when the number of candidate digests containing candidate words is less than or equal to the second threshold, each candidate digest may be determined to be a reference digest. When the number of candidate digests containing candidate words is greater than the second threshold, any number of the candidate digests in each candidate digest may be determined to be a set of reference digests.

For example, if the second threshold is 2 and the number of candidate digests including the candidate words is 2, the two candidate digests may be used as reference digests, and the candidate texts corresponding to the two candidate digests may be used as reference texts.

Or the second threshold is 2, the number of candidate summaries containing candidate words is 30, at this time, any number of candidate summaries can be randomly selected from the candidate summaries to be used as a group of reference summaries, and the candidate texts corresponding to the candidate summaries are used as a group of reference texts.

For example, two candidate summaries may be randomly selected from candidate summaries including candidate words as a set of reference summaries, and candidate texts corresponding to the two candidate summaries are used as reference texts.

Alternatively, three candidate digests may be randomly selected from the candidate digests including the candidate words, and the candidate texts corresponding to the three candidate digests may be used as a set of reference digests.

It should be noted that the above examples are only illustrative, and should not be taken as limiting the determination of the reference abstract and the reference text in the embodiments of the disclosure.

In the embodiment of the disclosure, when the reference abstract is selected from the candidate abstracts, the mode of selecting the reference abstract is determined according to the number of the candidate abstracts containing the candidate words so as to simulate the dialogue data of different speaking persons, thereby ensuring the diversity of constructed dialogue abstract training data.

It should be noted that, when a plurality of candidate summaries are randomly selected as the reference summaries and candidate texts corresponding to the plurality of candidate summaries are used as the reference texts, a situation may occur in which the spread of the plurality of reference texts is greatly different.

For example, two candidate texts are selected to be divided into a reference text a and a reference text B. Wherein, reference text a contains 100 sentences and reference text B contains 10 sentences. The similarity to the real dialog may be low when generating dialog text using reference text a and reference text B.

In some embodiments of the present disclosure, the reference abstract may be selected from the candidate abstracts according to a difference in the number of characters between the candidate texts corresponding to the candidate abstracts. Alternatively, the reference abstract may be selected from the candidate abstracts according to a difference between the number of sentences contained in the candidate text corresponding to the candidate abstracts.

It can be understood that each candidate abstract has a one-to-one correspondence with the candidate text. Therefore, according to the difference value of the number of characters between the candidate texts or the difference value between the number of sentences contained in the candidate texts, the reference text can be selected, and then the reference abstract is determined according to the reference text.

When the reference text is selected according to the difference in the number of characters between the candidate texts, the number of characters of each candidate text may be determined separately. For example, the number of characters of each candidate text is sequentially arranged from small to large as X1, X2, X3, X4, … … Xn. Where n is the number of candidate texts.

Then, the candidate text with the smallest difference in the number of characters may be selected as the reference text. For example, the candidate texts corresponding to X1 and X2 may be used as a set of reference texts, the candidate texts corresponding to X3 and X4 may be used as a set of reference texts, and so on.

Alternatively, a threshold value may be set for the difference in the number of characters between the respective candidate texts. When the difference in the number of characters between any two candidate texts is smaller than the threshold value, the two candidate texts may be used as a set of reference texts.

Wherein the characters in the candidate text may include one or more of chinese characters, english words, symbols, numbers, which is not limited by the present disclosure.

Similarly, when the reference text is selected according to the difference between the sentence numbers contained in the candidate texts, the sentence numbers of the respective candidate texts can be determined respectively. For example, the number of sentences of the respective candidate texts is sequentially arranged from small to large as Y1, Y2, Y3, Y4, … … Yn. Where n is the number of candidate texts.

Then, the candidate text with the smallest sentence number difference value can be selected as the reference text. For example, the candidate texts corresponding to Y1 and Y2 may be used as a set of reference texts, the candidate texts corresponding to Y3 and Y4 may be used as a set of reference texts, and so on.

Alternatively, a threshold value may be set for the sentence number difference value of each candidate text. When the sentence number difference of any two candidate texts is smaller than the threshold value, the two candidate texts may be used as a set of reference texts.

The number of sentences contained in the candidate text can be determined according to different dividing modes. For example, the division may be by a particular type of punctuation mark. Alternatively, the partitioning may be by any type of punctuation.

In the embodiment of the disclosure, the reference text is selected from the candidate text based on the difference value of the number of characters or the difference value of the number of sentences among the candidate text, so that the balance of the number of dialogue sentences representing different speaking parties in the generated dialogue text is ensured, and the similarity of the generated dialogue abstract training data and the real data is further improved.

Step S204, according to each reference abstract, the target sentences are screened from the sentences contained in each reference text.

It should be noted that the reference text in the partial text data set may have a longer spread and include more background information, such as papers in the paper summary data set.

Thus, in order to reduce the length of the generated dialog text, the target sentence may be screened from the plurality of sentences contained in the reference text.

It is to be understood that the abstract is a brief summary of the reference text. The reference abstract contains key information in the reference text. In order to obtain the reference abstract according to the target sentences screened from the reference text, the target sentences should also contain key information of the reference text.

In the embodiment of the disclosure, the target sentence can be screened from a plurality of sentences contained in the reference text according to the reference abstract.

For example, a sentence containing a plurality of keywords in the reference abstract may be selected from the reference text as the target sentence.

Alternatively, a sentence containing the keyword may be selected from the reference text according to a plurality of keywords in the reference abstract, and the sentence and a sentence contained in a context in which the sentence is located may be used as the target sentence.

It should be noted that the foregoing examples are merely illustrative, and are not meant to limit the screening target sentence in the embodiments of the present disclosure.

Step S205, a dialogue sentence set is generated based on the target sentences contained in each reference text.

Wherein the target sentences contained in each reference text can be combined to generate a dialogue sentence set.

Step S206, each sentence in the dialogue sentence set is arranged randomly to generate dialogue text.

Step S207, selecting a reference text from the candidate texts according to the difference value among the sentence numbers contained in the candidate texts.

It should be noted that, the specific implementation process of steps S206 and S207 may refer to the detailed descriptions of other embodiments of the disclosure, and will not be described herein.

In the embodiment of the disclosure, according to each reference abstract, target sentences are screened from each reference text, and a dialogue sentence set is generated based on the target sentences of each reference text, so that the extraction of key information of the reference text is realized. Therefore, the dialogue text generated according to the dialogue sentence set is more concise, and dialogue abstract training data are further optimized.

Fig. 3 is a flow chart illustrating a method of generating dialog digest training data according to another embodiment of the present disclosure.

As shown in fig. 3, the method for generating the dialogue summary training data may include the following steps:

Step S301, a plurality of reference texts and a reference abstract corresponding to each reference text are obtained.

It should be noted that, the specific implementation process of step S301 may refer to the detailed descriptions of other embodiments of the present disclosure, which are not described herein.

In step S302, the reference abstract corresponding to the reference text is divided into a plurality of abstract sentences.

It should be noted that the reference abstract is a brief summary after summarizing and modifying the reference text. The reference text may contain a plurality of summary sentences, each of which may contain key information in the reference text.

In the embodiment of the disclosure, the reference abstract corresponding to each reference text can be divided into a plurality of abstract sentences, and the target sentences are screened from the plurality of sentences contained in the reference text according to each abstract sentence.

Where the reference summary is divided into a plurality of summary statements, any possible division may be employed.

For example, a reference abstract may be divided into multiple abstract statements by a particular type of punctuation. Alternatively, the reference abstract may be divided into multiple abstract statements according to any type of punctuation.

For example, only periods, question marks, mark marks representing long pauses may be used as the basis for dividing sentences. Alternatively, periods, question marks, sighs, commas, stop marks, and colon marks representing long pauses may be used as the basis for dividing sentences.

Step S303, determining the association degree between each sentence and each abstract sentence in the reference text.

The degree of association between the sentence in the reference text and each abstract sentence can characterize the coverage degree of the sentence in the reference text on the information in the abstract sentence.

When determining the degree of association between a sentence in the reference text and the abstract sentence, any possible implementation may be employed.

For example, it is possible to count the co-occurring words of the sentence in the reference text and the abstract sentence, and determine the association degree of the two based on the co-occurring words.

Step S304, determining the sentences with the association degree larger than the second threshold value as target sentences.

The second threshold may be any value set in advance, which is not limited in this disclosure.

For example, if the second threshold is 0.3, and the association degree between the sentence in the reference text and the abstract sentence is 0.5, and is greater than the second threshold, the sentence can be determined as the target sentence. Or the association degree between the sentence in the reference text and the abstract sentence is 0.1 and is smaller than the second threshold value, and the sentence is not the target sentence.

Step S305, a dialogue sentence set is generated based on the target sentences contained in each reference text.

Step S306, each sentence in the dialogue sentence set is arranged randomly to generate dialogue text.

In step S307, the plurality of reference summaries are randomly arranged to generate a target summary corresponding to the dialog text.

It should be noted that, the specific implementation process of steps S305, S306, S307 may refer to the detailed description of other embodiments of the disclosure, and will not be described herein.

In the embodiment of the disclosure, the target sentence is determined by dividing the reference abstract into a plurality of abstract sentences and according to the association degree between the sentence in the reference text and each abstract sentence. Therefore, the key information of the reference abstract is guaranteed to be contained in the target sentence, and the accuracy of dialogue abstract training data is improved.

Fig. 4 is a flow chart of a method of generating dialog digest training data according to another embodiment of the disclosure.

As shown in fig. 4, the method for generating the dialogue summary training data may include the following steps:

step S401, determining a candidate word set according to each word contained in the abstract corresponding to each text in the text data set.

It is understood that each text may include a plurality of keywords in the corresponding abstract. In order to obtain a plurality of reference texts with relevance to the content, a plurality of keywords contained in the abstract can be combined into a set to be used as a candidate word set.

For example, a candidate word set may include keyword a, keyword B, keyword C, and so on.

Wherein, when determining the keywords in the abstract, the method can be implemented in any possible way. For example, keywords may be selected from the abstract according to parts of speech. Alternatively, keywords may be selected from the abstract according to word senses.

For example, nouns, verbs, etc. in the abstract may be used as keywords. Words having a practical meaning when independently present may be used as keywords.

Step S402, each word in the candidate word set is sequentially determined to be a candidate word.

After the candidate word set is determined, each word in the set can be sequentially used as a candidate word, and a candidate abstract containing the candidate word and a corresponding candidate text are selected from the text data set.

For example, the candidate set includes keyword a, keyword B, and keyword C. Then firstly, selecting candidate abstracts and candidate texts according to the keywords A, then selecting candidate abstracts and candidate texts according to the keywords B, and finally selecting candidate abstracts and candidate texts according to the keywords C.

Step S403 determines the number of respective candidate summaries in the text data set containing candidate words.

In step S404, in response to the number of candidate summaries being greater than the first threshold, a reference summary is selected from the candidate summaries, where the candidate text corresponding to the candidate summary is the reference text.

The specific implementation process of steps S403 and S404 may refer to the detailed descriptions of other embodiments of the present disclosure, which are not described herein.

In the embodiment of the disclosure, a candidate word set is formed according to the keywords in the abstract, and then each word in the candidate word set is utilized to select the reference text, so that the number of the acquired reference texts is ensured, and the relevance of the reference text on the content is realized.

Step S405, determining a keyword set corresponding to each reference abstract.

For a specific implementation manner of the keyword set, reference may be made to a detailed description about the candidate word set in the embodiments of the present disclosure, which is not described herein.

It will be appreciated that each reference text and corresponding reference abstract is screened from the text dataset. Therefore, after the reference text and the corresponding reference abstract are determined, the candidate word set corresponding to the text and the abstract can be used as a keyword set.

In step S406, the number of keywords in the keyword set included in each sentence is determined.

Wherein each sentence in the reference text may be determined in turn to contain the number of keywords in the keyword set. Further, the coverage of the key information by each sentence in the reference text may be determined.

For example, statement A1 in the reference text contains 2 keywords, statement A2 in the reference text contains 0 keywords, which may indicate that statement A1 contains key information, and statement A2 may not contain key information.

Step S407, selecting target sentences from the sentences according to the number of keywords contained in each sentence.

Wherein, the sentences in the reference text contain the number of keywords as the screening condition of the target sentences.

For example, a set threshold value may be set, and when the number of sentences in the reference text including keywords is greater than or equal to the set threshold value, the sentence is taken as the target sentence. Conversely, when the number of sentences in the reference text containing keywords is smaller than the set threshold, the sentence is not taken as a target sentence.

The threshold value can be set according to actual needs. For example, the set threshold may be 1. Alternatively, the set threshold may be 2, which is not limited by the present disclosure.

Step S408, a dialogue sentence set is generated based on the target sentences contained in each reference text.

Step S409, each sentence in the dialogue sentence set is arranged randomly to generate dialogue text.

In step S410, the plurality of reference summaries are randomly arranged to generate a target summary corresponding to the dialog text.

It should be noted that, the specific implementation process of steps S408, S409, and S410 may refer to the detailed description of other embodiments of the disclosure, and will not be described herein.

In the embodiment of the disclosure, the keywords in the reference abstract are formed into the keyword set, so that the target sentences are screened according to the number of the keywords contained in the sentences of the reference text, the keyword information of the reference abstract contained in the target sentences is ensured, and the accuracy of dialogue abstract training data is improved.

According to the embodiment of the disclosure, the disclosure further provides a device for generating the dialogue abstract training data.

Fig. 5 is a schematic structural diagram of a dialog digest training data generation device according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus 500 for generating dialogue digest training data may include: the acquisition module 510, the first generation module 520, the second generation module 530, and the third generation module 540.

The obtaining module 510 is configured to obtain a plurality of reference texts and a reference abstract corresponding to each reference text;

A first generation module 520, configured to generate a dialogue sentence set based on sentences contained in each reference text;

a second generation module 530, configured to randomly arrange each sentence in the dialogue sentence set to generate dialogue text;

the third generating module 540 is configured to randomly arrange the plurality of reference summaries to generate a target summary corresponding to the dialog text.

It should be noted that the explanation of the embodiment of the method for generating the dialogue summary training data is also applicable to the device for generating the dialogue summary training data of this embodiment, and the implementation principle is similar, and will not be repeated here.

In one possible implementation of the embodiment of the present disclosure, the obtaining module 510 may include:

a first determining unit configured to determine a candidate word;

a second determining unit for determining the number of candidate abstracts containing candidate words in the text data set;

and the third determining unit is used for responding to the fact that the number of the candidate summaries is larger than a first threshold value, and selecting a reference summary from the candidate summaries, wherein the candidate texts corresponding to the candidate summaries are the reference texts.

In one possible implementation manner of the embodiment of the present disclosure, the third determining unit is configured to:

determining each candidate digest as a reference digest in response to the number of candidate digests being less than or equal to a second threshold;

or alternatively, the process may be performed,

in response to the number of candidate summaries being greater than the second threshold, any number of candidate summaries in each candidate summary is determined as a set of reference summaries.

selecting a reference abstract from each candidate abstract according to the character quantity difference value among the candidate texts corresponding to each candidate abstract; or alternatively, the process may be performed,

and selecting a reference abstract from each candidate abstract according to the difference value among the sentence numbers contained in the candidate texts corresponding to each candidate abstract.

In one possible implementation manner of the embodiment of the present disclosure, the first determining unit is configured to:

determining a candidate word set according to each word contained in the abstract corresponding to each text in the text data set;

and sequentially determining each word in the candidate word set as a candidate word.

In one possible implementation of the embodiments of the present disclosure, the first generating module 520 may include:

a screening unit, configured to screen, according to each reference abstract, a target sentence from a plurality of sentences contained in each reference text;

and the generating unit is used for generating a dialogue statement set based on the target statement contained in each reference text.

In one possible implementation manner of the embodiment of the disclosure, the screening unit is configured to:

dividing a reference abstract corresponding to the reference text into a plurality of abstract sentences;

determining the association degree between each sentence and each abstract sentence in the reference text;

and determining the statement with the association degree larger than the second threshold value as a target statement.

determining a keyword set corresponding to each reference abstract;

determining the number of keywords in the keyword set contained in each sentence;

And screening target sentences from the multiple sentences according to the number of the keywords contained in each sentence.

In the embodiment of the disclosure, candidate abstracts and corresponding candidate texts are screened from the text data set based on candidate words, and the candidate abstracts and the corresponding candidate texts are used as reference abstracts and reference texts, so that the reference texts and the reference abstracts are obtained based on the relevance of text contents, and the similarity of the generated dialogue abstract training data and the real data is improved. In addition, according to each reference abstract, target sentences are screened from each reference text, and a dialogue sentence set is generated based on the target sentences of each reference text, so that the extraction of key information of the reference text is realized. Therefore, the dialogue text generated according to the dialogue sentence set is more concise, and dialogue abstract training data are further optimized.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the respective methods and processes described above, for example, a method of generating dialogue digest training data. For example, in some embodiments, the method of generating the dialog digest training data may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the above-described dialog digest training data generation method may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the method of generating the dialog digest training data in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtua lPrivate Server") or simply "VPS" are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method for generating dialogue abstract training data comprises the following steps:

acquiring a plurality of reference texts and a reference abstract corresponding to each reference text, wherein the reference texts are any type of texts with corresponding abstracts;

randomly arranging a plurality of reference abstracts to generate a target abstract corresponding to the dialogue text;

wherein the generating a dialogue sentence set based on sentences contained in each of the reference texts includes:

screening target sentences from a plurality of sentences contained in each reference text according to each reference abstract;

generating the dialogue sentence set based on the target sentences contained in each reference text;

the obtaining a plurality of reference texts and reference abstracts corresponding to each reference text includes:

sequentially determining each word in the candidate word set as a candidate word;

Determining the number of each candidate abstract containing the candidate words in the text data set, wherein the candidate text corresponding to the candidate abstract is the reference text;

and responding to the number of the candidate abstracts being larger than a first threshold value, selecting the reference abstracts from the candidate abstracts according to the difference value of the number of characters among the candidate texts corresponding to the candidate abstracts, or selecting the reference abstracts from the candidate abstracts according to the difference value of the number of sentences contained in the candidate texts corresponding to the candidate abstracts.

2. The method of claim 1, wherein said selecting the reference digest from the respective candidate digests comprises:

determining each candidate digest as the reference digest in response to the number of candidate digests being less than or equal to a second threshold;

or alternatively, the process may be performed,

in response to the number of candidate summaries being greater than the second threshold, any number of the candidate summaries is determined as a set of reference summaries.

3. The method of claim 1, wherein said screening target sentences from the plurality of sentences contained in each of said reference texts according to each of said reference summaries comprises:

Dividing the reference abstract corresponding to the reference text into a plurality of abstract sentences;

4. The method of claim 1, wherein said screening target sentences from the plurality of sentences contained in each of said reference texts according to each of said reference summaries comprises:

determining a keyword set corresponding to each reference abstract;

5. A dialog digest training data generation apparatus comprising:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a plurality of reference texts and reference abstracts corresponding to the reference texts, wherein the reference texts are any type of texts with corresponding abstracts;

The third generation module is used for randomly arranging a plurality of the reference abstracts to generate a target abstract corresponding to the dialogue text;

a screening unit, configured to screen, according to each of the reference summaries, a target sentence from a plurality of sentences contained in each of the reference texts;

a generating unit, configured to generate the dialogue sentence set based on target sentences contained in each of the reference texts;

wherein, the acquisition module includes:

the first determining unit is used for determining a candidate word set according to each word contained in the abstract corresponding to each text in the text data set; sequentially determining each word in the candidate word set as a candidate word;

a second determining unit, configured to determine the number of candidate summaries in the text data set, where the candidate text corresponding to the candidate summaries is the reference text;

a third determining unit, configured to select, in response to the number of candidate summaries being greater than a first threshold, the reference summary from the candidate summaries according to a difference in the number of characters between the candidate texts corresponding to the candidate summaries; or in response to the number of the candidate abstracts being greater than a first threshold, selecting the reference abstracts from the candidate abstracts according to the difference value between the number of sentences contained in the candidate texts corresponding to the candidate abstracts.

6. The apparatus of claim 5, wherein the third determining unit is further configured to:

or alternatively, the process may be performed,

7. The apparatus of claim 5, wherein the screening unit is to:

8. The apparatus of claim 5, wherein the screening unit is to:

determining a keyword set corresponding to each reference abstract;

9. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.

10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-4.