CN114490976A

CN114490976A - Method, device and equipment for generating dialogue abstract training data and storage medium

Info

Publication number: CN114490976A
Application number: CN202111680397.XA
Authority: CN
Inventors: 陈默也; 李伟; 刘家辰; 肖欣延
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-05-13
Anticipated expiration: 2041-12-30
Also published as: CN114490976B

Abstract

The present disclosure provides a method and an apparatus for generating dialogue summary training data, an electronic device and a storage medium, which relate to the technical field of computers, in particular to the technical field of artificial intelligence such as deep learning and natural language processing, and the specific implementation scheme is as follows: acquiring a plurality of reference texts and a reference abstract corresponding to each reference text; generating a dialogue sentence set based on sentences contained in each reference text; randomly arranging all sentences in the conversation sentence set to generate a conversation text; and randomly arranging a plurality of reference abstracts to generate a target abstract corresponding to the dialog text. The method adopts the non-dialogue form text and the corresponding abstract to construct the dialogue text and the dialogue abstract, realizes the acquisition of a large amount of dialogue abstract training data, and provides support for training to obtain the dialogue abstract algorithm model.

Description

Method, device and equipment for generating dialogue abstract training data and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating dialogue summary training data.

Background

Currently, various forms of session data are increasing, such as conferences, chats, customer service sessions, doctor-patient sessions, and so on. The dialogue summary can extract key information from complex dialogue data, thereby reducing the difficulty of understanding the dialogue data by people. The establishment of the dialogue summary algorithm model is one of the ways of obtaining the dialogue summary from the dialogue data. However, manually annotating a dialog summary is time consuming and labor intensive. Therefore, it is significant to research how to generate a large amount of dialogue summary training data.

Disclosure of Invention

The disclosure provides a generation method, a generation device, equipment and a storage medium of dialogue summary training data.

According to a first aspect of the present disclosure, there is provided a method for generating dialogue summary training data, including:

acquiring a plurality of reference texts and a reference abstract corresponding to each reference text;

generating a dialogue sentence set based on sentences contained in each reference text;

randomly arranging all sentences in the conversation sentence set to generate a conversation text;

and randomly arranging a plurality of reference abstracts to generate a target abstract corresponding to the dialog text.

According to a second aspect of the present disclosure, there is provided a generation apparatus of dialog summary training data, including:

the acquisition module is used for acquiring a plurality of reference texts and a reference abstract corresponding to each reference text;

a first generation module, configured to generate a dialog sentence set based on sentences included in each of the reference texts;

the second generation module is used for randomly arranging all sentences in the conversation sentence set to generate a conversation text;

and the third generation module is used for randomly arranging the reference abstracts to generate a target abstract corresponding to the dialog text.

An embodiment of a third aspect of the present disclosure provides a computer device, including: the present invention relates to a computer program product, and a computer program stored on a memory and executable on a processor, which when executed by the processor performs a method as set forth in an embodiment of the first aspect of the present disclosure.

A fourth aspect of the present disclosure is directed to a non-transitory computer-readable storage medium storing a computer program, which when executed by a processor implements the method as set forth in the first aspect of the present disclosure.

A fifth aspect of the present disclosure provides a computer program product, which when executed by an instruction processor performs the method provided in the first aspect of the present disclosure.

The method, the device, the equipment and the storage medium for generating the dialogue summary training data have the following beneficial effects:

firstly, acquiring a plurality of reference texts and a reference abstract corresponding to each reference text; then generating a dialogue sentence set based on sentences contained in each reference text; then randomly arranging all sentences in the conversation sentence set to generate a conversation text; and finally, randomly arranging the plurality of reference abstracts to generate a target abstract corresponding to the dialog text. Therefore, the method realizes the purpose of adopting the non-dialogue form text and the corresponding abstract and constructing the dialogue text and the dialogue abstract, thereby obtaining a large amount of dialogue abstract training data and providing support for training the obtained dialogue abstract algorithm model.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart diagram illustrating a method for generating dialogue summary training data according to an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating a method for generating dialogue summary training data according to another embodiment of the disclosure;

FIG. 3 is a flowchart illustrating a method for generating dialogue summary training data according to another embodiment of the disclosure;

FIG. 4 is a flowchart illustrating a method for generating dialogue summary training data according to another embodiment of the disclosure;

fig. 5 is a schematic structural diagram of an apparatus for generating dialogue summary training data according to an embodiment of the present disclosure;

fig. 6 is a block diagram of an electronic device for implementing a method of generating dialog summary training data according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In order to facilitate understanding of the present disclosure, the following description is first briefly made to the technical field to which the present disclosure relates.

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning technology, a deep learning technology, a big data processing technology, a knowledge map technology and the like.

Deep learning is the intrinsic law and expression level of the learning sample data, and the information obtained in the learning process is very helpful for the interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. Deep learning is a complex machine learning algorithm, and achieves the effect in speech and image recognition far exceeding the prior related art.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. The abstract is a classic task in the field of natural language processing, and the core purpose of the abstract is to select key information from input and convert the key information into a short summary. For example, given a news segment, a brief summary can be generated by using the summarization technique, and the summary can save the time of a reader and help the reader to quickly understand the important contents of the original text.

The following describes in detail a method, an apparatus, a computer device, and a storage medium for generating dialog summary training data provided by the present disclosure with reference to the accompanying drawings.

The present disclosure provides a method for generating dialog summary training data, which may be executed by a device for generating dialog summary training data provided by the present disclosure, and may also be executed by an electronic device provided by the present disclosure, where the electronic device may include but is not limited to a terminal device such as a mobile phone, a desktop computer, a tablet computer, and the like, and may also be a server, and the method for generating dialog summary training data provided by the present disclosure is executed by the device for generating dialog summary training data provided by the present disclosure, and is not limited by the present disclosure, and is hereinafter simply referred to as "device".

Fig. 1 is a flowchart illustrating a method for generating dialogue summary training data according to an embodiment of the disclosure.

As shown in fig. 1, the method for generating the dialogue summary training data may include the following steps:

step S101, a plurality of reference texts and a reference abstract corresponding to each reference text are obtained.

It should be noted that, according to the data type of the abstract, the abstract can be classified into a news abstract, a paper abstract, a dialogue abstract, and the like.

Wherein the dialog summary belongs to one of the summaries, which is derived from the dialog data. The session data may be in different forms, such as meeting, chatting, mail, debate, and so on.

It is understood that the dialog summary contains key information in the dialog data. The dialogue core content can be quickly understood through the dialogue abstract.

For example, the conversation between two speakers is:

a: we discuss a dialog summary. B: well, this is a very hot topic. A: is, it is now a problem.

The dialog summary for the dialog may be: although popular, session summaries are also a problem.

In the embodiment of the present disclosure, in order to generate the dialog abstract training data, the dialog abstract training data may be constructed by using the reference texts in other fields and the reference abstract corresponding to the reference texts.

The reference text may be any type of text with a corresponding abstract. For example, the reference text may be news, and the news digest is the corresponding reference digest. Or, the reference text may be a microblog, and the microblog title is the corresponding reference abstract.

It should be noted that the dialogue data has a feature that a plurality of speakers alternately speak, compared with other types of data. Therefore, in the embodiment of the present disclosure, when constructing the dialog summary training data, more than two reference texts and a reference summary corresponding to each reference text may be obtained.

Wherein, the number of people participating in the conversation is uncertain in the real conversation process. Therefore, in the embodiment of the present disclosure, the number of the selected reference texts is not limited.

For example, two reference texts and two corresponding reference summaries may be used to construct a dialog summary training data. Alternatively, three reference texts and corresponding three reference summaries may be used to construct a dialog summary training data.

It should be noted that the above examples are only illustrative and should not be taken as limiting the reference texts in the embodiments of the present disclosure.

Step S102, a dialogue sentence set is generated based on the sentences included in each reference text.

It should be noted that each reference text contains a plurality of sentences arranged in a certain order. To construct dialog data in the form of a dialog from the reference text, each reference text may be divided into a plurality of sentences to generate a dialog sentence set.

The specific division mode of each reference text can be realized in any possible mode.

For example, the reference text may be divided into a plurality of sentences according to a certain type of punctuation marks. Alternatively, the reference text may be divided into a plurality of sentences according to any type of punctuation marks.

For example, periods, question marks, exclamation marks representing long pauses may be the only basis for dividing sentences. Alternatively, a period, question mark, exclamation mark representing a long pause and a comma, pause mark, semicolon, and colon mark representing a short pause may be the basis for dividing a sentence. Alternatively, all punctuations can be used as a basis for dividing the sentence, which is not limited in this disclosure.

It should be noted that after the reference text is divided into sentences, all the sentences of each reference text may form a dialog sentence set. Or, partial sentences of each reference text can be selected to form a dialogue sentence set.

For example, sentences with words exceeding a set threshold may be selected to form a set of conversational sentences. Alternatively, sentences containing key information may be selected to form a set of conversational sentences, which is not limited by this disclosure.

Step S103, randomly arranging each sentence in the conversation sentence set to generate a conversation text.

In order to embody the feature that a plurality of speakers alternately speak in conversation data, the sentences of each reference text in the conversation sentence set may be randomly arranged to form a conversation text while being scrambled in order.

For example, the dialogue sentence set includes a plurality of sentences a1, a2, A3, a4 of the reference text a, and a plurality of sentences B1, B2, B3, B4 of the reference text B. The generated dialog text may be A1B1A2B2A3B 4A4 or A4A1B 2 A4B 3A3B 2 after the individual sentences in the dialog sentence set are randomly arranged.

Alternatively, the conversation sentence set includes a plurality of sentences a1, a2, A3 referring to the text a, a plurality of sentences B1, B2, B3 referring to the text B, and a plurality of sentences C1, C2, C3 referring to the text C. The generated dialog text may be A1B1C 2A2B2A3B3C3 or C1B3B2 A3C 1 A1 after the sentences in the dialog sentence set are randomly arranged.

It should be noted that the above examples are only illustrative and should not be taken as limiting the dialog text in the embodiments of the present disclosure.

And step S104, randomly arranging the plurality of reference abstracts to generate a target abstract corresponding to the dialog text.

It is understood that the dialog summary training data includes two parts, namely a dialog part and a summary part. The dialogue part can be used as input data, and the summary part can be used as target data.

In the embodiment of the disclosure, after the dialog text is generated by using a plurality of reference texts, the reference abstracts corresponding to the plurality of reference texts may be randomly combined to generate the target abstract corresponding to the dialog text.

It should be noted that the reference abstract corresponding to each reference text may include a plurality of abstract sentences. When a plurality of reference digests are randomly arranged, each reference digest is arranged as a whole, and the order of digest sentences contained in each reference digest is not changed.

For example, a dialog text is generated by using the reference text a and the reference text B, where the reference abstract corresponding to the reference text a is A0, and the reference abstract corresponding to the reference text B is B0, and then the target abstract corresponding to the dialog text may be A0B0 or B0 A0.

Or, the dialog text is generated by using the reference text a, the reference text B and the reference text C, where the reference abstract corresponding to the reference text a is A0, the reference abstract corresponding to the reference text B is B0, and the reference abstract corresponding to the reference text C is C0, and then the target abstract corresponding to the dialog text may be C0A0B0, or B0C0 A0.

It is to be noted that the above examples are only illustrative and are not to be considered as limiting the subject abstract of the disclosure.

In the embodiment of the disclosure, a plurality of reference texts and a reference abstract corresponding to each reference text are obtained firstly; then generating a dialogue sentence set based on sentences contained in each reference text; then randomly arranging all sentences in the conversation sentence set to generate a conversation text; and finally, randomly arranging the plurality of reference abstracts to generate a target abstract corresponding to the dialog text. Therefore, the method and the device realize the adoption of the non-dialogue form text and the corresponding abstract, and the construction of the dialogue text and the dialogue abstract, thereby obtaining a large amount of dialogue abstract training data, providing support for training the obtained dialogue abstract algorithm model, and saving the time and the energy of manually marking data.

Fig. 2 is a flowchart illustrating a method for generating dialogue summary training data according to another embodiment of the disclosure.

As shown in fig. 2, the method for generating the dialogue summary training data may include the following steps:

step S201, determining candidate words.

It should be noted that, in the process of a conversation between multiple speakers, it is usually performed around a certain subject. Such as sports, games, gourmet, etc.

Therefore, in order to make the constructed dialogue summary training data closer to the real dialogue data, the relevant reference texts can be screened according to the dialogue topics.

Specifically, keywords may be determined according to the topic of the conversation, and the keywords may be used as candidate words. The candidate words may include one or more, which is not limited in this disclosure.

For example, when constructing dialog data of a motion topic, a candidate word may be determined to be motion. Alternatively, basketball, soccer, etc. may be further embodied.

It should be noted that the above description is only an example, and should not be taken as a limitation on candidate words in the embodiments of the present disclosure.

Step S202, determining the number of each candidate abstract containing candidate words in the text data set.

It should be noted that the reference text may be obtained from a specific text data set. Such as a news digest dataset, a paper digest dataset, etc.

Since a plurality of reference texts are required for constructing the dialog text, when the reference texts are screened according to the candidate words, the number of each candidate abstract containing the candidate words in the text data set may be determined first to further determine whether the plurality of reference texts can be acquired.

Step S203, responding to the number of the candidate abstracts larger than the first threshold value, selecting a reference abstract from each candidate abstract, wherein the candidate text corresponding to the candidate abstract is the reference text.

It should be noted that when the number of candidate abstracts containing candidate words is 0 or 1, it indicates that there are no texts or only 1 texts containing related topics in the text data set. At this time, the candidate words can be replaced or adjusted to be selected again.

When the number of candidate abstracts containing the candidate words is larger than a first threshold value, a plurality of candidate abstracts can be selected from the candidate abstracts to be used as reference abstracts, and candidate texts corresponding to the candidate abstracts are used as reference texts.

The size of the first threshold value can be set as required. For example, when the dialog text is constructed with at least two reference texts, the first threshold may be 1. Alternatively, when the dialog text is constructed with at least three reference texts, the first threshold may be 2.

In the embodiment of the disclosure, the candidate abstract and the corresponding candidate text are screened from the text data set based on the candidate words and are used as the reference abstract and the reference text, so that the reference text and the reference abstract are obtained based on the correlation of text contents, and the similarity between the generated dialogue abstract training data and the real data is improved.

It will be appreciated that depending on the amount of text and the content of the text in the text dataset, the number of candidate summaries obtained when screening candidate summaries from the text dataset according to the candidate words cannot be determined. For example, it may be several, tens or hundreds.

In some embodiments of the present disclosure, the second threshold may be determined according to the number of reference texts required to construct the dialog text. And then, determining a mode of selecting the reference abstract according to the size relation between the number of the candidate abstracts containing the candidate words and a second threshold value.

For example, when it is necessary to construct a dialog text with two or more reference texts, the second threshold may be 2. Alternatively, when it is necessary to construct a dialog text with three or more reference texts, the second threshold may be 3.

Further, when the number of candidate digests containing candidate words is less than or equal to a second threshold, each candidate digest may be determined to be a reference digest. When the number of candidate digests containing candidate words is greater than the second threshold, any number of candidate digests in the respective candidate digests may be determined as a set of reference digests.

For example, if the second threshold is 2 and the number of candidate abstracts including candidate words is 2, the two candidate abstracts may be used as reference abstracts, and candidate texts corresponding to the two candidate abstracts may be used as reference texts.

Or, the second threshold is 2, the number of candidate abstracts including the candidate words is 30, and at this time, any number of candidate abstracts can be randomly selected from the candidate abstracts to serve as a group of reference abstracts, and candidate texts corresponding to the candidate abstracts are taken as a group of reference texts.

For example, two candidate abstracts may be randomly selected from the candidate abstracts containing the candidate words as a group of reference abstracts, and the candidate texts corresponding to the two candidate abstracts are used as the reference texts.

Alternatively, three candidate abstracts can be randomly selected from the candidate abstracts containing the candidate words to serve as a group of reference abstracts, and the candidate texts corresponding to the three candidate abstracts are taken as the reference texts.

It should be noted that the above examples are only illustrative and should not be taken as limitations to determining the abstract and the reference text in the embodiments of the present disclosure.

In the embodiment of the disclosure, when the reference abstract is selected from the candidate abstract, the mode of selecting the reference abstract is determined according to the number of the candidate abstract containing the candidate words, so as to simulate the dialogue data of different numbers of speakers, thereby ensuring the diversity of the constructed dialogue abstract training data.

It should be noted that, when a plurality of candidate summaries are randomly selected as the reference summaries, and candidate texts corresponding to the candidate summaries are used as the reference texts, the space difference between the reference texts may be large.

For example, two candidate texts are selected and divided into a reference text a and a reference text B. The reference text a includes 100 sentences, and the reference text B includes 10 sentences. The similarity with the real dialog may be low when generating the dialog text using the reference text a and the reference text B.

In some embodiments of the present disclosure, the reference abstract may be selected from each candidate abstract according to a difference between the number of characters in the candidate text corresponding to each candidate abstract. Alternatively, the reference abstract may be selected from each candidate abstract according to a difference between the number of sentences included in the candidate text corresponding to each candidate abstract.

It is understood that each candidate abstract has a one-to-one correspondence with the candidate text. Therefore, the reference text can be selected according to the difference between the number of characters in the candidate text or the difference between the number of sentences contained in the candidate text, and the reference abstract is determined according to the reference text.

When the reference text is selected according to the character number difference between the candidate texts, the character number of each candidate text can be respectively determined. For example, the number of characters of each candidate text is arranged in the order of X1, X2, X3, X4, … … Xn from small to large. Where n is the number of candidate texts.

Then, the candidate text with the smallest difference in the number of characters can be selected as the reference text. For example, candidate texts corresponding to X1 and X2 may be used as a group of reference texts, candidate texts corresponding to X3 and X4 may be used as a group of reference texts, and so on.

Alternatively, a threshold may be set for the difference in the number of characters between the respective candidate texts. When the difference of the number of characters between any two candidate texts is smaller than the threshold, the two candidate texts can be used as a group of reference texts.

The characters in the candidate text may include one or more of chinese characters, english words, symbols, and numbers, which is not limited in this disclosure.

Similarly, when the reference text is selected according to the difference between the sentence numbers contained in the candidate texts, the sentence number of each candidate text can be respectively determined. For example, the sentence numbers of the respective candidate texts are arranged in order of Y1, Y2, Y3, Y4, … … Yn from small to large. Where n is the number of candidate texts.

Then, the candidate text with the smallest sentence number difference can be selected as the reference text. For example, candidate texts corresponding to Y1 and Y2 may be used as a group of reference texts, candidate texts corresponding to Y3 and Y4 may be used as a group of reference texts, and so on.

Alternatively, a threshold may be set for the sentence number difference of each candidate text. When the difference value of the sentence quantity of any two candidate texts is smaller than the threshold value, the two candidate texts can be used as a group of reference texts.

The number of sentences contained in the candidate text can be determined according to different dividing modes. For example, the division may be by a particular type of punctuation mark. Alternatively, the division may be by any type of punctuation mark.

In the embodiment of the disclosure, the reference text is selected from the candidate texts based on the character quantity difference or the sentence quantity difference between the candidate texts, so that the quantity balance of the dialog sentences representing different speakers in the generated dialog text is ensured, and the similarity between the generated dialog abstract training data and the real data is further improved.

And step S204, according to each reference abstract, screening target sentences from a plurality of sentences contained in each reference text.

It should be noted that the reference texts in the partial text data sets may have longer space and contain more background information, such as the papers in the paper abstract data sets.

Therefore, in order to reduce the length of the generated dialogue text, a target sentence may be screened from among a plurality of sentences contained in the reference text.

It is to be understood that the reference abstract is a brief summary that follows a summary modification of the reference text. The reference summary contains key information in the reference text. In order to obtain the reference abstract according to the target sentence screened from the reference text, the target sentence also includes key information of the reference text.

In the embodiment of the disclosure, the target sentence can be screened from the plurality of sentences contained in the reference text according to the reference abstract.

For example, a sentence containing the keyword may be selected from the reference text as the target sentence according to the plurality of keywords in the reference abstract.

Or, a sentence containing the keyword may be selected from the reference text according to a plurality of keywords in the reference abstract, and the sentence and a sentence contained in the context where the sentence is located may be used as the target sentence.

It should be noted that the foregoing examples are merely illustrative and are not intended to limit the scope of the filtering target statements in the embodiments of the present disclosure.

In step S205, a dialog sentence set is generated based on the target sentence included in each reference text.

The target sentences contained in each reference text can be combined to generate a dialog sentence set.

Step S206, randomly arranging each sentence in the dialogue sentence set to generate a dialogue text.

Step S207, selecting a reference text from the candidate texts according to the difference between the sentence numbers contained in the candidate texts.

It should be noted that, for the specific implementation processes of steps S206 and S207, reference may be made to detailed descriptions of other embodiments of the present disclosure, and details are not described herein again.

In the embodiment of the disclosure, the target sentences are screened from each reference text according to each reference abstract, and the dialogue sentence set is generated based on the target sentences of each reference text, so that the extraction of the key information of the reference text is realized. Therefore, the dialog text generated according to the dialog statement set is more concise, and the dialog abstract training data is further optimized.

Fig. 3 is a flowchart illustrating a method for generating dialogue summary training data according to another embodiment of the disclosure.

As shown in fig. 3, the method for generating the dialog summary training data may include the following steps:

step S301, a plurality of reference texts and a reference abstract corresponding to each reference text are obtained.

It should be noted that, the specific implementation process of step S301 may refer to the detailed description of other embodiments of the present disclosure, and is not described herein again.

Step S302, dividing the reference abstract corresponding to the reference text into a plurality of abstract sentences.

It should be noted that the reference abstract is a short summary obtained by summarizing and modifying the reference text. The reference text may contain a plurality of abstract sentences, each potentially containing key information in the reference text.

In the embodiment of the disclosure, the reference abstract corresponding to each reference text may be divided into a plurality of abstract sentences, and a target sentence may be screened from the plurality of sentences contained in the reference text according to each abstract sentence.

When the reference abstract is divided into a plurality of abstract sentences, any possible division mode can be adopted.

For example, the reference summary may be divided into a plurality of summary statements according to a specific type of punctuation mark. Alternatively, the reference summary may be divided into a plurality of summary statements according to any type of punctuation.

For example, periods, question marks, exclamation marks representing long pauses may be the only basis for dividing sentences. Alternatively, a period, question mark, exclamation mark representing a long pause and a comma, pause mark, semicolon, and colon mark representing a short pause may be the basis for dividing a sentence.

Step S303, determining the association degree between each statement and each abstract statement in the reference text.

The association degree between the statement in the reference text and each abstract statement can represent the coverage degree of the statement in the reference text on the information in the abstract statement.

Any possible implementation may be employed when determining the degree of association between the sentences in the reference text and the abstract sentences.

For example, words that occur together in the sentence in the reference text and the abstract sentence may be counted, and the degree of association between the two may be determined based on the words that occur together.

Step S304, determining the statement with the relevance larger than the second threshold as the target statement.

The second threshold may be any value set in advance, which is not limited in this disclosure.

For example, the second threshold is 0.3, the association degree between the sentence in the reference text and the abstract sentence is 0.5, and if the association degree is greater than the second threshold, the sentence may be determined to be the target sentence. Or, the association degree between the statement in the reference text and the abstract statement is 0.1, and if the association degree is smaller than a second threshold, the statement is not the target statement.

In step S305, a dialog sentence set is generated based on the target sentence included in each reference text.

Step S306, randomly arranging each sentence in the dialogue sentence set to generate a dialogue text.

In step S307, the plurality of reference abstracts are randomly arranged to generate a target abstract corresponding to the dialog text.

It should be noted that, for specific implementation processes of steps S305, S306, and S307, reference may be made to detailed descriptions of other embodiments of the present disclosure, and details are not described herein again.

In the embodiment of the disclosure, the reference abstract is divided into a plurality of abstract sentences, and the target sentence is determined according to the association degree between the sentence in the reference text and each abstract sentence. Therefore, the key information of the reference abstract contained in the target sentence is ensured, and the accuracy of the dialogue abstract training data is improved.

Fig. 4 is a flowchart illustrating a method for generating dialogue summary training data according to another embodiment of the disclosure.

As shown in fig. 4, the method for generating the dialogue summary training data may include the following steps:

step S401, determining a candidate word set according to each word contained in the abstract corresponding to each text in the text data set.

It is understood that a plurality of keywords may be included in the abstract corresponding to each text. In order to obtain a plurality of reference texts with relevant contents, a plurality of keywords contained in the abstract can be combined into a set to be used as a candidate word set.

For example, a set of candidate words may include keyword a, keyword B, keyword C, and so on.

The determination of the keywords in the abstract can be implemented in any possible manner. For example, the keywords may be selected from the summary according to part of speech. Alternatively, the keywords may be selected from the summary according to word senses.

For example, nouns, verbs, etc. in the abstract may be used as keywords. Words that have practical meaning when present independently may also be used as keywords.

And step S402, sequentially determining each word in the candidate word set as a candidate word.

After the candidate word set is determined, each word in the set can be sequentially used as a candidate word, and a candidate abstract containing the candidate word and a corresponding candidate text are selected from the text data set.

For example, the candidate word set includes a keyword a, a keyword B, and a keyword C. The candidate abstract and the candidate text can be selected according to the keyword a, then the candidate abstract and the candidate text can be selected according to the keyword B, and finally the candidate abstract and the candidate text can be selected according to the keyword C.

Step S403, determining the number of each candidate abstract containing candidate words in the text data set.

Step S404, responding to the number of the candidate abstracts larger than a first threshold value, selecting a reference abstract from each candidate abstract, wherein the candidate text corresponding to the candidate abstract is the reference text.

The specific implementation processes of steps S403 and S404 may refer to detailed descriptions of other embodiments of the present disclosure, which are not described herein again.

In the embodiment of the disclosure, the candidate word set is formed according to the keywords in the abstract, and then the reference text is selected by using each word in the candidate word set, so that the number of the obtained reference texts is ensured, and the relevance of the reference text in content is realized.

Step S405, determining a keyword set corresponding to each reference abstract.

The detailed description of the candidate word set in the embodiments of the present disclosure may be referred to for a specific implementation manner of the keyword set, and is not repeated herein.

It will be appreciated that each reference text and corresponding reference digest is screened from the text dataset. Therefore, after the reference text and the corresponding reference abstract are determined, the candidate word set corresponding to the text and the abstract can be used as the keyword set.

Step S406, determining the number of keywords in the keyword set contained in each sentence.

Wherein the number of keywords in the keyword set contained in each sentence in the reference text can be determined in turn. In turn, the coverage of key information by each sentence in the reference text can be determined.

For example, sentence a1 in the reference text contains 2 keywords, and sentence a2 in the reference text contains 0 keywords, which may indicate that sentence a1 contains key information, and sentence a2 may not contain key information.

Step S407 is to screen a target sentence from the plurality of sentences according to the number of keywords included in each sentence.

The number of the sentences in the reference text containing the keywords can be used as the screening condition of the target sentence.

For example, a set threshold may be set, and when the number of words included in a sentence in the reference text is greater than or equal to the set threshold, the sentence is regarded as a target sentence. Conversely, when the number of the sentences in the reference text containing the keywords is less than the set threshold, the sentences are not taken as the target sentences.

The threshold value can be set according to actual needs. For example, the set threshold may be 1. Alternatively, the set threshold may be 2, which is not limited by the present disclosure.

Step S408, a dialog sentence set is generated based on the target sentence included in each reference text.

Step S409, randomly arranging each sentence in the dialog sentence set to generate a dialog text.

In step S410, the plurality of reference abstracts are randomly arranged to generate a target abstract corresponding to the dialog text.

It should be noted that, for specific implementation processes of steps S408, S409, and S410, reference may be made to detailed descriptions of other embodiments of the present disclosure, and details are not described herein.

In the embodiment of the disclosure, the keywords in the reference abstract form the keyword set, and then the target sentences are screened according to the number of the keywords in the sentences of the reference text, so that the key information of the reference abstract in the target sentences is ensured, and the accuracy of the dialogue abstract training data is improved.

According to the embodiment of the disclosure, the disclosure further provides a device for generating the dialogue summary training data.

Fig. 5 is a schematic structural diagram of an apparatus for generating dialogue summary training data according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus 500 for generating dialog summary training data may include: an acquisition module 510, a first generation module 520, a second generation module 530, and a third generation module 540.

The obtaining module 510 is configured to obtain a plurality of reference texts and a reference abstract corresponding to each reference text;

a first generating module 520, configured to generate a dialog sentence set based on the sentences contained in each reference text;

a second generating module 530, configured to randomly arrange each sentence in the dialog sentence set to generate a dialog text;

and a third generating module 540, configured to randomly arrange the plurality of reference abstracts to generate a target abstract corresponding to the dialog text.

It should be noted that the foregoing explanation of the embodiment of the method for generating dialog summary training data is also applicable to the apparatus for generating dialog summary training data of this embodiment, and the implementation principle thereof is similar, and is not repeated here.

In a possible implementation manner of the embodiment of the present disclosure, the obtaining module 510 may include:

the first determining unit is used for determining candidate words;

the second determining unit is used for determining the number of each candidate abstract containing the candidate words in the text data set;

and the third determining unit is used for selecting a reference abstract from each candidate abstract in response to the number of the candidate abstract being larger than the first threshold, wherein the candidate text corresponding to the candidate abstract is the reference text.

In one possible implementation manner of the embodiment of the present disclosure, the third determining unit is configured to:

determining each candidate summary as a reference summary in response to the number of candidate summaries being less than or equal to a second threshold;

or,

in response to the number of candidate digests being greater than a second threshold, any number of candidate digests in the respective candidate digests are determined as a set of reference digests.

selecting a reference abstract from each candidate abstract according to the character number difference between the candidate texts corresponding to each candidate abstract; or,

and selecting a reference abstract from each candidate abstract according to the difference between the sentence numbers contained in the candidate texts corresponding to each candidate abstract.

In one possible implementation manner of the embodiment of the present disclosure, the first determining unit is configured to:

determining a candidate word set according to each word contained in the abstract corresponding to each text in the text data set;

and determining each word in the candidate word set as a candidate word in sequence.

In a possible implementation manner of the embodiment of the present disclosure, the first generating module 520 may include:

the screening unit is used for screening target sentences from a plurality of sentences contained in each reference text according to each reference abstract;

and the generating unit is used for generating a dialogue sentence set based on the target sentences contained in each reference text.

In one possible implementation manner of the embodiment of the present disclosure, the screening unit is configured to:

dividing a reference abstract corresponding to a reference text into a plurality of abstract sentences;

determining the association degree between each statement and each abstract statement in the reference text;

and determining the statement with the relevance larger than a second threshold as the target statement.

determining a keyword set corresponding to each reference abstract;

determining the number of keywords in the keyword set contained in each statement;

and screening target sentences from the plurality of sentences according to the number of the keywords contained in each sentence.

In the embodiment of the disclosure, the candidate abstract and the corresponding candidate text are screened from the text data set based on the candidate words and are used as the reference abstract and the reference text, so that the reference text and the reference abstract are obtained based on the correlation of text contents, and the similarity between the generated dialogue abstract training data and the real data is improved. In addition, according to each reference abstract, target sentences are screened from each reference text, and a conversation sentence set is generated based on the target sentences of each reference text, so that the extraction of key information of the reference text is realized. Therefore, the dialog text generated according to the dialog statement set is more concise, and the dialog abstract training data is further optimized.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the device 600 comprises a computing unit 601, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as the generation method of the dialogue summary training data. For example, in some embodiments, the method of generating dialog summary training data may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When loaded into RAM603 and executed by computing unit 601, a computer program may perform one or more steps of the method of generating dialog summary training data described above. Alternatively, in other embodiments, the computing unit 601 may be configured by any other suitable means (e.g., by means of firmware) to perform the generation method of dialog summary training data.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service extensibility in the traditional physical host and VPS service ("virtual private Server", or "VPS" for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A generation method of dialogue summary training data comprises the following steps:

2. The method of claim 1, wherein the obtaining a plurality of reference texts and a reference abstract corresponding to each reference text comprises:

determining candidate words;

determining the number of each candidate abstract containing the candidate words in the text data set;

and responding to the condition that the number of the candidate abstracts is larger than a first threshold value, and selecting the reference abstracts from the candidate abstracts, wherein the candidate texts corresponding to the candidate abstracts are the reference texts.

3. The method of claim 2, wherein said selecting said reference summary from said respective candidate summaries comprises:

determining the respective candidate digests as the reference digests in response to the number of candidate digests being less than or equal to a second threshold;

or,

determining any number of candidate digests in the respective candidate digests as a set of reference digests in response to the number of candidate digests being greater than the second threshold.

4. The method of claim 2, wherein said selecting said reference summary from said respective candidate summaries comprises:

selecting the reference abstract from the candidate abstracts according to the character number difference between the candidate texts corresponding to the candidate abstracts; or,

and selecting the reference abstract from the candidate abstracts according to the difference value between the sentence numbers contained in the candidate texts corresponding to the candidate abstracts.

5. The method of claim 2, wherein the determining candidate words comprises:

and sequentially determining each word in the candidate word set as the candidate word.

6. The method of any one of claims 1-5, wherein said generating a set of conversational utterances based on the utterances contained in each of said reference texts comprises:

according to each reference abstract, screening target sentences from a plurality of sentences contained in each reference text;

and generating the dialog sentence set based on the target sentences contained in each reference text.

7. The method of claim 6, wherein the step of screening the target sentence from the plurality of sentences contained in each of the reference texts according to each of the reference abstracts comprises:

dividing the reference abstract corresponding to the reference text into a plurality of abstract sentences;

determining the association degree between each statement in the reference text and each abstract statement;

8. The method of claim 6, wherein the step of screening the target sentence from the plurality of sentences contained in each of the reference texts according to each of the reference abstracts comprises:

determining a keyword set corresponding to each reference abstract;

9. A dialog summary training data generation apparatus, comprising:

10. The apparatus of claim 9, wherein the means for obtaining comprises:

the first determining unit is used for determining candidate words;

and the third determining unit is used for selecting the reference abstract from the candidate abstracts in response to the number of the candidate abstracts being larger than a first threshold, wherein the candidate text corresponding to the candidate abstract is the reference text.

11. The apparatus of claim 10, wherein the third determining unit is to:

or,

12. The apparatus of claim 10, wherein the third determining unit is to:

13. The apparatus of claim 10, wherein the first determining unit is to:

14. The apparatus of any of claims 9-13, wherein the first generating means comprises:

and the generating unit is used for generating the dialogue sentence set based on the target sentences contained in each reference text.

15. The apparatus of claim 14, wherein the screening unit is to:

16. The apparatus of claim 14, wherein the screening unit is to:

determining a keyword set corresponding to each reference abstract;

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.