CN116229943A

CN116229943A - Conversational data set generation method and device

Info

Publication number: CN116229943A
Application number: CN202310505189.9A
Authority: CN
Inventors: 刘杰辰
Original assignee: Beijing Aishu Wisdom Technology Co ltd
Current assignee: Beijing Qingshu Intelligent Technology Co ltd
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-06-06
Anticipated expiration: 2043-05-08
Also published as: CN116229943B

Abstract

The application discloses a method and a device for generating a conversational data set, wherein the method comprises the following steps: acquiring dialogue data from subtitle data corresponding to a multimedia file, wherein the dialogue data comprises dialogue characters and corresponding time stamps; dividing the dialogue data into a plurality of dialogue segments, and dividing the audio file corresponding to the multimedia file based on each dialogue segment and the corresponding starting time stamp thereof to obtain a plurality of audio segments corresponding to the plurality of dialogue segments; and carrying out speaker recognition on the plurality of audio segments, marking the speaker of each sentence in the dialogue segment corresponding to each audio segment according to the recognition result, and taking the marked dialogue segment as a dialogue data set. According to the method and the device for generating the dialogue data, dialogue data are obtained from the subtitle data, and the dialogue data are segmented and marked, so that a dialogue data set is generated, the generation cost of the dialogue data set can be reduced, and the generation speed, the generation efficiency and the degree of diversification of the dialogue data set are improved.

Description

Conversational data set generation method and device

Technical Field

The application belongs to the field of artificial intelligence, and particularly relates to a method and a device for generating a conversational data set.

Background

However, language processing is a palm-top pearl in the field of artificial intelligence, and man-machine dialogue is the final ring in the field of natural language processing. In implementing a human-machine conversation, a large number of conversational datasets are often required.

At present, the general manufacturing method of the conversational data set is still manually generated, the cost is too high, the speed is low, the topic speaking repetition rate of the same batch of personnel is high, and the ever-increasing data demands cannot be met.

Disclosure of Invention

The embodiment of the application aims to provide a method and a device for generating a conversational data set, so as to solve the defect that the prior art cannot meet the data requirement.

In order to solve the technical problems, the application is realized as follows:

in a first aspect, a method for generating a conversational dataset is provided, including the steps of:

acquiring dialogue data from subtitle data corresponding to a multimedia file, wherein the dialogue data comprises dialogue characters and corresponding time stamps;

dividing the dialogue data into a plurality of dialogue segments, and dividing the audio file corresponding to the multimedia file based on each dialogue segment and the corresponding starting time stamp thereof to obtain a plurality of audio segments corresponding to the plurality of dialogue segments;

and carrying out speaker recognition on the plurality of audio segments, marking the speaker of each sentence in the dialogue segment corresponding to each audio segment according to the recognition result, and taking the marked dialogue segment as a dialogue data set.

In a second aspect, there is provided a device for generating a conversational dataset, comprising:

the acquisition module is used for acquiring dialogue data from subtitle data corresponding to the multimedia file, wherein the dialogue data comprises dialogue characters and corresponding time stamps;

the segmentation module is used for segmenting the dialogue data into a plurality of dialogue segments, and segmenting the audio file corresponding to the multimedia file based on each dialogue segment and the corresponding starting time stamp thereof to obtain a plurality of audio segments corresponding to the dialogue segments;

and the labeling module is used for identifying the speakers of the plurality of audio segments, labeling the speaker of each sentence in the conversation segment corresponding to each audio segment according to the identification result, and taking the labeled conversation segment as a conversation type data set.

According to the method and the device for generating the dialogue data, dialogue data are obtained from the subtitle data, and the dialogue data are segmented and marked, so that a dialogue data set is generated, the generation cost of the dialogue data set can be reduced, and the generation speed, the generation efficiency and the degree of diversification of the dialogue data set are improved.

Drawings

FIG. 1 is a flowchart of a method for generating a conversational dataset according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a device for generating a conversational dataset according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Since the mere manual generation of conversational datasets does not meet the ever-increasing data demands, it is necessary to supplement them by other means. However, the existing movie and television series captions are very ideal data sources, but the problems that the captions are difficult to clean, the dialogue sections and the speakers cannot be distinguished, the captions cannot be classified according to the topics and the like still exist.

The embodiment of the application provides a method and a device for generating a conversational data set, which use an automatic mode and can generate conversational data sets in batches; the method uses context comparison, distinguishes dialogue segments, combines multi-modal speaker recognition, performs speaker distinction, and classifies topics by using keyword extraction and the combination of bert word vectors.

The following describes in detail a method for generating a conversational dataset according to the embodiments of the present application through specific embodiments and application scenarios thereof with reference to the accompanying drawings.

As shown in fig. 1, a flowchart of a method for generating a conversational dataset according to an embodiment of the present application includes the following steps:

step 101, dialogue data is obtained from caption data corresponding to the multimedia file, wherein the dialogue data comprises dialogue characters and corresponding time stamps.

Step 102, segmenting the dialogue data into a plurality of dialogue segments, and segmenting the audio file corresponding to the multimedia file based on each dialogue segment and the corresponding starting time stamp thereof to obtain a plurality of audio segments corresponding to the plurality of dialogue segments.

In particular, the dialog data may be segmented into a plurality of dialog segments based on a minimum dialog turn and a maximum dialog turn of a sliding window limit.

In this embodiment, after the audio file corresponding to the multimedia file is segmented based on each session segment and the start timestamp corresponding to each session segment to obtain a plurality of audio segments corresponding to the plurality of session segments, each audio segment may be further identified, the identified text is matched and aligned with the text content of the session segment corresponding to the audio segment, and if there is an audio segment that cannot be matched, the audio segment is discarded.

And 103, identifying the speakers of the plurality of audio segments, marking the speaker of each sentence in the conversation segment corresponding to each audio segment according to the identification result, and taking the marked conversation segment as a conversation type data set.

In this embodiment, after the labeled dialog segments are used as the dialog data set, keywords may also be extracted from each dialog segment, and the keywords of each dialog segment and the corresponding weight values thereof may be recorded; according to the keywords of each dialogue segment and the corresponding weight values thereof, the matching degree of each dialogue segment and the existing topic classification is calculated respectively; based on the degree of matching, each dialog segment is classified into an existing topic classification.

Specifically, the word vector of each keyword of the existing topic classification and the topic word of the same dialog segment can be calculated respectively, the similarity of the word vector of the topic word and the word vector of each keyword is calculated by using cosine similarity, and the similarity is weighted and averaged based on the weight of each keyword to obtain the matching degree of the dialog segment and the existing topic classification; correspondingly, under the existing topic classification, sorting the dialog segments according to the matching degree of the dialog segments and the existing topic classification.

In the embodiment of the application, the movie, the television drama video and the corresponding subtitle data can be collected first, and the subtitle data can be cleaned. Because the formats of different subtitles are slightly different, the cleaning strategy can be adjusted according to the encountered subtitle formats, such as by distinguishing bystandings, lyrics, scene descriptions and the like through symbols, so that only dialogue characters and corresponding time stamps are ensured as much as possible.

Then, the subtitle data is correctly cut through the context comparison. Specifically, a scene segmentation method can be introduced to segment the whole dialogue data into different dialogue segments, and the specific steps are as follows: first, a judgment algorithm is trained through the existing dialogue data set to judge whether the n sentences of a certain section should be truncated at the last sentence. Then, by limiting the size of the sliding window, the minimum and maximum dialog turns of the dialog segment are limited (e.g., the number of sentences required in the dataset is between 8-12 sentences requiring no less than 4 turns and no more than 6 turns).

When the sliding window is used, the sliding window conforming to the size is sequentially identified from beginning to end, and when the identification fails (judgment can not be used as a cut sentence), the next round of sentences are added into the sliding window to continue the identification; when the recognition is successful (the judgment can be used as a cut sentence), all sentences contained in the sliding window are used as a dialogue section, and the previous operation is continuously repeated from the next sentence of the cut sentence. If the window length exceeds 12 sentences, then the above operation is continued from the latter sentence with this truncation. After the recognition is finished, screening out the dialogue sections with less than 8 sentences.

After the dialogue data is divided into a plurality of dialogue segments, corresponding audio can be intercepted according to the time stamp, and the speaker is identified by using a speaker logging tool and is distributed to different sentences (data of too many speakers is filtered). Because conversational datasets require speakers with different sentences of data labeling, caption data typically does not carry such information, and therefore requires distinguishing between the speakers. Firstly, the selected dialogue segments and the corresponding initial time stamps are used to segment the corresponding movie and TV play audio files (about 2 seconds before and after the segments are properly reserved to avoid the problem of inaccurate time stamps). The cut audio is then identified using asr, the identified text is matched and aligned with the subtitle text (if not, i.e., the similarity is below a threshold, the segment is discarded), and the aligned audio file is retained with the corresponding subtitle file. Finally, the speaker recognition tool is used for recognizing the reserved audio and labeling the speaker of each sentence, and only 2 conversational persons are needed in the common conversational data set, so that only the labeled subtitle file with the conversational person number of 2 is selected.

After the labeling is completed, the labeled caption dialogue segments can be classified, and the keyword of each dialogue segment is extracted to summarize the dialogue content, so that the classification and the correlation calculation are convenient. Specifically, tf-idf may be used to extract keywords from all the session segments, and each reserved keyword is 10 (in sequence), and the keywords and their corresponding weight values are recorded.

Afterwards, keywords can be classified in a summary way, matching coefficients (used for sorting) are recorded, namely, the matching degree of a certain section of dialogue content and a certain topic is calculated, the method can be used for classifying new dialogue section data into the existing topic classification, and when topic requirements outside the existing topic classification are acquired, matching degree sorting can be performed rapidly, and a result meeting the requirements is given.

The matching degree calculating method comprises the following steps: if the topic word used for calculation is a, the 5 keywords of the dialogue data segment currently used for calculation and the corresponding weight values thereof are as follows: b1-w1, B2-w2, B3-w3, B4-w4 and B5-w5, word vectors of A and Bn can be calculated by using a trained Chinese universal bert model to obtain AV and BVn, similarity Sn of the AV and each BVn is calculated by using cosine similarity, and weighted average is carried out by wn to obtain final similarity S.

And classifying all the labeled dialogue segments by using the calculation method, and sorting according to the matching degree under the existing classification of the topics, so that the later taking is convenient. When a new topic demand is acquired, for example, topics other than the existing topic system, the matching degree calculation can be performed by using the method, and a dialogue section with a front matching degree can be selected according to the demand.

According to the embodiment of the application, after relevant subtitles and audio data are crawled through the crawlers, the dialogue type dataset is automatically generated, keywords (convenient for later personalized classification and use) and classification types corresponding to each section of dialogue are generated, the dataset generation efficiency and the diversity degree are greatly improved, compared with the case that the subtitle data are reprocessed manually, the speed is improved by more than 80%, and the cost is only one tenth of that of an artificial method.

As shown in fig. 2, a schematic structural diagram of a device for generating a conversational dataset according to an embodiment of the present application includes:

the obtaining module 210 is configured to obtain dialogue data from subtitle data corresponding to the multimedia file, where the dialogue data includes dialogue text and a corresponding timestamp.

The splitting module 220 is configured to split the session data into a plurality of session segments, and split the audio file corresponding to the multimedia file based on each session segment and the start timestamp corresponding to each session segment, so as to obtain a plurality of audio segments corresponding to the plurality of session segments.

Specifically, the segmentation module 220 is specifically configured to segment the session data into a plurality of session segments based on a minimum session round and a maximum session round limited by the sliding window.

The labeling module 230 is configured to identify a speaker from the plurality of audio segments, label a speaker for each sentence in the conversation segment corresponding to each audio segment according to the identification result, and use the labeled conversation segment as a conversation data set.

In this embodiment, the apparatus further includes:

and the extraction module is used for respectively extracting the keywords from each dialogue segment and recording the keywords of each dialogue segment and the corresponding weight values thereof.

And the calculation module is used for calculating the matching degree of each dialogue segment and the existing topic classification according to the keywords of each dialogue segment and the corresponding weight values of the keywords.

The computing module is specifically configured to respectively compute a topic word of an existing topic classification and a word vector of each keyword in the same dialogue segment, compute similarity between the word vector of the topic word and the word vector of each keyword by using cosine similarity, and weight average the similarity based on weight of each keyword to obtain matching degree between the dialogue segment and the existing topic classification.

And the classifying module is used for classifying each dialogue segment into the existing topic classification based on the matching degree.

The classification module is specifically configured to, under an existing topic classification, order each dialog segment according to a matching degree between each dialog segment and the existing topic classification.

In this embodiment, the apparatus further includes:

the recognition module is used for respectively recognizing each audio segment, matching and aligning the recognized characters with the text content of the dialogue segment corresponding to the audio segment, and discarding the audio segment if the audio segment which cannot be matched exists.

The embodiment of the application further provides a computer readable storage medium, on which a computer program is stored, where the computer program when executed by a processor implements each process of the above-mentioned conversational data set generating method embodiment, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), including several instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

Claims

1. A method of generating a conversational dataset, comprising the steps of:

2. The method of claim 1, wherein after the labeling of the session as a conversational dataset, further comprising:

extracting keywords from each dialogue segment respectively, and recording the keywords of each dialogue segment and the corresponding weight values thereof;

according to the keywords of each dialogue segment and the corresponding weight values thereof, the matching degree of each dialogue segment and the existing topic classification is calculated respectively;

based on the degree of matching, each dialog segment is classified into an existing topic classification.

3. The method according to claim 2, wherein the calculating the matching degree between each dialogue segment and the existing topic classification according to the keyword of each dialogue segment and the corresponding weight value thereof comprises:

respectively calculating word vectors of topic words classified by the existing topics and each keyword of the same dialogue section, calculating similarity of the word vectors of the topic words and the word vectors of each keyword by using cosine similarity, and carrying out weighted average on the similarity based on the weight of each keyword to obtain the matching degree of the dialogue section and the existing topic classification;

based on the matching degree, classifying each dialogue segment into an existing topic classification specifically comprises the following steps:

and under the existing topic classification, sequencing each dialog segment according to the matching degree of each dialog segment and the existing topic classification.

4. The method according to claim 1, wherein the segmenting the session data into a plurality of session segments, in particular comprises:

the dialog data is sliced into a plurality of dialog segments based on a minimum dialog turn and a maximum dialog turn of a sliding window limit.

5. The method according to claim 1, wherein the step of slicing the audio file corresponding to the multimedia file based on each session and the start time stamp thereof to obtain a plurality of audio segments corresponding to the plurality of session further comprises:

and respectively identifying each audio segment, matching and aligning the identified characters with the text content of the dialogue segment corresponding to the audio segment, and discarding the audio segment if the audio segment which cannot be matched exists.

6. A device for generating a conversational dataset, comprising:

7. The apparatus as recited in claim 6, further comprising:

the extraction module is used for extracting keywords from each dialogue segment respectively and recording the keywords of each dialogue segment and the corresponding weight values thereof;

the computing module is used for respectively computing the matching degree of each dialogue segment and the existing topic classification according to the keywords of each dialogue segment and the corresponding weight values thereof;

8. The apparatus of claim 7, wherein the device comprises a plurality of sensors,

the computing module is specifically configured to respectively compute word vectors of topic words classified by an existing topic and each keyword in the same dialogue segment, compute similarity between the word vector of the topic word and the word vector of each keyword by using cosine similarity, and perform weighted average on the similarity based on weight of each keyword to obtain matching degree between the dialogue segment and the existing topic classification;

the classifying module is specifically configured to, under an existing topic classification, order each dialog segment according to a matching degree between each dialog segment and the existing topic classification.

9. The apparatus of claim 6, wherein the device comprises a plurality of sensors,

the segmentation module is specifically configured to segment the session data into a plurality of session segments based on a minimum session round and a maximum session round limited by a sliding window.

10. The apparatus as recited in claim 6, further comprising: