CN116229943A - Conversational data set generation method and device - Google Patents

Conversational data set generation method and device Download PDF

Info

Publication number
CN116229943A
CN116229943A CN202310505189.9A CN202310505189A CN116229943A CN 116229943 A CN116229943 A CN 116229943A CN 202310505189 A CN202310505189 A CN 202310505189A CN 116229943 A CN116229943 A CN 116229943A
Authority
CN
China
Prior art keywords
dialogue
segment
audio
data
segments
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310505189.9A
Other languages
Chinese (zh)
Other versions
CN116229943B (en
Inventor
刘杰辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qingshu Intelligent Technology Co ltd
Original Assignee
Beijing Aishu Wisdom Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Aishu Wisdom Technology Co ltd filed Critical Beijing Aishu Wisdom Technology Co ltd
Priority to CN202310505189.9A priority Critical patent/CN116229943B/en
Publication of CN116229943A publication Critical patent/CN116229943A/en
Application granted granted Critical
Publication of CN116229943B publication Critical patent/CN116229943B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Processing Or Creating Images (AREA)
  • Studio Circuits (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method and a device for generating a conversational data set, wherein the method comprises the following steps: acquiring dialogue data from subtitle data corresponding to a multimedia file, wherein the dialogue data comprises dialogue characters and corresponding time stamps; dividing the dialogue data into a plurality of dialogue segments, and dividing the audio file corresponding to the multimedia file based on each dialogue segment and the corresponding starting time stamp thereof to obtain a plurality of audio segments corresponding to the plurality of dialogue segments; and carrying out speaker recognition on the plurality of audio segments, marking the speaker of each sentence in the dialogue segment corresponding to each audio segment according to the recognition result, and taking the marked dialogue segment as a dialogue data set. According to the method and the device for generating the dialogue data, dialogue data are obtained from the subtitle data, and the dialogue data are segmented and marked, so that a dialogue data set is generated, the generation cost of the dialogue data set can be reduced, and the generation speed, the generation efficiency and the degree of diversification of the dialogue data set are improved.

Description

Conversational data set generation method and device
Technical Field
The application belongs to the field of artificial intelligence, and particularly relates to a method and a device for generating a conversational data set.
Background
However, language processing is a palm-top pearl in the field of artificial intelligence, and man-machine dialogue is the final ring in the field of natural language processing. In implementing a human-machine conversation, a large number of conversational datasets are often required.
At present, the general manufacturing method of the conversational data set is still manually generated, the cost is too high, the speed is low, the topic speaking repetition rate of the same batch of personnel is high, and the ever-increasing data demands cannot be met.
Disclosure of Invention
The embodiment of the application aims to provide a method and a device for generating a conversational data set, so as to solve the defect that the prior art cannot meet the data requirement.
In order to solve the technical problems, the application is realized as follows:
in a first aspect, a method for generating a conversational dataset is provided, including the steps of:
acquiring dialogue data from subtitle data corresponding to a multimedia file, wherein the dialogue data comprises dialogue characters and corresponding time stamps;
dividing the dialogue data into a plurality of dialogue segments, and dividing the audio file corresponding to the multimedia file based on each dialogue segment and the corresponding starting time stamp thereof to obtain a plurality of audio segments corresponding to the plurality of dialogue segments;
and carrying out speaker recognition on the plurality of audio segments, marking the speaker of each sentence in the dialogue segment corresponding to each audio segment according to the recognition result, and taking the marked dialogue segment as a dialogue data set.
In a second aspect, there is provided a device for generating a conversational dataset, comprising:
the acquisition module is used for acquiring dialogue data from subtitle data corresponding to the multimedia file, wherein the dialogue data comprises dialogue characters and corresponding time stamps;
the segmentation module is used for segmenting the dialogue data into a plurality of dialogue segments, and segmenting the audio file corresponding to the multimedia file based on each dialogue segment and the corresponding starting time stamp thereof to obtain a plurality of audio segments corresponding to the dialogue segments;
and the labeling module is used for identifying the speakers of the plurality of audio segments, labeling the speaker of each sentence in the conversation segment corresponding to each audio segment according to the identification result, and taking the labeled conversation segment as a conversation type data set.
According to the method and the device for generating the dialogue data, dialogue data are obtained from the subtitle data, and the dialogue data are segmented and marked, so that a dialogue data set is generated, the generation cost of the dialogue data set can be reduced, and the generation speed, the generation efficiency and the degree of diversification of the dialogue data set are improved.
Drawings
FIG. 1 is a flowchart of a method for generating a conversational dataset according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a device for generating a conversational dataset according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
Since the mere manual generation of conversational datasets does not meet the ever-increasing data demands, it is necessary to supplement them by other means. However, the existing movie and television series captions are very ideal data sources, but the problems that the captions are difficult to clean, the dialogue sections and the speakers cannot be distinguished, the captions cannot be classified according to the topics and the like still exist.
The embodiment of the application provides a method and a device for generating a conversational data set, which use an automatic mode and can generate conversational data sets in batches; the method uses context comparison, distinguishes dialogue segments, combines multi-modal speaker recognition, performs speaker distinction, and classifies topics by using keyword extraction and the combination of bert word vectors.
The following describes in detail a method for generating a conversational dataset according to the embodiments of the present application through specific embodiments and application scenarios thereof with reference to the accompanying drawings.
As shown in fig. 1, a flowchart of a method for generating a conversational dataset according to an embodiment of the present application includes the following steps:
step 101, dialogue data is obtained from caption data corresponding to the multimedia file, wherein the dialogue data comprises dialogue characters and corresponding time stamps.
Step 102, segmenting the dialogue data into a plurality of dialogue segments, and segmenting the audio file corresponding to the multimedia file based on each dialogue segment and the corresponding starting time stamp thereof to obtain a plurality of audio segments corresponding to the plurality of dialogue segments.
In particular, the dialog data may be segmented into a plurality of dialog segments based on a minimum dialog turn and a maximum dialog turn of a sliding window limit.
In this embodiment, after the audio file corresponding to the multimedia file is segmented based on each session segment and the start timestamp corresponding to each session segment to obtain a plurality of audio segments corresponding to the plurality of session segments, each audio segment may be further identified, the identified text is matched and aligned with the text content of the session segment corresponding to the audio segment, and if there is an audio segment that cannot be matched, the audio segment is discarded.
And 103, identifying the speakers of the plurality of audio segments, marking the speaker of each sentence in the conversation segment corresponding to each audio segment according to the identification result, and taking the marked conversation segment as a conversation type data set.
In this embodiment, after the labeled dialog segments are used as the dialog data set, keywords may also be extracted from each dialog segment, and the keywords of each dialog segment and the corresponding weight values thereof may be recorded; according to the keywords of each dialogue segment and the corresponding weight values thereof, the matching degree of each dialogue segment and the existing topic classification is calculated respectively; based on the degree of matching, each dialog segment is classified into an existing topic classification.
Specifically, the word vector of each keyword of the existing topic classification and the topic word of the same dialog segment can be calculated respectively, the similarity of the word vector of the topic word and the word vector of each keyword is calculated by using cosine similarity, and the similarity is weighted and averaged based on the weight of each keyword to obtain the matching degree of the dialog segment and the existing topic classification; correspondingly, under the existing topic classification, sorting the dialog segments according to the matching degree of the dialog segments and the existing topic classification.
According to the method and the device for generating the dialogue data, dialogue data are obtained from the subtitle data, and the dialogue data are segmented and marked, so that a dialogue data set is generated, the generation cost of the dialogue data set can be reduced, and the generation speed, the generation efficiency and the degree of diversification of the dialogue data set are improved.
In the embodiment of the application, the movie, the television drama video and the corresponding subtitle data can be collected first, and the subtitle data can be cleaned. Because the formats of different subtitles are slightly different, the cleaning strategy can be adjusted according to the encountered subtitle formats, such as by distinguishing bystandings, lyrics, scene descriptions and the like through symbols, so that only dialogue characters and corresponding time stamps are ensured as much as possible.
Then, the subtitle data is correctly cut through the context comparison. Specifically, a scene segmentation method can be introduced to segment the whole dialogue data into different dialogue segments, and the specific steps are as follows: first, a judgment algorithm is trained through the existing dialogue data set to judge whether the n sentences of a certain section should be truncated at the last sentence. Then, by limiting the size of the sliding window, the minimum and maximum dialog turns of the dialog segment are limited (e.g., the number of sentences required in the dataset is between 8-12 sentences requiring no less than 4 turns and no more than 6 turns).
When the sliding window is used, the sliding window conforming to the size is sequentially identified from beginning to end, and when the identification fails (judgment can not be used as a cut sentence), the next round of sentences are added into the sliding window to continue the identification; when the recognition is successful (the judgment can be used as a cut sentence), all sentences contained in the sliding window are used as a dialogue section, and the previous operation is continuously repeated from the next sentence of the cut sentence. If the window length exceeds 12 sentences, then the above operation is continued from the latter sentence with this truncation. After the recognition is finished, screening out the dialogue sections with less than 8 sentences.
After the dialogue data is divided into a plurality of dialogue segments, corresponding audio can be intercepted according to the time stamp, and the speaker is identified by using a speaker logging tool and is distributed to different sentences (data of too many speakers is filtered). Because conversational datasets require speakers with different sentences of data labeling, caption data typically does not carry such information, and therefore requires distinguishing between the speakers. Firstly, the selected dialogue segments and the corresponding initial time stamps are used to segment the corresponding movie and TV play audio files (about 2 seconds before and after the segments are properly reserved to avoid the problem of inaccurate time stamps). The cut audio is then identified using asr, the identified text is matched and aligned with the subtitle text (if not, i.e., the similarity is below a threshold, the segment is discarded), and the aligned audio file is retained with the corresponding subtitle file. Finally, the speaker recognition tool is used for recognizing the reserved audio and labeling the speaker of each sentence, and only 2 conversational persons are needed in the common conversational data set, so that only the labeled subtitle file with the conversational person number of 2 is selected.
After the labeling is completed, the labeled caption dialogue segments can be classified, and the keyword of each dialogue segment is extracted to summarize the dialogue content, so that the classification and the correlation calculation are convenient. Specifically, tf-idf may be used to extract keywords from all the session segments, and each reserved keyword is 10 (in sequence), and the keywords and their corresponding weight values are recorded.
Afterwards, keywords can be classified in a summary way, matching coefficients (used for sorting) are recorded, namely, the matching degree of a certain section of dialogue content and a certain topic is calculated, the method can be used for classifying new dialogue section data into the existing topic classification, and when topic requirements outside the existing topic classification are acquired, matching degree sorting can be performed rapidly, and a result meeting the requirements is given.
The matching degree calculating method comprises the following steps: if the topic word used for calculation is a, the 5 keywords of the dialogue data segment currently used for calculation and the corresponding weight values thereof are as follows: b1-w1, B2-w2, B3-w3, B4-w4 and B5-w5, word vectors of A and Bn can be calculated by using a trained Chinese universal bert model to obtain AV and BVn, similarity Sn of the AV and each BVn is calculated by using cosine similarity, and weighted average is carried out by wn to obtain final similarity S.
And classifying all the labeled dialogue segments by using the calculation method, and sorting according to the matching degree under the existing classification of the topics, so that the later taking is convenient. When a new topic demand is acquired, for example, topics other than the existing topic system, the matching degree calculation can be performed by using the method, and a dialogue section with a front matching degree can be selected according to the demand.
According to the embodiment of the application, after relevant subtitles and audio data are crawled through the crawlers, the dialogue type dataset is automatically generated, keywords (convenient for later personalized classification and use) and classification types corresponding to each section of dialogue are generated, the dataset generation efficiency and the diversity degree are greatly improved, compared with the case that the subtitle data are reprocessed manually, the speed is improved by more than 80%, and the cost is only one tenth of that of an artificial method.
As shown in fig. 2, a schematic structural diagram of a device for generating a conversational dataset according to an embodiment of the present application includes:
the obtaining module 210 is configured to obtain dialogue data from subtitle data corresponding to the multimedia file, where the dialogue data includes dialogue text and a corresponding timestamp.
The splitting module 220 is configured to split the session data into a plurality of session segments, and split the audio file corresponding to the multimedia file based on each session segment and the start timestamp corresponding to each session segment, so as to obtain a plurality of audio segments corresponding to the plurality of session segments.
Specifically, the segmentation module 220 is specifically configured to segment the session data into a plurality of session segments based on a minimum session round and a maximum session round limited by the sliding window.
The labeling module 230 is configured to identify a speaker from the plurality of audio segments, label a speaker for each sentence in the conversation segment corresponding to each audio segment according to the identification result, and use the labeled conversation segment as a conversation data set.
In this embodiment, the apparatus further includes:
and the extraction module is used for respectively extracting the keywords from each dialogue segment and recording the keywords of each dialogue segment and the corresponding weight values thereof.
And the calculation module is used for calculating the matching degree of each dialogue segment and the existing topic classification according to the keywords of each dialogue segment and the corresponding weight values of the keywords.
The computing module is specifically configured to respectively compute a topic word of an existing topic classification and a word vector of each keyword in the same dialogue segment, compute similarity between the word vector of the topic word and the word vector of each keyword by using cosine similarity, and weight average the similarity based on weight of each keyword to obtain matching degree between the dialogue segment and the existing topic classification.
And the classifying module is used for classifying each dialogue segment into the existing topic classification based on the matching degree.
The classification module is specifically configured to, under an existing topic classification, order each dialog segment according to a matching degree between each dialog segment and the existing topic classification.
In this embodiment, the apparatus further includes:
the recognition module is used for respectively recognizing each audio segment, matching and aligning the recognized characters with the text content of the dialogue segment corresponding to the audio segment, and discarding the audio segment if the audio segment which cannot be matched exists.
According to the method and the device for generating the dialogue data, dialogue data are obtained from the subtitle data, and the dialogue data are segmented and marked, so that a dialogue data set is generated, the generation cost of the dialogue data set can be reduced, and the generation speed, the generation efficiency and the degree of diversification of the dialogue data set are improved.
The embodiment of the application further provides a computer readable storage medium, on which a computer program is stored, where the computer program when executed by a processor implements each process of the above-mentioned conversational data set generating method embodiment, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), including several instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

Claims (10)

1. A method of generating a conversational dataset, comprising the steps of:
acquiring dialogue data from subtitle data corresponding to a multimedia file, wherein the dialogue data comprises dialogue characters and corresponding time stamps;
dividing the dialogue data into a plurality of dialogue segments, and dividing the audio file corresponding to the multimedia file based on each dialogue segment and the corresponding starting time stamp thereof to obtain a plurality of audio segments corresponding to the plurality of dialogue segments;
and carrying out speaker recognition on the plurality of audio segments, marking the speaker of each sentence in the dialogue segment corresponding to each audio segment according to the recognition result, and taking the marked dialogue segment as a dialogue data set.
2. The method of claim 1, wherein after the labeling of the session as a conversational dataset, further comprising:
extracting keywords from each dialogue segment respectively, and recording the keywords of each dialogue segment and the corresponding weight values thereof;
according to the keywords of each dialogue segment and the corresponding weight values thereof, the matching degree of each dialogue segment and the existing topic classification is calculated respectively;
based on the degree of matching, each dialog segment is classified into an existing topic classification.
3. The method according to claim 2, wherein the calculating the matching degree between each dialogue segment and the existing topic classification according to the keyword of each dialogue segment and the corresponding weight value thereof comprises:
respectively calculating word vectors of topic words classified by the existing topics and each keyword of the same dialogue section, calculating similarity of the word vectors of the topic words and the word vectors of each keyword by using cosine similarity, and carrying out weighted average on the similarity based on the weight of each keyword to obtain the matching degree of the dialogue section and the existing topic classification;
based on the matching degree, classifying each dialogue segment into an existing topic classification specifically comprises the following steps:
and under the existing topic classification, sequencing each dialog segment according to the matching degree of each dialog segment and the existing topic classification.
4. The method according to claim 1, wherein the segmenting the session data into a plurality of session segments, in particular comprises:
the dialog data is sliced into a plurality of dialog segments based on a minimum dialog turn and a maximum dialog turn of a sliding window limit.
5. The method according to claim 1, wherein the step of slicing the audio file corresponding to the multimedia file based on each session and the start time stamp thereof to obtain a plurality of audio segments corresponding to the plurality of session further comprises:
and respectively identifying each audio segment, matching and aligning the identified characters with the text content of the dialogue segment corresponding to the audio segment, and discarding the audio segment if the audio segment which cannot be matched exists.
6. A device for generating a conversational dataset, comprising:
the acquisition module is used for acquiring dialogue data from subtitle data corresponding to the multimedia file, wherein the dialogue data comprises dialogue characters and corresponding time stamps;
the segmentation module is used for segmenting the dialogue data into a plurality of dialogue segments, and segmenting the audio file corresponding to the multimedia file based on each dialogue segment and the corresponding starting time stamp thereof to obtain a plurality of audio segments corresponding to the dialogue segments;
and the labeling module is used for identifying the speakers of the plurality of audio segments, labeling the speaker of each sentence in the conversation segment corresponding to each audio segment according to the identification result, and taking the labeled conversation segment as a conversation type data set.
7. The apparatus as recited in claim 6, further comprising:
the extraction module is used for extracting keywords from each dialogue segment respectively and recording the keywords of each dialogue segment and the corresponding weight values thereof;
the computing module is used for respectively computing the matching degree of each dialogue segment and the existing topic classification according to the keywords of each dialogue segment and the corresponding weight values thereof;
and the classifying module is used for classifying each dialogue segment into the existing topic classification based on the matching degree.
8. The apparatus of claim 7, wherein the device comprises a plurality of sensors,
the computing module is specifically configured to respectively compute word vectors of topic words classified by an existing topic and each keyword in the same dialogue segment, compute similarity between the word vector of the topic word and the word vector of each keyword by using cosine similarity, and perform weighted average on the similarity based on weight of each keyword to obtain matching degree between the dialogue segment and the existing topic classification;
the classifying module is specifically configured to, under an existing topic classification, order each dialog segment according to a matching degree between each dialog segment and the existing topic classification.
9. The apparatus of claim 6, wherein the device comprises a plurality of sensors,
the segmentation module is specifically configured to segment the session data into a plurality of session segments based on a minimum session round and a maximum session round limited by a sliding window.
10. The apparatus as recited in claim 6, further comprising:
the recognition module is used for respectively recognizing each audio segment, matching and aligning the recognized characters with the text content of the dialogue segment corresponding to the audio segment, and discarding the audio segment if the audio segment which cannot be matched exists.
CN202310505189.9A 2023-05-08 2023-05-08 Conversational data set generation method and device Active CN116229943B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310505189.9A CN116229943B (en) 2023-05-08 2023-05-08 Conversational data set generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310505189.9A CN116229943B (en) 2023-05-08 2023-05-08 Conversational data set generation method and device

Publications (2)

Publication Number Publication Date
CN116229943A true CN116229943A (en) 2023-06-06
CN116229943B CN116229943B (en) 2023-08-15

Family

ID=86584638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310505189.9A Active CN116229943B (en) 2023-05-08 2023-05-08 Conversational data set generation method and device

Country Status (1)

Country Link
CN (1) CN116229943B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108806668A (en) * 2018-06-08 2018-11-13 国家计算机网络与信息安全管理中心 A kind of audio and video various dimensions mark and model optimization method
US20190318725A1 (en) * 2018-04-13 2019-10-17 Mitsubishi Electric Research Laboratories, Inc. Methods and Systems for Recognizing Simultaneous Speech by Multiple Speakers
KR102041621B1 (en) * 2019-02-25 2019-11-06 (주)미디어코퍼스 System for providing artificial intelligence based dialogue type corpus analyze service, and building method therefor
CN110717031A (en) * 2019-10-15 2020-01-21 南京摄星智能科技有限公司 Intelligent conference summary generation method and system
CN112818680A (en) * 2020-07-10 2021-05-18 腾讯科技(深圳)有限公司 Corpus processing method and device, electronic equipment and computer-readable storage medium
CN114996506A (en) * 2022-05-24 2022-09-02 腾讯科技(深圳)有限公司 Corpus generation method and device, electronic equipment and computer-readable storage medium
CN115269884A (en) * 2021-04-29 2022-11-01 华为云计算技术有限公司 Method, device and related equipment for generating video corpus

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190318725A1 (en) * 2018-04-13 2019-10-17 Mitsubishi Electric Research Laboratories, Inc. Methods and Systems for Recognizing Simultaneous Speech by Multiple Speakers
CN108806668A (en) * 2018-06-08 2018-11-13 国家计算机网络与信息安全管理中心 A kind of audio and video various dimensions mark and model optimization method
KR102041621B1 (en) * 2019-02-25 2019-11-06 (주)미디어코퍼스 System for providing artificial intelligence based dialogue type corpus analyze service, and building method therefor
CN110717031A (en) * 2019-10-15 2020-01-21 南京摄星智能科技有限公司 Intelligent conference summary generation method and system
CN112818680A (en) * 2020-07-10 2021-05-18 腾讯科技(深圳)有限公司 Corpus processing method and device, electronic equipment and computer-readable storage medium
CN115269884A (en) * 2021-04-29 2022-11-01 华为云计算技术有限公司 Method, device and related equipment for generating video corpus
CN114996506A (en) * 2022-05-24 2022-09-02 腾讯科技(深圳)有限公司 Corpus generation method and device, electronic equipment and computer-readable storage medium

Also Published As

Publication number Publication date
CN116229943B (en) 2023-08-15

Similar Documents

Publication Publication Date Title
CN112818906B (en) Intelligent cataloging method of all-media news based on multi-mode information fusion understanding
US10304458B1 (en) Systems and methods for transcribing videos using speaker identification
CN107305541B (en) Method and device for segmenting speech recognition text
US6925455B2 (en) Creating audio-centric, image-centric, and integrated audio-visual summaries
US8775174B2 (en) Method for indexing multimedia information
CN106878632B (en) Video data processing method and device
CN107491435B (en) Method and device for automatically identifying user emotion based on computer
CN110705254B (en) Text sentence-breaking method and device, electronic equipment and storage medium
CN113766314B (en) Video segmentation method, device, equipment, system and storage medium
CN112668559A (en) Multi-mode information fusion short video emotion judgment device and method
CN111797820B (en) Video data processing method and device, electronic equipment and storage medium
US7349477B2 (en) Audio-assisted video segmentation and summarization
CN111488813B (en) Video emotion marking method and device, electronic equipment and storage medium
CN114598933B (en) Video content processing method, system, terminal and storage medium
CN114996506A (en) Corpus generation method and device, electronic equipment and computer-readable storage medium
CN114051154A (en) News video strip splitting method and system
CN116229943B (en) Conversational data set generation method and device
CN116017088A (en) Video subtitle processing method, device, electronic equipment and storage medium
WO2011039773A2 (en) Tv news analysis system for multilingual broadcast channels
CN114064968A (en) News subtitle abstract generating method and system
Bechet et al. Detecting person presence in tv shows with linguistic and structural features
JP2006251553A (en) Method, device, and program for topic division processing
CN113470617B (en) Speech recognition method, electronic equipment and storage device
CN114510585B (en) Information characterization model construction method and information characterization method
JP2006135387A (en) Moving image subject dividing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 411, 4th floor, building 4, No.44, Middle North Third Ring Road, Haidian District, Beijing 100088

Patentee after: Beijing Qingshu Intelligent Technology Co.,Ltd.

Address before: Building G, 4th Floor, Cultural and Educational Industrial Park, No. 44 North Third Ring Middle Road, Haidian District, Beijing, 100000

Patentee before: BEIJING AISHU WISDOM TECHNOLOGY CO.,LTD.

CP03 Change of name, title or address