WO2022228235A1 - 生成视频语料的方法、装置及相关设备 - Google Patents

生成视频语料的方法、装置及相关设备 Download PDF

Info

Publication number
WO2022228235A1
WO2022228235A1 PCT/CN2022/087908 CN2022087908W WO2022228235A1 WO 2022228235 A1 WO2022228235 A1 WO 2022228235A1 CN 2022087908 W CN2022087908 W CN 2022087908W WO 2022228235 A1 WO2022228235 A1 WO 2022228235A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
target
corpus
processed
text
Prior art date
Application number
PCT/CN2022/087908
Other languages
English (en)
French (fr)
Inventor
李太松
李明磊
吴益灵
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Priority to EP22794706.6A priority Critical patent/EP4322029A1/en
Publication of WO2022228235A1 publication Critical patent/WO2022228235A1/zh
Priority to US18/496,250 priority patent/US20240064383A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/432Query formulation
    • G06F16/433Query formulation using audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition

Definitions

  • the present application relates to the technical field of video processing, and in particular, to a method, apparatus and related equipment for generating video corpus.
  • Video a common media category, has many applications in artificial intelligence scenarios such as sentiment analysis and speaker detection.
  • machine learning algorithms can be used to conduct supervised learning based on a large number of video corpora with text annotations to meet the needs of requirements in a variety of application scenarios.
  • the generation of video corpus is usually performed by an annotator who watches the entire video, and manually selects the start and end points of each video segment that needs to be labeled during the viewing process, so that the device divides the video based on the manually selected start and end points. Then, text annotation is performed on the content of each video segment obtained by the segmentation to obtain at least one video corpus.
  • This method of generating video corpus through manual annotation not only consumes high labor costs, but also the subjective cognitive error of the annotator usually leads to low segmentation accuracy of video clips, and the quality of the generated video corpus is low. .
  • the present application provides a method for generating video corpus, which improves the efficiency of generating video corpus and improves the quality of the generated video corpus.
  • the present application also provides a video corpus generation apparatus, computer equipment, computer-readable storage medium, and computer program product.
  • the present application provides a method for generating video corpus, and the method is applied to a video corpus generating apparatus.
  • the video corpus generation device obtains a video to be processed, and the video to be processed corresponds to voice content, that is, the audio in the video to be processed includes vocabulary content in human speech, and some video images of the video to be processed include audio content corresponding to the voice content. subtitle.
  • the video corpus generation device obtains the target video segment from the video to be processed according to the voice content, and uses the subtitle included in the video image in the target video segment as the label text of the target video segment, so as to generate a video image including video images. , audio, and text-annotated video corpora.
  • the video corpus generation device can automatically segment the to-be-processed video according to the corresponding voice content of the to-be-processed video, and use the subtitles in the video image to automatically annotate text for the video, thereby not only avoiding manual annotation.
  • the impact on segmentation accuracy is caused by subjective cognitive errors, and the efficiency of generating video corpus is usually high.
  • the to-be-processed video is segmented according to the corresponding voice content of the to-be-processed video, so as to avoid occurrences in the segmented target video clips.
  • the problem of incomplete playback of voice content can improve the quality of the generated video corpus.
  • the subtitles in the target video clip are used as the annotation text of the target video clip, the subtitles are usually accurate texts manually added by the video editor according to the video voice in advance, which is compared with the voice recognition of the voice content. As far as the obtained text is used as the annotated text of the target video segment, the accuracy of the annotated text of the video corpus is higher.
  • the video corpus generation device when it obtains the target video segment from the video to be processed, it may specifically recognize the target speech start and end points of the speech content, for example, it may recognize the target speech start and end points according to the ASR technology, etc.
  • the target speech start and end points include the target speech start point and the target speech end point corresponding to the target speech start point.
  • the target speech start point may be, for example, the start point of a speech in the audio of the video to be processed, and the target speech end point is the end point of the speech in the audio.
  • the video corpus generation device can obtain the target video segment from the target in the video to be processed according to the target voice starting and ending points.
  • the video corpus generating device can segment the to-be-processed video according to the target voice starting and ending points to obtain the target video segment. Wait. In this way, the video to be processed is segmented according to the starting and ending points of the target speech, which can avoid the problem of incomplete playback of speech content in the segmented video segments, thereby improving the quality of the generated video corpus.
  • the video corpus generation device may specifically identify the target subtitle start and end points of the subtitles corresponding to the speech content, for example, through OCR The technology identifies the target subtitle start and end points, etc.
  • the target subtitle start and end points include the target subtitle start point and the target subtitle end point.
  • the video corpus generation device can obtain candidate video segments from the video to be processed according to the target subtitle start and end points, and, when the target speech start and end points are inconsistent with the target subtitle start and end points, adjust the candidate video segments according to the target speech start and end points , to get the target video clip.
  • firstly dividing the video to be processed according to the starting and ending points of the target subtitles can prevent the target video segment from being too fragmented, for example, it can avoid that continuous multi-frame video images with the same subtitles are divided into multiple video segments.
  • the video corpus generation apparatus may specifically determine the target subtitle start and end points according to the subtitle display area of the subtitles in the video to be processed when recognizing the target subtitle start and end points of the subtitles corresponding to the voice content.
  • the video corpus generation device can sample multiple frames of video images in the video to be processed to obtain sampled video images, and then the video corpus generation device can determine the subtitle display in the to-be-processed video according to the display area of the subtitle on the sampled video image area.
  • the subtitle display area of the video to be processed can be determined through an automated sampling and identification process, so that the subtitle start and end points can be subsequently determined according to the subtitles in the subtitle display area.
  • the audio and annotated text in the video corpus can be used to complete the training of the speech recognition model.
  • the text information corresponding to the speech can be determined through the speech recognition model obtained by training.
  • the audio and annotated text in the video corpus can be used to complete the training of the speech generation model.
  • the corresponding voice can be obtained based on the text output by using the voice generation model.
  • the quality of the generated video corpus is high, the accuracy of the output result of the speech recognition model or the speech generation model generated based on the high-quality video corpus is usually also high.
  • the annotated text of the generated video corpus may include texts in multiple languages. Taking the text including the first language (such as Chinese) and the text of the second language (such as English) as an example, the training of the machine translation model can be completed by using the text of the first language and the text of the second language, so that the subsequent Using the machine translation model, according to the text to be processed in the first language (or the second language) input by the user, the corresponding translated text in the second language (or the first language) is obtained by translating.
  • the quality of the generated video corpus is high, the accuracy of the output translation result of the speech recognition model or speech generation model generated based on the high-quality video corpus is usually also high.
  • the face information in the video image of the video corpus can be obtained, and according to the face information, the audio included in the video corpus, and the labeled text of the video corpus, generate Digital virtual human.
  • the digital virtual human has a dialogue with the user, if the content of the dialogue is the same as the semantics of the annotated text, it is possible to fit the dialogue between the digital virtual human and the user according to the face information in the video image of the video corpus. Facial expressions and dialogue audio to achieve more intelligent human-computer interaction.
  • the video corpus generating apparatus may further present a task configuration interface to the user, where prompt information for prompting the user to specify a training task may be presented in the task configuration interface.
  • the apparatus for generating video corpus can acquire the training task of the user for the video corpus in the task configuration interface, so as to train the model belonging to the training task based on the generated video corpus.
  • the present application provides a video corpus generating apparatus, the video corpus generating apparatus including each module for implementing the method for generating a video corpus in the first aspect.
  • the present application provides a computer device, the computer device includes a processor and a memory; the memory is used to store an instruction, and when the computer device runs, the processor executes the instruction stored in the memory, so that the The computer device executes the method for generating a video corpus in the first aspect or any possible implementation manner of the first aspect.
  • the memory may be integrated in the processor, or may be independent of the processor.
  • a computer device may also include a bus. Among them, the processor is connected to the memory through the bus.
  • the memory may include readable memory and random access memory.
  • the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, when the computer-readable storage medium runs on a computer device, the computer device is made to execute the first aspect or any of the first aspects.
  • the present application provides a computer program product comprising instructions, which, when executed on a computer device, cause the computer device to execute the method for generating a video corpus described in the first aspect above.
  • the present application may further combine to provide more implementation manners.
  • FIG. 1 is a schematic diagram of a system architecture for generating video corpus
  • FIG. 2 is a schematic flowchart of a method for generating a video corpus provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of voice content included in an exemplary audio provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a video clip obtained by segmenting a video to be processed according to an embodiment of the present application
  • FIG. 5 is a schematic diagram of a task configuration interface provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a device for generating video corpus provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a computer device 700 according to an embodiment of the present application.
  • the system 100 includes a video capture device 101 , a video capture device 102 , a video corpus generation device 103 and a client 104 , and different devices can be connected through a communication network.
  • the video capture device 101 can capture existing videos from the network, such as movies, etc.; the video capture device 102 can capture videos on-site, such as live video through cameras, microphones, and other devices.
  • the video collection apparatus 101 and the video collection apparatus 102 can send the videos collected through different channels to the video corpus generation apparatus 103 .
  • the annotator manually annotates the video transmitted to the video corpus generation device 103 through the client 104, not only will the efficiency of generating the video corpus be low, but also the subjective cognitive error of the annotator will usually cause the video to be cut off. The scores are inaccurate, thus affecting the quality of the generated video corpus.
  • the video corpus generation device 103 can automatically segment the video and mark the text.
  • the video corpus generating apparatus 103 may include a video acquisition module 1031 , a segmentation module 1032 , a labeling module 1033 and an identification module 1034 .
  • the video acquisition module 1031 may receive the video transmitted by the video acquisition device 101 through the video acquisition device 102 , and provide the video to the segmentation module 1032 .
  • the segmentation module 1032 obtains the target video segment from the video according to the voice content corresponding to the video.
  • the labeling module 1033 uses the subtitles included in the video image in the target video segment as the labeling text of the target video segment, so as to obtain a video corpus including the labeling text, audio and images.
  • the subtitles in the video image can be obtained by recognizing the video image by the recognition module 1034 .
  • the video corpus generating device 103 can automatically segment the video according to the corresponding voice content of the video, and use the subtitles in the video image to automatically annotate the text for the video, thereby not only avoiding the manual annotation process.
  • the impact on segmentation accuracy is caused by subjective cognitive errors, and the efficiency of generating video corpus is usually high.
  • the video to be processed is segmented according to the audio content corresponding to the video, so as to avoid incomplete playback of the audio content in the segmented video clips. , so that the quality of the generated video corpus can be improved.
  • the subtitles in the video clips are used as the annotation text of the video clips, the subtitles are usually the accurate texts manually added by the video editor according to the video voice in advance, which is compared with the voice recognition of the voice content. In terms of the way that text is used as the annotated text of video clips, the accuracy of the annotated text of video corpus is higher.
  • the video corpus generating apparatus 103 may be implemented by software, for example, may be a computer program or the like running on any device (such as a server, etc.) in the system 100 .
  • the video corpus generating apparatus 103 may also be implemented by hardware, for example, the video corpus generating apparatus 103 may be a server or a terminal device in the system 100; or, the video corpus generating apparatus 103 may be an application-specific integrated circuit. circuit, ASIC) implementation, or programmable logic device (programmable logic device, PLD) implementation of equipment, etc.
  • the above-mentioned PLD can be a complex program logic device (complex programmable logical device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL) or any combination thereof.
  • complex program logic device complex programmable logical device, CPLD
  • field-programmable gate array field-programmable gate array
  • GAL general array logic
  • the video corpus generation device 103 shown in FIG. 1 can be deployed in the cloud, for example, it can be deployed in a public cloud, an edge cloud, or a distributed cloud. Generate the desired video corpus for one or more users.
  • the video corpus generating apparatus 103 may also be deployed locally, for example, video corpus may be collected locally and video corpus may be generated locally.
  • the specific deployment mode of the video corpus generating apparatus 103 is not limited.
  • the video corpus generating apparatus 103 shown in FIG. 1 is only used as an exemplary description, and is not used to limit the specific implementation of the apparatus.
  • the segmentation module 1032 and the recognition module 1034 may be integrated into a functional module, for example, the segmentation module 1032 may have functions of segmentation and recognition, or the video corpus generation device 103 may also It includes other functional modules to support the video corpus generation device with more other functions; alternatively, the video corpus generation device 103 may also acquire video data in other ways, such as providing video to the video corpus generation device 103 by a user.
  • FIG. 2 is a schematic flowchart of a method for generating video corpus provided by an embodiment of the present application.
  • the method for generating video corpus shown in FIG. 2 may be applied to the video corpus generating apparatus 103 shown in FIG. 1 , or may be applied to other applicable video corpus generating apparatuses.
  • the application to the video corpus generating apparatus 103 shown in FIG. 1 is taken as an example for illustrative description.
  • the method for generating a video corpus shown in FIG. 2 may specifically include:
  • the video acquisition module 1031 acquires a video to be processed, where the video to be processed corresponds to voice content, and some video images of the video to be processed include subtitles corresponding to the voice content.
  • the acquired video may be a video including continuous multiple frames of video images and audio, and the multiple continuous frames of video images include subtitles.
  • the acquired video is referred to as the video to be processed below.
  • the subtitles in the video to be processed may be compiled and added by the video editor according to the voice content included in the audio of the video to be processed in the process of generating the video, and the corresponding subtitles are added, and After the video is rendered, the added subtitles can be integrated into the multi-frame video images of the video.
  • the voice content in the audio may specifically be the vocabulary content in the human voice uttered by the character in the video to be processed.
  • the voice content may be the content of the conversation between character A and character B in the video to be processed, or it may be Introductory content, etc., expressed by "narration" in the video to be processed.
  • the semantics expressed by the speech content in the audio are consistent with the semantics of the subtitles in the video.
  • the video clips before the conversation between the characters and the video clips at the end of the conversation between the characters include audio. It may be audio that does not include voice content (in this case, the video image of the video segment may not include character dialogue subtitles).
  • the video acquisition module 1031 may receive the to-be-processed video sent by other devices.
  • the video acquisition module 1031 can establish a communication connection with the video capture device 101 and the video capture device 102 in FIG. 1 , and receive the video to be processed sent by the video capture device 101 and the video capture device 102 based on the communication connection.
  • the video acquisition module 1031 can read the video to be processed locally, or the like.
  • the specific implementation manner of acquiring the video to be processed is not limited.
  • the video acquisition module 301 can provide the to-be-processed video to the segmentation module 1032 for subsequent processing.
  • the to-be-processed video acquired by the video acquisition module 1031 may, for example, be a video with a playback duration greater than a preset duration threshold, so that multiple video corpora can be generated based on the to-be-processed video subsequently.
  • the segmentation module 1032 obtains the target video segment from the video to be processed according to the voice content.
  • the audio in the video to be processed may include a lot of speech content (for example, it may include multiple sentences of speech content), and there are audio segments in the audio that do not include speech content. Therefore, the segmentation module 1032 may Splitting the processed video, specifically, splitting the video to be processed according to the voice content to obtain a plurality of video clips, and the audio of each obtained video clip contains part of the voice content.
  • each video clip includes The subtitles corresponding to the voice content are integrated into the video images included in the video segment.
  • the segmentation module 1032 can also segment to obtain a video segment according to the voice content of the audio.
  • the following takes an example of generating a video corpus from a segmented video segment for illustrative description, and the video segment is hereinafter referred to as a target video segment.
  • the segmenting module 1032 may call the identifying module 1034 to obtain the starting and ending points of speech corresponding to the audio included in the video to be processed.
  • the recognition module 1034 can recognize each sentence in the audio, such as using automatic speech recognition (automatic speech recognition, ASR) algorithm for recognition, etc., then the end of each sentence is the end point of the sentence, and this The beginning of a sentence is the starting point of the sentence, and there can be a gap between two adjacent sentences.
  • the starting point and the ending point of a sentence of speech can be identified by the timestamp in the audio.
  • the recognition module 1034 may combine with voice activity detection (voice activity detection, VAD) technology to improve the accuracy of determining the start and end points of speech. For example, suppose the voice content in the audio includes, as shown in Figure 3, "Why didn't you go to the movie last time you said?", "Oh, when we bought the tickets, we found that the tickets for that movie had already been sold.
  • voice activity detection voice activity detection
  • the recognition module 1034 can recognize based on the ASR algorithm that the start time of the first speech “Why didn’t you watch the movie you said last time” is “00:00:25”, and the end time is “00 :00:27”, the second sentence speech "Oh, when we bought the tickets, we found that the tickets for that movie were sold out”, the starting time was "00:00:28”, and the ending time was "00:00:28". 00:31”, so the recognition module 1034 can use "00:00:25” and "00:00:27” as the starting point and the ending point of the first speech respectively, and use "00:00:28” and "00 :00:31” as the starting point and ending point of the second speech respectively.
  • the identification module 1034 may provide the segmentation module 1032 with multiple voice start and end points of the voice content included in the to-be-processed video identified through the ASR algorithm.
  • the segmenting module 1032 can segment the video to be processed into multiple video segments according to the multiple voice starting and ending points. That is, when the target video clip is obtained from the video to be processed (the target video clip is any video clip among multiple video clips), the target voice start and end points can be identified by the recognition module 1034 first, and the target voice start and end points include the target voice.
  • the voice starting point and the target voice ending point corresponding to the target voice starting point so that the segmentation module 1032 obtains the target video segment from the video to be processed according to the target voice starting and ending points.
  • the starting point and the ending point of the target video clip are the target voice starting point and the target voice ending point corresponding to the voice content in the video clip.
  • the to-be-processed video is segmented according to the start and end points of speech in the audio, and it may occur that multiple frames of video images with the same subtitles are segmented into two video segments.
  • the subtitle is "Oh, we are buying When I bought the tickets, I found that the tickets for that movie were sold out.”
  • the voice content may also be identified as “Oh, we found that when we bought the tickets.
  • the tickets for the movie have been sold out are recognized as two speech sentences, the speech between the time “00:00:28” and the time “00:00:29” shown in Figure 4, and the speech "00:00:30”
  • the voice between time and "00:00:31” which makes it possible to obtain the voice content of "Oh” and the voice content of "We are buying tickets when segmenting the video to be processed based on the start and end points of the voice.”
  • the two video clips in which the tickets for that movie were sold out when we bought the tickets have the same subtitle "Oh, we found out that the tickets for that movie were sold out when we bought the tickets”.
  • the segmentation module 1032 when it acquires the target video segment from the video to be processed, it may combine the start and end points of the subtitles and the start and end points of the voice content to segment the video to be processed.
  • the identification module 1034 can not only identify the target voice starting and ending points, but also identify the target subtitle starting and ending points of the subtitles corresponding to the voice content.
  • the target subtitle starting and ending points include: The target subtitle start point and the target subtitle end point corresponding to the target subtitle start point.
  • the target subtitle start point may specifically be the time point when the subtitle appears in the video to be processed
  • the target subtitle end point may specifically be the time point when the subtitle ends in the to-be-processed video.
  • the segmentation module 1032 may firstly segment the video to be processed according to the start and end points of the target subtitle to obtain candidate video segments.
  • the segmentation module 1032 can use the target speech start and end points to perform consistency check on the target subtitle start and end points, and when the target speech start and end points are consistent with the target subtitle start and end points, the segmentation module 1032 can use the candidate video segment as the final segmentation When the target voice start and end points are inconsistent with the target subtitle start and end points, the segmentation module 1032 can adjust the candidate video fragments according to the target voice start and end points to obtain the final target video fragment.
  • the starting and ending points of the target speech are inconsistent with the starting and ending points of the target subtitles, which may include the following situations:
  • the starting and ending points of the target speech include one or more sets of starting and ending points of the target subtitles.
  • the target speech start and end points include "00:00:28”, “00:00:29”, “00:00:30”, and "00:00:31”
  • the target subtitle start and end points may include "00:00:28", "00:00:31”.
  • the segmentation module 1032 can use the candidate video segment (that is, the video segment from "00:00:28” to "00:00:31") as the final segmented target video segment.
  • Case 2 The start and end points of the target voice are not aligned with the start and end points of the target subtitles, such as voice leading or subtitle leading, etc.
  • the segmentation module 1032 can firstly segment the video to be processed according to the target subtitle starting and ending points to obtain candidate video segments, Then, the segmented candidate video segments are adjusted according to the starting and ending points of the target speech to obtain the required target video segments. In this way, the problem of incomplete voice content corresponding to subtitles in the segmented target video segment can be avoided.
  • the voice content in the audio of the candidate video segment obtained by segmenting the starting and ending points of the target subtitle corresponds to the Part of the subtitles in the candidate video clips, that is, the audio of the candidate video clips, is missing part of the voice content, which makes if the candidate video clips are used as the target video clips, the voice content in the final generated video corpus will be incomplete. This affects the quality of the video corpus.
  • the segmentation module 1032 may also determine, according to the target speech start point (or the target speech end point), the lead time of the target speech start point relative to the target subtitle start point (or the target speech end point relative to the target subtitle start point) The advance duration of the target subtitle termination point), and since there may usually be a video image without subtitles between two adjacent subtitles in the video to be processed, the segmentation module 1032 may, according to the advance duration, precede the candidate video segment Selecting consecutive multiple frames of video images without subtitles, and classifying the video segments corresponding to the multiple frames of video images into candidate video segments, the playback duration of the selected video images is the advance duration.
  • the starting point of the candidate video segment can be moved forward, so that the obtained new candidate video segment includes the video segment corresponding to the selected video images of multiple consecutive frames and the candidate video segment obtained by previous segmentation, and the new candidate video segment is divided
  • the video clip is used as the target video clip obtained by the final slicing.
  • the starting point of the new candidate video clip is the target speech starting point corresponding to the audio in the candidate video clip
  • the ending point of the new candidate video clip is the target subtitle ending point corresponding to the subtitles in the candidate video clip .
  • the candidate video clips with the subtitles as shown in Figure 3 are "Oh, we found that the tickets for that movie have been sold out" when we bought the tickets,
  • the corresponding start point is "00:00:28”
  • the end point is "00:00:31”.
  • the segmentation module 1032 can move the start point of the candidate video segment forward by 0.5 seconds, so that the start point is "00:00:27.50” and the end point is "00: 00:31”, so that the voice content in the audio of the new candidate video segment is consistent with the subtitles.
  • the segmentation module 1032 can re-create the candidate video segment obtained by dividing the starting and ending points of the target subtitle according to the starting point of the target voice.
  • the starting point of the candidate video segment is determined, for example, the target speech starting point is used as the starting point of the candidate video segment.
  • the termination point of the candidate video segment is still the target subtitle termination point; and when the target voice termination point is later than the target subtitle termination point, the segmentation module 1032 may Re-determine the end point of the candidate video segment according to the target speech end point.
  • the duration of the speech lag can be determined first, so that multiple frames of video images can be continuously selected from the end point of the candidate video segment, and the selected multi-frame video images
  • the playback duration is the duration of the speech lag, so as to obtain a new candidate video clip, and use the new candidate video clip as the target video clip obtained by the final slice.
  • the start point of the new candidate video segment is the target speech start point corresponding to the audio in the candidate video segment
  • the end point of the new candidate video segment is the target speech end point corresponding to the audio. In this way, the alignment of subtitles and speech content in the video clip can be achieved.
  • the segmentation module 1032 may move the start point of the candidate video segment back by 0.5 seconds.
  • the start point of the new candidate video segment is "00:00:28.50" and the end point is "00:00:31”; and if the speech end point is later than Subtitle termination point, assuming that the semantic termination point is "00:00:31.30", the start point of the new candidate video segment is "00:00:28.50", and the termination point is "00:00:31.30".
  • the identification module 1034 when the identifying module 1034 identifies the starting and ending points of the target subtitles, it may be determined according to the difference between the video images. During specific implementation, the identification module 1034 may first determine the subtitle display area on the video image in the video to be processed. Under normal circumstances, the display area of the subtitles in the video to be processed on the video image (hereinafter referred to as the subtitle display area) is usually fixed, such as being located below the video image. Then, the identification module 1034 can sequentially compare the differences between the subtitle display areas of two adjacent video frames of the video to be processed, thereby determining the start and end points of multiple subtitles of the video to be processed.
  • the identification module 1034 may intercept the subtitle display areas in the two frames of video images, and compare the differences between the two subtitle display areas. If the difference between the two subtitle display areas is small, such as the degree of difference is smaller than the preset threshold, the identification module 1034 can determine that the subtitles displayed in the two frames of video images have not changed, that is, the subtitles displayed on the two frames of video images have not changed.
  • the subtitles are the same (of course, there may be no subtitles, you can further determine whether there are subtitles on the two frames of video images through image detection and other methods); and if the difference between the two subtitle display areas is large, such as the degree of difference is greater than If the threshold is preset, the identification module 1034 can determine that the subtitles displayed in the two frames of video images have changed, and accordingly, one of the two frames of video images can be used as the corresponding subtitle start point or subtitle end point. Of course, in practical application, the identification module 1034 may also determine the start and end points of the subtitles of the video to be processed based on other methods, which is not limited in this embodiment.
  • the identification module 1034 may determine the subtitle display area by means of automatic detection. For example, the identification module 1034 can randomly sample n frames of video images (n is a positive integer and the value is less than the total number of frames of the video images) from the multi-frame video included in the video to be processed to obtain the sampled video images, and then the identification module 1034 can pass Optical character recognition (optical character recognition, OCR) technology recognizes the subtitles in the n-frame sampled video image, and counts the approximate area of the subtitle in each frame of the sampled video image, so as to obtain the subtitle display area on the sampled video image. The largest area obtained by statistics is used as the subtitle display area.
  • OCR optical character recognition
  • the identification module 1034 may use both areas as the subtitle display area, or the identification module 1034 may count the area that displays the most subtitles in the n-frame sampled video image, and use this area as the subtitle display area. In practical application, the identification module 1034 may also use other methods to determine the subtitle display area, which is not limited in this embodiment.
  • the identifying module 1034 may also identify the starting and ending points of subtitles by sequentially comparing the differences between the entire video images of two adjacent frames in the video to be processed.
  • the specific implementation process of how the identifying module 1034 identifies the starting and ending points of subtitles is not limited.
  • the recognition accuracy of the recognition module 1034 for the start and end points of the target subtitle and the start and end points of the target speech may be respectively affected by the video image picture and audio content in the video to be processed.
  • the background color of the subtitle display area in the video image is similar to the subtitle color, it may be difficult for the recognition module 1034 to recognize the subtitle on the video image, so that the recognition module 1034 cannot recognize the start and end of the subtitle corresponding to the subtitle point.
  • the audio content includes both the voice of the character and the noise, etc.
  • the existence of the noise may make it difficult for the recognition module 1034 to recognize the voice of the character, thereby making it difficult for the recognition module 1034 to recognize the starting and ending points of the voice corresponding to the voice of the character.
  • the segmentation module 1032 determines that the coincidence rate between the start and end points of the speech and the start and end points of the subtitles reaches a preset coincidence rate threshold (such as 90%, etc.), the start and end points of the subtitles and the speech content can be combined according to the combination of the start and end points of the subtitles.
  • the voice start and end points of the video are split to be processed.
  • the segmentation module 1032 determines that the coincidence rate between the start and end points of speech and the start and end points of subtitles reaches a certain threshold (eg, 90%), the video to be processed can be segmented only according to the start and end points of speech content.
  • the above-mentioned coincidence rate threshold may also be set by the user.
  • the video corpus generating apparatus 103 can present a parameter setting interface to the user, so that the user can set the threshold of the coincidence rate between the starting and ending points of speech and the starting and ending points of subtitles in the parameter setting interface.
  • the user can determine the specific value of the coincidence rate threshold according to the video type to which the video to be processed belongs. For example, for a music-type video to be processed, the music sound in the included audio usually interferes with the speech content, thereby affecting the accuracy of the recognition module 1034 in recognizing the start and end points of speech. In this case, the user can reduce the coincidence rate threshold. value, such as setting the coincidence rate threshold to 85%, etc.
  • the video to be processed of pure human voice type, it usually includes less interfering sounds in the audio, which has little influence on the accuracy of the recognition module 1034 in recognizing the start and end points of speech. Therefore, the user can increase the value of the coincidence rate threshold. , such as setting the coincidence rate threshold to 95%, etc.
  • the above-mentioned process of obtaining the target video segment from the video to be processed can also be accelerated by corresponding hardware, for example, it can be processed by a graphics processing unit (GPU) with high performance in image processing, Of course, a CPU with relatively low performance can also be used for processing.
  • the video corpus generation apparatus 103 may present prompt information on whether to perform hardware acceleration in the interactive interface with the user, so that the user can choose whether to use hardware acceleration to speed up on the interactive interface. The process of obtaining target video clips from the video to be processed, thereby speeding up the process of generating video corpus.
  • the labeling module 1033 uses the subtitles included in the video images in the target video segment as the labeling text of the target video segment to obtain a video corpus.
  • the labeling module 1033 can automatically add labeling text to the target video segment.
  • the labeling text added by the labeling module 1033 to the target video clip is the subtitle displayed on the video image of the video clip.
  • the annotation module 1033 may call the identifying module 1034 to identify the subtitles on the video images in the target video segment.
  • the identification module 1034 can identify the subtitles on the video image of the target video segment through the OCR technology, obtain the corresponding subtitle text, and feed it back to the labeling module 1033 .
  • the labeling module 1033 can label the video segment with the received subtitle text as the labeling text, so as to generate a video corpus including labelled voice, audio and video images. Since the subtitles in the target video clip are manually added to the video by the video editor according to the voice content in the process of making the video, the subtitles are highly consistent with the voice content, so the labeling module 1033 assigns the target video clip to The subtitles in the video clip are used as the annotation text, which can improve the accuracy of the annotation text of the target video segment.
  • the recognition module 1034 recognizing the subtitles on the target video segment after segmenting the obtained target video segment is used as an example for description.
  • the recognition module 1034 It is also possible to first identify the subtitles on the video to be processed to obtain the subtitle text of the entire video to be processed, and the subtitle text may record the respective display time points corresponding to different subtitles. Then, the segmentation module 1032 completes segmentation of the video to be processed.
  • the labeling module 1033 when the labeling module 1033 needs to obtain the subtitle text corresponding to the target video clip, it can search for the subtitle displayed by the subtitle text in the playback time period according to the playback time period of the target video clip in the video to be processed, so as to obtain The subtitle text corresponding to the target video segment.
  • the identification module 1034 for identifying subtitles and the segmentation module 1032 for segmenting the video to be processed there is no limitation on the execution order of the identification module 1034 for identifying subtitles and the segmentation module 1032 for segmenting the video to be processed.
  • the video corpus generating device 103 generates a video corpus based on the to-be-processed video including subtitles, while in other possible implementations, when the to-be-processed video does not include subtitles, the video corpus generating device 103 also generates a video corpus based on the to-be-processed video.
  • a video corpus with annotated text can be generated.
  • the segmentation module 1032 may segment the to-be-processed video according to the voice content in the audio included in the to-be-processed video, and obtain one or more videos with audio content. video clips.
  • the specific implementation manner of the segmentation module 1032 segmenting the video to be processed according to the audio may refer to the descriptions in the above-mentioned relevant places.
  • the labeling module 1033 can call the recognition module 1034 to perform speech recognition on the audio in each video clip, and use the sentence boundary detection technology to determine each sentence in the speech content, so that The speech recognition text corresponding to the speech content of each video segment can be obtained, so that the labeling module 1033 can use the speech recognition text corresponding to each video segment as the labeling text for the video judgment to generate video corpus.
  • the video corpus generating device 103 may also present the video corpus to the user, so that the user can manually verify the video images, audio and marked text in the video corpus. In this way, when there is a small amount of low-quality video corpus in the generated video corpus, the user can manually correct the part of the video corpus to further improve the quality of the generated video corpus.
  • the video corpus generation device 103 generates one or more video corpora based on the video to be processed, which can be used in scenarios such as speech recognition, speech generation, machine translation, digital virtual machine construction, and sentiment analysis.
  • the video corpus generation device 103 may present a task configuration interface to the user, in which the user may be prompted to input a training task for the video corpus, as shown in FIG. task" message.
  • the task configuration interface can also present candidates for multiple training tasks, such as speech recognition, speech generation, machine translation, digital virtual robot construction, and sentiment analysis as shown in Figure 5.
  • the video corpus generating apparatus 103 can acquire the training task selected by the user and execute the training task based on the generated video corpus.
  • the user may also manually input the name of the training task directly on the task configuration interface.
  • the manner in which the video corpus generating apparatus 103 obtains the training task specified by the user is not limited.
  • the video corpus with annotated text, audio and video images generated by the video corpus generating device 103 can be used to train a pre-built speech recognition model.
  • the audio in the video corpus can be used as the input of the speech recognition model
  • the marked text of the video corpus can be used as the output of the speech recognition model, so as to train the speech recognition model.
  • the video corpus with the regional accent can be generated by the video corpus generating device 103, that is, the audio in the video corpus includes.
  • the speech is based on regional pronunciation, so after using the video corpus to train the speech recognition model, the speech recognition model can target the audio of the regional pronunciation or the video including the audio (such as dialect drama or local news video, etc. ) to recognize the corresponding speech text, so as to realize speech recognition.
  • the speech recognition model can target the audio of the regional pronunciation or the video including the audio (such as dialect drama or local news video, etc. ) to recognize the corresponding speech text, so as to realize speech recognition.
  • the video corpus generated by the video corpus generating device 103 in the speech generation scene may be used to train a pre-built speech generation model.
  • automatic speech generation can be understood as the reverse process of speech recognition, that is, to generate corresponding speech based on specific text.
  • the labeled text in the video corpus can be used as the input of the speech generation model, and the audio in the video corpus can be used as the output of the speech generation model, so as to complete the training of the speech generation model.
  • the speech generation model obtained by training can output the speech corresponding to the text according to the input text in the fields of audio novels, digital virtual humans, voice assistants, and smart speakers.
  • the speech generation model when training the speech generation model, can be trained by using a video corpus that includes the voice of a specific character, so that the subsequent speech generation model based on the training can generate a plurality of voices including the character,
  • the voice generation model is used to generate the voice of the character to broadcast the navigation route.
  • the labeled text in the video corpus may include texts based on multiple languages with the same meaning (for example, the subtitles in the video corpus are Chinese-English bilingual subtitles, etc.), so as to include the first Examples are texts in one language and texts in a second language.
  • texts in multiple languages can be separated from the labeled texts of the video corpus. Since the texts in the multiple languages usually have the same semantics, the machine translation model can be trained by using the annotated texts, so as to improve the accuracy of the machine translation model in obtaining the speech in another language from the speech translation based on one language.
  • the multimodal speaker detection technology can be used to locate the character of the pronunciation from the video image of the video corpus, and detect the character from the video image.
  • the face information of the character during pronunciation such as facial expressions, facial movements and other information, so that a digital virtual person can be generated according to the face information, the audio included in the video corpus, and the marked text.
  • the digital virtual human has a dialogue with the user, if the content of the dialogue is the same as the semantics of the annotated text, it is possible to fit the dialogue between the digital virtual human and the user according to the face information in the video image of the video corpus. Facial expressions and dialogue audio to achieve more intelligent human-computer interaction.
  • the video corpus can also be used in more other usable scenarios, such as multimodal sentiment analysis based on the video corpus, multimodal state video classification, etc., which are not limited in this embodiment.
  • the device 600 includes:
  • a video acquisition module 601 configured to acquire a video to be processed, the video to be processed corresponds to voice content, and some video images of the video to be processed include subtitles corresponding to the voice content;
  • a segmentation module 602 configured to obtain a target video segment from the to-be-processed video according to the voice content
  • the labeling module 603 is configured to use the subtitles included in the video images in the target video segment as the labeling text of the target video segment to obtain a video corpus.
  • the functions performed by the video acquisition module 601 in this embodiment are similar to the functions performed by the video acquisition module 1031 in the foregoing embodiments.
  • the functions performed by the video acquisition module 1031 in the foregoing embodiments please refer to the descriptions of the relevant parts of the foregoing embodiments, which will not be repeated here.
  • the specific functions performed by the segmentation module 602 and the labeling module 603 in this embodiment reference may be made to the segmentation module 1032 and the labeling module 1033 in the foregoing embodiment.
  • the segmentation module 602 is specifically used for:
  • the segmentation module 602 is specifically used for:
  • target subtitle start and end points of the subtitles corresponding to the voice content Identifying the target subtitle start and end points of the subtitles corresponding to the voice content, where the target subtitle start and end points include a target subtitle start point and a target subtitle end point corresponding to the target subtitle start point;
  • target subtitle start and end points obtain candidate video segments from the to-be-processed video
  • the candidate video segments are adjusted according to the starting and ending points of the target speech to obtain the target video segments.
  • the segmentation module 602 is specifically configured to determine the starting and ending points of the target subtitles according to the subtitle display area of the subtitles.
  • the apparatus 600 further includes:
  • the video corpus application module 604 is configured to use the audio and the labeled text in the video corpus to complete the training of the speech recognition model; or, use the audio and the labeled text in the video corpus to complete the training of the speech generation model.
  • the marked text of the video corpus includes text in the first language and text in the second language
  • the apparatus 600 further includes:
  • the video corpus application module 604 is configured to use the text in the first language and the text in the second language to complete the training of the machine translation model.
  • the apparatus 600 further includes:
  • an information acquisition module 605, configured to acquire face information in the video image of the video corpus
  • the video corpus application module 604 is configured to generate a digital virtual human according to the face information, the audio included in the video corpus, and the marked text of the video corpus.
  • the apparatus 600 further includes:
  • a presentation module 606, configured to present a task configuration interface
  • the information acquisition module 605 is configured to acquire the user's training task for the video corpus in the task configuration interface.
  • the video corpus generation apparatus 600 may correspond to executing the method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of the various modules of the video corpus generation apparatus 600 are respectively for realizing the video corpus in FIG. 2 .
  • the corresponding flow of each method executed by the generating device 103 is not repeated here.
  • the process of generating the video corpus can also be implemented by a separate hardware device.
  • the computing device for implementing the process of generating the video corpus will be introduced in detail.
  • FIG. 7 provides a schematic structural diagram of a computing device.
  • the computing device 700 shown in FIG. 7 can specifically be used to implement the functions of the video corpus generating apparatus 103 in the embodiment shown in FIG. 2 or the functions of the video corpus generating apparatus 600 in the embodiment shown in FIG. 6 .
  • Computing device 700 includes bus 701 , processor 702 , communication interface 703 , and memory 704 .
  • the processor 702 , the memory 704 and the communication interface 703 communicate through the bus 701 .
  • the bus 701 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus or the like.
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 7, but it does not mean that there is only one bus or one type of bus.
  • the communication interface 703 is used to communicate with the outside, such as receiving a target service request sent by a functional network element on the software developer plane.
  • the processor 702 may be a central processing unit (central processing unit, CPU).
  • Memory 704 may include volatile memory, such as random access memory (RAM).
  • RAM random access memory
  • Memory 704 may also include non-volatile memory, such as read-only memory (ROM), flash memory, HDD, or SSD.
  • Executable code is stored in the memory 704, and the processor 702 executes the executable code to execute the aforementioned method performed by the video corpus generating apparatus 103 or the video corpus generating apparatus 600.
  • the computing device 700 interacts with other devices through communication
  • the interface 703 is implemented, for example, the computing device 700 obtains multiple pieces of data to be processed in the data source through the communication interface 703 .
  • the processor is configured to execute the instructions in the memory 704 to implement the method executed by the video corpus generating apparatus 600 .
  • an embodiment of the present application also provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a computer device, the computer device is made to execute the video corpus generation apparatus 103 of the above-mentioned embodiment. method of execution.
  • an embodiment of the present application further provides a computer program product, when the computer program product is executed by a computer, the computer executes any one of the foregoing data providing methods.
  • the computer program product can be a software installation package, which can be downloaded and executed on a computer if any one of the aforementioned data providing methods needs to be used.
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be retrieved from a website, computer, training device, or data Transmission from the center to another website site, computer, training facility or data center via wired (eg coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.) means.
  • wired eg coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device, a data center, or the like that includes an integration of one or more available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Library & Information Science (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Studio Circuits (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

一种生成视频语料的方法,具体为获取待处理视频,该待处理视频对应语音内容,并且待处理视频的部分视频图像包括语音内容对应的字幕。然后,根据该语音内容,从待处理视频中获取目标视频片段,并将该目标视频片段中的视频图像包括的字幕作为该目标视频片段的标注文本,得到视频语料。如此,可以实现自动生成视频语料,从而不仅可以避免人工标注过程中因为主观认知误差而导致对于切分精度的影响,而且生成视频语料的效率通常也较高。并且,可以避免生成的视频语料中出现语音内容播放不完整的问题,同时,视频语料的标注文本的准确性更高。此外,还提供了一种视频语料生成装置及相关设备。

Description

生成视频语料的方法、装置及相关设备 技术领域
本申请涉及视频处理技术领域,尤其涉及一种生成视频语料的方法、装置及相关设备。
背景技术
视频,是一种常见的媒体类别,在情感分析、说话人物检测等人工智能场景中存在众多应用,具体可以是利用机器学习算法,基于大量带有文本标注的视频语料进行有监督学习,以满足在多种应用场景的需求。
目前,视频语料的生成,通常是由标注人员观看整段视频,并在观看过程中手动选择每个需要标注的视频片段的起止点,从而设备基于人工选择的起止点对该视频进行切分,然后再对切分得到的各个视频片段的内容进行文字标注,得到至少一个视频语料。这种通过人工标注生成视频语料的方式,不仅耗费较高的人工成本,而且标注人员的主观认知误差通常会导致对于视频片段的切分准确性较低,所生成的视频语料的质量较低。
发明内容
本申请提供了一种生成视频语料的方法,提高生成视频语料的效率以及提高所生成的视频语料的质量。此外,本申请还提供了一种视频语料生成装置、计算机设备、计算机可读存储介质以及计算机程序产品。
第一方面,本申请提供了一种生成视频语料的方法,该方法应用于视频语料生成装置。具体的,视频语料生成装置获取待处理视频,该待处理视频对应语音内容,即该待处理视频中的音频包括人类语音中的词汇内容,并且该待处理视频的部分视频图像包括语音内容对应的字幕。然后,视频语料生成装置根据该语音内容,从待处理视频中获取目标视频片段,并将该目标视频片段中的视频图像包括的字幕作为该目标视频片段的标注文本,以此生成得到包括视频图像、音频以及标注文本的视频语料。
如此,在生成视频语料的过程中,视频语料生成装置能够根据待处理视频对应的语音内容自动对待处理视频进行切分,并利用视频图像中的字幕自动为视频标注文本,从而不仅可以避免人工标注过程中因为主观认知误差而导致对于切分精度的影响,而且生成视频语料的效率通常也较高。
并且,当待处理视频中存在字幕与音频不一致的情况时(如字幕超前或语音超前),根据待处理视频对应的语音内容对待处理视频进行切分,可以避免切分得到的目标视频片段中出现语音内容播放不完整的问题,从而可以提高生成的视频语料的质量。另外,由于是将目标视频片段中的字幕作为该目标视频片段的标注文本,而字幕通常是预先由视频编辑者根据视频语音进行人工添加的准确文本,这相比于将对语音内容进行语音识别所得到的文本作为目标视频片段的标注文本的方式而言,视频语料的标注 文本的准确性更高。
在一种可能的实施方式中,视频语料生成装置在从待处理视频中获取目标视频片段时,具体可以是先识别语音内容的目标语音起止点,例如可以是根据ASR技术识别目标语音起止点等,该目标语音起止点包括目标语音起始点以及该目标语音起始点对应的目标语音终止点。示例性地,该目标语音起始点例如可以是待处理视频的音频中的一句语音的起始点,而目标语音终止点为这句语音在音频中的终止点。然后,视频语料生成装置可以根据该目标语音起止点,从待处理视频中目标获取目标视频片段,例如,视频语料生成装置可以根据该目标语音起止点,对待处理视频进行切分,得到目标视频片段等。如此,根据目标语音起止点对待处理视频进行分割,可以避免切分得到的视频片段中出现语音内容播放不完整的问题,从而可以提高生成的视频语料的质量。
在一种可能的实施方式中,视频语料生成装置在根据目标语音起止点从待处理视频中获取目标视频片段时,具体可以先识别语音内容对应的字幕的目标字幕起止点,例如可以是通过OCR技术识别目标字幕起止点等,该目标字幕起止点包括目标字幕起始点以及目标字幕终止点。然后,视频语料生成装置可以根据该目标字幕起止点,从待处理视频中获取候选视频片段,并且,当目标语音起止点与目标字幕起止点不一致时,根据目标语音起止点对候选视频片段进行调整,以得到目标视频片段。如此,可以实现目标视频片段中字幕与语音内容的对齐,避免切分得到的目标视频片段中出现字幕对应的语音内容不完整的问题。。并且,先根据目标字幕起止点对待处理视频进行切分,可以避免目标视频片段为过于碎片化,如可以避免具有相同字幕的连续多帧视频图像被切分成多个视频片段等。
在一种可能的实施方式中,视频语料生成装置在识别语音内容对应的字幕的目标字幕起止点时,具体可以是根据待处理视频中字幕的字幕显示区域,确定目标字幕起止点。例如,视频语料生成装置可以对待处理视频中的多帧视频图像进行采样,得到采样视频图像,然后,视频语料生成装置可以根据字幕在采样视频图像上的显示区域,确定待处理视频中的字幕显示区域。如此,可以通过自动化的采样以及识别过程,确定出待处理视频的字幕显示区域,以便后续根据该字幕显示区域中的字幕确定字幕起止点。
在一种可能的实施方式中,在生成视频语料后,可以利用该视频语料中的音频以及标注文本,完成对语音识别模型的训练。这样,对于文本信息未知的语音,可以通过训练得到的语音识别模型确定该语音对应的文本信息,如针对地域性口音的语音,可以通过该语音识别模型精确识别该语音对应的文本。或者,在生成视频语料后,可以利用该视频语料中的音频以及标注文本,完成对语音生成模型的训练。这样,对于一份特定的文本,可以利用语音生成模型基于该文本输出得到对应的语音。并且,由于生成的视频语料的质量较高,从而基于质量较高的视频语料所生成的语音识别模型或者语音生成模型,其输出结果的准确性通常也较高。
在一种可能的实施方式中,所生成的视频语料的标注文本中,可以包括多个语种的文本。以包括第一语种(如中文)的文本以及第二语种(如英语)的文本为例,可以利用第一语种的文本以及第二语种的文本,完成对机器翻译模型的训练,这样,后 续可以利用该机器翻译模型,根据用户输入的第一语种(或第二语种)的待处理文本,翻译得到对应的第二语种(或第一语种)的翻译文本。并且,由于生成的视频语料的质量较高,从而基于质量较高的视频语料所生成的语音识别模型或者语音生成模型,其输出的翻译结果的准确性通常也较高。
在一种可能的实施方式中,在生成视频语料后,可以获取该视频语料的视频图像中的人脸信息,并根据该人脸信息、该视频语料包括的音频以及视频语料的标注文本,生成数字虚拟人。这样,在数字虚拟人与用户进行对话时,若其对话内容与该标注文本的语义相同,则可以根据该视频语料的视频图像中的人脸信息,拟合出数字虚拟人与用户进行对话的面部表情以及对话音频,从而实现更加智能化的人机交互。
在一种可能的实施方式中,视频语料生成装置还可以向用户呈现任务配置界面,该任务配置界面中可以呈现有提示用户指定训练任务的提示信息。这样,视频语料生成装置可以获取用户在该任务配置界面针对所述视频语料的训练任务,以便基于生成的视频语料对该属于该训练任务的模型进行训练。
第二方面,本申请提供一种视频语料生成装置,所述视频语料生成装置包括用于实现第一方面中的生成视频语料的方法的各个模块。
第三方面,本申请提供一种计算机设备,所述计算机设备包括处理器和存储器;该存储器用于存储指令,当该计算机设备运行时,该处理器执行该存储器存储的该指令,以使该计算机设备执行上述第一方面或第一方面任一种可能实现方式中的生成视频语料的方法。需要说明的是,该存储器可以集成于处理器中,也可以是独立于处理器之外。计算机设备还可以包括总线。其中,处理器通过总线连接存储器。其中,存储器可以包括可读存储器以及随机存取存储器。
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机设备上运行时,使得计算机设备执行上述第一方面或第一方面的任一种实现方式所述生成视频语料的方法。
第五方面,本申请提供了一种包含指令的计算机程序产品,当其在计算机设备上运行时,使得计算机设备执行上述第一方面所述生成视频语料的方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1为一种生成视频语料的系统架构示意图;
图2为本申请实施例提供的一种生成视频语料的方法流程示意图;
图3为本申请实施例提供的一示例性音频包括的语音内容示意图;
图4为本申请实施例提供的对待处理视频进行切分所得到的视频片段示意图;
图5为本申请实施例提供的一种任务配置界面的示意图;
图6为本申请实施例提供的一种视频语料生成装置的示意图;
图7为本申请实施例提供的一种计算机设备700的结构示意图。
具体实施方式
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解,这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。
参见图1,为一种生成视频语料的系统架构示意图。如图1所示,该系统100包括视频采集装置101、视频采集装置102、视频语料生成装置103以及客户端104,不同装置之间可以通过通信网络进行连接。其中,视频采集装置101可以从网络中采集已有的视频,如电影等;视频采集装置102可以现场采集得到视频,如通过摄像头、麦克风等装置采集现场直播视频等。视频采集装置101以及视频采集装置102可以将通过不同途径采集到的视频发送给视频语料生成装置103。此时,若由标注人员通过客户端104对传输至视频语料生成装置103的视频进行人工标注,则不仅会导致生成视频语料的效率较低,而且标注人员的主观认知误差通常会导致视频切分不准确,从而影响所生成的视频语料的质量。
为此,本实施例中,可以由视频语料生成装置103自动对视频进行切分以及标注文本。具体实现时,视频语料生成装置103可以包括视频获取模块1031、切分模块1032、标注模块1033以及识别模块1034。其中,视频获取模块1031可以接收视频采集装置101以视频采集装置102传输的视频,并将该视频提供给切分模块1032。切分模块1032根据该视频对应的语音内容,从视频中获取目标视频片段。然后,标注模块1033将该目标视频片段中的视频图像包括的字幕作为目标视频片段的标注文本,以此得到包括标注文本、音频以及图像的视频语料。其中,视频图像中的字幕可以由识别模块1034对视频图像进行识别得到。如此,在生成视频语料的过程中,视频语料生成装置103能够根据视频对应的语音内容自动对视频进行切分,并利用视频图像中的字幕自动为视频标注文本,从而不仅可以避免人工标注过程中因为主观认知误差而导致对于切分精度的影响,而且生成视频语料的效率通常也较高。
并且,当视频中存在字幕与音频不一致的情况时(如字幕超前或语音超前),根据视频对应的语音内容对待处理视频进行切分,可以避免切分得到的视频片段中出现语音内容播放不完整的问题,从而可以提高生成的视频语料的质量。另外,由于是将视频片段中的字幕作为视频片段的标注文本,而字幕通常是预先由视频编辑者根据视频语音进行人工添加的准确文本,这相比于将对语音内容进行语音识别所得到的文本作为视频片段的标注文本的方式而言,视频语料的标注文本的准确性更高。
示例性地,视频语料生成装置103可以是由软件实现,例如可以是运行在系统100中的任意设备(如服务器等)上的计算机程序等。或者,视频语料生成装置103也可以是由硬件实现,如视频语料生成装置103可以是系统100中的服务器或者终端设备等;或者,视频语料生成装置103可以是利用专用集成电路(application-specific  integrated circuit,ASIC)实现、或可编程逻辑器件(programmable logic device,PLD)实现的设备等。其中,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、现场可编程门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合实现。
实际应用中,图1所示的视频语料生成装置103可以部署于云端,例如可以是部署于公有云、边缘云或者分布式云等,此时,视频语料生成装置103可以作为云服务,在云端为一个或者多个用户生成其所需的视频语料。或者,视频语料生成装置103可以也可以部署于本地,如可以在本地采集视频并在本地生成视频语料等。本实施例中,对于视频语料生成装置103的具体部署方式景并不进行限定。
需要说明的是,图1所示的视频语料生成装置103仅作为一种示例性说明,并不用于限定该装置的具体实现。例如,在其它可能的实施方式中,切分模块1032与识别模块1034可以集成为一个功能模块,如切分模块1032可以通过具有切分以及识别的功能等,或者,视频语料生成装置103也可以包括具有其它功能模块以支持视频语料生成装置具有更多其它的功能;或者,视频语料生成装置103也可以是通过其它方式获取视频数据,如由用户向该视频语料生成装置103提供视频等。
为便于理解,下面结合附图,对本申请的实施例进行描述。
参见图2,图2为本申请实施例提供的一种生成视频语料的方法流程示意图。其中,图2所示的生成视频语料的方法可以应用于图1所示的视频语料生成装置103,或者应用于其它可适用的视频语料生成装置中。为便于说明,本实施例中以应用于图1所示的视频语料生成装置103为例进行示例性说明。
基于图1所示的视频语料生成装置103,图2所示的生成视频语料的方法具体可以包括:
S201:视频获取模块1031获取待处理视频,该待处理视频对应语音内容,并且该待处理视频的部分视频图像包括该语音内容对应的字幕。
本实施例中,所获取的视频可以是包括连续的多帧视频图像以及音频的视频,并且,该连续多帧的视频图像中包括字幕。为便于描述,以下将获取的视频称之为待处理视频。实际应用场景中,待处理视频中的字幕,可以是由视频编辑者在生成视频的过程中,根据该待处理视频的音频中所包括的语音内容,为该视频编撰以及添加相应的字幕,并且在完成对视频的渲染后,所添加的字幕可以集成在视频的多帧视频图像中。
其中,音频中的语音内容,具体可以是该待处理视频中的角色发出的人类语音中的词汇内容,如语音内容可以是待处理视频中的人物A以及人物B之间的对话内容,或者可以是待处理视频中“旁白”表达的介绍性内容等。通常情况下,音频中的语音内容所表达的语义与视频中的字幕的语义保持一致。并且,待处理视频的音频中,除了包括语音内容,还存在不包括语音内容的音频片段,如在一段人物对话视频中,人 物对话之前的视频片段以及人物对话结束的视频片段包括的音频中,可以是不包括语音内容的音频(此时,该视频片段的视频图像中可以不包括人物对话字幕)。
作为一种获取待处理视频的实现示例,视频获取模块1031可以接收其它装置发送的待处理视频。例如,视频获取模块1031可以与图1中的视频采集装置101以及视频采集装置102建立通信连接,并接收视频采集装置101以及视频采集装置102基于该通信连接所发送的待处理视频。或者,视频获取模块1031可以从本地读取待处理视频等。本实施例中,对于获取待处理视频的具体实现方式并不进行限定。然后,视频获取模块301可以将该待处理视频提供给切分模块1032进行后续处理。
实际应用时,视频获取模块1031所获取的待处理视频,例如可以是播放时长大于预设时长阈值的视频,从而后续可以基于该待处理视频生成多个视频语料。
S202:切分模块1032根据语音内容,从待处理视频中获取目标视频片段。
在一些场景中,待处理视频中的音频包括的语音内容可能较多(如可能包括多句语音内容),并且音频中存在不包括语音内容的音频片段,因此,切分模块1032可以对该待处理视频进行切分,具体可以是根据语音内容,对待处理视频进行切分,得到多个视频片段,所得到的每个视频片段的音频均包含部分语音内容,相应的,每个视频片段包括的语音内容所对应的字幕集成在该视频片段包括的视频图像中。当然,若待处理视频的音频中包括的语音内容较少,如仅包括一句话的语音内容等,则切分模块1032根据音频的语音内容也可以切分得到一个视频片段。为便于描述,下面以根据切分得到的一个视频片段生成一个视频语料为例进行示例性说明,并且,以下将该视频片段称之为目标视频片段。
作为一种实现示例,在切分待处理视频的过程中,切分模块1032可以调用识别模块1034,以获取待处理视频包括的音频对应的语音起止点。具体的,识别模块1034可以识别音频中的每句话,如利用自动语音识别(automatic speech recognition,ASR)算法进行识别等,则每一句语音的结束,即为该句语音的终止点,而这句语音的开始即为这句话的起始点,并且,相邻两句话之间可以存在间隔。其中,一句语音的起始点以及终止点可以通过该音频中的时间戳进行标识。实际应用时,识别模块1034可以结合语音活性检测(voice activity detection,VAD)技术提高确定语音起止点的精度。举例来说,假设音频中的语音内容包括如图3所示的“上次你说的去看电影怎么没去看啊”、“哦,我们在买票的时候发现那部电影的票已经卖完了”,则识别模块1034基于ASR算法可以识别出第一句语音“上次你说的去看电影怎么没去看啊”的起始时刻为“00:00:25”、终止时刻为“00:00:27”,识别出第二语句语音“哦,我们在买票的时候发现那部电影的票已经卖完了”的起始时刻为“00:00:28”、终止时刻为“00:00:31”,从而识别模块1034可以将“00:00:25”以及“00:00:27”分别作为第一句语音的起始点以及终止点,将“00:00:28”以及“00:00:31”分别作为第二句语音的起始点以及终止点。
然后,识别模块1034可以将通过ASR算法识别得到待处理视频包括的语音内容的多个语音起止点提供给切分模块1032。这样,切分模块1032可以根据该多个语音起 止点将待处理视频切分成多个视频片段。即,在从待处理视频中获取目标视频片段时(该目标视频片段为多个视频片段中的任意视频片段),可以先由识别模块1034识别得到目标语音起止点,该目标语音起止点包括目标语音起始点以及该目标语音起始点对应的目标语音终止点,从而切分模块1032根据该目标语音起止点从待处理视频中通过切分得到该目标视频片段。其中,目标视频片段的起始点以及终止点即为该视频片段中语音内容对应的目标语音起始点以及目标语音终止点。
实际应用时,根据音频中的语音起止点对待处理视频进行切分,可能会出现具有相同字幕的多帧视频图像被切分成两个视频片段。比如,仍以图3所示的音频为例,如图4所示,对于“00:00:28”时刻至“00:00:31”时刻的视频片段,其字幕为“哦,我们在买票的时候发现那部电影的票已经卖完了”,但是,利用ASR算法对音频中的语音内容的语音起止点进行识别时,也可能会将语音内容“哦,我们在买票的时候发现那部电影的票已经卖完了”识别为两句语音,如图4所示的“00:00:28”时刻至“00:00:29”时刻之间的语音,以及“00:00:30”时刻至“00:00:31”时刻之间的语音,这使得在基于语音起止点对待处理视频进行切分时,可能会切分得到语音内容为“哦”以及语音内容为“我们在买票的时候发现那部电影的票已经卖完了”的两个视频片段中,均具有相同字幕“哦,我们在买票的时候发现那部电影的票已经卖完了”。
因此,在进一步可能的实施方式中,切分模块1032在从待处理视频中获取目标视频片段时,可以结合字幕起止点以及语音内容的语音起止点切分待处理视频。具体的,在从待处理视频中获取目标视频片段的过程中,识别模块1034不仅可以识别得到目标语音起止点,还可以识别语音内容所对应的字幕的目标字幕起止点,该目标字幕起止点包括目标字幕起始点以及该目标字幕起始点对应的目标字幕终止点。示例性地,目标字幕起始点具体可以是该条字幕在待处理视频中出现的时间点,目标字幕终止点具体可以是该条字幕在待处理视频中结束的时间点。这样,切分模块1032可以先根据该目标字幕起止点对待处理视频进行切分,得到候选视频片段。然后,切分模块1032可以利用目标语音起止点对目标字幕起止点进行一致性校验,并且当目标语音起止点与目标字幕起止点一致时,切分模块1032可以将该候选视频片段作为最终切分得到的目标视频片段;而当目标语音起止点与目标字幕起止点不一致时,切分模块1032可以根据目标语音起止点对候选视频片段进行调整,以得到最终的目标视频片段。
其中,目标语音起止点与目标字幕起止点不一致,可以包括如下情况:
情况一:目标语音起止点包括一组或者多组目标字幕起止点。如图4所示,目标语音起止点包括“00:00:28”、“00:00:29”、“00:00:30”、“00:00:31”,而目标字幕起止点可以包括“00:00:28”、“00:00:31”。此时,切分模块1032可以将该候选视频片段(也即“00:00:28”至“00:00:31”的视频片段)作为最终切分得到的目标视频片段。
情况二:目标语音起止点与目标字幕起止点并不对齐,如语音超前或者字幕超前等,此时,切分模块1032可以先根据目标字幕起止点对待处理视频进行切分,得到候选视频片段,然后,再根据目标语音起止点对切分得到的候选视频片段进行调整,得到所需的目标视频片段。如此,可以避免切分得到的目标视频片段中出现字幕对应的 语音内容不完整的问题。
其中,当语音超前时,即目标语音起始点对应的时刻早于目标字幕起始点对应的时刻,此时,基于目标字幕起止点所切分得到的候选视频片段的音频中的语音内容对应于该候选视频片段中的部分字幕,也即候选视频片段的音频中存在部分语音内容缺失,这使得若将该候选视频片段作为目标视频片段,则会导致最终生成的视频语料中的语音内容不完整,从而影响视频语料的质量。基于此,针对于候选视频片段,切分模块1032还可以根据目标语音起始点(或目标语音终止点),确定目标语音起始点相对于目标字幕起始点的超前时长(或目标语音终止点相对于目标字幕终止点的超前时长),并且,由于待处理视频中的相邻两条字幕之间通常可以存在不具有字幕的视频图像,因此,切分模块1032可以根据该超前时长在候选视频片段之前选取连续多帧不具有字幕的视频图像,并将该多帧视频图像对应的视频片段划入候选视频片段中,所选取的视频图像的播放时长为该超前时长。如此,可以使得候选视频片段的起始点前移,从而所得到的新的候选视频片段包括选取的连续多帧的视频图像对应的视频片段与之前分割得到的候选视频片段,并将该新的候选视频片段作为最终切片得到的目标视频片段。具体的,该新的候选视频片段的起始点为该候选视频片段中的音频对应的目标语音起始点,该新的候选视频片段的终止点为该候选视频片段中的字幕对应的目标字幕终止点。
举例来说,假设根据目标字幕起止点对待处理视频进行分割,可以得到如图3所示的字幕为“哦,我们在买票的时候发现那部电影的票已经卖完了”的候选视频片段,其对应的起始点为“00:00:28”、终止点为“00:00:31”。若候选视频片段中字幕对应的音频超前0.5秒,则切分模块1032可以将候选视频片段的起始点前移0.5秒,从而得到起始点为“00:00:27.50”、终止点为“00:00:31”的新的候选视频片段,以使得该新的候选视频片段的音频中的语音内容与字幕保持一致。
当语音滞后时,即目标语音起始点对应的时刻晚于目标字幕起始点对应的时刻,此时,切分模块1032针对基于目标字幕起止点分割得到的候选视频片段,可以根据目标语音起始点重新确定该候选视频片段的起始点,如将该目标语音起始点作为该候选视频片段的起始点。并且,当目标语音终止点不晚于目标字幕终止点时,该候选视频片段的终止点仍为目标字幕终止点;而当目标语音终止点晚于目标字幕终止点时,则切分模块1032可以根据目标语音终止点重新确定该候选视频片段的终止点,如可以先确定语音滞后的时长,从而可以在候选视频片段的终止点开始,连续选取多帧视频图像,并且所选取的多帧视频图像的播放时长为该语音滞后的时长,以此得到新的候选视频片段,并将该新的候选视频片段作为最终切片得到的目标视频片段。其中,该新的候选视频片段的起始点为该候选视频片段中的音频对应的目标语音起始点,该新的候选视频片段的终止点为该音频对应的目标语音终止点。如此,可以实现视频片段中字幕与语音内容的对齐。
仍以切分模块1032切分得到图3所示的字幕为“哦,我们在买票的时候发现那部电影的票已经卖完了”的候选视频片段为例,假设候选视频片段中字幕对应的音频滞后0.5秒,则切分模块1032可以将候选视频片段的起始点后移0.5秒。此时,若语音 终止点不晚于字幕终止点,则新的候选视频片段的起始点为“00:00:28.50”、终止点为“00:00:31”;而若语音终止点晚于字幕终止点,假设语义终止点为“00:00:31.30”,则新的候选视频片段的起始点为“00:00:28.50”、终止点为“00:00:31.30”。
本实施例中,识别模块1034在识别目标字幕起止点时,可以是根据视频图像之间的差异进行确定。具体实现时,识别模块1034可以先确定待处理视频中的视频图像上的字幕显示区域。通常情况下,待处理视频中的字幕在视频图像上的显示区域(以下简称为字幕显示区域)通常固定,如位于视频图像上的下方等。然后,识别模块1034可以通过依次比较待处理视频的多帧视频图像中相邻两帧视频图像的字幕显示区域之间的差异,以此确定出该待处理视频的多个字幕起止点。示例性地,针对相邻两帧视频图像,识别模块1034可以截取这两帧视频图像中的字幕显示区域,并比对这两个字幕显示区域之间的差异。若这两个字幕显示区域之间的差异较小,如差异程度小于预设阈值,则识别模块1034可以确定这两帧视频图像中显示的字幕没有发生变化,即两帧视频图像上所显示的字幕相同(当然也可能均都不存在字幕,可以进一步通过图像检测等方式确定这两帧视频图像上是否存在字幕);而若这两个字幕显示区域之间的差异较大,如差异程度大于预设阈值,则识别模块1034可以确定这两帧视频图像中显示的字幕发生变化,相应的,这两帧视频图像中的其中一帧图像即可作为相应的字幕起始点或者字幕终止点。当然,实际应用时,识别模块1034也可以是基于其它方式确定出待处理视频的字幕起止点,本实施例对此并不进行限定。
进一步的,识别模块1034在确定视频图像上的字幕显示区域时,可以通过自动化检测的方式,确定字幕显示区域。例如,识别模块1034可以从待处理视频包括的多帧视频中随机采样n帧视频图像(n为正整数并且取值小于视频图像总帧数),得到采样视频图像,然后,识别模块1034可以通过光学字符识别(optical character recognition,OCR)技术识别该n帧采样视频图像中的字幕,并统计该字幕在各帧采样视频图像中的大致区域,从而得到采样视频图像上的字幕显示区域,如可以是将统计得到的最大区域作为字幕显示区域等。进一步的,当不同帧采样视频图像中的字幕显示区域不同时,比如对于影视类视频,其字幕在视频图像上的显示位置可能位于视频图像的下方,也可能位于视频图像的右上方等,此时,识别模块1034可以将这两个区域均作为字幕显示区域,或者识别模块1034可以统计n帧采样视频图像中显示字幕最多的区域,并将该区域作为字幕显示区域。实际应用时,识别模块1034也可以是采用其它方式确定字幕显示区域,本实施例对此并不进行限定。
或者,在其他识别字幕起止点的实施方式中,识别模块1034也可以是通过依次比较待处理视频中相邻两帧的整个视频图像之间的差异,识别字幕起止点。本实施例中对于识别模块1034如何识别字幕起止点的具体实现过程并不进行限定。
实际应用时,识别模块1034对于目标字幕起止点以及目标语音起止点的识别精度,可能分别受待处理视频中的视频图像画面以及音频内容影响。比如,当视频图像中的字幕显示区域的背景颜色与字幕颜色相似时,可能会导致识别模块1034难以识别出该视频图像上的字幕,从而导致识别模块1034无法识别出这条字幕对应的字幕起止点。又比如,当音频内容中同时包括人物说话声音以及噪音等,则噪音的存在可能导致识 别模块1034难以识别出人物说话声音,从而导致识别模块1034难以识别出人物说话声音对应的语音起止点。为此,本实施例中,当切分模块1032确定语音起止点与字幕起止点之间的重合率达到预设的重合率阈值(如90%等)时,可以按照结合字幕起止点以及语音内容的语音起止点切分待处理视频。而当切分模块1032确定语音起止点与字幕起止点之间的重合率达到一定阈值(如90%等)时,可以仅按照语音内容的语音起止点切分待处理视频等。
进一步的,上述重合率阈值还可以由用户进行设定。比如,视频语料生成装置103可以向用户呈现参数设置界面,从而用户可以在该参数设置界面中对语音起止点与字幕起止点之间的重合率阈值进行设置。实际应用场景中,用户可以根据待处理视频所属的视频类型,决定该重合率阈值的具体取值。比如,对于音乐类型的待处理视频,其包括的音频中的音乐声音通常会对语音内容产生干扰,从而影响识别模块1034识别语音起止点的准确度,此时,用户可以降低重合率阈值的取值,如设定重合率阈值为85%等。而对于纯人声类型的待处理视频,其包括的音频中的干扰声音通常较少,对于识别模块1034识别语音起止点的准确度影响较小,因此,用户可以增大重合率阈值的取值,如设定重合率阈值为95%等。
并且,上述从待处理视频中获取得到目标视频片段的过程,还可以通过相应的硬件进行加速,如可以通过在图像处理方面具有较高性能的图形处理器(graphics processing unit,GPU)进行处理,当然,也可以是采用性能相对较低的CPU进行处理等。为此,在一些可能的实施方式中,视频语料生成装置103可以在与用户的交互界面中呈现是否进行硬件加速的提示信息,以便由用户在该交互界面上选择是否采用硬件加速的方式来加快从待处理视频中获取目标视频片段的过程,从而加快生成视频语料的过程。
S203:标注模块1033将目标视频片段中的视频图像包括的字幕作为目标视频片段的标注文本,得到视频语料。
在切分模块1032从待处理视频中获取目标视频片段后,标注模块1033可以为该目标视频片段自动添加标注文本。本实施例中,标注模块1033为目标视频片段添加的标注文本为该视频片段的视频图像上所显示的字幕。作为一种实现示例,在为目标视频片段添加标注文本时,标注模块1033可以调用识别模块1034来识别目标视频片段中的视频图像上的字幕。识别模块1034可以通过OCR技术,对目标视频片段的视频图像上的字幕进行识别,得到相应的字幕文本,并将其反馈给标注模块1033。标注模块1033可以将接收到的字幕文本作为标注文本对视频片段进行标注,以此生成包括标注语音、音频以及视频图像的视频语料。由于目标视频片段中的字幕,是预先由视频编辑者在制作视频的过程中根据语音内容人工添加至视频中,因此,该字幕与语音内容的一致性较高,从而标注模块1033将目标视频片段中的字幕作为标注文本,可以提高目标视频片段的标注文本的准确性。
值得注意的是,本实施例中是以识别模块1034在切分得到的目标视频片段后,再对目标视频片段上的字幕进行识别为例进行说明,在其它可能的实现方式中,识别模块1034也可以是先识别待处理视频上的字幕,得到整个待处理视频的字幕文本,并且 该字幕文本可以记录有不同字幕各自对应的显示时间点。然后,再由切分模块1032完成对待处理视频的切分。这样,当标注模块1033需要获取目标视频片段对应的字幕文本时,可以根据该目标视频片段在待处理视频中的播放时间段,查找字幕文本在该播放时间段内所显示的字幕,以此得到该目标视频片段对应的字幕文本。本实施例中,对于识别模块1034识别字幕以及切分模块1032切分待处理视频的执行顺序,并不进行限定。
上述实施方式中,视频语料生成装置103基于包括字幕的待处理视频生成视频语料,而在其它可能的实施方式中,当待处理视频不包括字幕时,视频语料生成装置103基于该待处理视频也可以生成带有标注文本的视频语料。作为一种实现示例,在视频获取模块1031获取到待处理视频后,切分模块1032可以根据待处理视频包括的音频中的语音内容,对待处理视频进行切分,得到一个或者多个具有语音内容的视频片段。其中,切分模块1032根据音频切分待处理视频的具体实现方式可参见前述相关之处描述。然后,标注模块1033在为每个视频片段添加标注文本时,可以调用识别模块1034针对每个视频片段中的音频进行语音识别,并利用句子边界检测技术确定语音内容中的各句话,以此可以得到各个视频片段的语音内容对应的语音识别文本,从而标注模块1033可以将各个视频片段对应的语音识别文本作为该视频判断的标注文本,生成视频语料。
进一步地,视频语料生成装置103在生成视频语料后,还可以将该视频语料呈现给用户,以便由用户对该视频语料中的视频图像、音频以及标注文本进行人工校验。这样,当生成的视频语料中存在少量质量较低的视频语料时,可以由用户对该部分视频语料进行人工校正,以进一步提高生成的视频语料的质量。
实际应用时,视频语料生成装置103基于待处理视频所生成的一个或者多个视频语料,可以用于语音识别、语音生成、机器翻译、数字虚拟机人构建以及情感分析等场景中。示例性地,视频语料生成装置103可以向用户呈现任务配置界面,该任务配置界面中可以提示用户输入针对视频语料的训练任务,如图5所示,可以在该任务配置上呈现“请输入训练任务”的提示信息。并且,为了方便用户对于训练任务的输入,任务配置界面中还可以呈现有多个训练任务的候选项,如图5所示的语音识别、语音生成、机器翻译、数字虚拟机人构建以及情感分析等训练任务等,从而用户可以在该任务配置界面中对呈现的训练任务进行选择,以便视频语料生成装置103可以获取用户所选择的训练任务,并基于生成的视频语料执行该训练任务。或者,在其他实施方式中,用户也可以是直接在任务配置界面上手动输入训练任务的名称。本实施例中,对于视频语料生成装置103获取用户指定的训练任务的实现方式并不进行限定。
作为一种应用示例,在语音识别场景中,视频语料生成装置103所生成的带有标注文本、音频以及视频图像的视频语料,可以用于对预先构建的语音识别模型进行训练。具体实现时,可以将该视频语料中的音频作为语音识别模型的输入,将视频语料的标注文本作为语音识别模型的输出,以此对语音识别模型进行训练。可选地,在实现利用语音识别模型识别地域性发音(如通常所说的方言等)的音频时,可以通过视频语料生成装置103生成具有地域性口音的视频语料,即该视频语料中音频包括的语 音为基于地域性发音的语音,从而利用该视频语料对语音识别模型进行训练后,该语音识别模型能够针对该地域性发音的音频或包括该音频的视频(如方言剧或者地方新闻视频等),识别出相应的语音文本,以此实现语音识别。
作为又一种应用实例,在语音生成场景中视频语料生成装置103所生成的视频语料,可以用于对预先构建的语音生成模型进行训练。其中,自动化的语音生成可以理解为语音识别的逆向过程,即基于特定的文本生成对应的语音。具体实现时,可以将该视频语料中的标注文本作为语音生成模型的输入,将该视频语料中的音频作为语音生成模型的输出,以此完成对语音生成模型的训练。实际应用时,训练得到的语音生成模型,可以在有声小说、数字虚拟人、语音助手、智能音响等领域中,根据输入的文本输出该文本所对应的语音。可选地,在训练语音生成模型时,可以利用包括特定人物角色的语音的视频语料对该语音生成模型进行训练,从而后续基于训练得到的语音生成模型可以生成包括该人物角色的多条语音,如利用该语音生成模型生成该人物角色播报导航路线的语音等。
作为再一种应用实例,在机器翻译场景中,视频语料中的标注文本可以包括具有相同含义的基于多个语种的文本(如视频语料中的字幕为中英双语字幕等),以包括第一语种的文本以及第二语种的文本为例。此时,可以从该视频语料的标注文本中分离出多个语种的文本。由于该多个语种的文本通常具有相同的语义,因此,可以利用该标注文本对机器翻译模型进行训练,以提高机器翻译模型在基于一个语种的语音翻译得到另一个语种的语音的准确性。
作为再一种应用实例,在构建数字虚拟人的场景中,可以利用多模态话者检测技术,从视频语料的视频图像中定位出该发音的人物角色,并从该视频图像中检测得到该人物角色在发音时的人脸信息,如脸部表情、脸部动作等信息,从而可以根据该人脸信息、该视频语料包括的音频以及标注文本,生成数字虚拟人。这样,在数字虚拟人与用户进行对话时,若其对话内容与该标注文本的语义相同,则可以根据该视频语料的视频图像中的人脸信息,拟合出数字虚拟人与用户进行对话的面部表情以及对话音频,从而实现更加智能化的人机交互。
当然,上述场景实例仅作为本实施例提供的一些示例性说明,实际应用时,视频语料还可以用于更多其它可使用的场景中,如基于该视频语料进行多模态情感分析、多模态视频分类等,本实施例对此并不进行限定。
上文结合图1至图5对本申请实施例提供的生成视频语料的方法进行了详细介绍,下面将结合附图从功能单元的角度对本申请实施例提供的视频语料生成装置进行介绍。
参见图6所示的视频语料生成装置的结构示意图,该装置600包括:
视频获取模块601,用于获取待处理视频,所述待处理视频对应语音内容,所述待处理视频的部分视频图像包括所述语音内容对应的字幕;
切分模块602,用于根据所述语音内容,从所述待处理视频中获取目标视频片段;
标注模块603,用于将所述目标视频片段中的视频图像包括的字幕作为所述目标 视频片段的标注文本,得到视频语料。
示例性地,本实施例中的视频获取模块601所执行的功能与前述实施例中视频获取模块1031所执行的功能类似,具体可参见前述实施例的相关之处描述,在此不做赘述。类似的,本本实施中的切分模块602以及标注模块603所执行的具体功能,可参见前述实施例切分模块1032以及标注模块1033。
在一种可能的实施方式中,所述切分模块602,具体用于:
识别所述语音内容的目标语音起止点,所述目标语音起止点包括目标语音起始点和所述目标语音起始点对应的目标语音终止点;
根据所述目标语音起止点,从所述待处理视频中获取所述目标视频片段。
在一种可能的实施方式中,所述切分模块602,具体用于:
识别所述语音内容对应的字幕的目标字幕起止点,所述目标字幕起止点包括目标字幕起始点和所述目标字幕起始点对应的目标字幕终止点;
根据目标字幕起止点,从所述待处理视频中获取候选视频片段;
当所述目标语音起止点与所述目标字幕起止点不一致时,根据所述目标语音起止点,对所述候选视频片段进行调整,得到所述目标视频片段。
在一种可能的实施方式中,所述切分模块602,具体用于根据所述字幕的字幕显示区域,确定所述目标字幕起止点。
在一种可能的实施方式中,所述装置600还包括:
视频语料应用模块604,用于利用所述视频语料中音频以及标注文本,完成语音识别模型的训练;或者,利用所述视频语料中音频以及标注文本,完成语音生成模型的训练。
在一种可能的实施方式中,所述视频语料的标注文本包括第一语种的文本以及第二语种的文本,所述装置600还包括:
视频语料应用模块604,用于利用所述第一语种的文本以及所述第二语种的文本,完成机器翻译模型的训练。
在一种可能的实施方式中,所述装置600还包括:
信息获取模块605,用于获取所述视频语料的视频图像中的人脸信息;
视频语料应用模块604,用于根据所述人脸信息、所述视频语料包括的音频以及所述视频语料的标注文本,生成数字虚拟人。
在一种可能的实施方式中,所述装置600还包括:
呈现模块606,用于呈现任务配置界面;
信息获取模块605,用于获取用户在所述任务配置界面针对所述视频语料的训练任务。
根据本申请实施例的视频语料生成装置600可对应于执行本申请实施例中描述的方法,并且视频语料生成装置600的各个模块的上述和其它操作和/或功能分别为了实现图2中视频语料生成装置103所执行的各个方法的相应流程,为了简洁,在此不再赘述。
上述各实施例中,生成视频语料的过程也可以以单独的硬件设备实现。下面,对实现生成视频语料的过程的计算设备进行详细介绍。
图7提供了一种计算设备的结构示意图。图7所示的计算设备700具体可以用于实现上述图2所示实施例中视频语料生成装置103的功能,或图6所示实施例中视频语料生成装置600的功能。
计算设备700包括总线701、处理器702、通信接口703和存储器704。处理器702、存储器704和通信接口703之间通过总线701通信。总线701可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图7中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。通信接口703用于与外部通信,例如接收软件开发者面功能网元发送的目标业务请求等。
其中,处理器702可以为中央处理器(central processing unit,CPU)。存储器704可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器704还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,HDD或SSD。
存储器704中存储有可执行代码,处理器702执行该可执行代码以执行前述视频语料生成装置103或者视频语料生成装置600所执行的方法。
具体地,在实现图2所示实施例的情况下,执行图2中的视频语料生成装置103的功能所需的软件或程序代码存储在存储器704中,计算设备700与其它设备的交互通过通信接口703实现,如计算设备700通过通信接口703获取数据源中的多条待处理数据等。处理器用于执行存储器704中的指令,实现视频语料生成装置600所执行的方法。
此外,本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在计算机设备上运行时,使得计算机设备执行上述实施例视频语料生成装置103所执行的方法。
此外,本申请实施例还提供了一种计算机程序产品,所述计算机程序产品被计算机执行时,所述计算机执行前述数据提供方法的任一方法。该计算机程序产品可以为一个软件安装包,在需要使用前述数据提供方法的任一方法的情况下,可以下载该计算机程序产品并在计算机上执行该计算机程序产品。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或 者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (19)

  1. 一种生成视频语料的方法,其特征在于,所述方法包括:
    获取待处理视频,所述待处理视频对应语音内容,所述待处理视频的部分视频图像包括所述语音内容对应的字幕;
    根据所述语音内容,从所述待处理视频中获取目标视频片段;
    将所述目标视频片段中的视频图像包括的字幕作为所述目标视频片段的标注文本,得到视频语料。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述语音内容,从所述待处理视频中获取目标视频片段,包括:
    识别所述语音内容的目标语音起止点,所述目标语音起止点包括目标语音起始点和所述目标语音起始点对应的目标语音终止点;
    根据所述目标语音起止点,从所述待处理视频中获取所述目标视频片段。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述目标语音起止点,从所述待处理视频中获取所述目标视频片段,包括:
    识别所述语音内容对应的字幕的目标字幕起止点,所述目标字幕起止点包括目标字幕起始点和所述目标字幕起始点对应的目标字幕终止点;
    根据目标字幕起止点,从所述待处理视频中获取候选视频片段;
    当所述目标语音起止点与所述目标字幕起止点不一致时,根据所述目标语音起止点,对所述候选视频片段进行调整,得到所述目标视频片段。
  4. 根据权利要求3所述的方法,其特征在于,所述识别所述语音内容对应的字幕的目标字幕起止点,包括:
    根据所述字幕的字幕显示区域,确定所述目标字幕起止点。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述方法还包括:
    利用所述视频语料中音频以及标注文本,完成语音识别模型的训练;或者,
    利用所述视频语料中音频以及标注文本,完成语音生成模型的训练。
  6. 根据权利要求1至5任一项所述的方法,其特征在于,所述视频语料的标注文本包括第一语种的文本以及第二语种的文本,所述方法还包括:
    利用所述第一语种的文本以及所述第二语种的文本,完成机器翻译模型的训练。
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述方法还包括:
    获取所述视频语料的视频图像中的人脸信息;
    根据所述人脸信息、所述视频语料包括的音频以及所述视频语料的标注文本,生成数字虚拟人。
  8. 根据权利要求1至7任一项所述的方法,其特征在于,所述方法还包括:
    呈现任务配置界面;
    获取用户在所述任务配置界面针对所述视频语料的训练任务。
  9. 一种生成视频语料的装置,其特征在于,所述装置包括:
    视频获取模块,用于获取待处理视频,所述待处理视频对应语音内容,所述待处理视频的部分视频图像包括所述语音内容对应的字幕;
    切分模块,用于根据所述语音内容,从所述待处理视频中获取目标视频片段;
    标注模块,用于将所述目标视频片段中的视频图像包括的字幕作为所述目标视频片段的标注文本,得到视频语料。
  10. 根据权利要求9所述的装置,其特征在于,所述切分模块,具体用于:
    识别所述语音内容的目标语音起止点,所述目标语音起止点包括目标语音起始点和所述目标语音起始点对应的目标语音终止点;
    根据所述目标语音起止点,从所述待处理视频中获取所述目标视频片段。
  11. 根据权利要求10所述的装置,其特征在于,所述切分模块,具体用于:
    识别所述语音内容对应的字幕的目标字幕起止点,所述目标字幕起止点包括目标字幕起始点和所述目标字幕起始点对应的目标字幕终止点;
    根据目标字幕起止点,从所述待处理视频中获取候选视频片段;
    当所述目标语音起止点与所述目标字幕起止点不一致时,根据所述目标语音起止点,对所述候选视频片段进行调整,得到所述目标视频片段。
  12. 根据权利要求11所述的装置,其特征在于,所述切分模块,具体用于根据所述字幕的字幕显示区域,确定所述目标字幕起止点。
  13. 根据权利要求9至12任一项所述的装置,其特征在于,所述装置还包括:
    视频语料应用模块,用于利用所述视频语料中音频以及标注文本,完成语音识别模型的训练;或者,利用所述视频语料中音频以及标注文本,完成语音生成模型的训练。
  14. 根据权利要求9至13任一项所述的装置,其特征在于,所述视频语料的标注文本包括第一语种的文本以及第二语种的文本,所述装置还包括:
    视频语料应用模块,用于利用所述第一语种的文本以及所述第二语种的文本,完成机器翻译模型的训练。
  15. 根据权利要求9至14任一项所述的装置,其特征在于,所述装置还包括:
    信息获取模块,用于获取所述视频语料的视频图像中的人脸信息;
    视频语料应用模块,用于根据所述人脸信息、所述视频语料包括的音频以及所述视频语料的标注文本,生成数字虚拟人。
  16. 根据权利要求9至15任一项所述的装置,其特征在于,所述装置还包括:
    呈现模块,用于呈现任务配置界面;
    信息获取模块,用于获取用户在所述任务配置界面针对所述视频语料的训练任务。
  17. 一种计算机设备,其特征在于,所述计算机设备包括处理器和存储器;
    所述处理器用于执行所述存储器中存储的指令,以使得所述计算机设备执行权利要求1至8中任一项所述的方法。
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,当其在计算设备上运行时,使得所述计算设备执行如权利要求1至8任一项所述的方法。
  19. 一种包含指令的计算机程序产品,当其在计算设备上运行时,使得所述计算设备执行如权利要求1至8任一项所述的方法。
PCT/CN2022/087908 2021-04-29 2022-04-20 生成视频语料的方法、装置及相关设备 WO2022228235A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22794706.6A EP4322029A1 (en) 2021-04-29 2022-04-20 Method and apparatus for generating video corpus, and related device
US18/496,250 US20240064383A1 (en) 2021-04-29 2023-10-27 Method and Apparatus for Generating Video Corpus, and Related Device

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202110471260.7 2021-04-29
CN202110471260 2021-04-29
CN202110905684.XA CN115269884A (zh) 2021-04-29 2021-08-06 生成视频语料的方法、装置及相关设备
CN202110905684.X 2021-08-06

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/496,250 Continuation US20240064383A1 (en) 2021-04-29 2023-10-27 Method and Apparatus for Generating Video Corpus, and Related Device

Publications (1)

Publication Number Publication Date
WO2022228235A1 true WO2022228235A1 (zh) 2022-11-03

Family

ID=83758632

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/087908 WO2022228235A1 (zh) 2021-04-29 2022-04-20 生成视频语料的方法、装置及相关设备

Country Status (4)

Country Link
US (1) US20240064383A1 (zh)
EP (1) EP4322029A1 (zh)
CN (1) CN115269884A (zh)
WO (1) WO2022228235A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117240983A (zh) * 2023-11-16 2023-12-15 湖南快乐阳光互动娱乐传媒有限公司 一种自动生成有声剧的方法及装置

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116468054B (zh) * 2023-04-26 2023-11-07 中央民族大学 基于ocr技术辅助构建藏汉音译数据集的方法及系统
CN116229943B (zh) * 2023-05-08 2023-08-15 北京爱数智慧科技有限公司 一种对话式数据集的生成方法和装置

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6263308B1 (en) * 2000-03-20 2001-07-17 Microsoft Corporation Methods and apparatus for performing speech recognition using acoustic models which are improved through an interactive process
US20060095264A1 (en) * 2004-11-04 2006-05-04 National Cheng Kung University Unit selection module and method for Chinese text-to-speech synthesis
CN104780388A (zh) * 2015-03-31 2015-07-15 北京奇艺世纪科技有限公司 一种视频数据的切分方法和装置
CN109858427A (zh) * 2019-01-24 2019-06-07 广州大学 一种语料提取方法、装置及终端设备
CN110427930A (zh) * 2019-07-29 2019-11-08 中国工商银行股份有限公司 多媒体数据处理方法及装置、电子设备和可读存储介质
WO2020155750A1 (zh) * 2019-01-28 2020-08-06 平安科技(深圳)有限公司 基于人工智能的语料收集方法、装置、设备及存储介质
CN111881900A (zh) * 2020-07-01 2020-11-03 腾讯科技(深圳)有限公司 语料生成、翻译模型训练、翻译方法、装置、设备及介质
CN112201225A (zh) * 2020-09-30 2021-01-08 北京大米科技有限公司 一种语料获取的方法、装置、可读存储介质和电子设备
CN112418172A (zh) * 2020-12-11 2021-02-26 苏州元启创人工智能科技有限公司 基于多模信息智能处理单元的多模信息融合情感分析方法

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6263308B1 (en) * 2000-03-20 2001-07-17 Microsoft Corporation Methods and apparatus for performing speech recognition using acoustic models which are improved through an interactive process
US20060095264A1 (en) * 2004-11-04 2006-05-04 National Cheng Kung University Unit selection module and method for Chinese text-to-speech synthesis
CN104780388A (zh) * 2015-03-31 2015-07-15 北京奇艺世纪科技有限公司 一种视频数据的切分方法和装置
CN109858427A (zh) * 2019-01-24 2019-06-07 广州大学 一种语料提取方法、装置及终端设备
WO2020155750A1 (zh) * 2019-01-28 2020-08-06 平安科技(深圳)有限公司 基于人工智能的语料收集方法、装置、设备及存储介质
CN110427930A (zh) * 2019-07-29 2019-11-08 中国工商银行股份有限公司 多媒体数据处理方法及装置、电子设备和可读存储介质
CN111881900A (zh) * 2020-07-01 2020-11-03 腾讯科技(深圳)有限公司 语料生成、翻译模型训练、翻译方法、装置、设备及介质
CN112201225A (zh) * 2020-09-30 2021-01-08 北京大米科技有限公司 一种语料获取的方法、装置、可读存储介质和电子设备
CN112418172A (zh) * 2020-12-11 2021-02-26 苏州元启创人工智能科技有限公司 基于多模信息智能处理单元的多模信息融合情感分析方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117240983A (zh) * 2023-11-16 2023-12-15 湖南快乐阳光互动娱乐传媒有限公司 一种自动生成有声剧的方法及装置
CN117240983B (zh) * 2023-11-16 2024-01-26 湖南快乐阳光互动娱乐传媒有限公司 一种自动生成有声剧的方法及装置

Also Published As

Publication number Publication date
CN115269884A (zh) 2022-11-01
EP4322029A1 (en) 2024-02-14
US20240064383A1 (en) 2024-02-22

Similar Documents

Publication Publication Date Title
WO2022228235A1 (zh) 生成视频语料的方法、装置及相关设备
CN110517689B (zh) 一种语音数据处理方法、装置及存储介质
US10878824B2 (en) Speech-to-text generation using video-speech matching from a primary speaker
US9158753B2 (en) Data processing method, presentation method, and corresponding apparatuses
US9066049B2 (en) Method and apparatus for processing scripts
CN111050201B (zh) 数据处理方法、装置、电子设备及存储介质
WO2022134698A1 (zh) 视频处理方法及装置
CN114465737B (zh) 一种数据处理方法、装置、计算机设备及存储介质
CN113035199B (zh) 音频处理方法、装置、设备及可读存储介质
US20230089308A1 (en) Speaker-Turn-Based Online Speaker Diarization with Constrained Spectral Clustering
US20220343100A1 (en) Method for cutting video based on text of the video and computing device applying method
CN110517668A (zh) 一种中英文混合语音识别系统及方法
US20230325611A1 (en) Video translation platform
WO2023124647A1 (zh) 一种纪要确定方法及其相关设备
CN114996506A (zh) 语料生成方法、装置、电子设备和计算机可读存储介质
TWI769520B (zh) 多國語言語音辨識及翻譯方法與相關的系統
CN111161710A (zh) 同声传译方法、装置、电子设备及存储介质
CN113470617B (zh) 语音识别方法以及电子设备、存储装置
CN112233661B (zh) 基于语音识别的影视内容字幕生成方法、系统及设备
TWI684964B (zh) 知識點標記生成系統及其方法
KR20210081308A (ko) 비디오를 처리하는 방법, 장치, 전자 기기 및 저장 매체
CN113033357A (zh) 基于口型特征的字幕调整方法以及装置
WO2021161908A1 (ja) 情報処理装置及び情報処理方法
CN112241462B (zh) 知识点标记生成系统及其方法
WO2023273702A1 (zh) 一种语音信息与演示信息同步的方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22794706

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022794706

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022794706

Country of ref document: EP

Effective date: 20231106

NENP Non-entry into the national phase

Ref country code: DE