CN114143479B

CN114143479B - Video abstract generation method, device, equipment and storage medium

Info

Publication number: CN114143479B
Application number: CN202111436728.5A
Authority: CN
Inventors: 刘钊
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2023-07-25
Anticipated expiration: 2041-11-29
Also published as: CN114143479A

Abstract

The embodiment of the application relates to the field of artificial intelligence and discloses a method, a device, equipment and a storage medium for generating a video abstract. The method comprises the steps of obtaining a target text interpretation video and dividing the target text interpretation video into a plurality of interpretation video fragments; selecting a key video snippet from the narrative video snippet seed; extracting text explanation audio and text explanation images of the key video clips; acquiring first voice data of a target object in a key video snippet; acquiring second voice data corresponding to the target object; determining target voice data of a target object according to the first voice data and the second voice data; acquiring target text information according to the target voice data; generating a video abstract segment corresponding to each key video segment according to the text explanation image, the target voice data and the target text information corresponding to each key video segment; and splicing the video abstract fragments to generate a video abstract corresponding to the target text interpretation video.

Description

Video abstract generation method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating a video abstract.

Background

With the development of internet technology and multimedia technology, digital video has been heavily gushed in, such as news, advertisements, television, movies, webcasts, and the like. Whether it is learning or social entertainment, users are surrounded by massive videos, and it is not easy to want to quickly search for interesting videos in massive videos, so that video summaries have been developed, which are brief representations of video content, as their names suggest, in order to facilitate the user to quickly understand the video content and decide whether to watch it in detail, and index and query it for video databases, etc.

In the prior art, in order to quickly generate a video abstract, the video abstract is usually obtained by picking a plurality of video frames from the video randomly or at equal frame intervals, and simply combining the picked video frames.

However, the video summary generated in this way cannot identify whether the material has quality problems, and the quality of the generated video summary is difficult to guarantee.

Disclosure of Invention

The main purpose of the embodiments of the present application is to provide a method, an apparatus, a device, and a storage medium for generating a video abstract, which are aimed at improving the quality of generating the video abstract and improving the user experience.

In a first aspect, an embodiment of the present application provides a method for generating a video summary, including:

acquiring a target text comment video and dividing the target text comment video into a plurality of comment video fragments;

selecting a key video segment from a plurality of the comment video segments according to the correlation degree of each comment video segment and the explanation of the target document, wherein each key video segment comprises a target object for explaining the target document;

extracting text explanation audio and text explanation images corresponding to the key video clips, acquiring first voice data of the target object in the key video clips according to the text explanation audio, acquiring a plurality of mouth-shaped changing images of the target object in the key video clips according to the text explanation images, and acquiring second voice data corresponding to the target object according to the plurality of mouth-shaped changing images;

determining target voice data of the target object according to the first voice data and the second voice data, and inputting the target voice data into a preset voice recognition model to obtain target text information;

And generating a video abstract segment corresponding to the key video segment according to the text comment image, the target voice data and the target text information corresponding to each key video segment, and splicing the video abstract segments to generate a video abstract corresponding to the target text comment video.

In a second aspect, an embodiment of the present application further provides a device for generating a video summary, including:

the segment dividing module is used for acquiring a target text interpretation video and dividing the target text interpretation video into a plurality of interpretation video segments;

a segment screening module, configured to select a key video segment from a plurality of the comment video segments according to a correlation degree between each of the comment video segments and a target document explanation, where each of the key video segments includes a target object explaining the target document;

the voice lifting module is used for extracting text description audio and text description images corresponding to the key video clips, acquiring first voice data of the target object in the key video clips according to the text description audio, acquiring a plurality of mouth-shaped variation images of the target object in the key video clips according to the text description images, and acquiring second voice data corresponding to the target object according to the mouth-shaped variation images;

The text conversion module is used for determining target voice data of the target object according to the first voice data and the second voice data, and inputting the target voice data into a preset voice recognition model to acquire target text information;

and the abstract generation module is used for generating a video abstract segment corresponding to the key video segment according to the text description image, the target voice data and the target text information corresponding to each key video segment, and splicing the video abstract segments to generate a video abstract corresponding to the target text description video.

In a third aspect, embodiments of the present application further provide an electronic device, the electronic device comprising a processor, a memory, a computer program stored on the memory and executable by the processor, and a data bus for enabling a connection communication between the processor and the memory, wherein the computer program, when executed by the processor, implements the steps of the method for generating a video summary as provided in any one of the present application specifications.

In a fourth aspect, embodiments of the present application further provide a storage medium for computer readable storage, where the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the steps of any of the methods for generating a video summary as provided in the present application.

The embodiment of the application provides a method, a device, equipment and a storage medium for generating a video abstract, wherein the method comprises the steps of obtaining a target text comment video and dividing the target text comment video into a plurality of comment video fragments; selecting a key video segment from a plurality of the comment video segments according to the correlation degree of each comment video segment and the explanation of the target document, wherein each key video segment comprises a target object for explaining the target document; extracting text explanation audio and text explanation images corresponding to the key video clips, acquiring first voice data of the target object in the key video clips according to the text explanation audio, acquiring a plurality of mouth-shaped changing images of the target object in the key video clips according to the text explanation images, and acquiring second voice data corresponding to the target object according to the plurality of mouth-shaped changing images; determining target voice data of the target object according to the first voice data and the second voice data, and inputting the target voice data into a preset voice recognition model to obtain target text information; and generating a video abstract segment corresponding to the key video segment according to the text comment image, the target voice data and the target text information corresponding to each key video segment, and splicing the video abstract segments to generate a video abstract corresponding to the target text comment video. In the process of generating the video abstract corresponding to the target text interpretation video, the mouth shape variation of the target object is obtained to obtain the interpretation voice data of the target object, so that when the audio in the obtained target text interpretation video has problems, the voice obtained by the mouth shape variation can be used for compensation, thereby improving the generation quality of the video abstract and improving the user experience.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for generating a video summary according to an embodiment of the present application;

fig. 2 is a schematic block diagram of a video summary generating apparatus according to an embodiment of the present application;

fig. 3 is a schematic block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

It is to be understood that the terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The video abstract formed by the method has strong dependence on the quality of the original material, and if the voice in the text description video in the original material is blocked or is not clearly analyzed, the situation that the voice is blocked or is not clearly analyzed in the generated video abstract can be caused, so that the quality of the finally generated video abstract can not be effectively ensured, and the experience of a user is poor.

Therefore, in order to improve the generation quality of the video abstract and improve the user experience, the embodiment of the application provides a method, a device, equipment and a storage medium for generating the video abstract. The method for generating the video abstract can be applied to electronic equipment. The electronic device can be a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, a wearable device and other terminal devices, and can also be a server, wherein the server can be an independent server or a server cluster.

Specifically, the method for generating the video abstract comprises the steps of obtaining a target text comment video and dividing the target text comment video into a plurality of comment video fragments; selecting a key video segment from a plurality of the comment video segments according to the correlation degree of each comment video segment and the explanation of the target document, wherein each key video segment comprises a target object for explaining the target document; extracting text explanation audio and text explanation images corresponding to the key video clips, acquiring first voice data of the target object in the key video clips according to the text explanation audio, acquiring a plurality of mouth-shaped changing images of the target object in the key video clips according to the text explanation images, and acquiring second voice data corresponding to the target object according to the plurality of mouth-shaped changing images; determining target voice data of the target object according to the first voice data and the second voice data, and inputting the target voice data into a preset voice recognition model to obtain target text information; and generating a video abstract segment corresponding to the key video segment according to the text comment image, the target voice data and the target text information corresponding to each key video segment, and splicing the video abstract segments to generate a video abstract corresponding to the target text comment video. In the process of generating the video abstract corresponding to the target text interpretation video, the mouth shape variation of the target object is obtained to obtain the interpretation voice data of the target object, so that when the audio in the obtained target text interpretation video has problems, the voice obtained by the mouth shape variation can be used for compensation, thereby improving the generation quality of the video abstract and improving the user experience.

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a flowchart of a method for generating a video summary according to an embodiment of the present application.

As shown in fig. 1, the scheme acquisition method based on voice recognition includes steps S1 to S5.

Step S1: a target narrative video is acquired and divided into a plurality of narrative video segments.

The target text comment video acquisition method has various ways, for example, the target text comment video can be acquired by being stored in a storage medium such as a hard disk, a USB flash disk, an SD card and the like, or can be a video link for acquiring the target text comment video, and the target text comment video is downloaded through the video link.

After the target text comment video is obtained, the target text comment video is divided into a plurality of comment video fragments, wherein the division of the comment video fragments can be a plurality of comment video fragments which divide the target text comment video into unit duration according to the total duration of the target text comment video, or the target text comment video can be randomly divided into a plurality of comment video fragments, which is not limited herein.

In some implementations, the dividing the target text narrative video into a plurality of narrative video segments includes:

and acquiring the total duration of the target text comment video, and dividing the target text comment video into N comment video fragments according to the total duration, wherein N is more than 2.

For example, the total duration of the target text comment video is 120 minutes, and a segment interception selection needs to be performed on the target text comment video in order to generate the video abstract, for example, the total duration of 120 minutes is equally divided into 12 equal parts, that is, the target text comment video of 120 minutes is equally divided into 12 10 minute comment video segments.

Step S2: and selecting a key video segment from a plurality of the comment video segments according to the correlation degree of each comment video segment and the explanation of the target document, wherein each key video segment comprises a target object for explaining the target document.

For example, after the target document interpretation video is divided into a plurality of interpretation video segments, some of the video segments have a strong association with the target document interpretation video, for example, a lecturer explains a document, and some of the video segments have a weak association with the target document interpretation video, for example, some of the video segments of an excessive relaxation link, some of the extended segments of the document interpretation, some of the question answering segments, and the like.

Based on the fact that the duration of the explanation of the target text in each explanation segment may be different, even some explanation video segments have no explanation object, so that the correlation degree between the video segments and the target text explanation video is not strong, if the video segments are utilized to make the video abstract of the target text explanation video, the information that the current video type is the target text explanation video may not be transmitted, and therefore, it is important to screen out the key video segments from the target text explanation video.

In some embodiments, the selecting a key video snippet from a plurality of the commentary video snippets according to a degree of relevance of each of the commentary video snippets to the target document explanation includes:

judging whether the appearance time of a target object explaining the target document in the explanation video segment exceeds a preset time;

when the appearance time of the target object exceeds the preset time, acquiring caption information of the caption video clip, and carrying out type division on each piece of caption information according to keywords in the caption information;

the subtitle information is weighted and summed based on the subtitle type of each piece of subtitle information and the weighting coefficient of each subtitle type, so that the key degree of the comment video clip is obtained;

And selecting the preset number of the explanation video clips with the highest key degree as the key video clips.

The preset duration is set according to the duration of the comment video clips, for example, the duration of each comment video clip is at least T1, the occurrence time T2 of the target object for explaining the target document in each comment video clip is calculated through image recognition, T2 is less than or equal to T1, and when T2 is greater than or equal to 0.6T1, the occurrence time of the target object in the comment video clip is considered to exceed the preset time, so that the comment video clip partially meeting the initial requirement can be screened out.

When the video segment of the explanation meeting the initial requirement is screened in a preliminary way, caption information of the video segment of the explanation is obtained, keyword splitting is carried out on the caption information, and caption type division is carried out on each piece of caption information according to the split keywords, for example, corresponding short messages with keywords and caption types are stored in electronic equipment, when the caption information contains keywords A, keywords B and keywords C, the caption information is divided into caption types A, when the caption information contains keywords A, keywords B and keywords D, the caption information is divided into caption types B, and when the caption information contains keywords H, keywords E and keywords F, the caption information is divided into caption types C.

And after classifying all caption information appearing in the video segments, carrying out weighted summation on the caption information according to the weighting coefficients corresponding to the caption types of the caption information to obtain the key degree of the video segments, and then selecting the preset number of video segments with the highest key degree as the key video segments. That is, the target narrative video is divided into N narrative video segments, and M narrative video segments having the highest selection criticality are selected from the N narrative video segments as the key video segments, where N is greater than M. The 3 most critical video segments are selected from the 12 narrative video segments as the critical video segments.

when the appearance time of the target object exceeds the preset time, acquiring document information of a target document appearing in each explanation video segment, and extracting keywords from the document information to acquire document keywords;

Acquiring the key degree of each comment video fragment according to the occurrence frequency of the corresponding text keywords in each comment video fragment and the number of the text keywords;

For example, when the appearance time of the target object in the comment video fragment exceeds a preset time, the comment video fragment partially meeting the initial requirement can be screened out.

When the explanation video clips meeting the initial requirements are primarily screened, video frames of the target texts appearing in each explanation video clip are screened, the video frames are converted into corresponding video pictures, text information of the corresponding target texts in the video pictures is identified through an OCR text recognition technology, keywords of the texts are split, so that keyword sets are obtained, and text keywords are screened from the keyword sets through a preset keyword library.

Counting the number of the text keywords and the occurrence frequency of the text keywords in each screened comment video fragment, presetting the corresponding relation between the keyword degree and the number of the comment video fragments and the occurrence frequency of the text keywords, acquiring the keyword degree of each comment video fragment according to the occurrence frequency of the text keywords and the number of the text keywords, and selecting the preset number of the comment video fragments with the highest keyword degree as the keyword video fragments.

Step S3: extracting text explanation audio and text explanation images corresponding to the key video clips, acquiring first voice data of the target object in the key video clips according to the text explanation audio, acquiring a plurality of mouth-shaped changing images of the target object in the key video clips according to the text explanation images, and acquiring second voice data corresponding to the target object according to the plurality of mouth-shaped changing images.

For example, each key video snippet includes at least text description audio and text description images, wherein each text description audio includes a plurality of audio frames and each text description image includes a plurality of video image frames. And decoding the key video snippet so as to obtain the text explanation audio and the text explanation image corresponding to the key video snippet.

Ambient noise may be present in the acquired audio data, such as the applause of an ambient listener, the questioning sound of the listener, etc., and in order to reduce the influence of the ambient noise, it is necessary to separate the first voice data of the target object from the audio data.

The mouth-shaped changing image of the target object in the key video fragment is obtained through the text explanation image, and the lip language information of the target object in the video correspondence is identified according to the mouth-shaped changing image, so that the second voice data of the target object in the text explanation image is obtained according to the lip language information, and the first voice data and the second voice data can be utilized to obtain the target voice of the more accurate and comprehensive target object.

In some implementations, the obtaining first voice data of the target object in the key video snippet according to the text narrative audio includes:

inputting the audio data corresponding to the text description explanation audio to a feature extraction network of a voice extraction model to perform feature extraction, and obtaining feature vectors corresponding to the audio data, wherein the audio data comprises first voice data of the target object and noise data of the environment;

inputting a preset vector and the feature vector into a voice extraction network of the voice extraction model to extract first voice data of the target object from the audio data, wherein the voice extraction model is obtained through voice training of the target object, the preset vector is obtained according to the noise data, and the voice extraction network takes the preset vector as a reference to adjust the proportion of the first voice data and the noise data in the audio data so as to obtain the first voice data of the target object.

Illustratively, different sounds have different voiceprint characteristics based on the different sounds, so that the user's sounds and ambient noise can be distinguished using the voiceprint characteristics to separate the voice data of the target object from the audio data.

It should be noted that, voiceprint (Voiceprint) is a sound wave spectrum carrying speech information displayed by an electroacoustic device. The generation of human language is a complex physiological and physical process between the human language center and the pronunciation organ, and the pronunciation organ used by the human when speaking, namely tongue, teeth, larynx, lung and nasal cavity, has great difference in size and shape, so that the voiceprint of any two persons has difference.

The acoustic characteristics of each individual's voice are both relatively stable and variable, not absolute, and constant. Such variations may be from physiology, pathology, psychology, simulation, camouflage, and are also related to environmental disturbances. Nevertheless, since the sound organs of each person are different, in general, people can distinguish between sounds of different persons or judge whether they are sounds of the same person.

Further, voiceprint features are acoustic features related to the anatomical structure of the human vocal mechanism, such as spectrum, cepstrum, formants, pitch, reflection coefficients, etc., nasal, breath with deep breath, sand dumb, laugh, etc.; the voiceprint characteristics of humans are affected by socioeconomic status, educational level, birth place, semantics, play, pronunciation, speech habits, etc. For voiceprint features, personal features or features such as rhythm, speed, intonation, volume and the like influenced by parents, features which can be used by the voiceprint automatic recognition model at present from the viewpoint of modeling by using a mathematical method include: acoustic features such as cepstrum; lexical features such as speaker-dependent words n-gram, phonemic n-gram, etc.; prosodic features such as pitch and energy "pose" described by ngram.

In practical application, when voiceprint feature extraction is performed, voiceprint feature data of a user in audio data may be extracted, where the voiceprint feature data includes at least one of a pitch spectrum and its contour, energy of a pitch frame, occurrence frequency and trajectory of a pitch formant, linear prediction cepstrum, line spectrum pair, autocorrelation and logarithmic area ratio, mel frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC), and perceptual linear prediction.

For example, the audio data includes first voice data of the target object and noise data of the environment. Based on the fact that the target object is a user, the voice of the target object has a large difference from the environmental noise, the voice extraction model is trained by utilizing the voice of the target object and the environmental noise, when the voice data of the target object are extracted, the obtained audio data are input into the voice extraction model for feature extraction, so that feature vectors corresponding to the audio data are obtained, the environmental noise of the environment where the terminal equipment is located is obtained, and the environmental noise is converted into corresponding preset vectors.

The method comprises the steps of inputting a preset vector and a feature vector into a voice extraction network of a voice extraction model to extract first voice data of a target object from audio data, wherein the voice extraction model is obtained through user voice and environmental noise training, the preset vector is obtained according to noise data, and the voice extraction network adjusts the proportion of the first voice data and the noise data in the audio data by taking the preset vector as a reference, so that the first voice data of the target object is obtained.

In some embodiments, the obtaining a plurality of mouth-shaped variant images of the target object in the key video snippet according to the text comment image, and obtaining second voice data corresponding to the target object according to a plurality of mouth-shaped variant images includes:

extracting a mouth shape image of a target object in each frame of video image of the text explanation image, and giving a corresponding time stamp to the mouth shape image according to a time axis of the text explanation image;

and inputting the mouth shape image into a preset lip language identification model according to the timestamp so as to acquire second voice data corresponding to the target object in the text interpretation image.

The acquired text interpretation images comprise N frames of video images, the mouth shape images of the target object in each frame of the N frames of video images are extracted, corresponding time stamps are given to the extracted mouth shape images according to the sequence of each frame of images, and the mouth shape images are input into the lip language recognition model according to the sequence of the time stamps so as to acquire second voice data corresponding to the target object in the text interpretation images.

For example, a first frame in the explanation image acquires a first mouth shape image, a second frame acquires a second mouth shape image, a third frame acquires a third mouth shape image until an nth frame acquires an nth mouth shape image, a corresponding mouth shape image time stamp is given according to the time sequence of each frame image, so that the mouth shape changing sequence of a target object is accurately identified, and the mouth shape images acquired from the first frame to the nth frame of the explanation image are input into a lip language recognition model according to the time sequence of the time stamps so as to acquire second voice data corresponding to the target object in the explanation image.

Step S4: and determining target voice data of the target object according to the first voice data and the second voice data, and inputting the target voice data into a preset voice recognition model to acquire target text information.

For example, the first voice data may cause that the acquired target object voice is covered by the environmental noise or the voice collector is disturbed by the environment, so that the voice of the target object is lost, and the voice missing part is compensated by the corresponding part of the second voice data, so that the target voice data of the target object is acquired. And identifying the acquired target voice data by using a preset voice identification model, so as to acquire target text information.

In some embodiments, the determining the target voice data of the target object according to the first voice data and the second voice data includes:

comparing the first voice data with the second voice data, and judging whether the first voice data has voice missing or not;

when the first voice data has voice missing, performing voice compensation on the first voice data according to the second voice data to obtain the target voice data.

In some embodiments, the performing the voice compensation on the first voice data according to the second voice data to obtain the target voice data includes:

marking the missing part of the first voice data and acquiring a first time period corresponding to the missing part;

and acquiring a second voice data segment corresponding to the first time segment from the second voice data, and compensating the missing part by utilizing the second voice data segment to obtain the target voice data.

In an exemplary embodiment, based on the fact that the voice data and the video data are acquired simultaneously, the start time of the first voice data is the same as the start time of the second voice data, the similarity of the first audio signal corresponding to the first voice data and the similarity of the second audio signal corresponding to the second voice data in time continuity are compared, so that whether the first voice data have voice deficiency is judged, when the voice deficiency exists, the voice deficiency is marked, a first time period corresponding to the voice deficiency is acquired, a second voice data segment corresponding to the same time period as the first time period is acquired from the second voice data, and the missing portion of the first voice data is compensated by the second voice data segment, so that target voice data of a target object is acquired.

And inputting the target voice data into a preset voice recognition model to acquire target text information. For example, the target voice data is converted into a corresponding text by an automatic voice recognition (Automatic Speech Recognition, ASR) technique, so that the obtained target text information can be used as a subtitle corresponding to the target voice data.

Step S5: and generating a video abstract segment corresponding to the key video segment according to the text comment image, the target voice data and the target text information corresponding to each key video segment, and splicing the video abstract segments to generate a video abstract corresponding to the target text comment video.

The method comprises the steps of taking a text explanation image, target voice data and target text information obtained by each key video segment as a source file of a current video abstract segment, taking the target voice data as explanation voice of the text explanation image, taking the target text information as explanation captions of the text explanation image, determining a time point when a target object starts to explain in the text explanation image, aligning the time point with a starting time point of the corresponding target voice data, and taking the time point when the target object starts to explain as the appearance time of the target text information, so that the current video abstract segment can be accurately generated.

And similarly, sequentially generating video abstract segments corresponding to the plurality of key video segments, thereby obtaining the video abstract segments corresponding to all the key video segments.

Each of the comment video clips has been assigned a corresponding time stamp when the video clips are divided, and the sequence of the corresponding video clips can be determined by the time stamp.

Therefore, the sequence of the corresponding video summary segments can be determined according to the time stamp, and the video summary segments of the target text description video can be generated, or the video summary segments can be randomly ordered and spliced into the corresponding video summary, which is not limited herein.

Referring to fig. 2, the present application further provides a device 200 for generating a video summary, where the device 200 for generating a video summary includes a segment dividing module 201, a segment screening module 202, a voice extracting module 203, a text converting module 204, and a summary generating module 205.

Specifically, the segment dividing module 201 is configured to obtain a target text comment video, and divide the target text comment video into a plurality of comment video segments;

a segment screening module 202, configured to select a key video segment from a plurality of the comment video segments according to a correlation degree between each of the comment video segments and a target document explanation, where each of the key video segments includes a target object explaining the target document;

The voice lifting module 203 is configured to extract text description audio and text description images corresponding to the key video snippet, obtain first voice data of the target object in the key video snippet according to the text description audio, obtain multiple mouth-shaped variation images of the target object in the key video snippet according to the text description images, and obtain second voice data corresponding to the target object according to the multiple mouth-shaped variation images;

the text conversion module 204 is configured to determine target voice data of the target object according to the first voice data and the second voice data, and input the target voice data into a preset voice recognition model to obtain target text information;

the summary generating module 205 is configured to generate a video summary segment corresponding to the key video segment according to the text description image, the target voice data, and the target text information corresponding to each key video segment, and splice the video summary segments to generate a video summary corresponding to the target text description video.

In some implementations, the segment partitioning module 201 is further configured to: and acquiring the total duration of the target text comment video, and dividing the target text comment video into N comment video fragments according to the total duration, wherein N is more than 2.

In some embodiments, the fragment screening module 202 is further configured to: judging whether the appearance time of a target object explaining the target document in the explanation video segment exceeds a preset time;

In some embodiments, the fragment screening module 202 is further configured to: the selecting a key video segment from a plurality of the comment video segments according to the correlation degree of each comment video segment and the target document explanation comprises the following steps:

In some implementations, the speech extraction module 203 is further configured to: inputting audio data into a feature extraction network of a voice extraction model to perform feature extraction, and acquiring feature vectors corresponding to the audio data, wherein the audio data comprises first voice data of the target object and noise data of the environment;

In some implementations, the speech extraction module 203 is further configured to: extracting a mouth shape image of a target object in each frame of video image of the text explanation image, and giving a corresponding time stamp to the mouth shape image according to a time axis of the text explanation image;

In some implementations, the text conversion module 204 is further configured to: comparing the first voice data with the second voice data, and judging whether the first voice data has voice missing or not;

In some implementations, the text conversion module 204 is further configured to: marking the missing part of the first voice data and acquiring a first time period corresponding to the missing part;

Referring to fig. 3, fig. 3 is a schematic block diagram of an electronic device according to an embodiment of the present application.

As shown in fig. 3, the electronic device 300 includes a processor 301 and a memory 302, the processor 301 and the memory 302 being connected by a bus 303, such as an I2C (Inter-integrated Circuit) bus.

In particular, the processor 301 is used to provide computing and control capabilities, supporting the operation of the entire server. The processor 301 may be a central processing unit (Central Processing Unit, CPU), the processor 301 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Specifically, the Memory 302 may be a Flash chip, a Read-Only Memory (ROM) disk, an optical disk, a U-disk, a removable hard disk, or the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 3 is merely a block diagram of a portion of the structure related to the embodiments of the present application and is not limiting of the electronic device to which the embodiments of the present application apply, and that a particular electronic device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

The processor 301 is configured to execute a computer program stored in the memory, and implement any one of the solution acquisition methods provided in the embodiments of the present application when the computer program is executed.

In some embodiments, the processor 301 is configured to run a computer program stored in a memory and when executing the computer program implement the steps of:

In some implementations, the processor 301, when the dividing the target narrative video into a plurality of narrative video segments, includes:

In some embodiments, the processor 301, when selecting a key video snippet from a plurality of the commentary video snippets according to a degree of correlation of each of the commentary video snippets with the target document explanation, comprises:

In some implementations, the processor 301, when acquiring the first voice data of the target object in the key video snippet according to the text-to-text narrative audio, includes:

In some embodiments, when acquiring a plurality of mouth-shaped variation images of the target object in the key video snippet according to the text-to-text illustration image, and acquiring second voice data corresponding to the target object according to a plurality of mouth-shaped variation images, the processor 301 includes:

In some implementations, the processor 301, when determining target voice data of the target object from the first voice data and the second voice data, includes:

In some embodiments, when performing voice compensation on the first voice data according to the second voice data to obtain the target voice data, the processor 301 includes:

It should be noted that, for convenience and brevity of description, specific working processes of the above-described electronic device may refer to corresponding processes in the foregoing embodiments of the method for generating a video abstract, which are not described herein again.

The embodiments of the present application also provide a storage medium for computer readable storage, where the storage medium stores one or more programs, and the one or more programs may be executed by one or more processors to implement the steps of any one of the methods for generating a video summary provided in the embodiments of the present application.

The storage medium may be an internal storage unit of the electronic device of the foregoing embodiment, for example, a hard disk or a memory of the electronic device. The storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware embodiment, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

It should be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments. The foregoing is merely illustrative of the embodiments of the present application, but the scope of the present application is not limited thereto, and any equivalent modifications or substitutions will be apparent to those skilled in the art within the scope of the present application, and these modifications or substitutions are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for generating a video summary, comprising:

2. The method of claim 1, wherein the selecting a key video snippet from a plurality of the narrative video snippets according to a degree of relevance of each of the narrative video snippets to a target document presentation comprises:

3. The method of claim 1, wherein the selecting a key video snippet from a plurality of the narrative video snippets according to a degree of relevance of each of the narrative video snippets to a target document presentation comprises:

4. The method of claim 1, wherein the obtaining the first voice data of the target object in the key video snippet from the text-to-text narrative audio comprises:

5. The method according to claim 1, wherein the obtaining a plurality of mouth-shaped variant images of the target object in the key video snippet according to the document-description image, and obtaining second voice data corresponding to the target object according to a plurality of mouth-shaped variant images, includes:

6. The method of claim 1, wherein said determining target speech data of said target object from said first speech data and said second speech data comprises:

7. The method of claim 6, wherein performing the voice compensation on the first voice data according to the second voice data to obtain the target voice data comprises:

8. A video summary generation apparatus, comprising:

9. An electronic device comprising a processor, a memory, a computer program stored on the memory and executable by the processor, and a data bus for enabling a connected communication between the processor and the memory, wherein the computer program, when executed by the processor, implements the steps of the method of generating a video summary according to any one of claims 1 to 7.

10. A storage medium for computer-readable storage, wherein the storage medium stores one or more programs executable by one or more processors to implement the steps of the method of generating a video summary of any one of claims 1 to 7.