CN114143479A

CN114143479A - Video abstract generation method, device, equipment and storage medium

Info

Publication number: CN114143479A
Application number: CN202111436728.5A
Authority: CN
Inventors: 刘钊
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-03-04
Anticipated expiration: 2041-11-29
Also published as: CN114143479B

Abstract

The embodiment of the application relates to the field of artificial intelligence and discloses a method, a device, equipment and a storage medium for generating a video abstract. The method comprises the steps of obtaining a target file comment video, and dividing the target file comment video into a plurality of comment video segments; selecting key video clips from the narration video clips; extracting the case comment audio and the case comment image of the key video clip; acquiring first voice data of a target object in a key video clip; acquiring second voice data corresponding to the target object; determining target voice data of a target object according to the first voice data and the second voice data; acquiring target text information according to the target voice data; generating video abstract fragments corresponding to the key video fragments according to the literary commentary images, the target voice data and the target text information corresponding to each key video fragment; and splicing the video abstract fragments to generate a video abstract corresponding to the target file commentary video.

Description

Video abstract generation method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating a video summary.

Background

With the development of internet technology and multimedia technology, digital video is being flooded, such as news, advertisements, television, movies, live webcasts, etc. Whether the user is in study or social entertainment, the user is surrounded by massive videos, and it is not easy to quickly inquire interested videos in massive videos, so the video abstract is a brief representation of video contents, and the purpose is to facilitate the user to quickly know the video contents and decide whether to watch the video contents in detail, and to index and inquire a video database.

In the prior art, in order to quickly generate a video summary, generally, a plurality of video frames are extracted from a video at random or at intervals between frames, and the extracted plurality of video frames are simply combined, so that the video summary is generated.

However, the video summary generated in this way cannot identify whether the material has quality problems, and the quality of the generated video summary is difficult to guarantee.

Disclosure of Invention

The embodiment of the application mainly aims to provide a method, a device, equipment and a storage medium for generating a video abstract, and aims to improve the generation quality of the video abstract and improve the user experience.

In a first aspect, an embodiment of the present application provides a method for generating a video summary, including:

the method comprises the steps of obtaining a target file comment video, and dividing the target file comment video into a plurality of comment video segments;

selecting key video clips from a plurality of the commentary video clips according to the correlation degree of each commentary video clip and the explanation of the target file, wherein each key video clip comprises a target object for explaining the target file;

extracting a case comment audio and a case comment image corresponding to the key video clip, acquiring first voice data of the target object in the key video clip according to the case comment audio, acquiring a plurality of mouth shape change images of the target object in the key video clip according to the case comment image, and acquiring second voice data corresponding to the target object according to the plurality of mouth shape change images;

determining target voice data of the target object according to the first voice data and the second voice data, and inputting the target voice data into a preset voice recognition model to obtain target text information;

and generating video abstract fragments corresponding to the key video fragments according to the literary commentary image, the target voice data and the target text information corresponding to each key video fragment, and splicing the video abstract fragments to generate a video abstract corresponding to the target literary commentary video.

In a second aspect, an embodiment of the present application further provides an apparatus for generating a video summary, including:

the segment dividing module is used for acquiring a target file comment video and dividing the target file comment video into a plurality of comment video segments;

the fragment screening module is used for selecting key video fragments from the plurality of the comment video fragments according to the correlation degree between each comment video fragment and the explanation of the target file, wherein each key video fragment comprises a target object for explaining the target file;

the voice lifting module is used for extracting a case comment audio and a case comment image corresponding to the key video clip, acquiring first voice data of the target object in the key video clip according to the case comment audio, acquiring a plurality of mouth shape change images of the target object in the key video clip according to the case comment image, and acquiring second voice data corresponding to the target object according to the plurality of mouth shape change images;

the text conversion module is used for determining target voice data of the target object according to the first voice data and the second voice data and inputting the target voice data into a preset voice recognition model to acquire target text information;

and the abstract generating module is used for generating video abstract fragments corresponding to the key video fragments according to the literary commentary image, the target voice data and the target text information corresponding to each key video fragment, splicing the video abstract fragments and generating a video abstract corresponding to the target literary commentary video.

In a third aspect, an embodiment of the present application further provides an electronic device, which includes a processor, a memory, a computer program stored on the memory and executable by the processor, and a data bus for implementing connection communication between the processor and the memory, where the computer program, when executed by the processor, implements the steps of any one of the video summary generation methods provided in the present specification.

In a fourth aspect, the present application further provides a storage medium for a computer-readable storage, where the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the steps of any one of the video summary generation methods provided in this specification.

The embodiment of the application provides a method, a device, equipment and a storage medium for generating a video abstract, wherein the method comprises the steps of obtaining a target file comment video and dividing the target file comment video into a plurality of comment video segments; selecting key video clips from a plurality of the commentary video clips according to the correlation degree of each commentary video clip and the explanation of the target file, wherein each key video clip comprises a target object for explaining the target file; extracting a case comment audio and a case comment image corresponding to the key video clip, acquiring first voice data of the target object in the key video clip according to the case comment audio, acquiring a plurality of mouth shape change images of the target object in the key video clip according to the case comment image, and acquiring second voice data corresponding to the target object according to the plurality of mouth shape change images; determining target voice data of the target object according to the first voice data and the second voice data, and inputting the target voice data into a preset voice recognition model to obtain target text information; and generating video abstract fragments corresponding to the key video fragments according to the literary commentary image, the target voice data and the target text information corresponding to each key video fragment, and splicing the video abstract fragments to generate a video abstract corresponding to the target literary commentary video. In the process of generating the video abstract corresponding to the target file comment video, the comment voice data of the target object is acquired by acquiring the mouth shape change of the target object, so that when the audio in the acquired target file comment video goes wrong, the voice acquired by the mouth shape change can be used for compensation, the generation quality of the video abstract is improved, and the user experience is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for generating a video summary according to an embodiment of the present disclosure;

fig. 2 is a schematic block structure diagram of a video summary generation apparatus according to an embodiment of the present disclosure;

fig. 3 is a block diagram schematically illustrating a structure of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The video abstract formed by the method has strong dependence on the material quantity of the original material, and if the voice in the case comment video in the original material is stuck or the analysis is not clear, the situation that the voice is stuck or the analysis is not clear can be caused in the generated video abstract, so that the quality of the finally generated video abstract cannot be effectively guaranteed, and the user experience is poor.

Therefore, in order to improve the generation quality of the video abstract and improve the user experience, embodiments of the present application provide a method, an apparatus, a device, and a storage medium for generating a video abstract. The video abstract generation method can be applied to electronic equipment. The electronic device can be a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, a wearable device and other terminal devices, and can also be a server, wherein the server can be an independent server or a server cluster.

Specifically, the method for generating the video abstract comprises the steps of obtaining a target file comment video and dividing the target file comment video into a plurality of comment video segments; selecting key video clips from a plurality of the commentary video clips according to the correlation degree of each commentary video clip and the explanation of the target file, wherein each key video clip comprises a target object for explaining the target file; extracting a case comment audio and a case comment image corresponding to the key video clip, acquiring first voice data of the target object in the key video clip according to the case comment audio, acquiring a plurality of mouth shape change images of the target object in the key video clip according to the case comment image, and acquiring second voice data corresponding to the target object according to the plurality of mouth shape change images; determining target voice data of the target object according to the first voice data and the second voice data, and inputting the target voice data into a preset voice recognition model to obtain target text information; and generating video abstract fragments corresponding to the key video fragments according to the literary commentary image, the target voice data and the target text information corresponding to each key video fragment, and splicing the video abstract fragments to generate a video abstract corresponding to the target literary commentary video. In the process of generating the video abstract corresponding to the target file comment video, the comment voice data of the target object is acquired by acquiring the mouth shape change of the target object, so that when the audio in the acquired target file comment video goes wrong, the voice acquired by the mouth shape change can be used for compensation, the generation quality of the video abstract is improved, and the user experience is improved.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for generating a video summary according to an embodiment of the present disclosure.

As shown in fig. 1, the scheme acquiring method based on voice recognition includes steps S1 through S5.

Step S1: a target literary description video is obtained and divided into a plurality of description video segments.

There are various ways for acquiring the target file comment video, for example, the target file comment video may be acquired by storing the target file comment video in a storage medium such as a hard disk, a usb disk, an SD card, or may be a video link for acquiring the target file comment video and downloading the video link to the target file comment video.

After the target file commentary video is acquired, dividing the target file commentary video into a plurality of commentary video segments, wherein the dividing of the commentary video segments may be dividing the target file commentary video into a plurality of commentary video segments with unit time length according to the total time length of the target file commentary video, or randomly dividing the target file commentary video into a plurality of commentary video segments, which is not limited herein.

In some implementations, the dividing the target paperwork commentary video into a plurality of commentary video segments includes:

and acquiring the total time length of the target file comment video, and dividing the target file comment video into N comment video segments according to the total time length, wherein N is greater than 2.

Illustratively, the total time length of the target paperwork comment video is 120 minutes, and segment interception selection needs to be performed on the target paperwork comment video in order to generate the video summary, for example, the total time length of 120 minutes is divided into 12 equal parts, that is, the target paperwork comment video of 120 minutes is divided into 12 commentary video segments of 10 minutes.

Step S2: selecting key video clips from the plurality of the commentary video clips according to the correlation degree of each commentary video clip and the explanation of the target file, wherein each key video clip comprises a target object for explaining the target file.

Illustratively, after the target file explanation video is divided into a plurality of explanation video segments, some of the video segments are strongly related to the target file explanation video, such as the part of the lecturer explaining the file, and some of the video segments are weakly related to the target file explanation video, such as some video segments of over-relaxation links, some expansion segments of file explanation, some quiz and answer segments, and the like.

Based on that the duration of explanation to the target file in each explanation segment may be different, even some explanation video segments have no explanation object, so the correlation degree between the video segments and the target file explanation video is not strong, and if a video abstract of the target file explanation video is made by using the video segments, information that the current video type is the target file explanation video may not be conveyed, so it is important to screen out key video segments from the target file explanation video.

In some embodiments, the selecting a key video clip from a plurality of the commentary video clips according to the relevance of each commentary video clip to the target document explanation comprises:

judging whether the appearance time of a target object explaining the target file in the explanation video clip exceeds preset time or not;

when the appearance time of the target object exceeds preset time, acquiring subtitle information of the commentary video clip, and performing type division on each piece of subtitle information according to keywords in the subtitle information;

carrying out weighted summation on the subtitle information based on the subtitle type of each piece of subtitle information and the weighting coefficient of each subtitle type to obtain the key degree of the commentary video clip;

and selecting the preset number of the commentary video clips with the highest key degree as the key video clips.

Illustratively, the preset time length is set according to the time length of the commentary video segments, for example, the time length of each commentary video segment is at least T1, the occurrence time T2 of a target object explaining a target pattern in each commentary video segment is calculated through image recognition, T2 is less than or equal to T1, and when T2 is greater than or equal to 0.6T1, the occurrence time of the target object in the commentary video segment is considered to exceed the preset time, so that commentary video segments which partially meet the initial requirements can be screened out.

When an explanatory video segment meeting initial requirements is preliminarily screened, subtitle information of the explanatory video segment is obtained, keyword splitting is carried out on the subtitle information, and subtitle type division is carried out on each piece of subtitle information according to the split keywords.

The caption type corresponding to each caption information corresponds to a weighting coefficient, after all the caption information appearing in the commentary video clip is subjected to type division, the caption information is subjected to weighted summation according to the weighting coefficient corresponding to the caption type of the caption information to obtain the key degree of the commentary video clip, and then the commentary video clip with the highest key degree in preset number is selected as the key video clip. That is, the target literary commentary video is divided into N commentary video clips, and M commentary video clips with the highest selection criticality are selected from the N commentary video clips as key video clips, where N is greater than M. The 3 most critical ones from the 12 commentary video clips were selected as key video clips.

when the appearance time of the target object exceeds the preset time, obtaining the file information of the target file appearing in each commentary video clip, and extracting keywords from the file information to obtain file keywords;

acquiring the key degree of each commentary video clip according to the appearance frequency of the corresponding documented key words in each commentary video clip and the number of the documented key words;

Illustratively, when the appearance time of the target object in the commentary video segment exceeds the preset time, the commentary video segment which partially meets the initial requirement can be screened out.

When comment video clips meeting initial requirements are preliminarily screened, video frames with target texts appearing in each comment video clip are screened, the video frames are converted into corresponding video pictures, text information of the corresponding target texts in the video pictures is identified through an OCR character identification technology, keywords are split on the texts, so that a keyword vocabulary is obtained, and keyword of the texts is screened out from the keyword vocabulary through a preset keyword library.

Counting the number of the case keywords and the occurrence frequency of the case keywords in each screened comment video clip, presetting the corresponding relation between the key degree of each comment video clip and the number of the keywords and the occurrence frequency of the case keywords, obtaining the key degree of each comment video clip according to the occurrence frequency of the case keywords and the number of the case keywords, and selecting the preset number of the comment video clips with the highest key degree as the key video clips.

Step S3: extracting a case comment audio and a case comment image corresponding to the key video clip, acquiring first voice data of the target object in the key video clip according to the case comment audio, acquiring a plurality of mouth shape change images of the target object in the key video clip according to the case comment image, and acquiring second voice data corresponding to the target object according to the plurality of mouth shape change images.

Illustratively, each key video clip comprises at least a literary commentary audio and a literary commentary image, wherein each literary commentary audio comprises a plurality of audio frames, and each literary commentary image comprises a plurality of video image frames. And decoding the key video clip to obtain the file comment audio and the file comment image corresponding to the key video clip.

Environmental noise, such as applause of an environmental listener, questions of the listener, and the like, may exist in the acquired audio data, and in order to reduce the influence of the environmental noise, it is necessary to separate the first voice data of the target object from the audio data.

The lip language information of the target object in the video correspondence is identified according to the lip language change image, so that the second voice data of the target object in the file explanation image is obtained according to the lip language information, and the more accurate and comprehensive target voice of the target object can be obtained by using the first voice data and the second voice data.

In some embodiments, the extracting the first speech data of the target object according to the audio data includes:

inputting audio data into a feature extraction network of a voice extraction model for feature extraction, and acquiring a feature vector corresponding to the audio data, wherein the audio data comprises first voice data of the target object and noise data of the environment;

inputting a preset vector and the feature vector into a voice extraction network of the voice extraction model to extract first voice data of the target object from the audio data, wherein the voice extraction model is obtained through voice training of the target object, the preset vector is obtained according to the noise data, and the voice extraction network adjusts the proportion of the first voice data and the noise data in the audio data by taking the preset vector as a reference, so as to obtain the first voice data of the target object.

Illustratively, different sound has different voiceprint characteristics, so that the voiceprint characteristics can be used to distinguish the user sound from the environmental noise to separate the voice data of the target object from the audio data.

First, Voiceprint (Voiceprint) is a sound spectrum carrying speech information displayed by an electroacoustic apparatus. The generation of human language is a complex physiological and physical process between the human language center and the pronunciation organs, and the vocal print maps of any two people are different because the vocal organs used by a person in speaking, namely the tongue, the teeth, the larynx, the lung and the nasal cavity, are different greatly in size and shape.

The speech acoustic characteristics of each person are both relatively stable and variable, not absolute, but invariant. The variation can come from physiology, pathology, psychology, simulation, camouflage and is also related to environmental interference. However, since the pronunciation organs of each person are different, in general, people can distinguish different sounds or judge whether the sounds are the same.

Further, the voiceprint features are acoustic features related to the anatomical structure of the human pronunciation mechanism, such as spectrum, cepstrum, formants, fundamental tones, reflection coefficients, etc., nasal sounds, deep breath sounds, mute, laugh, etc.; the human voice print characteristics are influenced by social and economic conditions, education level, place of birth, semantics, paraphrasing, pronunciation, speech habits, and the like. For the voiceprint characteristics, personal characteristics or characteristics of rhythm, speed, intonation, volume and the like influenced by parents, from the aspect of modeling by using a mathematical method, the currently available characteristics of the voiceprint automatic identification model comprise: acoustic features such as cepstrum; lexical features such as speaker dependent word n-grams, phoneme n-grams, etc.; prosodic features such as pitch and energy "poses" described with ngram.

In practical applications, when performing voiceprint feature extraction, voiceprint feature data of a user in audio data may be extracted, where the voiceprint feature data includes at least one of a pitch spectrum and its contour, an energy of a pitch frame, an occurrence Frequency and its trajectory of a pitch formant, a linear prediction Cepstrum, a line spectrum pair, an autocorrelation and a log-area ratio, Mel Frequency Cepstrum Coefficient (MFCC), and perceptual linear prediction.

For example, the audio data includes first voice data of the target object and noise data of the environment. Based on the fact that the target object is a user, the voice of the target object has a large difference with the environmental noise, the voice extraction model is trained by the voice of the target object and the environmental noise, when the voice data of the target object is extracted, the obtained audio data are input into the voice extraction model to be subjected to feature extraction, so that feature vectors corresponding to the audio data are obtained, the environmental noise of the environment where the terminal equipment is located is obtained, and the environmental noise is converted into corresponding preset vectors.

The method comprises the steps of inputting a preset vector and a characteristic vector into a voice extraction network of a voice extraction model to extract first voice data of a target object from audio data, wherein the voice extraction model is obtained through training of user voice and environmental noise, the preset vector is obtained according to noise data, and the voice extraction network takes the preset vector as a reference to adjust the proportion of the first voice data and the noise data in the audio data, so that the first voice data of the target object are obtained.

In some embodiments, the acquiring, according to the literature comment image, a plurality of mouth shape change images of the target object in the key video clip, and acquiring, according to the plurality of mouth shape change images, second voice data corresponding to the target object includes:

extracting a mouth shape image of a target object in each frame of video image of the file explanation image, and giving a corresponding time stamp to the mouth shape image according to the time axis of the file explanation image;

and inputting the mouth shape image to a preset lip language recognition model according to the timestamp so as to obtain second voice data corresponding to the target object in the file commentary image.

Illustratively, the obtained documentary interpretation image comprises N frames of video images, a mouth shape image of a target object in each frame of image in the N frames of video images is extracted, corresponding timestamps are given to the extracted mouth shape images according to the sequence of each frame of image, and the mouth shape images are input into the lip language recognition model according to the sequence of the timestamps so as to obtain second voice data corresponding to the target object in the documentary interpretation image.

For example, a first frame in the document description image acquires a first mouth shape image, a second frame acquires a second mouth shape image, and a third frame acquires a third mouth shape image until an nth mouth shape image is acquired, corresponding mouth shape image timestamps are given according to the time sequence of each frame image, so that the mouth shape change sequence of the target object is accurately identified, and according to the sequence of the timestamps, the mouth shape images acquired from the first frame to the nth frame of the document description image are input into the lip language identification model to acquire second voice data corresponding to the target object in the document description image.

Step S4: and determining target voice data of the target object according to the first voice data and the second voice data, and inputting the target voice data into a preset voice recognition model to obtain target text information.

For example, the target voice of the target object may be lost due to the fact that the acquired target voice is covered by the environmental noise or the voice collector cannot acquire the target voice even under the environmental interference, which may be caused by the influence of the environmental noise, in the first voice data, the voice missing part is compensated by using the part corresponding to the second voice data, so as to acquire the target voice data of the target object. And identifying the acquired target voice data by using a preset voice identification model so as to acquire target text information.

In some embodiments, said determining target speech data for said target object from said first speech data and said second speech data comprises:

comparing the first voice data with the second voice data, and judging whether the first voice data has voice missing or not;

and when the first voice data has voice missing, performing voice compensation on the first voice data according to the second voice data to obtain the target voice data.

In some embodiments, the performing the voice compensation on the first voice data according to the second voice data to obtain the target voice data includes:

marking a missing part of the first voice data and acquiring a first time period corresponding to the missing part;

and acquiring a second voice data segment corresponding to the first time period from the second voice data, and compensating the missing part by using the second voice data segment to obtain the target voice data.

Illustratively, based on the fact that the voice data and the video data are acquired simultaneously, therefore, the start times of the first voice data and the second voice data are the same, whether the first voice data has voice missing is judged by comparing the similarity of a first audio signal corresponding to the first voice data and a second audio signal corresponding to the second voice data on time continuity, when the voice missing exists, the voice missing part is marked, a first time period corresponding to the missing part is acquired, a second voice data segment corresponding to the time period which is the same as the first time period is acquired from the second voice data, the missing part of the first voice data is compensated by the second voice data segment, and the target voice data of the target object is acquired.

And inputting the target voice data into a preset voice recognition model to acquire target text information. For example, the target Speech data is converted into corresponding text by an Automatic Speech Recognition (ASR) technique, so that the obtained target text information can be used as a subtitle corresponding to the target Speech data.

Step S5: and generating video abstract fragments corresponding to the key video fragments according to the literary commentary image, the target voice data and the target text information corresponding to each key video fragment, and splicing the video abstract fragments to generate a video abstract corresponding to the target literary commentary video.

The method comprises the steps of taking a file explanation image, target voice data and target text information acquired by each key video clip as a source file of a current video abstract clip, taking the target voice data as an explanation voice of the file explanation image, taking the target text information as an explanation subtitle of the file explanation image, determining a time point of starting explanation of a target object in the file explanation image, aligning the time point with a starting time point of corresponding target voice data, and taking the time point of starting explanation of the target object as the occurrence time of the target text information, so that the current video abstract clip can be generated accurately.

And in the same way, video abstract fragments corresponding to a plurality of key video fragments are sequentially generated, so that the video abstract fragments corresponding to all the key video fragments are obtained.

Each commentary video clip is given a corresponding time stamp when being divided, and the sequence of the corresponding video clips can be determined through the time stamps.

Therefore, the sequence of the corresponding video abstract segments can be determined according to the time stamp, and the video abstract segments of the target documented commentary video can be generated, or the video abstract segments can be randomly ordered and spliced into the corresponding video abstract, which is not limited herein.

Referring to fig. 2, the present application further provides a video summary generating apparatus 200, where the video summary generating apparatus 200 includes a segment dividing module 201, a segment filtering module 202, a speech extracting module 203, a text converting module 204, and a summary generating module 205.

Specifically, the segment dividing module 201 is configured to obtain a target file commentary video, and divide the target file commentary video into a plurality of commentary video segments;

a segment screening module 202, configured to select key video segments from the plurality of commentary video segments according to a degree of correlation between each commentary video segment and a target document explanation, where each key video segment includes a target object for explaining the target document;

the voice extracting module 203 is configured to extract a case comment audio and a case comment image corresponding to the key video clip, acquire first voice data of the target object in the key video clip according to the case comment audio, acquire a plurality of mouth shape change images of the target object in the key video clip according to the case comment image, and acquire second voice data corresponding to the target object according to the plurality of mouth shape change images;

the text conversion module 204 is configured to determine target speech data of the target object according to the first speech data and the second speech data, and input the target speech data to a preset speech recognition model to obtain target text information;

the abstract generating module 205 is configured to generate video abstract segments corresponding to the key video segments according to the literary commentary image, the target voice data, and the target text information corresponding to each key video segment, and splice the video abstract segments to generate a video abstract corresponding to the target literary commentary video.

In some embodiments, the fragmentation module 201 is further configured to: and acquiring the total time length of the target file comment video, and dividing the target file comment video into N comment video segments according to the total time length, wherein N is greater than 2.

In some embodiments, the fragment screening module 202 is further configured to: judging whether the appearance time of a target object explaining the target file in the explanation video clip exceeds preset time or not;

In some embodiments, the fragment screening module 202 is further configured to: the selecting key video clips from the plurality of commentary video clips according to the relevance degree of each commentary video clip to the target file explanation comprises the following steps:

In some embodiments, the speech extraction module 203 is further configured to: inputting audio data into a feature extraction network of a voice extraction model for feature extraction, and acquiring a feature vector corresponding to the audio data, wherein the audio data comprises first voice data of the target object and noise data of the environment;

In some embodiments, the speech extraction module 203 is further configured to: extracting a mouth shape image of a target object in each frame of video image of the file explanation image, and giving a corresponding time stamp to the mouth shape image according to the time axis of the file explanation image;

In some implementations, the text conversion module 204 is further to: comparing the first voice data with the second voice data, and judging whether the first voice data has voice missing or not;

In some implementations, the text conversion module 204 is further to: marking a missing part of the first voice data and acquiring a first time period corresponding to the missing part;

Referring to fig. 3, fig. 3 is a schematic block diagram of an electronic device according to an embodiment of the present disclosure.

As shown in fig. 3, the electronic device 300 comprises a processor 301 and a memory 302, the processor 301 and the memory 302 being connected by a bus 303, such as an I2C (Inter-integrated Circuit) bus.

In particular, processor 301 is configured to provide computational and control capabilities, supporting the operation of the entire server. The Processor 301 may be a Central Processing Unit (CPU), and the Processor 301 may also be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Specifically, the Memory 302 may be a Flash chip, a Read-Only Memory (ROM) magnetic disk, an optical disk, a usb disk, or a removable hard disk.

Those skilled in the art will appreciate that the structure shown in fig. 3 is a block diagram of only a portion of the structure related to the embodiments of the present application, and does not constitute a limitation on the electronic device to which the embodiments of the present application are applied, and a specific electronic device may include more or less components than those shown in the drawings, or may combine some components, or have a different arrangement of components.

The processor 301 is configured to run a computer program stored in the memory, and implement any one of the scheme acquisition methods provided by the embodiments of the present application when executing the computer program.

In some embodiments, the processor 301 is configured to run a computer program stored in the memory and to implement the following steps when executing the computer program:

In some implementations, the processor 301, when dividing the target paperwork commentary video into a plurality of commentary video segments, includes:

In some embodiments, the processor 301, when selecting a key video clip from a plurality of the narration video clips according to the relevance of each of the narration video clips to the target document explanation, comprises:

In some embodiments, the processor 301, when extracting the first speech data of the target object according to the audio data, comprises:

In some embodiments, when acquiring a plurality of mouth shape change images of the target object in the key video clip according to the literature comment image and acquiring second voice data corresponding to the target object according to the plurality of mouth shape change images, the processor 301 includes:

In some embodiments, processor 301, in determining target speech data for the target object from the first speech data and the second speech data, comprises:

In some embodiments, when performing the speech compensation on the first speech data according to the second speech data to obtain the target speech data, the processor 301 includes:

It should be noted that, as will be clearly understood by those skilled in the art, for convenience and brevity of description, the specific working process of the electronic device described above may refer to the corresponding process in the foregoing video summary generation method embodiment, and details are not described herein again.

The embodiments of the present application further provide a storage medium for a computer-readable storage, where the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the steps of any method for generating a video summary as provided in the embodiments of the present application.

The storage medium may be an internal storage unit of the electronic device of the foregoing embodiment, for example, a hard disk or a memory of the electronic device. The storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the electronic device.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware embodiment, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

It should be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for generating a video summary is characterized by comprising the following steps:

2. The method of claim 1, wherein selecting key video clips from the plurality of commentary video clips according to the relevance of each commentary video clip to the target document explanation comprises:

3. The method of claim 1, wherein selecting key video clips from the plurality of commentary video clips according to the relevance of each commentary video clip to the target document explanation comprises:

4. The method of claim 1, wherein extracting the first speech data of the target object from the audio data comprises:

5. The method according to claim 1, wherein the obtaining a plurality of mouth shape change images of the target object in the key video clip according to the scrip commentary image, and obtaining second voice data corresponding to the target object according to the plurality of mouth shape change images comprises:

6. The method of claim 1, wherein determining target speech data for the target object based on the first speech data and the second speech data comprises:

7. The method of claim 6, wherein said performing speech compensation on said first speech data according to said second speech data to obtain said target speech data comprises:

8. An apparatus for generating a video summary, comprising:

9. An electronic device, characterized in that the electronic device comprises a processor, a memory, a computer program stored on the memory and executable by the processor, and a data bus for implementing a connection communication between the processor and the memory, wherein the computer program, when executed by the processor, implements the steps of the video summary generation method according to any one of claims 1 to 7.

10. A storage medium for computer-readable storage, wherein the storage medium stores one or more programs which are executable by one or more processors to implement the steps of the method for generating a video summary according to any one of claims 1 to 7.