CN112565885B - Video segmentation method, system, device and storage medium - Google Patents

Video segmentation method, system, device and storage medium Download PDF

Info

Publication number
CN112565885B
CN112565885B CN202011374280.4A CN202011374280A CN112565885B CN 112565885 B CN112565885 B CN 112565885B CN 202011374280 A CN202011374280 A CN 202011374280A CN 112565885 B CN112565885 B CN 112565885B
Authority
CN
China
Prior art keywords
video
audio
video segment
lip
node information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011374280.4A
Other languages
Chinese (zh)
Other versions
CN112565885A (en
Inventor
胡玉针
叶俊杰
李�权
王伦基
李嘉雄
朱杰
成秋喜
黄桂芳
韩蓝青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CYAGEN BIOSCIENCES (GUANGZHOU) Inc
Research Institute Of Tsinghua Pearl River Delta
Original Assignee
CYAGEN BIOSCIENCES (GUANGZHOU) Inc
Research Institute Of Tsinghua Pearl River Delta
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CYAGEN BIOSCIENCES (GUANGZHOU) Inc, Research Institute Of Tsinghua Pearl River Delta filed Critical CYAGEN BIOSCIENCES (GUANGZHOU) Inc
Priority to CN202011374280.4A priority Critical patent/CN112565885B/en
Publication of CN112565885A publication Critical patent/CN112565885A/en
Application granted granted Critical
Publication of CN112565885B publication Critical patent/CN112565885B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Abstract

The invention discloses a video segmentation method, a system, equipment and a storage medium, wherein the method comprises the steps of extracting a first audio from a first video and denoising to obtain a second audio; analyzing the second audio to obtain a third audio and first time node information, and obtaining a first video clip; carrying out voice recognition on the third audio to obtain a second video clip; carrying out face detection on the second video segment; carrying out lip-shaped synchronous detection on the video clips containing the human faces; and performing voice enhancement on the lip-shaped synchronous video segment, and further performing voice recognition on the existing video segment to obtain a voice recognition result. The invention can automatically realize the high-precision cutting of the video by adopting processing means such as denoising, neural network analysis, human voice recognition, human face detection, lip synchronization detection, voice enhancement result and the like to the video. The invention can be widely applied to the technical field of video processing.

Description

Video segmentation method, system, device and storage medium
Technical Field
The present invention relates to the field of video processing technologies, and in particular, to a video segmentation method, system, device, and storage medium.
Background
With the continuous development of internet technology, videos have become new carriers for information transmission, but due to the objective fact of different languages, there is a barrier in the transmission process of videos adopting different languages, and in view of the current situation, subtitles are mainly added or artificial dubbing in other languages is mainly used. However, the mode of adding the subtitles can disperse the attention of the audiences, so that the audiences pay more attention to the text information, ignore the whole picture and influence the watching effect; when the artificial dubbing in other languages is used, the phenomenon that sound and pictures are not synchronous is caused because the dubbing is not matched with lip changes of actors in the video, and the watching effect of audiences is also influenced; therefore, if the lip shapes of corresponding performers in the video can be synthesized and matched according to the artificial dubbing in other languages, the defects in the method can be effectively overcome.
One of the difficulties in synthesizing and matching the lip shape in the video is that the scenes in most videos are complex, the number of speakers is not fixed, and the accuracy of a speech recognition model is interfered by background noise; in order to ensure the synchronization of sound and picture in the video, the video needs to be segmented at the position of a sentence break of human speaking, scene conversion and the joint of human voice and background sound in the background, the requirement of video translation in the later period can be met only by high video segmentation precision, and how to carry out early segmentation processing on the video with high segmentation precision is a technical problem which needs to be solved at present.
Disclosure of Invention
To solve at least one of the technical problems in the prior art, the present invention provides a video segmentation method, system, device and storage medium.
According to a first aspect of the embodiments of the present invention, a video segmentation method includes the following steps:
acquiring a first video, extracting a first audio from the first video, and denoising the first audio to obtain a second audio;
analyzing the second audio by using a convolutional neural network to obtain a third audio and first time node information, and segmenting the first video according to the first time node information to obtain a first video segment; the first time node information comprises first start node information and first end node information;
carrying out voice recognition on the third audio to obtain second time node information, and segmenting the first video segment according to the second time node information to obtain a second video segment; the second time node information comprises second starting node information and second terminating node information;
performing face detection on the second video segment to obtain a video segment containing a face and a video segment without the face;
performing lip synchronization detection on the video segments containing the human face to obtain lip synchronization video segments and lip non-synchronization video segments;
and performing voice recognition on the voice enhanced video segment, the lip unsynchronized video segment and the unmanned video segment to obtain a voice recognition result.
Further, the step of analyzing the second audio by using a convolutional neural network to obtain a third audio and a first time node includes:
acquiring the second audio, and framing the second audio to obtain a framing result;
carrying out coarse-grained voice detection on the framing result by utilizing the convolutional neural network, and extracting voice features;
classifying and screening the voice features to obtain the third audio;
and generating the first time node information according to the position of the third audio in the first video.
Further, the step of performing voice recognition on the third audio to obtain second time node information includes:
acquiring the third audio;
carrying out voice tracking and clustering analysis on the third audio to distinguish audio segments of different voices;
and generating the second time node information according to the position of the audio clip in the first video.
Further, the step of performing face detection on the second video segment to obtain a video segment containing a face and a video segment without the face includes:
acquiring the second video clip;
performing face detection on the second video segment to obtain a face detection result;
and dividing the second video segment into the video segment containing the human face and the video segment without the human face according to the human face detection result.
Further, the step of performing lip-sync detection on the video segment containing the face to obtain a lip-sync video segment and a lip-unsynchronized video segment includes:
acquiring the video clip containing the human face;
performing lip-sync detection on the video clips containing the human faces to obtain lip-sync detection results;
and dividing the video segment containing the face into the lip-shaped synchronous video segment and the lip-shaped unsynchronized video segment according to the lip-shaped synchronous detection result.
Further, the step of performing voice enhancement on the lip-sync video segment to obtain a voice-enhanced video segment includes:
acquiring the lip-sync video clip;
and carrying out voice enhancement on the lip-shaped synchronous video clip by utilizing audio-video modal learning and visual lip information to obtain the voice enhanced video clip.
Further, the step of performing voice recognition on the voice-enhanced video segment, the lip unsynchronized video segment, and the unmanned face video segment to obtain a voice recognition result includes:
acquiring the voice enhancement video segment, the lip unsynchronized video segment and the unmanned face video segment;
performing sentence-level pause segmentation on the voice enhanced video segment, the lip unsynchronized video segment and the face-free video segment to obtain a segmentation result;
and performing sentence-by-sentence voice recognition according to the segmentation result to obtain the voice recognition result.
According to a second aspect of the embodiments of the present invention, a video segmentation system comprises the following modules:
the device comprises a preprocessing module, a first audio processing module and a second audio processing module, wherein the preprocessing module is used for acquiring a first video, extracting a first audio from the first video and denoising the first audio to obtain a second audio;
the first segmentation module is used for analyzing the second audio by using a convolutional neural network to obtain a third audio and first time node information, and segmenting the first video according to the first time node information to obtain a first video segment; the first time node information comprises first start node information and first termination node information;
the second segmentation module is used for carrying out voice recognition on the third audio to obtain second time node information, and segmenting the first video segment according to the second time node information to obtain a second video segment; the second time node information comprises second starting node information and second terminating node information;
the face detection module is used for carrying out face detection on the second video segment to obtain a video segment containing a face and a video segment without the face;
the lip synchronization detection module is used for carrying out lip synchronization detection on the video segment containing the face to obtain a lip synchronization video segment and a lip non-synchronization video segment;
and the voice recognition module is used for performing voice enhancement on the lip-shaped synchronous video segment to obtain a voice-enhanced video segment, and performing voice recognition on the voice-enhanced video segment, the lip-shaped unsynchronized video segment and the unmanned face video segment to obtain a voice recognition result.
According to a third aspect of an embodiment of the present invention, a video segmentation apparatus includes:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement a method as described in the first aspect.
According to a fourth aspect of embodiments of the present invention, a computer-readable storage medium has stored therein a processor-executable program which, when executed by a processor, is configured to implement the method of the first aspect.
The invention has the beneficial effects that: according to the invention, by adopting analysis processing means such as denoising, neural network analysis, human voice recognition, face detection, lip synchronization detection, voice enhancement result and the like to the video, high-precision cutting of the video can be automatically realized, and effective guarantee is provided for the smoothness of the later-stage video translation operation.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description is made on the drawings of the embodiments of the present invention or the related technical solutions in the prior art, it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a method provided by an embodiment of the present invention;
FIG. 2 is a diagram of steps performed by an embodiment of the present invention;
FIG. 3 is a block diagram of a module connection provided by an embodiment of the present invention;
fig. 4 is a diagram of device connection provided by an embodiment of the present invention.
Detailed Description
The conception, the specific structure and the technical effects produced by the present invention will be clearly and completely described in conjunction with the embodiments and the attached drawings, so as to fully understand the objects, the schemes and the effects of the present invention.
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of the invention and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The embodiment of the invention provides a video segmentation method, which can be applied to a terminal, a server and software running in the terminal or the server. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and a big data and artificial intelligence platform. Referring to fig. 1, the method includes the following steps S100 to S700:
s100, obtaining a first video, extracting a first audio from the first video, and denoising the first audio to obtain a second audio.
S200, analyzing the second audio by using a convolutional neural network to obtain a third audio and first time node information, and segmenting the first video according to the first time node information to obtain a first video segment; the first time node information includes first start node information and first termination node information.
Alternatively, step S200 may be implemented by the following sub-steps:
s201, acquiring a second audio, and framing the second audio to obtain a framing result;
s202, carrying out coarse-grained voice detection on the framing result by using a convolutional neural network, and extracting voice features;
s203, classifying and screening the voice characteristics to obtain third voice frequency;
and S204, generating first time node information according to the position of the third audio in the first video.
S300, identifying the third audio by human voice to obtain second time node information, and segmenting the first video segment according to the second time node information to obtain a second video segment; the second time node information includes second start node information and second end node information.
Alternatively, step S300 may be implemented by the following sub-steps:
s301, acquiring a third audio;
s302, performing voice tracking and clustering analysis on the third audio to distinguish audio segments of different voices;
s303, generating second time node information according to the position of the audio clip in the first video;
s400, carrying out face detection on the second video segment to obtain a video segment containing a face and a video segment without the face.
Alternatively, step S400 may be implemented by the following sub-steps:
s401, acquiring a second video clip;
s402, carrying out face detection on the second video segment to obtain a face detection result;
and S403, dividing the second video segment into a video segment containing the human face and a video segment without the human face according to the human face detection result.
S500, carrying out lip synchronization detection on the video segments containing the human faces to obtain lip synchronization video segments and lip non-synchronization video segments.
Alternatively, step S500 may be implemented by the following sub-steps:
s501, acquiring a video clip containing a human face;
s502, carrying out lip synchronization detection on the video clips containing the human faces to obtain lip synchronization detection results;
and S503, segmenting the video segment containing the face into a lip-shaped synchronous video segment and a lip-shaped unsynchronized video segment according to the lip-shaped synchronous detection result.
S600, performing voice enhancement on the lip-shaped synchronous video clip to obtain a voice enhanced video clip, and performing voice recognition on the voice enhanced video clip, the lip-shaped unsynchronized video clip and the unmanned video clip to obtain a voice recognition result.
Alternatively, step S600 may be implemented by the following sub-steps:
s601, obtaining a lip synchronization video clip;
and S602, performing voice enhancement on the lip-shaped synchronous video clip by using audio and video mode learning and visual lip information to obtain a voice enhanced video clip.
Optionally, step S600 may also be implemented by the following sub-steps:
s611, acquiring a voice enhancement video segment, a lip unsynchronized video segment and an unmanned face video segment;
s612, carrying out sentence-level pause segmentation on the voice enhanced video segment, the lip unsynchronized video segment and the unmanned face video segment to obtain segmentation results;
and S613, performing sentence-by-sentence voice recognition according to the segmentation result to obtain a voice recognition result.
According to the invention, by adopting analysis processing means such as denoising, neural network analysis, human voice recognition, face detection, lip synchronization detection, voice enhancement result and the like to the video, high-precision cutting of the video can be automatically realized, and effective guarantee is provided for the smoothness of the later-stage video translation operation.
Referring to fig. 2, there is provided a flowchart illustrating steps performed according to an embodiment of the present invention, beginning with; acquiring a first video, namely an original video resource; extracting a first audio from a first video, then carrying out denoising or voice enhancement processing on the first audio, and suppressing noise or background music in an original video to obtain a relatively pure human voice audio resource, namely a second audio; performing frame sliding window processing on the second audio, and performing coarse-grained voice activity detection on a result after frame division by using a convolutional neural network to obtain a starting time point T1 and an ending time point T2 of the voice, wherein T1 and T2 are first starting node information and first ending node information in first time node information, and it needs to be noted that multiple sections of voice possibly exist in the same video section to obtain a third audio; segmenting the first video through a first time node to obtain a first video segment; performing voice tracking and clustering analysis on the third audio to obtain voice audio segments of different individuals or different speakers and obtain corresponding second time node information; performing video segmentation on the first video segment through second time node information to obtain a second video segment; carrying out face detection on the second video segment, separating segments with and without face pictures in the second video segment, carrying out lip synchronization detection on the video segment containing the face without processing the video segment without the face; if the video clip meets the synchronization threshold, the video clip is regarded as a lip-shaped synchronous video clip, if the video clip does not meet the synchronization threshold, the video clip is regarded as a lip-shaped unsynchronized video clip, the video clip and the lip-shaped unsynchronized video clip are separated, and the lip-shaped unsynchronized video clip is not processed; carrying out voice enhancement on the lip-shaped synchronous video clip to obtain a voice enhanced video clip; fine-grained voice activity detection is carried out on the face-free video segment, the lip-shaped unsynchronized video segment and the voice-enhanced video segment, and sentence-breaking processing is carried out on the voice segment by taking the sentence level as an execution standard; finally, voice recognition is carried out, and all the video segments are recombined according to the time points used by the segmentation to form a video to be translated, namely a second video; the invention identifies and divides the segment which only needs to synthesize voice and needs lip combination in the first video, reduces the later work load when translating the second video and ensures the quality of video translation.
In some preferred embodiments, the method for voice tracking and cluster analysis used for the third audio mainly adopts a monitoring method based on RNN, the monitoring method does not limit the number of speakers, a corresponding recurrent neural network model is established for each speaker based on d-vector characteristics, and the state is continuously updated. Firstly, the voice segments are subjected to framing processing, frames are overlapped, a circulating neural network in an unbounded cross state is adopted for modeling, and parameters are shared among speakers. An unbounded number of speaker instances can be generated, intersecting different speakers in the time domain. The number of speakers is automatically estimated through a Bayesian non-parameter model, and the speakers are clustered through time information carried by a recurrent neural network.
The specific implementation method comprises the following steps:
given a segment of speech, an embedded representation of the sequence of speech is obtained using an embedded extraction module: x = (X) 1 ,x 2 ,...,x T ) Wherein
Figure BDA0002807770790000071
T e (1,2,.., T). Each x t Are d-vector vectors corresponding to a certain segment in the original speech. And during model training, for each segmentation segment, a corresponding real label of speaker segmentation is provided, and Y = (Y) 1 ,y 2 ,...,y T ). Each y t Are all corresponding to x t The ID of the real speaker, where the ID is expressed in the order of speaker occurrence, such as the first occurring speaker 1, the second occurring speaker 2, and so on. Such as Y = (1,1,2,3,2) which means that the speech has five segments with three different speakers. The same number indicates that the segment belongs to the speaker.
The model belongs to a sequence generation model, and a sequence set (1,2, …, t) is defined as [ t ].
Figure BDA0002807770790000072
To model speaker changes, the above formula can be expressed as:
Figure BDA0002807770790000073
wherein, Z = (Z) 2 ,z 3 ,...,z T ),z t E (0,1), 0 indicates no change in speaker, and 1 indicates a change in speaker. If Y = (1,1,2,3,2), then Z = (0,1,1,1). Thus Y determines the value of Z. The above formula is developed to obtain:
Figure BDA0002807770790000074
wherein, p (x) t |x [t-1] ,y [t] ) Representing the generation of a modeling sequence, p (y) t |z t ,y [t-1] ) Representing a modeled speaker assignment, p (z) t |z [t-1] ) Representing modeled speaker changes. And let y 1 =1 speaker assignment and speaker alteration are not modeled.
An unknown number of speakers are implicitly modeled using bayesian non-parametric estimation. When z is t =0 denotes no change in speaker, z t =1 represents a speaker change. Let
p(y t =K|z t =1,y [t-1] )∝N k,t-1
p(y t =K t-1 +1|z t =1,y [t-1] )∝α,
Such as: y is 5 = (1,1,2,3,2), may be divided into four blocks, (1,1) | (2) | (3) | (2), N 1,4 =1,N 2,4 =2,N 3,4 =1。N k,t-1 Means for dividing the real value markers of the sequence into t-1 blocks, k for the second speaker, N k,t-1 The value of (d) represents the number of blocks occupied by the speaker. The probability of switching to a previous speaker is proportional to the number of successive speech blocks spoken by that speaker, switching toThe probability of a new speaker is proportional to the constant alpha. The joint probability distribution of Y and Z is:
Figure BDA0002807770790000081
to generate the sequence Y, the sequence is modeled using a GRU recurrent neural network, the hidden state h of the GRU t And speaker y t It is relevant. And: m is t =f(h t | θ) as the output of the GRU network. Assume the current state Y 5 = (1,1,2,3,2), next state y 7 There are four possibilities: speaker 1,2,3 or new speaker 4. New state y 7 Depending on the previously assigned tag sequence y [6] And the previous observation sequence x [6] . Hidden state h t Comprises the following steps: h is t =GRU(x t′ ,h t′ | θ). Where t' = max {0,s<t:y s =y t Is expressed as considering the current time step t as speaker y before time t s The maximum time step of the time of day. Finally, greedy Search is carried out according to the time sequence by using an online decoding method to reduce the time complexity of searching in the whole mark space, the constant C is used for limiting the maximum speaker number in each voice, the cluster Search, namely the Beam Search method, is specifically adopted for decoding, and through the algorithm, the time nodes of speakers can be separated from a section of voice.
In some preferred embodiments, a SyncNet (lip synchronization detector) neural network is mainly adopted for lip synchronization detection of segments containing human faces, features of lips and current voice obtained in the human face detection are extracted, and similarity comparison is performed to judge whether the current voice is the current human face, so that when video synthesis is performed, lip conversion of language matching is performed on the synchronized video segments, and other voice synthesis is only performed; specifically, a spectrogram of a given voice is obtained by performing short-time fourier transform on the given voice, 0.2 second voice and a lip image input in a corresponding video are respectively input into two independent encoders to be encoded into vectors with 256 dimensions respectively, the encoders aim to perform feature dimension reduction and compression for a CNN architecture, so as to respectively extract voice and lip features, and then similarity of the two vectors with 256 dimensions is calculated. The goal of model training is that the output of the audio and video encoders are closer to the true distance and the mismatch is further away. The face matching with the voice in the video can be obtained through the network, and therefore the face can be replaced in the later lip synthesis. And the result of the recognition can be used to verify whether the result of the speaker segmentation is accurate.
In some preferred embodiments, the audio-video multi-modal learning method is adopted for the lip-sync video segment, and the visual lip information is used for enhancing the voice information. Specifically, a neural network architecture of an encoder-decoder is adopted, a human lip image sequence and a noisy speech spectrogram are input, and an enhanced speech spectrogram is output. The input lip image sequence is compressed into a vector by a multilayer deconvolution neural network, and the spectrogram also performs similar operation by using the convolution neural network to obtain the compressed vector. And performing attention weighted fusion on the two vectors, and then performing deconvolution reconstruction on the vector to obtain a denoised enhanced speech spectrogram.
In some preferred embodiments, after sentence-level pause segmentation, the following display results can be obtained:
Speaker 1Time:00:00:00,000-->00:00:02,640,
hi,Dan,what are u doing?
Speaker 2Time:00:00:02,640-->00:00:07,910,
I am playing war craft。
Speaker 1Time:00:00:07,910-->00:00:11,390,
Where is Tommy?
referring to fig. 3, a module connection diagram provided according to an embodiment of the present invention is shown, including the following modules:
the preprocessing module 301 is configured to obtain a first video, extract a first audio from the first video, and denoise the first audio to obtain a second audio;
the first segmentation module 302 is connected with the preprocessing module 301 to realize interaction, and is used for analyzing the second audio by using a convolutional neural network to obtain third audio and first time node information, and segmenting the first video according to the first time node information to obtain a first video segment; the first time node information includes first start node information and first end node information;
the second segmentation module 303 is connected with the first segmentation module 302 to realize interaction, and is configured to perform voice recognition on the third audio to obtain second time node information, and segment the first video segment according to the second time node information to obtain a second video segment; the second time node information comprises second starting node information and second terminating node information;
the face detection module 304 is connected with the second segmentation module 303 to realize interaction, and is used for performing face detection on the second video segment to obtain a video segment containing a face and a video segment without the face;
the lip-shaped synchronous detection module 305 is connected with the face detection module 304 to realize interaction, and is used for performing lip-shaped synchronous detection on the video segment containing the face to obtain a lip-shaped synchronous video segment and a lip-shaped unsynchronized video segment;
the voice recognition module 306 is respectively connected with the face detection module 304 and the lip synchronization detection module 305 to realize interaction, and is used for performing voice enhancement on the lip synchronization video segment to obtain a voice enhanced video segment, and performing voice recognition on the voice enhanced video segment, the lip non-synchronization video segment and the unmanned face video segment to obtain a voice recognition result;
referring to fig. 4, the present invention also provides an apparatus comprising:
at least one processor 401;
at least one memory 402 for storing at least one program;
when the at least one program is executed by the at least one processor 401, the at least one processor 401 is caused to perform the method as shown in fig. 1.
The contents in the method embodiment shown in fig. 1 are all applicable to the apparatus embodiment, the functions implemented in the apparatus embodiment are the same as those in the method embodiment shown in fig. 1, and the beneficial effects achieved by the apparatus embodiment are also the same as those achieved by the method embodiment shown in fig. 1.
Embodiments of the present application also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor to cause the computer device to perform the method illustrated in fig. 1.
The contents in the method embodiment shown in fig. 1 are all applicable to the present storage medium embodiment, the functions implemented by the present storage medium embodiment are the same as those in the method embodiment shown in fig. 1, and the advantageous effects achieved by the present storage medium embodiment are also the same as those achieved by the method embodiment shown in fig. 1.
It will be understood that all or some of the steps, systems of methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims (8)

1. A method of video segmentation, comprising the steps of:
acquiring a first video, extracting a first audio from the first video, and denoising the first audio to obtain a second audio;
analyzing the second audio by using a convolutional neural network to obtain third audio and first time node information, and segmenting the first video according to the first time node information to obtain a first video segment; the first time node information comprises first start node information and first termination node information;
carrying out voice recognition on the third audio to obtain second time node information, and segmenting the first video segment according to the second time node information to obtain a second video segment; the second time node information comprises second starting node information and second terminating node information;
performing face detection on the second video segment to obtain a video segment containing a face and a video segment without the face;
performing lip synchronization detection on the video clips containing the human faces to obtain lip synchronization video clips and lip non-synchronization video clips;
performing voice enhancement on the lip-shaped synchronous video segment to obtain a voice enhanced video segment, and performing voice recognition on the voice enhanced video segment, the lip-shaped unsynchronized video segment and the face-free video segment to obtain a voice recognition result;
the step of analyzing the second audio by using a convolutional neural network to obtain a third audio and a first time node includes:
acquiring the second audio, and framing the second audio to obtain a framing result;
performing coarse-grained voice detection on the framing result by utilizing the convolutional neural network, and extracting voice characteristics;
classifying and screening the voice features to obtain the third audio;
generating the first time node information according to the position of the third audio in the first video;
the step of performing voice recognition on the third audio to obtain second time node information includes:
acquiring the third audio;
carrying out voice tracking and clustering analysis on the third audio to distinguish audio segments of different voices;
and generating the second time node information according to the position of the audio clip in the first video.
2. The video segmentation method according to claim 1, wherein the step of performing face detection on the second video segment to obtain a video segment containing a face and a video segment without a face comprises:
acquiring the second video clip;
carrying out face detection on the second video segment to obtain a face detection result;
and segmenting the second video segment into the video segment containing the human face and the video segment without the human face according to the human face detection result.
3. The video segmentation method according to claim 1, wherein the step of performing lip-sync detection on the video segment containing the face to obtain a lip-sync video segment and a lip-unsynchronized video segment includes:
acquiring the video clip containing the human face;
performing lip synchronization detection on the video clips containing the human faces to obtain lip synchronization detection results;
and segmenting the video segment containing the face into the lip-shaped synchronous video segment and the lip-shaped unsynchronized video segment according to the lip-shaped synchronous detection result.
4. The video segmentation method as set forth in claim 1, wherein the step of performing speech enhancement on the lip-sync video segment to obtain a speech-enhanced video segment comprises:
acquiring the lip-sync video clip;
and carrying out voice enhancement on the lip-shaped synchronous video clip by utilizing audio-video modal learning and visual lip information to obtain the voice enhanced video clip.
5. The video segmentation method according to claim 1, wherein the step of performing speech recognition on the speech-enhanced video segment, the lip-unsynchronized video segment, and the unmanned video segment to obtain speech recognition results comprises:
acquiring the voice enhancement video segment, the lip unsynchronized video segment and the unmanned face video segment;
performing sentence-level pause segmentation on the voice enhanced video segment, the lip unsynchronized video segment and the face-free video segment to obtain a segmentation result;
and performing sentence-by-sentence voice recognition according to the segmentation result to obtain the voice recognition result.
6. A video segmentation system comprising the following modules:
the device comprises a preprocessing module, a first audio processing module and a second audio processing module, wherein the preprocessing module is used for acquiring a first video, extracting a first audio from the first video and denoising the first audio to obtain a second audio;
the first segmentation module is used for analyzing the second audio by using a convolutional neural network to obtain a third audio and first time node information, and segmenting the first video according to the first time node information to obtain a first video segment; the first time node information comprises first start node information and first end node information;
the second segmentation module is used for carrying out voice recognition on the third audio to obtain second time node information, and segmenting the first video segment according to the second time node information to obtain a second video segment; the second time node information comprises second starting node information and second terminating node information;
the face detection module is used for carrying out face detection on the second video segment to obtain a video segment containing a face and a video segment without the face;
the lip synchronization detection module is used for carrying out lip synchronization detection on the video segment containing the face to obtain a lip synchronization video segment and a lip non-synchronization video segment;
the voice recognition module is used for performing voice enhancement on the lip-shaped synchronous video segment to obtain a voice enhanced video segment, and performing voice recognition on the voice enhanced video segment, the lip-shaped unsynchronized video segment and the unmanned face video segment to obtain a voice recognition result;
the step of analyzing the second audio by using a convolutional neural network to obtain a third audio and a first time node includes:
acquiring the second audio, and framing the second audio to obtain a framing result;
performing coarse-grained voice detection on the framing result by utilizing the convolutional neural network, and extracting voice characteristics;
classifying and screening the voice features to obtain the third audio;
generating the first time node information according to the position of the third audio in the first video;
the step of performing voice recognition on the third audio to obtain second time node information includes:
acquiring the third audio;
carrying out voice tracking and clustering analysis on the third audio to distinguish audio segments of different voices;
and generating the second time node information according to the position of the audio clip in the first video.
7. A video segmentation apparatus, comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the method of any one of claims 1-5.
8. A computer-readable storage medium, in which a program executable by a processor is stored, characterized in that the program executable by the processor is adapted to implement the method according to any one of claims 1-5 when executed by the processor.
CN202011374280.4A 2020-11-30 2020-11-30 Video segmentation method, system, device and storage medium Active CN112565885B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011374280.4A CN112565885B (en) 2020-11-30 2020-11-30 Video segmentation method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011374280.4A CN112565885B (en) 2020-11-30 2020-11-30 Video segmentation method, system, device and storage medium

Publications (2)

Publication Number Publication Date
CN112565885A CN112565885A (en) 2021-03-26
CN112565885B true CN112565885B (en) 2023-01-06

Family

ID=75045385

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011374280.4A Active CN112565885B (en) 2020-11-30 2020-11-30 Video segmentation method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN112565885B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362832A (en) * 2021-05-31 2021-09-07 多益网络有限公司 Naming method and related device for audio and video characters
CN114299944B (en) * 2021-12-08 2023-03-24 天翼爱音乐文化科技有限公司 Video processing method, system, device and storage medium
CN114299953B (en) * 2021-12-29 2022-08-23 湖北微模式科技发展有限公司 Speaker role distinguishing method and system combining mouth movement analysis
CN114282621B (en) * 2021-12-29 2022-08-23 湖北微模式科技发展有限公司 Multi-mode fused speaker role distinguishing method and system
CN116781856A (en) * 2023-07-12 2023-09-19 深圳市艾姆诗电商股份有限公司 Audio-visual conversion control method, system and storage medium based on deep learning

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1949879A (en) * 2005-10-11 2007-04-18 华为技术有限公司 Lip synchronous method for multimedia real-time transmission in packet network and apparatus thereof
US9548048B1 (en) * 2015-06-19 2017-01-17 Amazon Technologies, Inc. On-the-fly speech learning and computer model generation using audio-visual synchronization
CN107333071A (en) * 2017-06-30 2017-11-07 北京金山安全软件有限公司 Video processing method and device, electronic equipment and storage medium
CN107945789A (en) * 2017-12-28 2018-04-20 努比亚技术有限公司 Audio recognition method, device and computer-readable recording medium
CN109005451A (en) * 2018-06-29 2018-12-14 杭州星犀科技有限公司 Video demolition method based on deep learning
CN109389085A (en) * 2018-10-09 2019-02-26 清华大学 Lip reading identification model training method and device based on parametric curve
CN111556254A (en) * 2020-04-10 2020-08-18 早安科技(广州)有限公司 Method, system, medium and intelligent device for video cutting by using video content
CN111641790A (en) * 2020-05-29 2020-09-08 三维六度(北京)文化有限公司 Method, device and system for film and television production and distribution

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1949879A (en) * 2005-10-11 2007-04-18 华为技术有限公司 Lip synchronous method for multimedia real-time transmission in packet network and apparatus thereof
US9548048B1 (en) * 2015-06-19 2017-01-17 Amazon Technologies, Inc. On-the-fly speech learning and computer model generation using audio-visual synchronization
CN107333071A (en) * 2017-06-30 2017-11-07 北京金山安全软件有限公司 Video processing method and device, electronic equipment and storage medium
CN107945789A (en) * 2017-12-28 2018-04-20 努比亚技术有限公司 Audio recognition method, device and computer-readable recording medium
CN109005451A (en) * 2018-06-29 2018-12-14 杭州星犀科技有限公司 Video demolition method based on deep learning
CN109389085A (en) * 2018-10-09 2019-02-26 清华大学 Lip reading identification model training method and device based on parametric curve
CN111556254A (en) * 2020-04-10 2020-08-18 早安科技(广州)有限公司 Method, system, medium and intelligent device for video cutting by using video content
CN111641790A (en) * 2020-05-29 2020-09-08 三维六度(北京)文化有限公司 Method, device and system for film and television production and distribution

Also Published As

Publication number Publication date
CN112565885A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
CN112565885B (en) Video segmentation method, system, device and storage medium
CN112562721B (en) Video translation method, system, device and storage medium
US10621991B2 (en) Joint neural network for speaker recognition
Chung et al. Learning to lip read words by watching videos
CN108307229B (en) Video and audio data processing method and device
US7636662B2 (en) System and method for audio-visual content synthesis
CN112823380A (en) Matching mouth shapes and actions in digital video with substitute audio
CN112866586B (en) Video synthesis method, device, equipment and storage medium
US11057457B2 (en) Television key phrase detection
US20220215830A1 (en) System and method for lip-syncing a face to target speech using a machine learning model
Halperin et al. Dynamic temporal alignment of speech to lips
US20210390945A1 (en) Text-driven video synthesis with phonetic dictionary
Ivanko et al. Multimodal speech recognition: increasing accuracy using high speed video data
Tapu et al. DEEP-HEAR: A multimodal subtitle positioning system dedicated to deaf and hearing-impaired people
Feng et al. Self-supervised video forensics by audio-visual anomaly detection
CN114022668B (en) Method, device, equipment and medium for aligning text with voice
Schwiebert et al. A multimodal German dataset for automatic lip reading systems and transfer learning
JP2018005011A (en) Presentation support device, presentation support system, presentation support method and presentation support program
US20230326369A1 (en) Method and apparatus for generating sign language video, computer device, and storage medium
Mocanu et al. Active speaker recognition using cross attention audio-video fusion
US20140368700A1 (en) Example-based cross-modal denoising
EP3839953A1 (en) Automatic caption synchronization and positioning
Jha et al. Cross-language speech dependent lip-synchronization
CN113312928A (en) Text translation method and device, electronic equipment and storage medium
US20230362451A1 (en) Generation of closed captions based on various visual and non-visual elements in content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant