CN110351591A - Calibrate method, apparatus, equipment and the storage medium of voice signal - Google Patents
Calibrate method, apparatus, equipment and the storage medium of voice signal Download PDFInfo
- Publication number
- CN110351591A CN110351591A CN201910502478.7A CN201910502478A CN110351591A CN 110351591 A CN110351591 A CN 110351591A CN 201910502478 A CN201910502478 A CN 201910502478A CN 110351591 A CN110351591 A CN 110351591A
- Authority
- CN
- China
- Prior art keywords
- target person
- voice signal
- lip
- image
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/4302—Content synchronisation processes, e.g. decoder synchronisation
- H04N21/4307—Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4398—Processing of audio elementary streams involving reformatting operations of audio signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
- H04N21/4402—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
- H04N21/440245—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display the reformatting operation being performed only on part of the stream, e.g. a region of the image or a time segment
Abstract
This application involves audio-video process fields to disclose a kind of method, apparatus, equipment and storage medium for calibrating voice signal using voice processing technology.In the method, from audio/video flow, the voice signal of target person is extracted;The image of the lip including the target person is successively extracted from the video frame of lip in the audio/video flow including the target person according to play time sequence;According to the image of the lip including the target person of extraction, the voice signal of the target person is calibrated, the voice signal for the target person calibrated.Voice signal and the image for the lip for including the target person to realize the target person is synchronous in play time, i.e., the content that the voice content that the lip is expressed in some play time is expressed with the voice signal is identical.
Description
Technical field
This application involves audio-video process field more particularly to a kind of method, apparatus for calibrating voice signal, equipment and deposit
Storage media.
Background technique
Audio-video synchronization technology is applied in daily life, and main application scenarios include: that security protection verifying is logical
It crosses, the non-straight broadcast audio video synchronization of TV, the audio-video of the post-production of addition video caption pretreatment, Film Animation etc. is same automatically
Step problem.
Audio-visual synchronization device (Audio Video Synchronizer) can synchronize the audio and video of film
Change processing.Audio and video are asynchronous, and there are two types of situations: sound is compared with lip shape or subtitle, sound shifts to an earlier date in image,
Or sound lags behind image.As it can be seen that it is that people's cognition is opposite that audio-video is asynchronous, it is not inconsistent with existence general knowledge.
The common method for guaranteeing audio-visual synchronization is that a reference clock is additionally arranged.Pass through the reference clock, control
Voice signal is synchronous with vision signal.
Summary of the invention
This application provides a kind of method, apparatus, equipment and storage medium for calibrating voice signal, accurately insure that sound regards
Frequency synchronized result more meets existence general knowledge.
In a first aspect, this application provides a kind of methods for calibrating voice signal, which comprises
From audio/video flow, the voice signal of target person is extracted;
According to play time sequence, from the video frame of lip in the audio/video flow including the target person, according to
The secondary image for extracting the lip including the target person;
According to the image of the lip including the target person of extraction, the voice signal of the target person is calibrated, is obtained
To the voice signal of the target person of calibration.
Second aspect, present invention also provides a kind of device for calibrating voice signal, described device includes:
Extraction unit, for extracting the voice signal of target person from audio/video flow;
The extraction unit, for according to play time sequence, including the target person from the audio/video flow
In the video frame of lip, the image of the lip including the target person is successively extracted;
Calibration unit calibrates the target person for the image according to the lip including the target person of extraction
Voice signal, the voice signal for the target person calibrated.
The third aspect, present invention also provides a kind of computer equipment, the computer equipment includes memory and processing
Device;The memory is for storing computer program;The processor, for executing the computer program and described in the execution
The method such as above-mentioned calibration voice signal is realized when computer program.
Fourth aspect, present invention also provides a kind of computer readable storage medium, the computer readable storage medium
It is stored with computer program, the computer program makes the processor realize such as above-mentioned calibration voice when being executed by processor
The method of signal.
This application discloses a kind of method, apparatus, equipment and storage mediums for calibrating voice signal.In the method, from
In audio/video flow, the voice signal of target person is extracted;It include described from the audio/video flow according to play time sequence
In the video frame of the lip of target person, the image of the lip including the target person is successively extracted;What it is according to extraction includes
The image of the lip of the target person calibrates the voice signal of the target person, the target person calibrated
Voice signal.To realize the image of the voice signal of the target person and the lip for including the target person in play time
On synchronization, i.e., the content expressed with the voice signal of voice content that the lip is expressed in some play time is identical
's.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description
Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field
For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is the step schematic flow diagram of calibration voice signal provided by the embodiments of the present application;
Fig. 2 is the schematic diagram that the voice signal of the target person arranges sequentially in time;
Fig. 3 be include the target person lip the schematic diagram that arranges sequentially in time of image;
Fig. 4 is a kind of schematic block diagram of device for calibrating voice signal provided by the embodiments of the present application;
Fig. 5 is a kind of structural representation block diagram of computer equipment provided by the embodiments of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen
Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall in the protection scope of this application.
Flow chart shown in the drawings only illustrates, it is not necessary to including all content and operation/step, also not
It is that must be executed by described sequence.For example, some operation/steps can also decompose, combine or partially merge, therefore practical
The sequence of execution is possible to change according to the actual situation.
Embodiments herein provides a kind of method, apparatus, computer equipment and storage medium for calibrating voice signal.
The method of calibration voice signal can be used for realizing the figure of the voice signal and the lip for including the target person of the target person
As the synchronization in play time, i.e., the voice content and the voice signal that the lip is expressed in some play time are expressed
Content be identical.
With reference to the accompanying drawing, it elaborates to some embodiments of the application.In the absence of conflict, following
Feature in embodiment and embodiment can be combined with each other.
Referring to Fig. 1, Fig. 1 is the step exemplary flow of the method for the calibration voice signal that embodiments herein provides
Figure.
Step S101, from audio/video flow, the voice signal of target person is extracted.
In the present embodiment, audio/video flow record has the session operational scenarios of one or more personages.Therefore the audio/video flow by
Multiple audio/video frames are formed according to play time sequence.
A kind of optional implementation includes the voice signal of personage in the audio stream in the audio/video flow.The audio/video flow
In video flowing in include personage video frame.
Therefore, the present embodiment can extract the voice signal of personage from the audio/video flow.The present embodiment can also be according to
Play time sequence successively extracts the lip including the personage from the video frame of lip in the audio/video flow including the personage
The image in portion.
Optionally, audio/video flow, which can be, obtains or is downloaded from network server, or from locally obtaining.Such as
News video can be downloaded from network servers such as CCTV, BBC and YouTube.
Optionally, this method can be applied to the video clip in designated time period.Such as with every ten in news video
Minute extracts the audio/video flow that one section of audio-video segment is applied as this method.
In the present embodiment, in step s101, for target person, the target person is extracted from the audio/video flow
Voice signal.The voice signal of the target person records the sound of speaking of the target person.
Optionally, step S101 can be realized in the following way: SyncNet convolutional neural networks model be utilized, from institute
State the voice signal that the target person is extracted in audio/video flow.
Using the SyncNet convolutional neural networks model training audio/video flow, this can be extracted from the audio/video flow
The voice signal of target person.Similarly, the voice signal of other personages can also be extracted from the audio/video flow.
Optionally, step S101 can be realized using sub-step S1011, step S1012 and step S1013.
Step S1011, from the audio/video flow, the muting video clip existing for the only described target person
In, extract the part of speech signal of the target person.
Using the audio/video flow training convolutional neural networks (such as SyncNet convolutional neural networks model), will have more
The consonant video flow separation of a personage, so that remaining consonant video flowing filters out the consonant video flowing of only target person, so
Muting video clip is filtered out from the consonant video flowing of only target person afterwards.The target is extracted from the video clip
The part of speech signal of personage.It is found that further including the target person of other parts in the guide video of other multiple personages
Voice signal.
In addition, filtering out in muting video clip from the consonant video flowing of only target person, extracting includes institute
State the image of the lip of target person.
Optionally, the region for recording the feature of lip of the target person, the region of extraction are extracted from the video clip
Be include the target person lip image.
, it should be understood that each video frame in video clip all has the image that one includes the lip of the target person.
Step S1012 includes the target person in the part of speech signal and the video clip based on extraction
Lip image, generate time-frequency mask.
For example, encoding to the part of speech signal of the step S1011 target person extracted, audio code is obtained
Stream;And correspondingly the image for the lip for including the target person is encoded, obtain video code flow;It then, will be to the sound
Frequency code stream and the video code flow merge, and obtain time-frequency mask.
Step S1013 is based on the time-frequency mask, and the voice letter of the target person is extracted from the audio/video flow
Number.
For example, being extracted from the audio/video flow using SyncNet convolutional neural networks model using the time-frequency mask
The voice signal of the voice signal of the target person out, the target person is as shown in Figure 2.The voice signal of extraction includes: only
The part of speech signal of the target person in the consonant video flowing of target person;The voice signal of extraction further include: Duo Geren
The part of speech signal of the target person in the simultaneous consonant video flowing of object.
Step S102, according to play time sequence, the view of the lip from the audio/video flow including the target person
In frequency frame, the image of the lip including the target person is successively extracted.
The audio/video flow includes the video frame with the lip of target person.The present embodiment is according to the broadcasting audio/video flow
Play time sequence, include that successively extracting includes the target person in the video frame with the lip of target person from all
The image of the lip of object.
For example, as shown in figure 3, according to predeterminable area, from it is all include the video frame with the lip of target person
In, the image of the lip including the target person is successively extracted, which has the predeterminable area size.Packet shown in Fig. 3
The image for including the lip of the target person, the voice signal with the target person shown in Fig. 2, is one-to-one.
Step S103, according to the image of the lip including the target person of extraction, the language of the target person is calibrated
Sound signal, the voice signal for the target person calibrated.
In the present embodiment, by step S101 extract target person voice signal and step S102 extract include institute
The image for stating the lip of target person is one-to-one according to the play time sequence of audio/video flow.Therefore, if according to
There is deviation in play time sequence, then calibrates the figure of the voice signal and the lip for including the target person of the target person
The play time of picture, the content for guaranteeing that the voice content of lip expression is expressed with the voice signal is identical, to realize
The voice signal of the target person and the image for the lip for including the target person are synchronous in play time.
In a kind of optional specific implementation of step S103, the voice of dynamic tensile/contraction target person is believed
Number, the image for the lip including the target person that alignment is extracted sequentially in time.
Specifically, step S101 extracts the voice signal of target person and what step S102 was extracted includes described playing
When the image of the lip of target person, if there is deviation according to play time sequence, stretch/shrink the target person
Voice signal is phase with the content for guaranteeing that the voice content expressed in the same play time lip is expressed with the voice signal
With, to realize the image of the voice signal of the target person and the lip for including the target person in play time
It is synchronous.
For example, play step S101 extract target person voice signal and step S102 extract include the mesh
When marking the image of the lip of personage, if expressed in the voice content of some play time lip expression and the voice signal
Content be different, and the voice signal of the target person shifts to an earlier date and is broadcast in the image of the lip of the target person
It puts, then stretches the voice signal of the target person from this play time.If in another play time lip table
The voice content reached is different with the content that the voice signal is expressed, and the voice signal of the target person lags behind this
The image of the lip of target person plays out, then the voice signal of the target person is shunk from another play time.
In this manner it is ensured that the image of the voice signal of the target person and the lip including the target person is in play time
It is synchronous.
Optionally, it using dynamic time warping (Dynamic Time Warping, DWT) algorithm, controls to the target
Dynamic tensile/contraction of the voice signal of personage, to be aligned the lip including the target person of extraction sequentially in time
Image.
Specifically, using step S101 extract target person voice signal and step S102 extract include the mesh
The image for marking the lip of personage is input parameter training AV matrix.It is then based on the resulting AV feature of trained AV matrix, uses DWT
Algorithm controls dynamic tensile/contraction to the voice signal of the target person, guarantees the voice signal and packet of the target person
Include synchronization of the image of the lip of the target person in play time.
When being trained, if the voice signal of the target person shifts to an earlier date the image progress in the lip of the target person
It plays, then voice signal and the nonsynchronous detectability threshold value of lip are+45 milliseconds (ms);If the voice of the target person is believed
The image of number lip for lagging behind the target person plays out, then voice signal and the nonsynchronous detectability threshold value of lip
For -125 milliseconds (ms).
Optionally, if the voice signal that step S101 extracts target person is expressed as sequence A=(a1 ..., aN),
The image for the lip including the target person that step S102 is extracted is expressed as sequence B=(b1 ..., bM), then passes through solution
The highest similitude of sequence A and sequence B, so that it may the voice signal and the target person of the target person are controlled according to solving result
The image of the lip of object is played simultaneously.Such as sequence A and sequence B are calculated using Di Jiesitela (Dijkstra) algorithm
Highest similitude, and data Cost matrix C is constructed according to solving result.It is subsequent that square is embedded in using data Cost matrix C as video
The pairs of dot product of battle array, to reach in error range by the length that the weight distribution in matrix controls matching delay
Match.
In a kind of optional specific implementation of step S103, step S103 is real by step S1031 and step S1032
It is existing.
Step S1031, sequentially in time, the image of the lip including the target person based on extraction is successively counted
Calculate the synchronous error of the voice signal of the target person and the lip sync of the target person.
In advance using template voice signal and lip image the training SyncNet convolutional neural networks model synchronized.It is complete
After training, the model parameter for the SyncNet convolutional neural networks model that use has been trained uses this sequentially in time
SyncNet convolutional neural networks model calculates the packet that step S101 extracts the voice signal of target person and step S102 is extracted
Include the synchronous error of the image of the lip of the target person.
In a kind of optional specific implementation of step S1031, using SyncNet convolutional neural networks model, calculate
The synchronous error of the lip sync of the voice signal of the target person and the target person.
Step S1032 is controlled and is drawn the dynamic of the voice signal of the target person according to the synchronous error being calculated
It stretches/shrinks.
For example, using dynamic time warping (Dynamic Time according to the synchronous error being calculated
Warping, DWT) algorithm, control dynamic tensile/contraction to the voice signal of the target person.
Optionally, as a kind of metric form of synchronous error, comparison loss function is provided to measure synchronous error.This is right
Formula (1) is seen than loss function:
Wherein, dnFor step S101 extract target person voice signal and step S102 extract include the target
The deviation of the image of the lip of personage, ynIt is the packet that step S101 extracts the voice signal of target person and step S102 is extracted
Include the binary system similarity measurement between the image of the lip of the target person.
Wherein, dnCalculation see formula (2),
dn=| | vn-an||2 (2)
Wherein vnIt is the full connection fc7 vector for the voice signal that step S101 extracts target person, anIt is step respectively
Full connection fc7 vector between the image for the lip including the target person that S102 is extracted.
Referring to Fig. 4, Fig. 4 is that embodiments herein also provides a kind of schematic frame of device for calibrating voice signal
Figure, the method which is used to execute any one of aforementioned calibration voice signal.Wherein, which can be configured at server or end
In end.
Wherein, server can be independent server, or server cluster.The terminal can be mobile phone, put down
The electronic equipments such as plate computer, laptop, desktop computer, personal digital assistant and wearable device.
As shown in figure 4, device 400 includes extraction unit 401 and calibration unit 402.
Extraction unit 401, for extracting the voice signal of target person from audio/video flow;
The extraction unit 401, for according to play time sequence, including the target person from the audio/video flow
Lip video frame in, successively extract include the target person lip image;
Calibration unit 402 calibrates the target person for the image according to the lip including the target person of extraction
The voice signal of object, the voice signal for the target person calibrated.
In one embodiment, the calibration unit 402, the voice for dynamic tensile/contraction target person are believed
Number, the image for the lip including the target person that alignment is extracted sequentially in time.
In one embodiment, the calibration unit 402 is controlled for using dynamic time warping (DWT) algorithm to institute
Dynamic tensile/contraction of the voice signal of target person is stated, includes the target person be aligned extraction sequentially in time
Lip image.
In one embodiment, the calibration unit 402, for sequentially in time, what it is based on extraction to include the mesh
The image for marking the lip of personage successively calculates the same of the voice signal of the target person and the lip sync of the target person
Walk error;According to the synchronous error being calculated, dynamic tensile/contraction to the voice signal of the target person is controlled.
In one embodiment, the calibration unit 402 calculates institute for utilizing SyncNet convolutional neural networks model
State the synchronous error of the voice signal of target person and the lip sync of the target person.
In one embodiment, the extraction unit 401 is used for from the audio/video flow, in the only described target person
In muting video clip existing for object, the part of speech signal of the target person is extracted;The part based on extraction
The image of lip in voice signal and the video clip including the target person generates time-frequency mask;
Based on the time-frequency mask, the voice signal of the target person is extracted from the audio/video flow.
In one embodiment, the extraction unit 401, for utilizing SyncNet convolutional neural networks model, from described
The voice signal of the target person is extracted in audio/video flow.
It should be noted that it is apparent to those skilled in the art that, for convenience of description and succinctly,
The device of the calibration voice signal of foregoing description and the specific work process of each unit, can be with reference to such alignment voice signal
Corresponding process in embodiment of the method, details are not described herein.
Above-mentioned device can be implemented as a kind of form of computer program, which can be as shown in Figure 5
Computer equipment on run.
Referring to Fig. 5, Fig. 5 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The computer
Equipment can be server or terminal.
Refering to Fig. 5, which includes processor, memory and the network interface connected by system bus,
In, memory may include non-volatile memory medium and built-in storage.
Non-volatile memory medium can storage program area and computer program.The computer program includes program instruction,
The program instruction is performed, and processor may make to execute a kind of method for calibrating voice signal.
Processor supports the operation of entire computer equipment for providing calculating and control ability.
Built-in storage provides environment for the operation of the computer program in non-volatile memory medium, the computer program quilt
When processor executes, processor may make to execute a kind of method for calibrating voice signal.
The network interface such as sends the task dispatching of distribution for carrying out network communication.It will be understood by those skilled in the art that
Structure shown in Fig. 5, only the block diagram of part-structure relevant to application scheme, is not constituted to application scheme institute
The restriction for the computer equipment being applied thereon, specific computer equipment may include than more or fewer portions as shown in the figure
Part perhaps combines certain components or with different component layouts.
It should be understood that processor can be central processing unit (Central Processing Unit, CPU), it should
Processor can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specially
With integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array
(Field-Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor are patrolled
Collect device, discrete hardware components etc..Wherein, general processor can be microprocessor or the processor be also possible to it is any often
The processor etc. of rule.
Wherein, the processor is for running computer program stored in memory, to realize following steps:
From audio/video flow, the voice signal of target person is extracted;According to play time sequence, from the audio/video flow
In include the target person lip video frame in, successively extract include the target person lip image;According to
The image of the lip including the target person extracted, calibrates the voice signal of the target person, that is calibrated is described
The voice signal of target person.
In one embodiment, the processor is in the image for realizing the lip including the target person according to extraction
When calibrating the voice signal of the target person, for realizing:
The voice signal of dynamic tensile/contraction target person, what alignment was extracted sequentially in time includes the mesh
Mark the image of the lip of personage.
In one embodiment, the processor is pressed in the voice signal for realizing dynamic tensile/contraction target person
When being aligned the image of the lip including the target person extracted according to time sequencing, for realizing:
Using DWT algorithm, dynamic tensile/contraction to the voice signal of the target person is controlled, with suitable according to the time
The image for the lip including the target person that ordered pair extracts together.
In one embodiment, the processor is in the voice letter for realizing dynamic tensile/contraction target person
Number, sequentially in time when the image of the lip including the target person of alignment extraction, for realizing:
Sequentially in time, the image of the lip including the target person based on extraction, successively calculates the target
The synchronous error of the lip sync of the voice signal of personage and the target person;According to the synchronous error being calculated, control
Dynamic tensile/contraction to the voice signal of the target person.
In one embodiment, the processor is realizing the voice signal and the mesh for calculating the target person
When marking the synchronous error of the lip sync of personage, for realizing:
Using SyncNet convolutional neural networks model, the voice signal and the target person of the target person are calculated
Lip sync synchronous error.
In one embodiment, the processor is realizing the voice letter that target person is extracted from audio/video flow
Number when, for realizing:
From the audio/video flow, in the muting video clip existing for the only described target person, described in extraction
The part of speech signal of target person;It include the target in the part of speech signal and the video clip based on extraction
The image of the lip of personage generates time-frequency mask;Based on the time-frequency mask, the target is extracted from the audio/video flow
The voice signal of personage.
In one embodiment, the processor is realizing the voice letter that target person is extracted from audio/video flow
Number, for realizing:
Using SyncNet convolutional neural networks model, the voice of the target person is extracted from the audio/video flow
Signal.
A kind of computer readable storage medium is also provided in embodiments herein, the computer readable storage medium is deposited
Computer program is contained, includes program instruction in the computer program, the processor executes described program instruction, realizes this
Apply for the method for any one calibration voice signal that embodiment provides.
Wherein, the computer readable storage medium can be the storage inside of computer equipment described in previous embodiment
Unit, such as the hard disk or memory of the computer equipment.The computer readable storage medium is also possible to the computer
The plug-in type hard disk being equipped on the External memory equipment of equipment, such as the computer equipment, intelligent memory card (Smart
Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..
The above, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, it is any
Those familiar with the art within the technical scope of the present application, can readily occur in various equivalent modifications or replace
It changes, these modifications or substitutions should all cover within the scope of protection of this application.Therefore, the protection scope of the application should be with right
It is required that protection scope subject to.
Claims (10)
1. a kind of method for calibrating voice signal characterized by comprising
From audio/video flow, the voice signal of target person is extracted;
It is successively mentioned from the video frame of lip in the audio/video flow including the target person according to play time sequence
Take the image of the lip including the target person;
According to the image of the lip including the target person of extraction, the voice signal of the target person is calibrated, school is obtained
The voice signal of the quasi- target person.
2. the method according to claim 1, wherein the lip including the target person according to extraction
Image calibration described in target person voice signal, comprising:
The voice signal of dynamic tensile/contraction target person, what alignment was extracted sequentially in time includes the target person
The image of the lip of object.
3. according to the method described in claim 2, it is characterized in that, the voice of dynamic tensile/contraction target person
Signal, the image for the lip including the target person that alignment is extracted sequentially in time, comprising:
Using dynamic time warping (DWT) algorithm, dynamic tensile/contraction to the voice signal of the target person is controlled, with
The image for the lip including the target person that alignment is extracted sequentially in time.
4. according to the method in claim 2 or 3, which is characterized in that the language of dynamic tensile/contraction target person
Sound signal, the image for the lip including the target person that alignment is extracted sequentially in time, comprising:
Sequentially in time, the image of the lip including the target person based on extraction, successively calculates the target person
Voice signal and the target person lip sync synchronous error;
According to the synchronous error being calculated, dynamic tensile/contraction to the voice signal of the target person is controlled.
5. according to the method described in claim 4, it is characterized in that, the voice signal for calculating the target person with it is described
The synchronous error of the lip sync of target person, comprising:
Using SyncNet convolutional neural networks model, the voice signal of the target person and the lip of the target person are calculated
The synchronous synchronous error in portion.
6. the method according to claim 1, wherein the voice for extracting target person from audio/video flow
Signal, comprising:
From the audio/video flow, in the muting video clip existing for the only described target person, the target is extracted
The part of speech signal of personage;
The image of lip in the part of speech signal and the video clip based on extraction including the target person, it is raw
At time-frequency mask;
Based on the time-frequency mask, the voice signal of the target person is extracted from the audio/video flow.
7. the method according to claim 1, wherein the voice for extracting target person from audio/video flow
Signal, comprising:
Using SyncNet convolutional neural networks model, the voice signal of the target person is extracted from the audio/video flow.
8. a kind of device for calibrating voice signal characterized by comprising
Extraction unit, for extracting the voice signal of target person from audio/video flow;
The extraction unit, is used for according to play time sequence, includes the lip of the target person from the audio/video flow
Video frame in, successively extract include the target person lip image;
Calibration unit calibrates the language of the target person for the image according to the lip including the target person of extraction
Sound signal, the voice signal for the target person calibrated.
9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor;
The memory is for storing computer program;
The processor, for executing the computer program and realization such as claim 1 when executing the computer program
To the method for calibrating voice signal described in any one of 7.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey
Sequence, the computer program make the processor realize the calibration as described in any one of claims 1 to 7 when being executed by processor
The method of voice signal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910502478.7A CN110351591A (en) | 2019-06-11 | 2019-06-11 | Calibrate method, apparatus, equipment and the storage medium of voice signal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910502478.7A CN110351591A (en) | 2019-06-11 | 2019-06-11 | Calibrate method, apparatus, equipment and the storage medium of voice signal |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110351591A true CN110351591A (en) | 2019-10-18 |
Family
ID=68181848
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910502478.7A Withdrawn CN110351591A (en) | 2019-06-11 | 2019-06-11 | Calibrate method, apparatus, equipment and the storage medium of voice signal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110351591A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111988654A (en) * | 2020-08-31 | 2020-11-24 | 维沃移动通信有限公司 | Video data alignment method and device and electronic equipment |
CN112437336A (en) * | 2020-11-19 | 2021-03-02 | 维沃移动通信有限公司 | Audio and video playing method and device, electronic equipment and storage medium |
US20210390970A1 (en) * | 2020-06-15 | 2021-12-16 | Tencent America LLC | Multi-modal framework for multi-channel target speech seperation |
CN114422825A (en) * | 2022-01-26 | 2022-04-29 | 科大讯飞股份有限公司 | Audio and video synchronization method, device, medium, equipment and program product |
-
2019
- 2019-06-11 CN CN201910502478.7A patent/CN110351591A/en not_active Withdrawn
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210390970A1 (en) * | 2020-06-15 | 2021-12-16 | Tencent America LLC | Multi-modal framework for multi-channel target speech seperation |
US11688412B2 (en) * | 2020-06-15 | 2023-06-27 | Tencent America LLC | Multi-modal framework for multi-channel target speech separation |
CN111988654A (en) * | 2020-08-31 | 2020-11-24 | 维沃移动通信有限公司 | Video data alignment method and device and electronic equipment |
CN112437336A (en) * | 2020-11-19 | 2021-03-02 | 维沃移动通信有限公司 | Audio and video playing method and device, electronic equipment and storage medium |
CN114422825A (en) * | 2022-01-26 | 2022-04-29 | 科大讯飞股份有限公司 | Audio and video synchronization method, device, medium, equipment and program product |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110351591A (en) | Calibrate method, apparatus, equipment and the storage medium of voice signal | |
CN106303658B (en) | Exchange method and device applied to net cast | |
US10182095B2 (en) | Method and system for video call using two-way communication of visual or auditory effect | |
CN106162235B (en) | For the method and apparatus of Switch Video stream | |
KR102043088B1 (en) | Synchronization of multimedia streams | |
US10341745B2 (en) | Methods and systems for providing content | |
CN108566558A (en) | Video stream processing method, device, computer equipment and storage medium | |
CN108924617A (en) | The method of synchronizing video data and audio data, storage medium and electronic equipment | |
CN105791938B (en) | The joining method and device of multimedia file | |
US20210174592A1 (en) | Augmented reality method and device | |
CN109257659A (en) | Subtitle adding method, device, electronic equipment and computer readable storage medium | |
EP4099709A1 (en) | Data processing method and apparatus, device, and readable storage medium | |
CN109089128A (en) | A kind of method for processing video frequency, device, equipment and medium | |
JP7442211B2 (en) | Synchronizing auxiliary data for content that includes audio | |
CN109089127A (en) | A kind of video-splicing method, apparatus, equipment and medium | |
CN105611481A (en) | Man-machine interaction method and system based on space voices | |
CN105898556A (en) | Plug-in subtitle automatic synchronization method and device | |
CN106658030B (en) | A kind of playback method and equipment of the composite video comprising SCVF single channel voice frequency multi-channel video | |
CN113242361B (en) | Video processing method and device and computer readable storage medium | |
CN107659538A (en) | A kind of method and apparatus of Video processing | |
CN105898500A (en) | Network video play method and device | |
CN109525865A (en) | Audience ratings monitoring method and computer readable storage medium based on block chain | |
CN109040773A (en) | A kind of video improvement method, apparatus, equipment and medium | |
CN109168059A (en) | A kind of labial synchronization method playing audio & video respectively on different devices | |
CN114040255A (en) | Live caption generating method, system, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20191018 |
|
WW01 | Invention patent application withdrawn after publication |