CN110351591A

CN110351591A - Calibrate method, apparatus, equipment and the storage medium of voice signal

Info

Publication number: CN110351591A
Application number: CN201910502478.7A
Authority: CN
Inventors: 王义文; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-06-11
Filing date: 2019-06-11
Publication date: 2019-10-18

Abstract

This application involves audio-video process fields to disclose a kind of method, apparatus, equipment and storage medium for calibrating voice signal using voice processing technology.In the method, from audio/video flow, the voice signal of target person is extracted；The image of the lip including the target person is successively extracted from the video frame of lip in the audio/video flow including the target person according to play time sequence；According to the image of the lip including the target person of extraction, the voice signal of the target person is calibrated, the voice signal for the target person calibrated.Voice signal and the image for the lip for including the target person to realize the target person is synchronous in play time, i.e., the content that the voice content that the lip is expressed in some play time is expressed with the voice signal is identical.

Description

Calibrate method, apparatus, equipment and the storage medium of voice signal

Technical field

This application involves audio-video process field more particularly to a kind of method, apparatus for calibrating voice signal, equipment and deposit Storage media.

Background technique

Audio-video synchronization technology is applied in daily life, and main application scenarios include: that security protection verifying is logical It crosses, the non-straight broadcast audio video synchronization of TV, the audio-video of the post-production of addition video caption pretreatment, Film Animation etc. is same automatically Step problem.

Audio-visual synchronization device (Audio Video Synchronizer) can synchronize the audio and video of film Change processing.Audio and video are asynchronous, and there are two types of situations: sound is compared with lip shape or subtitle, sound shifts to an earlier date in image, Or sound lags behind image.As it can be seen that it is that people's cognition is opposite that audio-video is asynchronous, it is not inconsistent with existence general knowledge.

The common method for guaranteeing audio-visual synchronization is that a reference clock is additionally arranged.Pass through the reference clock, control Voice signal is synchronous with vision signal.

Summary of the invention

This application provides a kind of method, apparatus, equipment and storage medium for calibrating voice signal, accurately insure that sound regards Frequency synchronized result more meets existence general knowledge.

In a first aspect, this application provides a kind of methods for calibrating voice signal, which comprises

From audio/video flow, the voice signal of target person is extracted；

According to play time sequence, from the video frame of lip in the audio/video flow including the target person, according to The secondary image for extracting the lip including the target person；

According to the image of the lip including the target person of extraction, the voice signal of the target person is calibrated, is obtained To the voice signal of the target person of calibration.

Second aspect, present invention also provides a kind of device for calibrating voice signal, described device includes:

Extraction unit, for extracting the voice signal of target person from audio/video flow；

The extraction unit, for according to play time sequence, including the target person from the audio/video flow In the video frame of lip, the image of the lip including the target person is successively extracted；

Calibration unit calibrates the target person for the image according to the lip including the target person of extraction Voice signal, the voice signal for the target person calibrated.

The third aspect, present invention also provides a kind of computer equipment, the computer equipment includes memory and processing Device；The memory is for storing computer program；The processor, for executing the computer program and described in the execution The method such as above-mentioned calibration voice signal is realized when computer program.

Fourth aspect, present invention also provides a kind of computer readable storage medium, the computer readable storage medium It is stored with computer program, the computer program makes the processor realize such as above-mentioned calibration voice when being executed by processor The method of signal.

This application discloses a kind of method, apparatus, equipment and storage mediums for calibrating voice signal.In the method, from In audio/video flow, the voice signal of target person is extracted；It include described from the audio/video flow according to play time sequence In the video frame of the lip of target person, the image of the lip including the target person is successively extracted；What it is according to extraction includes The image of the lip of the target person calibrates the voice signal of the target person, the target person calibrated Voice signal.To realize the image of the voice signal of the target person and the lip for including the target person in play time On synchronization, i.e., the content expressed with the voice signal of voice content that the lip is expressed in some play time is identical 's.

Detailed description of the invention

Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is the step schematic flow diagram of calibration voice signal provided by the embodiments of the present application；

Fig. 2 is the schematic diagram that the voice signal of the target person arranges sequentially in time；

Fig. 3 be include the target person lip the schematic diagram that arranges sequentially in time of image；

Fig. 4 is a kind of schematic block diagram of device for calibrating voice signal provided by the embodiments of the present application；

Fig. 5 is a kind of structural representation block diagram of computer equipment provided by the embodiments of the present application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.

Flow chart shown in the drawings only illustrates, it is not necessary to including all content and operation/step, also not It is that must be executed by described sequence.For example, some operation/steps can also decompose, combine or partially merge, therefore practical The sequence of execution is possible to change according to the actual situation.

Embodiments herein provides a kind of method, apparatus, computer equipment and storage medium for calibrating voice signal. The method of calibration voice signal can be used for realizing the figure of the voice signal and the lip for including the target person of the target person As the synchronization in play time, i.e., the voice content and the voice signal that the lip is expressed in some play time are expressed Content be identical.

With reference to the accompanying drawing, it elaborates to some embodiments of the application.In the absence of conflict, following Feature in embodiment and embodiment can be combined with each other.

Referring to Fig. 1, Fig. 1 is the step exemplary flow of the method for the calibration voice signal that embodiments herein provides Figure.

Step S101, from audio/video flow, the voice signal of target person is extracted.

In the present embodiment, audio/video flow record has the session operational scenarios of one or more personages.Therefore the audio/video flow by Multiple audio/video frames are formed according to play time sequence.

A kind of optional implementation includes the voice signal of personage in the audio stream in the audio/video flow.The audio/video flow In video flowing in include personage video frame.

Therefore, the present embodiment can extract the voice signal of personage from the audio/video flow.The present embodiment can also be according to Play time sequence successively extracts the lip including the personage from the video frame of lip in the audio/video flow including the personage The image in portion.

Optionally, audio/video flow, which can be, obtains or is downloaded from network server, or from locally obtaining.Such as News video can be downloaded from network servers such as CCTV, BBC and YouTube.

Optionally, this method can be applied to the video clip in designated time period.Such as with every ten in news video Minute extracts the audio/video flow that one section of audio-video segment is applied as this method.

In the present embodiment, in step s101, for target person, the target person is extracted from the audio/video flow Voice signal.The voice signal of the target person records the sound of speaking of the target person.

Optionally, step S101 can be realized in the following way: SyncNet convolutional neural networks model be utilized, from institute State the voice signal that the target person is extracted in audio/video flow.

Using the SyncNet convolutional neural networks model training audio/video flow, this can be extracted from the audio/video flow The voice signal of target person.Similarly, the voice signal of other personages can also be extracted from the audio/video flow.

Optionally, step S101 can be realized using sub-step S1011, step S1012 and step S1013.

Step S1011, from the audio/video flow, the muting video clip existing for the only described target person In, extract the part of speech signal of the target person.

Using the audio/video flow training convolutional neural networks (such as SyncNet convolutional neural networks model), will have more The consonant video flow separation of a personage, so that remaining consonant video flowing filters out the consonant video flowing of only target person, so Muting video clip is filtered out from the consonant video flowing of only target person afterwards.The target is extracted from the video clip The part of speech signal of personage.It is found that further including the target person of other parts in the guide video of other multiple personages Voice signal.

In addition, filtering out in muting video clip from the consonant video flowing of only target person, extracting includes institute State the image of the lip of target person.

Optionally, the region for recording the feature of lip of the target person, the region of extraction are extracted from the video clip Be include the target person lip image.

, it should be understood that each video frame in video clip all has the image that one includes the lip of the target person.

Step S1012 includes the target person in the part of speech signal and the video clip based on extraction Lip image, generate time-frequency mask.

For example, encoding to the part of speech signal of the step S1011 target person extracted, audio code is obtained Stream；And correspondingly the image for the lip for including the target person is encoded, obtain video code flow；It then, will be to the sound Frequency code stream and the video code flow merge, and obtain time-frequency mask.

Step S1013 is based on the time-frequency mask, and the voice letter of the target person is extracted from the audio/video flow Number.

For example, being extracted from the audio/video flow using SyncNet convolutional neural networks model using the time-frequency mask The voice signal of the voice signal of the target person out, the target person is as shown in Figure 2.The voice signal of extraction includes: only The part of speech signal of the target person in the consonant video flowing of target person；The voice signal of extraction further include: Duo Geren The part of speech signal of the target person in the simultaneous consonant video flowing of object.

Step S102, according to play time sequence, the view of the lip from the audio/video flow including the target person In frequency frame, the image of the lip including the target person is successively extracted.

The audio/video flow includes the video frame with the lip of target person.The present embodiment is according to the broadcasting audio/video flow Play time sequence, include that successively extracting includes the target person in the video frame with the lip of target person from all The image of the lip of object.

For example, as shown in figure 3, according to predeterminable area, from it is all include the video frame with the lip of target person In, the image of the lip including the target person is successively extracted, which has the predeterminable area size.Packet shown in Fig. 3 The image for including the lip of the target person, the voice signal with the target person shown in Fig. 2, is one-to-one.

Step S103, according to the image of the lip including the target person of extraction, the language of the target person is calibrated Sound signal, the voice signal for the target person calibrated.

In the present embodiment, by step S101 extract target person voice signal and step S102 extract include institute The image for stating the lip of target person is one-to-one according to the play time sequence of audio/video flow.Therefore, if according to There is deviation in play time sequence, then calibrates the figure of the voice signal and the lip for including the target person of the target person The play time of picture, the content for guaranteeing that the voice content of lip expression is expressed with the voice signal is identical, to realize The voice signal of the target person and the image for the lip for including the target person are synchronous in play time.

In a kind of optional specific implementation of step S103, the voice of dynamic tensile/contraction target person is believed Number, the image for the lip including the target person that alignment is extracted sequentially in time.

Specifically, step S101 extracts the voice signal of target person and what step S102 was extracted includes described playing When the image of the lip of target person, if there is deviation according to play time sequence, stretch/shrink the target person Voice signal is phase with the content for guaranteeing that the voice content expressed in the same play time lip is expressed with the voice signal With, to realize the image of the voice signal of the target person and the lip for including the target person in play time It is synchronous.

For example, play step S101 extract target person voice signal and step S102 extract include the mesh When marking the image of the lip of personage, if expressed in the voice content of some play time lip expression and the voice signal Content be different, and the voice signal of the target person shifts to an earlier date and is broadcast in the image of the lip of the target person It puts, then stretches the voice signal of the target person from this play time.If in another play time lip table The voice content reached is different with the content that the voice signal is expressed, and the voice signal of the target person lags behind this The image of the lip of target person plays out, then the voice signal of the target person is shunk from another play time. In this manner it is ensured that the image of the voice signal of the target person and the lip including the target person is in play time It is synchronous.

Optionally, it using dynamic time warping (Dynamic Time Warping, DWT) algorithm, controls to the target Dynamic tensile/contraction of the voice signal of personage, to be aligned the lip including the target person of extraction sequentially in time Image.

Specifically, using step S101 extract target person voice signal and step S102 extract include the mesh The image for marking the lip of personage is input parameter training AV matrix.It is then based on the resulting AV feature of trained AV matrix, uses DWT Algorithm controls dynamic tensile/contraction to the voice signal of the target person, guarantees the voice signal and packet of the target person Include synchronization of the image of the lip of the target person in play time.

When being trained, if the voice signal of the target person shifts to an earlier date the image progress in the lip of the target person It plays, then voice signal and the nonsynchronous detectability threshold value of lip are+45 milliseconds (ms)；If the voice of the target person is believed The image of number lip for lagging behind the target person plays out, then voice signal and the nonsynchronous detectability threshold value of lip For -125 milliseconds (ms).

Optionally, if the voice signal that step S101 extracts target person is expressed as sequence A=(a1 ..., aN), The image for the lip including the target person that step S102 is extracted is expressed as sequence B=(b1 ..., bM), then passes through solution The highest similitude of sequence A and sequence B, so that it may the voice signal and the target person of the target person are controlled according to solving result The image of the lip of object is played simultaneously.Such as sequence A and sequence B are calculated using Di Jiesitela (Dijkstra) algorithm Highest similitude, and data Cost matrix C is constructed according to solving result.It is subsequent that square is embedded in using data Cost matrix C as video The pairs of dot product of battle array, to reach in error range by the length that the weight distribution in matrix controls matching delay Match.

In a kind of optional specific implementation of step S103, step S103 is real by step S1031 and step S1032 It is existing.

Step S1031, sequentially in time, the image of the lip including the target person based on extraction is successively counted Calculate the synchronous error of the voice signal of the target person and the lip sync of the target person.

In advance using template voice signal and lip image the training SyncNet convolutional neural networks model synchronized.It is complete After training, the model parameter for the SyncNet convolutional neural networks model that use has been trained uses this sequentially in time SyncNet convolutional neural networks model calculates the packet that step S101 extracts the voice signal of target person and step S102 is extracted Include the synchronous error of the image of the lip of the target person.

In a kind of optional specific implementation of step S1031, using SyncNet convolutional neural networks model, calculate The synchronous error of the lip sync of the voice signal of the target person and the target person.

Step S1032 is controlled and is drawn the dynamic of the voice signal of the target person according to the synchronous error being calculated It stretches/shrinks.

For example, using dynamic time warping (Dynamic Time according to the synchronous error being calculated Warping, DWT) algorithm, control dynamic tensile/contraction to the voice signal of the target person.

Optionally, as a kind of metric form of synchronous error, comparison loss function is provided to measure synchronous error.This is right Formula (1) is seen than loss function:

Wherein, d_nFor step S101 extract target person voice signal and step S102 extract include the target The deviation of the image of the lip of personage, y_nIt is the packet that step S101 extracts the voice signal of target person and step S102 is extracted Include the binary system similarity measurement between the image of the lip of the target person.

Wherein, d_nCalculation see formula (2),

d_n=| | v_n-a_n||₂ (2)

Wherein v_nIt is the full connection fc7 vector for the voice signal that step S101 extracts target person, a_nIt is step respectively Full connection fc7 vector between the image for the lip including the target person that S102 is extracted.

Referring to Fig. 4, Fig. 4 is that embodiments herein also provides a kind of schematic frame of device for calibrating voice signal Figure, the method which is used to execute any one of aforementioned calibration voice signal.Wherein, which can be configured at server or end In end.

Wherein, server can be independent server, or server cluster.The terminal can be mobile phone, put down The electronic equipments such as plate computer, laptop, desktop computer, personal digital assistant and wearable device.

As shown in figure 4, device 400 includes extraction unit 401 and calibration unit 402.

Extraction unit 401, for extracting the voice signal of target person from audio/video flow；

The extraction unit 401, for according to play time sequence, including the target person from the audio/video flow Lip video frame in, successively extract include the target person lip image；

Calibration unit 402 calibrates the target person for the image according to the lip including the target person of extraction The voice signal of object, the voice signal for the target person calibrated.

In one embodiment, the calibration unit 402, the voice for dynamic tensile/contraction target person are believed Number, the image for the lip including the target person that alignment is extracted sequentially in time.

In one embodiment, the calibration unit 402 is controlled for using dynamic time warping (DWT) algorithm to institute Dynamic tensile/contraction of the voice signal of target person is stated, includes the target person be aligned extraction sequentially in time Lip image.

In one embodiment, the calibration unit 402, for sequentially in time, what it is based on extraction to include the mesh The image for marking the lip of personage successively calculates the same of the voice signal of the target person and the lip sync of the target person Walk error；According to the synchronous error being calculated, dynamic tensile/contraction to the voice signal of the target person is controlled.

In one embodiment, the calibration unit 402 calculates institute for utilizing SyncNet convolutional neural networks model State the synchronous error of the voice signal of target person and the lip sync of the target person.

In one embodiment, the extraction unit 401 is used for from the audio/video flow, in the only described target person In muting video clip existing for object, the part of speech signal of the target person is extracted；The part based on extraction The image of lip in voice signal and the video clip including the target person generates time-frequency mask；

Based on the time-frequency mask, the voice signal of the target person is extracted from the audio/video flow.

In one embodiment, the extraction unit 401, for utilizing SyncNet convolutional neural networks model, from described The voice signal of the target person is extracted in audio/video flow.

It should be noted that it is apparent to those skilled in the art that, for convenience of description and succinctly, The device of the calibration voice signal of foregoing description and the specific work process of each unit, can be with reference to such alignment voice signal Corresponding process in embodiment of the method, details are not described herein.

Above-mentioned device can be implemented as a kind of form of computer program, which can be as shown in Figure 5 Computer equipment on run.

Referring to Fig. 5, Fig. 5 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The computer Equipment can be server or terminal.

Refering to Fig. 5, which includes processor, memory and the network interface connected by system bus, In, memory may include non-volatile memory medium and built-in storage.

Non-volatile memory medium can storage program area and computer program.The computer program includes program instruction, The program instruction is performed, and processor may make to execute a kind of method for calibrating voice signal.

Processor supports the operation of entire computer equipment for providing calculating and control ability.

Built-in storage provides environment for the operation of the computer program in non-volatile memory medium, the computer program quilt When processor executes, processor may make to execute a kind of method for calibrating voice signal.

The network interface such as sends the task dispatching of distribution for carrying out network communication.It will be understood by those skilled in the art that Structure shown in Fig. 5, only the block diagram of part-structure relevant to application scheme, is not constituted to application scheme institute The restriction for the computer equipment being applied thereon, specific computer equipment may include than more or fewer portions as shown in the figure Part perhaps combines certain components or with different component layouts.

It should be understood that processor can be central processing unit (Central Processing Unit, CPU), it should Processor can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specially With integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor are patrolled Collect device, discrete hardware components etc..Wherein, general processor can be microprocessor or the processor be also possible to it is any often The processor etc. of rule.

Wherein, the processor is for running computer program stored in memory, to realize following steps:

From audio/video flow, the voice signal of target person is extracted；According to play time sequence, from the audio/video flow In include the target person lip video frame in, successively extract include the target person lip image；According to The image of the lip including the target person extracted, calibrates the voice signal of the target person, that is calibrated is described The voice signal of target person.

In one embodiment, the processor is in the image for realizing the lip including the target person according to extraction When calibrating the voice signal of the target person, for realizing:

The voice signal of dynamic tensile/contraction target person, what alignment was extracted sequentially in time includes the mesh Mark the image of the lip of personage.

In one embodiment, the processor is pressed in the voice signal for realizing dynamic tensile/contraction target person When being aligned the image of the lip including the target person extracted according to time sequencing, for realizing:

Using DWT algorithm, dynamic tensile/contraction to the voice signal of the target person is controlled, with suitable according to the time The image for the lip including the target person that ordered pair extracts together.

In one embodiment, the processor is in the voice letter for realizing dynamic tensile/contraction target person Number, sequentially in time when the image of the lip including the target person of alignment extraction, for realizing:

Sequentially in time, the image of the lip including the target person based on extraction, successively calculates the target The synchronous error of the lip sync of the voice signal of personage and the target person；According to the synchronous error being calculated, control Dynamic tensile/contraction to the voice signal of the target person.

In one embodiment, the processor is realizing the voice signal and the mesh for calculating the target person When marking the synchronous error of the lip sync of personage, for realizing:

Using SyncNet convolutional neural networks model, the voice signal and the target person of the target person are calculated Lip sync synchronous error.

In one embodiment, the processor is realizing the voice letter that target person is extracted from audio/video flow Number when, for realizing:

From the audio/video flow, in the muting video clip existing for the only described target person, described in extraction The part of speech signal of target person；It include the target in the part of speech signal and the video clip based on extraction The image of the lip of personage generates time-frequency mask；Based on the time-frequency mask, the target is extracted from the audio/video flow The voice signal of personage.

In one embodiment, the processor is realizing the voice letter that target person is extracted from audio/video flow Number, for realizing:

Using SyncNet convolutional neural networks model, the voice of the target person is extracted from the audio/video flow Signal.

A kind of computer readable storage medium is also provided in embodiments herein, the computer readable storage medium is deposited Computer program is contained, includes program instruction in the computer program, the processor executes described program instruction, realizes this Apply for the method for any one calibration voice signal that embodiment provides.

Wherein, the computer readable storage medium can be the storage inside of computer equipment described in previous embodiment Unit, such as the hard disk or memory of the computer equipment.The computer readable storage medium is also possible to the computer The plug-in type hard disk being equipped on the External memory equipment of equipment, such as the computer equipment, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..

The above, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, it is any Those familiar with the art within the technical scope of the present application, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should all cover within the scope of protection of this application.Therefore, the protection scope of the application should be with right It is required that protection scope subject to.

Claims

1. a kind of method for calibrating voice signal characterized by comprising

From audio/video flow, the voice signal of target person is extracted；

It is successively mentioned from the video frame of lip in the audio/video flow including the target person according to play time sequence Take the image of the lip including the target person；

According to the image of the lip including the target person of extraction, the voice signal of the target person is calibrated, school is obtained The voice signal of the quasi- target person.

2. the method according to claim 1, wherein the lip including the target person according to extraction Image calibration described in target person voice signal, comprising:

The voice signal of dynamic tensile/contraction target person, what alignment was extracted sequentially in time includes the target person The image of the lip of object.

3. according to the method described in claim 2, it is characterized in that, the voice of dynamic tensile/contraction target person Signal, the image for the lip including the target person that alignment is extracted sequentially in time, comprising:

Using dynamic time warping (DWT) algorithm, dynamic tensile/contraction to the voice signal of the target person is controlled, with The image for the lip including the target person that alignment is extracted sequentially in time.

4. according to the method in claim 2 or 3, which is characterized in that the language of dynamic tensile/contraction target person Sound signal, the image for the lip including the target person that alignment is extracted sequentially in time, comprising:

Sequentially in time, the image of the lip including the target person based on extraction, successively calculates the target person Voice signal and the target person lip sync synchronous error；

According to the synchronous error being calculated, dynamic tensile/contraction to the voice signal of the target person is controlled.

5. according to the method described in claim 4, it is characterized in that, the voice signal for calculating the target person with it is described The synchronous error of the lip sync of target person, comprising:

Using SyncNet convolutional neural networks model, the voice signal of the target person and the lip of the target person are calculated The synchronous synchronous error in portion.

6. the method according to claim 1, wherein the voice for extracting target person from audio/video flow Signal, comprising:

From the audio/video flow, in the muting video clip existing for the only described target person, the target is extracted The part of speech signal of personage；

The image of lip in the part of speech signal and the video clip based on extraction including the target person, it is raw At time-frequency mask；

7. the method according to claim 1, wherein the voice for extracting target person from audio/video flow Signal, comprising:

Using SyncNet convolutional neural networks model, the voice signal of the target person is extracted from the audio/video flow.

8. a kind of device for calibrating voice signal characterized by comprising

The extraction unit, is used for according to play time sequence, includes the lip of the target person from the audio/video flow Video frame in, successively extract include the target person lip image；

Calibration unit calibrates the language of the target person for the image according to the lip including the target person of extraction Sound signal, the voice signal for the target person calibrated.

9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor；

The memory is for storing computer program；

The processor, for executing the computer program and realization such as claim 1 when executing the computer program To the method for calibrating voice signal described in any one of 7.

10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program make the processor realize the calibration as described in any one of claims 1 to 7 when being executed by processor The method of voice signal.