CN114466179A - Method and device for measuring synchronism of voice and image - Google Patents
Method and device for measuring synchronism of voice and image Download PDFInfo
- Publication number
- CN114466179A CN114466179A CN202111057976.9A CN202111057976A CN114466179A CN 114466179 A CN114466179 A CN 114466179A CN 202111057976 A CN202111057976 A CN 202111057976A CN 114466179 A CN114466179 A CN 114466179A
- Authority
- CN
- China
- Prior art keywords
- voice
- image
- segment
- visual
- face
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 71
- 230000000007 visual effect Effects 0.000 claims abstract description 175
- 238000013528 artificial neural network Methods 0.000 claims abstract description 126
- 230000014509 gene expression Effects 0.000 claims description 55
- 238000012545 processing Methods 0.000 claims description 48
- 230000001360 synchronised effect Effects 0.000 claims description 32
- 239000013598 vector Substances 0.000 claims description 32
- 238000004422 calculation algorithm Methods 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 21
- 239000011159 matrix material Substances 0.000 claims description 20
- 238000001514 detection method Methods 0.000 claims description 15
- 238000000605 extraction Methods 0.000 claims description 15
- 230000009471 action Effects 0.000 claims description 11
- 230000008921 facial expression Effects 0.000 claims description 6
- 238000004891 communication Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 description 88
- 230000000875 corresponding effect Effects 0.000 description 74
- 230000033001 locomotion Effects 0.000 description 29
- 238000005259 measurement Methods 0.000 description 22
- 238000005070 sampling Methods 0.000 description 18
- 230000008569 process Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 16
- 238000005516 engineering process Methods 0.000 description 11
- 239000012634 fragment Substances 0.000 description 11
- 210000001328 optic nerve Anatomy 0.000 description 9
- 210000000887 face Anatomy 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000003062 neural network model Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 230000002238 attenuated effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 210000003128 head Anatomy 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 1
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 240000005561 Musa balbisiana Species 0.000 description 1
- 235000018290 Musa x paradisiaca Nutrition 0.000 description 1
- 230000001154 acute effect Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 210000004709 eyebrow Anatomy 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000011410 subtraction method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N17/00—Diagnosis, testing or measuring for television systems or their details
- H04N17/004—Diagnosis, testing or measuring for television systems or their details for digital television systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/4302—Content synchronisation processes, e.g. decoder synchronisation
- H04N21/4307—Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
- H04N21/43072—Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of multiple content streams on the same device
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Image Analysis (AREA)
Abstract
The application provides a method and a device for measuring the synchronism of voice and images, wherein the method comprises the following steps: acquiring a voice segment and an image segment in a video, wherein the voice segment and the image segment have a corresponding relation in the video; generating an outline image of the target person according to the image segments, wherein the outline image is irrelevant to the individual characteristics of the target person; obtaining the voice characteristics of the voice segments through a voice neural network; obtaining visual characteristics of the contour map through a visual neural network; and determining whether the voice segment and the image segment have synchronicity according to the voice feature and the visual feature. Before the image segment is input into the visual neural network, the image segment is processed to remove the characteristics related to the individual character in the image segment, and then the image data obtained after the image segment is processed is input into the visual neural network. Therefore, the visual characteristics acquired through the visual neural network do not carry the characteristics of the speaker, and the accuracy of measuring the synchronism of the voice and the image can be improved.
Description
Technical Field
The present application relates to the field of video processing technologies, and in particular, to a method and an apparatus for measuring synchronization between speech and an image.
Background
In a video, images and voice are often included. Also, when a person in the video speaks, the mouth movements of the person in the image should remain synchronized with the speech uttered by the person.
In order to measure whether the mouth movement of a person in a video is synchronous with the voice produced by the person, a SyncNet technology is generally adopted. So-called SyncNet-like technologies, reference may be made to Chung, Joon Son, and Andrew Zisserman. "Out of time: automated lip sync in the same. Specifically, the voice segments in the video are input into a neural network to obtain the voice characteristics. And inputting the image segment in the video into another neural network to obtain the visual characteristic. And finally, judging whether the mouth movement of the person in the video is synchronous with the voice sent by the person by comparing the voice characteristic with the visual characteristic.
However, the accuracy of measuring whether the mouth movement of the person in the video is synchronous with the voice produced by the person by adopting the SyncNet technology is still low.
Disclosure of Invention
The embodiment of the application aims to provide a method and a device for measuring the synchronism of voice and image, so as to improve the accuracy of measuring the synchronism of voice and image.
In order to solve the above technical problem, an embodiment of the present application provides the following technical solutions:
the first aspect of the present application provides a method for measuring the synchronicity between voice and image, where the method includes: acquiring a voice segment and an image segment in a video, wherein the voice segment and the image segment have a corresponding relation in the video; generating an outline of a target person according to the image segments, wherein the outline is independent of the individual characteristics of the target person; obtaining the voice characteristics of the voice segments through a voice neural network; obtaining visual features of the contour map through a visual neural network; and determining whether the voice segment and the image segment have synchronicity according to the voice feature and the visual feature, wherein the synchronicity is used for representing that the sound in the voice segment is matched with the action (specifically, the motion of the lower half face) of the target person in the image segment.
A second aspect of the present application provides a device for measuring synchronization between voice and image, the device comprising: the receiving module is used for acquiring a voice segment and an image segment in a video, wherein the voice segment and the image segment have a corresponding relation in the video; the data processing module is used for generating an outline image of a target person according to the image segment, and the outline image is irrelevant to the individual characteristics of the target person; the feature extraction module is used for obtaining the voice features of the voice segments through a voice neural network; the feature extraction module is further used for obtaining visual features of the contour map through a visual neural network; and the synchronism measuring module is used for determining whether the voice segment and the image segment have synchronism according to the voice feature and the visual feature, and the synchronism is used for representing that the sound in the voice segment is matched with the action of the target person in the image segment.
A third aspect of the present application provides an electronic device comprising: a processor, a memory, a bus; the processor and the memory complete mutual communication through the bus; the processor is for invoking program instructions in the memory for performing the method of the first aspect.
A fourth aspect of the present application provides a computer-readable storage medium comprising: a stored program; wherein the program, when executed, controls an apparatus in which the storage medium is located to perform the method of the first aspect.
Compared with the prior art, according to the method for measuring the synchronization between the voice and the image, after the voice segment and the image segment in the video are obtained, the outline of the target person is generated according to the image segment, the outline is independent of the individual characteristics of the target person, the voice characteristics of the voice segment are obtained through the voice neural network, the visual characteristics of the outline are obtained through the visual neural network, and finally whether the voice segment and the image segment are synchronized is determined according to the voice characteristics and the visual characteristics. That is, before inputting the image segment into the optic nerve network, the image segment is processed to remove the characteristics related to the individual person in the image segment, and then the image data obtained after processing the image segment is input into the optic nerve network. Therefore, the visual characteristics acquired through the visual neural network do not carry the characteristics of the speaker, and the accuracy of measuring the synchronism of the voice and the image can be improved.
The device for measuring the synchronicity between voice and image provided by the second aspect of the present application, the electronic device provided by the third aspect, and the computer-readable storage medium provided by the fourth aspect have the same or similar beneficial effects as the method for measuring the synchronicity between voice and image provided by the first aspect.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present application will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present application are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:
FIG. 1 is a first diagram of an image segment according to an embodiment of the present application;
FIG. 2 is a second diagram of an image segment according to an embodiment of the present application;
FIG. 3 is a first flowchart illustrating a method for measuring the synchronicity between speech and image according to an embodiment of the present application;
FIG. 4 is a second flowchart illustrating a method for measuring the synchronicity between speech and image according to an embodiment of the present application;
FIG. 5 is a flow chart illustrating processing of a speech segment according to an embodiment of the present application;
FIG. 6 is a schematic diagram of the range of the lower half face in the embodiment of the present application;
FIG. 7 is a flowchart illustrating processing of image segments according to an embodiment of the present application;
FIG. 8 is a block diagram illustrating an embodiment of the present invention for measuring the synchronization between the audio and video signals;
FIG. 9 is a block diagram of an exemplary speech neural network;
FIG. 10 is a flow chart illustrating the generation of speech features in an embodiment of the present application;
FIG. 11 is a schematic flow chart illustrating the generation of visual features in an embodiment of the present application;
FIG. 12 is a schematic flow chart illustrating a process for training a neural network according to an embodiment of the present application;
FIG. 13 is a flowchart illustrating a method for measuring the synchronicity between a voice and an image according to an embodiment of the present application;
FIG. 14 is a first schematic view of an apparatus for measuring synchronization between audio and video according to an embodiment of the present application;
FIG. 15 is a second schematic structural diagram of an apparatus for measuring synchronization between speech and image according to an embodiment of the present application;
fig. 16 is a schematic structural diagram of an electronic device in an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It is to be noted that, unless otherwise specified, technical terms or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.
In the prior art, a SyncNet technology is adopted to measure whether the mouth movement of a person in a video is synchronous with the voice sent by the person, and the accuracy is low.
The inventor carefully researches and discovers that the reason why the SyncNet technology measures whether the mouth movement and the voice are synchronous or not is low in accuracy is that: two neural networks are required in SyncNet-like technologies. One is a speech neural network for extracting speech features. One is a visual neural network for extracting visual features. Neither the speech neural network nor the visual neural network can be independent of the characteristics of the speaker during training. That is, when training is performed by using the sample, the sample carries the characteristics of the speaker itself, and the trained network also learns the characteristics of the speaker in the sample. For speakers not covered in the sample, the accuracy of the speech features and the visual features obtained through the speech neural network and the visual neural network is reduced.
Moreover, the SyncNet type technology is difficult to be independent of the coordinate system. That is, when visual features are extracted by the visual neural network, it is the mouth features that are mainly extracted. While the extraction of mouth features is very sensitive to mouth alignment. When the speaker performs a three-dimensional motion such as turning his head, it is difficult to align the mouth. Relative motion due to mouth alignment and mouth motion due to speech are coupled together, so that accuracy of the SyncNet-like technology for mouth feature extraction is obviously reduced. Fig. 1 is a first schematic diagram of an image segment in an embodiment of the present application, and referring to fig. 1, 3 images exist in the image segment. In the 1 st image, the person is speaking. By the 2 nd image, the head of the person rotates, and the position and scale of the mouth in the image are also different from the front face of the person in the 1 st image. In the 3 rd image, the person is still speaking continuously. The two-dimensional compatibility of the three-dimensional motion by using the SyncNet technology obviously affects the accuracy of judging the synchronization of the mouth motion and the voice.
Moreover, the SyncNet technology has poor robustness to occlusion in images. That is, when the face of the speaker is partially occluded in the image, the visual neural network cannot accurately extract the mouth feature of the speaker, and the extracted mouth feature includes the feature of the occlusion. Thus, the accuracy of the mouth movement and speech synchronism determination is also reduced. Fig. 2 is a schematic diagram two of an image segment in an embodiment of the present application, and referring to fig. 2, in the two images, the mouth of the person is partially blocked by the finger and the pen, respectively. Such occlusion affects the alignment of the mouth in the image, and the obtained mouth features may also contain occlusion objects, thereby affecting the accuracy of the mouth motion and speech synchronicity determination.
In view of this, an embodiment of the present application provides a method for measuring synchronization between speech and an image, where before an image segment is input to a visual neural network, the image segment is processed to remove features related to individual persons in the image segment, and then image data obtained by processing the image segment is input to the visual neural network. Therefore, the visual characteristics acquired through the visual neural network do not carry the characteristics of the speaker, and the accuracy of measuring the synchronism of the voice and the image is improved.
Fig. 3 is a first flowchart illustrating a method for measuring synchronization between voice and image according to an embodiment of the present application, and referring to fig. 3, the method may include:
s301: and acquiring a voice segment and an image segment in the video.
The video is a video in which it is required to determine whether images and voice are synchronous. Here, synchronicity is used to characterize that the sound in the voice segment matches the motion of the target person in the image segment.
Matching means that, in a video, the sound produced by the action of the target person in the image segment (specifically, the action of the lower half face) is semantically and temporally identical to the sound in the voice segment.
For example, if the mouth of the target person in the image segment is shaped like a mouth that utters "apple" and the sound in the voice segment is also "apple", the image segment and the voice segment can be considered to have synchronization. Furthermore, if the mouth of the target person in the image segment has a mouth shape that makes an "apple" sound and the sound in the voice segment is "banana", it can be considered that the image segment and the voice segment are not synchronized.
Generally, instead of directly putting all the images and all the voices in the video together, a part of the images and the corresponding voices in the video are put together for judgment. The selected partial images are image segments in the video, and correspondingly, the selected voice is a voice segment in the video. The selected voice segment and the image segment have corresponding relation in the video.
The correspondence relationship means that the selected voice segment and the image segment have the same start time, the same end time or a certain offset in time (the offset is acceptable in the visual range of human eyes).
For example, images and voices corresponding to the 1 st frame to the 10 th frame in the video are obtained. The images of the 1 st to 10 th frames in the video constitute an image segment, and the voices of the 1 st to 10 th frames in the video constitute a voice segment. Here, the 1 st frame to the 10 th frame are specific positions. The specific positions of the image segment and the voice segment can be set according to actual situations, and are not limited specifically here.
Of course, the image segment may also be a certain 1 frame image, and the corresponding voice segment may also be the voice of the frame and the voice of several frames before and after the frame.
After judging whether a part of images in the video are synchronous with the corresponding voice or not, judging whether another part of images in the video are synchronous with the corresponding voice or not until the judgment of the synchronism of all the images in the video and the corresponding voice is finished.
S302: and generating an outline image of the target person according to the image segments.
Wherein the contour map is independent of individual characteristics of the target person.
Before inputting the image segment into the visual neural network to obtain the visual features, the image segment needs to be processed first, and the individual features of people in the image segment are deleted, that is, the person features irrelevant to the individual features of the people in the image are extracted from the image segment.
For example, the lips vary in thickness and size from person to person. Some people have thick lips, some people have thin lips, some people have large mouths, and some people have small mouths. If the image segment is directly input into the optic nerve network, the obtained visual features will include individual features of each person, which reduces the accuracy of image and voice synchronism judgment. Moreover, if the image segments containing the individual features are input into the visual network for training, the trained network cannot accurately acquire the visual features of the persons not contained in the training samples, so that the accuracy of the subsequent image and voice synchronism judgment is reduced. Therefore, before the image segment is input into the optic nerve network, the image segment is extracted, and only the features related to the motion of the lower half face of the person in the image segment are extracted, so that the extraction of the features of the person is avoided, such as: only the degree of opening and closing of the mouth is extracted, and the thickness of the lips and the like are not extracted. And then combining the extracted characteristics related to the character movement to obtain the posture or expression of the character, and further obtain the contour map of the target character in the image segment. And then the contour map is input into the visual neural network, individual characteristics of people can be avoided from appearing in the obtained visual characteristics, and the accuracy of image and voice synchronism judgment is further improved.
S303: and obtaining the voice characteristics of the voice segments through the voice neural network.
And inputting the voice segments into a voice neural network, and processing the voice segments through the voice neural network, wherein the output of the voice neural network is the voice characteristic.
The speech neural network can be any neural network capable of acquiring speech features in a speech segment. The specific type of the speech neural network is not particularly limited herein.
S304: and obtaining the visual characteristics of the contour map through a visual neural network.
And inputting the contour map obtained after processing the image segments into a visual neural network, and processing the contour map through the visual neural network, wherein the output of the visual neural network is the visual characteristic.
The optic neural network can be any neural network capable of acquiring the visual features in the image segment. The particular type of the optic nerve network is not specifically limited herein.
S305: and determining whether the voice segment and the image segment have synchronicity according to the voice feature and the visual feature.
After the voice neural network outputs the voice characteristics and the visual neural network outputs the visual characteristics, the voice characteristics and the visual characteristics are compared through an algorithm with a comparison function, and whether the voice segment and the image segment have synchronism can be determined according to a comparison result. Here, synchronicity is used to characterize that the sound in the voice segment matches the motion of the target person in the image segment. That is, it is determined whether the meaning of the action of the sound in the voice section and the target person in the image section is the same or not, based on the comparison result. It can also be understood that the sound emitted by the action of the target person in the image segment is semantically and temporally identical to the sound in the voice segment. The action of the target person here generally refers to the action of the lower half face of the target person, such as the action associated with the mouth.
Typically, the output is a value between 0 and 1. And a threshold value is set between 0 and 1. If the output numerical value is larger than or equal to the threshold value, the similarity between the voice characteristic and the visual characteristic is high, and the voice segment is synchronous with the image segment. If the output numerical value is smaller than the threshold value, the similarity between the voice characteristic and the visual characteristic is low, and the voice segment is not synchronous with the image segment. Specific ranges and thresholds for the values are not specifically limited herein.
As can be seen from the above, according to the method for measuring the synchronization between the voice and the image, after the voice segment and the image segment in the video are obtained, the contour map of the target person is generated according to the image segment, the contour map is independent of the individual features of the target person, the voice feature of the voice segment is obtained through the voice neural network, the visual feature of the contour map is obtained through the visual neural network, and finally, whether the voice segment and the image segment have the synchronization is determined according to the voice feature and the visual feature. That is, before inputting the image segment into the optic nerve network, the image segment is processed to remove the characteristics related to the individual person in the image segment, and then the image data obtained after processing the image segment is input into the optic nerve network. Therefore, the visual characteristics acquired through the visual neural network do not carry the characteristics of the speaker, and the accuracy of measuring the synchronism of the voice and the image can be improved.
Further, as a refinement and an extension of the method shown in fig. 3, the embodiment of the present application further provides a method for measuring the synchronicity of the voice and the image. Fig. 4 is a schematic flowchart of a second method for measuring synchronization between speech and image in the embodiment of the present application, and referring to fig. 4, the method may include:
s401: and acquiring a voice segment and an image segment in the video.
Step S401 is implemented in the same manner as step S301, and is not described herein again.
The following describes a process of processing a voice segment and an image segment before input to the neural network, and correspondingly processing the voice segment and the image segment into voice data and image data, respectively, from the aspect of voice and image.
Firstly, processing the voice fragment.
Since the voice segment contains the characteristics of the speaker itself, for example: tone, intonation, etc. Therefore, before the voice segment is input into the voice neural network to obtain the voice characteristic, the characteristic of the speaker in the voice segment is erased, and then the voice data with the characteristic of the speaker erased is input into the voice neural network, so that the accuracy of synchronous comparison between voice and images can be improved.
S402: the sampling frequency of the speech segment is converted to a specific frequency.
After the voice segments are separated from the video and changed into a single channel, the configuration of the terminals for acquiring the video is different, so that the sampling frequency of the voice also has difference, and in order to accurately process the voice segments in the follow-up process, the sampling frequency of the voice segments needs to be unified first.
In practical applications, the sampling frequency of the voice segment can be unified to 16 kHz. Of course, the sampling frequency of the speech segment can also be unified into other values, such as: 8kHz, 20kHz, etc. The specific value can be set according to practical situations, and is not limited herein.
S403: and denoising the voice segment.
Here, step S403 may include two aspects.
S4031: and removing background sounds in the voice segments.
Specifically, the speech segment may be denoised by using a spectral subtraction method in the short-time spectral estimation to suppress background sounds in the speech segment and highlight speech in the speech segment. Of course, other ways to remove background sounds in a speech segment may be used, such as: adaptive filtering techniques. The specific way to remove the background sound in the speech segment is not limited herein.
S4032: and separating the voices of different speakers in the voice segments to obtain at least one voice sub-segment.
Sometimes, not only one person speaks in the speech segment, but also multiple persons speak simultaneously, so the voices of different speakers in the speech segment need to be separated to obtain the speech sub-segments of each speaker respectively.
After obtaining the voice sub-segments of multiple speakers, sometimes it is only necessary to determine whether the voice of a certain speaker is synchronized with the image, and sometimes it is necessary to determine whether the voice of multiple speakers is synchronized with the image. At this time, the voice sub-segment of a certain speaker or the voice sub-segments of several speakers can be selected as the voice segment after de-noising according to the actual judgment situation.
S404: and dividing the voice segment into a plurality of voice frames in a sliding weighting mode.
Where there is overlap between adjacent speech frames.
In particular, a window function may be utilized to segment the speech segment sliding weight into multiple speech frames. The window function may be a hamming window function or other types of window functions. The plurality of voice frames cut into the segments can be a plurality of 25ms segments, and can also be segments with other lengths. Each segment is called a speech frame. Typically 10ms overlap is maintained between adjacent speech frames because: the voice frames are too short, and possibly one voice is not sent out completely, so that the adjacent voice frames are overlapped to a certain degree, the meaning can be understood more fully, and the accuracy of measuring the synchronism of the voice and the image is improved.
It should be noted here that the execution order of steps S402, S403, and S404 may not be in the order of the size of the sequence number, and may be executed in any order. The execution sequence of steps S402, S403, and S404 is not particularly limited.
S405: each speech frame is converted to a specific signal.
Wherein the specific signal is independent of individual characteristics of the speaker in the speech segment.
In the prior art, before inputting the speech segment into the speech neural network, the speech segment needs to be converted into Mel-scale Frequency Cepstral Coefficients (MFCC) signals, and then the MFCC signals are input into the speech neural network to obtain corresponding speech features. However, the MFCC signal cannot well erase the identity information of the speaker in the speech segment, and the obtained speech features also include the identity information of the speaker, thereby reducing the accuracy of measuring the synchronization between the speech and the image.
In view of this, the speech segment may be converted into a specific signal before being input into the speech neural network. The specific signal is irrelevant to the characteristics of the speaker in the voice segment, namely, the characteristics of the speaker in the voice segment can be better erased. Therefore, the specific signal is input into the speech neural network, the obtained speech characteristics do not contain the characteristics of the speaker, and the accuracy of measuring the synchronism of the speech and the image is improved.
In practical applications, the specific signal may be a Phonetic class Posterior Probability (PPG) signal. The PPG signal is better able to erase the speaker identity-related information in the speech segment. Moreover, the PPG signal can further erase background sounds in the voice fragment, reduce the variance of the input of the voice neural network and further improve the accuracy of the measurement of the synchronism of the voice and the image.
Of course, the speech segment can also be converted into other types of signals, such as features extracted by the deep speech model, as long as the identity information of the speaker can be erased. The specific type of signal is not limited herein.
In practical applications, in order to convert the Speech segment into the PPG signal, the Speech segment may be input into a Speaker-Independent Speech Recognition (SI-ASR) system, and the Speech segment is processed by the SI-ASR system to generate the PPG signal. In the SI-ASR system, the dimension P of the particular PPG signal and the number of phonemes supported by the SI-ASR are related to the supported language. The SI-ASR system supporting Chinese and English is adopted, and the total support P is 400 phonemes. The PPG signal obtained for a speech frame is a feature vector of dimensions 1 × 400. The PPG signals obtained by T continuous voice frames are T multiplied by 400 dimensional characteristic matrix. Other SI-ASR systems may be used with corresponding adjustments based on the number of phonemes supported.
Of course, the voice segment can also be converted into a signal for erasing the identity information of the speaker by other means, such as: deep learning model deepspech. The deep learning model can convert the voice signal into corresponding words. Therefore, the feature extracted by DeepSpeech includes only the content of the utterance itself, and does not include personal features such as the tone of the speaker. Thus, even after extraction, semantic irrelevant contents such as the identity information and background of the speaker can be erased.
Fig. 5 is a schematic flowchart of processing a speech segment in the embodiment of the present application, and referring to fig. 5, first, a speech is input into a preprocessing module. In the preprocessing module, the voice can be processed by unified sampling frequency, denoising, segmentation and the like. The speech data resulting from processing the speech segments is then input into the SI-ASR system. In the SI-ASR system, speech segments can be converted into PPG signals.
And secondly, processing the image fragment.
Because the image segment contains the characteristics of the speaker, for example: thin and thick lips, mouth size, etc. Therefore, before the image segment is input into the image neural network to obtain the image characteristics, the characteristics of the speaker in the image segment are firstly erased, and then the image data with the characteristics of the speaker erased is input into the image neural network, so that the accuracy of synchronous comparison between the voice and the image can be improved.
Next, the following describes the generation of a contour map of a target person from an image segment, taking the extraction of the lower half-face feature from the image segment as an example. The contour map extracted here is independent of the individual features of the target person.
S406: and carrying out face detection on the image segments to obtain a face detection frame.
Generally, face detection is performed on each frame of image to obtain a face detection frame.
S407: and horizontally aligning the face in the face detection frame.
Specifically, the positions of the key points of the face in the face detection frame in the original image can be found by using a dense face alignment algorithm, including but not limited to the left eye center position, the right eye center position, the left mouth angle position and the right mouth angle position. The left and right are the left and right of the physiological significance of the face in the image, rather than the left and right in the image, and assume that the face in the image is a front face. And processing the face image into a form conforming to the rule based on rule calculation by using the position information of the face key points. The rules here may be as follows:
calculating the middle positions of the left eye central key point and the right eye central key point, and recording as P _ eyecentre;
calculating the middle positions of the key points of the left mouth corner and the right mouth corner, and recording as P _ mouthcenter;
calculating a vector from the center key point of the left eye to the center key point of the right eye, and recording the vector as V _ eyeyeye;
calculating a vector from P _ eyecentre to P _ mouthcentre, and rotating 90 degrees anticlockwise to form an acute angle with V _ eyeeyeeyeeye, and recording as V _ eyetomouth;
calculating a vector difference between V _ eyelayer and V _ eyetomouth, and carrying out modular length normalization on the vector difference to obtain a unit vector X _ unit;
amplifying the X _ unit, wherein the amplification factor is the larger value of 2 times of the length of the V _ eyetoeye module and 1.8 times of the length of the V _ eyetomouth module to obtain a vector X, and rotating the X by 90 degrees anticlockwise to obtain a vector Y;
taking the P _ eyecentre moving by 0.1 times V _ eyetomouth as a center C, a rectangle can be obtained in the image, the coordinate of the upper left corner of the rectangle is C + X + Y, and the coordinate of the lower right corner of the rectangle is C-X-Y;
and (4) extracting the image in the rectangle by using an interpolation algorithm, and zooming the image to a preset size, such as 256 pixels by 256 pixels, so as to obtain the aligned face.
The Dense Face Alignment algorithm for finding the Face key points here may be a three-dimensional Dense Face Alignment (3d DDFA) algorithm. Of course, other alignment algorithms can be used to obtain the key points of the face, and then the above rules are used to realize the face alignment. The specific algorithm used herein is not limited.
Compared with a more common mode of aligning the human face by calculating affine transformation between the human face key points and a preset front face key point template, the method can be compatible with the alignment of the large-angle side face and the front face.
S408: and extracting the expression coefficient of the target person from the face.
Specifically, the expression coefficient of the target person in the face detection frame can be extracted through a three-Dimensional deformable parameterized face model (3 digital motion Models, 3DMM) parameter estimation algorithm, and the expression coefficient meets the standard of the three-Dimensional deformable parameterized face model.
After the contents in the face detection box are used as input and processed by using a 3DMM parameter estimation algorithm, the identity coefficient and the expression coefficient of the target person according with the 3DMM model standard can be acquired. The expression coefficient can be recorded as alphaexp。
The 3DMM parameter estimation algorithm is an algorithm capable of estimating 3DMM parameters and is used for estimating the identity coefficient and the expression coefficient of the face, and the identity coefficient and the expression coefficient accord with the standard defined by the 3 DMM.
Specifically, the 3d dm parameter estimation algorithm adopted by the application is implemented by using a deep neural network model. The pre-trained deep neural network model can be used for inputting the face image in the aligned face detection frame and the identity coefficient corresponding to the existing target person into the model, extracting the expression coefficient and the identity coefficient of the target person in the aligned face image, and updating the identity coefficient corresponding to the existing target person according to the output identity coefficient for subsequent image frame estimation. Here, the identity coefficient corresponding to the target person is a moving weighted average of the identity coefficients estimated in the time-series adjacent image frames.
Compared with the method of directly calculating the expression coefficient of the target person from the aligned face image, the method has the advantages that the calculation result of the identity coefficient of the target person from the adjacent image frames in the time sequence is input into the deep neural network model, the model can be better fitted with the form change of the face by using the expression coefficient instead of changing the identity coefficient, and therefore more accurate expression coefficients are obtained.
Similarly, other 3DMM parameter estimation algorithms that can stabilize identity coefficients, such as the Face2Face algorithm (Thies, Justus, et al, "Face2Face: Real-time Face capture and practice of rgb video." Proceedings of the IEEE conference on computer vision and pattern recognition.2016 ") can be used to obtain the expression coefficients of each frame.
Expression coefficient alphaexpThe characteristics of the position of the mouth, the opening and closing degree of the mouth and the like which are irrelevant to the individual speaker are included. And features related to the speaker individual are characterized in the identity coefficient. Therefore, based only on the expression coefficient αexpAnd a standard identity coefficient (the standard identity coefficient is used for replacing the identity coefficient of the target person and removing the individual characteristics of the target person), and the universal parameterized face model is input to generate a face outline image of the target person, so that the individual characteristics of the speaker can be eliminated, and the accuracy of measuring the synchronism of the mouth movement and the voice is improved.
S409: and extracting the expression coefficient of the lower half face corresponding to the lower half face in the expression coefficients.
Under the definition of 3DMM, the influence of all expression coefficients is full-face, and the influence on the mouth is negligible only. Therefore, the expression coefficient having a high correlation with the lower half face motion is extracted as the lower half face expression coefficient.
When a part of the target person in the image needs to be synchronously measured with the voice, the coefficient of the part of the target person, which is irrelevant to the individual characteristics, is extracted from the system. Here, if the lower half face motion needs to be synchronously measured with the voice, the expression coefficient of the lower half face is extracted from the expression coefficients and is marked as alphahalffaceAnd further generating a lower half face contour map based on the expression coefficients of the lower half face so as to carry out synchronous measurement with the voice.
S410: and inputting the expression coefficient of the lower half face into the universal three-dimensional face model to obtain a three-dimensional face model corresponding to the lower half face of the target person.
And the three-dimensional face model corresponding to the lower half face of the target person is the three-dimensional face model combining the expression coefficient of the lower half face of the target person with the standard identity coefficient.
The universal three-dimensional face model is an abstract face model. In the general three-dimensional face model, data of eyebrow, eyes, nose, face, mouth and other parts are obtained on the basis of the average of a plurality of faces, and the general three-dimensional face model has universality.
And inputting the expression coefficients of the lower half face into the universal three-dimensional face model to obtain the three-dimensional face model corresponding to the lower half face of the target person.
Specifically, in the universal three-dimensional face model, a predefined complete expression orthogonal base B is usedexpThe correspondence is changed to B related to the lower half face movementhalfface. Specifically, the following formula (1):
wherein S is a geometric model of the target person under the lower half-face expression,for a corresponding average geometric model of the face under predefined neutral expression, BhalffaceBeing orthogonal bases associated with the lower half-face movement, alphahalffaceThe lower half face expression coefficient.
Therefore, the influence of irrelevant expressions can be eliminated by the obtained three-dimensional face model corresponding to the lower half facial expression of the target person.
S411: and acquiring a vertex set of a lower half face in the three-dimensional face model.
The lower half face refers to the area of the face below the line connecting the bottoms of the left and right ears and the tip of the nose. Fig. 6 is a schematic diagram of the range of the lower half face in the embodiment of the present application, and as shown in fig. 6, a position 601 of the bottom of the left ear, a position 602 of the tip of the nose, and a position 603 of the bottom of the right ear are connected to obtain a connection line 604. The line 604 divides the face into an upper half face and a lower half face. And the face below line 604 is the lower half.
When the lower half of the face is selected, the connecting line 604 may have a certain adjustment range, such as moving up to the eye position, or moving down to the nose position. Namely, the selection of the lower half face can be adjusted according to actual needs.
S412: and projecting the vertex set to a two-dimensional plane to obtain a lower half face contour map of the target person, and taking the lower half face contour map as a face contour map of the target person.
Specifically, the vertices corresponding to the mouth contour and the chin region on the obtained geometric model S are collected to obtain a vertex set V. Projecting the vertex set V to a two-dimensional plane by using Scale orthogonal projection (Scale orthogonal projections) to obtain a contour map I of the lower half face, which is specifically shown as the following formula (2):
wherein, I is a two-dimensional contour map of the lower half face of the target person, f is a scale coefficient, P is an orthogonal projection matrix, and S (v) is a vertex set of the lower half face in the three-dimensional face model. Here, the dimensions of the outline I may be 128 × 256 rectangles, the outlines of the mouth and the lower half face being centered. In particular, in order to enhance the visibility of the contour map, each vertex is projected into a two-dimensional Gaussian spot with r pixels of radius and the center of the projection position of the vertex. The value of the radius r is positively correlated with the size of I, corresponding to 128 × 256I, where r is 2 pixels.
In the process of processing the image segment, the model is not rotated and translated, the expression coefficient of a target person in the image segment is directly obtained through the 3DMM, a universal three-dimensional face model is further obtained by combining a standard identity coefficient, a lower half face contour map with individual features of the target person eliminated is generated, the obtained contour map is a contour map with front face features, and the influences of face pose and a shelter in an original image are eliminated.
Fig. 7 is a schematic flow chart of processing image segments in the embodiment of the present application, and as shown in fig. 7, first, dense face alignment is performed on an image to obtain an aligned image; then, extracting facial 3D model expression coefficients of the aligned images; secondly, generating a 3D model according to the extracted expression coefficients by adopting a front visual angle and a standard face shape; and finally, projecting the corresponding vertex of the 3D model to obtain the two-dimensional contour of the lower half face.
After the voice segment is processed into a PPG signal and the image segment is processed into a two-dimensional contour map of the lower half part of the front face of the human face, the PPG signal can be input into a voice neural network and the two-dimensional contour map is input into a visual neural network, voice features and visual features are respectively obtained, then the voice features and the visual features are compared, and whether the voice segment and the image segment have synchronism is determined.
S413: and obtaining the voice characteristics of the specific signal through a voice neural network.
S414: and obtaining the visual characteristics of the face contour map through a visual neural network.
S415: and determining whether the voice segment and the image segment have synchronism according to the voice characteristic and the visual characteristic.
Steps S413, S414, and S415 are the same as the foregoing steps S303, S304, and S305, and are not described herein again.
Fig. 8 is a schematic diagram of an architecture for measuring synchronization between voice and image in an embodiment of the present application, and referring to fig. 8, after a voice segment and an image segment are extracted from a video respectively, on one hand, the voice segment is input into a voice neural network to obtain a voice feature. And on the other hand, inputting the image segment into the visual neural network to obtain the visual characteristics. Finally, the voice features and the visual features are input into a synchronicity measuring module, and the synchronicity measuring module determines whether the corresponding voice segments and the image segments have synchronicity or not according to the voice features and the visual features. The synchronicity measurement module is a module for determining whether the corresponding voice segment and the image segment have synchronicity through comparison of the voice feature and the visual feature. The specific form of the synchronicity measurement module is not limited herein.
In practical applications, in order to obtain the speech features of the speech segments, the speech segments may be input into a speech neural network for processing to obtain the speech features. And in order to obtain the visual characteristics of the image segment, the image segment can be input into a visual neural network to be processed so as to obtain the visual characteristics. The following describes the construction of the neural network, the sampling of training data, and the training, respectively.
First, neural network construction
1. Speech neural network construction
Since the speech segment has been converted into a specific signal, in particular a PPG signal of dimension T × P, before it is input into the speech neural network. And each dimension has a definite physical meaning, P is the number of phonemes, T is the sampling frequency in time, and each column is the phoneme posterior probability distribution corresponding to one speech frame. Based on these explicit physical meanings, the speech neural network can be specifically constructed as follows.
Fig. 9 is a schematic structural diagram of a speech neural network according to an embodiment of the present application, and referring to fig. 9, the speech neural network at least includes: convolutional layers (Conv1D (3 × 1, stride ═ 2,1)) leak ReLU (0.02)), … … convolutional layers (Conv1D (3 × 1, stride ═ 2,1)) leak ReLU (0.02)), recombination layers (reply), full Connection layers (full Connection Layer leak ReLU (0.02)), full Connection Layer (full Connection Layer leak ReLU (0.02)), full Connection layers (full Connection Layer leak ReLU (0.02)), and Linear Projection layers (Linear Projection Layer).
Considering that there is overlap between adjacent speech segments, the time dimension is first processed with a plurality of 1-dimensional convolutional layers (convolutional kernel size is 3 × 1, convolutional step size is (2,1), and valid padding is used). And recombining the obtained matrix into a feature vector. Next, the feature vectors are processed using 3 fully connected layers. Finally, a 512-dimensional voice feature vector is obtained through 1 linear projection layer. Wherein, the number of convolutional layers is related to the duration of the input specific signal (the feature matrix corresponding to the PPG signal). The dimension of the finally output voice feature vector is consistent with the dimension of the visual feature vector of the subsequent output. In the embodiment of the present application, the speech feature vector is a speech feature, and the visual feature vector is a visual feature.
Specifically, when P is 400 and the input duration is 200ms, T is 13 and the PPG feature matrix is 13 × 400 dimensions. Correspondingly, a 3 × 400 feature matrix can be obtained using 2 layers of 1-dimensional convolutional layers. After recombining to 1 × 1200 feature vectors, the final 512-dimensional speech feature vector is obtained through 3 full-link layers and 1 linear layer.
Fig. 10 is a schematic flowchart of generating a speech feature in an embodiment of the present application, and referring to fig. 10, the process may include:
s1001: and processing the specific signal on a time dimension by adopting a plurality of 1-dimensional convolution layers to obtain a characteristic matrix.
Wherein the number of 1-dimensional convolutional layers is related to the time duration corresponding to the specific signal.
S1002: and recombining the feature matrix into a feature vector.
S1003: and processing the feature vectors by adopting 3 full-link layers and 1 linear projection layer to obtain 512-dimensional voice feature vectors.
Of course, the dimension of the resulting speech feature is not limited to only 512 dimensions. The dimensionality of the speech features is related to the amount of data of the speech data input into the model and the type of penalty function employed by the speech neural network.
2. Visual neural network construction
Because factors (such as illumination, expression, posture and the like) which interfere with the motion information of the lower half face in the image segment are removed to a great extent before the image segment is input into the visual neural network, the visual neural network can adopt a network structure with a light calculation amount.
Specifically, the optic nerve network may use the backbone network of ResNet18, modified as follows:
(1) if the input image segment is a plurality of images, the plurality of images can be arranged along the channel dimension in a time increasing order and then used as the input of the optic nerve network. Therefore, the parameter dimension of the convolution in layer 1 of the optic neural network needs to be adjusted accordingly.
(2) Since the image segment is processed as a silhouette of the lower half face, the resolution is 128 × 256, and the aspect ratio is 1: 2, this is in contrast to the ResNet18 default input aspect ratio 1: 1 is different. For this, it is necessary to use a larger convolution kernel size at layer 1 convolution of ResNet18, for example: 7 × 7, and the convolution step is set to (1, 2).
The above convolution size and step size are only one specific value, which is not intended to limit the convolution kernel size and step size employed in the embodiments of the present application to only 7 × 7 and (1, 2). In practical applications, the convolution kernel size and step size of the convolution layer are related to the size of the contour map. The corresponding step size can be set according to the aspect ratio of the contour map, and the size of the convolution kernel is set to be slightly larger. Thus, the contour map can be processed in one time by using a convolution layer with a large convolution kernel. Of course, it can be realized by performing a plurality of processes using a plurality of convolution layers having small convolution kernels.
(3) And a 1-layer full connection layer is added at the end of the ResNet18 backbone network, so that a 512-dimensional visual feature vector can be obtained. Of course, the dimension of the resulting visual feature vector is not limited to only 512 dimensions. The dimensionality of the visual feature vector is related to the amount of data of the visual data input into the model and the type of penalty function employed by the visual neural network.
Of course, the optic neural network can be modified by using other deep neural networks besides the backbone network of ResNet18, such as: MobilenetV2, and the like.
Fig. 11 is a schematic flowchart of generating visual features in an embodiment of the present application, and referring to fig. 11, the process may include:
s1101: and processing the contour map by using the convolution layer to obtain a characteristic matrix.
Wherein the convolution kernel size and step size of the convolution layer are related to the size of the contour map.
S1102: and processing the characteristic matrix by adopting a main network of the visual neural network to obtain a characteristic vector.
The backbone network herein refers to a main architecture in a neural network. In order to construct the visual neural network in the embodiment of the present application, after a certain existing visual neural network is obtained, the architecture in the existing visual neural network, that is, the backbone network, is adopted, and parameters in some layers are adaptively modified, so that the visual neural network in the embodiment of the present application can be obtained.
S1103: and processing the feature vector by adopting a full connection layer to obtain a 512-dimensional visual feature vector.
Of course, the dimension of the resulting visual feature is not limited to only 512 dimensions. The dimensionality of the visual features is related to the amount of data input to the profile map in the model and the type of penalty function employed by the visual neural network.
Second, training data sampling
For the training video, a portrait video of a single person speaking is adopted. In the portrait video, the degree of interference of the background sound is less than a certain degree. That is, a single-person speaking video with relatively clean background sound is required. Also, the training video may be large in order to enable more adequate follow-up training. In practical applications, 25Hz high definition video may be used. Thus, the accuracy of visual feature extraction training can be improved.
After the training videos are collected, the audio signals in each video segment are processed into 16kHz, the video signals are divided into frames, and the time lines are recorded. Thus, a voice section and an image section are obtained. Then, the processing method in the above steps S402-S405 is adopted to process the voice segment to obtain a specific signal, and the subsequent sampling is simply called voice, and the processing method in the above steps S406-S412 is adopted to process the image segment to obtain a face contour map, and the subsequent sampling is simply called vision.
Training data may then be formally sampled. Here, positive sample sampling and negative sample sampling are mainly included. The positive sample means that the input voice is synchronous with the vision. The negative examples mean that the input speech is not synchronous with the vision. By inputting the positive sample and the negative sample for training, the accuracy of measuring the synchronism of the voice and the image can be improved.
1. Positive sample sampling
The positive samples are the speech and visual needs used in training that are from the same piece of training video and are synchronized in time.
Moreover, if the length of the speech is too short, a complete pronunciation may not be included in the speech, and even the semantic understanding in the speech may be affected. The frame length of the voice is specifically selected, and can be determined based on the frame rate of the training video.
For example, for a training video with a frame rate of 25Hz, a frame of image at time T and a speech segment of (T-20ms, T +20ms) may be selected and processed to form a positive sample pair. At this time, the visual length is 1 frame, and the voice length is 40 ms. It is clear that this is to make the frame length for speech larger than for vision. And the length of the voice is set to 40ms, namely, the voice is matched with the frame rate of 25Hz in the training video. If training videos with other frame rates are adopted, the length of the voice can be adjusted correspondingly.
Selecting a training video from a training video set, and simply referred to as a first training video; another training video, referred to as a second training video for short, is selected from the training video set. The first training video and the second training video are different training videos.
In the embodiment of the present application, after the first image segment and the first voice segment acquired from the first training video are processed into the first image data and the first voice data, a positive sample is formed.
2. Negative sample sampling
The negative examples are that the voice and vision used in training are not synchronous. The lack of synchronization here may involve a variety of situations. And in order to be able to train more fully, samples can be collected in all cases that are not synchronized.
Currently, when a negative sample pair is collected, an image segment and a voice segment are collected from different videos respectively, or an image segment and a voice segment are collected from different times in the same video, so as to form a negative sample. However, positive samples may still be present in the negative sample pairs so collected. For example: if the voice segment in video a is the same as the voice segment in video B, then the voice segment in video a and the image segment corresponding to the voice segment in video B also have synchronism, and if the voice segment in video a and the image segment corresponding to the voice segment in video B are combined into a negative sample, and actually, the two are combined into a positive sample. For another example: if the voice corresponding to a certain image segment in the video a is silent and the voice corresponding to another image segment in the video B is also silent, if the image segment in the video a and the voice segment corresponding to the image segment in the video B are combined, a positive sample is actually formed. Therefore, unreasonable negative samples appear in the negative sample pairs, so that the accuracy of neural network training is reduced, and the accuracy of subsequent synchronicity measurement is reduced.
In view of this, in the embodiment of the present application, when negative samples are collected, unreasonable negative samples need to be removed, that is, the training database needs to be cleaned to remove negative samples that are not suitable for training. Therefore, the accuracy of the negative sample can be improved, the accuracy of neural network training is further improved, and the accuracy of measuring the synchronism of the voice and the image is further improved.
Specifically, the negative sample sampling may be performed in the following three ways.
(1) Negative example of dislocation
The negative example of the misalignment means that although the speech and the vision are from the same training video, the speech and the vision are not synchronized in time, that is, there is a small amount of misalignment.
For example, a frame of image at time T and a speech segment of (T-T-20ms, T-T +20ms) are collected and processed to form a negative sample pair. Namely, the image segment is processed into image data, the voice segment is processed into voice data, and then a sample pair of < voice data and image data > is constructed, which is abbreviated as < voice, vision >.
For example: for the misplaced negative sample: the negative samples of < voice, vision > are taken from the same video, the time line is slightly staggered, one frame of image at the T moment and the (T-T-20ms, T-T +20ms) voice segment form a negative sample pair, wherein | T | is >80 ms. That is, speech and vision need to be misaligned by at least 80ms, corresponding to the time length of two frames of images, before they are treated as negative sample pairs. And ensure that the (T-20ms, T +20ms) speech segment and the (T-T-20ms, T-T +20ms) speech segment have different semantics.
Specifically, when constructing a negative example of misalignment, the time length of misalignment between speech and vision needs to be greater than or equal to 2 times the visual time length. Therefore, the voices in the misplaced negative samples are completely staggered from the voices in the misplaced negative samples, which are corresponding to the vision and are synchronous, so that the accuracy of subsequent training is ensured.
If training videos with other frame rates are adopted, the frame length of voice can be correspondingly adjusted, and the frame length of vision can also be correspondingly adjusted.
In addition, in order to further improve the accuracy of the subsequent training, it is further required to ensure that the semantics of the speech in the misplaced negative sample are different from the semantics of the speech corresponding to the vision in the misplaced negative sample.
In the embodiment of the application, after the first image segment and the second voice segment acquired from the first training video are processed into the first image data and the second voice data, the first negative sample is formed.
(2) Negative sample for voice fixation
The negative sample of fixed voice refers to that voice is extracted from the same training video, and vision is randomly extracted from some other training video except the training video. And the speech in the certain other training video is different from the speech extracted from the same training video in semantics.
For example: for fixed speech segment negative examples: the negative samples of < speech, vision > are taken from different training videos, wherein the speech segments are fixed, and one frame of image is randomly sampled from other training videos to form a negative sample pair. Wherein, the semantemes of the voice in the negative sample pair and the voice in the visual belonging positive sample pair are different. If the semantic of the speech in the negative pair is "silence," the semantic of the speech in the positive pair to which the vision in the negative pair belongs cannot be "silence.
In the embodiment of the present application, after a first image segment obtained from a first training video and a random voice segment obtained from a second training video are processed into first image data and third voice data, a second negative sample is formed.
(3) Visual fixation negative sample
The negative sample of visual fixation means that vision is extracted from the same training video, and voice is randomly extracted from some other training video except the training video. And the vision in the certain other training video is different from the vision extracted from the same training video in the lower half face motion of the person in the image.
For example: negative examples for fixed visual frame: < speech, visual > negative examples are taken from different videos, where the video frames are fixed and a speech segment is randomly sampled from other videos, forming a negative example pair. Wherein it is ensured that the video frame in the negative sample pair and the visual image in the positive sample pair to which the speech segment belongs are both sufficiently different in the lower half face motion.
In the embodiment of the present application, after the second voice segment obtained from the first training video and the random image segment obtained from the second training video are processed into the second voice data and the second image data, a third negative sample is formed.
The first image segment and the random image segment are images of one or more continuous time points.
In addition, in practical application, it is considered that a single frame of image has no context information, and further, the motion information of the lower half face of a person in the image cannot be fully expressed, so that during sampling, images of continuous T time points can be collected, vision is obtained, voice segments corresponding to the images of the T time points are collected, voice is obtained, the obtained vision and voice are processed to form a sample pair, and the sample pair is input to a neural network for training. In general, T can be set to 5, and the corresponding speech segment is 200 ms.
After the three candidate negative samples are obtained, visual rule judgment and voice rule judgment are required to be carried out on the three candidate negative samples, and only the candidate negative samples which pass through the two judgments are reserved as qualified negative samples. The specific judgment process is as follows:
1) and (3) voice rule judgment:
in the negative sample < speech a, vision > and speech a _ positive in the positive sample pair to which the speech a and vision v belong are judged to be different in semantic needs.
In particular, the core idea is to measure the difference between PPG signature sequences.
Because the speech sample is processed into the PPG signature sequence, and each PPG signature is the posterior probability distribution of the phonemes contained in the corresponding speech frame, the probability maximum value is taken for the posterior probability distribution, and the phonemes corresponding to the speech frame can be obtained, so that the PPG signature sequence can be converted into the phoneme sequence P ═ P0…pi…pt]。
After the phoneme sequences of the voice a in the negative sample and the phoneme sequence corresponding to the voice a _ positive in the positive sample are obtained, the edit distance of the two phoneme sequences is calculated. In particularThe Levenshtein Distance (Levenshtein Distance) can be used to compute the speech P in the negative examples1And corresponding to speech P in the positive sample2D-L (P) edit distance therebetween1,P2). I.e. by delete, insert and replace operations, P is inserted1Is changed into P2The more similar the sequence the smaller the number of steps required. When the value of D is lower than a preset threshold value, judging that the two voice samples are too similar; when the value of D is higher than a preset threshold value, the two voice samples are judged to be sufficiently different. The predetermined threshold may be obtained from a database statistic.
2) And (3) visual rule judgment:
in the negative sample < speech a, vision v >, vision v and vision v _ positive in the positive sample pair to which speech a belongs are judged to be sufficiently different in the lower half face motion.
Specifically, the core idea is to determine how similar the vision in the negative sample is to the vision in the corresponding positive sample.
Since the two visual samples have been preprocessed, they are both processed into the lower face silhouette, and since the same standard identity information and projection coordinate system are used, they are already aligned. Therefore, the two contour maps can be changed from a gray scale map of 0-255 to a binary contour map of 0/1, which is marked as M by using a threshold valuev1And Mv2。
Then, the absolute difference D of the two binary contour maps is calculated1=∑|Mv1-Mv2And calculating Structural Similarity (SSIM) D of the two binarized contour maps2=SSIM(Mv1,Mv2) And further obtaining the weighted sum D ═ lambda of the two1D1+λ2D2. When the value of D is lower than a preset threshold value, judging that the two visual samples are too similar; when the value of D is higher than a preset threshold value, the two visual samples are judged to have enough difference. Weight λ1,λ2And the predetermined threshold may be obtained from a database statistic.
And when each visual sample comprises images of continuous T time points, preprocessing the T frame images, judging the difference between corresponding frames of the two visual samples one by one according to the visual rule, further making final judgment according to the number of the frames with the difference and the proportion of the total number of the frames in the visual samples, and judging that the two visual samples have enough difference if the proportion is higher than a preset threshold value.
The above dual decision of visual rules and phonetic rules is important because many different words in speech, the mouth movements are very similar. For example: sports 'birth' and 'birth' can both beep the mouth. Therefore, only the negative samples determined by the double judgment belong to reasonable negative samples, and the reasonable negative samples can be used for training the neural network subsequently. Therefore, the accuracy of neural network training can be improved, and the accuracy of measuring the synchronism of the voice and the image is further improved.
After the three candidate negative samples are screened, a first negative sample, a second negative sample and a third negative sample are obtained, and then the first negative sample, the second negative sample and the third negative sample are adopted for neural network training.
Three, neural network training
Based on the architecture diagram shown in fig. 8, the positive sample, the first negative sample, the second negative sample and the third negative sample obtained by the above-mentioned acquisition are input into the voice and image synchronicity measurement model for training, so that each parameter in the voice and image synchronicity measurement model can be adjusted, and further, the synchronicity of the voice and the image can be more accurately measured.
Here, the speech and image synchronicity measurement model is composed of a speech neural network, a visual neural network and a synchronicity measurement model.
Fig. 12 is a schematic flowchart of a process for training a neural network in an embodiment of the present application, and as shown in fig. 12, the process may include two stages, namely an early training stage and a late training stage, as follows:
1. early stage of training
S1201: and dividing the positive sample, the first negative sample, the second negative sample and the third negative sample into different batches of input voice and image synchronism measurement models for training, and adjusting parameters in the voice and image synchronism measurement models. And through balanced sampling, the number of positive samples and the number of negative samples in each batch are similar, and the model training is facilitated.
Specifically, the parameters in the speech and image synchronicity measurement model can be adjusted by a loss function, which is specifically shown in the following formula (3):
wherein L represents a loss value, N represents a batch size, N represents a sample index, ynLabel representing a specimen, yn1 denotes a positive sample, yn0 denotes a negative sample, dpWhich represents the distance of a positive sample,dnwhich represents the distance of the negative sample,v denotes the visual features extracted by the visual neural network, a denotes the phonetic features extracted by the phonetic neural network, margin1Is a specific value. Here margin1Margin with late training2May be different.
The specific way of adjusting each parameter in the model based on the loss function value is, for example: the model can be trained by using Adam optimization algorithm with corresponding parameters beta _1 ═ 0.99 and beta _2 ═ 0.999. In the early stage of training, the batch size is set to 256, 1000 epochs (epochs) are trained, the learning rate is initially set to 0.005, and the learning rate is gradually attenuated to 0 by using a cosine attenuation strategy after 100 epochs. Similarly, in the later period of training, 500 epochs are trained, the learning rate is initially set to 0.001, and the learning rate is gradually attenuated to 0 by using a cosine attenuation strategy after 100 epochs. The specific training parameters and model parameters need to be adjusted accordingly as the database changes. Of course, other embodiments may be used, and are not limited herein.
2. And (5) training later stage.
In order to further optimize the model for measuring the synchronism of the voice and the images, after the model is trained in the previous period, an online hard sample mining strategy can be continuously adopted in each training batch, and the model is trained again by using online mined hard samples until the trained model is in a certain precision interval and does not generate large fluctuation any more.
Specifically, unlike the pre-training period, the post-training period only divides all positive samples into different batches (e.g., M batches), and obtains negative samples, called negative samples in a batch, online by combining different positive samples in a batch. Sequencing the positive samples and the negative samples in each batch according to the loss value output by the loss function according to a rule; obtaining hard positive samples in the current batch according to the loss value; and obtaining a plurality of hard negative samples in the current batch according to the loss value.
S1202: and acquiring hard positive samples in the positive samples in a batch.
Specifically, after N positive samples (voice and vision) are randomly sampled in a training set to form a training batch, voice features a are respectively extracted through a current voice neural network and a current visual neural networkiAnd a visual feature viI ∈ N. Then, hard samples within each batch are found. The hard sample is specifically represented by the following formula (4):
wherein,the samples are difficult to correct, v represents the visual features extracted by the visual neural network, a represents the voice features extracted by the voice neural network, and i represents the labels of the samples.
S1203: and acquiring a negative sample in the batch and a plurality of difficult negative samples in the negative sample.
Specifically, the negative samples in the batch are generated according to the positive samples in the batch, and a plurality of difficult negative samples in the batch are obtained.
Specifically, N speech features and N visual features obtained from N positive samples in the training batch in step S1202 are combined pairwise to form an N × N matrix, a positive sample combination on a diagonal is excluded, N × (N-1) combination is obtained as a candidate negative sample, and the qualified negative sample obtained through visual rule determination and speech rule determination is the negative sample in the batch.
Wherein a plurality of negative samples correspond to each of the positive samples. That is, each positive sample in step S1202 corresponds to a plurality of negative samples. In step S1203, for each positive sample, a hard negative sample is found out from a plurality of corresponding negative samples.
Obtaining a plurality of hard negative samples in the negative samples in each batch, specifically to the voice characteristic aiThe corresponding negative samples are sorted according to the loss value output by the loss function, and the voice feature a is obtained according to the loss valueiA corresponding hard negative sample; and/or on visual characteristics viThe corresponding negative samples are sorted according to the loss value output by the loss function, and the visual feature v is obtained according to the loss valueiCorresponding hard negative samples.
For example, assuming there are 3 positive samples, a 3 × 3 matrix can be formed, and the combination of positive samples on the diagonal is removed to obtain 6 candidate negative samples, i.e. theAfter the unqualified negative samples in the matrix are removed, all the remaining samples in the matrix are qualified negative samples. Every i horizontal lines in the matrix are negative samples corresponding to the voice of the ith positive sample, and the hard negative sample with the largest loss function in each horizontal line is marked as the voice of the ith positive sample; similarly, the ith column in the matrix is a visually corresponding negative sample of the ith positive sample, and the visually corresponding hard negative sample marked as the ith positive sample with the largest loss function in each column.
In particular, when a certain row or a certain column does not contain qualified negative samples, the difficult negative samples are not calculated.
The hard negative sample is specifically shown in the following formulas (5) and (6):
wherein,represents the distance of the hard negative sample corresponding to the j-th positive sample's speech,and the distance of the hard negative sample corresponding to the vision of the jth positive sample is represented, v represents the visual feature extracted by the visual neural network, and a represents the voice feature extracted by the voice neural network.
Wherein, when the jth row does not contain qualified negative examples,is a specific value. Similarly, when the jth column does not contain a qualifying negative sample,
that is, the nature of hard negative sample mining is ordering. Within a training batch, for a speech sample ajTraversing all visual samples in the batch, and constructing a negative sample pair combination (v)0,aj),…,(vN,aj) If qualified negative samples exist, the qualified negative samples are selectedA difficult negative sample pair is selected. And for a visual sample vjTraversing all voice samples in the batch, and constructing a negative sample pair combination (v)j,a0),…,(vj,aN) And selecting a difficult negative sample pair from the qualified negative samples.
S1204: and inputting the hard positive samples and the hard negative samples into the voice and image synchronism measurement model after parameter adjustment for training, and adjusting the parameters in the voice and image synchronism measurement model again.
After the hard positive samples and the hard negative samples are mined out from the positive samples and the negative samples on line, the loss calculation of all the positive samples and the negative samples in the batch is not needed. Therefore, the loss function corresponding to the model for measuring the synchronicity between the voice and the image is changed correspondingly, and the changed loss function is specifically shown in the following formula (7):
wherein, L represents a loss value,the hard-to-correct sample distance is represented,represents the distance of the hard negative sample corresponding to the j-th positive sample's speech,denotes the distance of the visually corresponding hard negative sample of the jth positive sample, N denotes the size of the batch of samples, margin2Is a specific value.
Parameters in the model for measuring the synchronism of the voice and the image can be further adjusted through the hard positive sample, the hard negative sample and the loss function after corresponding change, the model is further optimized, and the accuracy of model prediction is improved.
In the actual model optimization process, generally, the optimization is not performed once, but performed multiple times. That is, after the model is optimized once by using the training data of the current batch, the training data of the next batch is reused and corresponding hard positive samples and negative samples are obtained, and then the training data is input into the current model to train again, and the training is repeated for multiple times until the output value of the corresponding loss function is maintained in a stable region, that is, the output value is in a certain precision interval and does not fluctuate greatly.
S1205: and acquiring the hard positive samples in the next batch again.
S1206: and acquiring the negative sample in the batch and a plurality of difficult negative samples in the negative sample again.
Wherein a plurality of hard negative samples correspond to each of the positive samples.
S1207: and inputting the re-acquired hard positive samples and the plurality of hard negative samples into the voice and image synchronism measurement model after the parameters are re-adjusted, and training, adjusting the parameters in the voice and image synchronism measurement model until loss values output by loss functions corresponding to the voice and image synchronism measurement model are converged, namely the loss values are in a certain precision interval and do not generate large fluctuation any more.
Steps S1205, S1206, and S1207 are similar to the specific implementation manner of steps S1202, S1203, and S1204, and are not described herein again.
Thus, after m batches of samples are processed in the above manner, the training of the speech and image synchronism measurement model is completed. Wherein M is less than or equal to M (M being a batch of positive sample partitions). When it is required to measure whether the audio segment and the image segment in a certain video have synchronicity, the audio segment and the image segment in the video are processed through the steps S402-S405 and S406-S412, respectively, and then are input into the audio and image synchronicity measurement model, respectively, so that the output result of the model can represent whether the audio segment and the image segment in the video have synchronicity.
The flow of the method for measuring the synchronicity between voice and image provided by the embodiment of the present application is fully described here.
Fig. 13 is a schematic view of a complete flow of the method for measuring the synchronization between voice and image in the embodiment of the present application, and is shown in fig. 13, and after a video stream is acquired, the video stream is divided into two paths. And one path of the voice fragments is to input the video stream into a preprocessing module to preprocess the video stream. And inputting the voice fragment into an SI-AR system, and processing the video stream into a PPG signal. And accumulating the PPG signals of a plurality of single frames into voice data. And then inputting the voice data into the voice neural network to obtain the voice characteristics. And in the other path, performing dense face alignment on the video stream frame by frame. In a frame of image, there may be multiple faces, and the following steps need to be performed for each face: and extracting expression coefficients from the human face. And generating a 3D model by using the face pose and the standard ID and the expression coefficient extracted from the face image. And projecting the corresponding vertex in the 3D model to obtain a 2D key point connecting line. And accumulating the obtained multi-frame 2D key point connecting lines into image data. And inputting the image data into a visual neural network to obtain visual characteristics. Finally, the voice features and the visual features are input into a synchronicity measurement module to measure whether the voice and the image in the video stream are synchronized. If the threshold is met, determining synchronization; if the threshold is not met, then out-of-sync is determined. The synchronicity of the voice feature and the visual feature can be judged through the synchronicity measuring module. The specific synchronicity measurement can be realized by calculating the distance between the voice feature and the visual feature on the vector and then comparing the distance with a preset threshold value. And finally, the face with the best synchronism can be judged through the synchronism measurement module. And if the synchronicity of all the faces in the video does not reach the preset threshold, judging that no suitable face exists in the video image under the current time segment.
In practical application, the method provided by the embodiment of the application can be applied to various scenes in which whether the voice and the image are synchronous or not needs to be judged. The method provided by the embodiment of the present application is further described below by taking three specific scenarios as examples.
Scene one: and judging the speaker.
When a plurality of people talk in a video, in order to determine a speaker who is speaking at present, firstly, extracting a corresponding voice segment and an image segment from the video; then, processing the voice segment into a PPG signal to erase personal characteristics of tone, intonation and the like of a speaker, extracting expression coefficients from the image segment through a 3DMM parameter estimation algorithm, and processing the image segment into a two-dimensional contour map of the lower half part of the front face of the face to eliminate interference of side face, shielding and the like, wherein the number of faces in the image is the number of two-dimensional contour maps; then, inputting the voice data obtained after the voice fragment processing into a voice neural network, and inputting the image data obtained after the image fragment processing into a visual neural network to respectively obtain a voice characteristic and a plurality of visual characteristics; and finally, synchronously matching the plurality of visual features with the voice features respectively, further determining the visual feature with the highest synchronism with the voice features, and further determining the person corresponding to the visual feature as the current speaker in the video.
Scene two: and identifying fake videos.
Some videos may not have sound or pictures originally added, but have been artificially added at a later time. For example: dubbing some stars' videos again, and dubbing some stars have no description at all. For another example: in some interactive live body authentication, a user is required to read out words displayed on a screen and then record the words as a video for uploading. In order to pass the verification, the lawless person acquires the image of the user in advance, and then dubs the image to make a video for uploading.
In order to judge whether the video is forged or not, firstly, extracting corresponding voice segments and image segments from the video; then, processing the voice segment into a PPG signal to erase personal characteristics of tone, intonation and the like of a speaker, extracting an expression coefficient from the image segment through a 3DMM parameter estimation algorithm, and processing the image segment into a two-dimensional contour map of the lower half part of the front face of the human face to eliminate interference of side face, shielding and the like; then, inputting the voice data obtained after the voice fragment processing into a voice neural network, and inputting the image data obtained after the image fragment processing into a visual neural network to respectively obtain voice characteristics and visual characteristics; and finally, synchronously matching the voice characteristics with the visual characteristics, wherein the higher the matching degree is, the images and the voice in the video are synchronous, but not artificially added in the later period. When the matching degree is higher than a specific value, it can be determined that the image and the voice in the video are generated by the same person at the same time, that is, the voice segment in the video belongs to the person in the image segment.
Scene three: and (5) video modulation.
When some non-professional multimedia devices record videos, a device for acquiring voice and a device for acquiring images are separated. The microphone can be used for collecting voice, and the camera can be used for collecting images. And then the collected voice and the image are fused into a video. Thus, it is easy to cause the speech and the image in the video to be temporally misaligned, i.e. the sound and the picture are not synchronized.
In order to solve the problem of audio and video asynchronism in the video, firstly, extracting corresponding voice segments and image segments from the video; then, processing the voice segment into a PPG signal to erase personal characteristics of tone, intonation and the like of a speaker, extracting an expression coefficient from the image segment through a 3DMM parameter estimation algorithm, and processing the image segment into a two-dimensional contour map of the lower half part of the front face of the human face to eliminate interference of side face, shielding and the like; then, inputting the voice data obtained after the voice fragment processing into a voice neural network, and inputting the image data obtained after the image fragment processing into a visual neural network to respectively obtain voice characteristics and visual characteristics; and finally, synchronously matching the voice characteristics with the visual characteristics, determining the dislocation degree of the voice and the image, and further performing auxiliary calibration, so that the time of the voice and the image is aligned according to the calibration to eliminate the dislocation.
Based on the same inventive concept, as the implementation of the method, the embodiment of the application also provides a device for measuring the synchronism of the voice and the image. Fig. 14 is a schematic structural diagram of a device for measuring synchronization between voice and image in the embodiment of the present application, and referring to fig. 14, the device may include:
a receiving module 1401, configured to obtain a voice segment and an image segment in a video.
A data processing module 1402, configured to generate an outline of the target person from the image segments, where the outline is independent of individual features of the target person.
A feature extraction module 1403, configured to obtain the voice features of the voice segments through a voice neural network.
The feature extraction module 1403 is further configured to obtain the visual features of the contour map through a visual neural network.
The synchronicity measuring module 1404 determines whether the voice segment and the image segment are synchronized according to the voice feature and the visual feature.
Further, as a refinement and an extension of the apparatus shown in fig. 14, the embodiment of the present application also provides a measuring apparatus for the synchronicity of the voice and the image. Fig. 15 is a schematic structural diagram of a device for measuring synchronization between voice and image in the embodiment of the present application, and referring to fig. 15, the device may include:
the receiving module 1501 is configured to obtain a voice segment and an image segment in a video.
The pre-processing module 1502 includes:
the detecting unit 1502a is configured to perform face detection on the image segment to obtain a face detection frame.
An alignment unit 1502b is configured to horizontally align the faces in the face detection frame.
A data processing module 1503, configured to generate an outline of the target person from the image segments, where the outline is independent of individual features of the target person.
When the contour map is a face contour map, the data processing module 1503 includes:
the extracting unit 1503a is configured to extract the expression coefficient of the target person from the aligned face through a three-dimensional deformable parameterized face model parameter estimation algorithm.
A generating unit 1503b, configured to generate a face contour map of the target person based on the expression coefficients and the generic parameterized face model.
The generating unit 1503b is specifically configured to extract a lower half face expression coefficient corresponding to a lower half face in the expression coefficients. And inputting the expression coefficient of the lower half face into the universal three-dimensional face model to obtain a three-dimensional face model corresponding to the lower half face of the target person, and processing the three-dimensional face model into a face contour map of the target person.
The generating unit 1503b is specifically configured to input the lower half face expression coefficient into the general three-dimensional face model to obtain a three-dimensional face model corresponding to the lower half face of the target person. And acquiring a vertex set of a lower half face in the three-dimensional face model. And projecting the vertex set to a two-dimensional plane to obtain a lower half face contour map of the target person, and taking the lower half face contour map as a face contour map of the target person.
A feature extraction module 1504, configured to obtain the voice features of the voice segments through a voice neural network.
The feature extraction module 1504 includes:
a first extracting unit 1504a, configured to process the contour map with convolutional layers to obtain a feature matrix, where a convolutional kernel size and a step size of the convolutional layers are related to a size of the contour map.
A second extracting unit 1504b, configured to process the feature matrix by using the backbone network of the optic neural network to obtain a feature vector.
A third extracting unit 1504c, configured to process the feature vector by using a full-connected layer to obtain a 515-dimensional visual feature.
The feature extraction module 1504 is further configured to obtain visual features of the contour map through a visual neural network.
The synchronicity measuring module 1505 determines whether the voice segment and the image segment have synchronicity according to the voice feature and the visual feature.
When the video is a video of a multi-person conversation, the synchronicity measuring module 1505 is configured to determine a speaker corresponding to the voice segment in the video according to the voice feature and the visual feature.
When the video is to be forged and identified, the synchronicity measuring module 1505 is configured to determine whether the voice segment in the video belongs to a person in the image segment according to the voice feature and the visual feature.
When the video is to be modulated, the synchronicity measuring module 1505 is used to align the start bits of the voice segment and the image segment in the video according to the voice feature and the visual feature, so that the voice segment and the image segment are synchronized.
It is to be noted here that the above description of the embodiments of the apparatus, similar to the description of the embodiments of the method described above, has similar advantageous effects as the embodiments of the method. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.
Based on the same inventive concept, the embodiment of the application also provides the electronic equipment. Fig. 16 is a schematic structural diagram of an electronic device in an embodiment of the present application, and referring to fig. 16, the electronic device may include: a processor 1601, a memory 1602, a bus 1603; the processor 1601 and the memory 1602 complete communication with each other through the bus 1603; processor 1601 is configured to call program instructions in memory 1602 to perform the methods in one or more embodiments described above.
It is to be noted here that the above description of the embodiments of the electronic device, similar to the description of the embodiments of the method described above, has similar advantageous effects as the embodiments of the method. For technical details not disclosed in the embodiments of the electronic device of the present application, refer to the description of the embodiments of the method of the present application for understanding.
Based on the same inventive concept, embodiments of the present application further provide a computer-readable storage medium, where the storage medium may include: a stored program; wherein the program controls the device on which the storage medium is located to execute the method in one or more of the above embodiments when the program runs.
It is to be noted here that the above description of the storage medium embodiments, like the description of the above method embodiments, has similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (11)
1. A method for measuring the synchronicity of voice and image is characterized by comprising the following steps:
acquiring a voice segment and an image segment in a video, wherein the voice segment and the image segment have a corresponding relation in the video;
generating an outline of a target person according to the image segments, wherein the outline is independent of the individual characteristics of the target person;
obtaining the voice characteristics of the voice segments through a voice neural network;
obtaining visual features of the contour map through a visual neural network;
and determining whether the voice segment and the image segment have synchronicity according to the voice feature and the visual feature, wherein the synchronicity is used for representing that the sound in the voice segment is matched with the action of the target person in the image segment.
2. The method of claim 1, wherein the contour map is a face contour map; the generating of the contour map of the target person according to the image segment comprises:
extracting the expression coefficient of the target character from the image segment;
and generating a face outline image of the target person based on the expression coefficients and the universal parameterized face model.
3. The method of claim 2, wherein the extracting the expression coefficient of the target person from the image segment comprises:
and extracting the expression coefficient of the target character in the image segment by a three-dimensional deformable parameterized face model parameter estimation algorithm, wherein the expression coefficient accords with the standard of the three-dimensional deformable parameterized face model.
4. The method of claim 3, wherein before said extracting the target human expression coefficients in the image segments by a three-dimensional deformable parameterized face model parameter estimation algorithm, the method further comprises:
carrying out face detection on the image segments to obtain a face detection frame;
horizontally aligning the human face in the human face detection frame;
the extracting of the expression coefficient of the target character in the image segment through the three-dimensional deformable parameterized face model parameter estimation algorithm comprises the following steps:
and extracting the expression coefficient of the target person from the aligned human face.
5. The method of claim 2, wherein the generic parameterized face model is a generic three-dimensional face model; the generating of the face contour map of the target person based on the expression coefficients and the general parameterized face model comprises:
extracting a lower half face expression coefficient corresponding to a lower half face in the expression coefficients;
and inputting the expression coefficient of the lower half face into the universal three-dimensional face model to obtain a three-dimensional face model corresponding to the lower half face of the target person, and processing the three-dimensional face model into a face contour map of the target person.
6. The method of claim 5, wherein the inputting the expression coefficients of the lower half face into the general three-dimensional face model to obtain a three-dimensional face model corresponding to the lower half face of the target person, and processing the three-dimensional face model into the face contour map of the target person comprises:
inputting the expression coefficient of the lower half face into the universal three-dimensional face model to obtain a three-dimensional face model corresponding to the lower half face of the target person;
acquiring a vertex set of a lower half face in the three-dimensional face model;
and projecting the vertex set to a two-dimensional plane to obtain a lower half face contour map of the target person, and taking the lower half face contour map as a face contour map of the target person.
7. The method of claim 1, wherein obtaining the visual features of the contour map through a visual neural network comprises:
processing the contour map by using a convolution layer to obtain a characteristic matrix, wherein the convolution kernel size and the step length of the convolution layer are related to the size of the contour map;
processing the characteristic matrix by adopting a backbone network of the visual neural network to obtain a characteristic vector;
and processing the characteristic vectors by adopting a full connection layer to obtain visual characteristics, wherein the dimensionality of the visual characteristics is related to the data volume of the contour map and the type of the loss function adopted by the visual neural network.
8. The method according to any one of claims 1 to 7, wherein the video is a video of a conversation of multiple persons; the determining whether the voice segment and the image segment have synchronicity according to the voice feature and the visual feature comprises: determining a speaker corresponding to the voice segment in the video according to the voice feature and the visual feature;
or the video is to be subjected to forgery identification; the determining whether the voice segment and the image segment have synchronicity according to the voice feature and the visual feature comprises: determining whether the voice segment in the video belongs to a person in the image segment according to the voice feature and the visual feature;
or, the video to be modulated; the determining whether the voice segment and the image segment have synchronicity according to the voice feature and the visual feature comprises: aligning start positions of the voice segment and the image segment in the video according to the voice feature and the visual feature so that the voice segment is synchronized with the image segment.
9. An apparatus for measuring synchronicity between voice and image, the apparatus comprising:
the receiving module is used for acquiring a voice segment and an image segment in a video, wherein the voice segment and the image segment have a corresponding relation in the video;
the data processing module is used for generating an outline image of a target person according to the image segment, and the outline image is irrelevant to the individual characteristics of the target person;
the feature extraction module is used for obtaining the voice features of the voice segments through a voice neural network;
the feature extraction module is further used for obtaining visual features of the contour map through a visual neural network;
and the synchronism measuring module is used for determining whether the voice segment and the image segment have synchronism according to the voice feature and the visual feature, and the synchronism is used for representing that the sound in the voice segment is matched with the action of the target person in the image segment.
10. An electronic device, comprising: a processor, a memory, a bus;
the processor and the memory complete mutual communication through the bus; the processor is configured to invoke program instructions in the memory to perform the method of any of claims 1 to 8.
11. A computer-readable storage medium, comprising: a stored program; wherein the program, when executed, controls the device on which the storage medium is located to perform the method according to any one of claims 1 to 8.
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111057976.9A CN114466179B (en) | 2021-09-09 | 2021-09-09 | Method and device for measuring synchronism of voice and image |
PCT/CN2022/114952 WO2023035969A1 (en) | 2021-09-09 | 2022-08-25 | Speech and image synchronization measurement method and apparatus, and model training method and apparatus |
EP22866437.1A EP4344199A4 (en) | 2021-09-09 | 2022-08-25 | Speech and image synchronization measurement method and apparatus, and model training method and apparatus |
US18/395,253 US20240135956A1 (en) | 2021-09-09 | 2023-12-22 | Method and apparatus for measuring speech-image synchronicity, and method and apparatus for training model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111057976.9A CN114466179B (en) | 2021-09-09 | 2021-09-09 | Method and device for measuring synchronism of voice and image |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114466179A true CN114466179A (en) | 2022-05-10 |
CN114466179B CN114466179B (en) | 2024-09-06 |
Family
ID=81406060
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111057976.9A Active CN114466179B (en) | 2021-09-09 | 2021-09-09 | Method and device for measuring synchronism of voice and image |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114466179B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115376542A (en) * | 2022-08-22 | 2022-11-22 | 西南科技大学 | Low-invasiveness audio-visual voice separation method and system |
WO2023035969A1 (en) * | 2021-09-09 | 2023-03-16 | 马上消费金融股份有限公司 | Speech and image synchronization measurement method and apparatus, and model training method and apparatus |
Citations (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0674315A1 (en) * | 1994-03-18 | 1995-09-27 | AT&T Corp. | Audio visual dubbing system and method |
US6219639B1 (en) * | 1998-04-28 | 2001-04-17 | International Business Machines Corporation | Method and apparatus for recognizing identity of individuals employing synchronized biometrics |
US20030154084A1 (en) * | 2002-02-14 | 2003-08-14 | Koninklijke Philips Electronics N.V. | Method and system for person identification using video-speech matching |
US20080111887A1 (en) * | 2006-11-13 | 2008-05-15 | Pixel Instruments, Corp. | Method, system, and program product for measuring audio video synchronization independent of speaker characteristics |
CN101199207A (en) * | 2005-04-13 | 2008-06-11 | 皮克索尔仪器公司 | Method, system, and program product for measuring audio video synchronization independent of speaker characteristics |
US9799096B1 (en) * | 2014-07-08 | 2017-10-24 | Carnegie Mellon University | System and method for processing video to provide facial de-identification |
CN107977629A (en) * | 2017-12-04 | 2018-05-01 | 电子科技大学 | A kind of facial image aging synthetic method of feature based separation confrontation network |
JP2019049829A (en) * | 2017-09-08 | 2019-03-28 | 株式会社豊田中央研究所 | Target section determination device, model learning device and program |
CN110677598A (en) * | 2019-09-18 | 2020-01-10 | 北京市商汤科技开发有限公司 | Video generation method and device, electronic equipment and computer storage medium |
CN111091824A (en) * | 2019-11-30 | 2020-05-01 | 华为技术有限公司 | Voice matching method and related equipment |
CN111145282A (en) * | 2019-12-12 | 2020-05-12 | 科大讯飞股份有限公司 | Virtual image synthesis method and device, electronic equipment and storage medium |
CN111225237A (en) * | 2020-04-23 | 2020-06-02 | 腾讯科技(深圳)有限公司 | Sound and picture matching method of video, related device and storage medium |
CN111414862A (en) * | 2020-03-22 | 2020-07-14 | 西安电子科技大学 | Expression recognition method based on neural network fusion key point angle change |
CN111507218A (en) * | 2020-04-08 | 2020-08-07 | 中国人民大学 | Matching method and device of voice and face image, storage medium and electronic equipment |
CN111881726A (en) * | 2020-06-15 | 2020-11-03 | 马上消费金融股份有限公司 | Living body detection method and device and storage medium |
JP2021086274A (en) * | 2019-11-26 | 2021-06-03 | 国立大学法人九州工業大学 | Lip reading device and lip reading method |
CN112949481A (en) * | 2021-03-01 | 2021-06-11 | 西安邮电大学 | Lip language identification method and system for irrelevant speakers |
KR20210108689A (en) * | 2020-02-26 | 2021-09-03 | 고려대학교 산학협력단 | Method and appartus for voice conversion by using neural network |
-
2021
- 2021-09-09 CN CN202111057976.9A patent/CN114466179B/en active Active
Patent Citations (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0674315A1 (en) * | 1994-03-18 | 1995-09-27 | AT&T Corp. | Audio visual dubbing system and method |
US6219639B1 (en) * | 1998-04-28 | 2001-04-17 | International Business Machines Corporation | Method and apparatus for recognizing identity of individuals employing synchronized biometrics |
US20030154084A1 (en) * | 2002-02-14 | 2003-08-14 | Koninklijke Philips Electronics N.V. | Method and system for person identification using video-speech matching |
CN101199207A (en) * | 2005-04-13 | 2008-06-11 | 皮克索尔仪器公司 | Method, system, and program product for measuring audio video synchronization independent of speaker characteristics |
US20080111887A1 (en) * | 2006-11-13 | 2008-05-15 | Pixel Instruments, Corp. | Method, system, and program product for measuring audio video synchronization independent of speaker characteristics |
US9799096B1 (en) * | 2014-07-08 | 2017-10-24 | Carnegie Mellon University | System and method for processing video to provide facial de-identification |
JP2019049829A (en) * | 2017-09-08 | 2019-03-28 | 株式会社豊田中央研究所 | Target section determination device, model learning device and program |
CN107977629A (en) * | 2017-12-04 | 2018-05-01 | 电子科技大学 | A kind of facial image aging synthetic method of feature based separation confrontation network |
CN110677598A (en) * | 2019-09-18 | 2020-01-10 | 北京市商汤科技开发有限公司 | Video generation method and device, electronic equipment and computer storage medium |
JP2021086274A (en) * | 2019-11-26 | 2021-06-03 | 国立大学法人九州工業大学 | Lip reading device and lip reading method |
CN111091824A (en) * | 2019-11-30 | 2020-05-01 | 华为技术有限公司 | Voice matching method and related equipment |
WO2021104110A1 (en) * | 2019-11-30 | 2021-06-03 | 华为技术有限公司 | Voice matching method and related device |
CN111145282A (en) * | 2019-12-12 | 2020-05-12 | 科大讯飞股份有限公司 | Virtual image synthesis method and device, electronic equipment and storage medium |
KR20210108689A (en) * | 2020-02-26 | 2021-09-03 | 고려대학교 산학협력단 | Method and appartus for voice conversion by using neural network |
CN111414862A (en) * | 2020-03-22 | 2020-07-14 | 西安电子科技大学 | Expression recognition method based on neural network fusion key point angle change |
CN111507218A (en) * | 2020-04-08 | 2020-08-07 | 中国人民大学 | Matching method and device of voice and face image, storage medium and electronic equipment |
CN111225237A (en) * | 2020-04-23 | 2020-06-02 | 腾讯科技(深圳)有限公司 | Sound and picture matching method of video, related device and storage medium |
CN111881726A (en) * | 2020-06-15 | 2020-11-03 | 马上消费金融股份有限公司 | Living body detection method and device and storage medium |
CN112949481A (en) * | 2021-03-01 | 2021-06-11 | 西安邮电大学 | Lip language identification method and system for irrelevant speakers |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023035969A1 (en) * | 2021-09-09 | 2023-03-16 | 马上消费金融股份有限公司 | Speech and image synchronization measurement method and apparatus, and model training method and apparatus |
CN115376542A (en) * | 2022-08-22 | 2022-11-22 | 西南科技大学 | Low-invasiveness audio-visual voice separation method and system |
Also Published As
Publication number | Publication date |
---|---|
CN114466179B (en) | 2024-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112088402B (en) | Federated neural network for speaker recognition | |
Fisher et al. | Speaker association with signal-level audiovisual fusion | |
Potamianos et al. | Recent advances in the automatic recognition of audiovisual speech | |
Neti et al. | Audio visual speech recognition | |
Chen | Audiovisual speech processing | |
US20240135956A1 (en) | Method and apparatus for measuring speech-image synchronicity, and method and apparatus for training model | |
Wimmer et al. | Low-level fusion of audio and video feature for multi-modal emotion recognition | |
CN114466179B (en) | Method and device for measuring synchronism of voice and image | |
CN114973412B (en) | Lip language identification method and system | |
CN111341350A (en) | Man-machine interaction control method and system, intelligent robot and storage medium | |
CN113177531B (en) | Speech recognition method, system, equipment and medium based on video analysis | |
Luettin et al. | Continuous audio-visual speech recognition | |
CN118197315A (en) | Cabin voice interaction method, system and computer readable medium | |
Haq et al. | Using lip reading recognition to predict daily Mandarin conversation | |
CN114494930B (en) | Training method and device for voice and image synchronism measurement model | |
CN114466178A (en) | Method and device for measuring synchronism of voice and image | |
JP2023117068A (en) | Speech recognition device, speech recognition method, speech recognition program, speech recognition system | |
JP2019152737A (en) | Speaker estimation method and speaker estimation device | |
Fisher et al. | Signal level fusion for multimodal perceptual user interface | |
Ibrahim | A novel lip geometry approach for audio-visual speech recognition | |
Sahrawat et al. | " Notic My Speech"--Blending Speech Patterns With Multimedia | |
Murai et al. | Face-to-talk: audio-visual speech detection for robust speech recognition in noisy environment | |
Talea et al. | Automatic visual speech segmentation | |
Yasui et al. | Multimodal speech recognition using mouth images from depth camera | |
JP2020091559A (en) | Expression recognition device, expression recognition method, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |