WO2023035969A1 - 语音与图像同步性的衡量方法、模型的训练方法及装置 - Google Patents

语音与图像同步性的衡量方法、模型的训练方法及装置 Download PDF

Info

Publication number
WO2023035969A1
WO2023035969A1 PCT/CN2022/114952 CN2022114952W WO2023035969A1 WO 2023035969 A1 WO2023035969 A1 WO 2023035969A1 CN 2022114952 W CN2022114952 W CN 2022114952W WO 2023035969 A1 WO2023035969 A1 WO 2023035969A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
segment
image
voice
data
Prior art date
Application number
PCT/CN2022/114952
Other languages
English (en)
French (fr)
Inventor
王淳
曾定衡
吴海英
周迅溢
蒋宁
Original Assignee
马上消费金融股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202111057976.9A external-priority patent/CN114466179A/zh
Priority claimed from CN202111056592.5A external-priority patent/CN114466178A/zh
Priority claimed from CN202111058177.3A external-priority patent/CN114494930B/zh
Application filed by 马上消费金融股份有限公司 filed Critical 马上消费金融股份有限公司
Priority to EP22866437.1A priority Critical patent/EP4344199A1/en
Publication of WO2023035969A1 publication Critical patent/WO2023035969A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N17/00Diagnosis, testing or measuring for television systems or their details
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Definitions

  • the present application relates to the technical field of video processing, and in particular to a method for measuring synchronization of voice and image, a method for training a model, and a device.
  • SyncNet technology In order to measure whether the mouth movement of the characters in the video is synchronized with the voice they utter, SyncNet technology is generally used.
  • the so-called SyncNet technology can refer to Chung, Joon Son, Andrew Zisserman. "Time Offset: Automatic Lip Synchronization in Natural Environment", Asian Computer Vision Conference, Springer, Cham, 2016 (Chung, Joon Son, and Andrew Zisserman. "Out of time: automated lip sync in the wild.” Asian conference on computer vision. Springer, Cham, 2016).
  • the voice segment in the video is input into a neural network
  • the image segment in the video is input into another neural network, so as to obtain the voice features and visual features, and then judge the mouth of the person in the video by comparing the voice features and visual features. Whether the movement is synchronized with the speech it makes.
  • the first aspect of the present application provides a method for measuring the synchronization of voice and image, the method comprising:
  • the specific signal has nothing to do with the personal characteristics of the speaker in the speech segment; or, generate a profile map of the target person according to the image segment and obtain the visual features of the profile map and the phonetic features of the speech clip, the profile map is consistent with the target person's Personal characteristics are irrelevant; or, convert the speech segment into a specific signal, generate a contour map of the target person according to the image segment, and obtain the speech feature of the specific signal and the visual feature of the contour map; determine the speech segment according to the speech feature and the visual feature Whether the image segment has synchronization, the synchronization is used to indicate that the sound in the speech segment matches the movement of the target person in the image segment.
  • the second aspect of the present application provides a method for training a speech and image synchronization measurement model, the method comprising: processing the first image segment into the first image data, processing the first speech segment into the first speech data, and the second speech
  • the segment is processed into the second voice data
  • the random image segment is processed into the second image data
  • the random voice segment is processed into the third voice data
  • the first image data and the first voice data are formed into a positive sample
  • the The first image data and the second voice data form a first negative sample
  • the first image data and the third voice data form a second negative sample
  • Two speech data, and the second image data form a third negative sample, and use the positive sample, the first negative sample, the second negative sample and the third negative sample to train speech and image synchronization measurement Model.
  • the third aspect of the present application provides a device for measuring the synchronization of voice and image, the device comprising:
  • the receiving module is used to obtain the voice segment and the image segment in the video, and the voice segment and the image segment have a corresponding relationship in the video;
  • the data processing module is used to perform any one of the following operations: convert the voice segment into a specific signal and obtain the speech features of the specific signal and the visual features of the image segment, and the specific signal has nothing to do with the personal characteristics of the speaker in the speech segment; or, generate a contour map of the target person based on the image segment and obtain the visual features of the contour map and voice
  • the speech features of the segment, the contour map has nothing to do with the personal characteristics of the target person; or, convert the speech clip into a specific signal, generate the target person's contour map according to the image clip, and obtain the speech feature of the specific signal and the visual feature of the contour map; synchronization
  • the sex measurement module determines whether the voice segment and the image segment are synchronized according to the voice feature and the visual feature, and the synchronization is used to indicate that the sound in the voice segment matches the movement of the target
  • the fourth aspect of the present application provides a training device for a voice and image synchronization measurement model
  • the device includes: a data processing module, which is used to process the first image segment into the first image data, and the first voice segment into the first The voice data and the second voice segment are processed as the second voice data; the data processing module is also used to process the random image segment as the second image data, and the random voice segment is processed as the third voice data; the sample generation module is used for Composing the first image data and the first speech data into a positive sample; the sample generation module is also used to compose the first image data and the second speech data into a first negative sample; the sample The generation module is also used to form the first image data and the third voice data into a second negative sample; the sample generation module is also used to combine the first voice data or the second voice data, Composing a third negative sample with the second image data; a training module for training voice and image synchronization using the positive sample, the first negative sample, the second negative sample and the third negative sample Measure the model.
  • the fifth aspect of the present application provides an electronic device, including: a processor, a memory, and a bus; wherein, the processor and the memory communicate with each other through the bus; the processor is used to call the Program instructions for performing the method of the first aspect or the second aspect.
  • a sixth aspect of the present application provides a computer-readable storage medium, including: a stored program; wherein, when the program is running, the device where the storage medium is located is controlled to execute the method of the first aspect or the second aspect.
  • FIG. 1 is a schematic diagram of an image segment in the embodiment of the present application.
  • FIG. 2 is a second schematic diagram of an image segment in the embodiment of the present application.
  • 3A is a schematic flow diagram of a method for measuring the synchronization of voice and image in the embodiment of the present application
  • 3B is a schematic flow diagram of another method for measuring the synchronization of voice and image in the embodiment of the present application.
  • FIG. 3C is a schematic flow diagram of another method for measuring the synchronization of voice and image in the embodiment of the present application.
  • FIG. 4 is a schematic flow diagram of a method for measuring the synchronization of voice and image in the embodiment of the present application
  • Fig. 5 is a schematic flow chart of processing a speech segment in an embodiment of the present application.
  • Figure 6 is a schematic diagram of the range of the lower half of the face in the embodiment of the present application.
  • FIG. 7 is a schematic flow chart of processing image segments in an embodiment of the present application.
  • FIG. 8 is a schematic flow diagram of a training method for a voice and image synchronization measurement model in an embodiment of the present application
  • FIG. 9 is a schematic diagram of a framework for measuring voice and image synchronization in an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the structure of the speech neural network in the embodiment of the present application.
  • Fig. 11 is a schematic flow chart of generating speech features in the embodiment of the present application.
  • Fig. 12 is a schematic flow chart of generating visual features in the embodiment of the present application.
  • FIG. 13 is a schematic flow diagram of training a neural network in an embodiment of the present application.
  • FIG. 14 is a schematic diagram of a complete flow of a method for measuring the synchronization of voice and image in the embodiment of the present application.
  • FIG. 15 is a structural schematic diagram of a measuring device for voice and image synchronization in an embodiment of the present application.
  • Fig. 16 is a structural schematic diagram II of a measurement device for voice and image synchronization in the embodiment of the present application.
  • Fig. 17 is a structural schematic diagram 3 of a measurement device for voice and image synchronization in the embodiment of the present application.
  • FIG. 18 is a schematic structural diagram of a training device for a speech and image synchronization measurement model in an embodiment of the present application.
  • Fig. 19 is a schematic diagram of the second structure of the training device for the speech and image synchronization measurement model in the embodiment of the present application.
  • FIG. 20 is a schematic structural diagram of an electronic device in an embodiment of the present application.
  • SyncNet technology is used to measure whether the mouth movement of the person in the video is synchronized with the voice uttered, and the accuracy is relatively low.
  • a speech neural network which is used to extract speech features.
  • voice personal characteristics For example, timbre, intonation, etc.
  • visual personal characteristics eg, thickness of lips, mouth size, etc.
  • FIG. 1 is a first schematic diagram of an image segment in the embodiment of the present application. Referring to FIG. 1 , there are 3 frames of images in the image segment. In frame 1, the character is talking.
  • the head of the person turns, and the position and scale of the mouth in the image are also different from the frontal face of the person in the first frame of the image.
  • the character continues to speak.
  • SyncNet technology to be compatible with this three-dimensional movement in a two-dimensional manner will obviously affect the accuracy of mouth movement and speech synchronization judgment.
  • FIG. 2 is a schematic diagram of the second image segment in the embodiment of the present application. Referring to FIG. 2 , in these two images, the mouth of the person is partially covered by the fingers and the pen. This type of occlusion will affect the alignment of the mouth in the image, and the obtained mouth features will also be mixed with occluders, which will affect the accuracy of mouth movement and speech synchronization judgment.
  • the embodiment of the present application provides a method for measuring the synchronization of voice and image.
  • the voice segment or the image segment is processed first, and the characteristics related to the individual character in the voice segment or the image segment are removed. Then perform feature extraction processing on the voice data or image data obtained after processing the voice segment or the image segment.
  • the acquired speech features or visual features no longer carry the personal characteristics of the speaker, thereby improving the accuracy of measuring the synchronization of speech and images.
  • FIG. 3A is a schematic flow chart of a method for measuring the synchronization of voice and image in the embodiment of the present application. Referring to FIG. 3A, the method may include:
  • S301 Acquire a voice segment and an image segment in a video.
  • the video here refers to the video whose synchronization between the image and the voice is to be judged.
  • the synchronicity is used to characterize that the sound in the speech segment matches the movement of the target person in the image segment.
  • the movement of the target person generally refers to the movement of the lower half of the face of the target person, and specifically may be a movement related to the mouth.
  • the mouth of the target person in the image segment makes the mouth movement of the sound "apple”, and the sound in the audio segment is also "apple”, then it can be considered that the image segment and the audio segment have synchronization sex.
  • the mouth of the target person in the image clip makes a mouth movement that makes the sound of "apple”, and the sound in the speech clip is "banana”, then it can be considered that the image clip and the speech clip do not have synchronization sex.
  • all the images in the video and all the voices are not directly judged together, but part of the images in the video are judged together with the corresponding voices.
  • the selected part of the image is an image segment in the video, and correspondingly, the selected voice is also a voice segment in the video.
  • the selected voice segment and the image segment have a corresponding relationship in the video.
  • the so-called corresponding relationship means that the selected voice segment and image segment have the same start time and end time in the video, or have a certain dislocation in time (this dislocation is acceptable in the visual range of the human eye).
  • images and voices corresponding to frames 1 to 10 in the video are obtained.
  • the images from the 1st frame to the 10th frame in the video constitute the image segment
  • the voices from the 1st frame to the 10th frame in the video constitute the voice segment.
  • the first frame to the tenth frame here is a specific position.
  • the specific positions for acquiring image segments and voice segments can be set according to actual conditions, and are not specifically limited here.
  • the image segment may also be a certain frame of image
  • the corresponding audio segment may also be the audio of this frame and the audio of several frames before and after this frame.
  • S3021 Convert the speech segment into a specific signal and acquire the speech feature of the specific signal and the visual feature of the image segment, and the specific signal has nothing to do with the personal characteristics of the speaker in the speech segment.
  • the above speech features and visual features may be extracted by using a speech and image synchronization measurement model.
  • the voice and image synchronization measurement model can include a voice neural network, a visual neural network and a synchronization measurement module, wherein the voice neural network can be used to extract the voice features of an input signal (such as a specific signal), and the visual neural network can be used to extract The visual feature of the input signal (such as an image segment), and the synchronization measurement module can be used to judge whether there is synchronization between the voice segment and the image segment.
  • the speech segment can be input into the speech neural network, and the speech segment is processed through the speech neural network, and the output of the speech neural network is the speech feature.
  • the speech neural network here can be any kind of neural network capable of acquiring speech features in speech segments. The specific type of the speech neural network is not specifically limited here.
  • the visual neural network here can be any neural network capable of obtaining visual features in image segments. The specific type of the visual neural network is not specifically limited here.
  • the speech segment Before inputting the speech segment into the speech neural network to obtain speech features, the speech segment can be processed first to delete the personal characteristics of the characters in the speech segment, that is, to extract semantic features from the speech segment that have nothing to do with the personal characteristics of the characters in the speech .
  • the speech fragments are directly input into the speech neural network, the obtained speech features will contain the personal characteristics of each person, which will reduce the accuracy of speech and speech synchronization judgment. Moreover, if a speech segment containing personal characteristics is input into the speech network for training, the trained network will not be able to accurately obtain the speech features of the characters not included in the training samples, thereby reducing the accuracy of subsequent speech and speech synchronization judgments. sex.
  • the speech fragment before inputting the speech fragment into the speech neural network, the speech fragment is first converted into a specific signal, and only the features related to the semantics in the speech fragment are extracted, and the personal characteristics of the characters are avoided, for example: only the speech content itself in the speech fragment is extracted , without extracting timbres, etc.
  • the personal features in the speech segment are deleted, that is, converted into a specific signal, and then the specific signal is input into the speech neural network, and the personal features of the characters can be avoided in the obtained speech features, thereby improving the accuracy of speech and speech synchronization judgment. accuracy.
  • the voice feature corresponding to the voice segment and the visual feature corresponding to the image segment can be obtained through the voice and image synchronization measurement model.
  • the voice segment Before inputting the voice segment and the image segment into the voice and image synchronization measurement model for processing, the voice segment can be Processing, the image segment is not processed, and then the image segment and the processed voice data are input into the voice and image synchronization measurement model to obtain visual features and voice features respectively.
  • the specific processing means of speech clips and image clips, and the training of speech and image synchronization measurement models will be described in detail below.
  • S303 Determine whether the voice segment and the image segment are synchronized according to the voice feature and the visual feature, and the synchronization is used to indicate that the sound in the voice segment matches the movement of the target person in the image segment.
  • the speech and image synchronization measurement model can include a synchronization measurement module.
  • the speech neural network of the speech and image synchronization measurement model outputs speech features
  • the speech and image synchronization measurement model's visual neural network outputs visual features
  • the synchronicity measurement module compares the speech feature with the visual feature through an algorithm with a comparison function, and according to the comparison result, it can be determined whether the speech segment and the image segment are synchronized.
  • the synchronicity is used to characterize that the sound in the speech segment matches the movement of the target person in the image segment. That is to say, according to the comparison result, it is determined whether the sound in the speech segment has the same meaning as the movement of the target person in the image segment. It can also be understood that the sound produced by the movement of the target person in the image segment is the same as the sound in the speech segment in terms of semantics and time.
  • the output is a value between 0 and 1. And, set a threshold between 0 and 1. If the output value is greater than or equal to the threshold, it indicates that the similarity between the speech feature and the visual feature is high, and the speech segment and the image segment are synchronized. If the output value is less than the threshold, it indicates that the similarity between the speech feature and the visual feature is low, and the speech segment and the image segment are not synchronized.
  • the specific range and threshold of values are not specifically limited here.
  • the method for measuring the synchronicity of voice and image above is to first convert the voice segment into a specific signal that has nothing to do with the speaker's personal characteristics after obtaining the voice segment and image segment in the video, and then obtain the voice feature of the specific signal and The visual features of the image segment. Finally, according to the voice feature and visual feature, it is determined whether the voice segment and the image segment are synchronized. That is to say, the speech segment is first processed to remove features related to individual characters in the speech segment, and then feature extraction processing is performed on a specific signal or image segment. The speech features obtained in this way no longer carry the personal characteristics of the speaker, which can improve the accuracy of measuring the synchronization of speech and images.
  • the method for measuring the synchronization of voice and image may include the following steps.
  • S301 Acquire a voice segment and an image segment in a video.
  • S3022 Generate a contour map of the target person according to the image segment and acquire the visual features of the contour map and the phonetic features of the voice segment, and the contour map has nothing to do with the personal characteristics of the target person.
  • the above speech features and visual features may be extracted by using a speech and image synchronization measurement model.
  • the speech and image synchronization measurement model can include a speech neural network, a visual neural network and a synchronization measurement module, wherein the speech neural network can be used to extract the speech features of the input signal (such as a speech segment), and the visual neural network can be used to extract The visual feature of the input signal (such as a contour map), and the synchronization measurement module can be used to judge whether there is synchronization between the speech segment and the image segment.
  • the image segment Before inputting the image segment into the visual neural network to obtain visual features, the image segment is first processed to delete the personal characteristics of the person in the image segment, that is, to extract the character features that have nothing to do with the personal characteristics of the person in the image from the image segment.
  • the thickness and size of lips vary from person to person. Some people have thick lips, some people have thin lips, some people have big mouths, and some people have small mouths. If the image fragments are directly input into the visual neural network, the obtained visual features will contain the individual characteristics of each person, which will reduce the accuracy of judging the synchronization of images and speech. Moreover, if image segments containing personal characteristics are input into the visual network for training, the trained network cannot accurately obtain the visual characteristics of people not included in the training samples, thereby reducing the accuracy of subsequent image and voice synchronization judgments. sex.
  • the image segment before inputting the image segment into the visual neural network, extract the image segment first, and only extract the features related to the movement of the lower half of the person's face in the image segment, and avoid extracting the personal characteristics of the character, for example: only extract the opening and closing of the mouth degree without extracting lip thickness etc. Furthermore, by combining the extracted features related to the movement of the person, the posture or expression of the person can be obtained, and then the contour map of the target person in the image segment can be obtained. Then input the contour map into the visual neural network, and the personal characteristics of the characters can be avoided in the obtained visual features, thereby improving the accuracy of judging the synchronization of images and voices.
  • the voice feature corresponding to the voice segment and the visual feature corresponding to the image segment can be obtained through the voice and image synchronization measurement model.
  • the image segment Before inputting the voice segment and image segment into the voice and image synchronization measurement model for processing, the image segment can be processed Processing, do not process the speech segment, and then input the speech segment and the processed image data into the speech and image synchronization measurement model to obtain speech features and visual features respectively.
  • the specific processing means of speech clips and image clips, and the training of speech and image synchronization measurement models will be described in detail below.
  • S303 Determine whether the voice segment and the image segment are synchronized according to the voice feature and the visual feature, and the synchronization is used to indicate that the sound in the voice segment matches the movement of the target person in the image segment.
  • the method for measuring the synchronicity of voice and image described above after acquiring the voice segment and image segment in the video, first generates the contour map of the target person according to the image segment, the contour map has nothing to do with the personal characteristics of the target person, and then obtains the voice of the voice segment Features and visual features of the contour map. Finally, according to the voice features and visual features, it is determined whether the voice segment and the image segment are synchronized. That is to say, the image segment is processed first, and the features related to the individual characters in the image segment are removed, and then the feature extraction process is performed on the speech segment and the contour map. In this way, the acquired visual features no longer carry the personal characteristics of the speaker, which can improve the accuracy of measuring the synchronization of speech and images.
  • the method for measuring the synchronization of voice and image may include the following steps.
  • S301 Acquire a voice segment and an image segment in a video.
  • S3023 Convert the speech segment into a specific signal, generate a contour map of the target person according to the image segment, and obtain the speech features of the specific signal and the visual features of the contour map.
  • the specific signal has nothing to do with the personal characteristics of the speaker in the speech clip, and the contour map is related to The personal characteristics of the target person are irrelevant.
  • step S3023 both the speech segment and the image segment are processed separately, and then corresponding features are extracted from the processed specific signal and contour map.
  • S303 Determine whether the voice segment and the image segment are synchronized according to the voice feature and the visual feature, and the synchronization is used to indicate that the sound in the voice segment matches the movement of the target person in the image segment.
  • the speech fragments are first converted into specific signals that have nothing to do with the personal characteristics of the speakers, and the target person is generated according to the image fragments
  • the contour map has nothing to do with the personal characteristics of the target person, and then obtain the speech features of the specific signal and the visual features of the contour map.
  • Figure 4 is a schematic flow chart of a method for measuring the synchronization of voice and image in the embodiment of the present application, as shown in Figure 4, the method may include:
  • S401 Acquire a voice segment and an image segment in a video.
  • Step S401 is implemented in the same manner as step S301, and will not be repeated here.
  • the voice segment and/or image segment before input into the voice and image synchronization measurement model are processed from two aspects of voice and image respectively, and the process of corresponding processing into voice data and image data is specifically described.
  • the speech segment contains the speaker's personal characteristics, such as: timbre, intonation and so on. Therefore, before inputting the speech clips into the speech neural network to obtain speech features, first erase the speaker's personal characteristics in the speech clips, and then input the speech data with the erased speaker's personal characteristics into the speech neural network, which can improve speech and image quality. Accuracy of synchronicity comparisons.
  • the processing of the audio segment may specifically include the following steps.
  • S402 Convert the sampling frequency of the speech segment to a specific frequency.
  • the sampling frequency of the voice is also different due to the different configurations of the terminals that collect the video. In order to accurately process the voice segment later, it is necessary to first convert the The sampling frequency of the clips is unified.
  • the sampling frequency of the voice segment can be unified to 16kHz.
  • the sampling frequency of the speech segment may also be unified to other values, such as 8 kHz, 20 kHz, and so on.
  • the specific value can be set according to the actual situation, and is not limited here.
  • step S403 may include two aspects.
  • the spectral subtraction method in the short-term spectrum estimation can be used to denoise the speech segment, so as to suppress the background sound in the speech segment and highlight the speech in the speech segment.
  • other methods may also be used to remove the background sound in the speech segment, such as adaptive filtering technology.
  • adaptive filtering technology As for the specific way to remove the background sound in the speech segment, it is not limited here.
  • S4032 Separate voices of different speakers in the voice segment to obtain at least one voice sub-segment.
  • the voice sub-segment of a certain speaker or the voice sub-segments of some speakers can be selected as the denoised voice segment according to the actual judgment situation.
  • a window function can be used to segment the speech segment into multiple speech frames with sliding weights.
  • the window function can be a Hamming window function or other types of window functions.
  • the multiple speech frames divided into can be multiple 25ms segments, or segments of other lengths. Each segment is called a speech frame.
  • the overlap between adjacent speech frames is generally maintained at 10ms. This is because: the speech frame is too short, and a single sound may not be finished. Therefore, maintaining a certain degree of overlap between adjacent speech frames can fully understand the semantics , and then improve the accuracy of speech and image synchronization measurement.
  • steps S402, S403, and S404 may not be executed in the order of sequence numbers, but may be executed in any order.
  • the execution order of steps S402, S403, and S404 is not specifically limited here. Regardless of how many steps are executed in steps S402, S403, and S404, when converting to a specific signal, the processing result of the executed steps is taken as the processing object to be converted into a specific signal.
  • step S402 when converting, the voice segment converted to a specific frequency is converted into a specific signal; if step S403 is performed, then when converting, the voice sub-segment is converted into a specific signal; if After step S404 is executed, each speech frame is converted into a specific signal as described in step S405 during conversion.
  • S405 Convert each speech frame into a specific signal.
  • the specific signal is independent of the personal characteristics of the speaker in the speech segment.
  • the speech segment before inputting the speech segment into the speech neural network, the speech segment needs to be converted into Mel-scale Frequency Cepstral Coefficients (MFCC) signals, and then the MFCC signals are input into the speech neural network to obtain corresponding phonetic features.
  • MFCC Mel-scale Frequency Cepstral Coefficients
  • the MFCC signal cannot well erase the personal characteristics of the speaker in the speech segment, that is, the identity information, and the obtained speech features will also contain the speaker's identity information, thereby reducing the measurement of the synchronization between speech and images. accuracy.
  • speech fragments can be converted into specific signals before being fed into a speech neural network.
  • the specific signal here has nothing to do with the personal characteristics of the speaker in the speech segment, that is, it can better erase the personal characteristics of the speaker in the speech segment. In this way, when a specific signal is input into the speech neural network, the obtained speech features no longer contain the personal characteristics of the speaker, thereby improving the accuracy of speech and image synchronization measurement.
  • the specific signal may be a phonetic posterior probability (Phonetic Posterior Grams, PPG) signal.
  • PPG signals are better able to erase information related to speaker identity in speech clips.
  • the PPG signal can further erase the background sound in the speech segment, reduce the variance of the speech neural network input, and thus improve the accuracy of speech and image synchronization measurement.
  • speech fragments can also be converted into other types of signals, such as the features extracted by the DeepSpeech model, as long as the identity information of the speaker can be erased.
  • the specific type of the specific signal is not limited here.
  • the voice segment in order to convert the voice segment into a PPG signal, the voice segment can be input into a speaker-independent automatic speech recognition (Speaker-Independent Automatic Speech Recognition, SI-ASR) system, and the voice segment is processed through the SI-ASR system.
  • SI-ASR Speaker-Independent Automatic Speech Recognition
  • the international phoneme table can be used to expand the adaptation language.
  • the specific dimension P of the PPG signal and the number of phonemes supported by the SI-ASR are related to the supported languages.
  • the PPG signal obtained from one speech frame is a 1 ⁇ 400-dimensional feature vector.
  • the PPG signal obtained from T consecutive speech frames is a T ⁇ 400-dimensional feature matrix.
  • Other SI-ASR systems can be adjusted according to the number of supported phonemes.
  • DeepSpeech This deep learning model can convert speech signals into corresponding text. Therefore, in the features extracted by DeepSpeech, there are only the content of the speech itself, and there will be no personal characteristics such as the speaker's timbre. In this way, the identity information and background of the speaker can also be erased after extraction, which has nothing to do with semantics.
  • the above processing process can also be carried out through the schematic flow chart shown in FIG. 5 .
  • the voice is input into the preprocessing module.
  • the above steps S402-S404 are performed in the pre-processing module, that is, processing such as unified sampling frequency, noise removal, and segmentation is performed on the speech.
  • the processed speech segments are fed into the SI-ASR system.
  • the processing of the above step S405 is performed, that is, the speech segment is converted into a PPG signal.
  • the image segment contains the personal characteristics of the target person, such as: lip thickness, mouth size, etc. Therefore, before inputting the image segment into the image neural network to obtain the image features, the personal characteristics of the target person in the image segment should be erased first, and the information related to the movement of the lower half of the face should be retained, and then the image with the erased personal characteristics of the target person
  • the data is input into the image neural network, which can improve the accuracy of the synchronization comparison between speech and image.
  • the following is an example of extracting the features of the lower half of the face from an image segment to describe how to generate a contour map of a target person based on an image segment.
  • the profile extracted here has nothing to do with the personal characteristics of the target person.
  • the processing of the image segment may specifically include the following steps.
  • S406 Perform face detection on the image segment to obtain a face detection frame.
  • face detection is performed on each frame of image in the image segment to obtain a face detection frame.
  • the dense face alignment algorithm can be used to find out the positions of the face key points in the face detection frame in the original image, including but not limited to the left eye center position, right eye center position, left mouth corner position and right mouth corner position.
  • the above left and right are the left and right of the physiological meaning of the face in the image, not the left and right in the image, and it is assumed that the face in the image is frontal.
  • the face image is processed into a form that conforms to the rules based on rule calculation.
  • the rules here can be as follows:
  • V_eyetoeye Calculate the vector from the key point of the left eye center to the key point of the right eye center
  • the magnification is 2 times of the length of the V_eyetoeye module and 1.8 times of the length of the V_eyetomouth module, to obtain the vector X, and rotate X 90 degrees counterclockwise to obtain the vector Y;
  • the image in the above rectangle is taken out by using an interpolation algorithm, scaled to a predetermined size, such as 256*256 pixels, and the aligned face is obtained.
  • the dense face alignment algorithm used to find out the key points of the face can be a three-dimensional dense face alignment (3Dimentional Dense Face Alignment, 3DDFA) algorithm.
  • 3DDFA Three-dimensional dense face alignment
  • other alignment algorithms can also be used to obtain face key points, and then the above rules can be used to achieve face alignment.
  • the specific algorithm used here is not limited.
  • this method is compatible with the alignment of large-angle side faces and front faces.
  • the expression coefficient of the target person in the face detection frame can be extracted through the parameter estimation algorithm of the three-dimensional deformable parametric face model (3Dimensional Morphable Models, 3DMM), and the expression coefficient meets the standard of the three-dimensional deformable parametric face model.
  • 3DMM explicitly designed the decoupling of the identity parameter space (the part expressing identity information) and the expression parameter space (the part expressing expression information)
  • the expression information obtained by using the 3DMM parameter estimation algorithm does not contain identity information, that is, no Contains personal characteristics.
  • the identity coefficient and expression coefficient of the target person in line with the 3DMM model standard can be obtained.
  • the expression coefficient can be denoted as ⁇ exp .
  • the 3DMM parameter estimation algorithm is an algorithm capable of estimating 3DMM parameters, and is used to estimate the identity coefficient and expression coefficient of a human face, and the identity coefficient and expression coefficient meet the standards defined by 3DMM.
  • the 3DMM parameter estimation algorithm used in this application is implemented with a deep neural network model.
  • the pre-trained deep neural network model can be used to input the aligned face image in the face detection frame and the identity coefficient corresponding to the target person in related technologies to the model, and extract the expression coefficient of the target person in the aligned face image and the identity coefficient, and update the identity coefficient corresponding to the target person in the related art according to the output identity coefficient for subsequent image frame estimation.
  • the identity coefficient corresponding to the target person is the sliding weighted average of the estimated identity coefficients of adjacent image frames in time sequence.
  • the model can be better used instead of changing the identity coefficient, the expression coefficient is used to fit the morphological changes of the face; that is, the ambiguity of the parameter estimation process is eliminated by adding the constraint of identity coefficient temporal stability, so as to obtain a more accurate expression coefficient.
  • 3DMM parameter estimation algorithms that can stabilize identity coefficients can also be used for reference here, such as the Face2Face algorithm (Thies, Justus, etc., Face2face: Real-time face capture and reproduction of rgb video, IEEE Computer Vision and Pattern Recognition Conference Papers Set, 2016 (Thies, Justus, et al.”Face2face: Real-time face capture and reenactment of rgb videos.”Proceedings of the IEEE conference on computer vision and pattern recognition.2016)) Get the expression coefficient of each frame.
  • Face2Face algorithm Thies, Justus, etc., Face2face: Real-time face capture and reproduction of rgb video, IEEE Computer Vision and Pattern Recognition Conference Papers Set, 2016 (Thies, Justus, et al.”Face2face: Real-time face capture and reenactment of rgb videos.”Proceedings of the IEEE conference on computer vision and pattern recognition.2016)
  • the expression coefficient ⁇ exp contains features that represent the position of the mouth, the degree of opening and closing of the mouth, etc., which have nothing to do with the speaker.
  • the characteristics related to the individual speaker are represented in the identity coefficient. Therefore, only based on the expression coefficient ⁇ exp and the standard identity coefficient (here, the standard identity coefficient is used to replace the identity coefficient of the target person, and the personal characteristics of the target person are removed), and the general parameterized face model is input to generate the face contour map of the target person, which can Improve the accuracy of mouth movement and speech synchrony measurements by excluding personal characteristics of the target person.
  • S410 Input the lower half face expression coefficients into the general 3D face model to obtain a 3D face model corresponding to the lower half face of the target person.
  • the 3D face model corresponding to the lower face of the target person is the 3D face model of the lower face expression coefficient of the target person combined with the standard identity coefficient.
  • the general 3D face model is an abstract face model.
  • the data of eyebrows, eyes, nose, face, mouth and other parts are obtained based on the average of many faces, which is universal.
  • the 3D face model corresponding to the lower half face of the target person's mouth expression is obtained.
  • the predefined complete expression orthogonal base B exp is correspondingly changed to B halfface related to the movement of the lower half of the face. Specifically, it is shown in the following formula (1):
  • S is the geometric model of the mouth shape of the target person under the neutral expression
  • B halfface is the orthogonal base related to mouth movement
  • ⁇ halfface is the expression coefficient of the lower half of the face.
  • the obtained 3D face model corresponding to the expression of the lower half of the face of the target person can eliminate the influence of irrelevant expressions.
  • S411 Obtain a vertex set of the lower half of the face in the 3D face model.
  • Fig. 6 is a schematic diagram of the range of the lower half of the face in the embodiment of the present application. Referring to Fig. 6, the position 601 of the bottom of the left ear, the position 602 of the tip of the nose, and the position 603 of the bottom of the right ear are connected to obtain a connection line 604. Connection 604 divides the human face into an upper half face and a lower half face. The faces below the line 604 are the lower half of the face.
  • connection line 604 can have a certain adjustment range, such as moving up to the eye position, or moving down to the nose position. That is, the selection of the lower half of the face can be adjusted according to actual needs.
  • S412 Project the vertex set onto a two-dimensional plane to obtain a lower half face contour map of the target person, and use the lower half face contour map as the face contour map of the target person.
  • I is the two-dimensional contour map of the lower half of the face of the target person
  • f is the scale coefficient
  • P is the orthogonal projection matrix
  • S(v) is the set of vertices of the lower half of the face in the three-dimensional face model.
  • the size of the contour map I can be a rectangle of 128 ⁇ 256, and the contours of the mouth and the lower half of the face are centered.
  • each vertex is projected into a two-dimensional Gaussian circular spot with the center of the vertex projection position and a radius of r pixels.
  • the original posture orientation and illumination information of the input face image is not retained, but only the expression coefficients of the target person in the image fragment obtained by the 3DMM parameter estimation algorithm are retained, and then combined with The standard identity coefficient obtains a general 3D face model, and generates a lower half face contour map that eliminates the personal characteristics of the target person.
  • the obtained contour map is a contour map under the front face feature, and eliminates the face posture, illumination and occluders in the original image. Impact.
  • the above-mentioned processing process can be carried out through the schematic flow diagram shown in Figure 7, referring to Figure 7, first, the processing of the above-mentioned steps S406-S407 is performed on the image segment, that is, the image is densely aligned to obtain an aligned image; then , performing the processing of the above-mentioned steps S408-S409 on the aligned images, that is, extracting the expression coefficients of the 3D model of the human face from the aligned images; then, performing the processing of the above-mentioned steps S410 on the extracted expression coefficients, that is, using a frontal perspective , standard face shape, average illumination, and generate a 3D model according to the extracted expression coefficients; finally, perform the processing of the above steps S411-S412 on the generated 3D model, and project the corresponding vertices of the 3D model to obtain the two-dimensional outline of the lower half of the face.
  • the PPG signal can be input into the speech neural network, and the two-dimensional contour map can be input into the visual neural network, respectively.
  • Voice features and visual features and then compare the voice features with the visual features to determine whether the voice segment and the image segment are synchronized.
  • the speech segment is input into the speech neural network, and the speech segment is processed through the speech neural network, and the output of the speech neural network is the speech feature.
  • the speech neural network here can be any kind of neural network capable of acquiring speech features in speech segments.
  • the specific type of the speech neural network is not specifically limited here.
  • the contour map obtained after processing the image fragments is input into the visual neural network, and the contour map is processed through the visual neural network, and the output of the visual neural network is the visual feature.
  • the visual neural network here can be any neural network capable of obtaining visual features in image segments.
  • the specific type of the visual neural network is not specifically limited here.
  • step S415 is further included: determining whether the speech segment and the image segment are synchronized according to the speech feature and the visual feature.
  • the embodiment of the present application also provides a method for training a speech and image synchronization measurement model.
  • various types of training samples are obtained in advance, that is, the acquisition type Diverse training samples, for example: image clips and speech clips with synchronization in the same training video, image clips and speech clips without synchronization in the same training video, image clips and speech clips in different training videos, etc. wait.
  • Using various types of training samples to train the speech and image synchronization measurement model can improve the accuracy of the speech and image synchronization measurement model, thereby improving the accuracy of the speech and image synchronization measurement.
  • the first training video is a training video in the training video set. Select a training video different from the first training video from the training video set as the second training video.
  • the method provided by the embodiment of the present application can be applied in various scenarios where it is necessary to determine whether the voice and the image are synchronized.
  • three specific scenarios are taken as examples to further describe the method provided in the embodiment of the present application.
  • Scenario 1 Determine the speaker.
  • the speech data obtained after processing the speech segment is input into the speech neural network
  • the image data obtained after processing the image segment is input into the visual neural network to obtain speech features and multiple visual features respectively ;Finally, synchronously match the multiple visual features with the voice features, and then determine the visual feature with the highest synchronization with the voice features, and compare the determined visual features with the preset threshold, if it belongs to the visual feature corresponding to the preset threshold feature, the person corresponding to the visual feature is determined as the current speaker in the video. It can avoid judging a speaker who is not in the video as the current speaker in the video. For example, in a reporter interview scene, if the reporter is not in the video screen, there is no corresponding speaker in the video screen when the reporter speaks.
  • Scenario 2 Forged video identification.
  • the sound or picture in some videos may not be original, but artificially added later. For example: re-dubbing the videos of some stars, with words that some stars didn't say at all.
  • Scenario 3 Video Modulation.
  • the device for collecting voice and the device for collecting images are often separated.
  • a microphone can be used to capture voice, and a camera can be used to capture images. Then the collected voice and image are fused into a video. In this way, it is easy to cause the voice and image in the video to be misaligned in time, that is, the audio and video are out of sync.
  • sample video data is obtained first, and then a speech and image synchronization measurement model is trained by using the sample video data.
  • the sampling of sample video data has an important impact on the performance of the speech and image synchronization measurement model, such as training efficiency and accuracy.
  • the sampling strategy of the sample video data can be optimized based on the characteristics of the sample video data, so as to train the speech and image synchronization measurement model more efficiently and obtain a higher-precision model.
  • the sample video data is processed through image preprocessing and voice preprocessing, and the information irrelevant to the speaker/target person in the sample video data is erased in a targeted manner, while the information related to the speaker/target person is retained.
  • the speech preprocessing may be processing the speech segment extracted from the sample video data into a PPG signal
  • the PPG signal is a frame-level structure, which has nothing to do with the speaker's language and can be used for synchronous judgment of multiple languages; and, The PPG signal measures distance and can be used for sampling positive and negative samples in sample video data.
  • the image preprocessing may be to process the image segments extracted from the sample video data into a contour map that has nothing to do with the personal characteristics of the target person.
  • Fig. 8 is a schematic flow chart of the training method of the speech and image synchronization measurement model in the embodiment of the present application. Referring to Fig. 8, the method may include:
  • S801 Process the first image segment into first image data, process the first voice segment into first voice data, and process the second voice segment into second voice data.
  • the first image segment, the first audio segment and the second audio segment are from the first training video, the first image segment and the first audio segment are synchronized, and the first image segment and the second audio segment are not synchronized. That is to say, the first image data, the first voice data and the second voice data are from the first training video.
  • the image segment and the voice segment of the first section of the first training video are obtained to obtain the first image segment and the first voice segment.
  • the speech segment of the second section of the first training video is acquired to obtain the second speech segment.
  • the first interval and the second interval may not overlap at all, or partially overlap. In this way, all differences between the contents of the first speech segment and the second speech segment can be ensured.
  • the image corresponding to the 10th ms to the 30th ms in the first training video is used as the first image segment
  • the voice corresponding to the 10 ms to the 30th ms in the first training video is used as the first voice segment
  • the first training video The voices corresponding to the 35th ms to the 55th ms are used as the second voice segment.
  • S802 Process a random image segment into second image data, and process the random voice segment into third voice data.
  • the random image segment and the random speech segment come from the second training video. That is, the second image data and the third voice data are from the second training video.
  • the first training video and the second training video are two different videos, both of which are from the training video set. That is to say, in order to enrich the training samples, it is also necessary to obtain image segments and voice segments in other videos except the first training video, and these image segments and voice segments are called random image segments and random voice segments respectively.
  • the first training video and the second training video need to have a certain degree of difference in the specific content of the image or voice, so that the subsequent voice and image synchronization measurement model can learn more accurately, and then improve the image and image. Accuracy of speech synchrony measurement.
  • the positive sample can be obtained in the following way: the first image segment and the first voice segment in the same interval in the same training video are processed into the first image data and the first voice data to form a positive sample.
  • the first image segment and the first voice segment corresponding to the 10th ms to the 30th ms in the first training video, and the corresponding first image data and first voice data are used as a positive sample.
  • the first image segment and the first voice segment corresponding to the 40th ms to the 60th ms in the first training video, and the corresponding first image data and first voice data are used as another positive sample.
  • S806 Composing the first voice data or the second voice data, and the second image data into a third negative sample.
  • the speech segment that is not synchronized with the first image segment and the first image segment are preprocessed to form a negative sample.
  • the non-synchronous speech segment here includes two situations.
  • the non-synchronous speech segment also comes from the first training video. That is, the speech segment may be the second speech segment.
  • the first image segment and the second audio segment can be processed into first image data and second audio data to form a first negative sample of dislocation of audio and image.
  • the second case the non-synchronous speech segment comes from the second training video. That is, the speech segment may be a random speech segment.
  • the first image segment and the random audio segment can be processed into the first image data and the third audio data to form the image-fixed second negative sample.
  • the third case the image segment without synchronization comes from the second training video. That is to say, the second speech segment and other image segments are processed into second speech data and second image data to form a third negative sample with fixed speech.
  • the first voice segment and other image segments into first voice data and second image data to form a third negative sample with fixed voice. As long as the speech segment in the third negative sample comes from the first training video.
  • the first speech segment, the second speech segment, and the random speech segment are processed, they are transformed into a specific signal, and the specific signal has nothing to do with the personal characteristics of the speakers in the speech segment. That is, the first voice data, the second voice data, and the third voice data are all specific signals, and the specific signals have nothing to do with the personal characteristics of the speakers in the corresponding voice segments.
  • both the first image data and the second image data are the face profile images of the target person, and the face profile images have nothing to do with the personal characteristics of the target person in the corresponding image segment.
  • S807 Using the positive sample, the first negative sample, the second negative sample, and the third negative sample to train a speech and image synchronization measurement model.
  • Training is to adjust the parameters in the voice and image synchronization measurement model, optimize the voice and image synchronization measurement model, so that after the subsequent input of the image data and voice data to be measured, the voice and image synchronization measurement model can be more accurate To measure.
  • the voice and image synchronization measurement model there are mainly two neural networks, namely the voice neural network and the visual neural network.
  • the speech neural network mainly obtains speech features based on speech data
  • the visual neural network mainly obtains visual features based on image data.
  • a synchronicity measurement module is also included, which can also be a neural network. Therefore, training the speech and image synchronization measurement model means training each neural network in the speech and image synchronization measurement model.
  • the first image segment and the first voice segment with synchronization in the first training video do not have synchronization with the first image segment
  • the second voice segment, as well as the random image segment and random voice segment outside the first training video are correspondingly processed into the first image data, the first voice data, the second voice data, the second image data and the third voice data.
  • the first image data and the first speech data are composed of positive samples
  • the first image data and the second speech data are composed of the first negative samples
  • the first image data and the third speech data are composed of the second negative samples
  • the first The speech data or the second speech data, and the second image data segment form the third negative sample.
  • the types of training samples are enriched, especially the types of negative samples that make images and voices out of synchronization are enriched.
  • training the speech and image synchronization measurement model with rich types of positive samples, the first negative sample, the second negative sample, and the third negative sample can improve the accuracy of the speech and image synchronization measurement model, thereby improving the speech and image synchronization.
  • Fig. 9 is a schematic diagram of the framework for measuring the synchronization of speech and images in the embodiment of the present application.
  • the speech segment is input into the speech neural network to obtain phonetic features.
  • the image fragments are input into the visual neural network to obtain visual features.
  • the voice features and visual features are input into the synchronization measurement module, and the synchronization measurement module determines whether the corresponding voice segment and image segment have synchronization through the voice feature and visual feature.
  • the synchronicity measurement module here is a module for determining whether the corresponding speech segment and image segment are synchronous by comparing the speech feature with the visual feature.
  • the specific form of the synchronization measurement module is not limited here.
  • the speech segment in order to obtain speech features of a speech segment, can be input into a speech neural network for processing to obtain speech features. And in order to obtain the visual features of the image segment, the image segment can be input into the visual neural network for processing to obtain the visual feature.
  • the following three aspects are explained from the construction of the neural network, training data sampling, and training.
  • the speech segment Before inputting the speech segment into the speech neural network, the speech segment has been converted into a specific signal, specifically a PPG signal with a dimension of T ⁇ P. And each dimension has a clear physical meaning, P is the number of phonemes, T is the number of samples in time, and each column is the posterior probability distribution of a phoneme corresponding to a speech frame. Based on these clear physical meanings, the speech neural network can be built as follows.
  • Fig. 10 is a schematic diagram of the architecture of the speech neural network in the embodiment of the present application.
  • convolution kernel size is 3 ⁇ 1
  • convolution step size is (2, 1)
  • effective expansion valid padding
  • the obtained matrix is reorganized into a feature vector.
  • 3 fully connected layers are used to process the feature vector.
  • a 512-dimensional speech feature vector is obtained through a linear projection layer.
  • the number of layers of the convolutional layer is related to the duration of the specific signal (the corresponding feature matrix of the PPG signal) of the input.
  • the dimension of the voice feature vector of the final output is consistent with the dimension of the visual feature vector of the subsequent output.
  • the voice feature in the embodiment of the application Vectors are phonetic features, and visual feature vectors are visual features.
  • the PPG feature matrix has a dimension of 13 ⁇ 400.
  • two layers of one-dimensional convolutional layers can be used to obtain a feature matrix of 3 ⁇ 400.
  • the final 512-dimensional speech feature vector is obtained through 3 fully connected layers and 1 linear layer.
  • Fig. 11 is a schematic flow chart of generating speech features in the embodiment of the present application, as shown in Fig. 11, the process may include:
  • S1101 Using multiple 1-dimensional convolutional layers to process the specific signal in the time dimension to obtain a feature matrix.
  • the number of 1-dimensional convolutional layers is related to the duration corresponding to a specific signal.
  • S1103 Process the feature vector by using 3 fully connected layers and 1 linear projection layer to obtain a 512-dimensional speech feature vector.
  • the dimension of the finally obtained speech feature vector is not limited to only 512 dimensions.
  • the dimension of the speech feature vector is related to the amount of speech data input into the model and the type of loss function adopted by the speech neural network.
  • the speech neural network may be a speech neural network included in the speech and image synchronization measurement model.
  • the visual neural network can adopt a network structure with a relatively light calculation amount.
  • the visual neural network can adopt the backbone network of ResNet18 and make the following changes:
  • the multiple images can be arranged along the channel dimension in increasing order of time and then used as the input of the visual neural network. Therefore, the parameter dimension of the convolution in the first layer of the visual neural network needs to be adjusted accordingly.
  • the resolution is 128 ⁇ 256 and the aspect ratio is 1:2, which is different from the default input aspect ratio of ResNet18 which is 1:1.
  • the convolution kernel size and stride of the convolutional layer are related to the size of the contour map.
  • the corresponding step size can be set according to the aspect ratio of the contour map, and the size of the convolution kernel can be set slightly larger. In this way, the contour map can be processed at one time by using a convolutional layer with a larger convolution kernel.
  • multiple convolution layers with smaller convolution kernels can also be used for multiple processing.
  • the dimension of the finally obtained visual feature vector is not limited to only 512 dimensions.
  • the dimension of the visual feature vector is related to the amount of visual data input into the model and the type of loss function adopted by the visual neural network.
  • the visual neural network can also be modified and used by other deep neural networks, such as MobilenetV2.
  • Fig. 12 is a schematic flow chart of generating visual features in the embodiment of the present application, as shown in Fig. 12, the process may include:
  • S1201 Process the contour map with a convolutional layer to obtain a feature matrix.
  • the convolution kernel size and step size of the convolutional layer are related to the size of the contour map.
  • S1202 Process the feature matrix by using the backbone network of the visual neural network to obtain feature vectors.
  • the backbone network here refers to the main architecture in the neural network.
  • the architecture in the visual neural network in the related technology that is, the backbone network, is used, and some layers in the The visual neural network of the embodiment of the present application can be obtained by adaptively modifying the parameters.
  • the visual neural network may be a visual neural network included in the speech and image synchronization measurement model.
  • S1203 Process the feature vector by using a fully connected layer to obtain a 512-dimensional visual feature vector.
  • portrait videos of a single person talking are used.
  • background sounds are less disturbing than a certain degree.
  • 25Hz high-definition video can be used. In this way, the accuracy of visual feature extraction training can be improved.
  • the audio signal in each video is processed to 16kHz, and the video signal is divided into frames, and the timeline is recorded. In this way, a voice segment and an image segment are obtained. Then, use the processing method in the above steps S402-S405 to process the speech segment to obtain a specific signal, which is referred to as speech for subsequent sampling, and process the image segment using the processing method in the above steps S406-S412 to obtain a human face Contour map, referred to as vision for subsequent sampling.
  • the training data can be formally sampled.
  • it mainly includes positive sample sampling and negative sample sampling.
  • the so-called positive sample means that the input speech and vision are synchronized.
  • the so-called negative sample means that the input voice and vision are out of sync.
  • the so-called positive sample means that the voice and vision used in training need to come from the same training video and be synchronized in time.
  • the speech length is too short, a complete pronunciation may not be included in the speech, and may even affect the understanding of the semantics in the speech.
  • the accuracy can make the speech frame length larger than the visual frame length.
  • the specific selection of the frame length of the voice can be determined based on the frame rate of the training video.
  • a frame of image at time T and a speech segment of (T-20ms, T+20ms) can be selected to form a positive sample pair after processing.
  • the length of vision is 1 frame
  • the length of speech is 40ms. Obviously, this is to make the frame length of voice longer than the frame length of vision.
  • the length of the voice is set to 40ms, just to match the frame rate of 25Hz in the training video. However, if a training video with other frame rates is used, the length of the speech can be adjusted accordingly.
  • a training video is selected from the training video set, referred to as the first training video; another training video is selected from the training video set, referred to as the second training video.
  • the first training video and the second training video are different training videos.
  • the first image segment and the first voice segment obtained from the first training video are processed into the first image data and the first voice data, and then the positive samples are formed.
  • the so-called negative samples mean that the speech and vision used during training are not synchronized.
  • the out-of-sync here can include many situations. In order to be able to train more fully, all situations that are not synchronized can be sampled.
  • image clips and speech clips can be collected from different videos, or image clips and speech clips can be collected from different times in the same video to form negative samples.
  • negative sample sampling can be performed in the following three ways.
  • misplaced negative samples mean that although the speech and vision come from the same training video, the speech and vision are not synchronized in time, that is, there is a small amount of misplacement.
  • a frame of image at time T and a speech segment of (T-t-20ms, T-t+20ms) are collected and processed to form a negative sample pair. That is, the image segment is processed into image data, the speech segment is processed into speech data, and then the sample pair ⁇ speech data, image data> is constructed, abbreviated as ⁇ speech, vision>.
  • dislocation negative samples For dislocation negative samples: ⁇ speech, vision> negative samples are collected from the same video, and the timeline is slightly misaligned.
  • a frame of image at time T and (T-t-20ms, T-t+20ms) speech clips constitute a negative sample pair , where
  • the dislocation duration of speech and vision needs to be greater than or equal to 2 times the visual duration. In this way, it can be ensured that the speech in the misplaced negative samples is completely staggered from the speech synchronized with the visual correspondence in the misplaced negative samples, thereby ensuring the accuracy of subsequent training.
  • the voice frame length can be adjusted accordingly, and the visual frame length can also be adjusted accordingly.
  • the dislocation negative sample After the dislocation negative sample is obtained, it is used as a candidate negative sample, and the visual rule determination and phonetic rule determination described below are performed to obtain the first negative sample.
  • the so-called voice fixed negative sample means that the voice is extracted from the same training video, and the vision is randomly extracted from a certain training video other than this training video.
  • the speech in the above-mentioned other training video is semantically different from the speech extracted from the same training video above.
  • negative samples For fixed speech segment negative samples: ⁇ speech, vision> negative samples are collected from different videos, in which the speech segment is fixed, and a frame of image is randomly sampled from other training videos to form a negative sample pair. Among them, it is ensured that the speech in the negative sample pair and the speech in the positive sample pair to which the vision belongs have different semantics. If the semantics of the speech in the negative sample pair is "silent", the semantics of the speech in the positive sample pair to which the vision in the negative sample pair belongs cannot be "silent".
  • the negative sample with fixed speech After obtaining the negative sample with fixed speech, it is used as a candidate negative sample, and the visual rule judgment and speech rule judgment described below are performed, so as to obtain the second negative sample.
  • the so-called visually fixed negative samples mean that the vision is extracted from the same training video, while the voice is randomly extracted from a certain training video other than this training video.
  • the vision in the above-mentioned other certain training video is different from the vision extracted from the above-mentioned same training video in the movement of the lower half of the face of the person in the image.
  • the visually fixed negative sample After the visually fixed negative sample is obtained, it is used as a candidate negative sample, and the visual rule determination and phonetic rule determination described below are performed to obtain the third negative sample.
  • Both the first image segment and the random image segment are images of one or more consecutive time points.
  • T can be set to 5
  • the corresponding voice segment is 200ms.
  • the voice rule judgment refers to the judgment of the voice in the negative sample and the voice in the positive sample pair to which the vision in the negative sample corresponds. Whether the edit distance between two phoneme sequences is greater than a preset threshold.
  • the value of D is lower than a preset threshold, it is judged that the two speech samples are too similar; when the value of D is higher than the preset threshold, it is judged that the two speech samples are sufficiently different.
  • the preset threshold can be obtained statistically from a database.
  • the database may include multiple groups of voice sample pairs that are manually marked as similar and multiple groups of voice sample pairs that are manually marked as having sufficient differences. By performing histogram statistics on the edit distances of the two types of manually marked data, the confusion is minimized.
  • the cutoff value of the edit distance is determined as a preset threshold.
  • the visual rule judgment is to judge the visual similarity between the vision in the candidate negative sample and the vision in the corresponding positive sample. Whether the similarity difference is greater than a preset threshold, the preset threshold can be selected according to actual requirements.
  • thresholds can be used to change the two contour maps from a grayscale image of 0 to 255 to a binary contour map of 0/1, which are denoted as M v1 and M v2 .
  • the database may include multiple groups of visual sample pairs manually marked as similar and multiple groups of visual sample pairs manually marked as sufficiently different. Histogram statistics can be performed on the weighted values of the absolute differences and structural similarities of the two types of manually labeled data, and then the weighted weights can be adjusted to determine the weighted weight that minimizes the confusion between the two types of manually labeled data as the final weight, and minimize the confusion.
  • the cutoff value of the optimized weighted weight is determined as the preset threshold.
  • T frame images are preprocessed, and the differences between corresponding frames between two visual samples are judged one by one according to the above visual rules, and then according to the number of different frames, The ratio to the total number of frames in the visual sample is finally determined. If the ratio is higher than a preset threshold, it is determined that the two visual samples are sufficiently different.
  • the first negative sample, the second negative sample and the third negative sample are obtained, and then the first negative sample, the second negative sample and the third negative sample are used for neural network training .
  • the first image data and the second voice data are used to form the first negative sample; when it is determined that the voice data corresponding to the first image data is different from the third voice data in the posterior probability of the voice category, and the first image data
  • the first image data and the third voice data are used to form a second negative sample; when it is determined that the voice data corresponding to the second image data is different from the first voice data Or when the second voice data is different in the posterior probability of the voice category, and the image data corresponding to the second image data and the first voice data or the second voice data is different in the movement of the lower half of the face, the first voice data or the second voice data
  • the speech data corresponding to the first image data refers to the speech data in the positive sample pair to which the first image data belongs
  • the speech data corresponding to the second image data refers to the speech data in the positive sample pair to which the second image data belongs.
  • the image data corresponding to the voice data refers to the image data in the positive sample pair to which the first voice data belongs
  • the image data corresponding to the second voice data refers to the image data in the positive sample pair to which the second voice data belongs
  • the image data corresponding to the third voice data refers to the image data in the positive sample pair to which the third speech data belongs.
  • the above misplaced negative sample is used as a candidate negative sample to perform speech rule judgment, that is, to judge the second speech data corresponding to the second speech segment and the first image data corresponding to the first image segment Whether the voices in the positive sample pair are different in the posterior probability of the voice category, that is, whether the edit distance between the two phoneme sequences corresponding to the two is greater than the preset threshold, if it is greater than the preset threshold, it means that the two The latter differ in the posterior probabilities of speech categories.
  • misplaced negative samples as candidate negative samples for visual rule judgment, that is, to judge the image data in the positive sample pair to which the first image data corresponding to the first image segment and the second voice data corresponding to the second voice segment belong. , whether the similarity difference between the two is greater than a preset threshold, and if it is greater than the preset threshold, it means that the two are different.
  • the first image data and the second voice data are combined into a first negative sample.
  • the above-mentioned negative sample with fixed voice is used as a candidate negative sample to perform voice rule determination, that is, to determine the third voice data corresponding to the third voice segment and the first voice data corresponding to the first image segment.
  • voice rule determination that is, to determine the third voice data corresponding to the third voice segment and the first voice data corresponding to the first image segment.
  • the first image data and the third voice data are combined into a second negative sample.
  • the above-mentioned visually fixed negative sample is used as a candidate negative sample to perform phonetic rule judgment, that is, to judge the first/second speech data corresponding to the first/second speech segment and the second Whether the voice in the positive sample pair to which the second image data corresponding to the image segment belongs is different in the posterior probability of the voice category, that is, whether the edit distance between the two phoneme sequences corresponding to the two is greater than the preset threshold, if Greater than the preset threshold, it means that the two are different in the posterior probability of the speech category.
  • the second image data and the first/second voice data are combined to form a third negative sample.
  • FIG. 9 shows a schematic diagram of the architecture for measuring the synchronization of speech and images
  • the speech and image synchronization measurement model is composed of speech neural network, visual neural network and synchronization measurement model.
  • Fig. 13 is a schematic flow diagram of the training neural network in the embodiment of the present application, as shown in Fig. 13, the process may include two stages of pre-training and post-training, specifically as follows:
  • S1301 Divide the positive sample, the first negative sample, the second negative sample and the third negative sample into different batches to input the speech and image synchronization measurement model for training, and adjust the parameters in the speech and image synchronization measurement model. Among them, through balanced sampling, the number of positive samples and the number of negative samples in each batch are similar, which is helpful for model training.
  • the parameters in the speech and image synchronization measurement model can be adjusted through the loss function, and the loss function is specifically shown in the following formula (3):
  • L represents the loss value
  • N represents the number of samples in the batch
  • n represents the label of the sample
  • y n represents the label of the sample
  • d p represents a positive sample distance
  • d n represents the negative sample distance
  • v represents the visual feature extracted by the visual neural network
  • a represents the speech feature extracted by the speech neural network
  • margin 1 is a specific value.
  • the margin 1 here can be different from the margin 2 in the later stage of training.
  • the batch size to 256
  • train for 1000 epochs and initially set the learning rate to 0.005
  • use the cosine decay strategy to gradually decay the learning rate to 0 after 100 epochs.
  • train for 500 epochs set the learning rate initially to 0.001, and use the cosine decay strategy to gradually decay the learning rate to 0 after 100 epochs.
  • the above-mentioned specific training parameters and model parameters in use need to be adjusted accordingly as the database changes. Certainly, other specific manners may also be adopted, which are not limited here.
  • the online hard sample mining strategy can be continued in each training batch, and the model can be trained again using the hard samples mined online until The trained model is in a certain accuracy interval until there are no large fluctuations.
  • the speech features a i and visual features v i are extracted through the current speech neural network and visual neural network, i ⁇ N. Then, find the hard-to-positive samples within each batch. Difficult positive samples are specifically shown in the following formula (4):
  • v represents the visual features extracted by the visual neural network
  • a represents the speech features extracted by the speech neural network
  • N represents the label of the sample.
  • the negative samples in the batch are generated according to the positive samples in the batch, and a plurality of difficult negative samples in the negative samples in the batch are obtained.
  • the negative samples in this batch are generated according to the positive samples in each batch, specifically, the N phonetic features and N visual features obtained from the N positive samples in the training batch in step S1002 are combined in pairs , can form an N ⁇ N matrix, exclude positive sample combinations on the diagonal, and get N ⁇ (N-1) combinations as candidate negative samples.
  • the qualified negative samples obtained are batches Negative samples within times.
  • each positive sample in step S1002 corresponds to multiple negative samples.
  • Step S1003 is to find out difficult negative samples among multiple negative samples corresponding to each positive sample.
  • obtaining multiple difficult negative samples in the negative samples in each batch is specifically sorting the negative samples corresponding to the speech feature a i according to the loss value output by the loss function, and obtaining the corresponding speech feature a i according to the loss value Hard negative samples; and/or sort the negative samples corresponding to the visual feature v i according to the loss value output by the loss function, and obtain the hard negative samples corresponding to the visual feature v i according to the loss value.
  • each row i in the matrix is the negative sample corresponding to the voice of the i-th positive sample, and the one with the largest loss function in each row is recorded as the difficult negative sample corresponding to the voice of the i-th positive sample; similarly, the i-th column in the matrix is The visually corresponding negative sample of the i-th positive sample, and the largest loss function in each column is recorded as the visually corresponding hard negative sample of the i-th positive sample.
  • the loss function is the largest, corresponding to the distance minimum.
  • margin 2 is a specific value.
  • the jth column does not contain qualified negative samples
  • the essence of hard negative sample mining is sorting.
  • a voice sample a j traverse all the visual samples in the batch, construct negative sample pairs (v 0 , a j ),...,(v N , a j ), if there is a qualified Negative samples, select a difficult pair of negative samples from qualified negative samples.
  • v j traverse all voice samples in the batch, construct negative sample pairs (v j , a 0 ),...,(v j , a N ), if there is a qualified negative sample, then from the qualified negative sample It is difficult to select a negative sample pair from the negative samples.
  • S1304 Input the difficult positive samples and multiple difficult negative samples into the speech and image synchronization measurement model after parameter adjustment for training, and adjust the parameters in the speech and image synchronization measurement model again.
  • the loss function corresponding to the speech and image synchronization measurement model will also undergo some changes accordingly.
  • the changed loss function is specifically shown in the following formula (7):
  • l represents the loss value
  • Indicates the difficult positive sample distance Indicates the distance of the difficult negative sample corresponding to the voice of the jth positive sample
  • N represents the number of samples in the batch
  • margin 2 is a specific value.
  • the parameters in the speech and image synchronization measurement model can be further adjusted, the model can be further optimized, and the accuracy of model prediction can be improved.
  • optimization is not performed only once, but multiple times. That is to say, after using the current batch of training data to optimize the model once, use the next batch of training data again to obtain the corresponding difficult positive samples and difficult negative samples, and then input them into the current model for training again. Repeat many times until the output value of the corresponding loss function is maintained in a stable area, that is, the output value is in a certain accuracy range and no longer fluctuates greatly.
  • multiple hard negative samples correspond to each sample in the positive samples.
  • Steps S1305 , S1306 , and S1307 are similar to the specific implementation manners of steps S1302 , S1303 , and S1304 described above, and will not be repeated here.
  • the training of the speech and image synchronization measurement model is completed.
  • m is less than or equal to M (M is the batch divided by positive samples).
  • FIG. 14 is a schematic flowchart of a complete measurement method for voice and image synchronization in the embodiment of the present application.
  • the video stream is input to the preprocessing module, and the video stream is preprocessed to obtain the audio segment.
  • the audio clip into the SI-ASR system, and process the video stream into a PPG signal.
  • input the speech data into the speech neural network to obtain speech features.
  • dense face alignment is performed on the video stream frame by frame. In one frame of image, there may be multiple faces, and the following steps need to be performed for each face: Extract expression coefficients from faces.
  • the expression coefficients extracted from the face image are used to generate a 3D model. Project the corresponding vertices in the 3D model to obtain a contour map. Accumulate the obtained multi-frame contour maps into one image data. Then input the image data into the visual neural network to obtain visual features. Finally, the speech features and visual features are input into the synchronicity measurement module to measure whether the speech and images in the video stream are synchronized. If the threshold is met, it is determined to be in sync; if the threshold is not met, it is determined to be out of sync. Through the synchronicity measurement module, the synchronicity of speech features and visual features can be judged.
  • the specific measure of synchronization can be realized by calculating the distance between the speech feature and the visual feature on the vector, and then comparing it with a preset threshold. Finally, through the synchronization measurement module, the face with the best synchronization can be determined. If the synchronization of all faces in the video does not reach the preset threshold, it is determined that there is no suitable face in the video image at the current time segment.
  • Fig. 15 is a schematic structural diagram of a measuring device for voice and image synchronization in the embodiment of the present application. Referring to Fig. 15, the device may include:
  • the receiving module 1501 is configured to acquire audio clips and image clips in the video.
  • the data processing module 1502 is configured to perform any one of the following operations: convert the speech segment into a specific signal and obtain the speech feature of the specific signal and the visual feature of the image segment, the specific signal and the speaker in the speech segment
  • the personal characteristics are irrelevant; or, generate the contour map of the target person according to the image segment and obtain the visual features of the contour map and the speech features of the speech segment, and the contour map has nothing to do with the personal characteristics of the target person; or, convert the speech segment into a specific signal, according to The image segment generates a contour map of the target person and captures the speech features of the specific signal and the visual features of the contour map.
  • the synchronization measurement module 1503 is used to determine whether the voice segment and the image segment are synchronized according to the voice feature and the visual feature.
  • the embodiment of the present application also provides a device for measuring the synchronization of voice and image.
  • Fig. 16 Schematic diagram of the second structure of the device for measuring the synchronization of voice and image in the embodiment of the present application. Referring to Fig. 16, the device may include:
  • the receiving module 1601 is configured to acquire audio clips and image clips in the video.
  • the preprocessing module 1602 is configured to convert the sampling frequency of the speech segment into a specific frequency; correspondingly, the data processing module 1603 is configured to convert the converted speech segment into a specific frequency into a specific signal.
  • the preprocessing module 1602 is configured to remove the background sound in the speech segment, and separate the voices of different speakers in the speech segment after the background sound is removed, to obtain at least one speech sub-segment; correspondingly, the data processing module 1603, for converting the speech sub-segment into a specific signal.
  • the preprocessing module 1602 is used to divide the speech segment into multiple speech frames in a sliding weighted manner, and there is overlap between adjacent speech frames; correspondingly, the data processing module 1603 is used to Convert multiple speech frames into multiple specific signals respectively.
  • the specific signal is a speech category posterior probability PPG signal.
  • the data processing module 1603 is specifically configured to convert the speech segment into a speech category posterior probability PPG signal through the SI-ASR system for automatic speaker language recognition.
  • the feature extraction module 1604 is also used to obtain visual features of image segments through a visual neural network.
  • the feature extraction module 1604 includes:
  • the first extraction unit 1604a is configured to use multiple 1-dimensional convolutional layers to process the specific signal in the time dimension to obtain a feature matrix, and the number of 1-dimensional convolutional layers is related to the duration corresponding to the specific signal;
  • the second extraction unit 1604b is configured to reorganize the feature matrix into a feature vector
  • the third extraction unit 1604c is configured to process the feature vector by using 3 fully connected layers and 1 linear projection layer to obtain 513-dimensional speech features.
  • Synchronization measurement module 1605 configured to determine whether the voice segment and the image segment are synchronized according to the voice feature and visual feature.
  • FIG. 17 is a structural schematic diagram three of a measurement device for voice and image synchronization in the embodiment of the present application. Referring to Fig. 17, the device may include:
  • the receiving module 1701 is configured to acquire audio clips and image clips in the video.
  • the preprocessing module 1702 includes:
  • the detection unit 1702a is configured to perform face detection on the image segment to obtain a face detection frame
  • an alignment unit 1702b configured to horizontally align the faces in the face detection frame
  • the data processing module 1703 is configured to generate a contour map of the target person according to the image segment, and the contour map has nothing to do with the personal characteristics of the target person.
  • the data processing module 1503 when the contour map is a face contour map, the data processing module 1503 includes:
  • the extraction unit 1703a is configured to extract the expression coefficient of the target person from the image segment.
  • the generation unit 1703b is configured to generate the face contour map of the target person based on the expression coefficients and the general parameterized face model.
  • the extraction unit 1703a is specifically configured to extract the expression coefficients of the target person in the image segment through a three-dimensional deformable parameterized face model parameter estimation algorithm, and the expression coefficients conform to the three-dimensional deformable parameterized face model standard.
  • the generation unit 1703b is specifically used to extract the lower half face expression coefficient corresponding to the lower half face in the expression coefficients; input the lower half face expression coefficient into the general three-dimensional human face model to obtain the lower half face of the target person The corresponding three-dimensional face model is processed, and the three-dimensional face model is processed into a face contour map of the target person.
  • the generating unit 1703b is specifically configured to input the lower half face expression coefficients into a general 3D face model to obtain a 3D face model corresponding to the lower half face of the target person; obtain the lower half of the 3D face model The collection of vertices for the face. Project the vertex set to a two-dimensional plane to obtain the lower half face contour map of the target person, and use the lower half face contour map as the face contour map of the target person.
  • the feature extraction module 1504 is used to obtain the speech features of the speech segment through the speech neural network.
  • the feature extraction module 1504 includes:
  • the first extraction unit 1704a is configured to use a convolution layer to process the contour map to obtain a feature matrix, and the convolution kernel size and step size of the convolution layer are related to the size of the contour map;
  • the second extraction unit 1704b is configured to use the backbone network of the visual neural network to process the feature matrix to obtain a feature vector
  • the third extraction unit 1704c is configured to use a fully connected layer to process the feature vector to obtain a 515-dimensional visual feature.
  • the synchronization measurement module 1705 is configured to determine whether the voice segment and the image segment are synchronized according to the voice feature and the visual feature.
  • the synchronization measurement module 1705 is configured to determine the speaker corresponding to the voice segment in the video according to the voice features and visual features.
  • the synchronization measurement module 1705 is configured to determine whether the voice segment in the video belongs to the person in the image segment according to the voice feature and visual feature.
  • the synchronization measurement module 1705 is used to align the start bits of the voice segment and the image segment in the video according to the voice feature and visual feature, so that the voice segment and the image segment are synchronized.
  • Fig. 18 is a schematic structural diagram of a training device for a speech and image synchronization measurement model in the embodiment of the present application. Referring to Fig. 18, the device may include:
  • a data processing module 1801 configured to process the first image segment into first image data, the first voice segment into first voice data, and the second voice segment into second voice data, wherein: the first image segment, the first The voice segment and the second voice segment are from the first training video, the first image segment is synchronized with the first voice segment, and the first image segment is not synchronized with the second voice segment.
  • the data processing module 1801 is further configured to process a random image segment into second image data, and a random voice segment into third voice data, wherein: the random image segment and the random voice segment are from the second training video.
  • a sample generating module 1802 configured to form positive samples from the first image data and the first voice data.
  • the sample generating module 1802 is further configured to compose the first image data and the second voice data into a first negative sample.
  • the sample generating module 1802 is further configured to form the first image data and the third voice data into a second negative sample.
  • the sample generating module 1802 is further configured to form the first voice data or the second voice data, and the second image data into a third negative sample.
  • the training module 1803 is used to train the speech and image synchronization measurement model by using the positive sample, the first negative sample, the second negative sample and the third negative sample.
  • an embodiment of the present application also provides a training device for a speech and image synchronization measurement model.
  • Fig. 19 is a schematic diagram of the second structure of the training device of the speech and image synchronization measurement model in the embodiment of the present application. Referring to Fig. 19, the device may include:
  • the receiving module 1901 is configured to acquire the first image segment, the first voice segment, and the second voice segment in the first training video, the first image segment and the first voice segment have synchronization, the first image segment and the second voice segment Not synchronous.
  • the receiving module 1901 is further configured to acquire random image segments and random voice segments, the random image segments and random voice segments are from the second training video.
  • the frame lengths of the first image segment and the random image segment are smaller than the frame lengths of the first audio segment, the second audio segment or the random audio segment.
  • the voice frame number of the voice data is related to the image frame number of the image data
  • the voice data includes the first voice data, the second voice data or the third voice data
  • the image data includes the first image data or the second voice data Two image data.
  • the duration of the dislocation between the second audio segment and the first image segment is greater than or equal to twice the total duration of the second audio segment.
  • both the first image segment and the random image segment are images of one or more consecutive time points.
  • the training video is a portrait video of a single person speaking, and the interference degree of the background sound in the training video is less than a certain degree, wherein the training video includes a first training video and a second training video.
  • the data processing module 1902 is configured to extract the contour map of the target person from the first image segment and the random image segment respectively, and the contour map has nothing to do with the personal characteristics of the target person; and/or,
  • the data processing module 1902 is further configured to respectively convert the first voice segment, the second voice segment and the random voice segment into specific signals, and the specific signal is related to the speaker in the first voice segment, the second voice segment and the random voice segment Personal characteristics are irrelevant.
  • the sample generation module 1903 is used to form the first image data and the first voice data into positive samples; the sample generation module 1903 is also used to form the first image data and the second voice data into the first positive sample. negative sample; the sample generating module 1903 is also used to form the first image data and the third voice data into a second negative sample; the sample generating module 1903 is also used to combine the first voice data or the second voice data, and the second image data form a third negative sample.
  • the sample generation module 1903 is specifically configured to, when it is determined that the speech data corresponding to the first image data and the second speech data are different in the posterior probability PPG of the speech category, use the first image data and the second speech data When the image data corresponding to the data is different in the movement of the lower half of the face, the first image data and the second voice data are composed of the first negative sample; when it is determined that the voice data corresponding to the first image data and the third voice data are in the voice category When there is a difference in the test probability, and the image data corresponding to the first image data and the third voice data have differences in the movement of the lower half of the face, the first image data and the third voice data form the second negative sample; when it is determined that the second The voice data corresponding to the image data is different from the first voice data or the second voice data in the posterior probability of the voice category, and the second image data is different from the image data corresponding to the first voice data or the second voice data in the movement of the lower half of the face When there is a difference, the
  • the training module 1904 includes:
  • the parameter adjustment unit 1904a is used to input the positive sample, the first negative sample, the second negative sample and the third negative sample into the speech and image synchronization measurement model in batches for training, and adjust the parameters in the speech and image synchronization measurement model ;
  • Difficult sample selection unit 1904b configured to obtain difficult positive samples among positive samples in each batch
  • the difficult sample selection unit 1904b is also used to generate negative samples in this batch according to the positive samples in each batch, and obtain multiple hard negative samples in the negative samples in each batch;
  • the parameter readjustment unit 1904c is used to input the difficult positive samples and multiple difficult negative samples into the speech and image synchronization measurement model after adjusting the parameters for training, and adjust the parameters in the speech and image synchronization measurement model until the speech and image synchronization until the loss value output by the loss function corresponding to the performance measurement model converges.
  • the difficult sample selection unit 1904b is also used to combine the N phonetic features a i and N visual features v i corresponding to the N positive samples in each batch to obtain N ⁇ (N- 1)
  • the difficult sample selection unit 1904b is also used to sort the negative samples corresponding to the speech feature a i according to the loss value output by the loss function; obtain the hard negative sample corresponding to the speech feature a i according to the loss value; and/ Or, sort the negative samples corresponding to the visual feature v i according to the loss value output by the loss function; obtain the difficult negative samples corresponding to the visual feature v i according to the loss value.
  • the difficult sample selection unit 1904b is also used to divide all positive samples into different batches; sort the positive samples in each batch according to the loss value output by the loss function; obtain the current The hard-to-positive sample among the positive samples described in the batch.
  • m is less than or equal to M (M is the batch divided by positive samples).
  • the description of the training device embodiment of the above speech and image synchronization measurement model is similar to the description of the training method embodiment of the above speech and image synchronization measurement model, and has similar benefits to the method embodiment Effect.
  • the description of the method embodiments of the present application please refer to the description of the method embodiments of the present application for understanding.
  • Fig. 20 is a schematic structural diagram of an electronic device in an embodiment of the present application.
  • the electronic device may include: a processor 2001, a memory 2002, and a bus 2003; communication; the processor 2001 is used to call the program instructions in the memory 2002, so as to execute the method in one or more embodiments above.
  • an embodiment of the present application also provides a computer-readable storage medium, which may include: a stored program; wherein, when the program is running, the device where the storage medium is located is controlled to execute the above-mentioned one or more embodiments.

Abstract

本申请提供一种语音与图像同步性的衡量方法、模型的训练方法及装置,语音与图像同步性的衡量方法包括:获取视频中的语音片段和图像片段,语音片段和图像片段在所述视频中具有对应关系;执行以下操作中的任意一项:将语音片段转换为特定信号并通过预先训练的语音与图像同步性衡量模型获得特定信号的语音特征以及图像片段的视觉特征,特定信号与语音片段中说话人的个人特征无关;或,根据图像片段生成目标人物的轮廓图并通过预先训练的语音与图像同步性衡量模型获得轮廓图的视觉特征以及语音片段的语音特征,轮廓图与所述目标人物的个人特征无关;或,将语音片段转换为特定信号,根据图像片段生成目标人物的轮廓图,并通过预先训练的语音与图像同步性衡量模型获得特定信号的语音特征以及轮廓图的视觉特征;根据语音特征以及所述视觉特征,确定语音片段与所述图像片段是否具有同步性,同步性用于表征语音片段中的声音与图像片段中目标人物的运动相匹配。

Description

语音与图像同步性的衡量方法、模型的训练方法及装置
本申请要求于2021年09月09日提交的申请号为202111057976.9、名称为“语音与图像同步性的衡量方法及装置”、2021年09月09日提交的申请号为202111056592.5、名称为“语音与图像同步性的衡量方法及装置”、于2021年09月09日提交的申请号为202111058177.3、名称为“语音与图像同步性衡量模型的训练方法及装置”的中国专利申请的优先权,上述申请的内容通过引用并入本文。
技术领域
本申请涉及视频处理技术领域,尤其涉及一种语音与图像同步性的衡量方法、模型的训练方法及装置。
背景技术
在一段视频中,往往都包含有图像和语音。并且,当视频中的人物说话时,图像中该人物的嘴部运动应当与该人物所发出的语音保持同步。
为了衡量视频中人物的嘴部运动与其所发出的语音是否同步,一般采用的是SyncNet类技术。所谓SyncNet类技术,可以参考文献Chung,Joon Son,Andrew Zisserman.“时间偏移:自然环境下的自动唇音同步”,亚洲计算机视觉会议,Springer,Cham,2016(Chung,Joon Son,and Andrew Zisserman.“Out of time:automated lip sync in the wild.”Asian conference on computer vision.Springer,Cham,2016)。一般是将视频中的语音片段输入一个神经网络,并将视频中的图像片段输入另一个神经网络,从而获得语音特征和视觉特征,再通过对比语音特征与视觉特征,判断视频中人物的嘴部运动与其所发出的语音是否同步。
但是,采用SyncNet类技术衡量视频中人物的嘴部运动与其所发出的语音是否同步,准确性仍然较低。
发明内容
本申请第一方面提供一种语音与图像同步性的衡量方法,所述方法包括:
获取视频中的语音片段和图像片段,语音片段和图像片段在所述视频中具有对应关系;执行以下操作中的任意一项:将语音片段转换为特定信号并获取特定信号的语音特征以及图像片段的视觉特征,特定信号与语音片段中说话人的个人特征无关;或,根据图像片段生成目标人物的轮廓图并获取轮廓图的视觉特征以及语音片段的语音特征,轮廓图与所述目标人物的个人特征无关;或,将语音片段转换为特定信号,根据图像片段生成目标人物的轮廓图,并获取特定信号的语音特征以及轮廓图的视觉特征;根据语音特征以及所述视觉特征,确定语音片段与所述图像片段是否具有同步性,同步性用于表征语音片段中的声音与图像片段中目标人物的运动相匹配。
本申请第二方面提供一种语音与图像同步性衡量模型的训练方法,所述方法包括:将第一图像片段处理为第一图像数据、第一语音片段处理为第一语音数据、第二语音片 段处理为第二语音数据,将随机图像片段处理为第二图像数据、所述随机语音片段处理为第三语音数据,将所述第一图像数据和所述第一语音数据组成正样本,将所述第一图像数据和所述第二语音数据组成第一负样本,将所述第一图像数据和所述第三语音数据组成第二负样本,将所述第一语音数据或所述第二语音数据,和所述第二图像数据组成第三负样本,采用所述正样本、所述第一负样本、所述第二负样本和所述第三负样本训练语音与图像同步性衡量模型。
本申请第三方面提供一种语音与图像同步性的衡量装置,所述装置包括:
接收模块,用于获取视频中的语音片段和图像片段,语音片段与所述图像片段在视频中具有对应关系;数据处理模块,用于执行以下操作中的任意一项:将语音片段转换为特定信号并获取特定信号的语音特征以及图像片段的视觉特征,特定信号与语音片段中说话人的个人特征无关;或,根据图像片段生成目标人物的轮廓图并获取所述轮廓图的视觉特征以及语音片段的语音特征,轮廓图与目标人物的个人特征无关;或,将语音片段转换为特定信号,根据图像片段生成目标人物的轮廓图,并获取特定信号的语音特征以及轮廓图的视觉特征;同步性衡量模块,根据语音特征与所述视觉特征,确定语音片段与所述图像片段是否具有同步性,同步性用于表征语音片段中的声音与图像片段中所述目标人物的运动相匹配。
本申请第四方面提供一种语音与图像同步性衡量模型的训练装置,所述装置包括:数据处理模块,用于将第一图像片段处理为第一图像数据、第一语音片段处理为第一语音数据、第二语音片段处理为第二语音数据;数据处理模块,还用于将随机图像片段处理为第二图像数据、所述随机语音片段处理为第三语音数据;样本生成模块,用于将所述第一图像数据和所述第一语音数据组成正样本;所述样本生成模块,还用于将所述第一图像数据和所述第二语音数据组成第一负样本;所述样本生成模块,还用于将所述第一图像数据和所述第三语音数据组成第二负样本;所述样本生成模块,还用于将所述第一语音数据或所述第二语音数据,和所述第二图像数据组成第三负样本;训练模块,用于采用所述正样本、所述第一负样本、所述第二负样本和所述第三负样本训练语音与图像同步性衡量模型。
本申请第五方面提供一种电子设备,包括:处理器、存储器、总线;其中,所述处理器、所述存储器通过所述总线完成相互间的通信;所述处理器用于调用所述存储器中的程序指令,以执行第一方面或第二方面的方法。
本申请第六方面提供一种计算机可读存储介质,包括:存储的程序;其中,在所述程序运行时控制所述存储介质所在设备执行第一方面或第二方面的方法。
附图说明
通过参考附图阅读下文的详细描述,本申请示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本申请的若干实施方式,相同或对应的标号表示相同或对应的部分,其中:
图1为本申请实施例中图像片段的示意图一;
图2为本申请实施例中图像片段的示意图二;
图3A为本申请实施例中一种语音与图像同步性的衡量方法的流程示意图;
图3B为本申请实施例中另一种语音与图像同步性的衡量方法的流程示意图;
图3C为本申请实施例中再一种语音与图像同步性的衡量方法的流程示意图;
图4为本申请实施例中一种语音与图像同步性的衡量方法的流程示意图;
图5为本申请实施例中处理语音片段的流程示意图;
图6为本申请实施例中下半脸的范围的示意图;
图7为本申请实施例中处理图像片段的流程示意图;
图8为本申请实施例中语音与图像同步性衡量模型的训练方法的流程示意图;
图9为本申请实施例中衡量语音与图像同步性的架构示意图;
图10为本申请实施例中语音神经网络的架构示意图;
图11为本申请实施例中生成语音特征的流程示意图;
图12为本申请实施例中生成视觉特征的流程示意图;
图13为本申请实施例中训练神经网络的流程示意图;
图14为本申请实施例中语音与图像同步性的衡量方法的完整流程示意图;
图15为本申请实施例中语音与图像同步性的衡量装置的结构示意图一;
图16为本申请实施例中语音与图像同步性的衡量装置的结构示意图二;
图17为本申请实施例中语音与图像同步性的衡量装置的结构示意图三;
图18为本申请实施例中语音与图像同步性衡量模型的训练装置的结构示意图一;
图19为本申请实施例中语音与图像同步性衡量模型的训练装置的结构示意图二;
图20为本申请实施例中电子设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本申请的示例性实施方式。虽然附图中显示了本申请的示例性实施方式,然而应当理解,可以以各种形式实现本申请而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了能够更透彻地理解本申请,并且能够将本申请的范围完整的传达给本领域的技术人员。
需要注意的是,除非另有说明,本申请使用的技术术语或者科学术语应当为本申请所属领域技术人员所理解的通常意义。
在相关技术中,采用SyncNet类技术衡量视频中人物的嘴部运动与其所发出的语音是否同步,准确性较低。
发明人经过仔细研究发现,SyncNet类技术衡量嘴部运动与语音是否同步准确性低的原因在于:SyncNet类技术中需要使用到两个神经网络。一个是语音神经网络,用于提取语音特征。一个是视觉神经网络,用于提取视觉特征。无论是语音神经网络,还是视觉神经网络,在进行训练时,都无法做到与说话人的个人特征无关。也就是说,在采用样本进行训练时,样本中携带有说话人的个人特征,而训练后的网络中也学习到了样本中说话人的个人特征,其中,说话人的个人特征包括语音个人特征(例如,音色、语调等)、视觉个人特征(例如,嘴唇薄厚、嘴大小等)等。对于样本中未覆盖到的说话人,通过语音神经网络和视觉神经网络获取的语音特征和视觉特征的准确性就会有所下降。
此外,一方面,SyncNet类技术也很难做到与坐标系无关。也就是说,在通过视觉神经网络提取视觉特征时,主要提取的是嘴部特征。而嘴部特征的提取对嘴部对齐十分 敏感。当说话人发生转头等三维运动时,嘴部对齐就会产生困难。嘴部对齐导致的相对运动和因为说话导致的嘴部运动耦合在一起,使得SyncNet类技术对于嘴部特征提取的准确性明显下降。图1为本申请实施例中图像片段的示意图一,参见图1所示,该图像片段中存在3帧图像。在第1帧图像中,人物正在说话。到第2帧图像时,人物的头部发生转动,并且图像中嘴部的位置和缩放比例也与第1帧图像中人物的正脸不同。在第3帧图像中,人物仍在继续说话。而使用SyncNet类技术以二维的方式兼容这种三维的运动,显然会影响嘴部运动与语音同步性判断的准确性。
另一方面,SyncNet类技术对图像中遮挡的鲁棒性较差。也就是说,当图像中说话人的脸部被部分遮挡时,视觉神经网络无法准确地提取说话人的嘴部特征,在提取的嘴部特征中包含有遮挡物的特征。这样,也会降低嘴部运动与语音同步性判断的准确性。图2为本申请实施例中图像片段的示意图二,参见图2所示,在这两幅图像中,人物的嘴部分别被手指和笔部分遮挡。这类遮挡会影响图像中嘴部的对齐,并且,获得的嘴部特征中也会混入遮挡物,进而影响嘴部运动与语音同步性判断的准确性。
有鉴于此,本申请实施例提供了一种语音与图像同步性的衡量方法,在该方法中,先对语音片段或图像片段进行处理,去除语音片段中或图像片段与人物个体相关的特征,再将对语音片段或图像片段处理后得到的语音数据或图像数据进行特征提取处理。这样,获取的语音特征或视觉特征就不再携带有说话人的个人特征,进而提高语音与图像同步性衡量的准确性。或者,也可以先对语音片段和图像片段都进行处理,去除语音片段中和图像片段与人物个体相关的特征,再将对语音片段和图像片段处理后得到的语音数据和图像数据进行特征提取处理。这样,获取的语音特征和视觉特征就不再携带有说话人的个人特征,进而提高语音与图像同步性衡量的准确性。
图3A为本申请实施例中一种语音与图像同步性的衡量方法的流程示意图,参见图3A所示,该方法可以包括:
S301:获取视频中的语音片段和图像片段。
此处的视频是指其所包括的图像与语音之间的同步性有待判断的视频。这里的同步性,用于表征语音片段中的声音与图像片段中目标人物的运动相匹配。
所谓相匹配,是指在一段视频中,图像片段中目标人物的运动所发出的声音与语音片段中的声音在语义和时间上是相同的。其中,目标人物的运动一般是指目标人物的下半脸运动,具体可以是与嘴部相关的运动。
举例来说,图像片段中目标人物的嘴部做出了发出“苹果”这一声音的嘴部运动,并且语音片段中的声音也是“苹果”,那么就可以认为该图像片段与语音片段具有同步性。再有,图像片段中目标人物的嘴部做出了发出“苹果”这一声音的嘴部运动,并且语音片段中的声音是“香蕉”,那么就可以认为该图像片段与语音片段不具有同步性。
一般来说,并不会直接将视频中的所有图像与所有语音放在一起进行判断,而是将视频中的一部分图像与相应的语音放在一起进行判断。选取的部分图像就是视频中的图像片段,相应的,选取的语音也就是视频中的语音片段。选取的语音片段与图像片段在视频中具有对应关系。
所谓对应关系,是指选取的语音片段和图像片段在视频中的起始时间相同、终止时 间相同或者在时间上具有一定的错位(该错位在人眼的视觉范围内是可以被接受的)。
举例来说,获取视频中第1帧至第10帧对应的图像和语音。视频中第1帧至第10帧的图像就组成了图像片段,视频中第1帧至第10帧的语音就组成了语音片段。这里的第1帧至第10帧就是一个具体的位置。对于获取图像片段和语音片段的具体位置,可以根据实际情况设置,此处不做具体限定。
当然,图像片段还可以是某1帧图像,相应的语音片段还可以是该帧的语音以及该帧前后几帧的语音。
在判断完视频中的一部分图像与相应的语音是否同步后,再判断视频中的另一部分图像与相应的语音是否同步,直到视频中所有的图像与相应的语音的同步性判断完成为止。
S3021:将语音片段转换为特定信号并获取特定信号的语音特征以及图像片段的视觉特征,特定信号与语音片段中说话人的个人特征无关。
在一种实施方式中,可以通过语音与图像同步性衡量模型来提取上述语音特征和视觉特征。该语音与图像同步性衡量模型可以包括语音神经网络、视觉神经网络以及同步性衡量模块,其中,语音神经网络可以用于提取输入信号(例如特定信号)的语音特征,视觉神经网络可以用于提取输入信号(例如图像片段)的视觉特征,同步性衡量模块可以用于对语音片段和图像片段之间是否具有同步性进行判断。
具体地,可以将语音片段输入语音神经网络,通过语音神经网络对语音片段进行处理,语音神经网络的输出就是语音特征。这里的语音神经网络可以是任何一种能够获取语音片段中语音特征的神经网络。对于语音神经网络的具体类型,此处不做具体限定。将图像片段输入视觉神经网络,通过视觉神经网络对图像片段进行处理,视觉神经网络的输出就是视觉特征。这里的视觉神经网络可以是任何一种能够获取图像片段中视觉特征的神经网络。对于视觉神经网络的具体类型,此处不做具体限定。
在将语音片段输入语音神经网络中获取语音特征前,可以先对语音片段进行处理,将语音片段中人物的个人特征删除,也就是从语音片段中提取与语音中人物的个人特征无关的语义特征。
举例来说,不同人之间,其音色、语调等均有所差异。有的人音色浑厚、坚实,而有的人音色明亮、清透。有的人语调轻柔,而有的人语调高亢。若将语音片段直接输入语音神经网络中,得到的语音特征中就会包含有每个人的个人特征,这样会降低语音与语音同步性判断的准确性。并且,若将包含有个人特征的语音片段输入语音网络中进行训练,训练出的网络在也无法准确地获取训练样本中未包含的人物的语音特征,进而降低后续语音与语音同步性判断的准确性。所以,在将语音片段输入语音神经网络前,先将语音片段转化为特定信号,仅提取语音片段中与语义相关的特征,避免提取人物的个人特征,例如:仅提取语音片段中的语音内容本身,而不提取音色等。这样,就将语音片段中的个人特征删除,即转换成特定信号,进而将特定信号输入语音神经网络,获得的语音特征中就能够避免出现人物的个人特征,进而提高语音与语音同步性判断的准确性。
通过语音与图像同步性衡量模型可以获取语音片段对应的语音特征以及图像片段对应的视觉特征,在将语音片段和图像片段输入至语音与图像同步性衡量模型进行处理之 前,可以先对语音片段进行处理,对图像片段不做处理,然后将图像片段以及处理得到的语音数据输入至语音与图像同步性衡量模型,分别获得视觉特征和语音特征。关于语音片段和图像片段的具体处理手段、语音与图像同步性衡量模型的训练将在下文详细描述。
S303:根据语音特征以及视觉特征,确定语音片段与图像片段是否具有同步性,同步性用于表征语音片段中的声音与图像片段中目标人物的运动相匹配。
如前文所述,语音与图像同步性衡量模型可以包括同步性衡量模块,在语音与图像同步性衡量模型的语音神经网络输出语音特征,语音与图像同步性衡量模型的视觉神经网络输出视觉特征后,同步性衡量模块通过具有对比功能的算法将语音特征与视觉特征进行对比,根据对比结果,就能够确定语音片段与图像片段是否具有同步性了。这里的同步性,用于表征语音片段中的声音与图像片段中目标人物的运动相匹配。也就是说,根据对比结果,确定语音片段中的声音与图像片段中目标人物的运动的含义是否相同。也可以理解为图像片段中目标人物的运动所发出的声音与语音片段中的声音在语义和时间上是相同的。
一般来说,输出为0至1之间的一个数值。并且,在0至1之间设置一个阈值。若输出的数值大于或等于该阈值,则说明语音特征与视觉特征的相似度较高,语音片段与图像片段同步。若输出的数值小于该阈值,则说明语音特征与视觉特征的相似度较低,语音片段与图像片段不同步。对于数值的具体范围和阈值,此处不做具体限定。
上述语音与图像同步性的衡量方法,在获取到视频中的语音片段和图像片段后,先将语音片段转换为与其中的说话人的个人特征无关的特定信号,再获取特定信号的语音特征以及图像片段的视觉特征,最后,根据语音特征与视觉特征,确定语音片段与图像片段是否具有同步性。也就是说,先对语音片段进行处理,去除语音片段中与人物个体相关的特征,再对特定信号或图像片段进行特征提取处理。这样获取的语音特征就不再携带有说话人的个人特征,进而能够提高语音与图像同步性衡量的准确性。
在一种实施方式中,参见图3B,语音与图像同步性的衡量方法可以包括如下步骤。
S301:获取视频中的语音片段和图像片段。
S301的具体实现可参见上文描述,此处不再赘述。
S3022:根据图像片段生成目标人物的轮廓图并获取轮廓图的视觉特征以及语音片段的语音特征,轮廓图与目标人物的个人特征无关。
在一种实施方式中,可以通过语音与图像同步性衡量模型来提取上述语音特征和视觉特征。该语音与图像同步性衡量模型可以包括语音神经网络、视觉神经网络以及同步性衡量模块,其中,语音神经网络可以用于提取输入信号(例如语音片段)的语音特征,视觉神经网络可以用于提取输入信号(例如轮廓图)的视觉特征,同步性衡量模块可以用于对语音片段和图像片段之间是否具有同步性进行判断。
在将图像片段输入视觉神经网络中获取视觉特征前,先对图像片段进行处理,将图像片段中人物的个人特征删除,也就是从图像片段中提取与图像中人物的个人特征无关的人物特征。
举例来说,不同人之间,其嘴唇厚度以及大小有所差异。有的人嘴唇厚,有的人嘴唇薄,有的人嘴大,有的人嘴小。若将图像片段直接输入视觉神经网络中,得到的视觉 特征中就会包含有每个人的个人特征,这样会降低图像与语音同步性判断的准确性。并且,若将包含有个人特征的图像片段输入视觉网络中进行训练,训练出的网络在也无法准确地获取训练样本中未包含的人物的视觉特征,进而降低后续图像与语音同步性判断的准确性。所以,在将图像片段输入视觉神经网络前,先对图像片段进行提取,仅提取图像片段中与人物下半脸运动相关的特征,避免提取人物的个人特征,例如:只提取嘴部开合的程度,而不提取嘴唇厚度等。进而将提取的与人物运动相关的特征组合,就能够得到人物的姿态或表情,进而就得到了图像片段中目标人物的轮廓图。进而将轮廓图输入视觉神经网络,获得的视觉特征中就能够避免出现人物的个人特征,进而提高图像与语音同步性判断的准确性。
通过语音与图像同步性衡量模型可以获取语音片段对应的语音特征以及图像片段对应的视觉特征,在将语音片段和图像片段输入至语音与图像同步性衡量模型进行处理之前,可以先对图像片段进行处理,对语音片段不做处理,然后将语音片段以及处理得到的图像数据输入至语音与图像同步性衡量模型,分别获得语音特征和视觉特征。关于语音片段和图像片段的具体处理手段、语音与图像同步性衡量模型的训练将在下文详细描述。
S303:根据语音特征以及视觉特征,确定语音片段与图像片段是否具有同步性,同步性用于表征语音片段中的声音与图像片段中目标人物的运动相匹配。
S303的具体实现可参见上文描述,此处不再赘述。
上述语音与图像同步性的衡量方法,在获取到视频中的语音片段和图像片段后,先根据图像片段生成目标人物的轮廓图,轮廓图与目标人物的个人特征无关,再获取语音片段的语音特征以及轮廓图的视觉特征,最后,根据语音特征与视觉特征,确定语音片段与图像片段是否具有同步性。也就是说,先对图像片段进行处理,去除图像片段中与人物个体相关的特征,再对语音片段和轮廓图进行特征提取处理。这样,获取的视觉特征就不再携带有说话人的个人特征,进而能够提高语音与图像同步性衡量的准确性。
在一种实施方式中,参见图3C,语音与图像同步性的衡量方法可以包括如下步骤。
S301:获取视频中的语音片段和图像片段。
S3023:将语音片段转换为特定信号,根据图像片段生成目标人物的轮廓图,并获取特定信号的语音特征以及轮廓图的视觉特征,特定信号与语音片段中说话人的个人特征无关,轮廓图与目标人物的个人特征无关。
S3023的具体实现可参见上文S3021和S3022的描述,此处不再赘述。
具体而言,S3023步骤中是对语音片段和图像片段都分别进行了处理,再对处理得到的特定信号和轮廓图提取对应特征。
S303:根据语音特征以及视觉特征,确定语音片段与图像片段是否具有同步性,同步性用于表征语音片段中的声音与图像片段中目标人物的运动相匹配。
S303的具体实现可参见上文描述,此处不再赘述。
上述语音与图像同步性的衡量方法,在获取到视频中的语音片段和图像片段后,先将语音片段转换为与其中的说话人的个人特征无关的特定信号,并且,根据图像片段生成目标人物的轮廓图,轮廓图与目标人物的个人特征无关,再获取特定信号的语音特征以及轮廓图的视觉特征,最后,根据语音特征与视觉特征,确定语音片段与图像片段是 否具有同步性。也就是说,先对语音片段和图像片段进行处理,去除语音片段和图像片段中与人物个体相关的特征,再对语音片段和图像片段进行特征提取处理。这样,获取的语音特征和视觉特征就不再携带有说话人的个人特征,进而能够提高语音与图像同步性衡量的准确性。
进一步地,作为图3所示方法的细化和扩展,本申请实施例还提供了一种语音与图像同步性的衡量方法。图4为本申请实施例中一种语音与图像同步性的衡量方法的流程示意图,参见图4所示,该方法可以包括:
S401:获取视频中的语音片段和图像片段。
步骤S401与步骤S301的实现方式相同,此处不再赘述。
下面分别从语音和图像两个方面,对输入语音与图像同步性衡量模型前的语音片段和/或图像片段进行处理,对应处理成语音数据和图像数据的过程进行具体说明。
一、语音片段处理方面
由于语音片段中包含有说话人的个人特征,例如:音色、语调等。因此,在将语音片段输入语音神经网络中获取语音特征之前,先将语音片段中说话人的个人特征抹去,进而将抹去说话人个人特征的语音数据输入语音神经网络,能够提升语音与图像同步性对比的准确性。
在S401获取视频中的语音片段和图像片段之后,语音片段处理具体可包括如下步骤。
S402:将语音片段的采样频率转换为特定频率。
从视频中将语音片段进行分离,改为单通道后,由于采集视频的终端的配置不同,故而语音的采样频率也存在差异,为了后续能够准确地对语音片段进行处理,因此,需要先将语音片段的采样频率进行统一。
在实际应用中,可以将语音片段的采样频率统一为16kHz。当然,也可以将语音片段的采样频率统一为其它数值,如:8kHz、20kHz等。具体的数值可以根据实际情况设置,此处不做限定。
S403:对语音片段进行去噪。
在这里,步骤S403可以包括两个方面。
S4031:去除语音片段中的背景音。
具体的,可以利用短时谱估计中的谱相减法对语音片段进行去噪,以压制语音片段中的背景音,突出语音片段中的语音。当然,也可以采用其它方式去除语音片段中的背景音,如:自适应滤波技术。而至于采用何种具体的方式去除语音片段中的背景音,此处不做限定。
S4032:将语音片段中不同说话人的语音分离,得到至少一个语音子片段。
有时语音片段中并不是只有一个人在说话,可能有多人同时说话,那么,就需要将语音片段中不同说话人的语音进行分离,分别获得各说话人的语音子片段。
在获得多个说话人的语音子片段后,有时只需要判断某个说话人的语音是否与图像同步,有时候需要判断多个说话人的语音是否与图像同步。此时,可以根据实际判断情况,选择某一个说话人的语音子片段或某几个说话人的语音子片段作为去噪后的语音片段。
S404:采用滑动加权的方式,将语音片段切分为多个语音帧。
其中,相邻的语音帧之间存在重叠。
具体来说,可以利用窗函数将语音片段滑动加权切分为多个语音帧。窗函数可以是汉明窗函数,也可以是其它类的窗函数。切分成的多个语音帧可以是多个25ms的片段,也可以是其它长度的片段。每一个片段称作一个语音帧。相邻语音帧之间一般保持10ms的重叠,这是因为:语音帧太短,可能一个音都没有发完,所以,使相邻语音帧保持一定程度的重叠,能够更加充分的对语义进行理解,进而提高语音与图像同步性衡量的准确性。
这里需要说明的是,步骤S402、S403、S404的执行顺序可以不按照序号的大小顺序执行,可以以任意顺序执行。对于步骤S402、S403、S404的执行顺序,此处不做具体限定。无论步骤S402、S403、S404中的执行了几个步骤,在后续转换为特定信号时,均是将所执行步骤的处理结果作为要转换为特定信号的处理对象。例如,若执行了步骤S402,则在转换时,是将转换为特定频率后的语音片段转换为特定信号;若执行了步骤S403,则在转换时,是将语音子片段转换为特定信号;若执行了步骤S404,则在转换时,是如步骤S405所述,将每个语音帧转换为特定信号。
S405:将每个语音帧转换为特定信号。
其中,特定信号与语音片段中说话人的个人特征无关。
在相关技术中,在将语音片段输入语音神经网络之前,需要先将语音片段转换为梅尔倒谱系数(Mel-scale Frequency Cepstral Coefficients,MFCC)信号,然后将MFCC信号输入语音神经网络,以获得相应的语音特征。然而,MFCC信号并不能够很好地抹去语音片段中说话人的个人特征,即身份信息,进而得到的语音特征中也会包含有说话人的身份信息,进而降低语音与图像同步性衡量的准确性。
有鉴于此,在将语音片段输入语音神经网络之前,可以先将语音片段转换为特定信号。这里的特定信号与语音片段中说话人的个人特征的无关,即能够更好地抹去语音片段中说话人的个人特征。这样,将特定信号输入语音神经网络,得到的语音特征就不再包含说话人的个人特征,进而提升语音与图像同步性衡量的准确性。
在实际应用中,特定信号可以是语音类别后验概率(Phonetic Posterior Grams,PPG)信号。PPG信号能够更好地抹去语音片段中与说话人身份相关的信息。并且,PPG信号还能够进一步抹去语音片段中的背景音,降低语音神经网络输入的方差,进而提升语音与图像同步性衡量的准确性。
当然,还可以将语音片段转换为其它类的信号,如DeepSpeech模型提取的特征,只要能够抹去说话人的身份信息即可。对于特定信号的具体类型,此处不做限定。
在实际应用中,为了将语音片段转换为PPG信号,可以将语音片段输入说话者无关的语音识别(Speaker-Independent Automatic Speech Recognition,SI-ASR)系统,通过SI-ASR系统对语音片段进行处理,生成PPG信号。在SI-ASR系统中,采用国际音素表,可以扩大适配语言,具体的PPG信号的维度P和SI-ASR支持的音素数量与支持的语言有关。这里采用支持中文和英文的SI-ASR系统,共支持P=400个音素。一语音帧所得的PPG信号为1×400维的特征向量。T个连续语音帧所得PPG信号为T×400维的特征矩阵。采用其他SI-ASR系统可根据支持的音素数量做相应调整。
当然,还可以通过其它方式将语音片段转换为抹去说话人的身份信息的信号,例如:深度学习模型DeepSpeech。该深度学习模型可以将语音信号转化为相应的文字。因此,在DeepSpeech提取的特征中,仅存在有说话的内容本身,不会存在说话人的音色等个人特征。这样,提取后也能够将说话人的身份信息和背景等与语义无关的内容抹去。
上述的处理过程也可以通过图5所示的流程示意图进行,参见图5所示,首先,将语音输入预处理模块。在预处理模块中执行上述步骤S402-S404的处理,即对语音进行统一采样频率、去噪、分割等处理。然后,将处理后的语音片段输入SI-ASR系统。在SI-ASR系统中执行上述步骤S405的处理,即将语音片段转换成PPG信号。
二、图像片段处理方面
由于图像片段中包含有目标人物的个人特征,例如:嘴唇薄厚、嘴大小等。因此,在将图像片段输入图像神经网络中获取图像特征之前,先将图像片段中目标人物的个人特征抹去,保留与下半脸运动相关的信息,进而将抹去目标人物的个人特征的图像数据输入图像神经网络,能够提升语音与图像同步性对比的准确性。
下面以从图像片段中提取下半脸特征为例,对根据图像片段生成目标人物的轮廓图进行说明。这里提取的轮廓图目标人物的个人特征无关。
在S401获取视频中的语音片段和图像片段之后,图像片段处理具体可包括如下步骤。
S406:对图像片段进行人脸检测,得到人脸检测框。
一般来说,对图像片段中的每一帧图像进行人脸检测,得到人脸检测框。
S407:将人脸检测框中的人脸进行水平对齐。
具体的,可以利用稠密人脸对齐算法,找出人脸检测框中人脸关键点在原图像中的位置,包括但不限于左眼中心位置、右眼中心位置、左嘴角位置和右嘴角位置。上述的左、右为图像中人脸生理意义的左右,而非在图像中的左右,并假设图像中人脸是正面。利用上述人脸关键点的位置信息,基于规则计算将人脸图像处理成符合规则的形式。此处规则可以如下:
计算左眼中心关键点和右眼中心关键点的中间位置,记为P_eyecentre;
计算左嘴角关键点和右嘴角关键点的中间位置,记为P_mouthcentre;
计算左眼中心关键点到右眼中心关键点的向量,记为V_eyetoeye;
计算P_eyecentre到P_mouthcentre的向量,并逆时针旋转90度,使其与V_eyetoeye成锐角,记为V_eyetomouth;
计算V_eyetoeye和V_eyetomouth的向量差,并对向量差进行模长归一化,得到单位向量X_unit;
将X_unit放大,放大倍率为V_eyetoeye模长的2倍和V_eyetomouth模长的1.8倍两者的较大值,得到向量X,并对X逆时针旋转90度得到向量Y;
以P_eyecentre移动0.1倍V_eyetomouth为中心C,可在图像中得到一个矩形,矩形的左上角坐标为C+X+Y,右下角坐标为C-X-Y;
利用插值算法将上述矩形内的图像取出,缩放到预定尺寸,如256*256像素,就得到了对齐后的人脸。
这里用于找出人脸关键点的稠密人脸对齐算法可以是三维稠密人脸对齐(3Dimentional Dense Face Alignment,3DDFA)算法。当然,还可以采用其他对齐算法获取 人脸关键点,继而采用上述规则实现人脸对齐。此处对于使用的具体算法,不做限定。
相比于更为常用的计算人脸关键点和预设正脸人脸关键点模板之间的仿射变换来对齐人脸的方式,此方法可以兼容大角度侧脸和正脸的对齐。
S408:从人脸中提取目标人物的表情系数。
具体的,可以通过三维可形变参数化人脸模型(3Dimensional Morphable Models,3DMM)参数估计算法提取人脸检测框中目标人物的表情系数,表情系数符合三维可形变参数化人脸模型的标准。由于3DMM显式的设计了身份参数空间(表达身份信息的部分)和表情参数空间(表达表情信息的部分)的解耦合,因此利用3DMM参数估计算法获取的表情信息不含有身份信息,即,不含个人特征。
在将人脸检测框中的内容作为输入,利用3DMM参数估计算法对人脸检测框中的内容进行处理后,就能够获取到目标人物的符合3DMM模型标准的身份系数和表情系数。可以将表情系数记为α exp
其中,3DMM参数估计算法是能够估计3DMM参数的算法,用来估计人脸的身份系数和表情系数,且身份系数和表情系数符合3DMM定义的标准。
具体来说,本申请采用的3DMM参数估计算法是用深度神经网络模型实现的。可以利用预先训练的深度神经网络模型,向模型中输入对齐后的人脸检测框中的人脸图像和相关技术中的目标人物对应的身份系数,提取对齐后人脸图像中目标人物的表情系数和身份系数,并根据输出的身份系数更新相关技术中的目标人物对应的身份系数,用于后续图像帧估计。此处目标人物对应的身份系数为时序上相邻图像帧估计的身份系数的滑动加权平均。
相比于单独从对齐后人脸图像直接计算目标人物的表情系数,此处通过将时序上相邻图像帧对目标人物的身份系数的计算结果输入深度神经网络模型,可以更好的让模型使用表情系数,而不是改变身份系数,来拟合人脸的形态变化;即,通过增加身份系数时序稳定的约束消除参数估计过程的歧义性,从而获得更为准确的表情系数。
类似的,此处也可以借鉴其他的能够稳定身份系数的3DMM参数估计算法,如Face2Face算法(Thies,Justus,等,Face2face:rgb视频实时人脸的捕获和再现,IEEE计算机视觉和模式识别会议论文集,2016(Thies,Justus,et al."Face2face:Real-time face capture and reenactment of rgb videos."Proceedings of the IEEE conference on computer vision and pattern recognition.2016))获取每一帧的表情系数。
表情系数α exp包含有表征嘴部的位置、嘴的开合程度等与说话人个人无关的特征。而与说话人个人相关的特征都是在身份系数中表征。所以,仅仅基于表情系数α exp和标准身份系数(这里使用标准身份系数替代目标人物的身份系数,去除目标人物的个人特征),输入通用参数化人脸模型生成目标人物的人脸轮廓图,能够排除目标人物的个人特征,进而提高嘴部运动与语音同步性衡量的准确性。
S409:提取表情系数中下半脸对应的下半脸表情系数。
在3DMM的定义下,所有表情系数的影响都是全脸的,只是有的对嘴影响大对眼睛影响可忽略。因此,提取表情系数中与下半脸运动相关性高的表情系数,作为下半脸表情系数。
需要将图像中目标人物的某一部位与语音进行同步性衡量,就从系统中提取目标人 物该部位的与个人特征无关的系数。在这里,需要将下半脸运动与语音进行同步性衡量,那么就从表情系数中提取下半脸的表情系数,记为α halfface,进而基于下半脸的表情系数生成下半脸轮廓图,以与语音进行同步性衡量。
S410:将下半脸表情系数输入通用三维人脸模型,得到目标人物的下半脸对应的三维人脸模型。
目标人物的下半脸对应的三维人脸模型也就是目标人物的下半脸表情系数结合标准身份系数的三维人脸模型。
通用三维人脸模型,就是抽象化的人脸模型。在通用三维人脸模型中,眉毛、眼睛、鼻子、脸、嘴等部位的数据均是基于众多的人脸平均后得到的,具有普适性。
将下半脸表情系数输入通用三维人脸模型后,得到的就是目标人物的嘴部表情目标人物的下半脸对应的三维人脸模型。
具体的,在通用三维人脸模型中,将预定义的完整表情正交基底B exp对应改为与下半脸运动相关的B halfface。具体如下式(1)所示:
Figure PCTCN2022114952-appb-000001
其中,S为目标人物在中性表情下的嘴型的几何模型,
Figure PCTCN2022114952-appb-000002
为预定义的中性表情下对应的平均人脸几何模型,B halfface为与嘴部运动相关的正交基底,α halfface为下半脸表情系数。
这样,得到的目标人物的下半脸表情对应的三维人脸模型就能够消除无关表情的影响。
S411:获取三维人脸模型中下半脸的顶点集合。
所谓下半脸,是指人脸中左右耳底部与鼻尖连线以下的人脸区域。图6为本申请实施例中下半脸的范围的示意图,参见图6所示,将左耳底部的位置601、鼻尖的位置602、右耳底部的位置603连接,得到连线604。连线604就将人脸分为了上半脸和下半脸。而连线604以下的人脸就是下半脸。
下半脸在选时,连线604可以有一定的调整幅度,如向上移动到眼部位置,或者向下移动到鼻子位置等。即下半脸的选取可以根据实际需要调整。
S412:将顶点集合投影到二维平面,得到目标人物的下半脸轮廓图,并将下半脸轮廓图作为目标人物的人脸轮廓图。
具体的,收集得到的几何模型S上对应嘴部轮廓和下巴区域的顶点,得到顶点集合V。再利用尺度正交投影(Scale Orthographic Projects)将顶点集合V投影到二维平面,得到下半脸的轮廓图I,具体如下式(2)所示:
Figure PCTCN2022114952-appb-000003
其中,I为目标人物下半脸的二维轮廓图,f为尺度系数,P为正交投影矩阵,S(v)为三维人脸模型中下半脸的顶点集合。在这里,轮廓图I的尺寸可以是128×256的长方形,嘴部和下半脸的轮廓居中。特别的,为增强轮廓图的可见性,投影时,将每个顶点投影成一个以顶点投影位置为圆心,半径为r个像素的二维高斯圆斑。半径r的取值和I的尺寸正相关,对应于128×256的I,这里取r=2个像素。
在对图像片段进行处理的过程中,并没有保留输入的人脸图像原有的姿态朝向和光照信息,而是只保留通过3DMM参数估计算法获取到的图像片段中目标人物的表情系数, 进而结合标准身份系数得到通用三维人脸模型,生成消除了目标人物的个人特征的下半脸轮廓图,得到的轮廓图是正脸特征下的轮廓图,消除了原始图像中人脸姿态、光照以及遮挡物的影响。
上述处理过程可以通过图7所示流程示意图进行,参见图7所示,首先,对图像片段执行上述步骤S406-S407的处理,即,将图像进行稠密人脸对齐,得到对齐后的图像;然后,对对齐后的图像执行上述步骤S408-S409的处理,即,将对齐后的图像进行人脸3D模型表情系数提取;接着,对提取的表情系数执行上述步骤S410的处理,即,采用正面视角、标准脸型、平均光照,根据提取的表情系数生成3D模型;最后,对生成的3D模型执行上述步骤S411-S412的处理,将3D模型进行对应顶点的投影,得到下半脸的二维轮廓。
在将语音片段处理为PPG信号,以及将图像片段处理为人脸正面下半部的二维轮廓图之后,就可以将PPG信号输入语音神经网络,以及将二维轮廓图输入视觉神经网络,分别得到语音特征和视觉特征,进而将语音特征与视觉特征进行对比,确定语音片段与图像片段是否具有同步性。
S413:通过语音神经网络获得特定信号的语音特征。
将语音片段输入语音神经网络,通过语音神经网络对语音片段进行处理,语音神经网络的输出就是语音特征。
这里的语音神经网络可以是任何一种能够获取语音片段中语音特征的神经网络。对于语音神经网络的具体类型,此处不做具体限定。
S414:通过视觉神经网络获得人脸轮廓图的视觉特征。
将对图像片段处理后得到的轮廓图输入视觉神经网络,通过视觉神经网络对轮廓图进行处理,视觉神经网络的输出就是视觉特征。
这里的视觉神经网络可以是任何一种能够获取图像片段中视觉特征的神经网络。对于视觉神经网络的具体类型,此处不做具体限定。
在通过语音片段处理和图像片段处理分别得到语音特征和视觉特征后,还包括步骤S415:根据语音特征与视觉特征,确定语音片段与图像片段是否具有同步性。
本申请实施例还提供了一种语音与图像同步性衡量模型的训练方法,在该方法中,当对语音与图像同步性衡量模型进行训练时,预先获取各种类型的训练样本,即获取类型多样的训练样本,例如:同一段训练视频中具有同步性的图像片段和语音片段,同一段训练视频中不具有同步性的图像片段和语音片段,不同训练视频中的图像片段和语音片段,等等。采用多种类型的训练样本对语音与图像同步性衡量模型进行训练,能够提高语音与图像同步性衡量模型的精确度,进而提高语音与图像同步性衡量的准确性。
这里需要说明的是,所有的训练视频来自于训练视频集,训练视频的数量可以是一个,也可以是多个。对于训练视频的数量,此处不做限定。第一训练视频为训练视频集中的一个训练视频。从训练视频集中选取一个与第一训练视频不同的训练视频,作为第二训练视频。
在实际应用中,本申请实施例提供的方法可以应用在各种需要判断语音与图像是否同步的场景下。下面以三个具体场景为例,进一步对本申请实施例提供的方法进行说明。
场景一:判定说话人。
当视频中有多人进行谈话时,为了确定当前正在说话的说话人,首先,从视频中提取出相应的语音片段和图像片段;然后,将语音片段处理为PPG信号,以抹去说话人的音色、语调等个人特征,以及将图像片段通过3DMM参数估计算法提取表情系数,处理成人脸正面下半部的二维轮廓图,以消除侧面、遮挡等情况的干扰,图像中有多少个人脸,就有多少个二维轮廓图;接着,将对语音片段处理后得到的语音数据输入语音神经网络,以及对图像片段处理后得到的图像数据输入视觉神经网络,分别得到语音特征和多个视觉特征;最后,将多个视觉特征分别与语音特征进行同步性匹配,进而确定出与语音特征同步性最高的视觉特征,并将确定的视觉特征与预设阈值比较,若属于预设阈值对应的视觉特征,则将该视觉特征对应的人确定为视频中当前的说话人。可以避免把不在视频中的说话人判定为视频中当前说话人,如记者采访场景,若记者不在视频的画面中,则在记者说话时视频画面中不存在对应的说话人。
场景二:伪造视频鉴别。
某些视频中的声音或者画面可能并不是原有的,而是后期人为加上去的。例如:将一些明星的视频重新进行配音,配上一些明星根本没有说过的话。再例如:在一些交互式的活体认证中,需要用户读出屏幕上所显示的字,然后录制成视频上传。而不法分子为了能够通过验证,就事先获取用户的图像,然后进行配音,制作成视频上传。
为了判断视频是否是伪造的,首先,从视频中提取出相应的语音片段和图像片段;然后,将语音片段处理为PPG信号,以抹去说话人的音色、语调等个人特征,以及将图像片段通过3DMM参数估计算法提取表情系数,处理成人脸正面下半部的二维轮廓图,以消除侧面、遮挡等情况的干扰;接着,对语音片段处理后得到的语音数据输入语音神经网络,以及对图像片段处理后得到的图像数据输入视觉神经网络,分别得到语音特征和视觉特征;最后,将语音特征与视觉特征进行同步性匹配,匹配度越高,说明视频中的图像和语音是同步的,而不是后期人为加入的。当匹配度高于特定值时,就可以确定视频中的图像和语音是同一个人同时产生的,即视频中的语音片段属于图像片段中的人物。
场景三:视频调制。
一些非专业级别的多媒体设备在录制视频时,采集语音的设备和采集图像的设备往往是分开的。采集语音可以使用麦克风,采集图像可以使用摄像头。然后再将采集的语音和图像融合成视频。这样,很容易导致视频中的语音与图像在时间上发生错位,即音画不同步。
为了解决视频中音画不同步的问题,首先,从视频中提取出相应的语音片段和图像片段;然后,将语音片段处理为PPG信号,以抹去说话人的音色、语调等个人特征,以及将图像片段通过3DMM参数估计算法提取表情系数,处理成人脸正面下半部的二维轮廓图,以消除侧面、遮挡等情况的干扰;接着,对语音片段处理后得到的语音数据输入语音神经网络,以及对图像片段处理后得到的图像数据输入视觉神经网络,分别得到语音特征和视觉特征;最后,将语音特征与视觉特征进行同步性匹配,确定语音与图像错位的程度,进而进行辅助标定,从而根据标定将语音与图像的时间对齐,以消除错位。
针对上述三个场景中的问题,可以通过预先训练的语音与图像同步性衡量模型来判 断语音与图像是否具有同步性。相关技术中,先获取样本视频数据,再通过样本视频数据对语音与图像同步性衡量模型进行训练。样本视频数据的采样对语音与图像同步性衡量模型的训练效率和精准度等性能都有重要影响。
在对语音与图像同步性衡量模型进行训练时,可以基于样本视频数据的特点优化样本视频数据的采样策略,从而更高效地训练语音与图像同步性衡量模型,得到更高精度的模型。具体而言,通过图像预处理和语音预处理对样本视频数据进行处理,针对性抹去样本视频数据中与说话人/目标人物无关的信息,保留说话人/目标人物相关的信息。语音预处理具体可以是将从样本视频数据中提取的语音片段处理为PPG信号,PPG信号作为一种帧级结构,其与说话人的语言无关,可用于多种语言的同步性判断;且,PPG信号可以衡量距离,可用于样本视频数据中正负样本的采样。图像预处理具体可以是将从样本视频数据中提取的图像片段处理为与目标人物的个人特征无关的轮廓图。通过上述方式进行采样,由于语音预处理和图像预处理得到的数据格式便于衡量数据差异性,因而可以高效地构建样本视频数据。
图8为本申请实施例中语音与图像同步性衡量模型的训练方法的流程示意图,参见图8所示,该方法可以包括:
S801:将第一图像片段处理为第一图像数据、第一语音片段处理为第一语音数据、第二语音片段处理为第二语音数据。
其中,第一图像片段、第一语音片段和第二语音片段来自于第一训练视频,第一图像片段与第一语音片段具有同步性,第一图像片段与第二语音片段不具有同步性。也就是说,第一图像数据、第一语音数据和第二语音数据来自于第一训练视频。
具体来说,就是获取第一训练视频的第一区间的图像片段和语音片段,得到第一图像片段和第一语音片段。获取第一训练视频的第二区间的语音片段,得到第二语音片段。在这里,第一区间与第二区间可以完全不重叠,或者部分重叠。这样,能够确保第一语音片段与第二语音片段的内容所有差异。
举例来说,将第一训练视频中第10ms至第30ms对应的图像作为第一图像片段,将第一训练视频中第10ms至第30ms对应的语音作为第一语音片段,以及将第一训练视频中第35ms至第55ms对应的语音作为第二语音片段。
S802:将随机图像片段处理为第二图像数据、所述随机语音片段处理为第三语音数据。
其中,随机图像片段和随机语音片段来自于第二训练视频。也就是,第二图像数据和第三语音数据来自于第二训练视频。
第一训练视频和第二训练视频为不同的两段视频,均来自于训练视频集。也就是说,为了丰富训练样本,还需要获取除第一训练视频外的其它视频中的图像片段和语音片段,这些图像片段和语音片段分别称之为随机图像片段和随机语音片段。
这里需要说明的是,第一训练视频与第二训练视频在图像或语音的具体内容上需要存在一定程度的差异,以便后续语音与图像同步性衡量模型能够进行更加准确地学习,进而提升图像与语音同步性衡量的准确性。
S803:将第一图像数据和第一语音数据组成正样本。
为了训练语音与图像同步性衡量模型,就需要获得训练样本。而为了进一步提升训练后的语音与图像同步性衡量模型的精准性,就需要获取各种类型的训练样本。也就是说,不仅需要获取具有同步性的图像片段和语音片段,还需要获取各种类型的不具有同步性的图像片段和语音片段。
正样本可采用如下方式获取:将同一段训练视频中同一个区间的第一图像片段和第一语音片段,处理成第一图像数据和第一语音数据后,组成一个正样本。
而在同一段训练视频中,存在有多个区间,并且这些区间可以是相互独立的,也可以是部分重合的,因此,基于同一段训练视频也能够获得多个正样本。
举例来说,将第一训练视频中第10ms至第30ms对应的第一图像片段和第一语音片段,对应的第一图像数据和第一语音数据作为一个正样本。将第一训练视频中第40ms至第60ms对应的第一图像片段和第一语音片段,对应的第一图像数据和第一语音数据作为另一个正样本。以及将第一训练视频中第20ms至第40ms对应的第一图像片段和第一语音片段,对应的第一图像数据和第一语音数据作为一个正样本。
S804:将第一图像数据和第二语音数据组成第一负样本。
S805:将第一图像数据和第三语音数据组成第二负样本。
S806:将第一语音数据或第二语音数据,和第二图像数据组成第三负样本。
在获取负样本的过程中,由于不具有同步性的图像片段与语音片段是多种多样的,因此,可以将能够罗列出的各种不具有同步性的图像片段和语音片段都罗列出来,以便对语音与图像同步性衡量模型进行更加充分的训练。
具体来说,以第一训练视频中的第一图像片段为基准,将与第一图像片段不具有同步性的语音片段,与第一图像片段,进行数据预处理后组成一个负样本。这里的不具有同步性的语音片段就包含有两种情况。
第一种情况:不具有同步性的语音片段也来自于第一训练视频。即,该语音片段可以是第二语音片段。此时,就可以将第一图像片段与第二语音片段,处理成第一图像数据与第二语音数据后组成语音与图像错位的第一负样本。
第二种情况:不具有同步性的语音片段来自于第二训练视频。即,该语音片段可以是随机语音片段。此时,就可以将第一图像片段与随机语音片段,处理成第一图像数据与第三语音数据后组成图像固定的第二负样本。
除了上述两种情况之外,在不以第一训练视频中的第一图像片段为基准,而是以第一训练视频中的语音片段为基准的情况下,还存在有一种情况。
第三种情况:不具有同步性的图像片段来自于第二训练视频。也就是说,将第二语音片段与其它图像片段,处理成第二语音数据与第二图像数据后组成语音固定的第三负样本。当然,也可以是将第一语音片段与其它图像片段,处理成第一语音数据与第二图像数据后组成语音固定的第三负样本。只要第三负样本中的语音片段来自于第一训练视频即可。
这样,训练样本的类型就比较丰富了,尤其是负样本的类型较为丰富。
第一语音片段、第二语音片段、随机语音片段经过处理后,转化成了特定信号,该特定信号与语音片段中说话人的个人特征无关。即第一语音数据、第二语音数据、第三语音数据均为特定信号,该特定信号与对应的语音片段中说话人的个人特征无关。
在一种可能的实施方式中,第一图像片段和随机图像片段经过处理后,转化为目标人物的人脸轮廓图,该人脸轮廓图与图像片段中的目标人物的个人特征无关。即第一图像数据和第二图像数据均为目标人物的人脸轮廓图,该人脸轮廓图与对应图像片段中的目标人物的个人特征无关。
S807:采用正样本、第一负样本、第二负样本和第三负样本训练语音与图像同步性衡量模型。
在采集到正样本、第一负样本、第二负样本和第三负样本后,将正样本、第一负样本、第二负样本和第三负样本输入语音与图像同步性衡量模型中进行训练,即调整语音与图像同步性衡量模型中的各项参数,优化语音与图像同步性衡量模型,使得后续输入待衡量的图像数据和语音数据后,语音与图像同步性衡量模型能够更加精准地进行衡量。
这里需要说明的是,在语音与图像同步性衡量模型中,主要包含有两个神经网络,即语音神经网络和视觉神经网络。语音神经网络主要基于语音数据获得语音特征,而视觉神经网络主要基于图像数据获取视觉特征。此外,还包含有一个同步性衡量模块,该模块也可以是一个神经网络。因此,对语音与图像同步性衡量模型进行训练,也就是说对语音与图像同步性衡量模型中的各个神经网络进行训练。
由上述内容可知,本申请实施例提供的语音与图像同步性衡量模型的训练方法,在第一训练视频中具有同步性的第一图像片段和第一语音片段,与第一图像片段不具有同步性的第二语音片段,以及第一训练视频外的随机图像片段和随机语音片段,对应处理为第一图像数据、第一语音数据、第二语音数据、第二图像数据和第三语音数据后,将第一图像数据和第一语音数据组成正样本,将第一图像数据和第二语音数据组成第一负样本,将第一图像数据和第三语音数据组成第二负样本,将第一语音数据或第二语音数据,和第二图像数据段组成第三负样本。这样,使得训练样本的类型更加丰富,尤其是使得图像与语音不具有同步性的负样本的类型的更加丰富。进而采用类型丰富的正样本、第一负样本、第二负样本和第三负样本训练语音与图像同步性衡量模型,能够提高语音与图像同步性衡量模型的精确度,进而提高语音与图像同步性衡量的准确性。
图9为本申请实施例中衡量语音与图像同步性的架构示意图,参见图9所示,在从视频中分别提取出语音片段和图像片段后,一方面,将语音片段输入语音神经网络,得到语音特征。另一方面,将图像片段输入视觉神经网络,得到视觉特征。最后,将语音特征和视觉特征输入同步性衡量模块,同步性衡量模块通过语音特征和视觉特征确定相应的语音片段和图像片段是否具有同步性。这里的同步性衡量模块就是通过语音特征与视觉特征的对比,确定相应的语音片段和图像片段是否具有同步性的模块。对于同步性衡量模块的具体形式,此处不做限定。
在实际应用中,为了获得语音片段的语音特征,可以将语音片段输入到语音神经网络中进行处理,以获得语音特征。以及为了获得图像片段的视觉特征,可以将图像片段输入到视觉神经网络中进行处理,以获得视觉特征。下面分别从神经网络的构建、训练数据采样、训练这三个方面进行说明。
一、神经网络构建
1、语音神经网络构建
由于在将语音片段输入语音神经网络之前,已经将语音片段转化为特定信号,具体是维度为T×P的PPG信号。并且每一个维度均具有明确的物理含义,P为音素数量,T为时间上的采样次数,每一列是一个语音帧对应的音素后验概率分布。基于这些明确的物理含义,语音神经网络具体可以做如下搭建。
图10为本申请实施例中语音神经网络的架构示意图,参见图10所示,语音神经网络至少包括有:卷积层(Conv1D(3×1,stride=(2,1))LeakyReLU(0.02))、……卷积层(Conv1D(3×1,stride=(2,1))LeakyReLU(0.02))、重组层(Reshape)、全连接层(Fully Connection Layer LeakyReLU(0.02))、全连接层(Fully Connection Layer LeakyReLU(0.02))、全连接层(Fully Connection Layer LeakyReLU(0.02))、线性投影层(Linear Projection Layer)。
考虑到相邻的语音片段之间存在重叠,因此,先采用多个1维卷积层(卷积核尺寸为3×1,卷积步长为(2,1),并采用有效扩充(valid padding)对时间维度进行处理。再将得到的矩阵重组为特征向量。接着,采用3个全连接层对特征向量进行处理。最后,经过1个线性投影层得到512维的语音特征向量。其中,卷积层的层数与输入的特定信号(PPG信号对应的特征矩阵)的时长相关。最终输出的语音特征向量的维度与后续输出的视觉特征向量的维度一致。本申请实施例中的语音特征向量也就是语音特征,视觉特征向量也就是视觉特征。
具体来说,当P=400,输入时长=200ms时,T=13,PPG特征矩阵为13×400维度。对应的,可以采用2层1维卷积层,得到3×400的特征矩阵。重组为1×1200的特征向量后,经过3个全连层和1个线性层,得到最后的512维语音特征向量。
图11为本申请实施例中生成语音特征的流程示意图,参见图11所示,该过程可以包括:
S1101:采用多个1维卷积层对特定信号在时间维度上进行处理,得到特征矩阵。
其中,1维卷积层的数量与特定信号对应的时长相关。
S1102:将特征矩阵重组为特征向量。
S1103:采用3个全连层和1个线性投影层对特征向量进行处理,得到512维的语音特征向量。
当然,最终得到的语音特征向量的维度并不仅限于只有512维。语音特征向量的维度与输入到模型中的语音数据的数据量和语音神经网络所采用的损失函数的类型相关。语音神经网络具体可以是语音与图像同步性衡量模型所包括的语音神经网络。
2、视觉神经网络构建
由于在将图像片段输入视觉神经网络之前,已经将图像片段中对下半脸运动信息形成干扰的因素(例如:光照、个人特征、姿态等)在很大程度上进行了去除,因此,视觉神经网络就可以采用计算量较为轻量级的网络结构。
具体来说,视觉神经网络可以采用ResNet18的主干网,并做如下改动:
(1)若输入的图像片段为多张图像,可以将多张图像按照时间增序沿着通道维度排列后作为视觉神经网络的输入。因此,视觉神经网络的第1层中卷积的参数维度需要做相应的调整。
(2)由于图像片段被处理为下半脸的轮廓图,分辨率为128×256,高宽比为1:2, 这与ResNet18默认输入高宽比1:1不同。对此,需要在ResNet18的第1层卷积采用较大卷积核尺寸,例如:7×7,并将卷积步长设置为(1,2)。
以上卷积尺寸和步长仅仅为一种具体的数值,这并不意在限制本申请实施例中采用的卷积核尺寸和步长只能够是7×7和(1,2)。在实际应用中,卷积层的卷积核尺寸和步长与轮廓图的尺寸相关。可以根据轮廓图的高宽比设置相应的步长,并且将卷积核的尺寸设置的稍大一些。这样,采用一个卷积核较大的卷积层就能够将轮廓图一次处理完成。当然,也可以采用多个卷积核较小的卷积层进行多次处理实现。
(3)在ResNet18主干网的最后增加了1层全连接层,这样能够得到512维的视觉特征向量。
当然,最终得到的视觉特征向量的维度并不仅限于只有512维。视觉特征向量的维度与输入到模型中的视觉数据的数据量和视觉神经网络所采用的损失函数的类型相关。
当然,视觉神经网络除了采用ResNet18的主干网之外,还可以采用其它的深度神经网络进行改动后使用,例如:MobilenetV2等。
图12为本申请实施例中生成视觉特征的流程示意图,参见图12所示,该过程可以包括:
S1201:采用卷积层处理轮廓图,得到特征矩阵。
其中,卷积层的卷积核尺寸和步长与轮廓图的尺寸相关。
S1202:采用视觉神经网络的主干网络处理特征矩阵,得到特征向量。
这里的主干网络,是指神经网络中的主要架构。为了构建本申请实施例中的视觉神经网络,在获取到相关技术中的某一个视觉神经网络后,采用该相关技术中的视觉神经网络中的架构,即主干网络,并对某些层中的参数进行适应性修改,就能够得到本申请实施例的视觉神经网络了。视觉神经网络具体可以是语音与图像同步性衡量模型所包括的视觉神经网络。
S1203:采用全连接层处理特征向量,得到512维的视觉特征向量。
二、训练数据采样
对于训练视频,采用单人说话的人像视频。在该人像视频中,背景声的干扰程度小于特定程度。也就是说,需要采用背景声相对干净的单人说话的视频。并且,训练视频可以是大量的,以便使得后续训练能够更加充分。在实际应用中,可以采用25Hz的高清视频。这样,能够提高视觉特征提取训练的精准性。
在采集完训练视频后,先将每段视频中的音频信号处理成16kHz,以及将视频信号切分为帧,并记录时间线。这样,就得到了语音片段和图像片段。然后,再采用上述步骤S402-S405中的处理方式对语音片段进行处理,得到特定信号,后续采样时简称为语音,以及采用上述步骤S406-S412中的处理方式对图像片段进行处理,得到人脸轮廓图,后续采样时简称为视觉。
接下来,就可以正式对训练数据进行采样了。在这里,主要包括正样本采样和负样本采样。所谓正样本,就是指输入的语音与视觉是同步的。而所谓负样本,就是指输入的语音与视觉是不同步的。通过输入正样本和负样本进行训练,能够提高语音与图像同步性衡量的准确性。
1、正样本采样
所谓正样本,就是训练时使用的语音和视觉需要来自于同一段训练视频,并且在时间上同步。
并且,若语音长度过短,可能会导致一个完整的发音未包含在语音内,甚至还可能会影响语音中语义的理解,有鉴于此,为了提高语音特征识别的准确性,进而提高同步性衡量的准确性,可以使语音的帧长大于视觉的帧长。而语音的帧长具体选择多少,可以基于训练视频的帧率确定。
举例来说,对于25Hz帧率的训练视频,可以选取T时刻的一帧图像,以及(T-20ms,T+20ms)的语音片段,经过处理后构成一个正样本对。此时,视觉的长度为1帧,而语音的长度为40ms。很明显,这就是使语音的帧长大于视觉的帧长。而语音的长度设置为40ms,就是为了配合训练视频中25Hz的帧率。而若采用其它帧率的训练视频,语音的长度可以进行相应的调整。
从训练视频集中选取一个训练视频,简称为第一训练视频;从训练视频集中选取另一个训练视频,简称为第二训练视频。第一训练视频和第二训练视频为不同的训练视频。
在本申请实施例中,从第一训练视频中获取的第一图像片段和第一语音片段,处理成第一图像数据和第一语音数据后,组成的就是正样本。
2、负样本采样
所谓负样本,就是训练时使用的语音和视觉并不同步。这里的不同步可以包含有多种情况。而为了能够更加充分地进行训练,可以将不同步的所有情况都进行样本采集。
在采集负样本对时,可以从不同的视频中分别采集图像片段和语音片段,或者从同一视频中的不同时间处采集图像片段和语音片段,进而组成负样本。但是,这样采集的负样本对中仍有可能存在正样本。例如:若视频A中的语音片段与视频B中的语音片段相同,那么视频A中的语音片段与视频B中的语音片段对应的图像片段也具有同步性,若将视频A中的语音片段与视频B中的语音片段对应的图像片段组成负样本,而实际上,上述两者组成的是正样本。再例如:若视频A中某一图像片段对应的语音为静音,视频B中另一图像片段对应的语音也是静音,若将视频A中的图像片段与视频B中的图像片段对应的语音片段进行组合,实际上,组成的是正样本。这样,在负样本对中就出现了不合理的负样本,进而降低神经网络训练的准确性,进而降低后续同步性度量的准确性。
有鉴于此,在本申请实施例中,在进行负样本采集时,需要去除不合理的负样本,也就是对训练数据库进行清洗,去除不适合用于训练的负样本。这样,能够提高负样本的准确性,进而提高神经网络训练的准确性,进而提高语音与图像同步性衡量的准确性。
具体的,可以通过以下三种方式进行负样本采样。
(1)错位负样本
所谓错位负样本,是指虽然语音和视觉来自于同一段训练视频,但是语音与视觉在时间上没有同步,即存在少量错位。
举例来说,采集T时刻的一帧图像,以及(T-t-20ms,T-t+20ms)的语音片段,进行处理后,构成一个负样本对。即把图像片段处理为图像数据,把语音片段处理为语音数据,再构建样本对<语音数据,图像数据>,简写为<语音,视觉>。
例如:对于错位负样本:<语音,视觉>负样本采自同一段视频,时间线少量错位,T时刻的一帧图像和(T-t-20ms,T-t+20ms)语音片段构成一个负样本对,其中|t|>80ms。即, 语音和视觉需要错位至少80ms,对应两帧图像的时间长度以上才被作为负样本对。并确保(T-20ms,T+20ms)语音片段和(T-t-20ms,T-t+20ms)语音片段,两者语义不同。
具体来说,就是在构建错位负样本时,语音与视觉的错位的时长需要大于或等于2倍的视觉时长。这样,能够确保错位负样本中的语音与错位负样本中与视觉对应同步的语音完全错开,进而确保后续训练的准确性。
而若采用其它帧率的训练视频,语音的帧长可以进行相应的调整,视觉的帧长也进行相应的调整。
此外,为了进一步提高后续训练的精准性,还需要确保错位负样本中的语音与错位负样本中与视觉对应同步的语音的语义不同。
在获得错位负样本后,将其作为候选负样本,并进行下文将要描述的视觉规则判定和语音规则判定,从而得到第一负样本。
(2)语音固定的负样本
所谓语音固定的负样本,是指语音是从同一段训练视频中提取出的,而视觉是从这段训练视频外的其它某段训练视频中随机提取出的。而上述其它某段训练视频中的语音与从上述同一段训练视频中提取出的语音在语义上存在不同。
例如:对于固定语音片段负样本:<语音,视觉>负样本采自不同视频,其中语音片段固定,从其他训练视频随机采样一帧图像,构成一个负样本对。其中,确保负样本对中的语音,和视觉所属正样本对中的语音,两者的语义不同。如负样本对中的语音的语义为“静音”,则负样本对中的视觉所属正样本对中的语音的语义不能为“静音”。
在获得语音固定的负样本后,将其作为候选负样本,并进行下文将要描述的视觉规则判定和语音规则判定,从而得到第二负样本。
(3)视觉固定的负样本
所谓视觉固定的负样本,是指视觉是从同一段训练视频中提取出的,而语音是从这段训练视频外的其它某段训练视频中随机提取出的。而上述其它某段训练视频中的视觉与从上述同一段训练视频中提取出的视觉在图像中人物的下半脸运动上存在不同。
例如:对于固定视觉帧负样本:<语音,视觉>负样本采自不同视频,其中视频帧固定,从其他视频随机采样一个语音片段,构成一个负样本对。其中,确保负样本对中的视频帧,和语音片段所属正样本对中的视觉图像,两个有足够下半脸运动上的差异。
在获得视觉固定的负样本后,将其作为候选负样本,并进行下文将要描述的视觉规则判定和语音规则判定,从而得到第三负样本。
上述第一图像片段和随机图像片段均为一个或多个连续时间点的图像。
此外,在实际应用中,考虑到单帧图像没有上下文信息,进而无法充分表达出图像中人物下半脸运动信息,因此,在采样时,可以采集连续T个时间点的图像,从而得到视觉,以及采集T个时间点的图像对应的语音片段,从而得到语音,进而将得到的视觉和语音进过处理后组成样本对,输入到神经网络中进行训练。一般来说,可以将T设置为5,对应的语音片段就是200ms。
在得到了上述三种类型的样本作为候选负样本后,针对这三种候选负样本,均需要做视觉规则判定和语音规则判定,并且,只保留两种判定都通过的候选负样本作为合格的负样本。具体判断过程如下:
1)语音规则判定:
判定负样本<语音a,视觉v>中,语音a,和视觉v所属正样本对中的语音a_positive,在语义上需要不同。
具体来说,其核心思路就是度量PPG特征序列之间的差异,即,针对候选负样本,语音规则判定是指判定负样本中的语音与负样本中的视觉所属正样本对中的语音对应的两个音素序列之间的编辑距离是否大于预设阈值。
由于语音样本已经处理为PPG特征序列,每个PPG特征是对应语音帧的所含音素的后验概率分布,因此,对后验概率分布取概率最大值,可以得到语音帧对应的音素,从而可以将PPG特征序列转化为音素序列P=[p 0...p i...p t]。
在得到负样本中语音a和对应正样本中语音a_positive的音素序列后,计算两个音素序列的编辑距离。具体的,可以用莱文斯坦距离(Levenshtein Distance)计算负样本中的语音P 1和对应正样本中的语音P 2之间的编辑距离D=L(P 1,P 2)。即通过删除、插入和替换操作,将P 1变成P 2得需要多少步骤,越相似的序列之间所需步骤越小。当D的值低于一个预设阈值,就判定两个语音样本过于相似;当D的值高于预设阈值,则判定两个语音样本有足够差异。预设阈值可由一个数据库统计获得。所述数据库中可以包括多组人工标注为相似的语音样本对和多组人工标注为有足够差异的语音样本对,通过对两类人工标注数据的编辑距离进行直方图统计,将混淆最小化的编辑距离的分界值确定为预设阈值。
2)视觉规则判定:
判定负样本<语音a,视觉v>中,视觉v,和语音a所属正样本对中的视觉v_positive,在下半脸运动上有足够不同。
具体来说,其核心思路就是判断负样本中视觉与相应正样本中的视觉的相似度如何,即针对候选负样本,视觉规则判定就是判断该候选负样本中视觉与相应正样本中的视觉的相似度差异是否大于预设阈值,该预设阈值可以根据实际需求选择。
由于上述两个视觉样本已经经过了预处理,都被处理成了下半脸的轮廓图,并且由于用的同一标准身份信息和投影坐标系,已经对齐。因此,可以利用阈值,将两个轮廓图从0~255的灰度图变成0/1的二值化轮廓图,记为M v1和M v2
下半脸然后,计算两个二值化轮廓图的绝对差异D 1=∑|M v1-M v2|,以及计算两个二值化轮廓图的结构相似性(Structural Similarity,SSIM)D 2=SSIM(M v1,M v2),进而得到两者的加权和D=λ 1D 12D 2。当D的值低于一个预设阈值,就判定两张视觉样本过于相似;当D的值高于预设阈值,则判定两张视觉样本有足够差异。权重λ 12和预设阈值可由一个数据库统计获得。所述数据库中可以包括多组人工标注为相似的视觉样本对和多组人工标注为有足够差异的视觉样本对。可以对两类人工标注数据的绝对差异和结构相似性的加权值进行直方图统计,进而调整加权权重,将使得两类人工标注数据的混淆最小化的加权权重确定为最终权重,并将混淆最小化的加权权重的分界值确定为预设阈值。
当每个视觉样本包含连续T个时间点的图像时,对T帧图像进行预处理,逐一根据上述视觉规则判定两个视觉样本间对应帧之间的差异性,进而根据有差异帧的数量,与视觉样本中帧的总数量的比例做最终判定,若比例高于预设阈值,则判定两个视觉样本 有足够差异。
以上的视觉规则判定和语音规则判定这个双重判定很重要,因为很多语音不同的字,嘴部运动是非常相似的。例如:体育的“育”和出门的“出”,都会嘟嘴一下。因此,只有通过双重判定的负样本,才属于合理的负样本,后续才能够用来对神经网络进行训练。这样,能够提高神经网络训练的准确性,进而提高语音与图像同步性衡量的准确性。
在对这三种候选负样本进行筛选后,就得到了第一负样本、第二负样本和第三负样本,进而采用第一负样本、第二负样本和第三负样本进行神经网络训练。
具体地,当判定出第一图像数据对应的语音数据与第二语音数据在语音类别后验概率PPG上存在不同,以及第一图像数据与第二语音数据对应的图像数据在下半脸运动上存在不同时,将第一图像数据和第二语音数据组成第一负样本;当判定出第一图像数据对应的语音数据与第三语音数据在语音类别后验概率上存在不同,以及第一图像数据与第三语音数据对应的图像数据在下半脸运动上存在不同时,将第一图像数据和第三语音数据组成第二负样本;当判定出第二图像数据对应的语音数据与第一语音数据或第二语音数据在语音类别后验概率上存在不同,以及第二图像数据与第一语音数据或第二语音数据对应的图像数据在下半脸运动上存在不同时,将第一语音数据或第二语音数据,和第二图像数据组成第三负样本。其中,第一图像数据对应的语音数据是指第一图像数据所属正样本对中的语音数据,第二图像数据对应的语音数据是指第二图像数据所属正样本对中的语音数据,第一语音数据对应的图像数据是指第一语音数据所属正样本对中的图像数据,第二语音数据对应的图像数据是指第二语音数据所属正样本对中的图像数据,第三语音数据对应的图像数据是指第三语音数据所属正样本对中的图像数据。
以第一负样本的组成为例进行说明,以上述错位负样本作为候选负样本,进行语音规则判定,即判定第二语音片段对应的第二语音数据与第一图像片段对应的第一图像数据所属正样本对中的语音,两者在语音类别后验概率上是否不同,即,两者对应的两个音素序列之间的编辑距离是否大于预设阈值,若大于预设阈值,则表示两者在语音类别后验概率上不同。此外,还需以上述错位负样本作为候选负样本,进行视觉规则判定,即判定第一图像片段对应的第一图像数据与第二语音片段对应的第二语音数据所属正样本对中的图像数据,两者的相似度差异是否大于预设阈值,若大于预设阈值,则表示两者不同。当判定出第二语音数据与第一图像数据所属正样本对中的语音在语音类别后验概率上不同,且第一图像数据与第二语音数据所属正样本对中的图像数据的相似度差异大于预设阈值时,将第一图像数据和第二语音数据组成第一负样本。
以第二负样本的组成为例进行说明,以上述语音固定的负样本作为候选负样本,进行语音规则判定,即判定第三语音片段对应的第三语音数据与第一图像片段对应的第一图像数据所属正样本对中的语音,两者在语音类别后验概率上是否不同,即,两者对应的两个音素序列之间的编辑距离是否大于预设阈值,若大于预设阈值,则表示两者在语音类别后验概率上不同。此外,还需以上述语音固定的负样本作为候选负样本,进行视觉规则判定,即判定第一图像片段对应的第一图像数据与第三语音片段对应的第三语音数据所属正样本对中的图像数据,两者的相似度差异是否大于预设阈值,若大于预设阈值,则表示两者不同。当判定出第三语音数据与第一图像数据所属正样本对中的语音在语音类别后验概率上不同,且第一图像数据与第三语音数据所属正样本对中的图像数据 的相似度差异大于预设阈值时,将第一图像数据和第三语音数据组成第二负样本。
以第三负样本的组成为例进行说明,以上述视觉固定的负样本作为候选负样本,进行语音规则判定,即判定第一/第二语音片段对应的第一/第二语音数据与第二图像片段对应的第二图像数据所属正样本对中的语音,两者在语音类别后验概率上是否不同,即,两者对应的两个音素序列之间的编辑距离是否大于预设阈值,若大于预设阈值,则表示两者在语音类别后验概率上不同。此外,还需以上述视觉固定的负样本作为候选负样本,进行视觉规则判定,即判定第二图像片段对应的第二图像数据与第一/第二语音片段对应的第一/第二语音数据所属正样本对中的图像数据,两者的相似度差异是否大于预设阈值,若大于预设阈值,则表示两者不同。当判定出第一/第二语音数据与第二图像数据所属正样本对中的语音在语音类别后验概率上不同,且第二图像数据与第一/第二语音数据所属正样本对中的图像数据的相似度差异大于预设阈值时,将第二图像数据和第一/第二语音数据组成第三负样本。通过上述方式构建样本对,可以避免引入错误负样本对,以及实现难样本对的挖掘,进一步提高语音与图像同步性衡量模型的精确度,从而提高语音与图像同步性衡量的准确性。此外,由于对语音数据和图像数据进行处理得到的语音信号和轮廓图对应的数据格式便于衡量数据差异性,因而可以高效地实现样本对的构建。
三、神经网络训练
基于图9所示的架构图,虽然图9中示出的是衡量语音与图像同步性的架构示意图,但是,也可以基于该架构来对语音与图像同步性衡量模型进行训练。具体比如是将上述采集得到的正样本、第一负样本、第二负样本和第三负样本输入语音与图像同步性衡量模型中进行训练,就能够调整语音与图像同步性衡量模型中的各项参数,进而更加准确地对语音与图像的同步性进行衡量。
在这里,语音与图像同步性衡量模型就是语音神经网络、视觉神经网络以及同步性衡量模型所组成的。
图13为本申请实施例中训练神经网络的流程示意图,参见图13所示,该过程可以包括训练前期和训练后期两个阶段,具体如下:
1、训练前期
S1301:将正样本、第一负样本、第二负样本和第三负样本分为不同批次输入语音与图像同步性衡量模型进行训练,调整语音与图像同步性衡量模型中的参数。其中,通过平衡采样,使得每个批次内的正样本数量和负样本数量相近,有助于模型训练。
具体来说,可以通过损失函数来调整语音与图像同步性衡量模型中的参数,损失函数具体如下式(3)所示:
Figure PCTCN2022114952-appb-000004
其中,L表示损失值,N表示批次内的样本数量,n表示样本的标号,y n表示样本的标签,y n=1表示正样本,y n=0表示负样本,d p表示正样本距离,
Figure PCTCN2022114952-appb-000005
d n表示负样本距离,
Figure PCTCN2022114952-appb-000006
v表示视觉神经网络抽取的视觉特征,a表示语音神经网络抽取的语音特征,margin 1为特定值。这里的margin 1与训练后期的margin 2可以不同。
而基于损失函数值调整模型中的各项参数的具体方式,例如:可以通过利用Adam优化算法训练模型,对应参数为beta_1=0.99,beta_2=0.999。在训练前期,将批次大小设 置为256,训练1000个时代(epochs),并将学习率初始设置为0.005,并在100个时代(epoch)后利用余弦衰减策略将学习率逐渐衰减到0。类似的,在训练后期,训练500个时代(epochs),将学习率初始设置为0.001,并在100个时代(epoch)后利用余弦衰减策略将学习率逐渐衰减到0。使用中上述具体的训练参数和模型参数,需要随数据库变化而做相应调整。当然,还可以采用其它具体方式,此处不做限定。
2、训练后期
为了进一步对语音与图像同步性衡量模型进行优化,在前期训练完模型后,可以继续在每个训练批次中采用在线难样本挖掘策略,使用在线挖掘出的难样本对模型再次进行训练,直到训练后的模型处于某一精度区间不再产生较大波动为止。
具体来说,与训练前期不同,训练后期只将所有的正样本分为不同批次(如M批),并通过将批次内的不同正样本间做组合,在线获得负样本,称为批次内的负样本。对每个批次内的正样本和负样本按照规则,根据损失函数输出的损失值进行排序;根据所述损失值获取当前批次内所述正样本中的难正样本;根据所述损失值获取当前批次内的负样本中的多个难负样本。
S1302:获取一批次内正样本中的难正样本。
针对每个批次获取一批次内正样本中的难正样本,例如,可以将所有的正样本分为不同批次;对每个批次内的正样本按照损失函数输出的损失值进行排序;根据损失值获取当前批次内正样本中的难正样本。
具体来说,在训练集合中随机采样N个正样本(语音和视觉)组成训练批次后,通过当前的语音神经网络和视觉神经网络分别提取出语音特征a i和视觉特征v i,i∈N。然后,找出每个批次内的难正样本。难正样本具体如下式(4)所示:
Figure PCTCN2022114952-appb-000007
其中,
Figure PCTCN2022114952-appb-000008
表示难正样本,v表示视觉神经网络抽取的视觉特征,a表示语音神经网络抽取的语音特征,N表示样本的标号。
S1303:获取该批次内的负样本和负样本中的多个难负样本。
具体为,根据该批次内正样本生成该批次内的负样本,获取该批次内的负样本中的多个难负样本。
其中,根据每个批次内正样本生成本批次内的负样本,具体为,将步骤S1002中的训练批次内的N个正样本获取的N个语音特征和N个视觉特征两两组合,可以形成一个N×N矩阵,排除对角线上正样本组合,得到N×(N-1)个组合作为候选负样本,经过视觉规则判定和语音规则判定,得到的合格负样本即为批次内负样本。
其中,多个负样本与正样本中的每个样本对应。也就是说,步骤S1002中的每一个正样本,都对应有多个负样本。步骤S1003就是针对每一个正样本,在其对应的多个负样本中找出难负样本。
其中,获取每个批次内的负样本中的多个难负样本,具体为对语音特征a i对应的负样本按照损失函数输出的损失值进行排序,根据损失值获取语音特征a i对应的难负样本;和/或对视觉特征v i对应的负样本按照损失函数输出的损失值进行排序,根据损失值获取视觉特征v i对应的难负样本。
举例来说,假设存在3个正样本,则可以组成一个3×3矩阵,除去对角线上的正样 本组合,共6个候选负样本,即
Figure PCTCN2022114952-appb-000009
去除矩阵内的不合格负样本后,矩阵内所剩均为合格负样本。矩阵内每i横行为第i个正样本的语音对应的负样本,每横行中损失函数最大的记为第i个正样本的语音对应的难负样本;类似的,矩阵内第i纵列为第i个正样本的视觉对应的负样本,每纵列中损失函数最大的记为第i个正样本的视觉对应的难负样本。
其中,在本实施例中,损失函数最大,对应于距离
Figure PCTCN2022114952-appb-000010
最小。
特别的,当某一横行或某一纵列不含有合格负样本时,则不计算难负样本。
难负样本具体如下式(5)和(6)所示:
Figure PCTCN2022114952-appb-000011
Figure PCTCN2022114952-appb-000012
其中,
Figure PCTCN2022114952-appb-000013
表示第j个正样本的语音对应的难负样本的距离,
Figure PCTCN2022114952-appb-000014
表示第j个正样本的视觉对应的难负样本的距离,v表示视觉神经网络抽取的视觉特征,a表示语音神经网络抽取的语音特征。
其中,当第j横行不含有合格负样本时,
Figure PCTCN2022114952-appb-000015
margin 2为特定值。类似的,当第j纵列不含有合格负样本时,
Figure PCTCN2022114952-appb-000016
也就是说,难负样本挖掘的本质就是排序。在一个训练批次内,对于一个语音样本a j,遍历批次内所有视觉样本,构建负样本对组合(v 0,a j),...,(v N,a j),若存在合格负样本,则从合格负样本中选出难的一个负样本对。以及对于一个视觉样本v j,遍历批次内所有语音样本,构建负样本对组合(v j,a 0),...,(v j,a N),若存在合格负样本,则从合格负样本中选出难的一个负样本对。
S1304:将难正样本和多个难负样本输入调整参数后的语音与图像同步性衡量模型进行训练,再次调整语音与图像同步性衡量模型中的参数。
从正样本和负样本中在线挖掘出难正样本以及难负样本后,就无需再对批次内所有的正样本和负样本进行损失计算了。因此,语音与图像同步性衡量模型对应的损失函数也相应的会发生一些变化,变化后的损失函数具体如下式(7)所示:
Figure PCTCN2022114952-appb-000017
其中,l表示损失值,
Figure PCTCN2022114952-appb-000018
表示难正样本距离,
Figure PCTCN2022114952-appb-000019
表示第j个正样本的语音对应的难负样本的距离,
Figure PCTCN2022114952-appb-000020
表示第j个正样本的视觉对应的难负样本的距离,N表示批次的样本数量,margin 2为特定值。
通过难正样本和难负样本,以及发生相应变化后的损失函数,就能够进一步对语音与图像同步性衡量模型中的参数进行调整,进一步优化模型,提高模型预测的准确性。
在实际的模型优化过程中,一般来说,并不只进行一次优化,而是会多次进行优化。也就是说,在利用当前批次的训练数据优化完一次模型后,再次利用下一批次的训练数据并获取对应的难正样本和难负样本,然后输入到当前的模型中再次进行训练,反复多次,直到对应的损失函数的输出值维持在一个稳定的区域,即输出值处于某一精度区间不再产生较大波动为止。
S1305:再次获取下一批次内正样本中的难正样本。
S1306:再次获取该批次内的负样本和负样本中的多个难负样本。
其中,多个难负样本与正样本中的每个样本对应。
S1307:将再次获取的难正样本和多个难负样本输入再次调整参数后的语音与图像同步性衡量模型进行训练,调整语音与图像同步性衡量模型中的参数,直到语音与图像同步性衡量模型对应的损失函数输出的损失值收敛为止。即损失值处于某一精度区间不再产生较大波动为止。
步骤S1305、S1306、S1307与上述步骤S1302、S1303、S1304的具体实现方式相似,此处不再赘述。
至此按照上述方式处理m批样本后,语音与图像同步性衡量模型就训练完成了。其中,m小于或等于M(M为正样本划分的批次)。当需要衡量某一视频中的语音片段与图像片段是否具有同步性时,将该视频中的语音片段与图像片段分别通过上述步骤S202-S205与S206-S212进行处理后,再分别输入语音与图像同步性衡量模型中,模型的输出结果就能够表征该视频中的语音片段与图像片段是否具有同步性了。
在这里,完整地对本申请实施例提供的语音与图像同步性的衡量方法的流程进行说明。
图14为本申请实施例中语音与图像同步性的衡量方法的完整流程示意图,参见图14所示,在获取到视频流后,分为两路。其中一路,将视频流输入预处理模块,对视频流进行预处理,得到语音片段。再将语音片段输入SI-ASR系统,将视频流处理成PPG信号。再将多个单帧的PPG信号累积为一个语音数据。进而将语音数据输入语音神经网络,得到语音特征。另外一路,将视频流逐帧进行稠密人脸对齐。在一帧图像中,可能有多个人脸,需要对每一张人脸都执行以下步骤:从人脸中提取表情系数。采用正面姿态、标准ID,将从人脸图像中提取到的表情系数生成3D模型。将3D模型中对应的顶点进行投影,得到轮廓图。将获得的多帧轮廓图累积为一个图像数据。进而将图像数据输入视觉神经网络,得到视觉特征。最后,将语音特征和视觉特征输入到同步性衡量模块中,以衡量视频流中的语音与图像是否同步。若满足阈值,则确定同步;若不满足阈值,则确定不同步。通过同步性衡量模块,能够判定语音特征和视觉特征的同步性。具体的同步性度量可以通过计算语音特征与视觉特征在向量上的距离,进而与预设的阈值的比较实现。最后,通过同步性衡量模块,能够判定出同步性最佳的人脸。若视频中所有人脸的同步性都达不到预设的阈值,则判断当前时间片段下视频图像中没有合适的人脸。
基于同一发明构思,作为对上述语音与图像同步性的衡量方法的实现,本申请实施例还提供了一种语音与图像同步性的衡量装置。图15为本申请实施例中语音与图像同步性的衡量装置的结构示意图一,参见图15所示,该装置可以包括:
接收模块1501,用于获取视频中的语音片段和图像片段。
数据处理模块1502,用于用于执行以下操作中的任意一项:将语音片段转换为特定信号并获取所述特定信号的语音特征以及图像片段的视觉特征,特定信号与语音片段中说话人的个人特征无关;或,根据图像片段生成目标人物的轮廓图并获取轮廓图的视觉特征以及语音片段的语音特征,轮廓图与目标人物的个人特征无关;或,将语音片段转换为特定信号,根据图像片段生成目标人物的轮廓图,并获取特定信号的语音特征以及轮 廓图的视觉特征。
同步性衡量模块1503用于根据语音特征与视觉特征,确定语音片段与图像片段是否具有同步性。
进一步地,作为图15示装置的细化和扩展,本申请实施例还提供了一种语音与图像同步性的衡量装置。图16本申请实施例中语音与图像同步性的衡量装置的结构示意图二,参见图16示,该装置可以包括:
接收模块1601,用于获取视频中的语音片段和图像片段。
在一实施方式中,预处理模块1602,用于将语音片段的采样频率转换为特定频率;相应的,数据处理模块1603,用于将转换为特定频率后的语音片段转换为特定信号。
在一实施方式中,预处理模块1602,用于去除语音片段中的背景音,将去除背景音后的语音片段中不同说话人的语音分离,得到至少一个语音子片段;相应的,数据处理模块1603,用于将语音子片段转换为特定信号。
在一实施方式中,预处理模块1602,用于采用滑动加权的方式,将语音片段切分为多个语音帧,相邻的语音帧之间存在重叠;相应的,数据处理模块1603,用于将多个语音帧分别转换为多个特定信号。
在一实施方式中,特定信号为语音类别后验概率PPG信号。
在一实施方式中,数据处理模块1603,具体用于通过说话者语言自动识别SI-ASR系统将语音片段转换为语音类别后验概率PPG信号。
在一实施方式中,特征提取模块1604,还用于通过视觉神经网络获得图像片段的视觉特征。
在一实施方式中,特征提取模块1604包括:
第一提取单元1604a,用于采用多个1维卷积层对特定信号在时间维度上进行处理,得到特征矩阵,1维卷积层的数量与特定信号对应的时长相关;
第二提取单元1604b,用于将特征矩阵重组为特征向量;
第三提取单元1604c,用于采用3个全连层和1个线性投影层对所述特征向量进行处理,得到513维的语音特征。
同步性衡量模块1605,用于根据语音特征与视觉特征,确定语音片段与图像片段是否具有同步性。
进一步地,作为图15所示装置的细化和扩展,本申请实施例还提供了一种语音与图像同步性的衡量装置。图17为本申请实施例中语音与图像同步性的衡量装置的结构示意图三,参见图17所示,该装置可以包括:
接收模块1701,用于获取视频中的语音片段和图像片段。
在一实施方式中,预处理模块1702包括:
检测单元1702a,用于对图像片段进行人脸检测,得到人脸检测框;
对齐单元1702b,用于将人脸检测框中的人脸进行水平对齐;
数据处理模块1703,用于根据图像片段生成目标人物的轮廓图,轮廓图与目标人物的个人特征无关。
在一实施方式中,当轮廓图为人脸轮廓图时,数据处理模块1503包括:
提取单元1703a,用于从图像片段中提取目标人物的表情系数。
生成单元1703b,用于基于表情系数和通用参数化人脸模型生成所述目标人物的人脸轮廓图。
在一实施方式中,提取单元1703a,具体用于通过三维可形变参数化人脸模型参数估计算法提取所述图像片段中所述目标人物的表情系数,表情系数符合三维可形变参数化人脸模型的标准。
在一实施方式中,生成单元1703b,具体用于提取表情系数中下半脸对应的下半脸表情系数;将下半脸表情系数输入所述通用三维人脸模型,得到目标人物的下半脸对应的三维人脸模型,并将三维人脸模型处理为目标人物的人脸轮廓图。
在一实施方式中,所述生成单元1703b,具体用于将下半脸表情系数输入通用三维人脸模型,得到目标人物的下半脸对应的三维人脸模型;获取三维人脸模型中下半脸的顶点集合。将顶点集合投影到二维平面,得到目标人物的下半脸轮廓图,并将下半脸轮廓图作为目标人物的人脸轮廓图。
特征提取模块1504,用于通过语音神经网络获得语音片段的语音特征。
在一实施方式中,所述特征提取模块1504包括:
第一提取单元1704a,用于采用卷积层处理所述轮廓图,得到特征矩阵,所述卷积层的卷积核尺寸和步长与轮廓图的尺寸相关;
第二提取单元1704b,用于采用视觉神经网络的主干网络处理所述特征矩阵,得到特征向量;
第三提取单元1704c,用于采用全连接层处理所述特征向量,得到515维的视觉特征。
同步性衡量模块1705,用于根据语音特征与视觉特征,确定语音片段与图像片段是否具有同步性。
在一实施方式中,当视频为多人谈话的视频时,同步性衡量模块1705,用于根据语音特征与视觉特征,确定视频中语音片段对应的说话人。
当视频为待进行真伪性鉴定的视频时,同步性衡量模块1705,用于根据语音特征与视觉特征,确定视频中语音片段是否属于图像片段中的人物。
当视频待调制的视频时,同步性衡量模块1705,用于根据语音特征与视觉特征,将视频中的语音片段和图像片段的起始位对齐,使得语音片段与图像片段同步。
这里需要指出的是,以上语音与图像同步性的衡量装置实施例的描述,与上述语音与图像同步性的衡量方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请装置实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。
基于同一发明构思,作为对上述语音与图像同步性衡量模型的训练方法的实现,本申请实施例还提供了一种语音与图像同步性衡量模型的训练装置。图18为本申请实施例中语音与图像同步性衡量模型的训练装置的结构示意图一,参见图18所示,该装置可以包括:
数据处理模块1801,用于将第一图像片段处理为第一图像数据、第一语音片段处理为第一语音数据、第二语音片段处理为第二语音数据,其中:第一图像片段、第一语音片段和第二语音片段来自于第一训练视频,第一图像片段与第一语音片段具有同步性, 第一图像片段与第二语音片段不具有同步性。
数据处理模块1801,还用于将随机图像片段处理为第二图像数据、随机语音片段处理为第三语音数据,其中:随机图像片段和随机语音片段来自于第二训练视频。
样本生成模块1802,用于将第一图像数据和第一语音数据组成正样本。
样本生成模块1802,还用于将第一图像数据和第二语音数据组成第一负样本。
样本生成模块1802,还用于将第一图像数据和第三语音数据组成第二负样本。
样本生成模块1802,还用于将第一语音数据或第二语音数据,和第二图像数据组成第三负样本。
训练模块1803,用于采用正样本、第一负样本、第二负样本和第三负样本训练语音与图像同步性衡量模型。
进一步地,作为图18所示装置的细化和扩展,本申请实施例还提供了一种语音与图像同步性衡量模型的训练装置。图19为本申请实施例中语音与图像同步性衡量模型的训练装置的结构示意图二,参见图19所示,该装置可以包括:
接收模块1901,用于获取第一训练视频中的第一图像片段、第一语音片段、第二语音片段,第一图像片段与第一语音片段具有同步性,第一图像片段与第二语音片段不具有同步性。
接收模块1901,还用于获取随机图像片段和随机语音片段,随机图像片段和随机语音片段来自于第二训练视频。
在一实施方式中,第一图像片段和随机图像片段的帧长均小于第一语音片段、第二语音片段或随机语音片段的帧长。
在一实施方式中,语音数据的语音帧数与所述图像数据的图像帧数相关,语音数据包括第一语音数据、第二语音数据或第三语音数据,图像数据包括第一图像数据或第二图像数据。
在一实施方式中,第二语音片段与第一图像片段错位的时长大于或等于第二语音片段的总时长的2倍。
在一实施方式中,第一图像片段和随机图像片段均为一个或多个连续时间点的图像。
在一实施方式中,训练视频为单人说话的人像视频,训练视频中背景声的干扰程度小于特定程度,其中,训练视频包括第一训练视频和第二训练视频。
在一实施方式中,数据处理模块1902,用于分别从第一图像片段和随机图像片段中提取目标人物的轮廓图,轮廓图与目标人物的个人特征无关;和/或,
所述数据处理模块1902,还用于分别将第一语音片段、第二语音片段和随机语音片段转换为特定信号,特定信号与第一语音片段、第二语音片段以及随机语音片段中说话人的个人特征无关。
在一实施方式中,样本生成模块1903,用于将第一图像数据和第一语音数据组成正样本;所述样本生成模块1903,还用于将第一图像数据和第二语音数据组成第一负样本;所述样本生成模块1903,还用于将第一图像数据和第三语音数据组成第二负样本;所述样本生成模块1903,还用于将第一语音数据或第二语音数据,和第二图像数据组成第三负样本。
在一实施方式中,样本生成模块1903,具体用于当判定出第一图像数据对应的语音 数据与第二语音数据在语音类别后验概率PPG上存在不同,以第一图像数据与第二语音数据对应的图像数据在下半脸运动上存在不同时,将第一图像数据和第二语音数据组成第一负样本;当判定出第一图像数据对应的语音数据与第三语音数据在语音类别后验概率上存在不同,以及第一图像数据与第三语音数据对应的图像数据在下半脸运动上存在不同时,将第一图像数据和第三语音数据组成第二负样本;当判定出第二图像数据对应的语音数据与第一语音数据或第二语音数据在语音类别后验概率上存在不同,以及第二图像数据与第一语音数据或第二语音数据对应的图像数据在下半脸运动上存在不同时,将第一语音数据或第二语音数据,和第二图像数据组成第三负样本。
在一实施方式中,训练模块1904包括:
参数调整单元1904a,用于将正样本、第一负样本、第二负样本和第三负样本分批次输入语音与图像同步性衡量模型进行训练,调整语音与图像同步性衡量模型中的参数;
难样本选择单元1904b,用于获取每个批次内正样本中的难正样本;
所述难样本选择单元1904b,还用于根据每个批次内正样本生成本批次内的负样本,获取每个批次内的负样本中的多个难负样本;
参数再调单元1904c,用于将难正样本和多个难负样本输入调整参数后的语音与图像同步性衡量模型进行训练,调整语音与图像同步性衡量模型中的参数,直到语音与图像同步性衡量模型对应的损失函数输出的损失值收敛为止。
在一实施方式中,难样本选择单元1904b,还用于将每个批次内N个正样本对应的N个语音特征a i和N个视觉特征v i两两组合,得到N×(N-1)个候选负样本;将候选负样本经过视觉规则判定和语音规则判定,得到合格负样本确定为本批次内负样本;其中,i∈N,N为正整数。
在一实施方式中,难样本选择单元1904b,还用于对语音特征a i对应的负样本按照损失函数输出的损失值进行排序;根据损失值获取语音特征a i对应的难负样本;和/或,对视觉特征v i对应的负样本按照损失函数输出的损失值进行排序;根据损失值获取视觉特征v i对应的难负样本。
在一实施方式中,难样本选择单元1904b,还用于将所有的正样本分为不同批次;对每个批次内的正样本按照损失函数输出的损失值进行排序;根据损失值获取当前批次内所述正样本中的难正样本。
至此按照上述方式处理m批样本后,语音与图像同步性衡量模型就训练完成了。其中,m小于或等于M(M为正样本划分的批次)。
这里需要指出的是,以上语音与图像同步性衡量模型的训练装置实施例的描述,与上述语音与图像同步性衡量模型的训练方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请装置实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。
基于同一发明构思,本申请实施例还提供了一种电子设备。图20为本申请实施例中电子设备的结构示意图,参见图20所示,该电子设备可以包括:处理器2001、存储器2002、总线2003;其中,处理器2001、存储器2002通过总线2003完成相互间的通信;处理器2001用于调用存储器2002中的程序指令,以执行上述一个或多个实施例中的方 法。
这里需要指出的是,以上电子设备实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请电子设备实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。
基于同一发明构思,本申请实施例还提供了一种计算机可读存储介质,该存储介质可以包括:存储的程序;其中,在程序运行时控制存储介质所在设备执行上述一个或多个实施例中的方法。
这里需要指出的是,以上存储介质实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请存储介质实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (29)

  1. 一种语音与图像同步性的衡量方法,包括:
    获取视频中的语音片段和图像片段,所述语音片段和所述图像片段在所述视频中具有对应关系;
    执行以下操作中的任意一项:
    将所述语音片段转换为特定信号并获取所述特定信号的语音特征以及所述图像片段的视觉特征,所述特定信号与所述语音片段中说话人的个人特征无关;或,
    根据所述图像片段生成目标人物的轮廓图并获取所述轮廓图的视觉特征以及所述语音片段的语音特征,所述轮廓图与所述目标人物的个人特征无关;或,
    将所述语音片段转换为特定信号,根据所述图像片段生成目标人物的轮廓图,并获取所述特定信号的语音特征以及所述轮廓图的视觉特征;
    根据所述语音特征以及所述视觉特征,确定所述语音片段与所述图像片段是否具有同步性,所述同步性用于表征所述语音片段中的声音与所述图像片段中目标人物的运动相匹配。
  2. 根据权利要求1所述的方法,其中,所述轮廓图为人脸轮廓图;所述根据所述图像片段生成目标人物的轮廓图,包括:
    从所述图像片段中提取所述目标人物的表情系数;
    基于所述表情系数和通用参数化人脸模型生成所述目标人物的人脸轮廓图。
  3. 根据权利要求2所述的方法,其中,所述从所述图像片段中提取所述目标人物的表情系数,包括:
    通过三维可形变参数化人脸模型参数估计算法提取所述图像片段中所述目标人物的表情系数,表情系数符合三维可形变参数化人脸模型的标准。
  4. 根据权利要求3所述的方法,其中,在所述通过三维可形变参数化人脸模型参数估计算法提取所述图像片段中所述目标人物表情系数之前,所述方法还包括:
    对所述图像片段进行人脸检测,得到人脸检测框;
    将所述人脸检测框中的人脸进行水平对齐;
    所述通过三维可形变参数化人脸模型参数估计算法提取所述图像片段中所述目标人物的表情系数,包括:
    从对齐后的人脸中提取所述目标人物的表情系数。
  5. 根据权利要求2至4中任一项所述的方法,其中,所述通用参数化人脸模型为通用三维人脸模型;所述基于所述表情系数和通用参数化人脸模型生成所述目标人物的人脸轮廓图,包括:
    提取所述表情系数中下半脸对应的下半脸表情系数;
    将所述下半脸表情系数输入所述通用三维人脸模型,得到所述目标人物的下半脸对应的三维人脸模型,并将所述三维人脸模型处理为所述目标人物的人脸轮廓图。
  6. 根据权利要求5所述的方法,其中,所述将所述下半脸表情系数输入所述通用三维人脸模型,得到所述目标人物的下半脸对应的三维人脸模型,并将所述三维人脸模型处理为所述目标人物的人脸轮廓图,包括:
    将所述下半脸表情系数输入所述通用三维人脸模型,得到所述目标人物的下半脸对 应的三维人脸模型;
    获取所述三维人脸模型中下半脸的顶点集合;
    将所述顶点集合投影到二维平面,得到所述目标人物的下半脸轮廓图,并将所述下半脸轮廓图作为所述目标人物的人脸轮廓图。
  7. 根据权利要求1至6中任一项所述的方法,其中,所述获取所述轮廓图的视觉特征,包括:
    采用卷积层处理所述轮廓图,得到特征矩阵,所述卷积层的卷积核尺寸和步长与所述轮廓图的尺寸相关;
    采用视觉神经网络的主干网络处理所述特征矩阵,得到特征向量;
    采用全连接层处理所述特征向量,得到视觉特征,所述视觉特征的维度与所述轮廓图的数据量和视觉神经网络所采用的损失函数的类型相关。
  8. 根据权利要求1至7中任一项所述的方法,其中,所述特定信号为语音类别后验概率PPG信号。
  9. 根据权利要求1至8中任一项所述的方法,其中,在所述将所述语音片段转换为特定信号之前,所述方法还包括:
    将所述语音片段的采样频率转换为特定频率;
    所述将所述语音片段转换为特定信号,包括:
    将转换为特定频率后的语音片段转换为特定信号。
  10. 根据权利要求1至8中任一项所述的方法,其中,在所述将所述语音片段转换为特定信号之前,所述方法还包括:
    去除所述语音片段中的背景音;
    将去除背景音后的语音片段中不同说话人的语音分离,得到至少一个语音子片段;
    所述将所述语音片段转换为特定信号,包括:
    将所述语音子片段转换为特定信号。
  11. 根据权利要求1至8中任一项所述的方法,其中,在所述将所述语音片段转换为特定信号之前,所述方法还包括:
    采用滑动加权的方式,将所述语音片段切分为多个语音帧,相邻的语音帧之间存在重叠;
    所述将所述语音片段转换为特定信号,包括:
    将所述多个语音帧分别转换为多个特定信号。
  12. 根据权利要求1至11中任一项所述的方法,其中,所述获取所述特定信号的语音特征,包括:
    采用多个1维卷积层对所述特定信号在时间维度上进行处理,得到特征矩阵,所述1维卷积层的数量与所述特定信号对应的时长相关;
    将所述特征矩阵重组为特征向量;
    采用3个全连层和1个线性投影层对所述特征向量进行处理,得到所述语音特征,所述语音特征的维度与所述语音片段的数据量和语音神经网络所采用的损失函数的类型相关。
  13. 根据权利要求1至12中任一项所述的方法,其中,所述视频为多人谈话的视频; 所述根据所述语音特征与所述视觉特征,确定所述语音片段与所述图像片段是否具有同步性,包括:根据所述语音特征与所述视觉特征,确定所述视频中所述语音片段对应的说话人;
    或者,所述视频为待进行真伪性鉴定的视频;所述根据所述语音特征与所述视觉特征,确定所述语音片段与所述图像片段是否具有同步性,包括:根据所述语音特征与所述视觉特征,确定所述视频中所述语音片段是否属于所述图像片段中的人物;
    或者,所述视频待调制的视频;所述根据所述语音特征与所述视觉特征,确定所述语音片段与所述图像片段是否具有同步性,包括:根据所述语音特征与所述视觉特征,将所述视频中的所述语音片段和所述图像片段的起始位对齐,使得所述语音片段与所述图像片段同步。
  14. 根据权利要求1至13中任一项所述的方法,其中,所述获取所述特定信号的语音特征包括:通过预先训练的语音与图像同步性衡量模型获取所述特定信号的语音特征,所述获取所述轮廓图的视觉特征包括:通过所述预先训练的语音与图像同步性衡量模型获取所述轮廓图的视觉特征,所述语音与图像同步性衡量模型的训练方法包括:
    将第一图像片段处理为第一图像数据、第一语音片段处理为第一语音数据、第二语音片段处理为第二语音数据,其中:所述第一图像片段、所述第一语音片段和所述第二语音片段来自于第一训练视频,所述第一图像片段与所述第一语音片段具有同步性,所述第一图像片段与所述第二语音片段不具有同步性;
    将随机图像片段处理为第二图像数据、随机语音片段处理为第三语音数据,其中:所述随机图像片段和所述随机语音片段来自于第二训练视频;
    将所述第一图像数据和所述第一语音数据组成正样本;
    将所述第一图像数据和所述第二语音数据组成第一负样本;
    将所述第一图像数据和所述第三语音数据组成第二负样本;
    将所述第一语音数据或所述第二语音数据,和所述第二图像数据组成第三负样本;
    采用所述正样本、所述第一负样本、所述第二负样本和所述第三负样本训练所述语音与图像同步性衡量模型。
  15. 根据权利要求14所述的方法,其特征在于,语音数据的语音帧数与图像数据的图像帧数相关,所述语音数据包括所述第一语音数据、所述第二语音数据或所述第三语音数据,所述图像数据包括所述第一图像数据或所述第二图像数据。
  16. 根据权利要求14或15所述的方法,其特征在于,所述第二语音片段与所述第一图像片段错位的时长大于或等于所述第二语音片段的总时长的2倍。
  17. 根据权利要求14至16中任一项所述的方法,其特征在于,所述第一图像片段和所述随机图像片段均为一个或多个连续时间点的图像。
  18. 根据权利要求14至17中任一项所述的方法,其特征在于,训练视频为单人说话的人像视频,所述训练视频中背景声的干扰程度小于特定程度;其中:训练视频包括所述第一训练视频和所述第二训练视频。
  19. 根据权利要求14至18中任一项所述的方法,其特征在于,所述将所述第一图像数据和所述第二语音数据组成第一负样本;将所述第一图像数据和所述第三语音数据组成第二负样本;将所述第一语音数据或所述第二语音数据,和所述第二图像数据组成 第三负样本,包括:
    当判定出所述第一图像数据对应的语音数据与所述第二语音数据在语音类别后验概率PPG上存在不同,以及所述第一图像数据与所述第二语音数据对应的图像数据在下半脸运动上存在不同时,将所述第一图像数据和所述第二语音数据组成第一负样本;
    当判定出所述第一图像数据对应的语音数据与所述第三语音数据在语音类别后验概率上存在不同,以及所述第一图像数据与所述第三语音数据对应的图像数据在下半脸运动上存在不同时,将所述第一图像数据和所述第三语音数据组成第二负样本;
    当判定出所述第二图像数据对应的语音数据与所述第一语音数据或所述第二语音数据在语音类别后验概率上存在不同,以及所述第二图像数据与所述第一语音数据或所述第二语音数据对应的图像数据在下半脸运动上存在不同时,将所述第一语音数据或所述第二语音数据,和所述第二图像数据组成第三负样本。
  20. 根据权利要求14至19中任一项所述的方法,其特征在于,将第一图像片段处理为第一图像数据、第一语音片段处理为第一语音数据、第二语音片段处理为第二语音数据、将随机图像片段处理为第二图像数据、所述随机语音片段处理为第三语音数据,包括:
    根据所述第一图像片段生成目标人物的轮廓图,得到第一图像数据;
    根据所述随机图像片段生成目标人物的轮廓图,得到第二图像数据;
    所述轮廓图与所述目标人物的个人特征无关;
    将所述第一语音片段转换为特定信号,得到第一语音数据;
    将所述第二语音片段转换为特定信号,得到第二语音数据;
    将所述随机语音片段转换为特定信号,得到第三语音数据;
    所述特定信号与所述第一语音片段、所述第二语音片段以及所述随机语音片段中说话人的个人特征无关。
  21. 根据权利要求14至20中任一项所述的方法,其特征在于,所述采用所述正样本、所述第一负样本、所述第二负样本和所述第三负样本训练语音与图像同步性衡量模型,包括:
    所述训练语音与图像同步性衡量模型分为训练前期和训练后期两个阶段;其中,在所述训练前期期间,将所述正样本、所述第一负样本、所述第二负样本和所述第三负样本分批次输入语音与图像同步性衡量模型进行训练,调整所述语音与图像同步性衡量模型中的参数;
    在所述训练后期期间,将所述正样本分批次输入调整参数后的语音与图像同步性衡量模型进行训练,包括:
    获取每个批次内所述正样本中的难正样本;
    根据每个批次内所述正样本生成本批次内的负样本;
    获取所述每个批次内的负样本中的多个难负样本;
    将所述难正样本和所述多个难负样本输入调整参数后的语音与图像同步性衡量模型进行训练,调整所述语音与图像同步性衡量模型中的参数,直到所述语音与图像同步性衡量模型对应的损失函数输出的损失值收敛为止。
  22. 根据权利要求21所述的方法,其特征在于,所述根据每个批次内所述正样本生 成本批次内的负样本,包括:
    将每个批次内N个正样本对应的N个语音特征a i和N个视觉特征v i两两组合,得到N×(N-1)个候选负样本;
    将所述候选负样本经过视觉规则判定和语音规则判定,得到合格负样本确定为本批次内负样本;
    其中,i∈N,N为正整数。
  23. 根据权利要求22所述的方法,其特征在于,获取所述每个批次内的负样本中的多个难负样本,包括:
    对语音特征a i对应的负样本按照损失函数输出的损失值进行排序;
    根据所述损失值获取语音特征a i对应的难负样本;和/或
    对视觉特征v i对应的负样本按照损失函数输出的损失值进行排序;
    根据所述损失值获取视觉特征v i对应的难负样本。
  24. 根据权利要求21至23中任一项所述的方法,其特征在于,所述获取每个批次内所述正样本中的难正样本,包括:
    将所有的正样本分为不同批次;
    对每个批次内的正样本按照损失函数输出的损失值进行排序;
    根据所述损失值获取当前批次内所述正样本中的难正样本。
  25. 一种语音与图像同步性的衡量装置,其特征在于,所述装置包括:
    接收模块,用于获取视频中的语音片段和图像片段,所述语音片段与所述图像片段在所述视频中具有对应关系;
    数据处理模块,用于执行以下操作中的任意一项:
    将所述语音片段转换为特定信号并获取所述特定信号的语音特征以及所述图像片段的视觉特征,所述特定信号与所述语音片段中说话人的个人特征无关;或,
    根据所述图像片段生成目标人物的轮廓图并获取所述轮廓图的视觉特征以及所述语音片段的语音特征,所述轮廓图与所述目标人物的个人特征无关;或,
    将所述语音片段转换为特定信号,根据所述图像片段生成目标人物的轮廓图,并获取所述特定信号的语音特征以及所述轮廓图的视觉特征;
    同步性衡量模块,根据所述语音特征与所述视觉特征,确定所述语音片段与所述图像片段是否具有同步性,所述同步性用于表征所述语音片段中的声音与所述图像片段中所述目标人物的运动相匹配。
  26. 一种电子设备,其中,包括:处理器、存储器、总线;
    其中,所述处理器、所述存储器通过所述总线完成相互间的通信;所述处理器用于调用所述存储器中的程序指令,以执行如权利要求1至24中任一项所述的方法。
  27. 一种计算机可读存储介质,其中,包括:存储的程序;其中,在所述程序运行时控制所述存储介质所在设备执行如权利要求1至24中任一项所述的方法。
  28. 一种计算机程序产品,包括计算机执行指令,当处理器执行所述计算机执行指令时,实现权利要求1-24中任一项所述的方法。
  29. 一种计算机程序,当处理器执行所述计算机程序时,实现权利要求1-24中任一项所述的方法。
PCT/CN2022/114952 2021-09-09 2022-08-25 语音与图像同步性的衡量方法、模型的训练方法及装置 WO2023035969A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP22866437.1A EP4344199A1 (en) 2021-09-09 2022-08-25 Speech and image synchronization measurement method and apparatus, and model training method and apparatus

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CN202111057976.9 2021-09-09
CN202111057976.9A CN114466179A (zh) 2021-09-09 2021-09-09 语音与图像同步性的衡量方法及装置
CN202111056592.5A CN114466178A (zh) 2021-09-09 2021-09-09 语音与图像同步性的衡量方法及装置
CN202111058177.3A CN114494930B (zh) 2021-09-09 2021-09-09 语音与图像同步性衡量模型的训练方法及装置
CN202111058177.3 2021-09-09
CN202111056592.5 2021-09-09

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/395,253 Continuation US20240135956A1 (en) 2021-09-09 2023-12-22 Method and apparatus for measuring speech-image synchronicity, and method and apparatus for training model

Publications (1)

Publication Number Publication Date
WO2023035969A1 true WO2023035969A1 (zh) 2023-03-16

Family

ID=85506097

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/114952 WO2023035969A1 (zh) 2021-09-09 2022-08-25 语音与图像同步性的衡量方法、模型的训练方法及装置

Country Status (2)

Country Link
EP (1) EP4344199A1 (zh)
WO (1) WO2023035969A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116152447A (zh) * 2023-04-21 2023-05-23 科大讯飞股份有限公司 一种人脸建模方法、装置、电子设备及存储介质
CN117636209A (zh) * 2023-11-24 2024-03-01 广州市希视科电子产品有限公司 一种自动可视化智慧大数据会议管理方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101199208A (zh) * 2005-04-13 2008-06-11 皮克索尔仪器公司 使用嘴唇和牙齿特征来测量音频视频同步的方法、系统和程序产品
CN112562720A (zh) * 2020-11-30 2021-03-26 清华珠三角研究院 一种唇形同步的视频生成方法、装置、设备及存储介质
CN113111812A (zh) * 2021-04-20 2021-07-13 深圳追一科技有限公司 一种嘴部动作驱动模型训练方法及组件
CN114466179A (zh) * 2021-09-09 2022-05-10 马上消费金融股份有限公司 语音与图像同步性的衡量方法及装置
CN114466178A (zh) * 2021-09-09 2022-05-10 马上消费金融股份有限公司 语音与图像同步性的衡量方法及装置
CN114494930A (zh) * 2021-09-09 2022-05-13 马上消费金融股份有限公司 语音与图像同步性衡量模型的训练方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101199208A (zh) * 2005-04-13 2008-06-11 皮克索尔仪器公司 使用嘴唇和牙齿特征来测量音频视频同步的方法、系统和程序产品
CN112562720A (zh) * 2020-11-30 2021-03-26 清华珠三角研究院 一种唇形同步的视频生成方法、装置、设备及存储介质
CN113111812A (zh) * 2021-04-20 2021-07-13 深圳追一科技有限公司 一种嘴部动作驱动模型训练方法及组件
CN114466179A (zh) * 2021-09-09 2022-05-10 马上消费金融股份有限公司 语音与图像同步性的衡量方法及装置
CN114466178A (zh) * 2021-09-09 2022-05-10 马上消费金融股份有限公司 语音与图像同步性的衡量方法及装置
CN114494930A (zh) * 2021-09-09 2022-05-13 马上消费金融股份有限公司 语音与图像同步性衡量模型的训练方法及装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHUNGJOON SONANDREW ZISSERMANCHAM: "Asian Conference on Computer Vision", 2016, SPRINGER, article "Out of time: automated lip sync in the wild"
THIESJUSTUS ET AL.: "Face2face: Real-time face capture and reenactment of rgb videos", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2016

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116152447A (zh) * 2023-04-21 2023-05-23 科大讯飞股份有限公司 一种人脸建模方法、装置、电子设备及存储介质
CN116152447B (zh) * 2023-04-21 2023-09-26 科大讯飞股份有限公司 一种人脸建模方法、装置、电子设备及存储介质
CN117636209A (zh) * 2023-11-24 2024-03-01 广州市希视科电子产品有限公司 一种自动可视化智慧大数据会议管理方法及系统

Also Published As

Publication number Publication date
EP4344199A1 (en) 2024-03-27

Similar Documents

Publication Publication Date Title
Fernandez-Lopez et al. Survey on automatic lip-reading in the era of deep learning
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
Zhou et al. Modality attention for end-to-end audio-visual speech recognition
WO2023035969A1 (zh) 语音与图像同步性的衡量方法、模型的训练方法及装置
CN105976809B (zh) 基于语音和面部表情的双模态情感融合的识别方法及系统
Chen et al. Audio-visual integration in multimodal communication
Chen Audiovisual speech processing
Potamianos et al. Recent advances in the automatic recognition of audiovisual speech
Neti et al. Audio visual speech recognition
Mittal et al. Animating face using disentangled audio representations
CN110096966A (zh) 一种融合深度信息汉语多模态语料库的语音识别方法
KR20060090687A (ko) 시청각 콘텐츠 합성을 위한 시스템 및 방법
CN112037788B (zh) 一种语音纠正融合方法
CN112786052A (zh) 语音识别方法、电子设备和存储装置
Howell Confusion modelling for lip-reading
Fernandez-Lopez et al. Automatic viseme vocabulary construction to enhance continuous lip-reading
CN114494930B (zh) 语音与图像同步性衡量模型的训练方法及装置
Chiţu¹ et al. Automatic visual speech recognition
CN114466179A (zh) 语音与图像同步性的衡量方法及装置
CN114466178A (zh) 语音与图像同步性的衡量方法及装置
US20240135956A1 (en) Method and apparatus for measuring speech-image synchronicity, and method and apparatus for training model
Sahrawat et al. " Notic My Speech"--Blending Speech Patterns With Multimedia
Verma et al. Animating expressive faces across languages
Ibrahim A novel lip geometry approach for audio-visual speech recognition
Gurban Multimodal feature extraction and fusion for audio-visual speech recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22866437

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022866437

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022866437

Country of ref document: EP

Effective date: 20231221