WO2021196390A1 - Voiceprint data generation method and device, and computer device and storage medium - Google Patents

Voiceprint data generation method and device, and computer device and storage medium Download PDF

Info

Publication number
WO2021196390A1
WO2021196390A1 PCT/CN2020/093318 CN2020093318W WO2021196390A1 WO 2021196390 A1 WO2021196390 A1 WO 2021196390A1 CN 2020093318 W CN2020093318 W CN 2020093318W WO 2021196390 A1 WO2021196390 A1 WO 2021196390A1
Authority
WO
WIPO (PCT)
Prior art keywords
face
image
images
sequence
target
Prior art date
Application number
PCT/CN2020/093318
Other languages
French (fr)
Chinese (zh)
Inventor
王德勋
徐国强
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021196390A1 publication Critical patent/WO2021196390A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building

Definitions

  • This application relates to the technical field of artificial intelligence speech processing, and in particular to a method, device, computer device, and storage medium for generating voiceprint data.
  • Voiceprint recognition is the process of using machines to automatically extract the voiceprint information in the voice and identify the speaker's identity. It plays an important role in security, audit, and education scenarios.
  • the current mainstream voiceprint recognition method is based on deep learning voiceprint recognition.
  • a neural network model ie, voiceprint recognition model
  • voiceprint recognition model is trained through a large number of voiceprint samples, so that the neural network model can automatically mine the speaker’s voiceprint features. Identify the speaker's identity based on voiceprint features.
  • voice data unlike face data, voice data (such as voiceprint data) is more private and harder to collect, and has multiple variables such as accents, noises, and dialects, resulting in an open source voiceprint database. It is seriously inadequate in quality and quantity, unable to obtain enough voiceprint samples, and unable to train a high-accuracy voiceprint recognition model. Collecting and labeling voiceprint data by oneself also requires a lot of money and labor costs. The lack of voiceprint data largely limits the development and promotion of voiceprint recognition technology.
  • the first aspect of the present application provides a method for generating voiceprint data, the method including:
  • each face image subsequence includes multiple face images of the same user
  • the audio segment corresponding to the target image subsequence of each target user is intercepted from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
  • a second aspect of the present application provides a voiceprint data generating device, the device including:
  • Audio and video acquisition module for acquiring audio and video data
  • the face detection module is configured to perform face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images;
  • a sequence acquisition module configured to acquire multiple face image sub-sequences from the original image sequence according to the multiple face images and the face frame, and each face image sub-sequence contains multiple face images of the same user;
  • Mouth opening detection module used to detect whether each face image in each face image sub-sequence has a mouth open
  • the screening module is used to screen out the target face image subsequence according to the open mouth detection result of each face image subsequence;
  • the feature extraction module is used to extract face features from each target face image sub-sequence
  • a clustering module configured to cluster the target face image subsequence according to the facial features of each target face image subsequence to obtain the target user to which each target face image subsequence belongs;
  • the interception module is used to intercept the audio segment corresponding to the target image subsequence of each target user from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
  • a third aspect of the present application provides a computer device, the computer device includes a memory and a processor, and the processor is configured to execute the computer-readable instructions stored in the memory to implement the following steps:
  • each face image subsequence includes multiple face images of the same user
  • the audio segment corresponding to the target image subsequence of each target user is intercepted from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
  • the fourth aspect of the present application provides one or more readable storage media storing computer readable instructions.
  • the computer readable instructions When executed by one or more processors, the one or more processors execute the following step:
  • each face image subsequence includes multiple face images of the same user
  • the audio segment corresponding to the target image subsequence of each target user is intercepted from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
  • This application is guided by the development of more mature facial image technology, and makes full use of the correlation between voice and image in audio and video data to extract voiceprint data associated with the speaker from the audio stream of audio and video data.
  • This application can process a large amount of audio and video data, a large amount of voiceprint data can be obtained to construct a large-scale voiceprint database.
  • This application can obtain voiceprint data with high efficiency and low cost.
  • the voiceprint data can be used to train the voiceprint recognition model, which solves the problem that voiceprint samples are difficult to obtain, and contributes to the development and promotion of voiceprint recognition technology. .
  • Fig. 1 is a flowchart of a method for generating voiceprint data provided by an embodiment of the present application.
  • Fig. 2 is a structural diagram of a voiceprint data generating device provided by an embodiment of the present application.
  • Fig. 3 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • the voiceprint data generation method of this application is applied to one or more computer devices.
  • the computer device is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor and an application specific integrated circuit (ASIC) , Field-Programmable Gate Array (FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC application specific integrated circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Processor
  • embedded equipment etc.
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
  • FIG. 1 is a flowchart of a method for generating voiceprint data according to Embodiment 1 of the present application.
  • the voiceprint data generation method is applied to a computer device.
  • the voiceprint data generation method extracts the voiceprint data associated with the speaker from the audio and video data.
  • the voiceprint data can be used as a voiceprint sample to train a voiceprint recognition model.
  • the method for generating voiceprint data includes:
  • Audio and video data refers to multimedia data that contains both voice and image.
  • the content of the audio and video data includes, but is not limited to variety shows, interviews, TV dramas, etc.
  • the acquired audio and video data includes the speaker's voice and images.
  • the audio and video data can be obtained from a preset multimedia database.
  • a camera device in the computer device or connected to the computer device can be controlled to collect the audio and video data in real time.
  • the original image sequence and audio stream sequence can be separated from the audio and video data.
  • audio and video editing software such as MediaCoder, ffmpeg
  • MediaCoder MediaCoder
  • ffmpeg MediaCoder
  • the original image sequence includes a plurality of original images.
  • said performing face detection on the original image sequence in the audio and video data frame by frame includes:
  • the MTCNN Multi-task Cascaded Convolutional Networks, multi-task cascaded convolutional network
  • the MTCNN Multi-task Cascaded Convolutional Networks, multi-task cascaded convolutional network
  • MTCNN is composed of three parts: P-Net (proposal network), R-Net (refine network), and O-Net (output network).
  • P-Net proposal network
  • R-Net refine network
  • O-Net output network
  • the three parts are three independent network structures.
  • Each part is a multi-task network, and the tasks to be processed include: face/non-face judgment, face frame regression, and feature point positioning.
  • Using the MTCNN model to perform face detection on the original image sequence in the audio and video data frame by frame includes:
  • Bounding box regression can be used to correct candidate windows, and non-maximum suppression (NMS) can be used to merge overlapping candidate boxes.
  • NMS non-maximum suppression
  • neural network models may be used to perform face detection on the original image sequence in the audio and video data frame by frame.
  • R-CNN faster region-based convolution neural network, accelerated regional convolution neural network model
  • cascadeCNN cascade convolution neural network, cascade convolution neural network model
  • the human face image refers to an image containing a human face.
  • the original image is determined to be a face image; if no face frame that meets the requirements is detected from the original image (including no person detected) The face frame or the detected face frame does not meet the requirements), it is determined that the original image is not a face image.
  • the original image is determined to be a face image; if no face frame is detected from the original image, it is determined that the original image is not a face image.
  • the face frame with the largest area is selected as the face frame of the original image, so that one face image corresponds to one face frame.
  • the face frame detected from the original image it can be determined whether the size of the face frame detected from the original image is less than or equal to the preset threshold, and if the size of the face frame detected from the original image is less than or equal to the preset threshold, it is determined The face frame is an invalid face frame. For example, it can be judged whether the width and height of the face frame detected from the original image is less than or equal to 50 pixels. If the width or height of the face frame detected from the original image is less than or equal to 50 pixels, it is determined The face frame is an invalid face frame.
  • the original image is determined to be a face image; if no face frame is detected from the original image or the detected face frame If the size of the face frame is less than or equal to the preset threshold, it is determined that the original image is not a face image.
  • each face image subsequence includes multiple face images of the same user.
  • acquiring multiple face image subsequences from the original image sequence according to the multiple face images and the face frame includes:
  • the two adjacent original images are face images, and the face frames of the two adjacent original images meet a preset condition, it is determined that the two adjacent original images correspond to the same user, and the adjacent two original images correspond to the same user.
  • the two original images belong to the same face image subsequence;
  • the two adjacent original images do not belong to the same face image subsequence.
  • the first original image and the second original image in the original image sequence are selected as two adjacent original images, if the first original image is The original image and the second original image are face images, and the face frames of the first original image and the second original image meet the preset conditions, it is determined that the first original image and the second original image correspond to the same user And belong to the first face image sub-sequence; the second original image and the third original image in the original image sequence are selected as two adjacent original images, if the second original image and the third original image are Face image, and the face frames of the second original image and the third original image meet the preset conditions, it is determined that the second original image and the third original image correspond to the same user, and the third original image also belongs to the third original image.
  • a face image can be used as a starting point, and the current face image and the next face image can be selected one by one to obtain two face images;
  • the two face images are adjacent frames in the original image sequence, and the face frames of the two face images meet a preset condition, it is determined that the two face images correspond to the same user, and the two face images correspond to the same user.
  • Personal face images belong to the same face image sub-sequence;
  • the two face images are not adjacent frames in the original image sequence, or the face frames of the two face images do not meet a preset condition, it is determined that the two face images do not correspond to the same user , The two face images do not belong to the same face image sub-sequence.
  • determining whether the face frames of the two adjacent original images meet a preset condition includes:
  • the overlapping area ratio of the face frames of the two adjacent face images is greater than or equal to the preset ratio, it is determined that the two adjacent face images meet the preset condition.
  • the position of each face frame can be obtained, and the adjacent face frames can be calculated according to the positions of the face frames of the two adjacent face images. The distance between the face frames of the two face images.
  • the detecting whether each face image in each face image subsequence has a mouth open includes:
  • the Adaboost algorithm is used to detect whether each face image in each face image subsequence is open.
  • Adaboost is an iterative algorithm. Its core idea is to train different classifiers (weak classifiers) for the same training set, and then group these weak classifiers to form a stronger final classifier (strong classifier).
  • Adaboost algorithm based on Haar features can be used to train the classifier to realize the distinction between the normal state of the mouth and the open state.
  • Adaboost algorithm for feature detection can refer to the prior art, which will not be repeated here.
  • the MobileNetV2 mobile network V2
  • the MobileNetV2 mobile network V2 model can be used to detect whether each face image in each face image sub-sequence is open.
  • the screening of the target face image subsequence according to the open mouth detection result of each face image subsequence includes:
  • the face image subsequence is the target face image subsequence .
  • the face image subsequence is not the target face image subsequence.
  • a preset number for example, 3
  • median filtering can be used to smooth the open mouth detection results of each face image subsequence.
  • the sliding window size of the median filter is set to 3, that is, the median value is calculated every 3 numbers of the mouth detection results of the face image subsequence.
  • the median filter can smooth the mouth detection results and better filter out the target face image sub-sequences.
  • the extraction of facial features from each target facial image subsequence includes:
  • the point distribution model is used to extract facial features for each target facial image sub-sequence.
  • the point distribution model is a linear contour model, and its realization form is principal component analysis.
  • the face contour ie, the feature point coordinate sequence
  • the face contour is described as the sum of the weighted linear combination of the mean value of the training sample and the basis vector of each principal component.
  • other feature extraction models or algorithms can be used to extract facial features for each target facial image subsequence.
  • the SIFT algorithm is used to extract facial features for each target facial image sub-sequence.
  • the facial features can be extracted from each face image in each target face image subsequence, and the face features of the target face image subsequence can be determined according to the facial features of all face images in the target face image subsequence .
  • the average value of the facial features of all facial images in the target facial image subsequence can be calculated, and the average value can be used as the facial features of the target facial image subsequence.
  • one or more face images may be selected from each target face image sub-sequence, and face features are extracted from the one or more face images, according to the The facial features of one or more facial images determine the facial features of the target facial image sub-sequence.
  • the GMM (Gaussian Mixture Model, Gaussian Mixture Model), DBSCAN or K-Means algorithm may be used to cluster the target face image subsequences.
  • clustering the target face image subsequence includes:
  • Each cluster center finally obtained corresponds to a target user.
  • the cosine similarity between the face features of each target face image subsequence to each cluster center can be calculated, and the cosine similarity is used as the face feature of each target face image subsequence to each cluster center. the distance.
  • the Euclidean distance, Manhattan distance, Mahalanobis distance, etc. from the facial features of each target face image subsequence to each cluster center can be calculated.
  • the target image subsequence of user U1 includes target image subsequences S1, S2, S3, the target image subsequence of user U2 includes target image subsequences S4, S5, S6, S7, and the target image subsequence of user U3 includes target images
  • the sub-sequences S8, S9, S10 are used to intercept the audio segments A1, A2, and A3 corresponding to the target image sub-sequences S1, S2, and S3 of the user U1 from the audio stream of the audio and video data, and from the audio of the audio and video data
  • the audio segments A4, A5, A6, and A7 corresponding to the target image subsequences S4, S5, S6, and S7 of the user U2 are intercepted in the stream
  • the target image subsequences S8, S9 of the user U3 are intercepted from the audio stream of the audio and video data.
  • S10 corresponds to the audio segments A8, A9, and A10.
  • the audio segment corresponding to the target image subsequence of each target user may be intercepted from the audio stream of the audio and video data according to the start time and the end time corresponding to the target image subsequence of each target user.
  • the voiceprint data generation method is guided by the development of more mature facial image technology, and makes full use of the correlation between the voice and the image in the audio and video data to extract the voice associated with the speaker from the audio stream of the audio and video data. ⁇ Pattern data.
  • the voiceprint data generation method can obtain voiceprint data with high efficiency and low cost.
  • the voiceprint data can be used to train a voiceprint recognition model, which solves the problem that voiceprint samples are difficult to obtain, and is helpful for voiceprint recognition.
  • Fig. 2 is a structural diagram of a voiceprint data generating device provided in the second embodiment of the present application.
  • the voiceprint data generating device 20 is applied to a computer device.
  • the voiceprint data generating device 20 analyzes the text to be analyzed, and determines the emotion category of the text to be analyzed.
  • the voiceprint data generating device 20 extracts voiceprint data associated with the speaker from the audio and video data.
  • the voiceprint data can be used as a voiceprint sample to train a voiceprint recognition model.
  • the voiceprint data generating device 20 may include an audio and video acquisition module 201, a face detection module 202, a sequence acquisition module 203, an open mouth detection module 204, a screening module 205, a feature extraction module 206, and a clustering module. 207.
  • the interception module 208 may include an audio and video acquisition module 201, a face detection module 202, a sequence acquisition module 203, an open mouth detection module 204, a screening module 205, a feature extraction module 206, and a clustering module.
  • the interception module 208 may include an audio and video acquisition module 201, a face detection module 202, a sequence acquisition module 203, an open mouth detection module 204, a screening module 205, a feature extraction module 206, and a clustering module.
  • the audio and video acquisition module 201 is used to acquire audio and video data.
  • Audio and video data refers to multimedia data that contains both voice and image.
  • the content of the audio and video data includes, but is not limited to variety shows, interviews, TV dramas, etc.
  • the acquired audio and video data includes the speaker's voice and images.
  • the audio and video data can be obtained from a preset multimedia database.
  • a camera device in the computer device or connected to the computer device can be controlled to collect the audio and video data in real time.
  • the face detection module 202 is configured to perform face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images.
  • the original image sequence and audio stream sequence can be separated from the audio and video data.
  • audio and video editing software such as MediaCoder, ffmpeg
  • MediaCoder MediaCoder
  • ffmpeg MediaCoder
  • the original image sequence includes a plurality of original images.
  • said performing face detection on the original image sequence in the audio and video data frame by frame includes:
  • the MTCNN Multi-task Cascaded Convolutional Networks, multi-task cascaded convolutional network
  • the MTCNN Multi-task Cascaded Convolutional Networks, multi-task cascaded convolutional network
  • MTCNN is composed of three parts: P-Net (proposal network), R-Net (refine network), and O-Net (output network).
  • P-Net proposal network
  • R-Net refine network
  • O-Net output network
  • the three parts are three independent network structures.
  • Each part is a multi-task network, and the tasks to be processed include: face/non-face judgment, face frame regression, and feature point positioning.
  • Using the MTCNN model to perform face detection on the original image sequence in the audio and video data frame by frame includes:
  • Bounding box regression can be used to correct candidate windows, and non-maximum suppression (NMS) can be used to merge overlapping candidate boxes.
  • NMS non-maximum suppression
  • neural network models may be used to perform face detection on the original image sequence in the audio and video data frame by frame.
  • R-CNN faster region-based convolution neural network, accelerated regional convolution neural network model
  • cascadeCNN cascade convolution neural network, cascade convolution neural network model
  • the human face image refers to an image containing a human face.
  • the original image is determined to be a face image; if no face frame that meets the requirements is detected from the original image (including no person detected) The face frame or the detected face frame does not meet the requirements), it is determined that the original image is not a face image.
  • the original image is determined to be a face image; if no face frame is detected from the original image, it is determined that the original image is not a face image.
  • the face frame with the largest area is selected as the face frame of the original image, so that one face image corresponds to one face frame.
  • the face frame detected from the original image it can be determined whether the size of the face frame detected from the original image is less than or equal to the preset threshold, and if the size of the face frame detected from the original image is less than or equal to the preset threshold, it is determined The face frame is an invalid face frame. For example, it can be judged whether the width and height of the face frame detected from the original image is less than or equal to 50 pixels. If the width or height of the face frame detected from the original image is less than or equal to 50 pixels, it is determined The face frame is an invalid face frame.
  • the original image is determined to be a face image; if no face frame is detected from the original image or the detected face frame If the size of the face frame is less than or equal to the preset threshold, it is determined that the original image is not a face image.
  • the sequence acquisition module 203 is configured to acquire multiple face image sub-sequences from the original image sequence according to the multiple face images and the face frame, and each face image sub-sequence includes multiple face images of the same user.
  • acquiring multiple face image subsequences from the original image sequence according to the multiple face images and the face frame includes:
  • the two adjacent original images are face images, and the face frames of the two adjacent original images meet a preset condition, it is determined that the two adjacent original images correspond to the same user, and the adjacent two original images correspond to the same user.
  • the two original images belong to the same face image subsequence;
  • the two adjacent original images do not belong to the same face image subsequence.
  • the first original image and the second original image in the original image sequence are selected as two adjacent original images, if the first original image is The original image and the second original image are face images, and the face frames of the first original image and the second original image meet the preset conditions, it is determined that the first original image and the second original image correspond to the same user And belong to the first face image sub-sequence; the second original image and the third original image in the original image sequence are selected as two adjacent original images, if the second original image and the third original image are Face image, and the face frames of the second original image and the third original image meet the preset conditions, it is determined that the second original image and the third original image correspond to the same user, and the third original image also belongs to the third original image.
  • a face image can be used as a starting point, and the current face image and the next face image can be selected one by one to obtain two face images;
  • the two face images are adjacent frames in the original image sequence, and the face frames of the two face images meet a preset condition, it is determined that the two face images correspond to the same user, and the two face images correspond to the same user.
  • Personal face images belong to the same face image sub-sequence;
  • the two face images are not adjacent frames in the original image sequence, or the face frames of the two face images do not meet a preset condition, it is determined that the two face images do not correspond to the same user , The two face images do not belong to the same face image sub-sequence.
  • determining whether the face frames of the two adjacent original images meet a preset condition includes:
  • the overlapping area ratio of the face frames of the two adjacent face images is greater than or equal to the preset ratio, it is determined that the two adjacent face images meet the preset condition.
  • the position of each face frame can be obtained, and the adjacent face frames can be calculated according to the positions of the face frames of the two adjacent face images. The distance between the face frames of the two face images.
  • the mouth opening detection module 204 is used to detect whether each face image in each face image sub-sequence has a mouth open.
  • the detecting whether each face image in each face image subsequence has a mouth open includes:
  • the Adaboost algorithm is used to detect whether each face image in each face image subsequence is open.
  • Adaboost is an iterative algorithm. Its core idea is to train different classifiers (weak classifiers) for the same training set, and then group these weak classifiers to form a stronger final classifier (strong classifier).
  • Adaboost algorithm based on Haar features can be used to train the classifier to realize the distinction between the normal state of the mouth and the open state.
  • Adaboost algorithm for feature detection can refer to the prior art, which will not be repeated here.
  • the MobileNetV2 mobile network V2
  • the MobileNetV2 mobile network V2 model can be used to detect whether each face image in each face image sub-sequence is open.
  • the screening module 205 is configured to screen out the target face image subsequence according to the open mouth detection result of each face image subsequence.
  • the screening of the target face image subsequence according to the open mouth detection result of each face image subsequence includes:
  • the face image subsequence is the target face image subsequence .
  • the face image subsequence is not the target face image subsequence.
  • a preset number for example, 3
  • median filtering can be used to smooth the open mouth detection results of each face image subsequence.
  • the sliding window size of the median filter is set to 3, that is, the median value is calculated every 3 numbers of the mouth detection results of the face image subsequence.
  • the median filter can smooth the mouth detection results and better filter out the target face image sub-sequences.
  • the feature extraction module 206 is configured to extract face features from each target face image sub-sequence.
  • the extraction of facial features from each target facial image subsequence includes:
  • the point distribution model is used to extract facial features for each target facial image sub-sequence.
  • the point distribution model is a linear contour model, and its realization form is principal component analysis.
  • the face contour ie, the feature point coordinate sequence
  • the face contour is described as the sum of the weighted linear combination of the mean value of the training sample and the basis vector of each principal component.
  • other feature extraction models or algorithms can be used to extract facial features for each target facial image subsequence.
  • the SIFT algorithm is used to extract facial features for each target facial image sub-sequence.
  • the facial features can be extracted from each face image in each target face image subsequence, and the face features of the target face image subsequence can be determined according to the facial features of all face images in the target face image subsequence .
  • the average value of the facial features of all facial images in the target facial image subsequence can be calculated, and the average value can be used as the facial features of the target facial image subsequence.
  • one or more face images may be selected from each target face image sub-sequence, and face features are extracted from the one or more face images, according to the The facial features of one or more facial images determine the facial features of the target facial image sub-sequence.
  • the clustering module 207 is configured to cluster the target face image subsequence according to the facial features of each target face image subsequence to obtain the target user to which each target face image subsequence belongs.
  • the GMM (Gaussian Mixture Model, Gaussian Mixture Model), DBSCAN or K-Means algorithm may be used to cluster the target face image subsequences.
  • clustering the target face image subsequence includes:
  • Each cluster center finally obtained corresponds to a target user.
  • the cosine similarity between the face features of each target face image subsequence to each cluster center can be calculated, and the cosine similarity is used as the face feature of each target face image subsequence to each cluster center. the distance.
  • the Euclidean distance, Manhattan distance, Mahalanobis distance, etc. from the facial features of each target face image subsequence to each cluster center can be calculated.
  • the interception module 208 is used to intercept the audio segment corresponding to the target image subsequence of each target user from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
  • the target image subsequence of user U1 includes target image subsequences S1, S2, S3, the target image subsequence of user U2 includes target image subsequences S4, S5, S6, S7, and the target image subsequence of user U3 includes target images
  • the sub-sequences S8, S9, S10 are used to intercept the audio segments A1, A2, and A3 corresponding to the target image sub-sequences S1, S2, and S3 of the user U1 from the audio stream of the audio and video data, and from the audio of the audio and video data
  • the audio segments A4, A5, A6, and A7 corresponding to the target image subsequences S4, S5, S6, and S7 of the user U2 are intercepted in the stream
  • the target image subsequences S8, S9 of the user U3 are intercepted from the audio stream of the audio and video data.
  • S10 corresponds to the audio segments A8, A9, and A10.
  • the audio segment corresponding to the target image subsequence of each target user may be intercepted from the audio stream of the audio and video data according to the start time and the end time corresponding to the target image subsequence of each target user.
  • the voiceprint data generating device 20 is guided by the development of more mature facial image technology, and makes full use of the correlation between the voice and the image in the audio and video data to extract the speaker-related information from the audio stream of the audio and video data.
  • Voiceprint data By using the voiceprint data generating device 20 to process a large amount of audio and video data, a large amount of voiceprint data can be obtained to construct a large-scale voiceprint database.
  • the voiceprint data generating device 20 can obtain voiceprint data with high efficiency and low cost, and the voiceprint data can be used to train the voiceprint recognition model, which solves the problem that voiceprint samples are difficult to obtain, and helps voiceprint data.
  • the development and promotion of recognition technology is provided to process a large amount of audio and video data.
  • This embodiment provides one or more readable storage media storing computer readable instructions.
  • the readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage media;
  • the computationally readable instructions when executed by one or more processors, implement the steps in the foregoing embodiment of the method for generating voiceprint data, such as 101-108 shown in FIG. 1.
  • the computationally readable instruction realizes the functions of the modules in the foregoing device embodiment when executed by the processor, such as modules 201-208 in FIG. 2. To avoid repetition, I won’t repeat them here.
  • FIG. 3 is a schematic diagram of a computer device provided in Embodiment 4 of this application.
  • the computer device 30 includes a memory 301, a processor 302, and computationally readable instructions 303 stored in the memory 301 and running on the processor 302, such as a voiceprint data generating program.
  • the processor 302 implements the steps in the embodiment of the voiceprint data generation method when executing the calculation readable instruction 303, for example, 101-108 shown in FIG. 1.
  • the computationally readable instruction realizes the functions of the modules in the foregoing device embodiment when executed by the processor, such as modules 201-208 in FIG. 2.
  • the computationally readable instruction 303 may be divided into one or more modules, and the one or more modules are stored in the memory 301 and executed by the processor 302 to complete the method.
  • the one or more modules may be a series of computationally readable instruction instruction segments capable of completing specific functions, and the instruction segment is used to describe the execution process of the computationally readable instructions 303 in the computer device 30.
  • the computationally readable instruction 303 can be divided into the audio and video acquisition module 201, the face detection module 202, the sequence acquisition module 203, the open mouth detection module 204, the screening module 205, the feature extraction module 206, and the clustering module shown in FIG.
  • Module 207, interception module 208 the specific functions of each module refer to the second embodiment.
  • the computer device 30 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the schematic diagram 3 is only an example of the computer device 30 and does not constitute a limitation on the computer device 30. It may include more or less components than those shown in the figure, or combine certain components, or be different.
  • the computer device 30 may also include input and output devices, network access devices, buses, and so on.
  • the so-called processor 302 may be a central processing unit (Central Processing Unit, CPU), other general processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor 302 can also be any conventional processor, etc.
  • the processor 302 is the control center of the computer device 30, which uses various interfaces and lines to connect the entire computer device 30. Various parts.
  • the memory 301 can be used to store the computationally readable instructions 303.
  • the processor 302 executes or executes the computationally readable instructions or modules stored in the memory 301 and calls data stored in the memory 301 to achieve Various functions of the computer device 30.
  • the memory 301 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may Store the data created according to the use of the computer device 30 (.
  • the memory 301 may include non-volatile memory or/and volatile memory, and the non-volatile memory may include, for example, hard disk, memory, plug-in hard disk, smart Memory card (Smart Media Card, SMC), Secure Digital (SD) card, flash card (Flash Card), at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. Volatile memory It may include random access memory (RAM) or external cache memory.
  • RAM random access memory
  • the integrated module of the computer device 30 is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware by computing readable instructions.
  • the computing readable instructions can be stored in a storage medium. When the computationally readable instructions are executed by the processor, they can implement the steps of the foregoing method embodiments.
  • the computationally readable instruction includes computationally readable instruction code, and the computationally readable instruction code may be in the form of source code, object code, executable file, or some intermediate form.
  • the computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory). It should be noted that the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to the legislation and patent practice, the computer-readable medium Does not include electrical carrier signals and telecommunication signals.
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional modules in the various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware, or in the form of hardware plus software functional modules.
  • the above-mentioned integrated modules implemented in the form of software functional modules may be stored in a storage medium.
  • the above-mentioned software function module is stored in a storage medium and includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor execute the method described in each embodiment of the present application. Part of the steps.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Image Analysis (AREA)

Abstract

The present application relates to the technical field of artificial intelligence, and provides a voiceprint data generation method and device, and a computer device and a storage medium. The method comprises: performing face detection on an original image sequence in audio and video data frame by frame to obtain a plurality of face images and face boxes thereof; obtaining a plurality of face image sub-sequences from the original image sequence according to the plurality of face images and the face boxes thereof; detecting whether a mouth in each face image in each face image sub-sequence opens or not; screening out a target face image sub-sequence according to the mouth opening detection result of each face image sub-sequence; extracting a face feature from each target face image sub-sequence; clustering the target face image sub-sequences to obtain a target user to which each target face image sub-sequence belongs; and capturing from an audio stream of the audio and video data an audio segment corresponding to the target image sub-sequence of each target user to obtain voiceprint data of each target user. According to the present application, voiceprint data can be obtained with high efficiency and low costs.

Description

声纹数据生成方法、装置、计算机装置及存储介质Voiceprint data generation method, device, computer device and storage medium
本申请要求于2020年3月31日提交中国专利局、申请号为202010244174.8,申请名称为“声纹数据生成方法、装置、计算机装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 31, 2020, with the application number of 202010244174.8, and the application titled "Voiceprint data generation method, device, computer device and storage medium", all of which are approved The reference is incorporated in this application.
技术领域Technical field
本申请涉及人工智能的语音处理技术领域,具体涉及一种声纹数据生成方法、装置、计算机装置及存储介质。This application relates to the technical field of artificial intelligence speech processing, and in particular to a method, device, computer device, and storage medium for generating voiceprint data.
背景技术Background technique
人类的语音中包含了丰富的信息,其中一种重要的信息是表征说话人身份的声纹信息。由于不同人具有相异的声腔和发声方式,任何两个人的声纹信息都不相同。声纹识别就是利用机器自动提取语音中的声纹信息并鉴别说话人身份的过程,其在安防、审核和教育等场景中发挥着重要作用。Human speech contains a wealth of information. One of the important information is the voiceprint information that characterizes the speaker's identity. Because different people have different voice cavities and ways of speaking, the voiceprint information of any two people is different. Voiceprint recognition is the process of using machines to automatically extract the voiceprint information in the voice and identify the speaker's identity. It plays an important role in security, audit, and education scenarios.
目前主流的声纹识别方法是基于深度学习的声纹识别,通过大量的声纹样本对神经网络模型(即声纹识别模型)进行训练,使神经网络模型自动挖掘出说话人的声纹特征,根据声纹特征识别说话人身份。然而,发明人意识到:不同于人脸数据,语音数据(如声纹数据)更具有隐私性也更难收集,并有口音、噪声、方言等多种可变因素,导致开源的声纹数据库在质量和数量上严重不足,不能得到足够的声纹样本,无法训练出高准确率的声纹识别模型。自行收集、标注声纹数据也需要投入大量的金钱和人力成本。声纹数据的不足很大程度上限制了声纹识别技术的发展和推广。The current mainstream voiceprint recognition method is based on deep learning voiceprint recognition. A neural network model (ie, voiceprint recognition model) is trained through a large number of voiceprint samples, so that the neural network model can automatically mine the speaker’s voiceprint features. Identify the speaker's identity based on voiceprint features. However, the inventor realized that: unlike face data, voice data (such as voiceprint data) is more private and harder to collect, and has multiple variables such as accents, noises, and dialects, resulting in an open source voiceprint database. It is seriously inadequate in quality and quantity, unable to obtain enough voiceprint samples, and unable to train a high-accuracy voiceprint recognition model. Collecting and labeling voiceprint data by oneself also requires a lot of money and labor costs. The lack of voiceprint data largely limits the development and promotion of voiceprint recognition technology.
申请内容Application content
鉴于以上内容,有必要提出一种声纹数据生成方法、装置、计算机装置及存储介质,其可以高效率、低成本地获得声纹数据。In view of the above, it is necessary to provide a method, device, computer device, and storage medium for generating voiceprint data, which can obtain voiceprint data with high efficiency and low cost.
本申请的第一方面提供一种声纹数据生成方法,所述方法包括:The first aspect of the present application provides a method for generating voiceprint data, the method including:
获取音视频数据;Obtain audio and video data;
对所述音视频数据中的原始图像序列逐帧进行人脸检测,得到多个人脸图像和所述多个人脸图像的人脸框;Performing face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images;
根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列,每个人脸图像子序列包含同一用户的多个人脸图像;Acquiring multiple face image subsequences from the original image sequence according to the multiple face images and the face frame, and each face image subsequence includes multiple face images of the same user;
检测每个人脸图像子序列中的每个人脸图像是否张嘴;Detect whether each face image in each face image subsequence has its mouth open;
根据每个人脸图像子序列的张嘴检测结果筛选出目标人脸图像子序列;Filter out the target face image subsequence according to the open mouth detection result of each face image subsequence;
对每个目标人脸图像子序列提取人脸特征;Extract facial features for each target facial image sub-sequence;
根据每个目标人脸图像子序列的人脸特征对所述目标人脸图像子序列进行聚类,得到每个目标人脸图像子序列所属的目标用户;Clustering the target facial image subsequence according to the facial features of each target facial image subsequence to obtain the target user to which each target facial image subsequence belongs;
从所述音视频数据的音频流中截取每个目标用户的目标图像子序列对应的音频段,得到每个目标用户的声纹数据。The audio segment corresponding to the target image subsequence of each target user is intercepted from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
本申请的第二方面提供一种声纹数据生成装置,所述装置包括:A second aspect of the present application provides a voiceprint data generating device, the device including:
音视频获取模块,用于获取音视频数据;Audio and video acquisition module for acquiring audio and video data;
人脸检测模块,用于对所述音视频数据中的原始图像序列逐帧进行人脸检测,得到多个人脸图像和所述多个人脸图像的人脸框;The face detection module is configured to perform face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images;
序列获取模块,用于根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列,每个人脸图像子序列包含同一用户的多个人脸图像;A sequence acquisition module, configured to acquire multiple face image sub-sequences from the original image sequence according to the multiple face images and the face frame, and each face image sub-sequence contains multiple face images of the same user;
张嘴检测模块,用于检测每个人脸图像子序列中的每个人脸图像是否张嘴;Mouth opening detection module, used to detect whether each face image in each face image sub-sequence has a mouth open;
筛选模块,用于根据每个人脸图像子序列的张嘴检测结果筛选出目标人脸图像子序列;The screening module is used to screen out the target face image subsequence according to the open mouth detection result of each face image subsequence;
特征提取模块,用于对每个目标人脸图像子序列提取人脸特征;The feature extraction module is used to extract face features from each target face image sub-sequence;
聚类模块,用于根据每个目标人脸图像子序列的人脸特征对所述目标人脸图像子序列进行聚类,得到每个目标人脸图像子序列所属的目标用户;A clustering module, configured to cluster the target face image subsequence according to the facial features of each target face image subsequence to obtain the target user to which each target face image subsequence belongs;
截取模块,用于从所述音视频数据的音频流中截取每个目标用户的目标图像子序列对应的音频段,得到每个目标用户的声纹数据。The interception module is used to intercept the audio segment corresponding to the target image subsequence of each target user from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
本申请的第三方面提供一种计算机装置,所述计算机装置包括存储器和处理器 ,所述处理器用于执行存储器中存储的计算机可读指令时实现如下步骤:A third aspect of the present application provides a computer device, the computer device includes a memory and a processor, and the processor is configured to execute the computer-readable instructions stored in the memory to implement the following steps:
获取音视频数据;Obtain audio and video data;
对所述音视频数据中的原始图像序列逐帧进行人脸检测,得到多个人脸图像和所述多个人脸图像的人脸框;Performing face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images;
根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列,每个人脸图像子序列包含同一用户的多个人脸图像;Acquiring multiple face image subsequences from the original image sequence according to the multiple face images and the face frame, and each face image subsequence includes multiple face images of the same user;
检测每个人脸图像子序列中的每个人脸图像是否张嘴;Detect whether each face image in each face image subsequence has its mouth open;
根据每个人脸图像子序列的张嘴检测结果筛选出目标人脸图像子序列;Filter out the target face image subsequence according to the open mouth detection result of each face image subsequence;
对每个目标人脸图像子序列提取人脸特征;Extract facial features for each target facial image sub-sequence;
根据每个目标人脸图像子序列的人脸特征对所述目标人脸图像子序列进行聚类,得到每个目标人脸图像子序列所属的目标用户;Clustering the target facial image subsequence according to the facial features of each target facial image subsequence to obtain the target user to which each target facial image subsequence belongs;
从所述音视频数据的音频流中截取每个目标用户的目标图像子序列对应的音频段,得到每个目标用户的声纹数据。The audio segment corresponding to the target image subsequence of each target user is intercepted from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
本申请的第四方面提供一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:The fourth aspect of the present application provides one or more readable storage media storing computer readable instructions. When the computer readable instructions are executed by one or more processors, the one or more processors execute the following step:
获取音视频数据;Obtain audio and video data;
对所述音视频数据中的原始图像序列逐帧进行人脸检测,得到多个人脸图像和所述多个人脸图像的人脸框;Performing face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images;
根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列,每个人脸图像子序列包含同一用户的多个人脸图像;Acquiring multiple face image subsequences from the original image sequence according to the multiple face images and the face frame, and each face image subsequence includes multiple face images of the same user;
检测每个人脸图像子序列中的每个人脸图像是否张嘴;Detect whether each face image in each face image subsequence has its mouth open;
根据每个人脸图像子序列的张嘴检测结果筛选出目标人脸图像子序列;Filter out the target face image subsequence according to the open mouth detection result of each face image subsequence;
对每个目标人脸图像子序列提取人脸特征;Extract facial features for each target facial image sub-sequence;
根据每个目标人脸图像子序列的人脸特征对所述目标人脸图像子序列进行聚类,得到每个目标人脸图像子序列所属的目标用户;Clustering the target facial image subsequence according to the facial features of each target facial image subsequence to obtain the target user to which each target facial image subsequence belongs;
从所述音视频数据的音频流中截取每个目标用户的目标图像子序列对应的音频 段,得到每个目标用户的声纹数据。The audio segment corresponding to the target image subsequence of each target user is intercepted from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
本申请以发展更为成熟的人脸图像技术为指导,充分利用了音视频数据中语音与图像之间的相关性从音视频数据的音频流中提取与说话人关联的声纹数据。采用本申请对大量的音视频数据进行处理,可以得到大量的声纹数据以构建大规模的声纹数据库。本申请可以高效率、低成本地获得声纹数据,该声纹数据能够用来对声纹识别模型进行训练,解决了声纹样本难以获取的问题,有助于声纹识别技术的发展和推广。本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。This application is guided by the development of more mature facial image technology, and makes full use of the correlation between voice and image in audio and video data to extract voiceprint data associated with the speaker from the audio stream of audio and video data. By using this application to process a large amount of audio and video data, a large amount of voiceprint data can be obtained to construct a large-scale voiceprint database. This application can obtain voiceprint data with high efficiency and low cost. The voiceprint data can be used to train the voiceprint recognition model, which solves the problem that voiceprint samples are difficult to obtain, and contributes to the development and promotion of voiceprint recognition technology. . The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.
附图说明Description of the drawings
图1是本申请实施例提供的声纹数据生成方法的流程图。Fig. 1 is a flowchart of a method for generating voiceprint data provided by an embodiment of the present application.
图2是本申请实施例提供的声纹数据生成装置的结构图。Fig. 2 is a structural diagram of a voiceprint data generating device provided by an embodiment of the present application.
图3是本申请实施例提供的计算机装置的示意图。Fig. 3 is a schematic diagram of a computer device provided by an embodiment of the present application.
具体实施方式Detailed ways
为了能够更清楚地理解本申请的上述目的、特征和优点,下面结合附图和具体实施例对本申请进行详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。In order to be able to understand the above objectives, features and advantages of the application more clearly, the application will be described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the application and the features in the embodiments can be combined with each other if there is no conflict.
在下面的描述中阐述了很多具体细节以便于充分理解本申请,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following description, many specific details are set forth in order to fully understand the present application. The described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of this application. The terms used in the specification of the application herein are only for the purpose of describing specific embodiments, and are not intended to limit the application.
本申请涉及人工智能的语音处理技术领域,优选地,本申请的声纹数据生成方 法应用在一个或者多个计算机装置中。所述计算机装置是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。This application relates to the field of artificial intelligence speech processing technology. Preferably, the voiceprint data generation method of this application is applied to one or more computer devices. The computer device is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor and an application specific integrated circuit (ASIC) , Field-Programmable Gate Array (FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
所述计算机装置可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机装置可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。The computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
实施例一Example one
图1是本申请实施例一提供的声纹数据生成方法的流程图。所述声纹数据生成方法应用于计算机装置。FIG. 1 is a flowchart of a method for generating voiceprint data according to Embodiment 1 of the present application. The voiceprint data generation method is applied to a computer device.
所述声纹数据生成方法从音视频数据中提取出与说话人关联的声纹数据。所述声纹数据可以作为声纹样本对声纹识别模型进行训练。The voiceprint data generation method extracts the voiceprint data associated with the speaker from the audio and video data. The voiceprint data can be used as a voiceprint sample to train a voiceprint recognition model.
如图1所示,所述声纹数据生成方法包括:As shown in Figure 1, the method for generating voiceprint data includes:
101,获取音视频数据。101. Acquire audio and video data.
音视频数据是指同时包含语音和图像的多媒体数据。所述音视频数据的内容包括,但不限于综艺、采访、电视剧等。Audio and video data refers to multimedia data that contains both voice and image. The content of the audio and video data includes, but is not limited to variety shows, interviews, TV dramas, etc.
为了提取与说话人关联的声纹数据,获取的音视频数据包括说话人的语音和图像。In order to extract the voiceprint data associated with the speaker, the acquired audio and video data includes the speaker's voice and images.
可以从预设多媒体数据库获取所述音视频数据。或者,可以控制所述计算机装置中的或者与所述计算机装置相连接的摄像设备实时采集所述音视频数据。The audio and video data can be obtained from a preset multimedia database. Alternatively, a camera device in the computer device or connected to the computer device can be controlled to collect the audio and video data in real time.
102,对所述音视频数据中的原始图像序列逐帧进行人脸检测,得到多个人脸图像和所述多个人脸图像的人脸框。102. Perform face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images.
可以从所述音视频数据中分离出原始图像序列和音频流序列。例如,可以采用音视频编辑软件(如MediaCoder、ffmpeg)从所述音视频数据中分离出原始图像序列和音频流序列。The original image sequence and audio stream sequence can be separated from the audio and video data. For example, audio and video editing software (such as MediaCoder, ffmpeg) can be used to separate the original image sequence and audio stream sequence from the audio and video data.
所述原始图像序列包括多个原始图像。The original image sequence includes a plurality of original images.
可选的,所述对所述音视频数据中的原始图像序列逐帧进行人脸检测包括:Optionally, said performing face detection on the original image sequence in the audio and video data frame by frame includes:
使用MTCNN(Multi-task Cascaded Convolutional Networks,多任务级联卷积网络)模型对所述音视频数据中的原始图像序列逐帧进行人脸检测。The MTCNN (Multi-task Cascaded Convolutional Networks, multi-task cascaded convolutional network) model is used to perform face detection on the original image sequence in the audio and video data frame by frame.
MTCNN由P-Net(proposal network,建议网络)、R-Net(refine network,改善网络)、O-Net(output network,输出网络)三个部分组成。三个部分是相互独立的三个网络结构。每个部分都是一个多任务网络,处理的任务包括:人脸/非人脸的判断、人脸框回归、特征点定位。MTCNN is composed of three parts: P-Net (proposal network), R-Net (refine network), and O-Net (output network). The three parts are three independent network structures. Each part is a multi-task network, and the tasks to be processed include: face/non-face judgment, face frame regression, and feature point positioning.
使用MTCNN模型对所述音视频数据中的原始图像序列逐帧进行人脸检测包括:Using the MTCNN model to perform face detection on the original image sequence in the audio and video data frame by frame includes:
(1)使用P-Net生成候选窗。可以使用边框回归(Bounding box regression)的方法校正候选窗,使用非极大值抑制(NMS)合并重叠的候选框。(1) Use P-Net to generate candidate windows. Bounding box regression can be used to correct candidate windows, and non-maximum suppression (NMS) can be used to merge overlapping candidate boxes.
(2)使用N-Net改善候选窗。将通过P-Net的候选窗输入R-Net中,去除候选框中的非人脸框。(2) Use N-Net to improve the candidate window. Input the candidate window through P-Net into R-Net, and remove the non-face frame in the candidate frame.
(3)使用O-Net输出最终的人脸框和人脸特征点的位置。(3) Use O-Net to output the final face frame and the position of the face feature points.
使用MTCNN模型进行人脸识别可以参考现有技术,此处不再赘述。The use of the MTCNN model for face recognition can refer to the prior art, which will not be repeated here.
在其他的实施例中,可以使用其他的神经网络模型对所述音视频数据中的原始图像序列逐帧进行人脸检测。例如,可以使用faster R-CNN(faster region-based convolution neural network,加速区域卷积神经网络模型)或cascadeCNN(cascade convolution neural network,级联卷积神经网络模型)对所述音视频数据中的原始图像序列逐帧进行人脸检测。In other embodiments, other neural network models may be used to perform face detection on the original image sequence in the audio and video data frame by frame. For example, faster R-CNN (faster region-based convolution neural network, accelerated regional convolution neural network model) or cascadeCNN (cascade convolution neural network, cascade convolution neural network model) can be used to analyze the original audio and video data. Face detection is performed on the image sequence frame by frame.
所述人脸图像是指包含人脸的图像。The human face image refers to an image containing a human face.
在本实施例中,若从一个原始图像检测到满足要求的人脸框,则确定该原始图像为人脸图像;若从该原始图像中没有检测到满足要求的人脸框(包括没有检测到人脸框或者检测到的人脸框不满足要求),则确定该原始图像不是人脸图像。In this embodiment, if a face frame that meets the requirements is detected from an original image, the original image is determined to be a face image; if no face frame that meets the requirements is detected from the original image (including no person detected) The face frame or the detected face frame does not meet the requirements), it is determined that the original image is not a face image.
在其他的实施例中,若从一个原始图像检测到人脸框,则确定该原始图像为人脸图像;若从该原始图像中没有检测到人脸框,则确定该原始图像不是人脸图像。In other embodiments, if a face frame is detected from an original image, the original image is determined to be a face image; if no face frame is detected from the original image, it is determined that the original image is not a face image.
在本实施例中,若一个原始图像中存在多个人脸框,则选取面积最大的人脸框作为该原始图像的人脸框,使得一个人脸图像对应一个人脸框。In this embodiment, if there are multiple face frames in an original image, the face frame with the largest area is selected as the face frame of the original image, so that one face image corresponds to one face frame.
在本实施例中,可以判断从原始图像中检测到的人脸框的大小是否小于或等于预设阈值,若从原始图像中检测到的人脸框的大小小于或等于预设阈值,则确定该人脸框为无效的人脸框。例如,可以判断从原始图像中检测到的人脸框的宽和高是否小于或等于50个像素,若从原始图像中检测到的人脸框的宽或高小于或等于50个像素,则确定该人脸框为无效的人脸框。In this embodiment, it can be determined whether the size of the face frame detected from the original image is less than or equal to the preset threshold, and if the size of the face frame detected from the original image is less than or equal to the preset threshold, it is determined The face frame is an invalid face frame. For example, it can be judged whether the width and height of the face frame detected from the original image is less than or equal to 50 pixels. If the width or height of the face frame detected from the original image is less than or equal to 50 pixels, it is determined The face frame is an invalid face frame.
在一具体实施例中,若从一个原始图像检测到的人脸框的大小大于预设阈值,则确定该原始图像为人脸图像;若从该原始图像中没有检测到人脸框或者检测到的人脸框的大小均小于或等于预设阈值,则确定该原始图像不是人脸图像。In a specific embodiment, if the size of the face frame detected from an original image is greater than a preset threshold, the original image is determined to be a face image; if no face frame is detected from the original image or the detected face frame If the size of the face frame is less than or equal to the preset threshold, it is determined that the original image is not a face image.
103,根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列,每个人脸图像子序列包含同一用户的多个人脸图像。103. Acquire multiple face image subsequences from the original image sequence according to the multiple face images and the face frame, and each face image subsequence includes multiple face images of the same user.
可选的,根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列包括:Optionally, acquiring multiple face image subsequences from the original image sequence according to the multiple face images and the face frame includes:
以所述原始图像序列中的一个原始图像作为起始点,逐一选取当前原始图像和下一原始图像,得到相邻两个原始图像;Taking an original image in the original image sequence as a starting point, selecting the current original image and the next original image one by one to obtain two adjacent original images;
判断所述相邻两个原始图像的人脸框是否满足预设条件;Judging whether the face frames of the two adjacent original images meet a preset condition;
若所述相邻两个原始图像是人脸图像,并且所述相邻两个原始图像的人脸框满足预设条件,则确定所述相邻两个原始图像对应同一用户,所述相邻两个原始图像属于同一人脸图像子序列;If the two adjacent original images are face images, and the face frames of the two adjacent original images meet a preset condition, it is determined that the two adjacent original images correspond to the same user, and the adjacent two original images correspond to the same user. The two original images belong to the same face image subsequence;
否则,若所述相邻两个原始图像至少一个不是人脸图像,或者所述相邻两个原始图像的人脸框不满足预设条件,则确定所述相邻两个原始图像不对应同一用户,所述相邻两个原始图像不属于同一人脸图像子序列。Otherwise, if at least one of the two adjacent original images is not a face image, or the face frames of the two adjacent original images do not meet the preset condition, it is determined that the two adjacent original images do not correspond to the same User, the two adjacent original images do not belong to the same face image subsequence.
例如,以所述原始图像序列中的第一个原始图像作为起始点,选取所述原始图像序列中的第一个原始图像和第二个原始图像作为相邻两个原始图像,若第一个原始图像和第二个原始图像是人脸图像,并且第一个原始图像和第二个原始图像的人脸框满足预设条件,则确定第一个原始图像和第二个原始图像对应同一用户并属于第一人脸图像子序列;选取所述原始图像序列中的第二个原始图 像和第三个原始图像作为相邻两个原始图像,若第二个原始图像和第三个原始图像是人脸图像,并且第二个原始图像和第三个原始图像的人脸框满足预设条件,则确定第二个原始图像和第三个原始图像对应同一用户,第三个原始图像也属于第一人脸图像子序列;.....;选取所述原始图像序列中的第八个原始图像和第九个原始图像作为相邻两个原始图像,若第九个原始图像不是人脸图像,或者第八个原始图像和第九个原始图像的人脸框不满足预设条件,则确定第八个原始图像和第九个原始图像对应同一用户,第九个原始图像不属于第一人脸图像子序列。因此,获取的第一人脸图像子序列包括第一个原始图像、第二个原始图像....和第八个原始图像。以第九个原始图像为新的起始点,获取下一个人脸图像子序列。For example, taking the first original image in the original image sequence as the starting point, the first original image and the second original image in the original image sequence are selected as two adjacent original images, if the first original image is The original image and the second original image are face images, and the face frames of the first original image and the second original image meet the preset conditions, it is determined that the first original image and the second original image correspond to the same user And belong to the first face image sub-sequence; the second original image and the third original image in the original image sequence are selected as two adjacent original images, if the second original image and the third original image are Face image, and the face frames of the second original image and the third original image meet the preset conditions, it is determined that the second original image and the third original image correspond to the same user, and the third original image also belongs to the third original image. A face image sub-sequence; ..... Select the eighth original image and the ninth original image in the original image sequence as two adjacent original images, if the ninth original image is not a face image , Or the face frames of the eighth original image and the ninth original image do not meet the preset conditions, it is determined that the eighth original image and the ninth original image correspond to the same user, and the ninth original image does not belong to the first person Face image subsequence. Therefore, the acquired first face image sub-sequence includes the first original image, the second original image... and the eighth original image. Taking the ninth original image as a new starting point, the next face image sub-sequence is obtained.
可以理解,可以采用其他的方法根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列。例如,可以以一个人脸图像作为起始点,逐一选取当前人脸图像和下一人脸图像,得到两个人脸图像;It can be understood that other methods may be used to obtain multiple face image subsequences from the original image sequence according to the multiple face images and the face frame. For example, a face image can be used as a starting point, and the current face image and the next face image can be selected one by one to obtain two face images;
若所述两个人脸图像是所述原始图像序列中的相邻帧,并且所述两个人脸图像的人脸框满足预设条件,则确定所述两个人脸图像对应同一用户,所述两个人脸图像属于同一人脸图像子序列;If the two face images are adjacent frames in the original image sequence, and the face frames of the two face images meet a preset condition, it is determined that the two face images correspond to the same user, and the two face images correspond to the same user. Personal face images belong to the same face image sub-sequence;
否则,若所述两个人脸图像不是所述原始图像序列中的相邻帧,或者所述两个人脸图像的人脸框不满足预设条件,则确定所述两个人脸图像不对应同一用户,所述两个人脸图像不属于同一人脸图像子序列。Otherwise, if the two face images are not adjacent frames in the original image sequence, or the face frames of the two face images do not meet a preset condition, it is determined that the two face images do not correspond to the same user , The two face images do not belong to the same face image sub-sequence.
可选的,判断所述相邻两个原始图像的人脸框是否满足预设条件包括:Optionally, determining whether the face frames of the two adjacent original images meet a preset condition includes:
判断所述相邻两个原始图像的人脸框的重叠面积比例(Intersection over Union,IOU)是否大于或等于预设比例;Judging whether the overlap area ratio (Intersection over Union, IOU) of the face frames of the two adjacent original images is greater than or equal to a preset ratio;
若所述相邻两个人脸图像的人脸框的重叠面积比例大于或等于预设比例,则确定所述相邻两个人脸图像满足预设条件。If the overlapping area ratio of the face frames of the two adjacent face images is greater than or equal to the preset ratio, it is determined that the two adjacent face images meet the preset condition.
或者,可以判断所述相邻两个人脸图像的人脸框的距离是否小于或等于预设距离,若所述相邻两个人脸图像的人脸框的距离小于或等于预设距离,则确定所述相邻两个人脸图像满足预设条件。Alternatively, it may be determined whether the distance between the face frames of the two adjacent face images is less than or equal to the preset distance, and if the distance between the face frames of the two adjacent face images is less than or equal to the preset distance, then it is determined The two adjacent face images meet a preset condition.
在对所述音视频数据中的原始图像序列逐帧进行人脸检测时,可以得到每个人 脸框的位置,根据所述相邻两个人脸图像的人脸框的位置可以计算所述相邻两个人脸图像的人脸框的距离。When face detection is performed on the original image sequence in the audio and video data frame by frame, the position of each face frame can be obtained, and the adjacent face frames can be calculated according to the positions of the face frames of the two adjacent face images. The distance between the face frames of the two face images.
104,检测每个人脸图像子序列中的每个人脸图像是否张嘴。104. Detect whether each face image in each face image sub-sequence has a mouth open.
可选的,所述检测每个人脸图像子序列中的每个人脸图像是否张嘴包括:Optionally, the detecting whether each face image in each face image subsequence has a mouth open includes:
使用Adaboost算法检测每个人脸图像子序列中的每个人脸图像是否张嘴。The Adaboost algorithm is used to detect whether each face image in each face image subsequence is open.
Adaboost是一种迭代算法,其核心思想是针对同一个训练集训练不同的分类器(弱分类器),然后把这些弱分类器集合起来,构成一个更强的最终分类器(强分类器)。Adaboost is an iterative algorithm. Its core idea is to train different classifiers (weak classifiers) for the same training set, and then group these weak classifiers to form a stronger final classifier (strong classifier).
可以使用基于Haar特征的Adaboost算法训练分类器,实现嘴的正常状态和张嘴状态的区分。The Adaboost algorithm based on Haar features can be used to train the classifier to realize the distinction between the normal state of the mouth and the open state.
使用Adaboost算法进行特征检测(如张嘴检测)可以参考现有技术,此处不再赘述。The use of Adaboost algorithm for feature detection (such as open mouth detection) can refer to the prior art, which will not be repeated here.
在其他的实施例中,可以使用其他的方法检测每个人脸图像子序列中的每个人脸图像是否张嘴。例如,可以使用MobileNetV2(移动网络V2)模型检测每个人脸图像子序列中的每个人脸图像是否张嘴。In other embodiments, other methods may be used to detect whether each face image in each face image sub-sequence has a mouth open. For example, the MobileNetV2 (mobile network V2) model can be used to detect whether each face image in each face image sub-sequence is open.
105,根据每个人脸图像子序列的张嘴检测结果筛选出目标人脸图像子序列。105. Filter out the target face image subsequence according to the open mouth detection result of each face image subsequence.
可选的,所述根据每个人脸图像子序列的张嘴检测结果筛选出目标人脸图像子序列包括:Optionally, the screening of the target face image subsequence according to the open mouth detection result of each face image subsequence includes:
判断每个人脸图像子序列中闭嘴的人脸图像在该人脸图像子序列中的占比;Determine the proportion of closed face images in each face image sub-sequence in the face image sub-sequence;
若该人脸图像子序列中闭嘴的人脸图像在该人脸图像子序列中的占比小于或等于预设比例(例如0.3),则该人脸图像子序列为目标人脸图像子序列。If the proportion of closed face images in the face image subsequence in the face image subsequence is less than or equal to the preset ratio (for example, 0.3), then the face image subsequence is the target face image subsequence .
否则,若该人脸图像子序列中闭嘴的人脸图像在该人脸图像子序列中的占比大于预设比例,则该人脸图像子序列不是目标人脸图像子序列。Otherwise, if the proportion of closed face images in the face image subsequence in the face image subsequence is greater than the preset ratio, the face image subsequence is not the target face image subsequence.
或者,可以判断每个人脸图像子序列中闭嘴的人脸图像的数量是否小于或等于预设数量(例如3)。若该人脸图像子序列中闭嘴的人脸图像的数量小于或等于预设数量,则该人脸图像子序列为目标人脸图像子序列。否则,若该人脸图像子序列中闭嘴的人脸图像的数量大于预设数量,则该人脸图像子序列不是目标人脸图像子序列。Alternatively, it can be determined whether the number of closed-mouth face images in each face image sub-sequence is less than or equal to a preset number (for example, 3). If the number of closed-mouth face images in the face image subsequence is less than or equal to the preset number, then the face image subsequence is the target face image subsequence. Otherwise, if the number of closed face images in the face image subsequence is greater than the preset number, then the face image subsequence is not the target face image subsequence.
在根据每个人脸图像子序列的张嘴检测结果筛选出目标人脸图像子序列之前,可以使用中值滤波对每个人脸图像子序列的张嘴检测结果作平滑处理。Before the target face image subsequence is filtered out according to the open mouth detection results of each face image subsequence, median filtering can be used to smooth the open mouth detection results of each face image subsequence.
例如,中值滤波的滑窗大小取为3,即对人脸图像子序列的张嘴检测结果每3个数计算一次中值。通过中值滤波能够平滑张嘴检测结果,更好地筛选出目标人脸图像子序列。For example, the sliding window size of the median filter is set to 3, that is, the median value is calculated every 3 numbers of the mouth detection results of the face image subsequence. The median filter can smooth the mouth detection results and better filter out the target face image sub-sequences.
106,对每个目标人脸图像子序列提取人脸特征。106. Extract facial features for each target facial image subsequence.
可选的,所述对每个目标人脸图像子序列提取人脸特征包括:Optionally, the extraction of facial features from each target facial image subsequence includes:
使用点分布模型对每个目标人脸图像子序列提取人脸特征。The point distribution model is used to extract facial features for each target facial image sub-sequence.
点分布模型是一种线性轮廓模型,其实现形式是主成分分析。在该模型中,人脸轮廓(即特征点坐标序列)被描述成训练样本均值与各主成分基向量的加权线性组合的和的形式。The point distribution model is a linear contour model, and its realization form is principal component analysis. In this model, the face contour (ie, the feature point coordinate sequence) is described as the sum of the weighted linear combination of the mean value of the training sample and the basis vector of each principal component.
在其他的实施例中,可以使用其他的特征提取模型或算法对每个目标人脸图像子序列提取人脸特征。例如,使用SIFT算法对每个目标人脸图像子序列提取人脸特征。In other embodiments, other feature extraction models or algorithms can be used to extract facial features for each target facial image subsequence. For example, the SIFT algorithm is used to extract facial features for each target facial image sub-sequence.
可以对每个目标人脸图像子序列中的每个人脸图像提取人脸特征,根据目标人脸图像子序列中的所有人脸图像的人脸特征确定该目标人脸图像子序列的人脸特征。例如,可以计算目标人脸图像子序列中的所有人脸图像的人脸特征的平均值,将所述平均值作为目标人脸图像子序列的人脸特征。The facial features can be extracted from each face image in each target face image subsequence, and the face features of the target face image subsequence can be determined according to the facial features of all face images in the target face image subsequence . For example, the average value of the facial features of all facial images in the target facial image subsequence can be calculated, and the average value can be used as the facial features of the target facial image subsequence.
或者,可以从每个目标人脸图像子序列中选取一个或多个人脸图像(例如图像质量最好的一个人脸图像),对所述一个或多个人脸图像提取人脸特征,根据所述一个或多个人脸图像的人脸特征确定目标人脸图像子序列的人脸特征。Alternatively, one or more face images (for example, a face image with the best image quality) may be selected from each target face image sub-sequence, and face features are extracted from the one or more face images, according to the The facial features of one or more facial images determine the facial features of the target facial image sub-sequence.
107,根据每个目标人脸图像子序列的人脸特征对所述目标人脸图像子序列进行聚类,得到每个目标人脸图像子序列所属的目标用户。107. Cluster the target facial image subsequence according to the facial features of each target facial image subsequence to obtain the target user to which each target facial image subsequence belongs.
可以使用GMM(Gaussian Mixture Model,高斯混合模型)、DBSCAN或K-Means算法对所述目标人脸图像子序列进行聚类。The GMM (Gaussian Mixture Model, Gaussian Mixture Model), DBSCAN or K-Means algorithm may be used to cluster the target face image subsequences.
具体地,对所述目标人脸图像子序列进行聚类包括:Specifically, clustering the target face image subsequence includes:
(1)选取预设数量的目标人脸图像子序列的人脸特征作为聚类中心;(1) Select a preset number of face features of the target face image subsequence as the clustering center;
(2)计算每个目标人脸图像子序列的人脸特征到每个聚类中心的距离;(2) Calculate the distance from the face feature of each target face image subsequence to each cluster center;
(3)根据每个目标人脸图像子序列的人脸特征到每个聚类中心的距离将每个目标人脸图像子序列划分到一个簇中;(3) Divide each target face image subsequence into a cluster according to the distance from the face feature of each target face image subsequence to each cluster center;
(4)根据所述目标人脸图像子序列的划分更新所述聚类中心;(4) Update the cluster center according to the division of the target face image subsequence;
重复上述(2)-(4),直至所述聚类中心不再改变。Repeat the above (2)-(4) until the cluster center no longer changes.
每个最终得到的聚类中心对应一个目标用户。Each cluster center finally obtained corresponds to a target user.
可以计算每个目标人脸图像子序列的人脸特征到每个聚类中心的余弦相似度,以所述余弦相似度作为每个目标人脸图像子序列的人脸特征到每个聚类中心的距离。The cosine similarity between the face features of each target face image subsequence to each cluster center can be calculated, and the cosine similarity is used as the face feature of each target face image subsequence to each cluster center. the distance.
或者,可以计算每个目标人脸图像子序列的人脸特征到每个聚类中心的欧氏距离、曼哈顿距离、马氏距离等。Alternatively, the Euclidean distance, Manhattan distance, Mahalanobis distance, etc. from the facial features of each target face image subsequence to each cluster center can be calculated.
108,从所述音视频数据的音频流中截取每个目标用户的目标图像子序列对应的音频段,得到每个目标用户的声纹数据。108. Intercept the audio segment corresponding to the target image subsequence of each target user from the audio stream of the audio and video data to obtain voiceprint data of each target user.
例如,用户U1的目标图像子序列包括目标图像子序列S1、S2、S3,用户U2的目标图像子序列包括目标图像子序列S4、S5、S6、S7,用户U3的目标图像子序列包括目标图像子序列S8、S9、S10,从所述音视频数据的音频流中截取出用户U1的目标图像子序列S1、S2、S3对应的音频段A1、A2、A3,从所述音视频数据的音频流中截取用户U2的目标图像子序列S4、S5、S6、S7对应的音频段A4、A5、A6、A7,从所述音视频数据的音频流中截取用户U3的目标图像子序列S8、S9、S10对应的音频段A8、A9、A10。For example, the target image subsequence of user U1 includes target image subsequences S1, S2, S3, the target image subsequence of user U2 includes target image subsequences S4, S5, S6, S7, and the target image subsequence of user U3 includes target images The sub-sequences S8, S9, S10 are used to intercept the audio segments A1, A2, and A3 corresponding to the target image sub-sequences S1, S2, and S3 of the user U1 from the audio stream of the audio and video data, and from the audio of the audio and video data The audio segments A4, A5, A6, and A7 corresponding to the target image subsequences S4, S5, S6, and S7 of the user U2 are intercepted in the stream, and the target image subsequences S8, S9 of the user U3 are intercepted from the audio stream of the audio and video data. , S10 corresponds to the audio segments A8, A9, and A10.
可以根据每个目标用户的目标图像子序列对应的起始时间和终止时间从所述音视频数据的音频流中截取每个目标用户的目标图像子序列对应的音频段。The audio segment corresponding to the target image subsequence of each target user may be intercepted from the audio stream of the audio and video data according to the start time and the end time corresponding to the target image subsequence of each target user.
所述声纹数据生成方法以发展更为成熟的人脸图像技术为指导,充分利用了音视频数据中语音与图像之间的相关性从音视频数据的音频流中提取与说话人关联的声纹数据。采用所述声纹数据生成方法对大量的音视频数据进行处理,可以得到大量的声纹数据以构建大规模的声纹数据库。所述声纹数据生成方法可以高效率、低成本地获得声纹数据,该声纹数据能够用来对声纹识别模型进行训练,解决了声纹样本难以获取的问题,有助于声纹识别技术的发展和推广。The voiceprint data generation method is guided by the development of more mature facial image technology, and makes full use of the correlation between the voice and the image in the audio and video data to extract the voice associated with the speaker from the audio stream of the audio and video data.纹数据。 Pattern data. By using the voiceprint data generation method to process a large amount of audio and video data, a large amount of voiceprint data can be obtained to construct a large-scale voiceprint database. The voiceprint data generation method can obtain voiceprint data with high efficiency and low cost. The voiceprint data can be used to train a voiceprint recognition model, which solves the problem that voiceprint samples are difficult to obtain, and is helpful for voiceprint recognition. The development and promotion of technology.
实施例二Example two
图2是本申请实施例二提供的声纹数据生成装置的结构图。所述声纹数据生成装置20应用于计算机装置。所述声纹数据生成装置20对待分析文本进行分析,确定所述待分析文本的情感类别。所述声纹数据生成装置20从音视频数据中提取出与说话人关联的声纹数据。所述声纹数据可以作为声纹样本对声纹识别模型进行训练。Fig. 2 is a structural diagram of a voiceprint data generating device provided in the second embodiment of the present application. The voiceprint data generating device 20 is applied to a computer device. The voiceprint data generating device 20 analyzes the text to be analyzed, and determines the emotion category of the text to be analyzed. The voiceprint data generating device 20 extracts voiceprint data associated with the speaker from the audio and video data. The voiceprint data can be used as a voiceprint sample to train a voiceprint recognition model.
如图2所示,所述声纹数据生成装置20可以包括音视频获取模块201、人脸检测模块202、序列获取模块203、张嘴检测模块204、筛选模块205、特征提取模块206、聚类模块207、截取模块208。As shown in FIG. 2, the voiceprint data generating device 20 may include an audio and video acquisition module 201, a face detection module 202, a sequence acquisition module 203, an open mouth detection module 204, a screening module 205, a feature extraction module 206, and a clustering module. 207. The interception module 208.
音视频获取模块201,用于获取音视频数据。The audio and video acquisition module 201 is used to acquire audio and video data.
音视频数据是指同时包含语音和图像的多媒体数据。所述音视频数据的内容包括,但不限于综艺、采访、电视剧等。Audio and video data refers to multimedia data that contains both voice and image. The content of the audio and video data includes, but is not limited to variety shows, interviews, TV dramas, etc.
为了提取与说话人关联的声纹数据,获取的音视频数据包括说话人的语音和图像。In order to extract the voiceprint data associated with the speaker, the acquired audio and video data includes the speaker's voice and images.
可以从预设多媒体数据库获取所述音视频数据。或者,可以控制所述计算机装置中的或者与所述计算机装置相连接的摄像设备实时采集所述音视频数据。The audio and video data can be obtained from a preset multimedia database. Alternatively, a camera device in the computer device or connected to the computer device can be controlled to collect the audio and video data in real time.
人脸检测模块202,用于对所述音视频数据中的原始图像序列逐帧进行人脸检测,得到多个人脸图像和所述多个人脸图像的人脸框。The face detection module 202 is configured to perform face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images.
可以从所述音视频数据中分离出原始图像序列和音频流序列。例如,可以采用音视频编辑软件(如MediaCoder、ffmpeg)从所述音视频数据中分离出原始图像序列和音频流序列。The original image sequence and audio stream sequence can be separated from the audio and video data. For example, audio and video editing software (such as MediaCoder, ffmpeg) can be used to separate the original image sequence and audio stream sequence from the audio and video data.
所述原始图像序列包括多个原始图像。The original image sequence includes a plurality of original images.
可选的,所述对所述音视频数据中的原始图像序列逐帧进行人脸检测包括:Optionally, said performing face detection on the original image sequence in the audio and video data frame by frame includes:
使用MTCNN(Multi-task Cascaded Convolutional Networks,多任务级联卷积网络)模型对所述音视频数据中的原始图像序列逐帧进行人脸检测。The MTCNN (Multi-task Cascaded Convolutional Networks, multi-task cascaded convolutional network) model is used to perform face detection on the original image sequence in the audio and video data frame by frame.
MTCNN由P-Net(proposal network,建议网络)、R-Net(refine network,改善网络)、O-Net(output network,输出网络)三个部分组成。三个部分是相互独 立的三个网络结构。每个部分都是一个多任务网络,处理的任务包括:人脸/非人脸的判断、人脸框回归、特征点定位。MTCNN is composed of three parts: P-Net (proposal network), R-Net (refine network), and O-Net (output network). The three parts are three independent network structures. Each part is a multi-task network, and the tasks to be processed include: face/non-face judgment, face frame regression, and feature point positioning.
使用MTCNN模型对所述音视频数据中的原始图像序列逐帧进行人脸检测包括:Using the MTCNN model to perform face detection on the original image sequence in the audio and video data frame by frame includes:
(1)使用P-Net生成候选窗。可以使用边框回归(Bounding box regression)的方法校正候选窗,使用非极大值抑制(NMS)合并重叠的候选框。(1) Use P-Net to generate candidate windows. Bounding box regression can be used to correct candidate windows, and non-maximum suppression (NMS) can be used to merge overlapping candidate boxes.
(2)使用N-Net改善候选窗。将通过P-Net的候选窗输入R-Net中,去除候选框中的非人脸框。(2) Use N-Net to improve the candidate window. Input the candidate window through P-Net into R-Net, and remove the non-face frame in the candidate frame.
(3)使用O-Net输出最终的人脸框和人脸特征点的位置。(3) Use O-Net to output the final face frame and the position of the face feature points.
使用MTCNN模型进行人脸识别可以参考现有技术,此处不再赘述。The use of the MTCNN model for face recognition can refer to the prior art, which will not be repeated here.
在其他的实施例中,可以使用其他的神经网络模型对所述音视频数据中的原始图像序列逐帧进行人脸检测。例如,可以使用faster R-CNN(faster region-based convolution neural network,加速区域卷积神经网络模型)或cascadeCNN(cascade convolution neural network,级联卷积神经网络模型)对所述音视频数据中的原始图像序列逐帧进行人脸检测。In other embodiments, other neural network models may be used to perform face detection on the original image sequence in the audio and video data frame by frame. For example, faster R-CNN (faster region-based convolution neural network, accelerated regional convolution neural network model) or cascadeCNN (cascade convolution neural network, cascade convolution neural network model) can be used to analyze the original audio and video data. Face detection is performed on the image sequence frame by frame.
所述人脸图像是指包含人脸的图像。The human face image refers to an image containing a human face.
在本实施例中,若从一个原始图像检测到满足要求的人脸框,则确定该原始图像为人脸图像;若从该原始图像中没有检测到满足要求的人脸框(包括没有检测到人脸框或者检测到的人脸框不满足要求),则确定该原始图像不是人脸图像。In this embodiment, if a face frame that meets the requirements is detected from an original image, the original image is determined to be a face image; if no face frame that meets the requirements is detected from the original image (including no person detected) The face frame or the detected face frame does not meet the requirements), it is determined that the original image is not a face image.
在其他的实施例中,若从一个原始图像检测到人脸框,则确定该原始图像为人脸图像;若从该原始图像中没有检测到人脸框,则确定该原始图像不是人脸图像。In other embodiments, if a face frame is detected from an original image, the original image is determined to be a face image; if no face frame is detected from the original image, it is determined that the original image is not a face image.
在本实施例中,若一个原始图像中存在多个人脸框,则选取面积最大的人脸框作为该原始图像的人脸框,使得一个人脸图像对应一个人脸框。In this embodiment, if there are multiple face frames in an original image, the face frame with the largest area is selected as the face frame of the original image, so that one face image corresponds to one face frame.
在本实施例中,可以判断从原始图像中检测到的人脸框的大小是否小于或等于预设阈值,若从原始图像中检测到的人脸框的大小小于或等于预设阈值,则确定该人脸框为无效的人脸框。例如,可以判断从原始图像中检测到的人脸框的 宽和高是否小于或等于50个像素,若从原始图像中检测到的人脸框的宽或高小于或等于50个像素,则确定该人脸框为无效的人脸框。In this embodiment, it can be determined whether the size of the face frame detected from the original image is less than or equal to the preset threshold, and if the size of the face frame detected from the original image is less than or equal to the preset threshold, it is determined The face frame is an invalid face frame. For example, it can be judged whether the width and height of the face frame detected from the original image is less than or equal to 50 pixels. If the width or height of the face frame detected from the original image is less than or equal to 50 pixels, it is determined The face frame is an invalid face frame.
在一具体实施例中,若从一个原始图像检测到的人脸框的大小大于预设阈值,则确定该原始图像为人脸图像;若从该原始图像中没有检测到人脸框或者检测到的人脸框的大小均小于或等于预设阈值,则确定该原始图像不是人脸图像。In a specific embodiment, if the size of the face frame detected from an original image is greater than a preset threshold, the original image is determined to be a face image; if no face frame is detected from the original image or the detected face frame If the size of the face frame is less than or equal to the preset threshold, it is determined that the original image is not a face image.
序列获取模块203,用于根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列,每个人脸图像子序列包含同一用户的多个人脸图像。The sequence acquisition module 203 is configured to acquire multiple face image sub-sequences from the original image sequence according to the multiple face images and the face frame, and each face image sub-sequence includes multiple face images of the same user.
可选的,根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列包括:Optionally, acquiring multiple face image subsequences from the original image sequence according to the multiple face images and the face frame includes:
以所述原始图像序列中的一个原始图像作为起始点,逐一选取当前原始图像和下一原始图像,得到相邻两个原始图像;Taking an original image in the original image sequence as a starting point, selecting the current original image and the next original image one by one to obtain two adjacent original images;
判断所述相邻两个原始图像的人脸框是否满足预设条件;Judging whether the face frames of the two adjacent original images meet a preset condition;
若所述相邻两个原始图像是人脸图像,并且所述相邻两个原始图像的人脸框满足预设条件,则确定所述相邻两个原始图像对应同一用户,所述相邻两个原始图像属于同一人脸图像子序列;If the two adjacent original images are face images, and the face frames of the two adjacent original images meet a preset condition, it is determined that the two adjacent original images correspond to the same user, and the adjacent two original images correspond to the same user. The two original images belong to the same face image subsequence;
否则,若所述相邻两个原始图像至少一个不是人脸图像,或者所述相邻两个原始图像的人脸框不满足预设条件,则确定所述相邻两个原始图像不对应同一用户,所述相邻两个原始图像不属于同一人脸图像子序列。Otherwise, if at least one of the two adjacent original images is not a face image, or the face frames of the two adjacent original images do not meet the preset condition, it is determined that the two adjacent original images do not correspond to the same User, the two adjacent original images do not belong to the same face image subsequence.
例如,以所述原始图像序列中的第一个原始图像作为起始点,选取所述原始图像序列中的第一个原始图像和第二个原始图像作为相邻两个原始图像,若第一个原始图像和第二个原始图像是人脸图像,并且第一个原始图像和第二个原始图像的人脸框满足预设条件,则确定第一个原始图像和第二个原始图像对应同一用户并属于第一人脸图像子序列;选取所述原始图像序列中的第二个原始图像和第三个原始图像作为相邻两个原始图像,若第二个原始图像和第三个原始图像是人脸图像,并且第二个原始图像和第三个原始图像的人脸框满足预设条件,则确定第二个原始图像和第三个原始图像对应同一用户,第三个原始图像也属于第一人脸图像子序列;.....选取所述原始图像序列中的第八个原始图像 和第九个原始图像作为相邻两个原始图像,若第九个原始图像不是人脸图像,或者第八个原始图像和第九个原始图像的人脸框不满足预设条件,则确定第八个原始图像和第九个原始图像对应同一用户,第九个原始图像不属于第一人脸图像子序列。因此,获取的第一人脸图像子序列包括第一个原始图像、第二个原始图像....和第八个原始图像。以第九个原始图像为新的起始点,获取下一个人脸图像子序列。For example, taking the first original image in the original image sequence as the starting point, the first original image and the second original image in the original image sequence are selected as two adjacent original images, if the first original image is The original image and the second original image are face images, and the face frames of the first original image and the second original image meet the preset conditions, it is determined that the first original image and the second original image correspond to the same user And belong to the first face image sub-sequence; the second original image and the third original image in the original image sequence are selected as two adjacent original images, if the second original image and the third original image are Face image, and the face frames of the second original image and the third original image meet the preset conditions, it is determined that the second original image and the third original image correspond to the same user, and the third original image also belongs to the third original image. A face image sub-sequence;.... Select the eighth original image and the ninth original image in the original image sequence as two adjacent original images, if the ninth original image is not a face image, Or the face frames of the eighth original image and the ninth original image do not meet the preset conditions, it is determined that the eighth original image and the ninth original image correspond to the same user, and the ninth original image does not belong to the first face Image subsequence. Therefore, the acquired first face image sub-sequence includes the first original image, the second original image... and the eighth original image. Taking the ninth original image as a new starting point, the next face image sub-sequence is obtained.
可以理解,可以采用其他的方法根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列。例如,可以以一个人脸图像作为起始点,逐一选取当前人脸图像和下一人脸图像,得到两个人脸图像;It can be understood that other methods may be used to obtain multiple face image subsequences from the original image sequence according to the multiple face images and the face frame. For example, a face image can be used as a starting point, and the current face image and the next face image can be selected one by one to obtain two face images;
若所述两个人脸图像是所述原始图像序列中的相邻帧,并且所述两个人脸图像的人脸框满足预设条件,则确定所述两个人脸图像对应同一用户,所述两个人脸图像属于同一人脸图像子序列;If the two face images are adjacent frames in the original image sequence, and the face frames of the two face images meet a preset condition, it is determined that the two face images correspond to the same user, and the two face images correspond to the same user. Personal face images belong to the same face image sub-sequence;
否则,若所述两个人脸图像不是所述原始图像序列中的相邻帧,或者所述两个人脸图像的人脸框不满足预设条件,则确定所述两个人脸图像不对应同一用户,所述两个人脸图像不属于同一人脸图像子序列。Otherwise, if the two face images are not adjacent frames in the original image sequence, or the face frames of the two face images do not meet a preset condition, it is determined that the two face images do not correspond to the same user , The two face images do not belong to the same face image sub-sequence.
可选的,判断所述相邻两个原始图像的人脸框是否满足预设条件包括:Optionally, determining whether the face frames of the two adjacent original images meet a preset condition includes:
判断所述相邻两个原始图像的人脸框的重叠面积比例(Intersection over Union,IOU)是否大于或等于预设比例;Judging whether the overlap area ratio (Intersection over Union, IOU) of the face frames of the two adjacent original images is greater than or equal to a preset ratio;
若所述相邻两个人脸图像的人脸框的重叠面积比例大于或等于预设比例,则确定所述相邻两个人脸图像满足预设条件。If the overlapping area ratio of the face frames of the two adjacent face images is greater than or equal to the preset ratio, it is determined that the two adjacent face images meet the preset condition.
或者,可以判断所述相邻两个人脸图像的人脸框的距离是否小于或等于预设距离,若所述相邻两个人脸图像的人脸框的距离小于或等于预设距离,则确定所述相邻两个人脸图像满足预设条件。Alternatively, it may be determined whether the distance between the face frames of the two adjacent face images is less than or equal to the preset distance, and if the distance between the face frames of the two adjacent face images is less than or equal to the preset distance, then it is determined The two adjacent face images meet a preset condition.
在对所述音视频数据中的原始图像序列逐帧进行人脸检测时,可以得到每个人脸框的位置,根据所述相邻两个人脸图像的人脸框的位置可以计算所述相邻两个人脸图像的人脸框的距离。When face detection is performed on the original image sequence in the audio and video data frame by frame, the position of each face frame can be obtained, and the adjacent face frames can be calculated according to the positions of the face frames of the two adjacent face images. The distance between the face frames of the two face images.
张嘴检测模块204,用于检测每个人脸图像子序列中的每个人脸图像是否张嘴。The mouth opening detection module 204 is used to detect whether each face image in each face image sub-sequence has a mouth open.
可选的,所述检测每个人脸图像子序列中的每个人脸图像是否张嘴包括:Optionally, the detecting whether each face image in each face image subsequence has a mouth open includes:
使用Adaboost算法检测每个人脸图像子序列中的每个人脸图像是否张嘴。The Adaboost algorithm is used to detect whether each face image in each face image subsequence is open.
Adaboost是一种迭代算法,其核心思想是针对同一个训练集训练不同的分类器(弱分类器),然后把这些弱分类器集合起来,构成一个更强的最终分类器(强分类器)。Adaboost is an iterative algorithm. Its core idea is to train different classifiers (weak classifiers) for the same training set, and then group these weak classifiers to form a stronger final classifier (strong classifier).
可以使用基于Haar特征的Adaboost算法训练分类器,实现嘴的正常状态和张嘴状态的区分。The Adaboost algorithm based on Haar features can be used to train the classifier to realize the distinction between the normal state of the mouth and the open state.
使用Adaboost算法进行特征检测(如张嘴检测)可以参考现有技术,此处不再赘述。The use of Adaboost algorithm for feature detection (such as open mouth detection) can refer to the prior art, which will not be repeated here.
在其他的实施例中,可以使用其他的方法检测每个人脸图像子序列中的每个人脸图像是否张嘴。例如,可以使用MobileNetV2(移动网络V2)模型检测每个人脸图像子序列中的每个人脸图像是否张嘴。In other embodiments, other methods may be used to detect whether each face image in each face image sub-sequence has a mouth open. For example, the MobileNetV2 (mobile network V2) model can be used to detect whether each face image in each face image sub-sequence is open.
筛选模块205,用于根据每个人脸图像子序列的张嘴检测结果筛选出目标人脸图像子序列。The screening module 205 is configured to screen out the target face image subsequence according to the open mouth detection result of each face image subsequence.
可选的,所述根据每个人脸图像子序列的张嘴检测结果筛选出目标人脸图像子序列包括:Optionally, the screening of the target face image subsequence according to the open mouth detection result of each face image subsequence includes:
判断每个人脸图像子序列中闭嘴的人脸图像在该人脸图像子序列中的占比;Determine the proportion of closed face images in each face image sub-sequence in the face image sub-sequence;
若该人脸图像子序列中闭嘴的人脸图像在该人脸图像子序列中的占比小于或等于预设比例(例如0.3),则该人脸图像子序列为目标人脸图像子序列。If the proportion of closed face images in the face image subsequence in the face image subsequence is less than or equal to the preset ratio (for example, 0.3), then the face image subsequence is the target face image subsequence .
否则,若该人脸图像子序列中闭嘴的人脸图像在该人脸图像子序列中的占比大于预设比例,则该人脸图像子序列不是目标人脸图像子序列。Otherwise, if the proportion of closed face images in the face image subsequence in the face image subsequence is greater than the preset ratio, the face image subsequence is not the target face image subsequence.
或者,可以判断每个人脸图像子序列中闭嘴的人脸图像的数量是否小于或等于预设数量(例如3)。若该人脸图像子序列中闭嘴的人脸图像的数量小于或等于预设数量,则该人脸图像子序列为目标人脸图像子序列。否则,若该人脸图像子序列中闭嘴的人脸图像的数量大于预设数量,则该人脸图像子序列不是目标人脸图像子序列。Alternatively, it can be determined whether the number of closed-mouth face images in each face image sub-sequence is less than or equal to a preset number (for example, 3). If the number of closed-mouth face images in the face image subsequence is less than or equal to the preset number, then the face image subsequence is the target face image subsequence. Otherwise, if the number of closed face images in the face image subsequence is greater than the preset number, then the face image subsequence is not the target face image subsequence.
在根据每个人脸图像子序列的张嘴检测结果筛选出目标人脸图像子序列之前,可以使用中值滤波对每个人脸图像子序列的张嘴检测结果作平滑处理。Before the target face image subsequence is filtered out according to the open mouth detection results of each face image subsequence, median filtering can be used to smooth the open mouth detection results of each face image subsequence.
例如,中值滤波的滑窗大小取为3,即对人脸图像子序列的张嘴检测结果每3个数计算一次中值。通过中值滤波能够平滑张嘴检测结果,更好地筛选出目标人脸图像子序列。For example, the sliding window size of the median filter is set to 3, that is, the median value is calculated every 3 numbers of the mouth detection results of the face image subsequence. The median filter can smooth the mouth detection results and better filter out the target face image sub-sequences.
特征提取模块206,用于对每个目标人脸图像子序列提取人脸特征。The feature extraction module 206 is configured to extract face features from each target face image sub-sequence.
可选的,所述对每个目标人脸图像子序列提取人脸特征包括:Optionally, the extraction of facial features from each target facial image subsequence includes:
使用点分布模型对每个目标人脸图像子序列提取人脸特征。The point distribution model is used to extract facial features for each target facial image sub-sequence.
点分布模型是一种线性轮廓模型,其实现形式是主成分分析。在该模型中,人脸轮廓(即特征点坐标序列)被描述成训练样本均值与各主成分基向量的加权线性组合的和的形式。The point distribution model is a linear contour model, and its realization form is principal component analysis. In this model, the face contour (ie, the feature point coordinate sequence) is described as the sum of the weighted linear combination of the mean value of the training sample and the basis vector of each principal component.
在其他的实施例中,可以使用其他的特征提取模型或算法对每个目标人脸图像子序列提取人脸特征。例如,使用SIFT算法对每个目标人脸图像子序列提取人脸特征。In other embodiments, other feature extraction models or algorithms can be used to extract facial features for each target facial image subsequence. For example, the SIFT algorithm is used to extract facial features for each target facial image sub-sequence.
可以对每个目标人脸图像子序列中的每个人脸图像提取人脸特征,根据目标人脸图像子序列中的所有人脸图像的人脸特征确定该目标人脸图像子序列的人脸特征。例如,可以计算目标人脸图像子序列中的所有人脸图像的人脸特征的平均值,将所述平均值作为目标人脸图像子序列的人脸特征。The facial features can be extracted from each face image in each target face image subsequence, and the face features of the target face image subsequence can be determined according to the facial features of all face images in the target face image subsequence . For example, the average value of the facial features of all facial images in the target facial image subsequence can be calculated, and the average value can be used as the facial features of the target facial image subsequence.
或者,可以从每个目标人脸图像子序列中选取一个或多个人脸图像(例如图像质量最好的一个人脸图像),对所述一个或多个人脸图像提取人脸特征,根据所述一个或多个人脸图像的人脸特征确定目标人脸图像子序列的人脸特征。Alternatively, one or more face images (for example, a face image with the best image quality) may be selected from each target face image sub-sequence, and face features are extracted from the one or more face images, according to the The facial features of one or more facial images determine the facial features of the target facial image sub-sequence.
聚类模块207,用于根据每个目标人脸图像子序列的人脸特征对所述目标人脸图像子序列进行聚类,得到每个目标人脸图像子序列所属的目标用户。The clustering module 207 is configured to cluster the target face image subsequence according to the facial features of each target face image subsequence to obtain the target user to which each target face image subsequence belongs.
可以使用GMM(Gaussian Mixture Model,高斯混合模型)、DBSCAN或K-Means算法对所述目标人脸图像子序列进行聚类。The GMM (Gaussian Mixture Model, Gaussian Mixture Model), DBSCAN or K-Means algorithm may be used to cluster the target face image subsequences.
具体地,对所述目标人脸图像子序列进行聚类包括:Specifically, clustering the target face image subsequence includes:
(1)选取预设数量的目标人脸图像子序列的人脸特征作为聚类中心;(1) Select a preset number of face features of the target face image subsequence as the clustering center;
(2)计算每个目标人脸图像子序列的人脸特征到每个聚类中心的距离;(2) Calculate the distance from the face feature of each target face image subsequence to each cluster center;
(3)根据每个目标人脸图像子序列的人脸特征到每个聚类中心的距离将每个 目标人脸图像子序列划分到一个簇中;(3) Divide each target face image subsequence into a cluster according to the distance from the face feature of each target face image subsequence to each cluster center;
(4)根据所述目标人脸图像子序列的划分更新所述聚类中心;(4) Update the cluster center according to the division of the target face image subsequence;
重复上述(2)-(4),直至所述聚类中心不再改变。Repeat the above (2)-(4) until the cluster center no longer changes.
每个最终得到的聚类中心对应一个目标用户。Each cluster center finally obtained corresponds to a target user.
可以计算每个目标人脸图像子序列的人脸特征到每个聚类中心的余弦相似度,以所述余弦相似度作为每个目标人脸图像子序列的人脸特征到每个聚类中心的距离。The cosine similarity between the face features of each target face image subsequence to each cluster center can be calculated, and the cosine similarity is used as the face feature of each target face image subsequence to each cluster center. the distance.
或者,可以计算每个目标人脸图像子序列的人脸特征到每个聚类中心的欧氏距离、曼哈顿距离、马氏距离等。Alternatively, the Euclidean distance, Manhattan distance, Mahalanobis distance, etc. from the facial features of each target face image subsequence to each cluster center can be calculated.
截取模块208,用于从所述音视频数据的音频流中截取每个目标用户的目标图像子序列对应的音频段,得到每个目标用户的声纹数据。The interception module 208 is used to intercept the audio segment corresponding to the target image subsequence of each target user from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
例如,用户U1的目标图像子序列包括目标图像子序列S1、S2、S3,用户U2的目标图像子序列包括目标图像子序列S4、S5、S6、S7,用户U3的目标图像子序列包括目标图像子序列S8、S9、S10,从所述音视频数据的音频流中截取出用户U1的目标图像子序列S1、S2、S3对应的音频段A1、A2、A3,从所述音视频数据的音频流中截取用户U2的目标图像子序列S4、S5、S6、S7对应的音频段A4、A5、A6、A7,从所述音视频数据的音频流中截取用户U3的目标图像子序列S8、S9、S10对应的音频段A8、A9、A10。For example, the target image subsequence of user U1 includes target image subsequences S1, S2, S3, the target image subsequence of user U2 includes target image subsequences S4, S5, S6, S7, and the target image subsequence of user U3 includes target images The sub-sequences S8, S9, S10 are used to intercept the audio segments A1, A2, and A3 corresponding to the target image sub-sequences S1, S2, and S3 of the user U1 from the audio stream of the audio and video data, and from the audio of the audio and video data The audio segments A4, A5, A6, and A7 corresponding to the target image subsequences S4, S5, S6, and S7 of the user U2 are intercepted in the stream, and the target image subsequences S8, S9 of the user U3 are intercepted from the audio stream of the audio and video data. , S10 corresponds to the audio segments A8, A9, and A10.
可以根据每个目标用户的目标图像子序列对应的起始时间和终止时间从所述音视频数据的音频流中截取每个目标用户的目标图像子序列对应的音频段。The audio segment corresponding to the target image subsequence of each target user may be intercepted from the audio stream of the audio and video data according to the start time and the end time corresponding to the target image subsequence of each target user.
所述声纹数据生成装置20以发展更为成熟的人脸图像技术为指导,充分利用了音视频数据中语音与图像之间的相关性从音视频数据的音频流中提取与说话人关联的声纹数据。采用所述声纹数据生成装置20对大量的音视频数据进行处理,可以得到大量的声纹数据以构建大规模的声纹数据库。所述声纹数据生成装置20可以高效率、低成本地获得声纹数据,该声纹数据能够用来对声纹识别模型进行训练,解决了声纹样本难以获取的问题,有助于声纹识别技术的发展和推广。The voiceprint data generating device 20 is guided by the development of more mature facial image technology, and makes full use of the correlation between the voice and the image in the audio and video data to extract the speaker-related information from the audio stream of the audio and video data. Voiceprint data. By using the voiceprint data generating device 20 to process a large amount of audio and video data, a large amount of voiceprint data can be obtained to construct a large-scale voiceprint database. The voiceprint data generating device 20 can obtain voiceprint data with high efficiency and low cost, and the voiceprint data can be used to train the voiceprint recognition model, which solves the problem that voiceprint samples are difficult to obtain, and helps voiceprint data. The development and promotion of recognition technology.
实施例三Example three
本实施例提供了一个或多个存储有计算机可读指令的可读存储介质,本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质;该计算可读指令被一个或多个处理器执行时实现上述声纹数据生成方法实施例中的步骤,例如图1所示的101-108。或者,该计算可读指令被处理器执行时实现上述装置实施例中各模块的功能,例如图2中的模块201-208。为避免重复,这里不再赘述。本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一非易失性可读存储介质也可以存储在易失性可读存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。This embodiment provides one or more readable storage media storing computer readable instructions. The readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage media; The computationally readable instructions, when executed by one or more processors, implement the steps in the foregoing embodiment of the method for generating voiceprint data, such as 101-108 shown in FIG. 1. Alternatively, the computationally readable instruction realizes the functions of the modules in the foregoing device embodiment when executed by the processor, such as modules 201-208 in FIG. 2. To avoid repetition, I won’t repeat them here. A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile readable storage. The medium may also be stored in a volatile readable storage medium, and when the computer readable instructions are executed, they may include the processes of the above-mentioned method embodiments.
实施例四Example four
图3为本申请实施例四提供的计算机装置的示意图。所述计算机装置30包括存储器301、处理器302以及存储在所述存储器301中并可在所述处理器302上运行的计算可读指令303,例如声纹数据生成程序。所述处理器302执行所述计算可读指令303时实现上述声纹数据生成方法实施例中的步骤,例如图1所示的101-108。或者,该计算可读指令被处理器执行时实现上述装置实施例中各模块的功能,例如图2中的模块201-208。FIG. 3 is a schematic diagram of a computer device provided in Embodiment 4 of this application. The computer device 30 includes a memory 301, a processor 302, and computationally readable instructions 303 stored in the memory 301 and running on the processor 302, such as a voiceprint data generating program. The processor 302 implements the steps in the embodiment of the voiceprint data generation method when executing the calculation readable instruction 303, for example, 101-108 shown in FIG. 1. Alternatively, the computationally readable instruction realizes the functions of the modules in the foregoing device embodiment when executed by the processor, such as modules 201-208 in FIG. 2.
示例性的,所述计算可读指令303可以被分割成一个或多个模块,所述一个或者多个模块被存储在所述存储器301中,并由所述处理器302执行,以完成本方法。所述一个或多个模块可以是能够完成特定功能的一系列计算可读指令指令段,该指令段用于描述所述计算可读指令303在所述计算机装置30中的执行过程。例如,所述计算可读指令303可以被分割成图2中的音视频获取模块201、人脸检测模块202、序列获取模块203、张嘴检测模块204、筛选模块205、特征提取模块206、聚类模块207、截取模块208,各模块具体功能参见实施例二。Exemplarily, the computationally readable instruction 303 may be divided into one or more modules, and the one or more modules are stored in the memory 301 and executed by the processor 302 to complete the method. . The one or more modules may be a series of computationally readable instruction instruction segments capable of completing specific functions, and the instruction segment is used to describe the execution process of the computationally readable instructions 303 in the computer device 30. For example, the computationally readable instruction 303 can be divided into the audio and video acquisition module 201, the face detection module 202, the sequence acquisition module 203, the open mouth detection module 204, the screening module 205, the feature extraction module 206, and the clustering module shown in FIG. Module 207, interception module 208, the specific functions of each module refer to the second embodiment.
所述计算机装置30可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。本领域技术人员可以理解,所述示意图3仅仅是计算机装置30的示例,并不构成对计算机装置30的限定,可以包括比图示更多或更少的部件,或者组 合某些部件,或者不同的部件,例如所述计算机装置30还可以包括输入输出设备、网络接入设备、总线等。The computer device 30 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. Those skilled in the art can understand that the schematic diagram 3 is only an example of the computer device 30 and does not constitute a limitation on the computer device 30. It may include more or less components than those shown in the figure, or combine certain components, or be different. For example, the computer device 30 may also include input and output devices, network access devices, buses, and so on.
所称处理器302可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器302也可以是任何常规的处理器等,所述处理器302是所述计算机装置30的控制中心,利用各种接口和线路连接整个计算机装置30的各个部分。The so-called processor 302 may be a central processing unit (Central Processing Unit, CPU), other general processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor 302 can also be any conventional processor, etc. The processor 302 is the control center of the computer device 30, which uses various interfaces and lines to connect the entire computer device 30. Various parts.
所述存储器301可用于存储所述计算可读指令303,所述处理器302通过运行或执行存储在所述存储器301内的计算可读指令或模块,以及调用存储在存储器301内的数据,实现所述计算机装置30的各种功能。所述存储器301可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据计算机装置30的使用所创建的数据(。此外,存储器301可以包括非易失性存储器或/和易失性存储器,非易失性存储器可包括例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。The memory 301 can be used to store the computationally readable instructions 303. The processor 302 executes or executes the computationally readable instructions or modules stored in the memory 301 and calls data stored in the memory 301 to achieve Various functions of the computer device 30. The memory 301 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may Store the data created according to the use of the computer device 30 (. In addition, the memory 301 may include non-volatile memory or/and volatile memory, and the non-volatile memory may include, for example, hard disk, memory, plug-in hard disk, smart Memory card (Smart Media Card, SMC), Secure Digital (SD) card, flash card (Flash Card), at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. Volatile memory It may include random access memory (RAM) or external cache memory.
所述计算机装置30集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算可读指令来指令相关的硬件来完成,所述的计算可读指令可存储于一存储介质中,该计算可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算可读指令包括计算可读指令代码,所述计算可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够 携带所述计算可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。If the integrated module of the computer device 30 is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware by computing readable instructions. The computing readable instructions can be stored in a storage medium. When the computationally readable instructions are executed by the processor, they can implement the steps of the foregoing method embodiments. Wherein, the computationally readable instruction includes computationally readable instruction code, and the computationally readable instruction code may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory). It should be noted that the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to the legislation and patent practice, the computer-readable medium Does not include electrical carrier signals and telecommunication signals.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, the functional modules in the various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, or in the form of hardware plus software functional modules.
上述以软件功能模块的形式实现的集成的模块,可以存储在一个存储介质中。上述软件功能模块存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施例所述方法的部分步骤。The above-mentioned integrated modules implemented in the form of software functional modules may be stored in a storage medium. The above-mentioned software function module is stored in a storage medium and includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor execute the method described in each embodiment of the present application. Part of the steps.
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他模块或步骤,单数不排除复数。系统权利要求中陈述的多个模块或装 置也可以由一个模块或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。For those skilled in the art, it is obvious that the present application is not limited to the details of the foregoing exemplary embodiments, and the present application can be implemented in other specific forms without departing from the spirit or basic characteristics of the application. Therefore, no matter from which point of view, the embodiments should be regarded as exemplary and non-limiting. The scope of this application is defined by the appended claims rather than the above description, and therefore it is intended to fall into the claims. All changes in the meaning and scope of the equivalent elements of are included in this application. Any associated diagram marks in the claims should not be regarded as limiting the claims involved. In addition, it is obvious that the word "including" does not exclude other modules or steps, and the singular does not exclude the plural. Multiple modules or devices stated in the system claims can also be implemented by one module or device through software or hardware. Words such as first and second are used to denote names, but do not denote any specific order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application and not to limit them. Although the application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the application can be Make modifications or equivalent replacements without departing from the spirit and scope of the technical solution of the present application.
发明概述Summary of the invention
技术问题technical problem
问题的解决方案The solution to the problem
发明的有益效果The beneficial effects of the invention

Claims (20)

  1. 一种声纹数据生成方法,其中,所述方法包括:A method for generating voiceprint data, wherein the method includes:
    获取音视频数据;Obtain audio and video data;
    对所述音视频数据中的原始图像序列逐帧进行人脸检测,得到多个人脸图像和所述多个人脸图像的人脸框;Performing face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images;
    根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列,每个人脸图像子序列包含同一用户的多个人脸图像;Acquiring multiple face image subsequences from the original image sequence according to the multiple face images and the face frame, and each face image subsequence includes multiple face images of the same user;
    检测每个人脸图像子序列中的每个人脸图像是否张嘴;Detect whether each face image in each face image subsequence has its mouth open;
    根据每个人脸图像子序列的张嘴检测结果筛选出目标人脸图像子序列;Filter out the target face image subsequence according to the open mouth detection result of each face image subsequence;
    对每个目标人脸图像子序列提取人脸特征;Extract facial features for each target facial image sub-sequence;
    根据每个目标人脸图像子序列的人脸特征对所述目标人脸图像子序列进行聚类,得到每个目标人脸图像子序列所属的目标用户;Clustering the target facial image subsequence according to the facial features of each target facial image subsequence to obtain the target user to which each target facial image subsequence belongs;
    从所述音视频数据的音频流中截取每个目标用户的目标图像子序列对应的音频段,得到每个目标用户的声纹数据。The audio segment corresponding to the target image subsequence of each target user is intercepted from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
  2. 如权利要求1所述的方法,其中,所述对所述音视频数据中的原始图像序列逐帧进行人脸检测包括:The method according to claim 1, wherein said performing face detection on the original image sequence in the audio and video data frame by frame comprises:
    使用多任务级联卷积网络模型对所述音视频数据中的原始图像序列逐帧进行人脸检测。A multi-task cascaded convolutional network model is used to perform face detection on the original image sequence in the audio and video data frame by frame.
  3. 如权利要求1所述的方法,其中,所述根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列包括:The method according to claim 1, wherein said acquiring multiple face image subsequences from said original image sequence according to said multiple face images and said face frame comprises:
    以所述原始图像序列中的一个原始图像作为起始点,逐一选取当前原始图像和下一原始图像,得到相邻两个原始图像;Taking an original image in the original image sequence as a starting point, selecting the current original image and the next original image one by one to obtain two adjacent original images;
    判断所述相邻两个原始图像的人脸框是否满足预设条件;Judging whether the face frames of the two adjacent original images meet a preset condition;
    若所述相邻两个原始图像是人脸图像,并且所述相邻两个原始图像的人脸框满足预设条件,则确定所述相邻两个原始图像对应同一用户,所述相邻两个原始图像属于同一人脸图像子序列;If the two adjacent original images are face images, and the face frames of the two adjacent original images meet a preset condition, it is determined that the two adjacent original images correspond to the same user, and the adjacent two original images correspond to the same user. The two original images belong to the same face image subsequence;
    否则,若所述相邻两个原始图像至少一个不是人脸图像,或者所述相邻两个原始图像的人脸框不满足预设条件,则确定所述相邻两个原始图像不对应同一用户,所述相邻两个原始图像不属于同一人脸图像子序列。Otherwise, if at least one of the two adjacent original images is not a face image, or the face frames of the two adjacent original images do not meet the preset condition, it is determined that the two adjacent original images do not correspond to the same User, the two adjacent original images do not belong to the same face image subsequence.
  4. 如权利要求3所述的方法,其中,所述判断所述相邻两个原始图像的人脸框是否满足预设条件包括:8. The method according to claim 3, wherein said determining whether the face frames of the two adjacent original images meet a preset condition comprises:
    判断所述相邻两个原始图像的人脸框的重叠面积比例是否大于或等于预设比例;Judging whether the overlap area ratio of the face frames of the two adjacent original images is greater than or equal to a preset ratio;
    或者,判断所述相邻两个人脸图像的人脸框的距离是否小于或等于预设距离。Alternatively, it is determined whether the distance between the face frames of the two adjacent face images is less than or equal to a preset distance.
  5. 如权利要求1所述的方法,其中,所述检测每个人脸图像子序列中的每个人脸图像是否张嘴包括:The method according to claim 1, wherein said detecting whether each face image in each face image sub-sequence has a mouth open comprises:
    使用Adaboost算法检测每个人脸图像子序列中的每个人脸图像是否张嘴。The Adaboost algorithm is used to detect whether each face image in each face image subsequence is open.
  6. 如权利要求1所述的方法,其中,所述根据每个人脸图像子序列的张嘴检测结果筛选出目标人脸图像子序列包括:The method according to claim 1, wherein the screening out the target face image sub-sequence according to the open mouth detection result of each face image sub-sequence comprises:
    判断每个人脸图像子序列中闭嘴的人脸图像在该人脸图像子序列中的占比;Determine the proportion of closed face images in each face image sub-sequence in the face image sub-sequence;
    若该人脸图像子序列中闭嘴的人脸图像在该人脸图像子序列中的占比小于或等于预设比例,则该人脸图像子序列为目标人脸图像子序列。If the proportion of closed face images in the face image subsequence in the face image subsequence is less than or equal to the preset ratio, then the face image subsequence is the target face image subsequence.
  7. 如权利要求1所述的方法,其中,所述对每个目标人脸图像子序列提取人脸特征包括:The method according to claim 1, wherein said extracting facial features for each target facial image sub-sequence comprises:
    使用点分布模型对每个目标人脸图像子序列提取人脸特征。The point distribution model is used to extract facial features for each target facial image sub-sequence.
  8. 一种声纹数据生成装置,其中,所述装置包括:A voiceprint data generating device, wherein the device includes:
    音视频获取模块,用于获取音视频数据;Audio and video acquisition module for acquiring audio and video data;
    人脸检测模块,用于对所述音视频数据中的原始图像序列逐帧进行人脸检测,得到多个人脸图像和所述多个人脸图像的人脸框;The face detection module is configured to perform face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images;
    序列获取模块,用于根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列,每个人脸图像子序列包含同一用户的多个人脸图像;A sequence acquisition module, configured to acquire multiple face image sub-sequences from the original image sequence according to the multiple face images and the face frame, and each face image sub-sequence contains multiple face images of the same user;
    张嘴检测模块,用于检测每个人脸图像子序列中的每个人脸图像是否张嘴;Mouth opening detection module, used to detect whether each face image in each face image sub-sequence has a mouth open;
    筛选模块,用于根据每个人脸图像子序列的张嘴检测结果筛选出目标人脸图像子序列;The screening module is used to screen out the target face image subsequence according to the open mouth detection result of each face image subsequence;
    特征提取模块,用于对每个目标人脸图像子序列提取人脸特征;The feature extraction module is used to extract face features from each target face image sub-sequence;
    聚类模块,用于根据每个目标人脸图像子序列的人脸特征对所述目标人脸图像子序列进行聚类,得到每个目标人脸图像子序列所属的目标用户;A clustering module, configured to cluster the target face image subsequence according to the facial features of each target face image subsequence to obtain the target user to which each target face image subsequence belongs;
    截取模块,用于从所述音视频数据的音频流中截取每个目标用户的目标图像子序列对应的音频段,得到每个目标用户的声纹数据。The interception module is used to intercept the audio segment corresponding to the target image subsequence of each target user from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
  9. 一种计算机装置,其中,所述计算机装置包括存储器和处理器,所述处理器用于执行存储器中存储的计算机可读指令时实现如下步骤:A computer device, wherein the computer device includes a memory and a processor, and the processor is configured to execute the computer-readable instructions stored in the memory to implement the following steps:
    获取音视频数据;Obtain audio and video data;
    对所述音视频数据中的原始图像序列逐帧进行人脸检测,得到多个人脸图像和所述多个人脸图像的人脸框;Performing face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images;
    根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列,每个人脸图像子序列包含同一用户的多个人脸图像;Acquiring multiple face image subsequences from the original image sequence according to the multiple face images and the face frame, and each face image subsequence includes multiple face images of the same user;
    检测每个人脸图像子序列中的每个人脸图像是否张嘴;Detect whether each face image in each face image subsequence has its mouth open;
    根据每个人脸图像子序列的张嘴检测结果筛选出目标人脸图像子序列;Filter out the target face image subsequence according to the open mouth detection result of each face image subsequence;
    对每个目标人脸图像子序列提取人脸特征;Extract facial features for each target facial image sub-sequence;
    根据每个目标人脸图像子序列的人脸特征对所述目标人脸图像子 序列进行聚类,得到每个目标人脸图像子序列所属的目标用户;Clustering the target facial image subsequence according to the facial features of each target facial image subsequence to obtain the target user to which each target facial image subsequence belongs;
    从所述音视频数据的音频流中截取每个目标用户的目标图像子序列对应的音频段,得到每个目标用户的声纹数据。The audio segment corresponding to the target image subsequence of each target user is intercepted from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
  10. 如权利要求9所述的计算机装置,其中,所述对所述音视频数据中的原始图像序列逐帧进行人脸检测包括:9. The computer device according to claim 9, wherein said performing face detection on the original image sequence in the audio and video data frame by frame comprises:
    使用多任务级联卷积网络模型对所述音视频数据中的原始图像序列逐帧进行人脸检测。A multi-task cascaded convolutional network model is used to perform face detection on the original image sequence in the audio and video data frame by frame.
  11. 如权利要求9所述的计算机装置,其中,所述根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列包括:9. The computer device according to claim 9, wherein said acquiring a plurality of face image subsequences from said original image sequence according to said plurality of face images and said face frame comprises:
    以所述原始图像序列中的一个原始图像作为起始点,逐一选取当前原始图像和下一原始图像,得到相邻两个原始图像;Taking an original image in the original image sequence as a starting point, selecting the current original image and the next original image one by one to obtain two adjacent original images;
    判断所述相邻两个原始图像的人脸框是否满足预设条件;Judging whether the face frames of the two adjacent original images meet a preset condition;
    若所述相邻两个原始图像是人脸图像,并且所述相邻两个原始图像的人脸框满足预设条件,则确定所述相邻两个原始图像对应同一用户,所述相邻两个原始图像属于同一人脸图像子序列;If the two adjacent original images are face images, and the face frames of the two adjacent original images meet a preset condition, it is determined that the two adjacent original images correspond to the same user, and the adjacent two original images correspond to the same user. The two original images belong to the same face image subsequence;
    否则,若所述相邻两个原始图像至少一个不是人脸图像,或者所述相邻两个原始图像的人脸框不满足预设条件,则确定所述相邻两个原始图像不对应同一用户,所述相邻两个原始图像不属于同一人脸图像子序列。Otherwise, if at least one of the two adjacent original images is not a face image, or the face frames of the two adjacent original images do not meet the preset condition, it is determined that the two adjacent original images do not correspond to the same User, the two adjacent original images do not belong to the same face image subsequence.
  12. 如权利要求11所述的计算机装置,其中,所述判断所述相邻两个原始图像的人脸框是否满足预设条件包括:11. The computer device of claim 11, wherein said determining whether the face frames of the two adjacent original images meet a preset condition comprises:
    判断所述相邻两个原始图像的人脸框的重叠面积比例是否大于或等于预设比例;Judging whether the overlap area ratio of the face frames of the two adjacent original images is greater than or equal to a preset ratio;
    或者,判断所述相邻两个人脸图像的人脸框的距离是否小于或等于预设距离。Alternatively, it is determined whether the distance between the face frames of the two adjacent face images is less than or equal to a preset distance.
  13. 如权利要求9所述的计算机装置,其中,所述检测每个人脸图像子序列中的每个人脸图像是否张嘴包括:9. The computer device of claim 9, wherein said detecting whether each face image in each face image sub-sequence has a mouth open comprises:
    使用Adaboost算法检测每个人脸图像子序列中的每个人脸图像是否张嘴。The Adaboost algorithm is used to detect whether each face image in each face image subsequence is open.
  14. 如权利要求9所述的计算机装置,其中,所述根据每个人脸图像子序列的张嘴检测结果筛选出目标人脸图像子序列包括:9. The computer device according to claim 9, wherein said screening out the target face image sub-sequence according to the open mouth detection result of each face image sub-sequence comprises:
    判断每个人脸图像子序列中闭嘴的人脸图像在该人脸图像子序列中的占比;Determine the proportion of closed face images in each face image sub-sequence in the face image sub-sequence;
    若该人脸图像子序列中闭嘴的人脸图像在该人脸图像子序列中的占比小于或等于预设比例,则该人脸图像子序列为目标人脸图像子序列。If the proportion of closed face images in the face image subsequence in the face image subsequence is less than or equal to the preset ratio, then the face image subsequence is the target face image subsequence.
  15. 如权利要求9所述的计算机装置,其中,所述对每个目标人脸图像子序列提取人脸特征包括:9. The computer device according to claim 9, wherein said extracting facial features for each target facial image sub-sequence comprises:
    使用点分布模型对每个目标人脸图像子序列提取人脸特征。The point distribution model is used to extract facial features for each target facial image sub-sequence.
  16. 一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
    获取音视频数据;Obtain audio and video data;
    对所述音视频数据中的原始图像序列逐帧进行人脸检测,得到多个人脸图像和所述多个人脸图像的人脸框;Performing face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images;
    根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列,每个人脸图像子序列包含同一用户的多个人脸图像;Acquiring multiple face image subsequences from the original image sequence according to the multiple face images and the face frame, and each face image subsequence includes multiple face images of the same user;
    检测每个人脸图像子序列中的每个人脸图像是否张嘴;Detect whether each face image in each face image subsequence has its mouth open;
    根据每个人脸图像子序列的张嘴检测结果筛选出目标人脸图像子序列;Filter out the target face image subsequence according to the open mouth detection result of each face image subsequence;
    对每个目标人脸图像子序列提取人脸特征;Extract facial features for each target facial image sub-sequence;
    根据每个目标人脸图像子序列的人脸特征对所述目标人脸图像子序列进行聚类,得到每个目标人脸图像子序列所属的目标用户;Clustering the target facial image subsequence according to the facial features of each target facial image subsequence to obtain the target user to which each target facial image subsequence belongs;
    从所述音视频数据的音频流中截取每个目标用户的目标图像子序 列对应的音频段,得到每个目标用户的声纹数据。The audio segment corresponding to the target image sub-sequence of each target user is intercepted from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
  17. 如权利要求16所述的可读存储介质,其中,所述对所述音视频数据中的原始图像序列逐帧进行人脸检测包括:The readable storage medium according to claim 16, wherein said performing face detection on the original image sequence in the audio and video data frame by frame comprises:
    使用多任务级联卷积网络模型对所述音视频数据中的原始图像序列逐帧进行人脸检测。A multi-task cascaded convolutional network model is used to perform face detection on the original image sequence in the audio and video data frame by frame.
  18. 如权利要求16所述的可读存储介质,其中,所述根据所述多个人脸图像和所述人脸框从所述原始图像序列中获取多个人脸图像子序列包括:The readable storage medium according to claim 16, wherein said acquiring a plurality of face image subsequences from said original image sequence according to said plurality of face images and said face frame comprises:
    以所述原始图像序列中的一个原始图像作为起始点,逐一选取当前原始图像和下一原始图像,得到相邻两个原始图像;Taking an original image in the original image sequence as a starting point, selecting the current original image and the next original image one by one to obtain two adjacent original images;
    判断所述相邻两个原始图像的人脸框是否满足预设条件;Judging whether the face frames of the two adjacent original images meet a preset condition;
    若所述相邻两个原始图像是人脸图像,并且所述相邻两个原始图像的人脸框满足预设条件,则确定所述相邻两个原始图像对应同一用户,所述相邻两个原始图像属于同一人脸图像子序列;If the two adjacent original images are face images, and the face frames of the two adjacent original images meet a preset condition, it is determined that the two adjacent original images correspond to the same user, and the adjacent two original images correspond to the same user. The two original images belong to the same face image subsequence;
    否则,若所述相邻两个原始图像至少一个不是人脸图像,或者所述相邻两个原始图像的人脸框不满足预设条件,则确定所述相邻两个原始图像不对应同一用户,所述相邻两个原始图像不属于同一人脸图像子序列。Otherwise, if at least one of the two adjacent original images is not a face image, or the face frames of the two adjacent original images do not meet the preset condition, it is determined that the two adjacent original images do not correspond to the same User, the two adjacent original images do not belong to the same face image subsequence.
  19. 如权利要求18所述的可读存储介质,其中,所述判断所述相邻两个原始图像的人脸框是否满足预设条件包括:18. The readable storage medium according to claim 18, wherein said determining whether the face frames of the two adjacent original images meet a preset condition comprises:
    判断所述相邻两个原始图像的人脸框的重叠面积比例是否大于或等于预设比例;Judging whether the overlap area ratio of the face frames of the two adjacent original images is greater than or equal to a preset ratio;
    或者,判断所述相邻两个人脸图像的人脸框的距离是否小于或等于预设距离。Alternatively, it is determined whether the distance between the face frames of the two adjacent face images is less than or equal to a preset distance.
  20. 如权利要求16所述的可读存储介质,其中,所述检测每个人脸图像子序列中的每个人脸图像是否张嘴包括:The readable storage medium according to claim 16, wherein said detecting whether each face image in each face image sub-sequence has a mouth open comprises:
    使用Adaboost算法检测每个人脸图像子序列中的每个人脸图像是否张嘴。The Adaboost algorithm is used to detect whether each face image in each face image subsequence is open.
PCT/CN2020/093318 2020-03-31 2020-05-29 Voiceprint data generation method and device, and computer device and storage medium WO2021196390A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010244174.8 2020-03-31
CN202010244174.8A CN111613227A (en) 2020-03-31 2020-03-31 Voiceprint data generation method and device, computer device and storage medium

Publications (1)

Publication Number Publication Date
WO2021196390A1 true WO2021196390A1 (en) 2021-10-07

Family

ID=72205420

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093318 WO2021196390A1 (en) 2020-03-31 2020-05-29 Voiceprint data generation method and device, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN111613227A (en)
WO (1) WO2021196390A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114299953A (en) * 2021-12-29 2022-04-08 湖北微模式科技发展有限公司 Speaker role distinguishing method and system combining mouth movement analysis
CN115225326A (en) * 2022-06-17 2022-10-21 中国电信股份有限公司 Login verification method and device, electronic equipment and storage medium
CN115996303A (en) * 2023-03-23 2023-04-21 科大讯飞股份有限公司 Video generation method, device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182503A1 (en) * 2004-02-12 2005-08-18 Yu-Ru Lin System and method for the automatic and semi-automatic media editing
US20110035221A1 (en) * 2009-08-07 2011-02-10 Tong Zhang Monitoring An Audience Participation Distribution
CN105512348A (en) * 2016-01-28 2016-04-20 北京旷视科技有限公司 Method and device for processing videos and related audios and retrieving method and device
CN106650624A (en) * 2016-11-15 2017-05-10 东软集团股份有限公司 Face tracking method and device
CN108875506A (en) * 2017-11-17 2018-11-23 北京旷视科技有限公司 Face shape point-tracking method, device and system and storage medium
CN110032970A (en) * 2019-04-11 2019-07-19 深圳市华付信息技术有限公司 Biopsy method, device, computer equipment and the storage medium of high-accuracy

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182503A1 (en) * 2004-02-12 2005-08-18 Yu-Ru Lin System and method for the automatic and semi-automatic media editing
US20110035221A1 (en) * 2009-08-07 2011-02-10 Tong Zhang Monitoring An Audience Participation Distribution
CN105512348A (en) * 2016-01-28 2016-04-20 北京旷视科技有限公司 Method and device for processing videos and related audios and retrieving method and device
CN106650624A (en) * 2016-11-15 2017-05-10 东软集团股份有限公司 Face tracking method and device
CN108875506A (en) * 2017-11-17 2018-11-23 北京旷视科技有限公司 Face shape point-tracking method, device and system and storage medium
CN110032970A (en) * 2019-04-11 2019-07-19 深圳市华付信息技术有限公司 Biopsy method, device, computer equipment and the storage medium of high-accuracy

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114299953A (en) * 2021-12-29 2022-04-08 湖北微模式科技发展有限公司 Speaker role distinguishing method and system combining mouth movement analysis
CN115225326A (en) * 2022-06-17 2022-10-21 中国电信股份有限公司 Login verification method and device, electronic equipment and storage medium
CN115996303A (en) * 2023-03-23 2023-04-21 科大讯飞股份有限公司 Video generation method, device, electronic equipment and storage medium
CN115996303B (en) * 2023-03-23 2023-07-25 科大讯飞股份有限公司 Video generation method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111613227A (en) 2020-09-01

Similar Documents

Publication Publication Date Title
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
Jafar et al. Forensics and analysis of deepfake videos
WO2021196390A1 (en) Voiceprint data generation method and device, and computer device and storage medium
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
WO2020248376A1 (en) Emotion detection method and apparatus, electronic device, and storage medium
US20210012777A1 (en) Context acquiring method and device based on voice interaction
WO2020253051A1 (en) Lip language recognition method and apparatus
Provost Identifying salient sub-utterance emotion dynamics using flexible units and estimates of affective flow
JP2001092974A (en) Speaker recognizing method, device for executing the same, method and device for confirming audio generation
CN111326139B (en) Language identification method, device, equipment and storage medium
Ringeval et al. Emotion recognition in the wild: Incorporating voice and lip activity in multimodal decision-level fusion
Sahoo et al. Emotion recognition from audio-visual data using rule based decision level fusion
Hassanat Visual words for automatic lip-reading
CN108520752A (en) A kind of method for recognizing sound-groove and device
Jachimski et al. A comparative study of English viseme recognition methods and algorithms
Schlüter et al. Unsupervised feature learning for speech and music detection in radio broadcasts
CN112507311A (en) High-security identity verification method based on multi-mode feature fusion
El Shafey et al. Audio-visual gender recognition in uncontrolled environment using variability modeling techniques
CN113923521B (en) Video scripting method
Shi et al. Visual speaker authentication by ensemble learning over static and dynamic lip details
Shivakumar et al. Simplified and supervised i-vector modeling for speaker age regression
Yao et al. Anchor voiceprint recognition in live streaming via RawNet-SA and gated recurrent unit
CN116883900A (en) Video authenticity identification method and system based on multidimensional biological characteristics
Borde et al. Recognition of isolated digit using random forest for audio-visual speech recognition
Gao et al. Metric Learning Based Feature Representation with Gated Fusion Model for Speech Emotion Recognition.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20928652

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20928652

Country of ref document: EP

Kind code of ref document: A1