WO2021196390A1

WO2021196390A1 - Voiceprint data generation method and device, and computer device and storage medium

Info

Publication number: WO2021196390A1
Application number: PCT/CN2020/093318
Authority: WO
Inventors: 王德勋; 徐国强
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-03-31
Filing date: 2020-05-29
Publication date: 2021-10-07
Also published as: CN111613227A

Abstract

The present application relates to the technical field of artificial intelligence, and provides a voiceprint data generation method and device, and a computer device and a storage medium. The method comprises: performing face detection on an original image sequence in audio and video data frame by frame to obtain a plurality of face images and face boxes thereof; obtaining a plurality of face image sub-sequences from the original image sequence according to the plurality of face images and the face boxes thereof; detecting whether a mouth in each face image in each face image sub-sequence opens or not; screening out a target face image sub-sequence according to the mouth opening detection result of each face image sub-sequence; extracting a face feature from each target face image sub-sequence; clustering the target face image sub-sequences to obtain a target user to which each target face image sub-sequence belongs; and capturing from an audio stream of the audio and video data an audio segment corresponding to the target image sub-sequence of each target user to obtain voiceprint data of each target user. According to the present application, voiceprint data can be obtained with high efficiency and low costs.

Description

Voiceprint data generation method, device, computer device and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 31, 2020, with the application number of 202010244174.8, and the application titled "Voiceprint data generation method, device, computer device and storage medium", all of which are approved The reference is incorporated in this application.

Technical field

This application relates to the technical field of artificial intelligence speech processing, and in particular to a method, device, computer device, and storage medium for generating voiceprint data.

Background technique

Human speech contains a wealth of information. One of the important information is the voiceprint information that characterizes the speaker's identity. Because different people have different voice cavities and ways of speaking, the voiceprint information of any two people is different. Voiceprint recognition is the process of using machines to automatically extract the voiceprint information in the voice and identify the speaker's identity. It plays an important role in security, audit, and education scenarios.

The current mainstream voiceprint recognition method is based on deep learning voiceprint recognition. A neural network model (ie, voiceprint recognition model) is trained through a large number of voiceprint samples, so that the neural network model can automatically mine the speaker’s voiceprint features. Identify the speaker's identity based on voiceprint features. However, the inventor realized that: unlike face data, voice data (such as voiceprint data) is more private and harder to collect, and has multiple variables such as accents, noises, and dialects, resulting in an open source voiceprint database. It is seriously inadequate in quality and quantity, unable to obtain enough voiceprint samples, and unable to train a high-accuracy voiceprint recognition model. Collecting and labeling voiceprint data by oneself also requires a lot of money and labor costs. The lack of voiceprint data largely limits the development and promotion of voiceprint recognition technology.

Application content

In view of the above, it is necessary to provide a method, device, computer device, and storage medium for generating voiceprint data, which can obtain voiceprint data with high efficiency and low cost.

The first aspect of the present application provides a method for generating voiceprint data, the method including:

Obtain audio and video data;

Performing face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images;

Acquiring multiple face image subsequences from the original image sequence according to the multiple face images and the face frame, and each face image subsequence includes multiple face images of the same user;

Detect whether each face image in each face image subsequence has its mouth open;

Filter out the target face image subsequence according to the open mouth detection result of each face image subsequence;

Extract facial features for each target facial image sub-sequence;

Clustering the target facial image subsequence according to the facial features of each target facial image subsequence to obtain the target user to which each target facial image subsequence belongs;

The audio segment corresponding to the target image subsequence of each target user is intercepted from the audio stream of the audio and video data to obtain the voiceprint data of each target user.

A second aspect of the present application provides a voiceprint data generating device, the device including:

Audio and video acquisition module for acquiring audio and video data;

The face detection module is configured to perform face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images;

A sequence acquisition module, configured to acquire multiple face image sub-sequences from the original image sequence according to the multiple face images and the face frame, and each face image sub-sequence contains multiple face images of the same user;

Mouth opening detection module, used to detect whether each face image in each face image sub-sequence has a mouth open;

The screening module is used to screen out the target face image subsequence according to the open mouth detection result of each face image subsequence;

The feature extraction module is used to extract face features from each target face image sub-sequence;

A clustering module, configured to cluster the target face image subsequence according to the facial features of each target face image subsequence to obtain the target user to which each target face image subsequence belongs;

The interception module is used to intercept the audio segment corresponding to the target image subsequence of each target user from the audio stream of the audio and video data to obtain the voiceprint data of each target user.

A third aspect of the present application provides a computer device, the computer device includes a memory and a processor, and the processor is configured to execute the computer-readable instructions stored in the memory to implement the following steps:

Obtain audio and video data;

Extract facial features for each target facial image sub-sequence;

The fourth aspect of the present application provides one or more readable storage media storing computer readable instructions. When the computer readable instructions are executed by one or more processors, the one or more processors execute the following step:

Obtain audio and video data;

Extract facial features for each target facial image sub-sequence;

This application is guided by the development of more mature facial image technology, and makes full use of the correlation between voice and image in audio and video data to extract voiceprint data associated with the speaker from the audio stream of audio and video data. By using this application to process a large amount of audio and video data, a large amount of voiceprint data can be obtained to construct a large-scale voiceprint database. This application can obtain voiceprint data with high efficiency and low cost. The voiceprint data can be used to train the voiceprint recognition model, which solves the problem that voiceprint samples are difficult to obtain, and contributes to the development and promotion of voiceprint recognition technology. . The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.

Description of the drawings

Fig. 1 is a flowchart of a method for generating voiceprint data provided by an embodiment of the present application.

Fig. 2 is a structural diagram of a voiceprint data generating device provided by an embodiment of the present application.

Fig. 3 is a schematic diagram of a computer device provided by an embodiment of the present application.

Detailed ways

In order to be able to understand the above objectives, features and advantages of the application more clearly, the application will be described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the application and the features in the embodiments can be combined with each other if there is no conflict.

In the following description, many specific details are set forth in order to fully understand the present application. The described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of this application. The terms used in the specification of the application herein are only for the purpose of describing specific embodiments, and are not intended to limit the application.

This application relates to the field of artificial intelligence speech processing technology. Preferably, the voiceprint data generation method of this application is applied to one or more computer devices. The computer device is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor and an application specific integrated circuit (ASIC) , Field-Programmable Gate Array (FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.

The computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.

Example one

FIG. 1 is a flowchart of a method for generating voiceprint data according to Embodiment 1 of the present application. The voiceprint data generation method is applied to a computer device.

The voiceprint data generation method extracts the voiceprint data associated with the speaker from the audio and video data. The voiceprint data can be used as a voiceprint sample to train a voiceprint recognition model.

As shown in Figure 1, the method for generating voiceprint data includes:

101. Acquire audio and video data.

Audio and video data refers to multimedia data that contains both voice and image. The content of the audio and video data includes, but is not limited to variety shows, interviews, TV dramas, etc.

In order to extract the voiceprint data associated with the speaker, the acquired audio and video data includes the speaker's voice and images.

The audio and video data can be obtained from a preset multimedia database. Alternatively, a camera device in the computer device or connected to the computer device can be controlled to collect the audio and video data in real time.

102. Perform face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images.

The original image sequence and audio stream sequence can be separated from the audio and video data. For example, audio and video editing software (such as MediaCoder, ffmpeg) can be used to separate the original image sequence and audio stream sequence from the audio and video data.

The original image sequence includes a plurality of original images.

Optionally, said performing face detection on the original image sequence in the audio and video data frame by frame includes:

The MTCNN (Multi-task Cascaded Convolutional Networks, multi-task cascaded convolutional network) model is used to perform face detection on the original image sequence in the audio and video data frame by frame.

MTCNN is composed of three parts: P-Net (proposal network), R-Net (refine network), and O-Net (output network). The three parts are three independent network structures. Each part is a multi-task network, and the tasks to be processed include: face/non-face judgment, face frame regression, and feature point positioning.

Using the MTCNN model to perform face detection on the original image sequence in the audio and video data frame by frame includes:

(1) Use P-Net to generate candidate windows. Bounding box regression can be used to correct candidate windows, and non-maximum suppression (NMS) can be used to merge overlapping candidate boxes.

(2) Use N-Net to improve the candidate window. Input the candidate window through P-Net into R-Net, and remove the non-face frame in the candidate frame.

(3) Use O-Net to output the final face frame and the position of the face feature points.

The use of the MTCNN model for face recognition can refer to the prior art, which will not be repeated here.

In other embodiments, other neural network models may be used to perform face detection on the original image sequence in the audio and video data frame by frame. For example, faster R-CNN (faster region-based convolution neural network, accelerated regional convolution neural network model) or cascadeCNN (cascade convolution neural network, cascade convolution neural network model) can be used to analyze the original audio and video data. Face detection is performed on the image sequence frame by frame.

The human face image refers to an image containing a human face.

In this embodiment, if a face frame that meets the requirements is detected from an original image, the original image is determined to be a face image; if no face frame that meets the requirements is detected from the original image (including no person detected) The face frame or the detected face frame does not meet the requirements), it is determined that the original image is not a face image.

In other embodiments, if a face frame is detected from an original image, the original image is determined to be a face image; if no face frame is detected from the original image, it is determined that the original image is not a face image.

In this embodiment, if there are multiple face frames in an original image, the face frame with the largest area is selected as the face frame of the original image, so that one face image corresponds to one face frame.

In this embodiment, it can be determined whether the size of the face frame detected from the original image is less than or equal to the preset threshold, and if the size of the face frame detected from the original image is less than or equal to the preset threshold, it is determined The face frame is an invalid face frame. For example, it can be judged whether the width and height of the face frame detected from the original image is less than or equal to 50 pixels. If the width or height of the face frame detected from the original image is less than or equal to 50 pixels, it is determined The face frame is an invalid face frame.

In a specific embodiment, if the size of the face frame detected from an original image is greater than a preset threshold, the original image is determined to be a face image; if no face frame is detected from the original image or the detected face frame If the size of the face frame is less than or equal to the preset threshold, it is determined that the original image is not a face image.

103. Acquire multiple face image subsequences from the original image sequence according to the multiple face images and the face frame, and each face image subsequence includes multiple face images of the same user.

Optionally, acquiring multiple face image subsequences from the original image sequence according to the multiple face images and the face frame includes:

Taking an original image in the original image sequence as a starting point, selecting the current original image and the next original image one by one to obtain two adjacent original images;

Judging whether the face frames of the two adjacent original images meet a preset condition;

If the two adjacent original images are face images, and the face frames of the two adjacent original images meet a preset condition, it is determined that the two adjacent original images correspond to the same user, and the adjacent two original images correspond to the same user. The two original images belong to the same face image subsequence;

Otherwise, if at least one of the two adjacent original images is not a face image, or the face frames of the two adjacent original images do not meet the preset condition, it is determined that the two adjacent original images do not correspond to the same User, the two adjacent original images do not belong to the same face image subsequence.

For example, taking the first original image in the original image sequence as the starting point, the first original image and the second original image in the original image sequence are selected as two adjacent original images, if the first original image is The original image and the second original image are face images, and the face frames of the first original image and the second original image meet the preset conditions, it is determined that the first original image and the second original image correspond to the same user And belong to the first face image sub-sequence; the second original image and the third original image in the original image sequence are selected as two adjacent original images, if the second original image and the third original image are Face image, and the face frames of the second original image and the third original image meet the preset conditions, it is determined that the second original image and the third original image correspond to the same user, and the third original image also belongs to the third original image. A face image sub-sequence; ..... Select the eighth original image and the ninth original image in the original image sequence as two adjacent original images, if the ninth original image is not a face image , Or the face frames of the eighth original image and the ninth original image do not meet the preset conditions, it is determined that the eighth original image and the ninth original image correspond to the same user, and the ninth original image does not belong to the first person Face image subsequence. Therefore, the acquired first face image sub-sequence includes the first original image, the second original image... and the eighth original image. Taking the ninth original image as a new starting point, the next face image sub-sequence is obtained.

It can be understood that other methods may be used to obtain multiple face image subsequences from the original image sequence according to the multiple face images and the face frame. For example, a face image can be used as a starting point, and the current face image and the next face image can be selected one by one to obtain two face images;

If the two face images are adjacent frames in the original image sequence, and the face frames of the two face images meet a preset condition, it is determined that the two face images correspond to the same user, and the two face images correspond to the same user. Personal face images belong to the same face image sub-sequence;

Otherwise, if the two face images are not adjacent frames in the original image sequence, or the face frames of the two face images do not meet a preset condition, it is determined that the two face images do not correspond to the same user , The two face images do not belong to the same face image sub-sequence.

Optionally, determining whether the face frames of the two adjacent original images meet a preset condition includes:

Judging whether the overlap area ratio (Intersection over Union, IOU) of the face frames of the two adjacent original images is greater than or equal to a preset ratio;

If the overlapping area ratio of the face frames of the two adjacent face images is greater than or equal to the preset ratio, it is determined that the two adjacent face images meet the preset condition.

Alternatively, it may be determined whether the distance between the face frames of the two adjacent face images is less than or equal to the preset distance, and if the distance between the face frames of the two adjacent face images is less than or equal to the preset distance, then it is determined The two adjacent face images meet a preset condition.

When face detection is performed on the original image sequence in the audio and video data frame by frame, the position of each face frame can be obtained, and the adjacent face frames can be calculated according to the positions of the face frames of the two adjacent face images. The distance between the face frames of the two face images.

104. Detect whether each face image in each face image sub-sequence has a mouth open.

Optionally, the detecting whether each face image in each face image subsequence has a mouth open includes:

The Adaboost algorithm is used to detect whether each face image in each face image subsequence is open.

Adaboost is an iterative algorithm. Its core idea is to train different classifiers (weak classifiers) for the same training set, and then group these weak classifiers to form a stronger final classifier (strong classifier).

The Adaboost algorithm based on Haar features can be used to train the classifier to realize the distinction between the normal state of the mouth and the open state.

The use of Adaboost algorithm for feature detection (such as open mouth detection) can refer to the prior art, which will not be repeated here.

In other embodiments, other methods may be used to detect whether each face image in each face image sub-sequence has a mouth open. For example, the MobileNetV2 (mobile network V2) model can be used to detect whether each face image in each face image sub-sequence is open.

105. Filter out the target face image subsequence according to the open mouth detection result of each face image subsequence.

Optionally, the screening of the target face image subsequence according to the open mouth detection result of each face image subsequence includes:

Determine the proportion of closed face images in each face image sub-sequence in the face image sub-sequence;

If the proportion of closed face images in the face image subsequence in the face image subsequence is less than or equal to the preset ratio (for example, 0.3), then the face image subsequence is the target face image subsequence .

Otherwise, if the proportion of closed face images in the face image subsequence in the face image subsequence is greater than the preset ratio, the face image subsequence is not the target face image subsequence.

Alternatively, it can be determined whether the number of closed-mouth face images in each face image sub-sequence is less than or equal to a preset number (for example, 3). If the number of closed-mouth face images in the face image subsequence is less than or equal to the preset number, then the face image subsequence is the target face image subsequence. Otherwise, if the number of closed face images in the face image subsequence is greater than the preset number, then the face image subsequence is not the target face image subsequence.

Before the target face image subsequence is filtered out according to the open mouth detection results of each face image subsequence, median filtering can be used to smooth the open mouth detection results of each face image subsequence.

For example, the sliding window size of the median filter is set to 3, that is, the median value is calculated every 3 numbers of the mouth detection results of the face image subsequence. The median filter can smooth the mouth detection results and better filter out the target face image sub-sequences.

106. Extract facial features for each target facial image subsequence.

Optionally, the extraction of facial features from each target facial image subsequence includes:

The point distribution model is used to extract facial features for each target facial image sub-sequence.

The point distribution model is a linear contour model, and its realization form is principal component analysis. In this model, the face contour (ie, the feature point coordinate sequence) is described as the sum of the weighted linear combination of the mean value of the training sample and the basis vector of each principal component.

In other embodiments, other feature extraction models or algorithms can be used to extract facial features for each target facial image subsequence. For example, the SIFT algorithm is used to extract facial features for each target facial image sub-sequence.

The facial features can be extracted from each face image in each target face image subsequence, and the face features of the target face image subsequence can be determined according to the facial features of all face images in the target face image subsequence . For example, the average value of the facial features of all facial images in the target facial image subsequence can be calculated, and the average value can be used as the facial features of the target facial image subsequence.

Alternatively, one or more face images (for example, a face image with the best image quality) may be selected from each target face image sub-sequence, and face features are extracted from the one or more face images, according to the The facial features of one or more facial images determine the facial features of the target facial image sub-sequence.

107. Cluster the target facial image subsequence according to the facial features of each target facial image subsequence to obtain the target user to which each target facial image subsequence belongs.

The GMM (Gaussian Mixture Model, Gaussian Mixture Model), DBSCAN or K-Means algorithm may be used to cluster the target face image subsequences.

Specifically, clustering the target face image subsequence includes:

(1) Select a preset number of face features of the target face image subsequence as the clustering center;

(2) Calculate the distance from the face feature of each target face image subsequence to each cluster center;

(3) Divide each target face image subsequence into a cluster according to the distance from the face feature of each target face image subsequence to each cluster center;

(4) Update the cluster center according to the division of the target face image subsequence;

Repeat the above (2)-(4) until the cluster center no longer changes.

Each cluster center finally obtained corresponds to a target user.

The cosine similarity between the face features of each target face image subsequence to each cluster center can be calculated, and the cosine similarity is used as the face feature of each target face image subsequence to each cluster center. the distance.

Alternatively, the Euclidean distance, Manhattan distance, Mahalanobis distance, etc. from the facial features of each target face image subsequence to each cluster center can be calculated.

108. Intercept the audio segment corresponding to the target image subsequence of each target user from the audio stream of the audio and video data to obtain voiceprint data of each target user.

For example, the target image subsequence of user U1 includes target image subsequences S1, S2, S3, the target image subsequence of user U2 includes target image subsequences S4, S5, S6, S7, and the target image subsequence of user U3 includes target images The sub-sequences S8, S9, S10 are used to intercept the audio segments A1, A2, and A3 corresponding to the target image sub-sequences S1, S2, and S3 of the user U1 from the audio stream of the audio and video data, and from the audio of the audio and video data The audio segments A4, A5, A6, and A7 corresponding to the target image subsequences S4, S5, S6, and S7 of the user U2 are intercepted in the stream, and the target image subsequences S8, S9 of the user U3 are intercepted from the audio stream of the audio and video data. , S10 corresponds to the audio segments A8, A9, and A10.

The audio segment corresponding to the target image subsequence of each target user may be intercepted from the audio stream of the audio and video data according to the start time and the end time corresponding to the target image subsequence of each target user.

The voiceprint data generation method is guided by the development of more mature facial image technology, and makes full use of the correlation between the voice and the image in the audio and video data to extract the voice associated with the speaker from the audio stream of the audio and video data.纹数据。 Pattern data. By using the voiceprint data generation method to process a large amount of audio and video data, a large amount of voiceprint data can be obtained to construct a large-scale voiceprint database. The voiceprint data generation method can obtain voiceprint data with high efficiency and low cost. The voiceprint data can be used to train a voiceprint recognition model, which solves the problem that voiceprint samples are difficult to obtain, and is helpful for voiceprint recognition. The development and promotion of technology.

Example two

Fig. 2 is a structural diagram of a voiceprint data generating device provided in the second embodiment of the present application. The voiceprint data generating device 20 is applied to a computer device. The voiceprint data generating device 20 analyzes the text to be analyzed, and determines the emotion category of the text to be analyzed. The voiceprint data generating device 20 extracts voiceprint data associated with the speaker from the audio and video data. The voiceprint data can be used as a voiceprint sample to train a voiceprint recognition model.

As shown in FIG. 2, the voiceprint data generating device 20 may include an audio and video acquisition module 201, a face detection module 202, a sequence acquisition module 203, an open mouth detection module 204, a screening module 205, a feature extraction module 206, and a clustering module. 207. The interception module 208.

The audio and video acquisition module 201 is used to acquire audio and video data.

The face detection module 202 is configured to perform face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images.

The original image sequence includes a plurality of original images.

The human face image refers to an image containing a human face.

The sequence acquisition module 203 is configured to acquire multiple face image sub-sequences from the original image sequence according to the multiple face images and the face frame, and each face image sub-sequence includes multiple face images of the same user.

For example, taking the first original image in the original image sequence as the starting point, the first original image and the second original image in the original image sequence are selected as two adjacent original images, if the first original image is The original image and the second original image are face images, and the face frames of the first original image and the second original image meet the preset conditions, it is determined that the first original image and the second original image correspond to the same user And belong to the first face image sub-sequence; the second original image and the third original image in the original image sequence are selected as two adjacent original images, if the second original image and the third original image are Face image, and the face frames of the second original image and the third original image meet the preset conditions, it is determined that the second original image and the third original image correspond to the same user, and the third original image also belongs to the third original image. A face image sub-sequence;.... Select the eighth original image and the ninth original image in the original image sequence as two adjacent original images, if the ninth original image is not a face image, Or the face frames of the eighth original image and the ninth original image do not meet the preset conditions, it is determined that the eighth original image and the ninth original image correspond to the same user, and the ninth original image does not belong to the first face Image subsequence. Therefore, the acquired first face image sub-sequence includes the first original image, the second original image... and the eighth original image. Taking the ninth original image as a new starting point, the next face image sub-sequence is obtained.

The mouth opening detection module 204 is used to detect whether each face image in each face image sub-sequence has a mouth open.

The screening module 205 is configured to screen out the target face image subsequence according to the open mouth detection result of each face image subsequence.

The feature extraction module 206 is configured to extract face features from each target face image sub-sequence.

The clustering module 207 is configured to cluster the target face image subsequence according to the facial features of each target face image subsequence to obtain the target user to which each target face image subsequence belongs.

Specifically, clustering the target face image subsequence includes:

Repeat the above (2)-(4) until the cluster center no longer changes.

Each cluster center finally obtained corresponds to a target user.

The interception module 208 is used to intercept the audio segment corresponding to the target image subsequence of each target user from the audio stream of the audio and video data to obtain the voiceprint data of each target user.

The voiceprint data generating device 20 is guided by the development of more mature facial image technology, and makes full use of the correlation between the voice and the image in the audio and video data to extract the speaker-related information from the audio stream of the audio and video data. Voiceprint data. By using the voiceprint data generating device 20 to process a large amount of audio and video data, a large amount of voiceprint data can be obtained to construct a large-scale voiceprint database. The voiceprint data generating device 20 can obtain voiceprint data with high efficiency and low cost, and the voiceprint data can be used to train the voiceprint recognition model, which solves the problem that voiceprint samples are difficult to obtain, and helps voiceprint data. The development and promotion of recognition technology.

Example three

This embodiment provides one or more readable storage media storing computer readable instructions. The readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage media; The computationally readable instructions, when executed by one or more processors, implement the steps in the foregoing embodiment of the method for generating voiceprint data, such as 101-108 shown in FIG. 1. Alternatively, the computationally readable instruction realizes the functions of the modules in the foregoing device embodiment when executed by the processor, such as modules 201-208 in FIG. 2. To avoid repetition, I won’t repeat them here. A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile readable storage. The medium may also be stored in a volatile readable storage medium, and when the computer readable instructions are executed, they may include the processes of the above-mentioned method embodiments.

Example four

FIG. 3 is a schematic diagram of a computer device provided in Embodiment 4 of this application. The computer device 30 includes a memory 301, a processor 302, and computationally readable instructions 303 stored in the memory 301 and running on the processor 302, such as a voiceprint data generating program. The processor 302 implements the steps in the embodiment of the voiceprint data generation method when executing the calculation readable instruction 303, for example, 101-108 shown in FIG. 1. Alternatively, the computationally readable instruction realizes the functions of the modules in the foregoing device embodiment when executed by the processor, such as modules 201-208 in FIG. 2.

Exemplarily, the computationally readable instruction 303 may be divided into one or more modules, and the one or more modules are stored in the memory 301 and executed by the processor 302 to complete the method. . The one or more modules may be a series of computationally readable instruction instruction segments capable of completing specific functions, and the instruction segment is used to describe the execution process of the computationally readable instructions 303 in the computer device 30. For example, the computationally readable instruction 303 can be divided into the audio and video acquisition module 201, the face detection module 202, the sequence acquisition module 203, the open mouth detection module 204, the screening module 205, the feature extraction module 206, and the clustering module shown in FIG. Module 207, interception module 208, the specific functions of each module refer to the second embodiment.

The computer device 30 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. Those skilled in the art can understand that the schematic diagram 3 is only an example of the computer device 30 and does not constitute a limitation on the computer device 30. It may include more or less components than those shown in the figure, or combine certain components, or be different. For example, the computer device 30 may also include input and output devices, network access devices, buses, and so on.

The so-called processor 302 may be a central processing unit (Central Processing Unit, CPU), other general processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor 302 can also be any conventional processor, etc. The processor 302 is the control center of the computer device 30, which uses various interfaces and lines to connect the entire computer device 30. Various parts.

The memory 301 can be used to store the computationally readable instructions 303. The processor 302 executes or executes the computationally readable instructions or modules stored in the memory 301 and calls data stored in the memory 301 to achieve Various functions of the computer device 30. The memory 301 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may Store the data created according to the use of the computer device 30 (. In addition, the memory 301 may include non-volatile memory or/and volatile memory, and the non-volatile memory may include, for example, hard disk, memory, plug-in hard disk, smart Memory card (Smart Media Card, SMC), Secure Digital (SD) card, flash card (Flash Card), at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. Volatile memory It may include random access memory (RAM) or external cache memory.

If the integrated module of the computer device 30 is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware by computing readable instructions. The computing readable instructions can be stored in a storage medium. When the computationally readable instructions are executed by the processor, they can implement the steps of the foregoing method embodiments. Wherein, the computationally readable instruction includes computationally readable instruction code, and the computationally readable instruction code may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory). It should be noted that the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to the legislation and patent practice, the computer-readable medium Does not include electrical carrier signals and telecommunication signals.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.

The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional modules in the various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, or in the form of hardware plus software functional modules.

The above-mentioned integrated modules implemented in the form of software functional modules may be stored in a storage medium. The above-mentioned software function module is stored in a storage medium and includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor execute the method described in each embodiment of the present application. Part of the steps.

For those skilled in the art, it is obvious that the present application is not limited to the details of the foregoing exemplary embodiments, and the present application can be implemented in other specific forms without departing from the spirit or basic characteristics of the application. Therefore, no matter from which point of view, the embodiments should be regarded as exemplary and non-limiting. The scope of this application is defined by the appended claims rather than the above description, and therefore it is intended to fall into the claims. All changes in the meaning and scope of the equivalent elements of are included in this application. Any associated diagram marks in the claims should not be regarded as limiting the claims involved. In addition, it is obvious that the word "including" does not exclude other modules or steps, and the singular does not exclude the plural. Multiple modules or devices stated in the system claims can also be implemented by one module or device through software or hardware. Words such as first and second are used to denote names, but do not denote any specific order.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application and not to limit them. Although the application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the application can be Make modifications or equivalent replacements without departing from the spirit and scope of the technical solution of the present application.

Summary of the invention

technical problem

The solution to the problem

The beneficial effects of the invention

Claims

A method for generating voiceprint data, wherein the method includes:

Obtain audio and video data;

Performing face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images;

Acquiring multiple face image subsequences from the original image sequence according to the multiple face images and the face frame, and each face image subsequence includes multiple face images of the same user;

Detect whether each face image in each face image subsequence has its mouth open;

Filter out the target face image subsequence according to the open mouth detection result of each face image subsequence;

Extract facial features for each target facial image sub-sequence;

Clustering the target facial image subsequence according to the facial features of each target facial image subsequence to obtain the target user to which each target facial image subsequence belongs;

The audio segment corresponding to the target image subsequence of each target user is intercepted from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
The method according to claim 1, wherein said performing face detection on the original image sequence in the audio and video data frame by frame comprises:

A multi-task cascaded convolutional network model is used to perform face detection on the original image sequence in the audio and video data frame by frame.
The method according to claim 1, wherein said acquiring multiple face image subsequences from said original image sequence according to said multiple face images and said face frame comprises:

Taking an original image in the original image sequence as a starting point, selecting the current original image and the next original image one by one to obtain two adjacent original images;

Judging whether the face frames of the two adjacent original images meet a preset condition;

If the two adjacent original images are face images, and the face frames of the two adjacent original images meet a preset condition, it is determined that the two adjacent original images correspond to the same user, and the adjacent two original images correspond to the same user. The two original images belong to the same face image subsequence;

Otherwise, if at least one of the two adjacent original images is not a face image, or the face frames of the two adjacent original images do not meet the preset condition, it is determined that the two adjacent original images do not correspond to the same User, the two adjacent original images do not belong to the same face image subsequence.
8. The method according to claim 3, wherein said determining whether the face frames of the two adjacent original images meet a preset condition comprises:

Judging whether the overlap area ratio of the face frames of the two adjacent original images is greater than or equal to a preset ratio;

Alternatively, it is determined whether the distance between the face frames of the two adjacent face images is less than or equal to a preset distance.
The method according to claim 1, wherein said detecting whether each face image in each face image sub-sequence has a mouth open comprises:

The Adaboost algorithm is used to detect whether each face image in each face image subsequence is open.
The method according to claim 1, wherein the screening out the target face image sub-sequence according to the open mouth detection result of each face image sub-sequence comprises:

Determine the proportion of closed face images in each face image sub-sequence in the face image sub-sequence;

If the proportion of closed face images in the face image subsequence in the face image subsequence is less than or equal to the preset ratio, then the face image subsequence is the target face image subsequence.
The method according to claim 1, wherein said extracting facial features for each target facial image sub-sequence comprises:

The point distribution model is used to extract facial features for each target facial image sub-sequence.
A voiceprint data generating device, wherein the device includes:

Audio and video acquisition module for acquiring audio and video data;

The face detection module is configured to perform face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images;

A sequence acquisition module, configured to acquire multiple face image sub-sequences from the original image sequence according to the multiple face images and the face frame, and each face image sub-sequence contains multiple face images of the same user;

Mouth opening detection module, used to detect whether each face image in each face image sub-sequence has a mouth open;

The screening module is used to screen out the target face image subsequence according to the open mouth detection result of each face image subsequence;

The feature extraction module is used to extract face features from each target face image sub-sequence;

A clustering module, configured to cluster the target face image subsequence according to the facial features of each target face image subsequence to obtain the target user to which each target face image subsequence belongs;

The interception module is used to intercept the audio segment corresponding to the target image subsequence of each target user from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
A computer device, wherein the computer device includes a memory and a processor, and the processor is configured to execute the computer-readable instructions stored in the memory to implement the following steps:

Obtain audio and video data;

Performing face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images;

Acquiring multiple face image subsequences from the original image sequence according to the multiple face images and the face frame, and each face image subsequence includes multiple face images of the same user;

Detect whether each face image in each face image subsequence has its mouth open;

Filter out the target face image subsequence according to the open mouth detection result of each face image subsequence;

Extract facial features for each target facial image sub-sequence;

Clustering the target facial image subsequence according to the facial features of each target facial image subsequence to obtain the target user to which each target facial image subsequence belongs;

The audio segment corresponding to the target image subsequence of each target user is intercepted from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
9. The computer device according to claim 9, wherein said performing face detection on the original image sequence in the audio and video data frame by frame comprises:

A multi-task cascaded convolutional network model is used to perform face detection on the original image sequence in the audio and video data frame by frame.
9. The computer device according to claim 9, wherein said acquiring a plurality of face image subsequences from said original image sequence according to said plurality of face images and said face frame comprises:

Taking an original image in the original image sequence as a starting point, selecting the current original image and the next original image one by one to obtain two adjacent original images;

Judging whether the face frames of the two adjacent original images meet a preset condition;

If the two adjacent original images are face images, and the face frames of the two adjacent original images meet a preset condition, it is determined that the two adjacent original images correspond to the same user, and the adjacent two original images correspond to the same user. The two original images belong to the same face image subsequence;

Otherwise, if at least one of the two adjacent original images is not a face image, or the face frames of the two adjacent original images do not meet the preset condition, it is determined that the two adjacent original images do not correspond to the same User, the two adjacent original images do not belong to the same face image subsequence.
11. The computer device of claim 11, wherein said determining whether the face frames of the two adjacent original images meet a preset condition comprises:

Judging whether the overlap area ratio of the face frames of the two adjacent original images is greater than or equal to a preset ratio;

Alternatively, it is determined whether the distance between the face frames of the two adjacent face images is less than or equal to a preset distance.
9. The computer device of claim 9, wherein said detecting whether each face image in each face image sub-sequence has a mouth open comprises:

The Adaboost algorithm is used to detect whether each face image in each face image subsequence is open.
9. The computer device according to claim 9, wherein said screening out the target face image sub-sequence according to the open mouth detection result of each face image sub-sequence comprises:

Determine the proportion of closed face images in each face image sub-sequence in the face image sub-sequence;

If the proportion of closed face images in the face image subsequence in the face image subsequence is less than or equal to the preset ratio, then the face image subsequence is the target face image subsequence.
9. The computer device according to claim 9, wherein said extracting facial features for each target facial image sub-sequence comprises:

The point distribution model is used to extract facial features for each target facial image sub-sequence.
One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:

Obtain audio and video data;

Performing face detection on the original image sequence in the audio and video data frame by frame to obtain multiple face images and face frames of the multiple face images;

Acquiring multiple face image subsequences from the original image sequence according to the multiple face images and the face frame, and each face image subsequence includes multiple face images of the same user;

Detect whether each face image in each face image subsequence has its mouth open;

Filter out the target face image subsequence according to the open mouth detection result of each face image subsequence;

Extract facial features for each target facial image sub-sequence;

Clustering the target facial image subsequence according to the facial features of each target facial image subsequence to obtain the target user to which each target facial image subsequence belongs;

The audio segment corresponding to the target image sub-sequence of each target user is intercepted from the audio stream of the audio and video data to obtain the voiceprint data of each target user.
The readable storage medium according to claim 16, wherein said performing face detection on the original image sequence in the audio and video data frame by frame comprises:

A multi-task cascaded convolutional network model is used to perform face detection on the original image sequence in the audio and video data frame by frame.
The readable storage medium according to claim 16, wherein said acquiring a plurality of face image subsequences from said original image sequence according to said plurality of face images and said face frame comprises:

Taking an original image in the original image sequence as a starting point, selecting the current original image and the next original image one by one to obtain two adjacent original images;

Judging whether the face frames of the two adjacent original images meet a preset condition;

If the two adjacent original images are face images, and the face frames of the two adjacent original images meet a preset condition, it is determined that the two adjacent original images correspond to the same user, and the adjacent two original images correspond to the same user. The two original images belong to the same face image subsequence;

Otherwise, if at least one of the two adjacent original images is not a face image, or the face frames of the two adjacent original images do not meet the preset condition, it is determined that the two adjacent original images do not correspond to the same User, the two adjacent original images do not belong to the same face image subsequence.
18. The readable storage medium according to claim 18, wherein said determining whether the face frames of the two adjacent original images meet a preset condition comprises:

Judging whether the overlap area ratio of the face frames of the two adjacent original images is greater than or equal to a preset ratio;

Alternatively, it is determined whether the distance between the face frames of the two adjacent face images is less than or equal to a preset distance.
The readable storage medium according to claim 16, wherein said detecting whether each face image in each face image sub-sequence has a mouth open comprises:

The Adaboost algorithm is used to detect whether each face image in each face image subsequence is open.