CN116417001A

CN116417001A - Voiceprint recognition method, voiceprint recognition device, terminal and storage medium

Info

Publication number: CN116417001A
Application number: CN202310485536.6A
Authority: CN
Inventors: 谭泉; 潘俊
Original assignee: Wonly Security And Protection Technology Co ltd
Current assignee: Wonly Security And Protection Technology Co ltd
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2023-07-11
Also published as: CN116665680A

Abstract

The invention discloses a voiceprint recognition method, a voiceprint recognition device, a voiceprint recognition terminal and a storage medium, wherein the voiceprint recognition method comprises the following steps: acquiring test voice and training voice, wherein the test voice and the training voice comprise a plurality of voice characteristics; respectively inputting test voice and training voice into a trained voiceprint model to obtain a first posterior probability matrix of a plurality of voice features in the test voice corresponding to preset voice features respectively and a second posterior probability matrix of a plurality of voice features in the training voice corresponding to preset voice features respectively; and performing similarity comparison on the first posterior probability matrix and the second posterior probability matrix by utilizing a CDS similarity algorithm to obtain a voiceprint recognition result of the test voice. The invention calculates posterior probability matrixes of the test voice and the training voice respectively, and compares the similarity of the two matrixes by utilizing a CDS similarity algorithm to obtain a voiceprint recognition result of the test voice. The method provided by the invention improves the operation speed and the voiceprint recognition accuracy.

Description

Voiceprint recognition method, voiceprint recognition device, terminal and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a voiceprint recognition method, a voiceprint recognition device, a voiceprint terminal and a storage medium.

Background

The voiceprint information contains identity information of a speaker, and is voice information reflecting physiological and behavioral characteristics of the speaker through voice waveforms. But does not contain the speaker's vital information, the voiceprint recognition model can resist impersonation attacks and cannot effectively defend against them. The voiceprint recognition is a process of comprehensively analyzing and comparing the voice acoustic characteristics of an unknown speaker or an uncertain speaker with the voice acoustic characteristics of a known speaker to make a conclusion about whether the two are identical, and in order to improve the accuracy and the high efficiency of the voiceprint recognition, the voiceprint recognition analysis method is very necessary to design.

Disclosure of Invention

Therefore, the voiceprint recognition method, the voiceprint recognition device, the voiceprint recognition terminal and the voiceprint recognition storage medium overcome the defects of low accuracy and low speed of voice recognition to be recognized in the prior art.

In order to achieve the above purpose, the present invention provides the following technical solutions:

in a first aspect, an embodiment of the present invention provides a voiceprint recognition method, including:

acquiring test voice and training voice, wherein the test voice and the training voice comprise a plurality of voice characteristics;

respectively inputting test voice and training voice into a trained voiceprint model to obtain a first posterior probability matrix of a plurality of voice features in the test voice corresponding to preset voice features respectively and a second posterior probability matrix of a plurality of voice features in the training voice corresponding to preset voice features respectively;

and performing similarity comparison on the first posterior probability matrix and the second posterior probability matrix by utilizing a CDS similarity algorithm to obtain a voiceprint recognition result of the test voice.

Optionally, the voiceprint model includes: a voiceprint background sub-model, a voiceprint classification sub-model and a voiceprint identification sub-model, wherein,

the voiceprint background sub-model is used for filtering background noise of input voice;

the voiceprint classification sub-model is used for classifying input voice, wherein each voice sample corresponds to a class label;

and the voiceprint recognition sub-model is used for carrying out voiceprint target recognition on the input voice.

Optionally, the training process of any one of the submodels in the voiceprint model includes:

acquiring a preset voice set, wherein the preset voice set comprises a plurality of voice samples;

decomposing a preset voice set by adopting wavelet transformation, and extracting wavelet entropy corresponding to the characteristics of a plurality of voice samples;

inputting the wavelet entropy into a preset neural network training voiceprint sub-model for training, and obtaining a trained voiceprint sub-model when preset conditions are met.

Optionally, the voiceprint classification sub-model verifies through an EM estimation algorithm whether the sub-model is trained.

Optionally, the voiceprint recognition sub-model verifies whether the sub-model is trained by a MAP algorithm.

Optionally, the preset neural network is a convolutional neural network, and the structure of the preset neural network includes: the posterior probability matrix is output of a preset neural network.

Optionally, the voice features include: voice frequency, voice decibels, voice semantics, and number of voice characters.

In a second aspect, an embodiment of the present invention provides a voiceprint recognition apparatus, including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring test voice and training voice, and the test voice and the training voice comprise a plurality of voice characteristics;

the training module is used for respectively inputting the test voice and the training voice into the trained voiceprint model to obtain a first posterior probability matrix of a plurality of voice features in the test voice, which correspond to preset voice features respectively, and a second posterior probability matrix of a plurality of voice features in the training voice, which correspond to preset voice features respectively;

and the recognition module is used for comparing the similarity of the first posterior probability matrix and the second posterior probability matrix by utilizing a CDS similarity algorithm to obtain a voiceprint recognition result of the test voice.

In a third aspect, an embodiment of the present invention provides a terminal, including: the voice print recognition system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to cause the at least one processor to execute the voice print recognition method according to the first aspect of the embodiment of the invention. In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium, where computer instructions are stored, where the computer instructions are configured to cause the computer to perform the voiceprint recognition method according to the first aspect of the embodiment of the present invention.

The technical scheme of the invention has the following advantages:

according to the voiceprint recognition method, the voiceprint recognition device, the terminal and the storage medium, the posterior probability matrixes of the test voice and the training voice are respectively calculated through the voiceprint model, the similarity comparison is carried out on the two matrixes by utilizing the CDS similarity algorithm, so that the voiceprint recognition result of the test voice is obtained, and the operation speed and the voiceprint recognition accuracy are improved through the method.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a specific example of a voiceprint recognition method according to an embodiment of the present invention;

FIG. 2 is a block diagram of one specific example of a convolutional neural network provided by an embodiment of the present invention;

FIG. 3 is a block diagram of a specific example of a voiceprint recognition device according to an embodiment of the present invention;

fig. 4 is a composition diagram of a specific example of a terminal according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; the two components can be directly connected or indirectly connected through an intermediate medium, or can be communicated inside the two components, or can be connected wirelessly or in a wired way. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Example 1

The voiceprint recognition method provided by the embodiment of the invention is used for terminal equipment such as sound equipment, mobile phones and the like which need to work by recognizing voiceprint information.

As shown in fig. 1, the method comprises the following steps:

step S1: test speech and training speech are obtained, each of which includes a plurality of speech features.

In the embodiment of the invention, the test voice and the training voice are acquired through the voice recording equipment, the voice recording equipment is not limited herein, and corresponding selection is performed according to actual conditions. The voice features include: the speech frequency, the speech decibels, the speech semantics and the number of speech characters are merely examples, but not limited thereto, and the corresponding speech features are divided according to actual situations in practical applications. For example: according to the voice of a speaker, the voice characteristics of the speaker can be divided into speed, height, length and mood states, wherein the mood states are divided into: doubt, affirmative, negative, surprise, etc. represent information of a state.

Step S2: the method comprises the steps of respectively inputting test voice and training voice into a trained voiceprint model, and obtaining a first posterior probability matrix of a plurality of voice features in the test voice, which correspond to preset voice features respectively, and a second posterior probability matrix of a plurality of voice features in the training voice, which correspond to preset voice features respectively.

In an embodiment of the present invention, a voiceprint model includes: voiceprint background submodel, voiceprint classification submodel, and voiceprint recognition submodel. The voiceprint background sub-model is used for filtering background noise of input voice, denoising the input voice to obtain denoised voice to be recognized, and carrying out voiceprint recognition based on the denoised voice to be recognized, so that the voiceprint recognition result is improved. The voiceprint classification sub-model is used for classifying input voice, each voice sample corresponds to a class label, the classified class labels can be men, women, children, animals and the like, the class labels are not limited, and corresponding selection is carried out according to actual conditions. And the voiceprint recognition sub-model is used for carrying out voiceprint target recognition on the input voice and carrying out target determination on the input voice.

In the embodiment of the invention, the training processes of the three sub-models are the same, and the training process of any one sub-model comprises the following steps:

the method comprises the steps of obtaining a preset voice set, wherein the preset voice set comprises a plurality of voice samples, and respectively obtaining different voice sets when three submodels are trained, wherein the different voice sets are respectively a background voice set, a development voice set and a target voice set. Recording voices of speakers under different backgrounds by using a background voice set, and dividing according to the backgrounds; developing a voice set record, namely dividing a speaker according to the label type; the target voice set records voice information of different people. The voice samples of the preset voice set are not limited in this regard, and the corresponding number is selected according to the actual situation.

The method comprises the steps of decomposing a preset voice set by wavelet transformation, extracting wavelet entropy corresponding to a plurality of voice sample characteristics, converting the voice characteristics from a time domain to a frequency domain according to the wavelet transformation, analyzing, wherein a voice signal is an unsteady state and time-varying signal based on the wavelet self-adaption principle and the global processing performance, so that the voice characteristic information is extracted by wavelet transformation, the global noise interference is restrained, and the interference of local noise is restrained by utilizing the wavelet locality characteristic. .

Inputting the wavelet entropy into a preset neural network training voiceprint sub-model for training, and obtaining a trained voiceprint sub-model when the preset condition is met. The preset conditions are not limited herein, and are selected correspondingly according to practical situations, for example: and stopping training to obtain a trained voiceprint submodel when the number of the cycles exceeds a preset value and/or reaches a preset precision. The voiceprint classification sub-model classifies input voice, and at the moment, whether the sub-model is trained is verified through an EM estimation algorithm; the voiceprint recognition sub-model is used for carrying out target determination on input voice, and at the moment, whether the sub-model is trained or not is verified through the MAP algorithm, and verification is carried out by adopting different verification algorithms based on different functions of the two models.

In the embodiment of the present invention, the preset neural network is a convolutional neural network, and the structure thereof includes: the input layer, the first hidden layer, the second hidden layer, the third hidden layer, the fourth hidden layer and the output layer are of a six-layer network structure, and the posterior probability matrix is output of a preset neural network.

In a specific embodiment, a factor analysis technology model (I-Vector) based on a convolutional neural network is provided, where the I-Vector is a speaker model based on a factor analysis technology, and can more accurately encode speaker identity information contained in voice features.

A general background model (Universal Background Model, UBM), CNN-UBM, was proposed using convolutional neural network CNN architecture. The embodiment of the invention adopts a CNN-UBM to estimate posterior probability by using an I-vector model, and further provides a CNN/I-vector speaker model based on the CNN-UBM. Compared with the traditional neural network, the neural network provided by the embodiment of the invention has the advantages of small operation amount, high operation speed and less occupied memory resources.

The network structure of the CNN-UBM is shown in fig. 2, and is composed of an input layer, four hidden layers and an output layer. The hidden layers are 1 to 7, and optionally four hidden layers, an input layer and an output layer, which are used as embodiments of the present invention, form a six-layer network structure, and operations between the layers are indicated in a lower box, where f is the size of a convolution kernel, p is the number of layers filled with 0, and s is a step size. The input layer consists of one speech feature and its 15 contextual speech features.

The CNN-UBM provided by the invention requires the dimension of the voice feature vector to be 16 dimensions, so that the input layer is a vector of 1×256, and the structure is as follows:

V _n ＝x _n-7 ，...,x _n ，...,x _n+8 ] ^T

wherein x is _n For the current speech feature, x _n-7 To x _n-1 And x _n+1 To x _n+8 Is x _n Context characteristics of (c). All features are represented by horizontal vectors, with "T" representing the transpose operation. Each hidden layer of the convolutional neural network contains eight maps, each with a size of 1×128, and all hidden layers use ReLU as an excitation function. The output layer is a full connection layerThe number of nodes of the output layer is consistent with the number of speakers in the background voice set, wherein the kth node represents the speaker S _k Say V _n Posterior probability p of (2) _nk ＝(V _n |S _k ) Wherein V is _n Is input. In the CNN-UBM model, the loss function is defined as:

wherein Z is _k Ideal output for the kth node, if input V _n Is speaker S _k Then Z _k =1, otherwise Z _k And the number of m is the number of nodes, and when the loss function is smaller than a preset value, training is completed, and a trained submodel is obtained.

Step S3: and performing similarity comparison on the first posterior probability matrix and the second posterior probability matrix by utilizing a CDS similarity algorithm to obtain a voiceprint recognition result of the test voice.

In a specific embodiment, the CDS similarity algorithm is a common tool for I-vector classification, and has the advantage of fast classification, three sub-models are quickly selected and voice features are processed by the CDS heuristic algorithm, and the cosine value of the included angle between two I-vectors is used to estimate the similarity of the two models, where the similarity is defined as:

wherein, X and Y are known and unknown I-vector respectively, corresponding to the first posterior probability matrix and the second posterior probability in the step S3, and T is the transpose.

Because the generation algorithm of the I-vector models the channel information and the speaker information together, the I-vector cannot effectively separate the channel information and the speaker information. This results in incorrect results for CDS in a multi-channel environment, and in order to solve this problem, a channel compensation technique is introduced into the CDS similarity calculation process to ensure that the CDS can give the correct results, and in order to solve this problem, a CDS similarity calculation process is introduced.

According to the voiceprint recognition method provided by the embodiment of the invention, the posterior probability matrixes of the test voice and the training voice are respectively calculated through the voiceprint model provided by the invention, and similarity comparison is carried out on the two matrixes by utilizing a CDS similarity algorithm, so that the voiceprint recognition result of the test voice is obtained. The method provided by the invention improves the operation speed and the voiceprint recognition accuracy.

Example 2

An embodiment of the present invention provides a voiceprint recognition apparatus, as shown in fig. 3, including:

the acquisition module 1 is used for acquiring test voice and training voice, wherein the test voice and the training voice comprise a plurality of voice characteristics; this module performs the method described in step S1 in embodiment 1, and will not be described here again.

The training module 2 is used for respectively inputting the test voice and the training voice into the trained voiceprint model to obtain a first posterior probability matrix of a plurality of voice features in the test voice, which respectively correspond to preset voice features, and a second posterior probability matrix of a plurality of voice features in the training voice, which respectively correspond to preset voice features; this module performs the method described in step S2 in embodiment 1, and will not be described here.

The recognition module 3 is used for comparing the similarity of the first posterior probability matrix and the second posterior probability matrix by utilizing a CDS similarity algorithm to obtain a voiceprint recognition result of the test voice; this module performs the method described in step S3 in embodiment 1, and will not be described here.

The embodiment of the invention provides a voiceprint recognition device, which is characterized in that a training module is used for respectively calculating posterior probability matrixes of test voices and training voices, and similarity comparison is carried out on the two matrixes by utilizing a CDS similarity algorithm of the recognition module to obtain a voiceprint recognition result of the test voices. The device provided by the invention improves the operation speed and the voiceprint recognition accuracy.

Example 3

An embodiment of the present invention provides a terminal, as shown in fig. 4, including: at least one processor 401, such as a CPU (Central Processing Unit ), at least one communication interface 403, a memory 404, at least one communication bus 402. Wherein communication bus 402 is used to enable connected communications between these components. The communication interface 403 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional communication interface 403 may further include a standard wired interface and a wireless interface. The memory 404 may be a high-speed RAM memory (Random Access Memory) or a nonvolatile memory (nonvolatile memory), such as at least one magnetic disk memory. The memory 404 may also optionally be at least one storage device located remotely from the aforementioned processor 401. Wherein the processor 401 may perform the method of optimizing the classifier chain tag sequence in embodiment 1. A set of program codes is stored in the memory 404, and the processor 401 calls the program codes stored in the memory 404 for executing the voiceprint recognition method in embodiment 1. The communication bus 402 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. Communication bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one line is shown in fig. 3, but not only one bus or one type of bus. Wherein the memory 404 may include volatile memory (English) such as random-access memory (RAM); the memory may also include a nonvolatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated as HDD) or a solid-state drive (english: SSD); memory 404 may also include a combination of the above types of memory. The processor 401 may be a central processor (English: central processing unit, abbreviated: CPU), a network processor (English: network processor, abbreviated: NP) or a combination of CPU and NP.

Wherein the memory 404 may include volatile memory (English) such as random-access memory (RAM); the memory may also include a nonvolatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated as HDD) or a solid state disk (english: solid-state drive, abbreviated as SSD); memory 404 may also include a combination of the above types of memory.

The processor 401 may be a central processor (English: central processing unit, abbreviated: CPU), a network processor (English: network processor, abbreviated: NP) or a combination of CPU and NP.

Wherein the processor 401 may further comprise a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof (English: programmable logic device). The PLD may be a complex programmable logic device (English: complex programmable logic device, abbreviated: CPLD), a field programmable gate array (English: field-programmable gate array, abbreviated: FPGA), a general-purpose array logic (English: generic array logic, abbreviated: GAL), or any combination thereof.

Optionally, the memory 404 is also used for storing program instructions. The processor 401 may invoke program instructions to implement the voiceprint recognition method as in embodiment 1 of the present application.

The embodiment of the invention also provides a computer readable storage medium, and the computer readable storage medium stores computer executable instructions thereon, where the computer executable instructions can execute the voiceprint recognition method in embodiment 1. Wherein the storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present invention.

Claims

1. A method of voiceprint recognition comprising:

2. The voiceprint recognition method of claim 1, wherein the voiceprint model comprises: a voiceprint background sub-model, a voiceprint classification sub-model and a voiceprint identification sub-model, wherein,

3. The voiceprint recognition method according to claim 2, wherein the training process of any one of the submodels in the voiceprint model includes:

4. A voiceprint recognition method according to claim 3, wherein the voiceprint classification sub-model verifies by an EM estimation algorithm whether the sub-model is trained.

5. A voiceprint recognition method according to claim 3, wherein the voiceprint recognition submodel verifies by a MAP algorithm that the submodel has completed training.

6. A voiceprint recognition method according to claim 3 wherein the predetermined neural network is a convolutional neural network, the structure of which comprises: the posterior probability matrix is output of a preset neural network.

7. The voiceprint recognition method of claim 1, wherein the voice feature comprises: voice frequency, voice decibels, voice semantics, and number of voice characters.

8. A voiceprint recognition apparatus, comprising:

9. A terminal, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the voiceprint recognition method of any one of claims 1-7.

10. A computer-readable storage medium storing computer instructions for causing the computer to perform the voiceprint recognition method of any one of claims 1-7.