CN110517698B

CN110517698B - Method, device and equipment for determining voiceprint model and storage medium

Info

Publication number: CN110517698B
Application number: CN201910837580.2A
Authority: CN
Inventors: 殷兵; 李晋; 方昕; 方四安; 徐承; 柳林
Original assignee: iFlytek Co Ltd; MIGU Digital Media Co Ltd
Current assignee: iFlytek Co Ltd; MIGU Digital Media Co Ltd
Priority date: 2019-09-05
Filing date: 2019-09-05
Publication date: 2022-02-01
Anticipated expiration: 2039-09-05
Also published as: CN110517698A

Abstract

The application provides a method, a device, equipment and a storage medium for determining a voiceprint model, wherein the method comprises the following steps: obtaining at least one speech spectrum segment of target speech, determining at least one first feature map of each speech spectrum segment through a pre-established voiceprint extraction model, wherein feature points in the first feature maps are mutually independent, determining a second feature map which corresponds to each first feature map and contains global information of the first feature map through the voiceprint extraction model, and obtaining at least one second feature map of each speech spectrum segment, wherein the second feature map corresponding to one first feature map is a feature map obtained by strengthening a feature region which can distinguish voiceprints in the first feature map; and determining a voiceprint model of the target speech by using at least one second feature map of each speech spectrum segment and the voiceprint extraction model. The voiceprint model determining method can determine the stable and accurate voiceprint model aiming at the target voice.

Description

Method, device and equipment for determining voiceprint model and storage medium

Technical Field

The present application relates to the field of voiceprint recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for determining a voiceprint model.

Background

Voiceprint recognition is one of key technologies in the field of biological authentication, identity authentication is directly performed by using voice signals, the voiceprint recognition system has the characteristics of no need of memory and simplicity in judgment, authentication can be performed under the condition that a user does not know, the user acceptance is high, and the voiceprint recognition system is widely applied to the fields of national security, finance, intelligent home furnishing and the like.

It should be noted that the key to voiceprint recognition is the determination of the voiceprint model. At present, a voiceprint model is mainly determined based on an all-variable factor analysis method, namely, a large amount of linguistic data are utilized to train to obtain an all-variable space covering various environments and channels, and a section of voice is mapped into a voiceprint model vector (i-vector) with fixed and uniform dimensions through the space.

In some application fields, the requirement on the accuracy of voiceprint recognition is high, which requires obtaining a stable and accurate voiceprint model, however, the voiceprint model determined by the current voiceprint model determining scheme is not stable and accurate enough, which results in poor voiceprint recognition effect and can not meet the requirement on the accuracy of voiceprint recognition in some fields.

Disclosure of Invention

In view of this, the present application provides a method, an apparatus, a device and a storage medium for determining a voiceprint model, so as to solve the problem that a voiceprint model determined by a voiceprint model determining scheme in the prior art is not stable and accurate enough, and the technical scheme is as follows:

a method of determining a voiceprint model, comprising:

acquiring at least one speech spectrum segment of target speech;

determining at least one first feature map of each speech spectrum fragment through a pre-established voiceprint extraction model, wherein feature points in the first feature maps are independent of one another;

determining a second feature map which comprises global information and corresponds to each first feature map through the voiceprint extraction model, and obtaining at least one second feature map of each speech spectrum fragment, wherein the second feature map corresponding to one first feature map is a feature map obtained by strengthening a feature region which can distinguish voiceprints in the first feature map;

and determining a voiceprint model of the target voice by using at least one second feature map of each speech spectrum fragment and the voiceprint extraction model.

Optionally, the determining a voiceprint model of the target speech by using at least one second feature map of each speech spectrum segment and the voiceprint extraction model includes:

determining a voiceprint model of the target speech by using the voiceprint extraction module, the at least one first feature map of each speech spectrum segment and the at least one second feature map of each speech spectrum segment.

Optionally, the determining a voiceprint model of the target speech by using the voiceprint extraction module, the at least one first feature map of each speech spectrum segment, and the at least one second feature map of each speech spectrum segment includes:

for any speech spectrum segment of the target speech, fusing at least one first feature map of the speech spectrum segment with at least one second feature map of the speech spectrum segment through the voiceprint extraction model to obtain a voiceprint sub-model of the speech spectrum segment so as to obtain a voiceprint sub-model of each speech spectrum segment of the target speech;

and averaging the voiceprint sub-models of all the spectrum fragments of the target voice to obtain a voiceprint model of the target voice.

Optionally, the obtaining the voiceprint sub-model of the speech spectrum fragment by fusing the at least one first feature map of the speech spectrum fragment with the at least one second feature map of the speech spectrum fragment through the voiceprint extraction model includes:

splicing the first feature maps of the speech spectrum segment into high-dimensional column vectors serving as the first high-dimensional column vectors of the speech spectrum segment through the voiceprint extraction model;

splicing the second feature maps of the speech spectrum segment into high-dimensional vectors through the voiceprint extraction model, wherein the high-dimensional vectors are used as second high-dimensional column vectors of the speech spectrum segment;

splicing the first high-dimensional column vector of the speech spectrum segment with the second high-dimensional column vector of the speech spectrum segment through the voiceprint extraction model to obtain a spliced high-dimensional vector;

and reducing the dimension of the spliced high-dimensional vector through the voiceprint extraction model, and determining the vector after dimension reduction as the voiceprint sub-model of the speech spectrum fragment.

Optionally, the determining a second feature map containing global information corresponding to each first feature map includes:

for any first feature map, dividing the first feature map into a plurality of first feature subgraphs of different frequency bands to obtain a plurality of first feature subgraphs contained in each first feature map;

for any first feature subgraph, determining a second feature subgraph which is corresponding to the first feature subgraph and contains global information to obtain a second feature subgraph corresponding to each first feature subgraph;

and for any first feature diagram, respectively corresponding second feature subgraphs to a plurality of first feature subgraphs contained in the first feature diagram to form a second feature diagram corresponding to the first feature diagram and containing global information so as to obtain the second feature diagram corresponding to each first feature diagram and containing the global information.

Optionally, the determining a second feature subgraph corresponding to the first feature subgraph and containing global information includes:

performing dimensionality reduction processing on the first feature subgraph through three convolution kernels with the same size and different parameters to obtain three dimensionality-reduced feature subgraphs;

determining attention weight through two feature sub-images in the three feature sub-images after dimension reduction;

and determining a second feature subgraph which corresponds to the first feature subgraph and contains global information according to the attention weight and the other feature subgraph in the three dimension-reduced feature subgraphs.

Optionally, the obtaining at least one speech spectrum segment of the target speech includes:

determining the voice feature of each voice frame of the target voice to obtain a voice feature sequence of the target voice;

and segmenting the voice feature sequence of the training voice according to a preset segmentation rule to obtain at least one voice spectrum segment of the target voice.

Optionally, the process of pre-establishing the voiceprint extraction model includes:

acquiring training voice and acquiring at least one voice spectrum segment of the training voice;

determining at least one first feature map of each spectrum segment of the training voice through a current voiceprint extraction model, wherein each feature point in the first feature map is independent, if the training is carried out for the first time, the current voiceprint extraction model is an initial voiceprint extraction model, and if the training is not carried out for the first time, the current voiceprint extraction model is a voiceprint extraction model after the training for the previous time;

determining a second feature map which comprises global information and corresponds to each first feature map of each speech spectrum segment of the training speech through a current voiceprint extraction model to obtain at least one second feature map of each speech spectrum segment of the training speech, wherein the second feature map corresponding to one first feature map is a feature map obtained by strengthening a feature region which can distinguish voiceprints in the first feature map;

determining a voiceprint sub-model of each speech spectrum fragment of the training speech by at least using at least one second feature map of each speech spectrum fragment of the training speech and a current voiceprint extraction model;

and predicting the voiceprint identity label corresponding to each speech spectrum segment of the training speech according to the voiceprint sub-model of each speech spectrum segment of the training speech, and updating the parameters of the current voiceprint extraction model according to the prediction result.

Optionally, the determining a voiceprint sub-model of each spectral fragment of the training speech by using at least one second feature map of each spectral fragment of the training speech and a current voiceprint extraction model includes:

and for any speech spectrum segment of the training speech, fusing at least one first characteristic diagram of the speech spectrum segment with at least one second characteristic diagram of the speech spectrum segment through a current voiceprint extraction model to obtain a voiceprint sub-model of the speech spectrum segment so as to obtain the voiceprint sub-model of each speech spectrum segment of the training speech.

An apparatus for determining a voiceprint model, comprising: the system comprises a speech spectrum fragment acquisition module, a first characteristic acquisition module, a second characteristic acquisition module and a voiceprint model determination module;

the speech spectrum segment acquisition module is used for acquiring at least one speech spectrum segment of the target speech;

the first feature acquisition module is used for determining at least one first feature map of each speech spectrum fragment through a pre-established voiceprint extraction model, wherein feature points in the first feature maps are mutually independent;

the second feature obtaining module is configured to determine, through the voiceprint extraction model, a second feature map that includes global information of each first feature map and corresponds to each first feature map, and obtain at least one second feature map of each speech spectrum segment, where the second feature map corresponding to one first feature map is a feature map obtained by enhancing a feature region in the first feature map, where the feature region is capable of distinguishing voiceprints;

and the voiceprint model determining module is used for determining the voiceprint model of the target voice by using at least one second feature map of each speech spectrum fragment and the voiceprint extraction model.

A voiceprint model determination apparatus comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the method for determining a voiceprint model according to any one of the above embodiments.

A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of determining a voiceprint model according to any one of the preceding claims.

According to the scheme, the method, the device, the equipment and the storage medium for determining the voiceprint model provided by the application firstly obtain at least one speech spectrum segment of the target speech, then determine at least one first feature map containing local information of each speech spectrum segment through a pre-established voiceprint extraction model, then determine a second feature map containing global information corresponding to each first feature map through the voiceprint extraction model to obtain at least one second feature map of each speech spectrum segment, and finally determine the voiceprint model of the target speech by using at least one second feature map of each speech spectrum segment and the voiceprint model. Compared with the prior art, the voiceprint model determining method can obtain the first feature map of the speech spectrum fragment by utilizing the pre-established voiceprint extraction model, can determine a more accurate and stable voiceprint model through the first feature map, and considers that all feature points of the first feature map are mutually independent, namely the first feature map contains local information.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for determining a voiceprint model according to an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of pre-establishing a voiceprint extraction model according to an embodiment of the present application;

fig. 3 is a schematic diagram of a second feature subgraph which includes global information and corresponds to the first feature subgraph determined in the embodiment of the present application;

fig. 4 is a schematic structural diagram of a voiceprint model determination apparatus provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a determining apparatus of a voiceprint model according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

When the voiceprint recognition is carried out, whether two sections of voice come from the same speaker or not is described by utilizing the similarity of the voiceprint models, and if the obtained voiceprint models are not stable and accurate enough, the voiceprint recognition effect is directly influenced.

For the voiceprint model determination scheme based on the analysis of the all-variable factors, under the condition of short voice time, the determined voiceprint model is not stable and accurate enough due to insufficient statistic calculation.

In order to determine a stable and accurate voiceprint model so as to improve the voiceprint recognition effect, the inventor of the present application has studied, and the original idea is:

with the voiceprint model determination scheme based on the deep Convolutional Neural Network (CNN), in recent years, the deep learning method has achieved a remarkable achievement in many research fields, and forms abstract high-level attribute description by combining and analyzing low-level features to find structural feature representation of data, wherein the deep convolutional neural network is an efficient learning method which is developed in recent years and draws much attention.

Compared with a pure fully variable factor analysis method, the convolutional neural network can perform joint analysis on a time domain and a frequency domain, deeply excavate voiceprint information in a voice spectrum, obtain more detailed voiceprint characteristic expression and further establish an accurate voiceprint model.

When determining a voiceprint model based on a deep convolutional neural network, firstly, extracting features reflecting voiceprint information from a section of voice, for example, Fast Fourier Transform (FFT), then training a convolutional neural model CNN by stacking structures such as convolution, pooling, activation, and the like, and performing nonlinear projection on voice features by using the convolutional neural model to obtain a voiceprint model c-vector corresponding to the section of voice. The voiceprint determination scheme based on a convolutional neural network is relatively simple and efficient.

However, through further research, the inventors found that, in the above-mentioned voiceprint model determination scheme based on the convolutional neural network, in the process of performing feature map (feature map) analysis, feature points on each feature map are independent of each other and are limited by the limitation of the receptive field of the convolutional kernel, and global information of the feature map cannot be sufficiently obtained, which results in that the voiceprint model determined based on the convolutional neural network is still not stable and accurate enough.

In order to obtain a more stable and accurate voiceprint model, the inventor of the present application has conducted further research, and finally provides a voiceprint model determination method with a better effect, the method is applied to an application scenario where voiceprint recognition is required, and the method can be applied to a terminal with data processing capability and can also be applied to a server. The voiceprint model determination method provided by the present application is described next by the following embodiments.

Referring to fig. 1, a schematic flow chart of a method for determining a voiceprint model according to an embodiment of the present application is shown, where the method may include:

step S101: at least one spectral fragment of the target speech is obtained.

Specifically, the process of acquiring at least one speech spectrum segment of the target speech may include:

step S1011, determining the speech feature of each speech frame of the target speech, and obtaining the speech feature sequence of the target speech.

Specifically, frame windowing and fourier transform may be performed on the target speech to obtain an FFT feature sequence, which is used as the speech feature sequence of the target speech.

Step S1012, segmenting the speech feature sequence of the target speech according to a preset segmentation rule, so as to obtain at least one speech spectrum segment.

Specifically, a window length L may be preset, the speech feature sequence of the target speech is segmented according to the window length L, and the dimension of the speech feature is assumed to be d, so that the size of each speech spectrum segment is L × d.

Step S102: at least one first feature map of each spectral fragment is determined by a pre-established voiceprint extraction model.

Each feature point in the first feature map is independent of each other, that is, the first feature map includes local information.

The pre-established voiceprint extraction model can be a model based on a convolutional neural network, and is obtained by training a speech spectrum segment of training speech, wherein the speech spectrum segment of the training speech has a voiceprint identity tag.

Step S103: and determining a second feature map containing global information corresponding to each first feature map through a voiceprint extraction model to obtain at least one second feature map of each speech spectrum segment.

The second feature map corresponding to one first feature map is a feature map obtained by enhancing a feature region capable of distinguishing a voiceprint in the first feature map, and corresponds to a feature map obtained by optimizing the first feature map. In the embodiment, global information is fully mined from the first feature map by using the voiceprint extraction model, the feature areas needing important attention (i.e. the feature areas capable of obviously distinguishing voiceprints) are determined, and then the feature areas needing important attention are strengthened.

Step S104: and determining a voiceprint model of the target speech by using at least one second feature map of each speech spectrum segment and the voiceprint extraction model.

Because the second feature map contains global information and strengthens the feature region capable of distinguishing the voiceprints in the first feature map, a more accurate and stable voiceprint model can be determined according to the second feature map.

Considering that the first feature map contains local information and the second feature map contains global information, in order to obtain a more accurate and stable voiceprint model, in another possible implementation, the voiceprint model of the target speech may be determined using a voiceprint extraction model, at least one first feature map of each spectral segment of the target speech, and at least one second feature map of each spectral segment of the target speech, i.e., the voiceprint model of the target speech is determined using both the global information and the local information.

Compared with the voiceprint determination scheme based on the analysis of the all-variable factors in the prior art, the voiceprint model determination method based on the voiceprint extraction model can obtain the first feature map of the speech spectrum fragment of the target speech by using the pre-established voiceprint extraction model, because the first feature map comprises the voiceprint information interwoven with the time domain and the frequency domain, the voiceprint information in the speech spectrum can be deeply mined to obtain a more stable and accurate voiceprint model, and in consideration of the fact that the feature points of the first feature map are mutually independent, namely the first feature map comprises the local information, in order to obtain the more stable and accurate voiceprint model, the method further excavates the global information of the first feature map by using the voiceprint extraction model, thereby determining the voiceprint model by using the second feature map comprising the global information, and because the second feature map comprises the global information, and the characteristic area capable of distinguishing the voiceprint in the first characteristic diagram is strengthened, so that a more stable and accurate voiceprint model can be determined based on the second characteristic diagram.

As can be seen from the above embodiments, the voiceprint model of the target speech is determined by the pre-established voiceprint extraction model, and the process of pre-establishing the voiceprint extraction model is described below.

Referring to fig. 2, a schematic flow chart of pre-establishing a voiceprint extraction model is shown, which may include:

step S201: training speech is acquired, and at least one speech spectral fragment of the training speech is acquired.

The process of acquiring at least one speech spectrum segment of the training speech is similar to the process of acquiring at least one speech spectrum segment of the target speech, namely, the speech feature of each speech frame of the training speech is determined, and the speech feature sequence of the training speech is obtained; and segmenting the voice feature sequence of the training voice according to a preset segmentation rule to obtain at least one voice spectrum segment of the training voice.

Similarly, frame windowing and Fourier transformation can be performed on the training voice to obtain an FFT feature sequence, the FFT feature sequence is used as a voice feature sequence of the training voice, and the voice feature sequence of the training voice is segmented according to a preset window length L.

If the length of the training speech is less than L, the training speech is supplemented with a copy of the training speech so that the final speech length is equal to or greater than L, and if the final speech length is not an integral multiple of L, the redundant speech is deleted so that the final speech length is an integral multiple of L. If the length of the training speech is greater than L but not an integral multiple of L, the training speech is similarly supplemented with a copy of the training speech, and then the redundant speech is removed. If the target speech is shorter than L or longer than L but not an integral multiple of L, the target speech is processed in the same manner as the training speech.

It can be understood that if L is set too small, fragmentation of an original spectrogram of a spectrogram can be caused, continuous spectrogram information is segmented into a plurality of small spectrogram segments, information among the spectrogram segments is lost too much, modeling on time length correlation of voice cannot be performed, L is set too large, training efficiency of a voiceprint extraction model is affected, and meanwhile resource occupation of a GPU is remarkably improved. In one possible implementation, the window length L may be set to 1/2 which is the average duration of the training speech in the training data set.

Step S202: at least one first feature map of each spectral segment of the training speech is determined by the current voiceprint extraction model.

It should be noted that, during the first training, the current voiceprint extraction model is the initial voiceprint extraction model.

Each feature point in the first feature map is independent of each other, that is, the first feature map contains local information.

Specifically, for any speech spectrum segment, the speech spectrum segment can be mapped into at least one first feature map by performing convolution, pooling and activation processing on the speech spectrum segment.

Step S203: and determining a second feature map containing global information corresponding to each first feature map of each spectral fragment of the training speech through the current voiceprint extraction model to obtain at least one second feature map of each spectral fragment of the training speech.

And the second characteristic diagram corresponding to one first characteristic diagram is the characteristic diagram obtained after strengthening the characteristic area capable of distinguishing the voiceprint in the first characteristic diagram.

Specifically, the process of determining a second feature map containing global information corresponding to each first feature map of each speech spectrum segment of the training speech may include:

step S2031, for any first feature map, dividing the first feature map into a plurality of first feature sub-maps of different frequency bands to obtain a plurality of first feature sub-maps included in each first feature map.

In this embodiment, the first feature map is divided in a frequency domain to obtain a plurality of first feature sub-maps in different frequency bands.

Referring to fig. 3, 301 in fig. 3 is an example of a first characteristic diagram, and the first characteristic diagram 301 is divided into two first characteristic diagrams of different frequency bands.

Step S2032, for any first feature subgraph, determining a second feature subgraph corresponding to the first feature subgraph and containing global information, so as to obtain a second feature subgraph corresponding to each first feature subgraph.

Specifically, for any first feature subgraph, the process of determining a second feature subgraph which corresponds to the first feature subgraph and contains global information includes: performing dimensionality reduction processing on the first feature subgraph through three convolution kernels with the same size and different parameters to obtain three dimensionality-reduced feature subgraphs; determining attention weight through two feature sub-images in the three feature sub-images after dimension reduction; and determining a second feature subgraph which corresponds to the first feature subgraph and contains global information through attention weight and another feature subgraph in the three dimension-reduced feature subgraphs.

For any first feature subgraph, assuming that three convolution kernels with the same size (such as three convolution kernels with the size of 1 × 1) and different parameters are adopted to perform dimensionality reduction on the first feature subgraph to obtain p1, p2 and p3, firstly multiplying the transposition of p1 by p2, multiplying the transposition of p1 by p2 to obtain a matrix capable of representing the correlation of each feature point of p1 and p2, then passing the matrix obtained by multiplying the transposition of p1 by p2 through a softmax layer to obtain an attention weight, then multiplying the attention weight by p3, and finally performing dimensionality enhancement on the result obtained by multiplying the attention weight by p3 by a convolution kernel (such as a convolution kernel with the size of 1 × 1) to obtain a second feature subgraph corresponding to the first feature subgraph and containing global information, wherein the second feature subgraph corresponding to the first feature subgraph has the same size as the first feature subgraph.

As shown in fig. 3, the first feature map 301 is divided into a first feature sub-map 3011 and a first feature sub-map 3012 in two different frequency bands, for the first feature sub-map 3011, 31 × 1 convolution kernels are used to perform dimension reduction on the first feature sub-map 3011, three

feature sub-maps

3011a, 3011b, and 3011c are obtained after dimension reduction, 3011a is transposed and 3011b are multiplied, the multiplication result passes through a softmax layer to obtain an attention weight, the obtained attention weight is multiplied by 3011c, the multiplication result is subjected to dimension enhancement by 1 × 1 convolution, and a second feature sub-map 3011 'corresponding to the first feature sub-map 3011 and containing global information is obtained, and the second feature sub-map 3011' is a feature sub-map after optimization of the first feature sub-map 3011. The same processing is applied to the first feature sub-graph 3012, and a second feature sub-graph 3012' corresponding to the first feature sub-graph 3012 and containing global information can be obtained.

Step S2033, for any first feature map, respectively corresponding second feature maps to the plurality of first feature maps included in the first feature map, and forming a second feature map including global information and corresponding to the first feature map, so as to obtain a second feature map including global information and corresponding to each first feature map.

As shown in fig. 3, a second feature subgraph 3011 ' corresponding to the first feature subgraph 3011 and containing global information is spliced with a second feature subgraph 3012 ' corresponding to the second feature subgraph 3012 and containing global information, so as to obtain a second feature graph 301 ' corresponding to the first feature graph 301 and containing global information.

Step S204: and determining the voiceprint sub-model of each speech spectrum fragment of the training speech by using at least one second feature map of each speech spectrum fragment of the training speech and the current voiceprint extraction model.

In one possible implementation, for any speech spectral segment of the training speech, the current voiceprint extraction model and the at least one second feature map of the speech spectral segment can be utilized to determine a voiceprint sub-model of the speech spectral segment. In order to obtain a more stable and accurate voiceprint model, in another possible implementation manner, for any speech spectral fragment of the training speech, the current voiceprint extraction model, the at least one first feature map of the speech spectral fragment and the at least one second feature map of the speech spectral fragment can be used to determine the voiceprint sub-model of the speech spectral fragment.

For any speech spectrum segment, the process of determining the voiceprint sub-model of the speech spectrum segment by using the current voiceprint extraction model and the at least one second feature map of the speech spectrum segment may include: and splicing the second feature maps of the speech spectrum segment into a high-dimensional vector through a current voiceprint extraction model, reducing the dimension of the high-dimensional vector through linear transformation, and obtaining a vector as a voiceprint sub-model of the speech spectrum segment after dimension reduction.

For any speech spectrum segment, determining a voiceprint sub-model of the speech spectrum segment by using a current voiceprint extraction model, at least one first feature map of the speech spectrum segment and at least one second feature map of the speech spectrum segment, wherein the determining comprises the following steps: and fusing at least one first characteristic diagram of the speech spectrum fragment with at least one second characteristic diagram of the speech spectrum fragment through the current voiceprint extraction model to obtain a voiceprint sub-model of the speech spectrum fragment. Specifically, the first feature maps of the speech spectrum segment are spliced into a high-dimensional column vector as the first high-dimensional column vector of the speech spectrum segment, the second feature maps of the speech spectrum segment are spliced into a high-dimensional column vector as the second high-dimensional column vector of the speech spectrum segment, the first high-dimensional column vector of the speech spectrum segment and the second high-dimensional column vector of the speech spectrum segment are spliced to obtain a spliced high-dimensional vector, the spliced high-dimensional vector is subjected to dimension reduction through linear transformation to obtain a low-dimensional vector, and the obtained low-dimensional vector is used as the voiceprint sub-model of the speech spectrum segment.

Step S205: and predicting the voiceprint identity label corresponding to each spectrum segment of the training voice according to the voiceprint sub-model of each spectrum segment of the training voice, and updating the parameters of the current voiceprint extraction model according to the prediction result.

The voiceprint identity tag corresponding to each speech spectrum segment of the training speech is used for identifying the speaker corresponding to the training speech.

The training process is executed for multiple times until the preset training times are reached, or the performance of the voiceprint extraction model obtained through training meets the requirements.

Through the training process, a voiceprint extraction model for determining a voiceprint model of the target speech can be obtained. On the basis of the above embodiment, a process of determining a voiceprint model of a target speech using a trained voiceprint extraction model will be further described below.

The above embodiment mentions that after obtaining at least one spectral fragment of the target speech, at least one first feature map containing local information of each spectral fragment of the target speech is first determined through a pre-established voiceprint extraction model, then a second feature map containing global information corresponding to each first feature map is determined through the pre-established voiceprint extraction model to obtain at least one second feature map of each spectral fragment of the target speech, and a process of determining the second feature map containing global information corresponding to each first feature map through the pre-established voiceprint extraction model is given as follows:

step a1, for any first feature map, dividing the first feature map into a plurality of first feature subgraphs of different frequency bands to obtain a plurality of first feature subgraphs included in each first feature map.

Step a2, for any first feature subgraph, determining a second feature subgraph which is corresponding to the first feature subgraph and contains global information to obtain a second feature subgraph corresponding to each first feature subgraph.

Specifically, dimension reduction processing is respectively carried out on the first feature subgraph through three convolution kernels with the same size and different parameters, and three feature subgraphs after dimension reduction are obtained; determining attention weight through two feature sub-images in the three feature sub-images after dimension reduction; and determining a second feature subgraph which corresponds to the first feature subgraph and contains global information through attention weight and another feature subgraph in the three dimension-reduced feature subgraphs.

Step a3, for any first feature graph, respectively corresponding second feature subgraphs to a plurality of first feature subgraphs contained in the first feature graph, and forming a second feature graph corresponding to the first feature graph and containing global information to obtain a second feature graph corresponding to each first feature graph and containing global information, namely at least one second feature graph of each spectral segment of the target speech.

It should be noted that the process of determining the second feature map of each speech spectrum segment of the target speech is substantially the same as the process of determining the second feature map of each speech spectrum segment of the training speech, and specific descriptions of steps a to c may refer to the process of determining the second feature map of each speech spectrum segment of the training speech.

After obtaining at least one second feature map of each speech spectrum segment of the target speech, determining a voiceprint model of the target speech by using at least one second feature map of each speech spectrum segment and the voiceprint extraction model, specifically:

and b1, determining the voiceprint sub-model of each spectral fragment of the target voice by using at least one second feature map of each spectral fragment of the target voice and a pre-established voiceprint model.

In the training phase, if the voiceprint sub-model of each spectral fragment of the training speech is determined only according to the at least one second feature map of each spectral fragment of the training speech, the voiceprint sub-model of each spectral fragment of the target speech is also determined only according to the at least one second feature map of each spectral fragment of the target speech, and in the training phase, if the voiceprint sub-model of each spectral fragment of the training speech is determined according to the at least one first feature map and the at least one second feature map of each spectral fragment of the training speech, the voiceprint sub-model of each spectral fragment of the target speech is also determined according to the at least one first feature map and the at least one second feature map of each spectral fragment of the target speech.

Specifically, the process of determining the voiceprint sub-model of each speech spectrum segment of the target speech by using the pre-established voiceprint extraction model, at least one first feature map of each speech spectrum segment of the target speech and at least one second feature map of each speech spectrum segment of the target speech includes: for any speech spectrum fragment, at least one first feature map of the speech spectrum fragment and at least one second feature map of the speech spectrum fragment are fused through a pre-established voiceprint extraction model to obtain a voiceprint sub-model of the speech spectrum fragment so as to obtain the voiceprint sub-model of each speech spectrum fragment. Further, the process of fusing at least one first feature map of the speech spectrum segment with at least one second feature map of the speech spectrum segment through a pre-established voiceprint extraction model includes: splicing all the first feature maps of the speech spectrum fragment into a high-dimensional vector through a pre-established voiceprint extraction model, and taking the high-dimensional vector as a first high-dimensional column vector of the speech spectrum fragment; splicing each second feature map of the speech spectrum segment into a high-dimensional column vector through a pre-established voiceprint extraction model, and taking the high-dimensional column vector as a second high-dimensional column vector of the speech spectrum segment; splicing the first high-dimensional column vector of the speech spectrum segment and the second high-dimensional column vector of the speech spectrum segment through a pre-established voiceprint extraction model to obtain a spliced high-dimensional vector; and reducing the dimension of the spliced high-dimensional vector through a voiceprint extraction model, and determining the vector after dimension reduction as a voiceprint sub-model of the speech spectrum fragment.

Step b2, averaging the voiceprint sub-models of each spectrum fragment of the target voice to obtain the voiceprint model of the target voice.

It should be noted that, if there is only one speech spectrum segment of the target speech, the voiceprint sub-model of the speech spectrum segment is directly determined as the voiceprint model of the target speech; and if the target voice has a plurality of speech spectrum segments, determining the average value of the voiceprint models of the speech spectrum segments as the voiceprint model of the target voice.

According to the voiceprint model determining method provided by the embodiment of the application, the attention mechanism is adopted to optimize the first feature map of each speech spectrum segment of the target speech, and the accurate and stable voiceprint model can be determined by utilizing the optimized features; because the first characteristic contains local information and the second characteristic contains global information, the first characteristic and the second characteristic are subjected to complementary fusion, and a more accurate and stable voiceprint model can be determined. In addition, considering that voiceprint information has different performances in different frequency bands, the first feature graph is divided into sub-graphs on different frequency bands when the attention weight is determined, so that mutual interference among information of different frequency bands can be reduced, and the purpose of accurately calculating the attention weight can be achieved.

The following describes the apparatus for determining a voiceprint model provided in the embodiment of the present application, and the apparatus for determining a voiceprint model described below and the method for determining a voiceprint model described above may be referred to in correspondence with each other.

Referring to fig. 4, a schematic structural diagram of a device for determining a voiceprint model according to an embodiment of the present application is shown, and as shown in fig. 4, the device for determining a voiceprint model may include: a speech spectrum segment acquisition module 401, a first feature acquisition module 402, a second feature acquisition module 403 and a voiceprint model determination module 404.

A speech spectrum segment acquiring module 401, configured to acquire at least one speech spectrum segment of the target speech.

A first feature obtaining module 402, configured to determine at least one first feature map of each speech spectrum segment through a pre-established voiceprint extraction model.

Wherein, each characteristic point in the first characteristic diagram is independent.

A second feature obtaining module 403, configured to determine, through the voiceprint extraction model, a second feature map that includes global information of each first feature map, and obtain at least one second feature map of each speech spectrum segment.

A voiceprint model determining module 404, configured to determine a voiceprint model of the target speech by using at least the at least one second feature map of each speech spectrum segment and the voiceprint extraction model.

The voiceprint model determination device provided by the embodiment of the application can obtain the first feature map of the speech spectrum fragment of the target speech by using the pre-established voiceprint extraction model, can determine a more accurate and stable voiceprint model through the first feature map, and considers that each feature point of the first feature map is mutually independent, namely the first feature map contains local information, in order to obtain the more stable and accurate voiceprint model, the application further fully excavates the global information of the first feature map by using the voiceprint extraction model, thereby determining the voiceprint model by using at least the second feature map containing the global information.

In a possible implementation manner, in the apparatus for determining a voiceprint model provided in the foregoing embodiment, the speech spectrum fragment obtaining module 401 includes: a feature determination submodule and a segmentation submodule.

And the characteristic determining submodule is used for determining the voice characteristic of each voice frame of the target voice and obtaining the voice characteristic sequence of the target voice.

And the segmentation submodule is used for segmenting the voice feature sequence of the training voice according to a preset segmentation rule to obtain at least one voice spectrum segment of the target voice.

In a possible implementation manner, in the apparatus for determining a voiceprint model provided in the foregoing embodiment, the second feature obtaining module 403 includes: the first feature map dividing sub-module, the second feature sub-map determining sub-module and the second feature map determining sub-module.

And the first feature map dividing submodule is used for dividing any first feature map into a plurality of first feature sub-maps of different frequency bands to obtain a plurality of first feature sub-maps contained in each first feature map.

And the second characteristic subgraph determining sub-module is used for determining a second characteristic subgraph which is corresponding to any first characteristic subgraph and contains global information so as to obtain a second characteristic subgraph corresponding to each first characteristic subgraph.

And the second feature map determining submodule is used for forming a second feature map which corresponds to the first feature map and contains global information for any one first feature map and corresponds to the plurality of first feature sub-maps contained in the first feature map so as to obtain the second feature map which corresponds to each first feature map and contains global information.

In a possible implementation manner, the second feature sub-image determining sub-module is specifically configured to perform dimension reduction processing on the first feature sub-image through three convolution kernels with the same size and different parameters, and obtain three dimension-reduced feature sub-images; determining attention weight through two feature sub-images in the three feature sub-images after dimension reduction; and determining a second feature subgraph which corresponds to the first feature subgraph and contains global information according to the attention weight and the other feature subgraph in the three dimension-reduced feature subgraphs.

In a possible implementation manner, in the apparatus for determining a voiceprint model provided in the foregoing embodiment, the voiceprint model determining module 404 is specifically configured to determine the voiceprint model of the target speech by using the voiceprint extracting module, the at least one first feature map of each speech spectrum segment, and the at least one second feature map of each speech spectrum segment.

In one possible implementation, the voiceprint model determination module 404 includes: a voiceprint sub-model determination sub-module and a voiceprint model determination sub-module.

And the voiceprint sub-model determining sub-module is used for fusing at least one first characteristic diagram of any speech spectrum fragment of the target speech with at least one second characteristic diagram of the speech spectrum fragment through the voiceprint extraction model to obtain the voiceprint sub-model of the speech spectrum fragment so as to obtain the voiceprint sub-model of each speech spectrum fragment of the target speech.

And the voiceprint model determining submodule is used for averaging the voiceprint sub models of all the speech spectrum fragments of the target speech to obtain the voiceprint model of the target speech.

In a possible implementation manner, the voiceprint sub-model determining sub-module is specifically configured to splice, by using the voiceprint extraction model, the first feature maps of the speech spectrum segment into high-dimensional column vectors as the first high-dimensional column vectors of the speech spectrum segment when the voiceprint sub-model of the speech spectrum segment is obtained by fusing the at least one first feature map of the speech spectrum segment with the at least one second feature map of the speech spectrum segment through the voiceprint extraction model; splicing the second feature maps of the speech spectrum segment into high-dimensional vectors through the voiceprint extraction model, wherein the high-dimensional vectors are used as second high-dimensional column vectors of the speech spectrum segment; splicing the first high-dimensional column vector of the speech spectrum segment with the second high-dimensional column vector of the speech spectrum segment through the voiceprint extraction model to obtain a spliced high-dimensional vector; and reducing the dimension of the spliced high-dimensional vector through the voiceprint extraction model, and determining the vector after dimension reduction as the voiceprint sub-model of the speech spectrum fragment.

The apparatus for determining a voiceprint model provided in the foregoing embodiment may further include: and a model building module.

The model building module comprises: the system comprises a training voice acquisition module, a speech spectrum fragment acquisition module, a first characteristic diagram determination module, a second characteristic diagram determination module, a voiceprint sub-model determination module, an identity label prediction module and a parameter updating module.

And the training voice acquisition module is used for acquiring training voice.

A speech spectrum segment obtaining module, configured to obtain at least one speech spectrum segment of the training speech;

and the first feature map determining module is used for determining at least one first feature map of each speech spectrum segment of the training speech through a current voiceprint extraction model.

If the training is the first training, the current voiceprint extraction model is the initial voiceprint extraction model, and if the training is not the first training, the current voiceprint extraction model is the voiceprint extraction model after the previous training.

And the second feature map determining module is used for determining a second feature map which corresponds to each first feature map of each spectral fragment of the training speech and contains global information through a current voiceprint extraction model so as to obtain at least one second feature map of each spectral fragment of the training speech.

The second characteristic diagram corresponding to one first characteristic diagram is a characteristic diagram obtained after strengthening a characteristic region capable of distinguishing the voiceprint in the first characteristic diagram;

a voiceprint sub-model determining module, configured to determine a voiceprint sub-model of each speech spectrum segment of the training speech by using at least one second feature map of each speech spectrum segment of the training speech and a current voiceprint extraction model;

and the identity label prediction module is used for predicting the voiceprint identity label corresponding to each spectrum segment of the training voice according to the voiceprint sub-model of each spectrum segment of the training voice.

And the parameter updating module is used for updating the parameters of the current voiceprint extraction model according to the prediction result of the identity tag prediction module.

In a possible implementation manner, the voiceprint sub-model determining module is specifically configured to, for any speech spectrum segment of the training speech, fuse at least one first feature map of the speech spectrum segment with at least one second feature map of the speech spectrum segment through a current voiceprint extraction model, and obtain a voiceprint sub-model of the speech spectrum segment, so as to obtain a voiceprint sub-model of each speech spectrum segment of the training speech.

An embodiment of the present application further provides a device for determining a voiceprint model, please refer to fig. 5, which shows a schematic structural diagram of the device for determining a voiceprint model, where the device may include: at least one processor 501, at least one communication interface 502, at least one memory 503, and at least one communication bus 504;

in the embodiment of the present application, the number of the processor 501, the communication interface 502, the memory 503 and the communication bus 504 is at least one, and the processor 501, the communication interface 502 and the memory 503 complete the communication with each other through the communication bus 504;

the processor 501 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 503 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

acquiring at least one speech spectrum segment of target speech;

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

acquiring at least one speech spectrum segment of target speech;

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for determining a voiceprint model, comprising:

acquiring at least one speech spectrum segment of target speech;

2. The method for determining the voiceprint model according to claim 1, wherein the determining the voiceprint model of the target speech by using at least the at least one second feature map of each speech spectrum segment and the voiceprint extraction model comprises:

and determining the voiceprint model of the target voice by using the voiceprint extraction model, the at least one first feature map of each speech spectrum fragment and the at least one second feature map of each speech spectrum fragment.

3. The method for determining the voiceprint model according to claim 2, wherein the determining the voiceprint model of the target speech by using the voiceprint extraction model, the at least one first feature map of each spectral fragment and the at least one second feature map of each spectral fragment comprises:

4. The method for determining the voiceprint model according to claim 3, wherein the obtaining the voiceprint sub-model of the speech spectrum fragment by fusing at least one first feature map of the speech spectrum fragment with at least one second feature map of the speech spectrum fragment through the voiceprint extraction model comprises:

splicing the second feature maps of the speech spectrum segment into a high-dimensional column vector through the voiceprint extraction model, wherein the high-dimensional column vector is used as a second high-dimensional column vector of the speech spectrum segment;

5. The method for determining the voiceprint model according to claim 1, wherein the determining a second feature map containing global information corresponding to each first feature map includes:

6. The method for determining the voiceprint model according to claim 5, wherein the determining a second feature sub-graph containing global information corresponding to the first feature sub-graph comprises:

7. The method for determining a voiceprint model according to claim 1, wherein said obtaining at least one spectral fragment of a target speech comprises:

and segmenting the voice feature sequence of the target voice according to a preset segmentation rule to obtain at least one voice spectrum segment of the target voice.

8. The method for determining the voiceprint model according to any one of claims 1 to 7, wherein the process of pre-establishing the voiceprint extraction model comprises:

9. The method for determining a voiceprint model according to claim 8, wherein the determining a voiceprint sub model for each spectral fragment of the training speech using at least the at least one second feature map for each spectral fragment of the training speech and the current voiceprint extraction model comprises:

10. An apparatus for determining a voiceprint model, comprising: the system comprises a speech spectrum fragment acquisition module, a first characteristic acquisition module, a second characteristic acquisition module and a voiceprint model determination module;

11. An apparatus for determining a voiceprint model, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the method for determining a voiceprint model according to any one of claims 1 to 9.

12. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for determining a voiceprint model according to any one of claims 1 to 9.