CN110970036A

CN110970036A - Voiceprint recognition method and device, computer storage medium and electronic equipment

Info

Publication number: CN110970036A
Application number: CN201911346904.9A
Authority: CN
Inventors: 万里红; 雷进; 陈康; 王润琦; 陆海天; 张伟东
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-04-07
Anticipated expiration: 2039-12-24
Also published as: CN110970036B

Abstract

The present disclosure relates to the technical field of computers, and provides a voiceprint recognition method, a voiceprint recognition apparatus, a computer storage medium, and an electronic device, wherein the voiceprint recognition method includes: acquiring voice of a person contained in the voice to be recognized; extracting first acoustic features corresponding to human voice, and inputting the first acoustic features into a voiceprint recognition model to obtain a preliminary voiceprint recognition result corresponding to the voice to be recognized, wherein the preliminary voiceprint recognition result comprises attribute prediction results corresponding to various voiceprint attributes respectively; and determining a speaker recognition result corresponding to the voice to be recognized according to the attribute prediction results respectively corresponding to the multiple voiceprint attributes. The voiceprint recognition method can effectively distinguish noise and human voice, and improves recognition accuracy.

Description

Voiceprint recognition method and device, computer storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a voiceprint recognition method, a voiceprint recognition apparatus, a computer storage medium, and an electronic device.

Background

Voiceprint (Voiceprint), a spectrum of sound waves carrying verbal information displayed by an electroacoustical instrument. The generation of human language is a complex physiological and physical process between the human language center and the vocal organs, and the vocal organs (tongue, teeth, larynx, lung, nasal cavity) used by a person during speaking are greatly different from person to person in terms of size and shape, so that the vocal print maps of any two persons are different. The speech acoustic characteristics of each person are both relatively stable and variable, not absolute, but invariant. The variation can come from physiology, pathology, psychology, simulation, camouflage and is also related to environmental interference. However, since the pronunciation organs of each person are different, in general, people can distinguish different sounds or judge whether the sounds are the same.

At present, the identification of the voiceprint content is generally realized based on the comparison of audio signals. On one hand, noise or environmental sound in the audio is easily divided into human voice during voiceprint extraction, so that subsequent identification performance is influenced; furthermore, a learning model needs to be established separately for different voiceprints, so that the practical application efficiency is low; on the other hand, the sound in the game has different duration and contains rich emotion, and the model learning is difficult and unstable due to the fact that the model spans all ages.

In view of the above, there is a need in the art to develop a new voiceprint recognition method and apparatus.

It is to be noted that the information disclosed in the background section above is only used to enhance understanding of the background of the present disclosure.

Disclosure of Invention

The present disclosure is directed to provide a voiceprint recognition method, a voiceprint recognition apparatus, a computer storage medium, and an electronic device, so as to avoid, at least to a certain extent, a defect of low accuracy of a voiceprint recognition method in the prior art.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to a first aspect of the present disclosure, there is provided a voiceprint recognition method comprising: acquiring voice of a person contained in the voice to be recognized; extracting first acoustic features corresponding to the voice of the person, inputting the first acoustic features into a voiceprint recognition model to obtain a preliminary voiceprint recognition result corresponding to the voice to be recognized, wherein the preliminary voiceprint recognition result comprises attribute prediction results corresponding to various voiceprint attributes respectively; and determining a speaker recognition result corresponding to the voice to be recognized according to the attribute prediction results respectively corresponding to the multiple voiceprint attributes.

In an exemplary embodiment of the present disclosure, the acquiring voice of a person included in the voice to be recognized includes: extracting a second acoustic feature corresponding to the voice to be recognized; determining the human voice acoustic characteristics in the second acoustic characteristics according to the human voice phonemes; and determining the human voice contained in the voice to be recognized according to the human acoustic features in the plurality of segments of the second acoustic features.

In an exemplary embodiment of the present disclosure, the extracting a second acoustic feature corresponding to the speech to be recognized includes: pre-emphasis processing is carried out on the voice to be recognized to obtain a target voice signal; performing framing processing on the target voice signal to obtain a framing result; windowing the framing result to obtain a windowed signal; performing fast Fourier transform on the windowed signal to obtain frequency spectrum information; performing a modulus square on the frequency spectrum information to obtain a power spectrum; inputting the power spectrum into a triangular filter bank to obtain a logarithmic energy spectrum output by the triangular filter bank; and determining the logarithmic energy spectrum as the second acoustic feature corresponding to the voice to be recognized.

In an exemplary embodiment of the present disclosure, the method further comprises: comparing the human voice phonemes with the multiple sections of the second acoustic features respectively to obtain feature vectors representing comparison results; if the value of the target component in the feature vector is not the maximum value, determining that the second acoustic feature is the human acoustic feature; acquiring a plurality of sections of voice fragments corresponding to the voice acoustic features; and splicing the multiple sections of voice fragments to obtain the voice to be recognized.

In an exemplary embodiment of the present disclosure, the method further comprises: if the value of the target component in the feature vector is the maximum value, determining that the second acoustic feature is an interference feature; and eliminating the voice segments corresponding to the interference features.

In an exemplary embodiment of the present disclosure, the determining, according to the attribute prediction results respectively corresponding to the multiple voiceprint attributes, a speaker recognition result corresponding to the speech to be recognized includes: accumulating and summing the attribute prediction results corresponding to the various voiceprint attributes respectively to obtain an accumulated value; and determining the attribute prediction result corresponding to the accumulated value with the largest numerical value as the speaker recognition result corresponding to the voice to be recognized.

In an exemplary embodiment of the present disclosure, the plurality of voiceprint attributes includes at least: identity attribute, gender attribute, age attribute, and language attribute.

In an exemplary embodiment of the present disclosure, the method further comprises: performing data enhancement processing on the obtained original voice sample to obtain an extended sample; performing phoneme labeling on the extended sample to obtain a sample label; and training a machine learning model according to the original voice sample and the sample label to obtain the voiceprint recognition model.

In an exemplary embodiment of the present disclosure, the training a machine learning model according to the original speech sample and the sample label to obtain the voiceprint recognition model includes: and converging the original voice sample and the sample label according to a cross entropy loss function to obtain the voiceprint recognition model.

According to a second aspect of the disclosure, a voiceprint recognition device is provided, which includes an obtaining module, configured to obtain a human voice included in a voice to be recognized; the information extraction module is used for extracting first acoustic features corresponding to the voice of the person, inputting the first acoustic features into a voiceprint recognition model to obtain a preliminary voiceprint recognition result corresponding to the voice to be recognized, wherein the preliminary voiceprint recognition result comprises attribute prediction results corresponding to various voiceprint attributes respectively; and the joint optimization module is used for determining a speaker recognition result corresponding to the voice to be recognized according to the attribute prediction results respectively corresponding to the multiple voiceprint attributes.

According to a third aspect of the present disclosure, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the voiceprint recognition method of the first aspect described above.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the voiceprint recognition method of the first aspect described above via execution of the executable instructions.

As can be seen from the foregoing technical solutions, the voiceprint recognition method, the voiceprint recognition apparatus, the computer storage medium and the electronic device in the exemplary embodiments of the present disclosure have at least the following advantages and positive effects:

in the technical solutions provided in some embodiments of the present disclosure, on one hand, the vocal speech included in the speech to be recognized is obtained, so that the technical problem in the prior art that the environmental sound or noise is determined as the vocal by mistake can be solved, invalid redundant data in the speech is eliminated, and the accuracy of subsequent recognition is improved. The method comprises the steps of extracting first acoustic features corresponding to voice, inputting the first acoustic features into a voiceprint recognition model to obtain an initial voiceprint recognition result corresponding to the voice to be recognized, wherein the initial voiceprint recognition result comprises attribute prediction results corresponding to various voiceprint attributes respectively, and recognition speed and recognition accuracy can be improved. The speaker recognition result corresponding to the voice to be recognized is determined according to the attribute prediction results corresponding to the various voiceprint attributes, joint statistics can be carried out on the attribute prediction results corresponding to the various voiceprint attributes, the number of models and calculation resources are obviously reduced, and the model prediction efficiency and the recognition accuracy are improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

FIG. 1 shows a flow diagram of a voiceprint recognition method in an exemplary embodiment of the disclosure;

FIG. 2 shows a flow diagram of a voiceprint recognition method in another exemplary embodiment of the disclosure;

FIG. 3 shows a flow diagram of a voiceprint recognition method in yet another exemplary embodiment of the present disclosure;

FIG. 4 shows a flow diagram of a voiceprint recognition method in a further exemplary embodiment of the disclosure;

FIG. 5 shows a flow diagram of a voiceprint recognition method in an example embodiment of the disclosure;

fig. 6 shows a schematic structural diagram of a voiceprint recognition apparatus in an exemplary embodiment of the disclosure;

FIG. 7 shows a schematic diagram of a structure of a computer storage medium in an exemplary embodiment of the disclosure;

fig. 8 shows a schematic structural diagram of an electronic device in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

The terms "a," "an," "the," and "said" are used in this specification to denote the presence of one or more elements/components/parts/etc.; the terms "comprising" and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. other than the listed elements/components/etc.; the terms "first" and "second", etc. are used merely as labels, and are not limiting on the number of their objects.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.

Currently, voiceprint recognition is typically accomplished based on template matching. Specifically, the identification of the voiceprint content is generally realized based on the comparison of the audio signals. On one hand, noise or environmental sound in the audio is easily divided into human voice during voiceprint extraction, so that subsequent identification performance is influenced; furthermore, a learning model needs to be established separately for different voiceprints, so that the practical application efficiency is low; on the other hand, the sound in the game has different duration and contains rich emotion, and the model learning is difficult and unstable due to the fact that the model spans all ages.

In the embodiments of the present disclosure, a voiceprint recognition method is provided first, which overcomes, at least to some extent, the defect of low accuracy of the voiceprint recognition method provided in the prior art.

Fig. 1 is a flowchart illustrating a voiceprint recognition method according to an exemplary embodiment of the present disclosure, where an execution subject of the voiceprint recognition method may be a server that recognizes a voiceprint.

Referring to fig. 1, a voiceprint recognition method according to one embodiment of the present disclosure includes the steps of:

step S110, acquiring voice of a person contained in the voice to be recognized;

step S120, extracting first acoustic features corresponding to human voice, inputting the first acoustic features into a voiceprint recognition model to obtain a preliminary voiceprint recognition result corresponding to the voice to be recognized, wherein the preliminary voiceprint recognition result comprises attribute prediction results corresponding to various voiceprint attributes respectively;

step S130, determining a speaker recognition result corresponding to the voice to be recognized according to the attribute prediction results respectively corresponding to the multiple voiceprint attributes.

In the technical scheme provided by the embodiment shown in fig. 1, on one hand, the voice of the person included in the voice to be recognized is obtained, so that the technical problem that the environmental sound or noise is judged as the person by mistake in the prior art can be solved, invalid redundant data in the voice is eliminated, and the accuracy of subsequent recognition is improved. The method comprises the steps of extracting first acoustic features corresponding to voice, inputting the first acoustic features into a voiceprint recognition model to obtain an initial voiceprint recognition result corresponding to the voice to be recognized, wherein the initial voiceprint recognition result comprises attribute prediction results corresponding to various voiceprint attributes respectively, and recognition speed and recognition accuracy can be improved. The speaker recognition result corresponding to the voice to be recognized is determined according to the attribute prediction results corresponding to the various voiceprint attributes, joint statistics can be carried out on the attribute prediction results corresponding to the various voiceprint attributes, the number of models and calculation resources are obviously reduced, and the model prediction efficiency and the recognition accuracy are improved.

The following describes the specific implementation of each step in fig. 1 in detail:

in step S110, a human voice included in the voice to be recognized is acquired.

In an exemplary embodiment of the present disclosure, a human voice included in a voice to be recognized may be acquired. The voice to be recognized may be a large number of voice dialogues recorded by dubbing personnel included in the game.

Specifically, referring to fig. 2, fig. 2 shows a flow diagram of a voiceprint recognition method in another exemplary embodiment of the present disclosure, specifically shows a flow diagram of acquiring a human voice included in a voice to be recognized, including steps S201 to S202, and the step S110 is explained below with reference to fig. 2.

In step S201, a second acoustic feature corresponding to the speech to be recognized is extracted.

In an exemplary embodiment of the present disclosure, the acoustic features may include MFCC features (Mel-frequency cepstral Coefficients, abbreviated as MFCC), PLP (Perceptual linear predictive Coefficients, abbreviated as PLP), Tandem features (a "new feature" obtained after taking logarithm, decorrelation and dimensionality reduction with prior probability of ANN output, called Tandem features), bottleeck features (Bottleneck layer, feature map activated by the last layer of the network before full connection), and Filterbank features (corresponding to MFCC features that do not contain discrete cosine transform, Filterbank features retain more original speech characteristics than MFCC features).

In the exemplary embodiment of the present disclosure, the second acoustic feature is an "Filterbank feature" for example, for an example, referring to fig. 3, fig. 3 shows a flowchart of a voiceprint recognition method in another exemplary embodiment of the present disclosure, specifically shows a flowchart of obtaining the second acoustic feature corresponding to the speech to be recognized, which includes steps S301 to S306, and the following describes step S201 with reference to fig. 3.

In step S301, pre-emphasis processing is performed on the speech to be recognized, so as to obtain a target speech signal.

In an exemplary embodiment of the present disclosure, the voice signal of the voice to be recognized may be subjected to pre-emphasis processing, so as to obtain a target voice signal. Pre-emphasis is a measure of deliberately enhancing the amplitude of some of its spectral components relative to the amplitude of other components in advance to facilitate transmission or recording of the signal. Specifically, the speech to be recognized may be input to a first-order finite-excitation-response high-pass filter (a high-pass filter, which is a combined device of devices such as a capacitor, an inductor, and a resistor that greatly suppresses signal components below a certain frequency to allow signal components above the certain frequency to pass through, so that the frequency spectrum of the signal becomes flat and is not easily affected by the finite word length effect.

In step S302, a framing process is performed on the target speech signal to obtain a framing result.

In an exemplary embodiment of the disclosure, after obtaining the target speech signal, the target speech signal may be subjected to a framing process to obtain a framing result. The N sampling points are grouped into one observation unit, called a frame. Typically, N has a value of 256 or 512, covering a time window of about 20 to 30 ms. In order to avoid the excessive variation of two adjacent frames, there is an overlapping area between two adjacent frames, and the overlapping area includes N/2 or N/3 sampling points. Illustratively, a speech frame length of 32ms and a frame shift of 16ms may be set. And then, performing framing processing on the target voice signal to obtain a framing result containing multi-frame signals.

In step S303, a windowing process is performed on the framing result to obtain a windowed signal.

In an exemplary embodiment of the present disclosure, after obtaining the framing result, the windowing processing may be performed on the framing result to obtain a windowed signal. Specifically, each frame signal in the framing result may be multiplied by a hamming window (hamming window, which is a window function having a nonzero value in a certain interval and 0 in the remaining intervals) to obtain a windowed signal, thereby avoiding the gibbs effect (a periodic function (e.g., rectangular pulse) having discontinuities is spread by a fourier series number, and then a finite term is selected for synthesis.

It should be noted that the window function used in the windowing process may include a rectangular window, a gaussian window, a hamming window, a Bartlett window, a Blackman window, etc., and may be set according to actual situations, which belongs to the protection scope of the present disclosure.

In step S304, a fast fourier transform is performed on the windowed signal to obtain corresponding spectral information.

In an exemplary embodiment of the present disclosure, after obtaining the above-mentioned windowed signal, fast fourier transform may be performed on each frame signal of the above-mentioned windowed signal to transform the time-domain signal into a power spectrum (frequency-domain signal) of the signal.

Specifically, the calculation formula of Fast Fourier Transform (FFT) is the following formula (1):

where x (N) is the input speech signal and N represents the number of points of the fourier transform. Based on the above calculation formula, the specific calculation steps can be divided into three steps: firstly, dividing 1 time domain signal of N points into N time domain signals of 1 point; secondly, calculating the frequency domains of the N1-point time domain signals to obtain N frequency domain points; thirdly, the points of the N frequency domains are added according to a certain sequence to obtain corresponding frequency spectrum information.

In step S305, the spectral information is squared modulo to obtain a power spectrum.

In an exemplary embodiment of the present disclosure, after obtaining the spectrum information, the spectrum information may be subjected to a modular squaring to obtain a power spectrum X_a(k)。

In step S306, the power spectrum is input into the triangular filter bank to obtain a logarithmic energy spectrum output by the triangular filter bank; and determining the log energy spectrum as a second acoustic feature corresponding to the voice to be recognized.

In an exemplary embodiment of the present disclosure, after obtaining the power spectrum, the power spectrum may be input into a triangular filter bank, and a logarithmic energy spectrum is obtained according to an output of the triangular filter bank.

In an exemplary embodiment of the present disclosure, a filter bank having K filters may be predefined (the number of filters is close to the number of critical bands), and the filters used are triangular filters, the center frequency is f (m), m is 1, 2, …, K; further, the power spectrum passes through a set of triangular filter banks with Mel frequency spectrum; finally, the log energy spectrum of each filter bank output is calculated based on the following equation (2):

where, N is the number of points of fourier transform, M is the number of triangular filters, and hm (k) represents the mth triangular filter.

In an exemplary embodiment of the present disclosure, after the log energy spectrum (i.e., the Filterbank feature) is calculated, the log energy spectrum may be determined as a second acoustic feature corresponding to the speech to be recognized. Thus, redundancy of the voice signal can be eliminated, and distinctiveness between speakers can be enhanced.

With continuing reference to fig. 2, in step S202, determining the human voice acoustic features in the plurality of segments of the second acoustic features respectively according to the human voice phonemes; and determining the human voice contained in the voice to be recognized according to the human voice acoustic characteristics in the multiple sections of second acoustic characteristics.

In an exemplary embodiment of the present disclosure, after obtaining the plurality of segments of the second acoustic features, the vocal acoustic features included in the plurality of segments of the second acoustic features may be determined according to the vocal phonemes, and then, the vocal speech included in the speech to be recognized is determined. The phoneme is the smallest unit or the smallest voice fragment constituting a syllable, and is the smallest linear voice unit divided from the viewpoint of voice quality.

Specifically, referring to fig. 4, fig. 4 shows a flow chart of a voiceprint recognition method in another exemplary embodiment of the present disclosure, and specifically shows a flow chart of determining a human voice included in a voice to be recognized, which includes steps S401 to S404, and the step S202 is explained below with reference to fig. 4.

In step S401, the human voice phonemes are respectively compared with the multiple segments of the second acoustic features to obtain feature vectors representing comparison results.

In an exemplary embodiment of the present disclosure, a human voice detection model (a machine learning model for performing detection and recognition on human voice and disturbing voice (i.e., non-human voice, which may be noise, environmental sound, etc.) in the second acoustic feature) may be trained in advance. The training process of the human voice detection model can be as follows: the method comprises the steps of obtaining massive voice samples and label information (used for marking whether the voice samples are human voice or interference sound) corresponding to the voice samples in advance, further inputting the voice samples and the label information into a machine learning model, and adjusting parameters for multiple times to train the machine learning model, so that a loss function of the machine learning model tends to be converged, and a human voice detection model is obtained. The machine learning model may be a CNN model (Convolutional Neural Networks, abbreviated as CNN) with a ResNet (residual Neural Network) structure.

In an exemplary embodiment of the present disclosure, the human voice phonemes may be respectively compared with the multiple segments of the second acoustic features, so as to obtain feature vectors representing comparison results. Specifically, a sliding window (a fixed window size and a fixed step length) may be set, the feature map corresponding to the second acoustic feature is captured by sliding from left to right to obtain a plurality of feature map segments, the feature map segments are input into a trained voice detection model, the second acoustic feature in each sliding window is compared with a pre-labeled voice phoneme based on the voice detection model, and a feature vector representing a comparison result is output. Thus, the stability of model prediction can be enhanced.

For example, when the feature map corresponding to the second acoustic feature is divided intoWhen Y sliding windows are provided, the human voice detection model can output Y characteristic vectors, and each characteristic vector is c +1 dimension. For example, for a second acoustic feature contained in any one of the sliding windows, the feature vector output by the human voice detection model may be a c + 1-dimensional vector [ P ]₀,P₁,P₂,……P_c-1，P_c]Wherein the component P₀,P₁,P₂,……P_c-1A target vector P representing the prediction probability that the sliding window is predicted as a human voice_CThen the predicted probability of an interfering sound (i.e., a non-human sound, such as ambient sound, noise, silence, etc.) is indicated. c is a positive integer representing the total number of phonemes.

In step S402, if the value of the target component in the feature vector is not the maximum value, it is determined that the second acoustic feature is a human acoustic feature.

In an exemplary embodiment of the present disclosure, if the target component P is described above_CIs not the maximum value, i.e. if P₀,P₁,P₂,……P_c-1In which any component present is greater than the target component P_CThen the second acoustic feature within the sliding window may be determined to be a human acoustic feature.

In an exemplary embodiment of the present disclosure, if the value of the target component in the feature vector is the maximum value, the target component P is the maximum value_CGreater than P₀,P₁,P₂,……P_c-1The second acoustic feature contained in the sliding window can be determined as an interference feature (i.e., a non-human voice feature), and then, a speech segment corresponding to the interference feature can be removed. Therefore, the technical problem that the environmental sound or noise is judged to be the voice by mistake in the prior art can be solved, invalid redundant data in the voice is eliminated, and the accuracy of subsequent recognition is improved.

In step S403, multiple segments of human voice segments corresponding to the human voice acoustic features are obtained.

In an exemplary embodiment of the present disclosure, after the human voice acoustic features are determined, multiple segments of human voice segments corresponding to the human voice acoustic features may be obtained. Referring to the above-mentioned related explanation of step S401, after the detection of the above-mentioned Y sliding windows is completed, for example, the obtained vocal segments may be Y-1 segments.

In step S404, the multiple segments of voice segments are spliced to obtain voice voices contained in the voice to be recognized.

In an exemplary embodiment of the present disclosure, after the Y-1 segment of vocal segment is obtained, the Y-1 segment of vocal segment may be subjected to a concatenation process to obtain vocal speech included in the speech to be recognized.

With reference to fig. 1, in step S120, a first acoustic feature corresponding to the human voice is extracted, and the first acoustic feature is input into the voiceprint recognition model to obtain a preliminary voiceprint recognition result corresponding to the voice to be recognized.

In an exemplary embodiment of the present disclosure, after the human voice is obtained, a first acoustic feature corresponding to the human voice may be obtained, specifically, referring to the relevant explanations of the above steps S301 to S306, first, performing pre-emphasis, framing, and windowing on the human voice, further, performing fast fourier transform on a signal after the windowing, and performing a modular square on frequency spectrum information after the fast fourier transform to obtain a power spectrum, further, inputting the power spectrum into a triangular filter bank, and taking a logarithmic energy spectrum output by the triangular filter bank as the first acoustic feature corresponding to the human voice.

It should be noted that, specific parameter settings (for example, the number of framing frames, the type of window function, etc.) during processing of human voice can be set according to actual situations, and the method belongs to the protection scope of the present disclosure.

In an exemplary embodiment of the present disclosure, after the first acoustic feature is extracted, the first acoustic feature may be input into a voiceprint recognition model, and a preliminary voiceprint recognition result corresponding to a voice to be recognized is determined according to an output of the voiceprint recognition model. And the preliminary voiceprint recognition result comprises attribute prediction results corresponding to various voiceprint attributes respectively. The plurality of voiceprint attributes includes at least: the identity attribute (e.g., speaker name, occupation), gender attribute (male or female), age attribute (e.g., 10, 20, 30, etc.), language attribute (chinese, english, korean, etc.), etc. can be set according to actual situations, and are within the scope of the present disclosure.

In an exemplary embodiment of the present disclosure, for example, referring to fig. 5, fig. 5 shows a flowchart of a voiceprint recognition method in an exemplary embodiment of the present disclosure, and specifically shows a flowchart of training to obtain the voiceprint recognition model, which includes steps S501 to S503, and the following explains a specific implementation manner with reference to fig. 5.

In step S501, data enhancement processing is performed on the acquired original voice sample to obtain an extended sample.

In an exemplary embodiment of the present disclosure, a large amount of original voice samples may be obtained, and data enhancement processing may be performed on the original voice samples, specifically, data enhancement processing may be performed on the original voice samples based on a SOX tool (the Swiss Army whitening of sound processing programs, abbreviated as SOX, a command line tool supported by a cross-platform, which is used to convert various different computer audio file formats and may also perform various special effect processing at the same time). Therefore, the number of samples can be expanded, the model learning performance under a small number of samples is obviously improved, and the generalization performance during model training is enhanced.

In an exemplary embodiment of the present disclosure, for example, after obtaining one kind of original voice data, the original voice data may be subjected to data enhancement processing to obtain corresponding 39 extended samples. For example, the extended samples may include 39 sound effects (7 basic sound effects +18 equalization filter sound effects +14 sound variation sound effects), and specifically, refer to the following tables 1 to 3, where table 1 shows a schematic table of the 7 basic sound effects and sound effect parameters thereof; table 2 shows a schematic table of the 18 equalization filter sound effects and their sound effect parameters; table 3 shows a schematic table of the above-mentioned 14 kinds of inflexion sound effects and their sound effect parameters.

TABLE 1

Sound effect name	Parameter(s)
		Edging	{Effect.COMMAND:'flanger'}
Phase shift 1	{Effect.COMMAND:'phaser 0.8 0.74 3 0.4 0.5-t'}
		Phase shift 2	{Effect.COMMAND:'phaser 0.6 0.66 3 0.6 2-t'}
Nonlinear distortion 1	{Effect.COMMAND:'overdrive'}
		Non-linear distortion 2	{Effect.COMMAND:'overdrive 10 10'}
Reverberation 1	{Effect.COMMAND:'reverb'}
		Reverberation 2	{Effect.COMMAND:'reverb 30 30'}

TABLE 2

TABLE 3

It should be noted that the sound effect name and the parameter setting thereof can be set by self according to the actual situation, and belong to the protection scope of the present disclosure.

In step S502, phoneme labeling is performed on the extended sample to obtain a sample label.

In an exemplary embodiment of the present disclosure, after the extended exemplar is obtained, the extended exemplar may be subjected to phoneme labeling to obtain an exemplar label. For example, the speaker identity, the speaker age, the speaker language, and the like of each extended sample may be labeled to obtain a sample label of each extended sample.

In step S503, a machine learning model is trained according to the original speech sample and the sample label, and a voiceprint recognition model is obtained.

In an exemplary embodiment of the present disclosure, a machine learning model may be trained according to the original speech sample and the sample label, resulting in a voiceprint recognition model. Specifically, the original voice sample and the sample label may be input into a machine learning model, and the cross entropy loss function is used as a target function to converge the original voice sample and the sample label, so as to obtain the trained voiceprint recognition model. The machine learning model may be a CNN model of a ResNet structure.

Wherein, the expression of the cross entropy loss function is: h (p, q) ═ Σ p (x) log (q (x)), where p denotes the distribution of true markers and q is the predicted marker distribution of the trained model, a cross-entropy objective function can measure the similarity of p and q. The machine learning model is trained on the basis of the cross entropy loss function, so that the training process can be guaranteed to be simple and effective, and the self-adaption realization of a computer is easy.

In particularWhen the recognizable voiceprint attributes include four categories (identity attribute, gender attribute, age attribute, language attribute), the cross-loss function used in training for recognition of the identity attribute may be H_speaker(p_speaker,q_speaker)＝-∑p_speaker(x)log(q_speaker(x) ). The cross-loss function used in training for recognition of gender attributes may be H_gender(p_gender,q_gender)＝-∑p_gender(x)log(q_gender(x) ). The cross-loss function used in training for identification of age attributes may be H_age(p_age,q_age)＝-∑p_age(x)log(q_age(x) ). The cross-penalty function used in training for recognition of language attributes may be H_language(p_language,q_language)＝-∑p_language(x)log(q_language(x))。

In an exemplary embodiment of the present disclosure, after the voiceprint recognition model is obtained through training, the feature map corresponding to the first acoustic feature may be input into the voiceprint recognition model in a form of a sliding window, the first acoustic feature is recognized based on the voiceprint recognition model, and then, according to the output of the voiceprint recognition model, a preliminary voiceprint recognition result corresponding to the speech to be recognized is obtained. The preliminary voiceprint recognition result may include one or more of an identity prediction result corresponding to the identity attribute of the speaker, a gender prediction result corresponding to the gender attribute, an age prediction result corresponding to the age attribute, and a language prediction result corresponding to the language attribute, and specifically, the preliminary voiceprint recognition result may be set by the user according to an actual situation, and belongs to the protection scope of the present disclosure.

For example, taking the identification of the identity attribute of the speech to be recognized as an example, in the preliminary voiceprint recognition result output by the above voiceprint recognition model, the attribute prediction result of the identity attribute may include x types (zhang, lii, … …, wang, etc.) in total.

For each sliding window, the voiceprint recognition model can output an attribute prediction result, and one attribute prediction result corresponds to one prediction vector.

For example, when the attribute prediction results output by the voiceprint recognition model include n attribute prediction results corresponding to n sliding windows, "three open" and the number of prediction vectors corresponding to the attribute prediction result "three open" may be n, for example: [ a ] A₁,b₁,c₁,……，z₁]^T，[a₂,b₂,c₂,……，z₂]^T，……，[a_n,b_n,c_n,……，z_n]^T. When the attribute prediction result output by the voiceprint recognition model has f attribute prediction results corresponding to the sliding windows, which are "lie four", the number of prediction vectors corresponding to the attribute prediction result "lie four" may be f, for example: [ a ] A₁₁,b₁₁,c₁₁,……，z₁₁]^T，[a₂₁,b₂₁,c₂₁,……，z₂₁]^T，……，[a_f1,b_f1,c_f1,……，z_f1]^T。

With continued reference to fig. 1, in step S130, a speaker recognition result corresponding to the speech to be recognized is determined according to the attribute prediction results corresponding to the various voiceprint attributes, respectively.

In an exemplary embodiment of the disclosure, after obtaining the preliminary voiceprint recognition result, that is, the attribute prediction results corresponding to the plurality of types of voiceprint attributes, respectively, the attribute prediction result corresponding to each voiceprint attribute may be based on the objective function H_jointAnd performing joint optimization on the preliminary voiceprint recognition result to determine a speaker recognition result corresponding to the voice to be recognized. Illustratively, the objective function H_jointThe expression (c) may be:

H_joint＝H_speaker(p_speaker,q_speaker)+H_gender(p_gender,q_gender)+H_age(p_age,q_age)+H_language(p_language,q_language)

further, by using the aboveObjective function H_jointAnd performing joint optimization on the preliminary voiceprint recognition result, specifically, determining the preliminary voiceprint recognition result corresponding to the component with the largest numerical value as the speaker recognition result corresponding to the voice to be recognized. Therefore, the number of models and computing resources can be obviously reduced, and the prediction efficiency of the models is improved.

Referring to the above explanation, the n prediction vectors may be summed up to obtain an accumulated value S corresponding to the attribute prediction result "zhang san₁₁[a₁+a₂+……a_n，b₁+b₂+……b_n，c₁+c₂+……c_n，……，z₁+z₂+……z_n]^T. The f prediction vectors can be accumulated and summed to obtain an accumulated value S corresponding to the attribute prediction result Liqu₁₂[a₁₁+a₂₁+……a_f1，b₁₁+b₂₁+……b_f1，c₁₁+c₂₁+……c_f1，……，z₁₁+z₂₁+……z_f1]^T. Then, for example, the vector representation corresponding to the recognition result "wang five" may be S_1x. In summary, the prediction vector representing the identity attribute can be expressed as: prediction vector A [ S ]₁₁,S₁₂,……，S_1x]. Further, when the accumulated value S is reached₁₁Is (may be a vector S)₁₁The model) is the maximum, and when the attribute prediction result corresponding to the model is "zhangsan", the attribute prediction result "zhangsan" can be determined as the speaker recognition result corresponding to the voice to be recognized (i.e., the recognition result corresponding to the identity attribute of the speaker).

Similarly, the result of attribute prediction (prediction vector B [ S ]) for the sex information is calculated with reference to the explanation of the above steps₂₁，S_2y]And y represents the type of attribute prediction, for example: male, female, etc.), and further, when the value S is accumulated₂₁Is (may be a vector S)₂₁Modulo) is maximum, and the corresponding recognition result is "woman", the attribute prediction result "woman" may be determined as the utterance corresponding to the speech to be recognizedThe speaker recognition result (i.e. the recognition result corresponding to the gender attribute of the speaker).

Similarly, the result of attribute prediction (prediction vector C S) for the age attribute is described with reference to the explanation of the above steps₃₁,S₃₂,S₃₃……S_3u]U denotes the type of attribute prediction, for example: 20, 30, 32, etc.), illustratively when accumulated value S₃₁Is (may be a vector S)₃₁Modulo of (b) is the maximum, and the corresponding recognition result is "22", the attribute prediction result "22" may be determined as the speaker recognition result (recognition result corresponding to the age attribute of the speaker) corresponding to the above-mentioned speech to be recognized.

Similarly, the result of attribute prediction (prediction vector D [ S ]) for the language attribute is performed with reference to the explanation of the above steps₄₁,S₄₂,S₄₃……S_4v]V denotes the kind of attribute prediction result, for example: chinese, english, korean, french, etc.), illustratively, when the value S is accumulated₄₁Is (may be a vector S)₄₁The model of (b) is the maximum, and when the corresponding recognition result is "chinese", the attribute prediction result "chinese" can be determined as the speaker recognition result (recognition result corresponding to the language attribute of the speaker) corresponding to the speech to be recognized.

Therefore, the technical problem of low recognition efficiency caused by the fact that different models need to be established for different recognition tasks when different attributes of the speaker are recognized in the prior art can be solved, and the recognition efficiency of the models is improved.

The present disclosure also provides a voiceprint recognition apparatus, and fig. 6 shows a schematic structural diagram of the voiceprint recognition apparatus in an exemplary embodiment of the present disclosure; as shown in fig. 6, the voiceprint recognition apparatus 600 can include an obtaining module 601, an information extracting module 602, and a joint optimization module 603. Wherein:

the obtaining module 601 is configured to obtain voice of a person included in the voice to be recognized.

In an exemplary embodiment of the disclosure, the obtaining module is configured to extract a second acoustic feature corresponding to the speech to be recognized; determining the human voice acoustic characteristics in the multiple sections of second acoustic characteristics respectively according to the human voice phonemes; and acquiring and determining the voice to be recognized according to the voice acoustic characteristics in the plurality of sections of the second acoustic characteristics.

In an exemplary embodiment of the disclosure, the obtaining module is configured to perform pre-emphasis processing on a voice to be recognized to obtain a target voice signal; performing framing processing on the target voice signal to obtain a framing result; windowing the framing result to obtain a windowed signal; performing fast Fourier transform on the windowed signal to obtain frequency spectrum information; performing modulus squaring on the frequency spectrum information to obtain a power spectrum; inputting the power spectrum into a triangular filter bank to obtain a logarithmic energy spectrum output by the triangular filter bank; and determining the log energy spectrum as a second acoustic feature corresponding to the voice to be recognized.

In an exemplary embodiment of the disclosure, the obtaining module is configured to compare the human voice phonemes with multiple segments of second acoustic features respectively to obtain feature vectors representing comparison results; if the value of the target component in the feature vector is not the maximum value, determining that the second acoustic feature is the human acoustic feature; acquiring a plurality of sections of voice segments corresponding to voice acoustic features; and splicing the multiple sections of voice fragments to obtain voice contained in the voice to be recognized.

In an exemplary embodiment of the present disclosure, the obtaining module is configured to determine that the second acoustic feature is an interference feature if a value of the target component in the feature vector is a maximum value; and eliminating the voice segments corresponding to the interference features.

An information extraction module 602, configured to extract a first acoustic feature corresponding to the voice, input the first acoustic feature into a voiceprint recognition model to obtain a preliminary voiceprint recognition result corresponding to the voice to be recognized, where the preliminary voiceprint recognition result includes attribute prediction results corresponding to multiple voiceprint attributes respectively.

In an exemplary embodiment of the present disclosure, the information extraction module is configured to perform data enhancement processing on an obtained original voice sample to obtain an extended sample; performing phoneme labeling on the extended sample to obtain a sample label; and training a machine learning model according to the original voice sample and the sample label to obtain a voiceprint recognition model.

In an exemplary embodiment of the disclosure, the information extraction module is configured to converge the original speech sample and the sample label according to a cross entropy loss function, so as to obtain a voiceprint recognition model.

In an exemplary embodiment of the present disclosure, the information extraction module is configured to compare the human voice phonemes with multiple segments of second acoustic features respectively to obtain feature vectors representing comparison results; if the value of the target component in the feature vector is not the maximum value, determining that the second acoustic feature is the human acoustic feature; acquiring a plurality of sections of voice segments corresponding to voice acoustic features; and splicing the multiple sections of voice fragments to obtain voice contained in the voice to be recognized.

And the joint optimization module 603 is configured to determine a speaker recognition result corresponding to the speech to be recognized according to the attribute prediction results corresponding to the multiple voiceprint attributes respectively.

In an exemplary embodiment of the present disclosure, the plurality of voiceprint attributes includes at least: identity attribute, gender attribute, age attribute, and language attribute. The joint optimization module is used for accumulating and summing attribute prediction results respectively corresponding to the various voiceprint attributes to obtain an accumulated value; and determining the attribute prediction result corresponding to the accumulated value with the maximum numerical value as the speaker recognition result corresponding to the voice to be recognized.

The specific details of each module in the voiceprint recognition apparatus have been described in detail in the corresponding voiceprint recognition method, and therefore are not described herein again.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer storage medium capable of implementing the above method. On which a program product capable of implementing the above-described method of the present specification is stored. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.

Referring to fig. 7, a program product 700 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 800 according to this embodiment of the disclosure is described below with reference to fig. 8. The electronic device 800 shown in fig. 8 is only an example and should not bring any limitations to the functionality and scope of use of the embodiments of the present disclosure.

As shown in fig. 8, electronic device 800 is in the form of a general purpose computing device. The components of the electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, a bus 830 connecting various system components (including the memory unit 820 and the processing unit 810), and a display unit 840.

Wherein the storage unit stores program code that is executable by the processing unit 810 to cause the processing unit 810 to perform steps according to various exemplary embodiments of the present disclosure as described in the "exemplary methods" section above in this specification. For example, the processing unit 810 may perform the following as shown in fig. 1: step S110, acquiring voice of a person contained in the voice to be recognized; step S120, extracting a first acoustic feature corresponding to the voice of the person, inputting the first acoustic feature into a voiceprint recognition model to obtain a preliminary voiceprint recognition result corresponding to the voice to be recognized, wherein the preliminary voiceprint recognition result comprises attribute prediction results corresponding to various voiceprint attributes respectively; step S130, determining the speaker recognition result corresponding to the voice to be recognized according to the attribute prediction results respectively corresponding to the multiple voiceprint attributes.

The storage unit 820 may include readable media in the form of volatile memory units such as a random access memory unit (RAM)8201 and/or a cache memory unit 8202, and may further include a read only memory unit (ROM) 8203.

The storage unit 820 may also include a program/utility 8204 having a set (at least one) of program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 830 may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 800 may also communicate with one or more external devices 900 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 800, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 800 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 850. Also, the electronic device 800 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 860. As shown, the network adapter 860 communicates with the other modules of the electronic device 800 via the bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A voiceprint recognition method, comprising:

acquiring voice of a person contained in the voice to be recognized;

extracting first acoustic features corresponding to the voice of the person, inputting the first acoustic features into a voiceprint recognition model to obtain a preliminary voiceprint recognition result corresponding to the voice to be recognized, wherein the preliminary voiceprint recognition result comprises attribute prediction results corresponding to various voiceprint attributes respectively;

and determining a speaker recognition result corresponding to the voice to be recognized according to the attribute prediction results respectively corresponding to the multiple voiceprint attributes.

2. The method according to claim 1, wherein the obtaining of the human voice contained in the voice to be recognized comprises:

extracting a second acoustic feature corresponding to the voice to be recognized;

determining the human voice acoustic characteristics in the second acoustic characteristics according to the human voice phonemes;

and determining the human voice contained in the voice to be recognized according to the human acoustic features in the plurality of segments of the second acoustic features.

3. The method according to claim 2, wherein the extracting a second acoustic feature corresponding to the speech to be recognized comprises:

pre-emphasis processing is carried out on the voice to be recognized to obtain a target voice signal;

performing framing processing on the target voice signal to obtain a framing result;

windowing the framing result to obtain a windowed signal;

performing fast Fourier transform on the windowed signal to obtain frequency spectrum information;

performing a modulus square on the frequency spectrum information to obtain a power spectrum;

inputting the power spectrum into a triangular filter bank to obtain a logarithmic energy spectrum output by the triangular filter bank;

and determining the logarithmic energy spectrum as the second acoustic feature corresponding to the voice to be recognized.

4. The method of claim 2, further comprising:

comparing the human voice phonemes with the multiple sections of the second acoustic features respectively to obtain feature vectors representing comparison results;

if the value of the target component in the feature vector is not the maximum value, determining that the second acoustic feature is the human acoustic feature;

acquiring a plurality of sections of voice fragments corresponding to the voice acoustic features;

and splicing the multiple sections of voice fragments to obtain the voice to be recognized.

5. The method of claim 4, further comprising:

if the value of the target component in the feature vector is the maximum value, determining that the second acoustic feature is an interference feature;

and eliminating the voice segments corresponding to the interference features.

6. The method according to claim 1, wherein the determining the speaker recognition result corresponding to the speech to be recognized according to the attribute prediction results corresponding to the plurality of voiceprint attributes respectively comprises:

accumulating and summing the attribute prediction results corresponding to the various voiceprint attributes respectively to obtain an accumulated value;

and determining the attribute prediction result corresponding to the accumulated value with the largest numerical value as the speaker recognition result corresponding to the voice to be recognized.

7. The method of claim 6, wherein the plurality of voiceprint attributes comprises at least: identity attribute, gender attribute, age attribute, and language attribute.

8. The method of claim 1, further comprising:

performing data enhancement processing on the obtained original voice sample to obtain an extended sample;

performing phoneme labeling on the extended sample to obtain a sample label;

and training a machine learning model according to the original voice sample and the sample label to obtain the voiceprint recognition model.

9. The method of claim 8, wherein the training of a machine learning model from the original speech samples and the sample labels to obtain the voiceprint recognition model comprises:

and converging the original voice sample and the sample label according to a cross entropy loss function to obtain the voiceprint recognition model.

10. A voiceprint recognition apparatus comprising:

the acquisition module is used for acquiring the voice of the person contained in the voice to be recognized;

the information extraction module is used for extracting first acoustic features corresponding to the voice of the person, inputting the first acoustic features into a voiceprint recognition model to obtain a preliminary voiceprint recognition result corresponding to the voice to be recognized, wherein the preliminary voiceprint recognition result comprises attribute prediction results corresponding to various voiceprint attributes respectively;

and the joint optimization module is used for determining a speaker recognition result corresponding to the voice to be recognized according to the attribute prediction results respectively corresponding to the multiple voiceprint attributes.

11. A computer storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements a voiceprint recognition method according to any one of claims 1 to 9.

12. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the voiceprint recognition method of any one of claims 1 to 9 via execution of the executable instructions.