CN113516987A - Speaker recognition method, device, storage medium and equipment - Google Patents

Speaker recognition method, device, storage medium and equipment Download PDF

Info

Publication number
CN113516987A
CN113516987A CN202110807643.7A CN202110807643A CN113516987A CN 113516987 A CN113516987 A CN 113516987A CN 202110807643 A CN202110807643 A CN 202110807643A CN 113516987 A CN113516987 A CN 113516987A
Authority
CN
China
Prior art keywords
target
speaker
voice
sample
recognition model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110807643.7A
Other languages
Chinese (zh)
Other versions
CN113516987B (en
Inventor
田敬广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202110807643.7A priority Critical patent/CN113516987B/en
Priority claimed from CN202110807643.7A external-priority patent/CN113516987B/en
Publication of CN113516987A publication Critical patent/CN113516987A/en
Application granted granted Critical
Publication of CN113516987B publication Critical patent/CN113516987B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Abstract

The application discloses a speaker identification method, a speaker identification device, a storage medium and equipment, wherein the method comprises the following steps: firstly, acquiring target voice to be recognized, determining the sampling rate of the target voice, and extracting first acoustic features of the target voice; processing the first acoustic feature based on the sampling rate of the first acoustic feature to obtain a second acoustic feature, inputting the second acoustic feature into a pre-constructed speaker recognition model, and recognizing to obtain a target characterization vector of a target speaker; the speaker recognition model is obtained by utilizing the voice co-training of different sampling rates; then, the target speaker can be identified according to the target characterization vector, and an identification result of the target speaker is obtained. Therefore, the method and the device have the advantages that the second acoustic feature is input into the pre-constructed speaker recognition model, so that no effect loss is caused when the high-frequency voice acoustic feature is input, the effect reduction caused by the input of the low-frequency voice acoustic feature is compensated, and the accuracy of the recognition result is improved.

Description

Speaker recognition method, device, storage medium and equipment
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a speaker recognition method, device, storage medium, and apparatus.
Background
With the continuous breakthrough of artificial intelligence technology and the increasing popularization of various intelligent terminal devices, the frequency of human-computer interaction in daily work and life of people is higher and higher. Voice interaction as a next-generation man-machine interaction mode can bring great convenience to the life of people, and more importantly, a technology for recognizing a speaker based on voice is called speaker recognition. For example, speaker recognition can be applied to the situation of confirming the identity of a speaker, such as court approval, remote financial services, security, voice retrieval, and other fields, and accurate recognition of the identity of the speaker based on voice data is required.
The traditional speaker recognition method is to respectively train and maintain speaker recognition models for voices with different sampling rates of a broadband and a narrowband, the deployment cost is high, and the two speaker recognition models cannot be subjected to similarity matching, so that the accuracy of recognition results is low.
Therefore, how to use the same model for recognition and improve the accuracy of the recognition result of the speaker is a technical problem to be solved urgently at present.
Disclosure of Invention
The embodiments of the present application mainly aim to provide a method, an apparatus, a storage medium, and a device for speaker recognition, which can effectively improve the accuracy of a recognition result when performing speaker recognition.
The embodiment of the application provides a speaker identification method, which comprises the following steps:
acquiring target voice to be recognized, and determining the sampling rate of the target voice;
extracting a first acoustic feature from the target voice; processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature;
inputting the second acoustic characteristic into a pre-constructed speaker recognition model, and recognizing to obtain a target representation vector of a target speaker; the speaker recognition model is obtained by utilizing the voice co-training of different sampling rates;
and identifying the target speaker according to the target characterization vector to obtain an identification result of the target speaker.
In one possible implementation, the speaker recognition model is constructed as follows:
acquiring a first sample voice and a teacher speaker recognition model corresponding to a first sampling rate; the teacher speaker recognition model is obtained through voice training based on a first sampling rate;
acquiring a second sample voice corresponding to a second sampling rate; extracting acoustic features of the second sample voice from the second sample voice; the first sample speech and the second sample speech belong to the same sample speaker;
inputting the acoustic characteristics of the first sample voice into the teacher speaker recognition model to obtain a first sample characterization vector; inputting the acoustic features of the first sample voice and the acoustic features corresponding to the second sample voice into an initial speaker recognition model to respectively obtain a second sample characterization vector and a third sample characterization vector;
and training the initial speaker recognition model according to the first sample characterization vector, the second sample characterization vector and the third sample characterization vector to generate a student speaker recognition model, and taking the student speaker recognition model as a final speaker recognition model.
In one possible implementation, the processing the first acoustic feature based on the sampling rate of the target speech to obtain a second acoustic feature includes:
when the sampling rate of the target voice is determined to be the first sampling rate, directly taking the first acoustic feature as a second acoustic feature;
and when the sampling rate of the target voice is determined to be the second sampling rate, processing the first acoustic feature to obtain a second acoustic feature.
In one possible implementation, the first sampling rate is higher than the second sampling rate, and the first acoustic feature includes a log mel filter bank FBANK feature; when the sampling rate of the target voice is determined to be the second sampling rate, processing the first acoustic feature to obtain a second acoustic feature, including:
adjusting the number of filters for filtering the power spectrum of the first acoustic feature to obtain an adjusted first acoustic feature, so that the adjusted first acoustic feature is aligned with a low-frequency band region of the acoustic feature of the voice corresponding to the first sampling rate;
and zero filling is carried out on the difference dimension of the adjusted first acoustic feature and the acoustic feature of the voice corresponding to the first sampling rate, so that the dimension of the zero filled first acoustic feature is the same as that of the voice corresponding to the first sampling rate, and the zero filled first acoustic feature is taken as a second acoustic feature.
In one possible implementation, the inputting the acoustic features of the first sample speech and the acoustic features corresponding to the second sample speech into an initial speaker recognition model includes: and after the acoustic characteristics corresponding to the second sample voice are processed, inputting the initial speaker recognition model.
In one possible implementation manner, the training the initial speaker recognition model according to the first sample characterization vector, the second sample characterization vector, and the third sample characterization vector to generate a student speaker recognition model, and using the student speaker recognition model as a final speaker recognition model includes:
calculating cosine similarity between the first sample characterization vector and the second sample characterization vector as a first cosine loss;
calculating cosine similarity between the first sample characterization vector and the third sample characterization vector as a second cosine loss;
and calculating a sum of the first cosine loss and the second cosine loss, training the initial speaker recognition model according to the sum, generating a student speaker recognition model, and taking the student speaker recognition model as a final speaker recognition model.
In one possible implementation, the target speech includes M segments of speech; m is a positive integer greater than 1; extracting a first acoustic feature from the target voice; and processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature, including:
respectively extracting M first acoustic features of the M sections of voice from the M sections of voice; processing the M first acoustic features based on the respective sampling rates of the M sections of voice to obtain M second acoustic features;
inputting the second acoustic feature into a pre-constructed speaker recognition model, and recognizing to obtain a target representation vector of a target speaker, wherein the method comprises the following steps:
respectively inputting the M second acoustic features into a pre-constructed speaker recognition model, and recognizing to obtain M target characterization vectors corresponding to a target speaker;
and calculating the average value of the M target characterization vectors, and taking the average value as the final target characterization vector corresponding to the target speaker.
In a possible implementation manner, the recognizing the target speaker according to the target characterization vector to obtain a recognition result of the target speaker includes:
calculating the similarity between the target characteristic vector of the target speaker and the preset characteristic vector of a preset speaker;
judging whether the similarity is higher than a preset threshold value, if so, determining that the target speaker is the preset speaker; if not, determining that the target speaker is not the preset speaker.
In a possible implementation manner, the recognizing the target speaker according to the target characterization vector to obtain a recognition result of the target speaker includes:
calculating N similarity between the target characterization vector of the target speaker and N preset characterization vectors of N preset speakers; n is a positive integer greater than 1;
and selecting the maximum similarity from the N similarities, and determining the target speaker as a preset speaker corresponding to the maximum similarity.
The embodiment of the present application further provides a speaker recognition apparatus, including:
the device comprises a first acquisition unit, a second acquisition unit and a voice recognition unit, wherein the first acquisition unit is used for acquiring target voice to be recognized; determining the sampling rate of the target voice;
the processing unit is used for extracting a first acoustic feature from the target voice; processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature;
the first identification unit is used for inputting the second acoustic characteristic into a pre-constructed speaker identification model and identifying to obtain a target representation vector of a target speaker; the speaker recognition model is obtained by utilizing the voice co-training of different sampling rates;
and the second identification unit is used for identifying the target speaker according to the target characterization vector to obtain an identification result of the target speaker.
In a possible implementation manner, the apparatus further includes:
the second acquisition unit is used for acquiring a first sample voice and a teacher speaker recognition model corresponding to the first sampling rate; the teacher speaker recognition model is obtained through voice training based on a first sampling rate;
the third acquisition unit is used for acquiring second sample voice corresponding to the second sampling rate; extracting acoustic features of the second sample voice from the second sample voice; the first sample speech and the second sample speech belong to the same sample speaker;
the obtaining unit is used for inputting the acoustic characteristics of the first sample voice into the teacher speaker recognition model to obtain a first sample characterization vector; inputting the acoustic features of the first sample voice and the acoustic features corresponding to the second sample voice into an initial speaker recognition model to respectively obtain a second sample characterization vector and a third sample characterization vector;
and the training unit is used for training the initial speaker recognition model according to the first sample characterization vector, the second sample characterization vector and the third sample characterization vector to generate a student speaker recognition model, and taking the student speaker recognition model as a final speaker recognition model.
In one possible implementation, the processing unit includes:
the first processing subunit is used for directly taking the first acoustic feature as a second acoustic feature when the sampling rate of the target voice is determined to be the first sampling rate;
and the second processing subunit is configured to, when it is determined that the sampling rate of the target speech is the second sampling rate, process the first acoustic feature to obtain a second acoustic feature.
In one possible implementation, the first sampling rate is higher than the second sampling rate, and the first acoustic feature includes a log mel filter bank FBANK feature; the second processing subunit includes:
the adjusting subunit is configured to adjust the number of filters that filter the power spectrum of the first acoustic feature to obtain an adjusted first acoustic feature, so that the adjusted first acoustic feature is aligned with a low frequency band region of the acoustic feature of the speech corresponding to the first sampling rate;
and a zero padding subunit, configured to pad zero for a difference dimension of the adjusted first acoustic feature and the acoustic feature of the speech corresponding to the first sampling rate, so that the dimension of the first acoustic feature after the zero padding is the same as the dimension of the acoustic feature of the speech corresponding to the first sampling rate, and use the first acoustic feature after the zero padding as the second acoustic feature.
In a possible implementation manner, the obtaining unit is specifically configured to:
and after the acoustic characteristics corresponding to the second sample voice are processed, inputting the initial speaker recognition model.
In one possible implementation, the training unit includes:
a first calculating subunit, configured to calculate a cosine similarity between the first sample characterization vector and the second sample characterization vector as a first cosine loss;
a second calculating subunit, configured to calculate a cosine similarity between the first sample characterization vector and the third sample characterization vector as a second cosine loss;
and the training subunit is used for calculating the sum of the first cosine loss and the second cosine loss, training the initial speaker recognition model according to the sum, generating a student speaker recognition model, and taking the student speaker recognition model as a final speaker recognition model.
In one possible implementation, the target speech includes M segments of speech; m is a positive integer greater than 1; the processing unit is specifically configured to:
respectively extracting M first acoustic features of the M sections of voice from the M sections of voice; processing the M first acoustic features based on the respective sampling rates of the M sections of voice to obtain M second acoustic features;
the first recognition unit includes:
the identification subunit is used for respectively inputting the M second acoustic features into a pre-constructed speaker identification model, and identifying to obtain M target representation vectors corresponding to the target speaker;
and the third calculating subunit is used for calculating the average value of the M target characterization vectors, and taking the average value as the final target characterization vector corresponding to the target speaker.
In a possible implementation manner, the second identification unit includes:
the fourth calculating subunit is used for calculating the similarity between the target characterization vector of the target speaker and the preset characterization vector of a preset speaker;
the first determining subunit is used for judging whether the similarity is higher than a preset threshold value, and if so, determining that the target speaker is the preset speaker; if not, determining that the target speaker is not the preset speaker.
In a possible implementation manner, the second identification unit includes:
the fifth calculating subunit is used for calculating N similarity between the target characterization vector of the target speaker and N preset characterization vectors of N preset speakers; n is a positive integer greater than 1;
and the second determining subunit is used for selecting the maximum similarity from the N similarities and determining the target speaker as a preset speaker corresponding to the maximum similarity.
The embodiment of the present application further provides a speaker recognition device, including: a processor, a memory, a system bus;
the processor and the memory are connected through the system bus;
the memory is used for storing one or more programs, and the one or more programs comprise instructions which, when executed by the processor, cause the processor to execute any implementation manner of the speaker recognition method.
An embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a terminal device, the terminal device is enabled to execute any implementation manner of the speaker recognition method.
The embodiment of the application also provides a computer program product, and when the computer program product runs on the terminal device, the terminal device is enabled to execute any implementation mode of the speaker identification method.
According to the speaker recognition method, the speaker recognition device, the storage medium and the speaker recognition equipment, firstly, target voice to be recognized is obtained, the sampling rate of the target voice is determined, and first acoustic features of the target voice are extracted; processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature, and then inputting the second acoustic feature into a pre-constructed speaker recognition model to recognize to obtain a target characterization vector of the target speaker; the speaker recognition model is obtained by utilizing the voice co-training of different sampling rates; then, the target speaker can be identified according to the target characterization vector, and an identification result of the target speaker is obtained. Therefore, according to the embodiment of the application, the second acoustic feature corresponding to the target voice is input into the pre-constructed speaker recognition model, so that effect loss is avoided when the high-frequency voice acoustic feature is input, effect reduction caused by input of the low-frequency voice acoustic feature is compensated, and the target characterization vector of the target speaker can be predicted, so that high-frequency information of the low-frequency voice acoustic feature loss is compensated under the condition that the number of parameters of the speaker recognition model is not increased, a good recognition effect can be obtained on both low-frequency and high-frequency target voice data by using the same speaker recognition model, and the accuracy of a speaker recognition result is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flowchart of a speaker recognition method according to an embodiment of the present disclosure;
FIG. 2 is a schematic flow chart illustrating a process for constructing a speaker recognition model according to an embodiment of the present disclosure;
FIG. 3 is a schematic structural diagram of constructing a speaker recognition model according to an embodiment of the present application;
fig. 4 is a schematic diagram illustrating a speaker recognition apparatus according to an embodiment of the present disclosure.
Detailed Description
With the rapid development of intelligent recognition technology, more and more scenes need to apply biometric recognition technology to identify speakers, such as financial security, smart home, and administrative jurisdictions. The traditional speaker recognition method is to respectively train and maintain speaker recognition models of wide-band and narrow-band voices with different sampling rates, the deployment cost is high, and the accuracy of recognition results is low because the speaker models obtained by the voice training with different sampling rates cannot be subjected to similarity matching.
In contrast, the conventional speaker recognition method usually trains a mixed bandwidth speaker recognition model to respectively recognize voices with different sampling rates, but the obtained speaker recognition result is not satisfactory. Specifically, the existing methods for training the mixed bandwidth speaker recognition model generally include the following three methods:
the first one is a down-sampling method, which directly down-samples broadband speech into narrowband speech, extracts narrowband acoustic features, trains speaker recognition models uniformly by using the narrowband acoustic features, and performs speaker recognition, but the down-sampling method ignores high-frequency information in the broadband speech, and related studies show that the high-frequency information in the speech is very helpful for speaker discrimination, so the method sacrifices the effect of speaker recognition models, and results in lower accuracy of model recognition results.
The second method is an up-sampling method, which directly up-samples narrow-band speech into wide-band speech, extracts wide-band acoustic features, trains speaker recognition models uniformly by using the wide-band acoustic features to perform speaker recognition, although the up-sampling method does not lose high-frequency information of the wide-band speech, but does not compensate high-frequency information lacking in the narrow-band speech, and compared with the wide-band speech recognition effect, the up-sampling method still loses loss, so that the accuracy of model recognition results is low.
The third is a bandwidth extension method, which extracts acoustic features from narrowband speech and wideband speech, trains a bandwidth extension neural network, converts the narrowband acoustic features into wideband acoustic features, recovers the missing high-frequency band information, trains a speaker recognition model by uniformly using the wideband acoustic features, and performs speaker recognition.
Therefore, the traditional speaker identification method and the existing speaker identification method are not accurate enough for speaker identification.
In order to solve the defects, the application provides a speaker recognition method, which comprises the steps of firstly obtaining target voice to be recognized, determining the sampling rate of the target voice, and extracting first acoustic features of the target voice; processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature, and then inputting the second acoustic feature into a pre-constructed speaker recognition model to recognize to obtain a target characterization vector of the target speaker; the speaker recognition model is obtained by utilizing the voice co-training of different sampling rates; then, the target speaker can be identified according to the target characterization vector, and an identification result of the target speaker is obtained. Therefore, according to the embodiment of the application, the second acoustic feature corresponding to the target voice is input into the pre-constructed speaker recognition model, so that effect loss is avoided when the high-frequency voice acoustic feature is input, effect reduction caused by input of the low-frequency voice acoustic feature is compensated, and the target characterization vector of the target speaker can be predicted, so that high-frequency information of the low-frequency voice acoustic feature loss is compensated under the condition that the number of parameters of the speaker recognition model is not increased, a good recognition effect can be obtained on both low-frequency and high-frequency target voice data by using the same speaker recognition model, and the accuracy of a speaker recognition result is improved.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
First embodiment
Referring to fig. 1, a schematic flow chart of a speaker identification method provided in this embodiment is shown, where the method includes the following steps:
s101: and acquiring target voice to be recognized, and determining the sampling rate of the target voice.
In this embodiment, any speaker needing to be identified is defined as a target speaker, and a voice that the target speaker needs to be identified is defined as a target voice. It should be noted that the present embodiment does not limit the language type of the target speech, for example, the target speech may be a speech composed of chinese or a speech composed of english, etc.; meanwhile, the embodiment also does not limit the length of the target speech, for example, the target speech may be a sentence, or multiple sentences.
It can be understood that the target speech can be obtained by recording according to actual needs, for example, telephone call speech or conference recording in daily life of people can be used as the target speech, and when the target speech is obtained, the sampling rate of the target speech is determined, so that the target speech is processed by using the scheme provided by this embodiment, so as to identify the identity of the target speaker who speaks the target speech.
The sampling rate (i.e. sampling frequency) refers to how much analog signals are sampled by the recording device in a unit time, and the higher the sampling frequency is, the truer and more natural the waveform of sound waves is. The units of the sampling rate are expressed in hertz (Hz). Also, different voices may correspond to a variety of different sampling rates. For example, for a 4000Hz telephone signal in a network, sampling at 8000Hz is required in order to allow the digital signal after sampling to fully retain the information in the telephone signal, according to nyquist's law. With the development of computer network technology, for the sampling of voice such as internet audio, a sampling rate of 16000Hz is widely used.
It will be appreciated that the sampling rate may comprise a plurality of different sampling rates, such as a high frequency (e.g., 16000Hz, etc.) and a low frequency (e.g., 8000Hz, etc.), with the corresponding sampled speech being high frequency speech and low frequency speech, respectively. The high frequency speech may be referred to as wideband speech, the low frequency speech may be referred to as narrowband speech, the sampling rate of wideband speech is higher than narrowband speech, for example, the sampling rate of wideband speech may be twice that of narrowband speech (e.g., 16000Hz is twice 8000 Hz), etc.
S102: extracting a first acoustic feature from the target voice; and processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature.
In this embodiment, after the target voice to be recognized is obtained through step S101, and the sampling rate of the target voice is determined, in order to accurately recognize the identity information of the target speaker who speaks the target voice, first, an acoustic feature representing voiceprint information of the target voice is extracted from the target voice by using a feature extraction method, and is defined as a first acoustic feature, and then, according to the sampling rate of the target voice, the first acoustic feature is processed to obtain a second acoustic feature, and further, the obtained second acoustic feature can be used as a recognition basis to realize effective recognition of the target voice through subsequent steps S103 to S104, and further recognize the identity of the target speaker.
Specifically, when extracting a first acoustic feature of a target voice, firstly, framing the target voice is required to obtain a corresponding voice frame sequence, and then, pre-emphasis is performed on the framed voice frame sequence; and then, extracting the acoustic features of each voice frame in sequence, wherein the acoustic features refer to feature data used for representing voiceprint information of the corresponding voice frame, and may be Mel-scale Frequency Cepstral Coefficients (MFCCs) features or Log Mel-Filter Bank (FBANK) features, for example.
It should be noted that, in the embodiment of the present application, the method for extracting the first acoustic feature of the target speech is not limited, and a specific extraction process is not limited, and an appropriate extraction method may be selected according to actual situations, and a corresponding feature extraction operation may be performed. For the sake of understanding, the present embodiment will be described below by taking the first acoustic feature of the target speech as the FBANK feature.
Further, an optional implementation manner is that, when the sampling rate of the target speech is a first sampling rate, for example, when the target speech is a wideband speech, after a first acoustic feature (e.g., FBANK feature) of the target speech is extracted, the first acoustic feature is not processed, but is directly used as a second acoustic feature to perform subsequent steps S103 to S104, so as to implement effective recognition of the target speech and further recognize the identity of the target speaker.
Or, another optional implementation manner is that, when the sampling rate of the target speech is the second sampling rate, for example, when the target speech is a narrowband speech, in order to reduce a difference between a narrowband acoustic feature and a wideband acoustic feature, after the FBANK feature of the target speech is extracted as the first acoustic feature, further, the number of filters for filtering the power spectrum of the first acoustic feature needs to be adjusted to obtain the adjusted first acoustic feature, so that the adjusted first acoustic feature is aligned with a low-frequency band region of the acoustic feature of the speech corresponding to the first sampling rate; and then, zero padding is needed to be performed on the difference dimension of the adjusted first acoustic feature and the acoustic feature of the voice corresponding to the first sampling rate, so that the dimension of the zero-padded first acoustic feature is the same as the dimension of the acoustic feature of the voice corresponding to the first sampling rate, and the zero-padded first acoustic feature is used as a second acoustic feature to execute the subsequent steps S103-S104, thereby realizing effective recognition on the target voice and further recognizing the identity of the target speaker.
For example, the following steps are carried out: for a narrow band target speech at 8000Hz sampling rate, the corresponding FBANK signature may represent information in the 0-4000Hz band, and for a speech at 16000Hz sampling rate, the corresponding FBANK signature may represent information in the 0-8000Hz band. The information of the 4000-8000Hz frequency band is missing for the target speech at 8000Hz sampling rate compared to the speech at 16000Hz sampling rate.
For this, the conversion formula m according to the frequency f and the mel scale frequency m is 2594log10(1+ f/700), the correlation between the number of mel filter banks for narrowband speech and wideband speech can be calculated, and the specific calculation formula is as follows:
Figure BDA0003166958810000111
wherein f isNUpper bound of the frequency domain, f, representing the FBANK characteristics of the narrowband target speechWUpper bound of the frequency domain, M, representing the FBANK characteristics of wideband speechNAnd MWThe number of filters for filtering the power spectrum of the narrowband target speech FBANK feature and the wideband speech FBANK feature, respectively, is indicated.
For example, the following steps are carried out: when f isW=8000、fN=4000、MWWhen the value is 40, M can be calculated according to the above formula (1)N30.2. At this point, a rounding operation may be performed to round MNForce value to be 30, and then f can be calculated backNIs 3978. Thus, when the FBANK feature extracted from the voice with the sampling rate of 16000Hz is 40-dimensional and the FBANK feature extracted from the target voice with the sampling rate of 8000Hz is 30-dimensional, the two are aligned in the frequency band of 0-3978 Hz. Then, zero values of 10 dimensions can be complemented for the 30-dimensional FBANK extracted from the target voice with the sampling rate of 8000Hz, so that the characteristic dimensions of the FBANK extracted from the target voice with the sampling rate of 8000Hz and the broadband voice with the sampling rate of 16000Hz are both 40 dimensions, and the difference between the target voice and the broadband voice is further reduced.
S103: inputting the second acoustic characteristic into a pre-constructed speaker recognition model, and recognizing to obtain a target representation vector of a target speaker; the speaker recognition model is obtained by utilizing the voice co-training of different sampling rates. In general, if the target speech to be recognized contains speech corresponding to a first sampling rate and speech corresponding to a second sampling rate, the speaker recognition model is also trained based on the sample speech of the first sampling rate and the sample speech of the second sampling rate.
In this embodiment, after the second acoustic feature of the target speech is obtained in step S102, in order to effectively improve the accuracy of the recognition result, the second acoustic feature may be further input into a speaker recognition model that is constructed in advance, so as to recognize and obtain a target feature vector that the target speaker has when speaking the speech content of the target speech, so as to execute the subsequent step S104. It should be noted that the specific format of the target token vector may be set according to actual situations, which is not limited in this embodiment, for example, the target token vector may be a 256-dimensional vector.
Compared with acoustic features at a frame level (such as FBANK features), the target characterization vector characterizes the acoustic information at a sentence level of the target speech, and comprehensively considers the relation between each speech frame and the context thereof, so that the speech information of the target speech can be more accurately characterized. And the speaker recognition model is obtained by co-training with voices of different sampling rates (such as broadband voice of a first sampling rate and narrowband voice of a second sampling rate). Therefore, no matter the target voice is the voice with the first sampling rate or the voice with the second sampling rate, after the corresponding second acoustic feature is input into the speaker recognition model, the target characterization vector of the individual voice information of the target voice can be more accurately characterized, and further, the target characterization vector can be utilized to identify the target speaker to which the target voice belongs through the subsequent step S104, so as to determine the identity information of the target speaker.
Next, the present embodiment will describe a process for constructing a speaker recognition model, as shown in fig. 2, which shows a schematic flow chart of constructing a speaker recognition model provided by the present embodiment, and the flow chart includes the following steps a1-a 4:
step A1: acquiring a first sample voice and a teacher speaker recognition model corresponding to a first sampling rate; the teacher speaker recognition model is obtained based on sample voice training corresponding to the first sampling rate.
In this embodiment, in order to construct the speaker recognition model, a large amount of preparation work needs to be performed in advance, first, a large amount of voices corresponding to a first sampling rate sent by a user during speaking need to be collected, such as wideband voice data, for example, the voices can be picked up by a microphone array, the pickup device can be a tablet computer, or an intelligent hardware device, such as an intelligent sound, a television, an air conditioner, and the like, noise reduction processing needs to be performed on the collected large amount of high-frequency voices after being collected, and then each piece of collected high-frequency voice (such as wideband voice) data of each user can be respectively used as a first sample voice, and meanwhile, a teacher speaker recognition model can be trained by using the first sample voices to perform the subsequent step a 2.
The implementation process of obtaining the teacher speaker recognition model by using the first sample speech training specifically may include the following steps a11-a12, and it should be noted that, in the following steps, the present embodiment describes the training process of the teacher speaker recognition model by taking the first sample speech as the wideband sample speech as an example:
step A11: from the wideband sample speech, wideband acoustic features characterizing acoustic information of the wideband sample speech are extracted.
In this embodiment, after the wideband sample speech is obtained, the wideband sample speech cannot be directly used for training to generate a teacher speaker recognition model, but a method similar to the method for extracting the second acoustic feature of the target speech in step S102 needs to be adopted, and the target speech is replaced by the wideband sample speech, that is, the wideband acoustic features of each wideband sample speech can be extracted, for the relevant points, reference is made to the description in step S102, and details are not described here.
Step A12: and training according to the broadband acoustic characteristics of the broadband sample voice and the speaker recognition label corresponding to the broadband sample voice to generate a teacher speaker recognition model.
In this embodiment, first, a neural network can be selected as the initialized speaker recognition model, and the model parameters, such as the neural network shown in the left diagram of fig. 3, are initialized. It should be noted that, in this embodiment, a specific Network structure of the model is not limited, and the model may be a Neural Network in any form, for example, a Convolutional Neural Network (CNN), a cyclic Neural Network (RNN), a Deep Neural Network (DNN), an x-vector system structure, or the like.
Then, as shown in fig. 3, the current round of training may be performed on the initialized speaker recognition model (i.e., the neural network shown in the left diagram of fig. 3) sequentially using the broadband acoustic features corresponding to each broadband sample voice to perform parameter updating, and after multiple rounds of parameter updating (i.e., after a training end condition is satisfied, for example, a preset number of training rounds is reached or a variation of a model parameter is smaller than a preset threshold value, etc.), the teacher speaker recognition model is obtained through training.
Specifically, in the training process, an alternative implementation manner is that a teacher speaker recognition model can be constructed by using a given objective function, and the network parameters of the model are updated. The objective function adopted in this embodiment is as follows:
Figure BDA0003166958810000141
wherein x isiAn acoustic feature vector representing an ith wideband sample speech; y isiThe speaker label represents the artificial label corresponding to the ith broadband sample voice; w is ajAnd b are model parameters, in particular, wjA jth column of the weight matrix of the model classification layer is represented, and b represents a bias term; m and N represent the total number of wideband sample voices and the total number of speakers corresponding to these wideband sample voices, respectively.
When the teacher speaker recognition model is trained by using the objective function in the above formula (2), the training can be performed according to LsThe change of the value is continuously updated to the model parameters (namely w and b) until LsIf the value meets the requirement, for example, the change amplitude is very small, the updating of the model parameters is stopped, and the training of the speaker recognition model of the teacher is completed.
Alternatively, an existing model that has been trained by wideband speech may be used as the teacher speaker recognition model in this embodiment for training, as long as it is ensured that the model is trained by wideband acoustic features of wideband speech. However, it is ensured that a second sample speech (e.g., a narrowband speech) corresponding to the second sampling rate subsequently used in training the student speaker recognition model and a wideband speech used in training the teacher speaker recognition model belong to the same sample speaker. And the sampling rate (namely the second sampling rate) of the second sample voice (such as narrow-band sample voice) input during the training of the student speaker recognition model is lower than the sampling rate (namely the first sampling rate) of the first sample voice (such as wide-band sample voice) input by the teacher network.
Step A2: acquiring a second sample voice corresponding to a second sampling rate; extracting acoustic features of the second sample voice from the second sample voice; wherein the first sample speech and the second sample speech belong to the same sample speaker.
In this embodiment, in order to construct the speaker recognition model, in addition to a first sample voice (e.g., wideband sample voice) corresponding to a plurality of first sampling rates, a second sample voice (e.g., narrowband sample voice) corresponding to a plurality of second sampling rates needs to be obtained, wherein the first sampling rate is higher than the second sampling rate. After the second sample voices (such as narrowband sample voices) are obtained, the target voice pairs are replaced by the second sample voices (such as narrowband sample voices) by a method similar to the method for extracting the second acoustic features of the target voices in the step S102, namely the acoustic features of the second sample voices (such as narrowband sample voices) can be extracted, and the relevant parts are referred to the description in the step S102, which is not repeated herein, so that the final speaker recognition model can be obtained by training the second sample voices (such as narrowband sample voices) and the first sample voices (such as wideband sample voices) belonging to the same sample speaker, for example, voices with a sampling rate of 8000Hz and voices with a sampling rate of 16000Hz spoken by the same sample speaker through the subsequent steps A3-A4.
Step A3: inputting the acoustic characteristics of the first sample voice into a teacher speaker recognition model to obtain a first sample characterization vector; and inputting the acoustic features of the first sample voice and the acoustic features corresponding to the second sample voice into the initial speaker recognition model to respectively obtain a second sample characterization vector and a third sample characterization vector.
In this embodiment, after the first sample speech (e.g., wideband sample speech) is acquired through step a1, after the acoustic feature of the first sample speech (e.g., wideband sample speech) is extracted, the acoustic feature is input into the teacher speaker recognition model to recognize and obtain a sample characterization vector (defined as the first sample characterization vector herein) corresponding to the first sample speech (e.g., wideband sample speech), as shown in the right diagram of fig. 3. Meanwhile, after the second sample speech (e.g., narrowband sample speech) is obtained in step a2, the acoustic features of the second sample speech (e.g., narrowband sample speech) can be obtained by replacing the target speech with the second sample speech in a similar way to the extraction of the second acoustic features of the target speech in step S102, and after processing the acoustic features and the acoustic features of the first sample speech (e.g., wideband sample speech) are input into the initial speaker recognition model, so as to respectively recognize and obtain a sample characterization vector (defined as the second sample characterization vector herein) corresponding to the first sample speech (e.g., wideband sample speech) and a sample characterization vector (defined as the third sample characterization vector herein) corresponding to the second sample speech (e.g., narrowband sample speech) as shown in the right diagram of fig. 3.
Wherein the network structures of the initial speaker recognition model and the teacher speaker recognition model are the same, and both of them load the model parameters (i.e., w and b) of the teacher speaker recognition model trained through the steps a1-a2 as initial parameters.
Step A4: and training the initial speaker recognition model according to the first sample characterization vector, the second sample characterization vector and the third sample characterization vector to generate a student speaker recognition model, and taking the student speaker recognition model as a final speaker recognition model.
In this embodiment, after obtaining the first sample token vector, the second sample token vector and the third sample token vector through step a3, further, based on the knowledge distillation idea of the characterization vectors, the first sample characterization vector output by the teacher speaker recognition model is utilized to directly constrain the second sample characterization vector and the third sample characterization vector output by the initial speaker recognition model (which is regarded as the student speaker recognition model), the more similar the characterization vectors are, the better the characterization vectors are required to be, so as to realize the training of the initial speaker recognition model, and, in the training process, the network parameters of the teacher speaker recognition model are ensured to be fixed, only the network parameters of the initial speaker recognition model are updated, and then, after the preset conditions are met and training is finished, a student speaker recognition model is obtained and serves as a final speaker recognition model.
In a possible implementation manner of the embodiment of the present application, a specific implementation process of the step a4 may include: firstly, calculating cosine similarity between a first sample characterization vector and a second sample characterization vector as a first cosine loss, and simultaneously calculating cosine similarity between the first sample characterization vector and a third sample characterization vector as a second cosine loss; then, the sum of the first cosine loss and the second cosine loss is calculated, the initial speaker recognition model is trained according to the sum, a student speaker recognition model is generated, and the student speaker recognition model is used as a final speaker recognition model
Specifically, in this implementation manner, in order to train a speaker recognition model with a better recognition effect, a specific calculation formula of a sum of the first cosine loss and the second cosine loss in the training process is as follows:
Ltotal=LCOS(twb,snb)+Lcos(twb,swb) (3)
wherein L isCOS(twb,snb) Representing cosine similarity between a first sample characterization vector output by the teacher speaker recognition network and a second sample characterization vector output by the student speaker recognition model, namely first cosine loss; l iscos(twb,swb) Representing cosine similarity between a first sample characterization vector output by the teacher speaker recognition network and a third sample characterization vector output by the student speaker recognition model, namely second cosine loss; l istotalRepresenting the sum of the first cosine loss and the second cosine loss.
The calculation formulas of the first cosine loss and the second cosine loss are as follows:
Figure BDA0003166958810000161
wherein the content of the first and second substances,
Figure BDA0003166958810000162
a first sample characterization vector representing an ith first sample speech (e.g., wideband sample speech) output by the teacher speaker recognition network;
Figure BDA0003166958810000163
a second sample characterization vector or a third characterization vector representing the ith sample speech (e.g. the wideband sample speech corresponding to the first sample speech or the narrowband sample speech corresponding to the second sample speech) output by the student speaker recognition network; m represents the total number of sample voices.
When the student speaker recognition model is trained by using the above formulas (3) and (4), the training can be performed according to LtotalChanging value, continuously updating parameters of the student speaker recognition model until LtotalAnd if the value meets the requirement, for example, the change amplitude is very small, the updating of the model parameters is stopped, the training of the student speaker recognition model is completed, and the student speaker recognition model obtained through training is used as the final speaker recognition model.
It should be noted that, when the teacher speaker recognition model is used to train the student speaker recognition model, parallel data of a first sample voice (such as a wideband sample voice) and a second sample voice (such as a narrowband sample voice) are used, such as a voice with a sampling rate of 8000HZ and a voice with a sampling rate of 16000HZ spoken by a large number of speakers with the same sample. When the collected training data can not meet the condition, the collected first sample voice (such as broadband sample voice) can be downsampled to obtain parallel second sample voice (such as narrowband sample voice) to complement the training data for model training.
Thus, through the steps A1-A4, the characterization vectors output by the teacher speaker recognition model are used for guiding the training of the student speaker recognition model, speaker labeling is not needed to be carried out on training data, and the student speaker recognition model can be reserved as the final speaker recognition model after the training is finished. The unsupervised training mode enables the finally obtained speaker recognition model to output a target characterization vector which more accurately characterizes the individual voice information of the target voice for the input target voice of low frequency (such as narrow band) or high frequency (such as wide band), and further can more accurately recognize the target speaker to which the target voice belongs by utilizing the target characterization vector through the subsequent step S104 so as to determine the identity information of the speaker.
The method ensures that no effect loss exists when the speaker recognition model inputs high-frequency voice acoustic features subsequently, compensates the effect reduction caused by inputting low-frequency voice acoustic features, and ensures that the same speaker recognition model can obtain better recognition effect on the recognition of low-frequency and high-frequency voice data. That is to say, through the teacher-student model learning mode, high-frequency information missing from low-frequency speech acoustic features can be compensated under the condition that the number of parameters of the speaker recognition model is not increased, and the accuracy of recognition results is improved.
Furthermore, in a possible implementation manner of the embodiment of the present application, when the target speech obtained in step S101 includes M segments of speech (where M is a positive integer greater than 1), in step S102, M first acoustic features of the M segments of speech may be extracted from the M segments of speech, respectively; processing the M first acoustic features based on the respective sampling rates of the M segments of speech to obtain M second acoustic features, and then in step S103, inputting the M second acoustic features to the speaker recognition model trained in step a1-a4, respectively, and recognizing M target feature vectors corresponding to the target speaker when speaking the speech content of the M segments of speech; then, the average value of the M target characterization vectors is calculated, and the average value is used as the final target characterization vector corresponding to the target speaker for executing the following step S104.
S104: and identifying the target speaker according to the target characterization vector to obtain an identification result of the target speaker.
In this embodiment, after the target representation vector that the target speaker has when speaking the speech content of the target speech is obtained in step S103, the target representation vector may be further processed, and the target speaker may be identified according to the processing result, so as to obtain the identification result of the target speaker.
Specifically, an alternative implementation manner is that the specific implementation process of this step S104 may include the following steps B1-B2:
step B1: and calculating the similarity between the target characteristic vector of the target speaker and the preset characteristic vector of the preset speaker.
In this implementation manner, when the identity of the target speaker needs to be confirmed to determine whether the target speaker is a certain preset speaker, after the target characterization vector that the target speaker has when speaking the target voice is obtained in step S103, the similarity between the target characterization vector of the target speaker and the preset characterization vector of the preset speaker may be further calculated, where the specific calculation formula is as follows:
Figure BDA0003166958810000181
wherein v is1A target characterization vector representing a target speaker; v. of2A preset characterization vector representing a preset speaker; cos (v)1,v2) A similarity, cos (v), between the target token vector representing the target speaker and the predetermined token vector of the predetermined speaker1,v2) The higher the value of (v), the more similar the target speaker and the preset speaker are, i.e., the higher the possibility that the target speaker and the preset speaker are the same person is, otherwise, cos (v)1,v2) The smaller the value of (a) is, the less the target speaker is similar to the preset speaker, that is, the less the possibility that the target speaker and the preset speaker are the same person is.
Step B2: judging whether the similarity is higher than a preset threshold value, if so, determining that the target speaker is a preset speaker; if not, determining that the target speaker is not the preset speaker.
In the present implementation, the similarity cos (v) between the target token vector of the target speaker and the preset token vector of the preset speaker is calculated through step B11,v2) Then, the similarity cos (v) needs to be further determined1,v2) Whether the target speaker is higher than a preset threshold value or not is judged, and if yes, the target speaker is determined to be a preset speaker; if not, determining that the target speaker is not the preset speaker.
The preset threshold refers to a critical value used for defining whether the target speaker and the preset speaker are the same person, and a specific value may be set according to an actual situation, which is not limited in the embodiment of the present application, for example, the preset threshold may be set to 0.8, or may be set to a value corresponding to a constant error rate, or may be a value corresponding to a minimum detection cost function, or may be another value determined empirically according to an actual application scenario, or the like. When the similarity between the target characteristic vector of the target speaker and the preset characteristic vector of the preset speaker exceeds the critical value, the target characteristic vector and the preset characteristic vector of the preset speaker are indicated to be the same person, otherwise, when the similarity between the target characteristic vector of the target speaker and the preset characteristic vector of the preset speaker does not exceed the critical value, the target characteristic vector of the target speaker and the preset characteristic vector of the preset speaker are indicated to be not the same person.
Or, in another optional implementation manner, the specific implementation process of step S104 may further include the following steps C1-C2:
step C1: calculating N similarity between a target characterization vector of the target speaker and N preset characterization vectors of N preset speakers; wherein N is a positive integer greater than 1.
In this implementation, when the target speaker needs to be identified to identify which of N preset speakers (where M is a positive integer greater than 1) the target speaker is, after the target characterization vector that the target speaker has when uttering the target voice is obtained in step S103, N similarities between the target characterization vector of the target speaker and the N preset characterization vectors of the N preset speakers may be further calculated by using the above formula (5), so as to execute the subsequent step C2.
Step C2: and selecting the maximum similarity from the N similarities, and determining the target speaker as the preset speaker corresponding to the maximum similarity.
In this implementation manner, after the N similarities between the target characterization vector of the target speaker and the N preset characterization vectors of the N preset speakers are calculated in step C1, a maximum similarity may be further selected from the N similarities, and the target speaker is determined to be the preset speaker corresponding to the maximum similarity.
For example, the following steps are carried out: assuming that there are three preset speakers a, b and c, and the similarity between the target characterization vector of the target speaker and the preset characterization vector of the preset speaker a is calculated to be 0.1, the similarity between the target characterization vector of the target speaker and the preset characterization vector of the preset speaker b is calculated to be 0.84, and the similarity between the target characterization vector of the target speaker and the preset characterization vector of the preset speaker c is calculated to be 0.22, the highest similarity can be determined to be 0.84, and according to the highest similarity being 0.84, the obtained recognition result is that the identity of the target speaker is the preset speaker b.
Therefore, the pre-constructed speaker recognition model can process low-frequency voice and high-frequency voice, and the voice is not required to be subjected to operations such as up-sampling, down-sampling or bandwidth expansion during recognition, so that the effect is superior to that of the existing recognition scheme. Meanwhile, through the steps A1-A4, the output characterization vector of the teacher speaker recognition model is used for guiding the training of the student speaker recognition model, speaker labeling is not needed to be carried out on training data, and the unsupervised training mode reduces the cost of manual labeling data, can be expanded to a large-scale training data set, and fully exerts the technical advantages of deep learning.
And in the model training stage, setting the network structures of the teacher speaker recognition model and the student speaker recognition model to be completely the same, firstly training the teacher speaker recognition model by using the acoustic characteristics of high-frequency voice, then loading the network parameters of the trained teacher speaker recognition model on the teacher speaker recognition model, then fixing the network parameters of the teacher speaker recognition model during training, only updating the network parameters of the student speaker recognition model, and adopting COS criterion to ensure that the representation vectors output by the student speaker recognition model for inputting the acoustic characteristics of low-frequency voice and the acoustic characteristics of high-frequency voice are more similar to the representation vectors output by the teacher speaker recognition model for inputting the acoustic characteristics of high-frequency voice. And after the training is finished, keeping the student speaker recognition model as a final speaker recognition model with mixed bandwidth. The method not only ensures that no effect loss exists when the speaker recognition model inputs the acoustic characteristics (such as broadband acoustic characteristics) of the high-frequency voice, but also compensates the effect reduction caused by the acoustic characteristics (such as narrow-band acoustic characteristics) of the high-frequency voice input.
In summary, in the speaker recognition method provided in this embodiment, a target voice to be recognized is first obtained, a sampling rate of the target voice is determined, and a first acoustic feature of the target voice is extracted; processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature, and then inputting the second acoustic feature into a pre-constructed speaker recognition model to recognize to obtain a target characterization vector of the target speaker; the speaker recognition model is obtained by utilizing the voice co-training of different sampling rates; then, the target speaker can be identified according to the target characterization vector, and an identification result of the target speaker is obtained. Therefore, according to the embodiment of the application, the second acoustic feature corresponding to the target voice is input into the pre-constructed speaker recognition model, so that effect loss is avoided when the high-frequency voice acoustic feature is input, effect reduction caused by input of the low-frequency voice acoustic feature is compensated, and the target characterization vector of the target speaker can be predicted, so that high-frequency information of the low-frequency voice acoustic feature loss is compensated under the condition that the number of parameters of the speaker recognition model is not increased, a good recognition effect can be obtained on both low-frequency and high-frequency target voice data by using the same speaker recognition model, and the accuracy of a speaker recognition result is improved.
Second embodiment
In this embodiment, a speaker recognition apparatus will be described, and please refer to the above method embodiment for related contents.
Referring to fig. 4, a schematic diagram of a speaker recognition apparatus provided in this embodiment is shown, in which the apparatus 400 includes:
a first obtaining unit 401, configured to obtain a target voice to be recognized; determining the sampling rate of the target voice;
a processing unit 402, configured to extract a first acoustic feature from the target speech; processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature;
the first identification unit 403 is configured to input the second acoustic feature into a pre-constructed speaker identification model, and identify to obtain a target characterization vector of a target speaker; the speaker recognition model is obtained by utilizing the voice co-training of different sampling rates;
and a second identifying unit 404, configured to identify the target speaker according to the target characterization vector, so as to obtain an identification result of the target speaker.
In an implementation manner of this embodiment, the apparatus further includes:
the second acquisition unit is used for acquiring a first sample voice and a teacher speaker recognition model corresponding to the first sampling rate; the teacher speaker recognition model is obtained through voice training based on a first sampling rate;
the third acquisition unit is used for acquiring second sample voice corresponding to the second sampling rate; extracting acoustic features of the second sample voice from the second sample voice; the first sample speech and the second sample speech belong to the same sample speaker;
the obtaining unit is used for inputting the acoustic characteristics of the first sample voice into the teacher speaker recognition model to obtain a first sample characterization vector; inputting the acoustic features of the first sample voice and the acoustic features corresponding to the second sample voice into an initial speaker recognition model to respectively obtain a second sample characterization vector and a third sample characterization vector;
and the training unit is used for training the initial speaker recognition model according to the first sample characterization vector, the second sample characterization vector and the third sample characterization vector to generate a student speaker recognition model, and taking the student speaker recognition model as a final speaker recognition model.
In an implementation manner of this embodiment, the processing unit 402 includes:
the first processing subunit is used for directly taking the first acoustic feature as a second acoustic feature when the sampling rate of the target voice is determined to be the first sampling rate;
and the second processing subunit is configured to, when it is determined that the sampling rate of the target speech is the second sampling rate, process the first acoustic feature to obtain a second acoustic feature.
In one implementation of this embodiment, the first sampling rate is higher than the second sampling rate, and the first acoustic feature includes a log mel filter bank FBANK feature; the second processing subunit includes:
the adjusting subunit is configured to adjust the number of filters that filter the power spectrum of the first acoustic feature to obtain an adjusted first acoustic feature, so that the adjusted first acoustic feature is aligned with a low frequency band region of the acoustic feature of the speech corresponding to the first sampling rate;
and a zero padding subunit, configured to pad zero for a difference dimension of the adjusted first acoustic feature and the acoustic feature of the speech corresponding to the first sampling rate, so that the dimension of the first acoustic feature after the zero padding is the same as the dimension of the acoustic feature of the speech corresponding to the first sampling rate, and use the first acoustic feature after the zero padding as the second acoustic feature.
In an implementation manner of this embodiment, the obtaining unit is specifically configured to:
and after the acoustic characteristics corresponding to the second sample voice are processed, inputting the initial speaker recognition model.
In an implementation manner of this embodiment, the training unit includes:
a first calculating subunit, configured to calculate a cosine similarity between the first sample characterization vector and the second sample characterization vector as a first cosine loss;
a second calculating subunit, configured to calculate a cosine similarity between the first sample characterization vector and the third sample characterization vector as a second cosine loss;
and the training subunit is used for calculating the sum of the first cosine loss and the second cosine loss, training the initial speaker recognition model according to the sum, generating a student speaker recognition model, and taking the student speaker recognition model as a final speaker recognition model.
In an implementation manner of this embodiment, the target speech includes M segments of speech; m is a positive integer greater than 1; the processing unit 402 is specifically configured to:
respectively extracting M first acoustic features of the M sections of voice from the M sections of voice; processing the M first acoustic features based on the respective sampling rates of the M sections of voice to obtain M second acoustic features;
the first recognition unit 403 includes:
the identification subunit is used for respectively inputting the M second acoustic features into a pre-constructed speaker identification model, and identifying to obtain M target representation vectors corresponding to the target speaker;
and the third calculating subunit is used for calculating the average value of the M target characterization vectors, and taking the average value as the final target characterization vector corresponding to the target speaker.
In an implementation manner of this embodiment, the second identifying unit 404 includes:
the fourth calculating subunit is used for calculating the similarity between the target characterization vector of the target speaker and the preset characterization vector of a preset speaker;
the first determining subunit is used for judging whether the similarity is higher than a preset threshold value, and if so, determining that the target speaker is the preset speaker; if not, determining that the target speaker is not the preset speaker.
In an implementation manner of this embodiment, the second identifying unit 404 includes:
the fifth calculating subunit is used for calculating N similarity between the target characterization vector of the target speaker and N preset characterization vectors of N preset speakers; n is a positive integer greater than 1;
and the second determining subunit is used for selecting the maximum similarity from the N similarities and determining the target speaker as a preset speaker corresponding to the maximum similarity.
Further, an embodiment of the present application further provides a speaker recognition device, including: a processor, a memory, a system bus;
the processor and the memory are connected through the system bus;
the memory is used for storing one or more programs, and the one or more programs comprise instructions which when executed by the processor cause the processor to execute any one of the implementation methods of the speaker recognition method.
Further, an embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the instructions cause the terminal device to execute any implementation method of the speaker identification method.
Further, an embodiment of the present application further provides a computer program product, which when running on a terminal device, causes the terminal device to execute any one implementation method of the speaker recognition method.
As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A speaker recognition method, comprising:
acquiring target voice to be recognized, and determining the sampling rate of the target voice;
extracting a first acoustic feature from the target voice; processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature;
inputting the second acoustic characteristic into a pre-constructed speaker recognition model, and recognizing to obtain a target representation vector of a target speaker; the speaker recognition model is obtained by utilizing the voice co-training of different sampling rates;
and identifying the target speaker according to the target characterization vector to obtain an identification result of the target speaker.
2. The method of claim 1, wherein the speaker recognition model is constructed as follows:
acquiring a first sample voice and a teacher speaker recognition model corresponding to a first sampling rate; the teacher speaker recognition model is obtained through voice training based on a first sampling rate;
acquiring a second sample voice corresponding to a second sampling rate; extracting acoustic features of the second sample voice from the second sample voice; the first sample speech and the second sample speech belong to the same sample speaker;
inputting the acoustic characteristics of the first sample voice into the teacher speaker recognition model to obtain a first sample characterization vector; inputting the acoustic features of the first sample voice and the acoustic features corresponding to the second sample voice into an initial speaker recognition model to respectively obtain a second sample characterization vector and a third sample characterization vector;
and training the initial speaker recognition model according to the first sample characterization vector, the second sample characterization vector and the third sample characterization vector to generate a student speaker recognition model, and taking the student speaker recognition model as a final speaker recognition model.
3. The method of claim 2, wherein processing the first acoustic feature to obtain a second acoustic feature based on a sampling rate of the target speech comprises:
when the sampling rate of the target voice is determined to be the first sampling rate, directly taking the first acoustic feature as a second acoustic feature;
and when the sampling rate of the target voice is determined to be the second sampling rate, processing the first acoustic feature to obtain a second acoustic feature.
4. The method of claim 3, wherein the first sampling rate is higher than the second sampling rate, wherein the first acoustic features comprise log-Mel Filter Bank (FBANK) features; when the sampling rate of the target voice is determined to be the second sampling rate, processing the first acoustic feature to obtain a second acoustic feature, including:
adjusting the number of filters for filtering the power spectrum of the first acoustic feature to obtain an adjusted first acoustic feature, so that the adjusted first acoustic feature is aligned with a low-frequency band region of the acoustic feature of the voice corresponding to the first sampling rate;
and zero filling is carried out on the difference dimension of the adjusted first acoustic feature and the acoustic feature of the voice corresponding to the first sampling rate, so that the dimension of the zero filled first acoustic feature is the same as that of the voice corresponding to the first sampling rate, and the zero filled first acoustic feature is taken as a second acoustic feature.
5. The method of claim 3, wherein the inputting the acoustic features of the first sample speech and the acoustic features corresponding to the second sample speech into an initial speaker recognition model comprises: and after the acoustic characteristics corresponding to the second sample voice are processed, inputting the initial speaker recognition model.
6. The method of claim 2, wherein training the initial speaker recognition model according to the first sample characterization vector, the second sample characterization vector, and the third sample characterization vector to generate a student speaker recognition model, and using the student speaker recognition model as a final speaker recognition model comprises:
calculating cosine similarity between the first sample characterization vector and the second sample characterization vector as a first cosine loss;
calculating cosine similarity between the first sample characterization vector and the third sample characterization vector as a second cosine loss;
and calculating a sum of the first cosine loss and the second cosine loss, training the initial speaker recognition model according to the sum, generating a student speaker recognition model, and taking the student speaker recognition model as a final speaker recognition model.
7. The method of claim 1, wherein the target speech comprises M segments of speech; m is a positive integer greater than 1; extracting a first acoustic feature from the target voice; and processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature, including:
respectively extracting M first acoustic features of the M sections of voice from the M sections of voice; processing the M first acoustic features based on the respective sampling rates of the M sections of voice to obtain M second acoustic features;
inputting the second acoustic feature into a pre-constructed speaker recognition model, and recognizing to obtain a target representation vector of a target speaker, wherein the method comprises the following steps:
respectively inputting the M second acoustic features into a pre-constructed speaker recognition model, and recognizing to obtain M target characterization vectors corresponding to a target speaker;
and calculating the average value of the M target characterization vectors, and taking the average value as the final target characterization vector corresponding to the target speaker.
8. The method according to claim 1, wherein said recognizing the target speaker according to the target characterization vector to obtain a recognition result of the target speaker comprises:
calculating the similarity between the target characteristic vector of the target speaker and the preset characteristic vector of a preset speaker;
judging whether the similarity is higher than a preset threshold value, if so, determining that the target speaker is the preset speaker; if not, determining that the target speaker is not the preset speaker.
9. The method according to claim 1, wherein said recognizing the target speaker according to the target characterization vector to obtain a recognition result of the target speaker comprises:
calculating N similarity between the target characterization vector of the target speaker and N preset characterization vectors of N preset speakers; n is a positive integer greater than 1;
and selecting the maximum similarity from the N similarities, and determining the target speaker as a preset speaker corresponding to the maximum similarity.
10. A speaker recognition apparatus, comprising:
the device comprises a first acquisition unit, a second acquisition unit and a voice recognition unit, wherein the first acquisition unit is used for acquiring target voice to be recognized; determining the sampling rate of the target voice;
the processing unit is used for extracting a first acoustic feature from the target voice; processing the first acoustic feature based on the sampling rate of the target voice to obtain a second acoustic feature;
the first identification unit is used for inputting the second acoustic characteristic into a pre-constructed speaker identification model and identifying to obtain a target representation vector of a target speaker; the speaker recognition model is obtained by utilizing the voice co-training of different sampling rates;
and the second identification unit is used for identifying the target speaker according to the target characterization vector to obtain an identification result of the target speaker.
11. A speaker recognition device, comprising: a processor, a memory, a system bus;
the processor and the memory are connected through the system bus;
the memory is to store one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1-9.
12. A computer-readable storage medium having stored therein instructions that, when executed on a terminal device, cause the terminal device to perform the method of any one of claims 1-9.
CN202110807643.7A 2021-07-16 Speaker recognition method, speaker recognition device, storage medium and equipment Active CN113516987B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110807643.7A CN113516987B (en) 2021-07-16 Speaker recognition method, speaker recognition device, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110807643.7A CN113516987B (en) 2021-07-16 Speaker recognition method, speaker recognition device, storage medium and equipment

Publications (2)

Publication Number Publication Date
CN113516987A true CN113516987A (en) 2021-10-19
CN113516987B CN113516987B (en) 2024-04-12

Family

ID=

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999059136A1 (en) * 1998-05-08 1999-11-18 T-Netix, Inc. Channel estimation system and method for use in automatic speaker verification systems
CN1629937A (en) * 1997-06-10 2005-06-22 编码技术股份公司 Source coding enhancement using spectral-band replication
US20080195387A1 (en) * 2006-10-19 2008-08-14 Nice Systems Ltd. Method and apparatus for large population speaker identification in telephone interactions
US20110276323A1 (en) * 2010-05-06 2011-11-10 Senam Consulting, Inc. Speech-based speaker recognition systems and methods
CN105869645A (en) * 2016-03-25 2016-08-17 腾讯科技(深圳)有限公司 Voice data processing method and device
US20170294191A1 (en) * 2016-04-07 2017-10-12 Fujitsu Limited Method for speaker recognition and apparatus for speaker recognition
US20180082692A1 (en) * 2016-09-19 2018-03-22 Pindrop Security, Inc. Channel-Compensated Low-Level Features For Speaker Recognition
CN109564759A (en) * 2016-08-03 2019-04-02 思睿逻辑国际半导体有限公司 Speaker Identification
EP3516652A1 (en) * 2016-09-19 2019-07-31 Pindrop Security, Inc. Channel-compensated low-level features for speaker recognition
CN111656440A (en) * 2018-01-23 2020-09-11 思睿逻辑国际半导体有限公司 Speaker identification
CN112233680A (en) * 2020-09-27 2021-01-15 科大讯飞股份有限公司 Speaker role identification method and device, electronic equipment and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1629937A (en) * 1997-06-10 2005-06-22 编码技术股份公司 Source coding enhancement using spectral-band replication
WO1999059136A1 (en) * 1998-05-08 1999-11-18 T-Netix, Inc. Channel estimation system and method for use in automatic speaker verification systems
US20080195387A1 (en) * 2006-10-19 2008-08-14 Nice Systems Ltd. Method and apparatus for large population speaker identification in telephone interactions
US20110276323A1 (en) * 2010-05-06 2011-11-10 Senam Consulting, Inc. Speech-based speaker recognition systems and methods
US20150039313A1 (en) * 2010-05-06 2015-02-05 Senam Consulting, Inc. Speech-Based Speaker Recognition Systems and Methods
CN105869645A (en) * 2016-03-25 2016-08-17 腾讯科技(深圳)有限公司 Voice data processing method and device
US20170294191A1 (en) * 2016-04-07 2017-10-12 Fujitsu Limited Method for speaker recognition and apparatus for speaker recognition
CN109564759A (en) * 2016-08-03 2019-04-02 思睿逻辑国际半导体有限公司 Speaker Identification
US20180082692A1 (en) * 2016-09-19 2018-03-22 Pindrop Security, Inc. Channel-Compensated Low-Level Features For Speaker Recognition
EP3516652A1 (en) * 2016-09-19 2019-07-31 Pindrop Security, Inc. Channel-compensated low-level features for speaker recognition
CN111656440A (en) * 2018-01-23 2020-09-11 思睿逻辑国际半导体有限公司 Speaker identification
CN112233680A (en) * 2020-09-27 2021-01-15 科大讯飞股份有限公司 Speaker role identification method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"通信", 中国无线电电子学文摘, no. 04 *
SOUMEN KANRAR ET AL.: "Speaker Identification by GMM based i Vector", ARXIV *

Similar Documents

Publication Publication Date Title
CN108564942B (en) Voice emotion recognition method and system based on adjustable sensitivity
KR100908121B1 (en) Speech feature vector conversion method and apparatus
CN111524527B (en) Speaker separation method, speaker separation device, electronic device and storage medium
CN111916111B (en) Intelligent voice outbound method and device with emotion, server and storage medium
CN110767239A (en) Voiceprint recognition method, device and equipment based on deep learning
CN113488058A (en) Voiceprint recognition method based on short voice
CN113763965B (en) Speaker identification method with multiple attention feature fusion
Chauhan et al. Speech to text converter using Gaussian Mixture Model (GMM)
KR102026226B1 (en) Method for extracting signal unit features using variational inference model based deep learning and system thereof
Sinha et al. Acoustic-phonetic feature based dialect identification in Hindi Speech
CN111091840A (en) Method for establishing gender identification model and gender identification method
Ozerov et al. GMM-based classification from noisy features
Park et al. The Second DIHARD Challenge: System Description for USC-SAIL Team.
CN113571095A (en) Speech emotion recognition method and system based on nested deep neural network
CN112466276A (en) Speech synthesis system training method and device and readable storage medium
Kamble et al. Emotion recognition for instantaneous Marathi spoken words
CN114038469B (en) Speaker identification method based on multi-class spectrogram characteristic attention fusion network
CN113516987B (en) Speaker recognition method, speaker recognition device, storage medium and equipment
CN112992155B (en) Far-field voice speaker recognition method and device based on residual error neural network
CN113516987A (en) Speaker recognition method, device, storage medium and equipment
Hanifa et al. Comparative Analysis on Different Cepstral Features for Speaker Identification Recognition
Hizlisoy et al. Text independent speaker recognition based on MFCC and machine learning
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
Tzudir et al. Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients
US7454337B1 (en) Method of modeling single data class from multi-class data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant