CN114512133A - Sound object recognition method, sound object recognition device, server and storage medium - Google Patents

Sound object recognition method, sound object recognition device, server and storage medium Download PDF

Info

Publication number
CN114512133A
CN114512133A CN202011159156.6A CN202011159156A CN114512133A CN 114512133 A CN114512133 A CN 114512133A CN 202011159156 A CN202011159156 A CN 202011159156A CN 114512133 A CN114512133 A CN 114512133A
Authority
CN
China
Prior art keywords
voiceprint feature
voice
vector
voiceprint
sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011159156.6A
Other languages
Chinese (zh)
Inventor
张大威
姜涛
王晓瑞
王俊
李岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202011159156.6A priority Critical patent/CN114512133A/en
Publication of CN114512133A publication Critical patent/CN114512133A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Abstract

The disclosure relates to a method, a device, a server and a storage medium for recognizing a sound object. The method for recognizing the sound-producing object comprises the following steps: extracting a first voice vector from first voice data of a voice object to be recognized and extracting a second voice vector from second voice data of a target voice object; inputting the first voice vector and the second voice vector into a voiceprint feature recognition model, and respectively carrying out voiceprint feature extraction on the first voice vector and the second voice vector by utilizing an activation function of a hidden layer in the voiceprint feature recognition model to obtain a first voiceprint feature of a to-be-recognized voice object and a second voiceprint feature of a target voice object; calculating the similarity between the first voiceprint feature and the second voiceprint feature; and if the similarity is greater than or equal to the similarity threshold, determining that the sound-emitting object to be recognized is matched with the target sound-emitting object. Whether the sound-emitting object to be recognized is matched with the target sound-emitting object can be accurately determined.

Description

Sound object recognition method, sound object recognition device, server and storage medium
Technical Field
The present disclosure relates to the field of communications technologies, and in particular, to a method, an apparatus, a server, and a storage medium for recognizing a sound object.
Background
With the development of the mobile internet, various network platforms develop rapidly, a large number of users can upload own audios and videos on the network platforms, and in some cases, the network platforms need to determine sounding objects corresponding to the audios and videos.
The recognition of the sounding object can judge whether the sounding object to be detected is a registered target object or not based on the voice spoken by the sounding object to be detected, but the accuracy and speed of recognizing the sounding object are low because the voice samples of speaker data used for training are few at present.
Disclosure of Invention
The present disclosure provides a sound object recognition method, apparatus, server, and storage medium to at least solve the problem in the related art that the accuracy and speed of recognizing a sound object are not high. The technical scheme of the disclosure is as follows:
according to a first aspect of the embodiments of the present disclosure, there is provided a sound emission object recognition method, including: extracting a first voice vector from first voice data of a voice object to be recognized and extracting a second voice vector from second voice data of a target voice object; inputting the first voice vector and the second voice vector into a voiceprint feature recognition model, and respectively carrying out voiceprint feature extraction on the first voice vector and the second voice vector by utilizing an activation function of a hidden layer in the voiceprint feature recognition model to obtain a first voiceprint feature of a vocal object to be recognized and a second voiceprint feature of a target vocal object, wherein the voiceprint feature recognition model comprises a plurality of cascaded hidden layers, and the number of the hidden layers and the number of hidden neurons of each hidden layer are determined according to the number of training samples; calculating the similarity between the first voiceprint feature and the second voiceprint feature; and if the similarity is greater than or equal to the similarity threshold, determining that the sound-emitting object to be recognized is matched with the target sound-emitting object.
According to a second aspect of the embodiments of the present disclosure, there is provided a sound emission object recognition apparatus including: an extraction module configured to perform extraction of a first speech vector from first speech data of a sound object to be recognized and extraction of a second speech vector from second speech data of a target sound object; the input module is configured to input the first voice vector and the second voice vector into a voiceprint feature recognition model, voiceprint feature extraction is respectively carried out on the first voice vector and the second voice vector by utilizing an activation function of a hidden layer in the voiceprint feature recognition model, so that a first voiceprint feature of a to-be-recognized voice-producing object and a second voiceprint feature of a target voice-producing object are obtained, the voiceprint feature recognition model comprises a plurality of cascaded hidden layers, and the number of the hidden layers and the number of hidden neurons of each hidden layer are determined according to the number of training samples; a calculation module configured to perform calculating a similarity between the first and second voiceprint features; and the matching module is configured to determine that the sound-emitting object to be recognized is matched with the target sound-emitting object if the similarity is greater than or equal to the similarity threshold.
According to a third aspect of the embodiments of the present disclosure, there is provided a server, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the vocal object recognition method according to the first aspect or the second aspect.
According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium having instructions that, when executed by a processor of a server, enable the server to perform the sound emission object recognition method according to the first aspect or the second aspect.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, wherein the instructions of the computer program product, when executed by a processor of a server, enable the server to perform the method of recognizing a sound-emitting object according to the first or second aspect.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
in the embodiment of the disclosure, a first voice vector and a second voice vector extracted from first voice data of a vocal object to be recognized and second voice data of a target vocal object are input into a voiceprint feature recognition model, and voiceprint feature extraction is performed on the first voice vector and the second voice vector by using an activation function of a hidden layer in the voiceprint feature recognition model to obtain a first voiceprint feature of the vocal object to be recognized and a second voiceprint feature of the target vocal object, wherein the voiceprint feature recognition model comprises a plurality of cascaded hidden layers, the number of the hidden layers and the number of hidden neurons of each hidden layer are determined according to the number of training samples, so that the voiceprint feature extraction speed of the voiceprint feature recognition model is high, the voiceprint feature of the vocal object to be recognized can be accurately represented, and the first voiceprint feature can accurately represent the feature of the vocal object to be recognized, the second voiceprint characteristics can accurately represent the characteristics of the target sound production object. Therefore, whether the sound-emitting object to be recognized is matched with the target sound-emitting object can be quickly and accurately determined by determining the similarity between the first voiceprint feature and the second voiceprint feature.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a block diagram illustrating a latency neural network in accordance with an exemplary embodiment.
Fig. 2 is a diagram illustrating a voice object recognition principle architecture in accordance with an exemplary embodiment.
Fig. 3 is a schematic diagram illustrating an application environment of a method, an apparatus, an electronic device, and a storage medium for recognizing a spoken object according to an exemplary embodiment.
Fig. 4 is a flow diagram illustrating a method of spoken object recognition according to an example embodiment.
FIG. 5 is a flow diagram illustrating extraction of voiceprint features in accordance with an exemplary embodiment.
Fig. 6 is a diagram illustrating a framing method according to an exemplary embodiment.
FIG. 7 is a flow diagram illustrating another method of training a voiceprint feature recognition model in accordance with an illustrative embodiment.
FIG. 8 is a schematic diagram illustrating a time-lapse neural network, according to an example embodiment.
FIG. 9 is a flowchart illustrating a process of computing voiceprint feature similarity in accordance with an exemplary embodiment.
FIG. 10 is a schematic diagram illustrating a cosine similarity algorithm in accordance with an exemplary embodiment.
FIG. 11 is a diagram illustrating a voiceprint feature recognition scenario in accordance with an exemplary embodiment.
Fig. 12 is a block diagram illustrating a spoken object recognition device according to an example embodiment.
FIG. 13 is a block diagram illustrating a server in accordance with an exemplary embodiment.
FIG. 14 is a block diagram illustrating an apparatus for data processing according to an example embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
With the continuous expansion of the service scale of the network platform, the amount of data to be processed by the server of the network platform increases, and a single instance node in the server may not be able to complete the data processing of all services in time. Therefore, a plurality of instance nodes need to be deployed for the server, so as to implement parallel processing on massive and continuously newly-added data by using the plurality of instance nodes.
Before describing the embodiments of the present disclosure, technical terms used in describing the embodiments of the present disclosure will be first introduced.
First, voiceprint recognition is described. Voiceprint recognition is one of the biometric identification technologies, also called speaker recognition, and includes speaker recognition and speaker verification, and the speaker verification is used for confirming whether a certain voice is spoken by a specified person, which is a problem of one-to-one discrimination. Voiceprint recognition can convert acoustic signals into electrical signals, which are then recognized by a computer. The recognition of the sound-emitting object of the present disclosure can be understood as speaker confirmation, i.e., confirmation or identification of whether the sound-emitting object is the target sound-emitting object.
Wherein, the voiceprint is a sound wave frequency spectrum which is displayed by an electroacoustic instrument and carries speech information. The voiceprint is not only specific, but also has the characteristic of relative stability. This is because vocal print spectra of any two persons differ because the vocal organs used by the persons when speaking vary greatly from person to person in terms of size and morphology.
The sounding characteristics are mainly embodied in the following aspects: (1) the resonance mode is characterized in that: pharyngeal resonance, nasal resonance and oral resonance; (2) voice purity characteristics: different people's voice, the purity is generally different, roughly divided into three grades of high purity (bright), low purity (hoarse) and medium purity; (3) average pitch characteristics: the average sound height is generally called whether the voice is hypertonic or deep; (4) the range characteristic: the level of the sound field is known as the sound being full or flat.
Due to the above-mentioned voice characteristics, the distribution of formants of voices of different persons in a spectrogram is different, and the voiceprint recognition is to judge whether the voice of a speaker of two sections of voices is the same person by comparing the voices of the speaker on the same phoneme, thereby realizing the function of 'people being recognized by smelling voice'.
Next, a Time-Delay Neural Network (TDNN) is introduced. The output of each hidden layer is expanded in the time domain, that is, the input received by each hidden layer is not only the output of the previous layer at the current time, but also the output of the previous layer at some time before and after, and the TDNN expands the output of each hidden layer in the network propagation process.
The time delay neural network is a neural network containing multiple frames of input information of an input layer, as shown in fig. 1, and fig. 1 shows a time delay neural network structure. Assuming that the delay of TDNN is considered to be 2, consecutive 3 frames will all be considered. Where the hidden layer serves as feature extraction, there are 13 small circles in each rectangle of the input layer, representing the 13-dimentional Mel-frequency Cepstral Coefficients (MFCC) features of the frame. The first hidden layer has 256 hidden neurons, and the weight number of connections is 3 × 13 × 256 — 9984. The determined weight data is less, so the feature extraction speed is high. Due to this feature, TDNN is widely used in the field of feature extraction applications.
However, the inventor finds that, in the existing TDNN network structure, for the case of fewer voice data samples, the effect of extracting the voiceprint features is poor, and the extracted voiceprint features are easy to be over-fitted, so that the accuracy of subsequently recognizing the sound object is not high.
Based on this, the embodiment of the present disclosure provides a method for recognizing a sound-generating object. The method comprises the steps of inputting a first voice vector and a second voice vector extracted from first voice data of a vocal object to be recognized and second voice data of a target vocal object into a voiceprint feature recognition model, and utilizing an activation function of a hidden layer in the voiceprint feature recognition model to respectively extract voiceprint features of the first voice vector and the second voice vector to obtain a first voiceprint feature of the vocal object to be recognized and a second voiceprint feature of the target vocal object.
It should be noted that, the embodiment of the present disclosure provides a sound object recognition principle framework 200, as shown in fig. 2:
first, as part of the training process, a voiceprint feature recognition model is trained based on a plurality of first training samples, each first training sample including a third speech vector 212 of a first originating object, the voiceprint feature recognition model 214 extracts a third voiceprint feature 216 from the third speech vector 212, and the voiceprint feature recognition model 214 is trained based on the third voiceprint feature 216 and its corresponding target identification information 218.
Then, the part relating to the sound generation object recognition component acquires a first speech vector 2221 of the sound generation object to be recognized and a second speech vector 2222 of the target sound generation object; inputting the first voice vector and the second voice vector into the trained voiceprint feature recognition model 224, determining a first voiceprint feature 2261 of the voice object to be recognized and a second voiceprint feature 2262 of the target voice object, and calculating a first similarity 228 between the first voiceprint feature and the second voiceprint feature; and if the first similarity is greater than or equal to the first similarity threshold, determining that the sound-producing object to be recognized is matched with the target sound-producing object, and obtaining a recognition result.
As can be seen from fig. 2, a pre-trained voiceprint feature recognition model is required to extract a first voiceprint feature of an object to be recognized and a second voiceprint feature of the target object, and therefore, before feature extraction is performed by using the voiceprint feature recognition model, the voiceprint feature recognition model needs to be trained.
Fig. 3 is a schematic diagram of an application environment of the recognition of a sound-generating object, the apparatus, the electronic device, and the storage medium according to one or more embodiments of the present disclosure. As shown in fig. 3, the server 100 is communicatively connected to one or more user terminals 200 via a network 300 for data communication or interaction. The server 100 may be a web server, a database server, or the like. The user end 200 may be, but is not limited to, a Personal Computer (PC), a smart phone, a tablet computer, a Personal Digital Assistant (PDA), and the like. The network 300 may be a wired or wireless network.
The recognition of the sound emission object provided by the embodiments of the present disclosure will be described in detail below.
The recognition of the sound object provided by the embodiment of the present disclosure can be applied to the user end 200, and for convenience of description, the embodiment of the present disclosure is described with the user end 200 as an execution subject except for specific description. It is to be understood that the subject matter described is not to be construed as limiting the disclosure.
First, a specific implementation of the recognition of the sound-emitting object provided by the embodiment of the present disclosure is described below with reference to the drawings.
Fig. 4 is a flowchart of a method for recognizing an utterance object according to an embodiment of the present disclosure. The method comprises the following steps:
s410, extracting a first voice vector from the first voice data of the voice object to be recognized, and extracting a second voice vector from the second voice data of the target voice object.
And S420, inputting the first voice vector and the second voice vector into a voiceprint feature recognition model, and respectively carrying out voiceprint feature extraction on the first voice vector and the second voice vector by utilizing an activation function of a hidden layer in the voiceprint feature recognition model to obtain a first voiceprint feature of the to-be-recognized sound object and a second voiceprint feature of the target sound object.
S430, calculating the similarity between the first voiceprint feature and the second voiceprint feature.
And S440, if the similarity is greater than or equal to the similarity threshold, determining that the sound generating object to be recognized is matched with the target sound generating object.
According to the voice print recognition method and device, the first voice vector and the second voice vector extracted from the first voice data of the voice object to be recognized and the second voice data of the target voice object are input into the voice print feature recognition model, voice print feature extraction is carried out on the first voice vector and the second voice vector respectively by using the activation function of the hidden layer in the voice print feature recognition model, and the first voice print feature of the voice object to be recognized and the second voice print feature of the target voice object are obtained.
Specific implementations of the above steps are described below.
First, S410 is introduced.
A first speech vector is extracted from first speech data of a sound-emitting object to be recognized, and a second speech vector is extracted from second speech data of a target sound-emitting object.
As an implementation manner of the present disclosure, in order to improve the stability of the speech vector, the step of extracting the first speech vector from the first speech data of the utterance object to be recognized may specifically include the following steps: determining a first voice signal according to an audio frame corresponding to the first voice data and a preset window function; performing fast Fourier transform on the first voice signal to obtain a frequency spectrum signal of an audio frame; filtering the frequency spectrum signal to obtain a filtered frequency spectrum signal; and performing discrete cosine transform on the filtered frequency spectrum signal to obtain a first voice vector.
First, in the step of determining the first speech signal according to the audio frame corresponding to the first speech data and the preset window function, the audio signal corresponding to each audio frame may be multiplied by a smooth window function to form a windowed speech, i.e. the first speech signal. Therefore, the two ends of the audio frame can be smoothly attenuated to zero, the audio signal can be more continuous, the subsequent short-time Fourier transform can be better carried out, and the frequency spectrum with higher quality can be obtained. Specifically, a window function may be selected for each frame, the width of the window function is the frame length, and the commonly used window functions include a rectangular window, a hamming window, a gaussian window, and the like.
Secondly, in the step involving performing fast fourier transform on the first speech signal to obtain the spectrum signal of the audio frame, the fast fourier transform may be performed on the first speech signal to obtain the spectrum signal of the audio/video frame, that is, the discrete frequency band spectrum signal is extracted from the first speech signal. Then, the spectrum signal is filtered to obtain a filtered spectrum signal.
Finally, in the step of performing discrete cosine transform on the filtered spectrum signal to obtain the fifth speech vector, the obtained first speech vector (cepstrum coefficient) may be obtained by discrete cosine transform. The cepstrum can be regarded as a spectrum signal of a logarithm of a spectrum signal, namely, the spectrum transforms a time domain signal into a frequency domain signal, and the cepstrum transforms the frequency domain signal back into the time domain signal; on the waveform, the cepstrum has a similar waveform to the spectrum, i.e., if the spectrum has a peak at low frequencies, the cepstrum also has a peak at low cepstral coefficients, and if the spectrum has a peak at high frequencies, the cepstrum also has a peak at high cepstral coefficients. The advantage of cepstral coefficients is that the variation of different coefficients is uncorrelated, thus greatly reducing the number of parameters in subsequent model training.
Therefore, the influence of the equipment for collecting the voice signals and the environment variables in the acquisition process on the signals can be reduced through the processing, so that the signals are more uniform and smooth, and a more stable signal source (namely, a fifth voice vector) is provided for the subsequent extraction of the voiceprint features.
Therefore, a specific implementation of extracting MFCC provided by the embodiments of the present disclosure will be described below with reference to fig. 5.
S321, an audio signal (i.e., first voice data) is input.
And S322, pre-emphasis processing is carried out on the audio signal to obtain the pre-emphasized audio signal.
Because the energy of the low frequency band of the voice is large for the voice signal, the energy is mainly distributed in the low frequency band, and the power spectral density of the voice is reduced along with the increase of the frequency, thereby causing the attenuation of the high frequency transmission and affecting the signal quality. Therefore, it is necessary to perform pre-emphasis processing on the audio signal, that is, to enhance the high frequency part, so that the influence of lip radiation can be removed by emphasizing the high frequency part of the voice information, and the high frequency resolution of the voice can be increased to flatten the frequency spectrum of the signal.
And S323, determining a first voice signal according to the pre-emphasized audio signal and a preset window function.
In this step, two steps of framing and windowing may be included. This is because framing and windowing are actually one continuous operation. In real life, an audio signal is generally a non-stationary signal, but over a period of time, the signal may be considered stationary, i.e., the audio signal has short-term stationarity (e.g., a speech signal may be considered approximately constant within 10-30 ms). This allows the pre-emphasized audio signal to be divided into short segments for processing, i.e. framing.
As shown in fig. 6, there is a repeated portion between the t-th frame and the t + 1-th frame, and this repeated portion is called frame shift, and by making a section of overlap region (frame shift) between two adjacent frames, it is possible to avoid missing signal points between two adjacent frames.
Alternatively, using a Hamming window as the default window function, we multiply each audio frame by the Hamming window, thus increasing the continuity of the left and right sides of the audio frame, i.e., windowing.
Specifically, the audio signal after the framing is S (N), N is N, N is a natural number, and the hamming window is w (N), and the first speech signal S' (N) is determined according to the audio signal after the framing S (N) and the hamming window w (N) and can be calculated by the following formula: s' (n) ═ S (n) × w (n). By windowing, the continuity of the left and right ends of the frame can be enhanced, thereby making the global audio signal more continuous.
And S324, carrying out short-time Fourier transform on the first voice signal to obtain a frequency spectrum signal.
Short-time Fourier transform (STFT), a mathematical transform related to the Fourier transform, is used to determine the frequency and phase of the local area sinusoid of a time-varying signal. After the windowing processing, each frame of the first audio signal needs to be subjected to short-time Fourier transform to obtain energy distribution on a frequency spectrum, so that the characteristics of the signal can be simply and visually observed through the energy distribution.
And S325, performing modulus calculation on the frequency spectrum signal to obtain the frequency spectrum signal after the modulus calculation.
The modulo computation refers to the value of the positive square root of the sum of the squares of the real and imaginary parts of the complex number as the modulus of the complex number.
S326, filtering the spectrum signal after the modulo calculation to obtain a filtered spectrum signal.
After short-time Fourier transform, the frequency spectrum is processed by a group of Mel filter banks to obtain Mel frequency spectrum which can accurately reflect the auditory characteristics of human ear, i.e. common frequency is converted into Mel frequency.
S327, performing a logarithm processing on the filtered spectrum signal to obtain a spectrum signal after the logarithm processing.
The signal emitted by the human phonation system is formed by convolution of fundamental tone information and vocal tract information. After the fourier transform, denoted "s-convolution v", the convolution becomes a multiplication. Namely "stft(s) × stft (v)". After taking the logarithm, the multiplication becomes the addition. Namely "Log (fft (s)) + Log (fft (v))" converts the convolved signal into an additive signal, which is why STFT and logarithm are taken.
And S328, performing discrete cosine transform on the frequency spectrum signal subjected to logarithm processing to obtain a first voice vector.
The Discrete Cosine Transform (DCT) is a Transform related to the fourier Transform, which is similar to the Discrete fourier Transform but uses only real numbers. Thus, the preprocessing of the voice information is completed, and a fifth voice vector is generated.
S329, the first speech vector is output.
In this way, the process of extracting the fifth speech vector (MFCC) from the third speech data, and the process of removing the silence interval from the fifth speech vector to obtain the third speech vector (MFCC after VAD processing) is completed.
The MFCC determined through the above process can accurately represent the spectrum features in the voice data, then the third voiceprint features can be extracted from the third voice vector, and model training is performed by using the third voiceprint features and the target identification information corresponding to the third voiceprint features, so that a voiceprint feature recognition model, namely mini-TDNN, related in the disclosure is obtained.
Because the number of layers of the mini-TDNN is small and the number of hidden neurons is small, the speed of extracting the features is high, and the third voiceprint features can be well extracted from the third voice vector under the condition that the number of audio frames corresponding to the third voice vector is small.
The implementation manner of extracting the second speech vector from the second speech data of the target utterance object can be obtained in the same manner, and is not described herein again.
Next, S420 is introduced.
In some embodiments of the present disclosure, before S420, the following steps may be further included:
acquiring a third voice vector of the first sound-emitting object and corresponding target identification information thereof; determining a fourth voice vector according to a preset time delay parameter and third voice vectors, wherein the preset time delay parameter is a time delay parameter of a voiceprint feature recognition model, and each third voice vector corresponds to one frame of audio; determining a target voice vector according to the third voice vector and the fourth voice vector; inputting the target voice vector into a voiceprint feature recognition model so that a first hidden layer of the voiceprint feature recognition model utilizes an activation function to perform voiceprint feature extraction on the target voice vector to obtain a third voiceprint feature; and training a voiceprint feature recognition model according to the third voiceprint feature and the corresponding target identification information.
The third voice data may be voice data acquired by the server from a video file or an audio file, and when acquiring the voice data, it is necessary to acquire identification information of a sound object corresponding to the voice data. For example, when the second voice data a records voice data including the sound of the uttering object a and the second voice data B records voice data including the sound of the uttering object B, the second voice data of the uttering object a and the target identification information "a" corresponding thereto are obtained in association with the second voice data of the uttering object B and the target identification information "B" corresponding thereto.
The third voice data is a voice signal, and the time domain waveform of the voice only represents the relationship of the sound pressure changing with time, and cannot well represent the characteristics of the voice, so that the voice signal needs to be converted into an acoustic feature vector. The acoustic feature vector may include: such as mel-frequency cepstral coefficients and linear prediction cepstral coefficients.
Optionally, MFCC is employed as the third speech vector. For ease of explanation the meaning of MFCC notation. Before introducing the MFCC, the Mel frequency is introduced, and the Mel frequency is extracted based on the auditory characteristics of human ears and has a non-linear corresponding relation with the frequency. MFCC is then the spectral signature calculated using this relationship between them. The human auditory system is a special nonlinear system whose sensitivity to signals of different frequencies is different. The MFCC takes human hearing into account, maps linear frequency spectrum to Mel nonlinear frequency spectrum based on hearing perception, and then converts to cepstrum. Thus, the recognition rate of the voice can be improved by simulating the processing characteristics of human auditory perception.
As an implementation of the present disclosure, in order to eliminate the influence of the silence interval, the accuracy of voiceprint feature extraction is improved. Before the above step involving obtaining the third speech vector of the first uttered object and its corresponding target identification information, the following steps may be further included:
determining a fifth speech vector according to the third speech data; performing voice endpoint detection on the fifth voice vector to obtain a target endpoint of the fifth voice vector; determining a mute interval vector in the fifth voice vector according to a target endpoint of the fifth voice vector; and removing the mute interval vector from the fifth voice vector to obtain a third voice vector.
Voice Activity Detection (VAD), among others, accurately locates the beginning and end points of Voice from noisy Voice because Voice contains long silence, i.e. the silence is separated from the actual Voice. Determining a fifth speech vector according to the third speech data; and performing voice endpoint detection on the fifth voice vector to obtain a target endpoint of the fifth voice vector. For example, the third speech data includes a plurality of mute points (i.e., target end points), which form a mute interval, where the mute interval needs to be removed from the original speech data, that is, the mute interval vector in the fifth speech vector is determined according to the target end point of the fifth speech vector; and removing the mute interval vector from the fifth voice vector to obtain a third voice vector.
Thus, by accurately locating the beginning and end points (target end points) of speech from speech with silence or noise, the silence or noise can be separated from the actual speech, the silence or noise can be removed, and the third speech vector without silence or noise can be retained.
A specific implementation manner of using the activation function of the hidden layer in the voiceprint feature recognition model to respectively perform voiceprint feature extraction on the first and second voiceprint vectors to obtain the first voiceprint feature of the to-be-recognized sound object and the second voiceprint feature of the target sound object according to the embodiment of the present disclosure is described below with reference to fig. 7.
As shown in fig. 7, the method may specifically include the following steps:
s332, determining a fourth voice vector according to the preset time delay parameter and the third voice vectors, wherein the preset time delay parameter is a time delay parameter of a voiceprint feature recognition model, and each third voice vector corresponds to one frame of audio.
The step of determining the fourth speech vector according to the preset delay parameter and the third speech vector may specifically include the following steps:
determining at least one audio frame which is separated from the audio frame corresponding to the third voice vector by a preset time delay according to a preset time delay parameter; and determining a speech vector corresponding to at least one audio frame as a fourth speech vector.
Specifically, at least one audio frame separated by a preset time delay from the audio frame corresponding to the third speech vector is determined according to a preset time delay parameter; and determining a speech vector corresponding to at least one audio frame as a fourth speech vector.
Illustratively, the third speech vector corresponds to the tth frame of audio, and the preset delay parameter is +1, +2, -1, -2, then the fourth speech vector may include: and the t-2 frame audio, the t-1 frame audio, the t +1 frame audio and the t +2 frame audio respectively correspond to the voice vector.
And S334, determining a target voice vector according to the third voice vector and the fourth voice vector.
Correspondingly, the target speech vector may be a speech vector formed by splicing the third speech vector and the fourth speech vector, that is, speech vectors corresponding to the t-2 frame audio, the t-1 frame audio, the t frame, the t +1 frame audio and the t +2 frame audio respectively.
S336, inputting the target voice vector into the voiceprint feature recognition model, so that the first hidden layer of the voiceprint feature recognition model utilizes the activation function to extract the voiceprint feature of the target voice vector, and a first extraction result is obtained.
The input of the voiceprint feature recognition model is a target speech vector (such as MFCC) extracted from a preset number of frames of audio, and the voiceprint feature recognition model takes a fixed number of frames of speech vectors from the voiceprint feature recognition model, wherein one frame of speech vector corresponds to one frame of audio.
The activation function may be a Rectified linear unit (ReLU), which is also called a modified linear unit. Relu makes the output of a part of hidden neurons be 0, so that the neural network tends to be sparse, the interdependence relation of parameters is reduced, namely Relu can increase the nonlinear relation among hidden layers of the neural network, and the overfitting problem is relieved. The model after realizing sparseness through the ReLU can better mine relevant characteristics and fit training data.
Optionally, in order to make the input of each hidden layer have the same distribution as much as possible, during the processing of the target speech vector by the activation function ReLU, further processing may be performed by BatchNorm.
For each hidden layer neuron, BatchNorm may fall the input value of the nonlinear transformation function into a region that is sensitive to the input, thereby avoiding the gradient vanishing problem. The above-mentioned gradient vanishing problem is a case where when the gradient is less than 1, an error between a predicted value and a true value is attenuated once per layer of propagation, resulting in a model with a poor convergence. Because the gradient of the neural network processed by the BatchNorm can be kept in a larger state all the time, the efficiency of adjusting the parameters of the neural network is higher, the training speed is improved, and the convergence speed is accelerated.
Therefore, the convergence rate of the model can be increased, the problem of gradient disappearance in a deep network can be relieved, and the trained deep network model is more stable.
And S338, taking the first extraction result as the input of the hidden layer next to the first hidden layer, performing voiceprint feature extraction on the first extraction result by using the activation function of the hidden layer next to the first hidden layer to obtain a second extraction result, and so on until the preset hidden layer of the voiceprint feature recognition model is reached, and outputting a third voiceprint feature corresponding to the third speech vector by the preset hidden layer.
The first hidden layer comprises a first number of hidden neurons, and the preset hidden layer comprises a second number of hidden neurons.
And taking a first extraction result output by the first hidden layer as the input of a hidden layer (a second hidden layer) at the next layer of the first hidden layer, performing voiceprint feature extraction on the first extraction result by utilizing an activation function of the second hidden layer to obtain a second extraction result, taking the second extraction result output by the second hidden layer as the input … … of a third hidden layer, and so on until the preset hidden layer of the mini-TDNN is reached, and outputting a second voiceprint feature corresponding to a second voice vector by the preset hidden layer. Each hidden layer has a predetermined number of hidden neurons.
A specific implementation of extracting voiceprint features from a speech vector by using a voiceprint feature recognition model according to an embodiment of the present disclosure is described below with reference to fig. 8.
Firstly, MFCC corresponding to an audio frame with a preset duration is used as the input of a voiceprint feature recognition model. Secondly, the first hidden layer extracts a first extraction result from the MFCC corresponding to (t-2, t-1, t, t +1, t +2), and inputs the first extraction result into the second hidden layer. Next, the second hidden layer extracts its output at (t-2, t, t +2) from the first extraction result, and inputs the second extraction result to the third hidden layer as the second extraction result.
Then, the third hidden layer extracts its output at (t-3, t, t +3) from the second extraction result, and inputs the third extraction result as a third extraction result into the fourth hidden layer. Then, the fourth hidden layer extracts its output at t from the third extraction result as a fourth extraction result, and inputs the fourth extraction result into the fifth hidden layer. Then, the fifth hidden layer extracts its output at t from the fourth extraction result as a fifth extraction result.
Then, the sixth hidden layer extracts its output at t from the fifth extraction result as a sixth extraction result. Finally, the seventh hidden layer extracts its output at t from the sixth extraction result as a seventh extraction result. The number of the hidden neurons of the first hidden layer, the second hidden layer, the third hidden layer and the fourth hidden layer is 256, the number of the hidden neurons of the fifth hidden layer is 750, and the number of the hidden neurons of the sixth hidden layer and the seventh hidden layer is 512.
The number of layers of the voiceprint feature recognition model is small, the number of hidden neurons is small, so that the feature extraction speed is high, and the third voiceprint feature can be well extracted from the third voice vector under the condition that the number of audio frames corresponding to the third voice vector is small.
And S340, training the voiceprint feature recognition model to obtain the voiceprint feature recognition model according to the third voiceprint feature and the corresponding target identification information.
Optionally, the output of the result of the sixth hidden layer is selected as the third voiceprint feature.
Optionally, the deep neural network may adopt DNN, where the DNN model is a deep learning framework model, and the structure of the DNN model mainly includes: an input layer, a plurality of hidden layers and an output layer; generally, the first layer of the DNN model is an input layer, the last layer is an output layer, the middle layer is a multi-layer hidden layer, the extracted second voiceprint features are input into the input layer of the DNN by the voiceprint feature recognition model, and the DNN adjusts training parameters of the DNN according to first identification information and target identification information determined by the third voiceprint features until the first identification information and the target identification information meet a training stop condition to obtain the voiceprint feature recognition model. The parameters of the DNN model adjusted by this procedure mainly include: weights of the linear transformations between the layers of the DNN model are connected.
It will be appreciated that the process shown in figure 8 may be used to represent two phases.
The first stage is training of the neural network, and since the neural network is a combination of a feature extractor and a classifier, each layer has a very strong feature extraction capability, the output of the sixth concealment layer is used as the embeddings (i.e. feature vectors) of the input audio frame.
The second stage is feature extraction, where the trained neural network (i.e., DNN) can be removed and the remaining structure is used to derive embeddings for each segment of speech for us. The imbedding is to use a low-dimensional vector to represent information, and the property of the imbedding vector is to enable information corresponding to vectors with close distances to have close meanings.
Alternatively, x-vector is used to describe the second vocal features (i.e., embeddings) extracted from the second speech vector using mini-TDNN.
As shown in fig. 8, the output of the sixth hidden layer of the voiceprint feature recognition model can be used as an embedding xvector feature.
Before obtaining the voiceprint feature recognition model referred to in the present disclosure, namely the mini-TDNN, in order to improve the extraction accuracy of the mini-TDNN, as shown in fig. 8, a neural network (e.g., DNN) may be added to the last hidden layer of the mini-TDNN, so that the output of the mini-TDNN is used as the input of the DNN. The network comprising the mini-TDNN and the DNN can be used as the voiceprint feature recognition model, after the training of the voiceprint feature recognition model is completed, the trained DNN can be removed, the residual mini-TDNN has high-precision feature extraction capability, and voiceprint features can be extracted from voice vectors.
Then, S430 is introduced.
In some embodiments of the present disclosure, a plurality of third voiceprint features of the first originating object are averaged to determine a fourth voiceprint feature; preprocessing the fourth voiceprint characteristic, and determining a calculation parameter of a PLDA algorithm; adjusting the PLDA algorithm by using the calculation parameters, and determining a first PLDA algorithm; and calculating the similarity between the first voiceprint feature and the second voiceprint feature by using a first PLDA algorithm.
Before introducing the calculation parameters of a Probabilistic Linear Discriminant Analysis (PLDA) algorithm, the principle of the PLDA algorithm is introduced, and the PLDA is a channel compensation (channel compensation) algorithm, and due to the existence of channel information, interference is generated on speaker recognition of a user, and even the recognition accuracy of the system is seriously influenced. Thus, it is desirable to minimize this effect, i.e., channel compensation for the voiceprint characteristics.
The channel compensation is embodied in the following process: in the field of voiceprint recognition, it is assumed that the training data speech consists of the speech of I speakers, where each speaker has J segments of his own different speech. Then, the jth speech of the ith speaker can be defined as Xij, and then, according to the factor analysis, the generative model of Xij is defined as: xij-u + Fhi + Gwij + epsilon ij. The algorithm can be viewed as two parts, the first two items on the right of the equal sign are only related to the speaker and are not related to a specific certain voice of the speaker, and the two items are called signal parts, and the difference between the speakers is described. The last two items to the right of the equal sign describe the difference between different voices of the same speaker, called the noise part. Thus, the data structure of a piece of speech can be described by such two imaginary variables. Assume that the two imaginary variables are matrix F and matrix G, respectively. The two matrices F and G contain basic factors in the respective imaginary variable spaces, which can be regarded as eigenvectors of the respective spaces. For example, each column of F corresponds to a feature vector of an inter-class space, and each column of G corresponds to a feature vector of an intra-class space. And two vectors can be regarded as feature representations in respective spaces, for example hi can be regarded as feature representations of Xij in the speaker space.
In this way, in the process of calculating the first similarity between the first and second voiceprint features by the PLDA algorithm, if the likelihood that the features thereof are the same is larger, it can be determined that the different utterance object matches the target utterance object.
In the embodiment of the present disclosure, a first similarity between the first voiceprint feature and the second voiceprint feature may be calculated through a PLDA algorithm, and if the first similarity is greater than or equal to a first similarity threshold, it is determined that the to-be-recognized sound generating object matches the target sound generating object, that is, it may be determined that the first voiceprint feature belongs to the target sound generating object. If the first similarity is smaller than the first similarity threshold, it is determined that the sound-emitting object to be recognized is not matched with the target sound-emitting object, and it is determined that the first voiceprint feature does not belong to the target sound-emitting object. In the method, the PLDA is used as a channel compensation algorithm, so that the channel compensation capability is better, and the reliability of the voiceprint recognition result is improved.
The step of preprocessing the fourth voiceprint feature and determining the calculation parameters of the PLDA algorithm may specifically include the following steps:
performing linear discrimination analysis processing on the fourth voiceprint feature to determine a fifth voiceprint feature; carrying out mean value normalization processing on the fifth voiceprint characteristic to determine a sixth voiceprint characteristic; carrying out length normalization processing on the sixth voiceprint characteristic to obtain a seventh voiceprint characteristic; and determining the calculation parameters of the PLDA algorithm according to the seventh voiceprint characteristic.
First, in the step of determining the fifth voiceprint feature by performing Linear Discriminant Analysis (LDA) processing on the fourth voiceprint feature, specifically, the step may be: and performing LDA processing on the first voiceprint characteristics. LDA can be understood as a dimension reduction method that tries to remove unwanted classification directions, i.e. maximize inter-class distance and minimize intra-class distance. In speaker recognition, most cases are the binary problem, so the effect of LDA here is to reduce the original high dimensional feature data to one dimension, and when there is much speech of a speaker, if the speech is affected by the channel, it appears that the variance of the speech of the speaker is large. LDA then tries to find a new direction to which all the original data is projected, so that the data of the same speaker in this direction has the smallest intra-class variance, while the distance between different speakers is as large as possible. Thus, the effect of reducing the channel difference is achieved.
When the LDA is used for processing the test data (the first voiceprint characteristic) and the x-vector (the target voiceprint characteristic) of the voiceprint characteristic recognition model, the first voiceprint characteristic and the target voiceprint characteristic can be subjected to channel compensation, and the accuracy of a subsequent acoustic object recognition result is improved.
Secondly, in the step of performing mean normalization processing on the fifth voiceprint feature and determining the sixth voiceprint feature, the step may specifically be: and performing Mean normalization (Mean norm) processing on the fifth vocal print characteristic to determine a sixth vocal print characteristic.
Since the magnitude of the fifth voiceprint feature may be different, when the magnitude level between the first voiceprint features is very different, the mean normalization processing needs to be performed on the original index data.
Specifically, the normalization process may employ the following formula: x-mean (x)/max (x) -min (x). The mean normalization processing can convert all the original data into non-dimensionalized mapping evaluation values (namely, sixth voiceprint features), even if the original data are in the same quantity level as much as possible, the accuracy of calculating the similarity between the voiceprint features can be further ensured, and the reliability of the object identification result is ensured.
Then, in the step of performing length normalization processing on the sixth voiceprint feature to obtain the seventh voiceprint feature, the step may specifically be: and if the sixth voiceprint characteristic length is greater than the first length, determining the difference between the sixth voiceprint characteristic length and the first length, and partially removing the difference in the sixth voiceprint characteristic length to enable the sixth voiceprint characteristic length after the length normalization processing to be equal to the first length. If the sixth voiceprint feature length is less than the first length, a difference between the first length and the sixth voiceprint feature length is determined, and the missing difference is partially filled.
Illustratively, if the sixth voiceprint feature length is 20s, it is reduced to 10s, if the sixth voiceprint feature length is 5s, a blank voiceprint feature may be added to make its length equal to 10s, or the original voiceprint may be copied and added to the sixth voiceprint feature to make its length equal to 10 s.
Finally, in the step of determining the calculation parameter of the PLDA algorithm according to the seventh voiceprint feature, the step may specifically be: according to the seventh voiceprint feature obtained through the linear discriminant analysis, the mean normalization and the length normalization, the similarity calculation method has the feature suitable for calculating the similarity, and according to the length parameter and the variance parameter of the seventh voiceprint feature, the normalization processing can be performed on the voiceprint features subsequently input into the PLDA algorithm, so that the order of magnitude is unified conveniently, and the similarity among the voiceprint features is calculated.
The step of calculating the first similarity between the first voiceprint feature and the second voiceprint feature by using the PLDA algorithm may specifically include the following steps: adjusting the PLDA algorithm by using the calculation parameters, and determining a first PLDA algorithm; a first similarity between the first and second voiceprint features is calculated using a first PLDA algorithm. The overall process described above can be seen in fig. 9.
Finally, S440 is introduced.
And if the similarity is greater than or equal to the similarity threshold, determining that the sound-emitting object to be recognized is matched with the target sound-emitting object. The matching of the sound-emitting object to be recognized and the target sound-emitting object can be used for indicating that the sound-emitting object to be recognized and the target sound-emitting object are the same sound-emitting object.
Therefore, the first voice vector (first MFCC) and the second voice vector (second MFCC) extracted from the first voice data of the voice object to be recognized and the second voice data of the target voice object are input into the voiceprint feature recognition model, the first voiceprint feature of the voice object to be recognized and the second voiceprint feature of the target voice object are determined, the first voiceprint feature of the voice object to be recognized and the second voiceprint feature of the target voice object are subjected to similarity scoring by utilizing PLDA, and the higher processing speed and the lower Equal Error Rate (EER) can be achieved.
Here, EER is a value when the error Rejection Rate (FRR) is equal to the error Acceptance Rate (FAR), and the values of FAR and FRR at this time are referred to as equal error rates.
In the classification problem, if two samples are of the same type (same person) but are mistakenly considered as different types (non-same person) by the system, the samples are false rejection cases, and the FRR is the proportion of the false rejection cases in all matching cases of the same type. In the classification problem, if two samples are heterogeneous (not the same person), but are mistaken by the system as homogeneous (the same person), the classification problem is an error acceptance case. FAR is the proportion of false acceptance cases in all heterogeneous matching cases.
When the required accuracy of the voiceprint feature recognition model is higher, the corresponding false acceptance rate is lower, but the false rejection rate may be higher. On the contrary, if the voiceprint feature recognition model pursues better usability, that is, the passing rate is high, the false acceptance rate is higher, and the false rejection rate is lower. Here, the smaller the equal error rate value, the better the performance of the model.
In addition, the present disclosure also provides a specific implementation of determining a similarity between the first voiceprint feature and the second voiceprint feature. The method specifically comprises the following steps:
and calculating the similarity between the first voiceprint feature and the second voiceprint feature by a cosine similarity algorithm.
For convenience of introduction, the following description is made with reference to the cosine similarity algorithm of fig. 10, which is a schematic diagram of the cosine similarity algorithm provided in the embodiment of the present invention, as shown in the figure, an included angle θ between a vector a and a vector b is first obtained, and a cosine value cos θ corresponding to the included angle θ is obtained, and the cosine value can be used to represent the similarity between the two vectors. The smaller the angle, the closer the cosine value is to 1, and the more similar the vector a and the vector b are.
And if the cosine similarity (such as 0.8) reaches a second similarity threshold (such as 0.7), the sound-emitting object to be recognized is matched with the target sound-emitting object. If the cosine similarity (such as 0.6) does not reach the first similarity threshold (such as 0.7), the sound-emitting object to be recognized does not match the target sound-emitting object.
The cosine similarity algorithm distinguishes the difference between different voiceprint features from the direction, so that the problem that the measurement standards among different voiceprint features are not uniform can be corrected, and the reliability of the voiceprint recognition result can be improved.
In addition, in addition to determining the voiceprint recognition result by using the above-described PLDA algorithm and cosine similarity algorithm, similarity detection may be performed by using a euclidean distance, a manhattan distance, or a pearson correlation coefficient.
Therefore, a first voice vector and a second voice vector extracted from first voice data of a vocal object to be recognized and second voice data of a target vocal object are input into a voiceprint feature recognition model, a first voiceprint feature of the vocal object to be recognized and a second voiceprint feature of the target vocal object are determined, the voiceprint feature recognition model is obtained by training a mini time delay neural network mini-TDNN according to a plurality of first training samples, the mini-TDNN is high in speed and accuracy in voiceprint feature extraction, and voiceprint features of the vocal object to be recognized can be accurately represented, so that the first voiceprint feature can accurately represent the features of the vocal object to be recognized, and the second voiceprint feature can accurately represent the features of the target vocal object. Therefore, whether the sound-emitting object to be recognized is matched with the target sound-emitting object can be quickly and accurately determined by determining the similarity between the first voiceprint feature and the second voiceprint feature.
The method for recognizing a sound-emitting object provided by the present disclosure is described below with reference to several specific scenarios.
First, the method for recognizing the sounding object provided by the present disclosure can be applied to the field of network platforms. With the development of the mobile internet, various network platforms develop rapidly, a large number of users can upload their own audios and videos on the network platform, and in some cases, the network platform needs to determine the sound production objects corresponding to the audios and videos.
For example, in the title or description of the audio/video on the network platform, it is difficult for a network platform user or manager to find the audio/video corresponding to the target sound-emitting object in the massive audio/video of the network platform without keywords related to the target sound-emitting object in the audio/video. As shown in fig. 11, a user or manager of the network platform wishes to search for audios and videos including singer 'wangal' voice data among the massive audios and videos of the network platform.
At this time, first voice data of a plurality of objects (including "wanglor" and small M, small P, and small Q) and "wanglor" second voice data stored in advance in the server may be acquired first. Then, a first voice vector is extracted from the plurality of first voice data, and a second voice vector of "Wanglor" is extracted from the second voice data. The plurality of first and second voice vectors are input to the voiceprint feature recognition model determined above, and the plurality of first voiceprint features and the second voiceprint feature of "jag carl" are determined. And finally, calculating a first similarity between each first voiceprint feature to be recognized and a second voiceprint feature of the 'Wanglor' by a PLDA algorithm, and determining the sound-producing object corresponding to the first voice vector with the maximum similarity value as the 'Wanglor'.
Therefore, through the method for identifying the sounding object, the audio and video corresponding to the target sounding object can be found from the massive audio and video of the network platform, so that better use experience can be provided for a network platform user, and a network platform manager can manage massive data on the network platform more easily.
Secondly, the method for recognizing the sounding object can be applied to the field of security protection. In recent years, due to the development of the internet, voice cases have been greatly increased, in which a sound object recognition method can effectively implement technical detection means, such as a fraudster a who frequently makes a call to an old person and a fraudster a property of the old person.
At this time, first voice data of the suspect person and second voice data of the fraudster a may be acquired first. Then, a first voice vector of a suspect person is extracted from the first voice data, and a second voice vector of a fraudster a is extracted from the second voice data. The first and second voice vectors are input to the voiceprint feature recognition model determined above, determining a first voiceprint feature of the suspect person and a second voiceprint feature of the fraudster a. Finally, a first similarity between the first and second voiceprint features is calculated through a PLDA algorithm, and if the first similarity is greater than or equal to a first similarity threshold, it can be basically determined that the suspect person is a fraudster a.
Therefore, by the method for identifying the sounding object, criminal behaviors such as telecom fraud and the like can be effectively prevented quickly and accurately, and a safe social public environment is constructed.
Finally, the method for recognizing the sounding object can be applied to the aspect of intelligent hardware interaction, in recent years, due to the rapid development of smart homes, a large number of intelligent hardware products emerge in large numbers, but at present, many intelligent products can only recognize the content spoken by the user, but cannot distinguish the identity of a speaker, and cannot meet the personalized requirements of the user. For example, the smart product B may extract valid instructions from the utterance spoken by the utterance object. However, in a multi-person scene, the instruction receiving capability of the intelligent product is reduced, and a sound-producing object recognition method can be adopted to solve the problem.
For example, in a situation where the house of the house owner C has other guests, the smart product B is expected to perform corresponding operations only in response to the voice information of the house owner C, and at this time, first, a plurality of first voice data of a plurality of persons (including the house owner C and the guests D, E, F) and second voice data previously stored by the house owner C can be acquired. Then, a first voice vector is extracted from the plurality of first voice data, and a second voice vector of the house owner C is extracted from the second voice data. The plurality of first voice vectors and the second voice vector are input to the voice print feature recognition model determined above, and the plurality of first voice print features and the second voice print feature of the house owner C are determined. And finally, calculating a first similarity between each first voiceprint feature and a second voiceprint feature of the house owner C through a PLDA algorithm, and determining the sound-emitting object corresponding to the first voice vector with the maximum similarity value as the house owner C.
Therefore, through the method for recognizing the sounding object, the intelligent product can distinguish different roles, and the purpose of recognizing people by listening to the sound is achieved. The intelligent product is targeted to provide different contents and services for everyone, man-machine interaction is simpler, and users can enjoy more relaxed and personalized use experience.
Based on the above method for recognizing the sound-generating object, the present disclosure further provides a device for recognizing the sound-generating object, which is specifically described with reference to fig. 12.
Fig. 12 is a block diagram illustrating a spoken object recognition device according to an example embodiment. Referring to fig. 12, the vocal object recognition apparatus 1200 may include an extraction module 1210, an input module 1220, a calculation module 1230, and a matching module 1240.
An extraction module 1210 configured to perform extracting a first speech vector from first speech data of a vocal object to be recognized and extracting a second speech vector from second speech data of a target vocal object.
The input module 1220 is configured to perform input of the first voice vector and the second voice vector into the voiceprint feature recognition model, perform voiceprint feature extraction on the first voice vector and the second voice vector by using an activation function of a hidden layer in the voiceprint feature recognition model, to obtain a first voiceprint feature of the vocal object to be recognized and a second voiceprint feature of the target vocal object, where the voiceprint feature recognition model includes a plurality of cascaded hidden layers, and the number of hidden layers and the number of hidden neurons in each hidden layer are determined according to the number of training samples.
A calculation module 1230 configured to perform calculating a similarity between the first and second voiceprint features.
And the matching module 1240 is configured to determine that the sound-generating object to be identified is matched with the target sound-generating object if the similarity is greater than or equal to the similarity threshold.
In some embodiments of the present disclosure, the sound generating object recognizing apparatus 1200 further includes:
and the acquisition module is configured to acquire the third voice vector of the first sound-emitting object and the corresponding target identification information thereof.
And the determining module is configured to determine a fourth voice vector according to preset time delay parameters and third voice vectors, wherein the preset time delay parameters are time delay parameters of the voiceprint feature recognition model, and each third voice vector corresponds to one frame of audio.
The determining module is further configured to perform determining a target speech vector from the third speech vector and the fourth speech vector.
And the input module is configured to input the target voice vector into the voiceprint feature recognition model, so that the first hidden layer of the voiceprint feature recognition model performs voiceprint feature extraction on the target voice vector by using the activation function to obtain a third voiceprint feature.
And the training module is configured to train the voiceprint feature recognition model according to the third voiceprint feature and the corresponding target identification information.
In some embodiments of the present disclosure, the determining module is further configured to determine at least one audio frame separated from the audio frame corresponding to the third speech vector by a preset time delay according to a preset time delay parameter.
The determining module is further configured to perform determining a speech vector corresponding to the at least one audio frame as a fourth speech vector.
In some embodiments of the present disclosure, the extraction module 1210 includes:
and the determining module is configured to execute the audio frame corresponding to the first voice data and a preset window function to determine the first voice signal.
The first transformation module is configured to perform fast Fourier transform on the first voice signal to obtain a spectrum signal of the audio frame.
And the filtering module is configured to perform filtering processing on the spectrum signal to obtain a filtered spectrum signal.
And the second transformation module is configured to perform discrete cosine transformation on the filtered spectrum signal to obtain a first voice vector.
In some embodiments of the present disclosure, the calculating module 1230 is further configured to perform an averaging calculation of a plurality of third voiceprint features of the first originating object, and determine a fourth voiceprint feature.
The calculation module 1230 further includes:
and the preprocessing module is configured to execute preprocessing on the fourth voiceprint feature and determine the calculation parameters of the PLDA algorithm.
An adjustment module configured to perform an adjustment of the PLDA algorithm using the calculation parameters, determining a first PLDA algorithm.
A calculating module 1230 further configured to perform calculating a similarity between the first and second voiceprint features using a first PLDA algorithm.
In some embodiments of the present disclosure, the preprocessing module mentioned above includes:
and the determining module is configured to perform linear discriminant analysis processing on the fourth voiceprint features and determine fifth voiceprint features.
And the normalization module is configured to perform mean normalization processing on the fifth voiceprint features and determine sixth voiceprint features.
The normalization module is further configured to perform length normalization processing on the sixth voiceprint feature to obtain a seventh voiceprint feature.
A determination module further configured to perform determining a calculation parameter of the PLDA algorithm according to the seventh voiceprint feature.
In some embodiments of the present disclosure, the calculation module 1230 is further configured to perform calculating the similarity between the first voiceprint feature and the second voiceprint feature by a cosine similarity algorithm.
In the embodiment of the present disclosure, the vocal print object recognition apparatus 1300 is capable of inputting the first voice vector and the second voice vector extracted from the first voice data of the vocal print object to be recognized and the second voice data of the target vocal print object into the vocal print feature recognition model, and performing vocal print feature extraction on the first voice vector and the second voice vector by using the activation function of the hidden layer in the vocal print feature recognition model to obtain the first vocal print feature of the vocal print object to be recognized and the second vocal print feature of the target vocal print object, wherein the vocal print feature recognition model includes a plurality of cascaded hidden layers, the number of hidden layers and the number of hidden neurons in each hidden layer are determined according to the number of training samples, so that the speed of extracting vocal print features by the vocal print feature recognition model is fast and high in accuracy, and can accurately represent vocal print features of the vocal print object to be recognized, so that the first vocal print feature can accurately represent features of the vocal print object to be recognized, the second voiceprint characteristics can accurately represent the characteristics of the target sound production object. Therefore, whether the sound-emitting object to be recognized is matched with the target sound-emitting object can be quickly and accurately determined by determining the similarity between the first voiceprint feature and the second voiceprint feature.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
FIG. 13 is a block diagram illustrating a server in accordance with an exemplary embodiment. Referring to fig. 13, an embodiment of the present disclosure further provides a server including a processor 1310, a communication interface 1320, a memory 1330 and a communication bus 1340, wherein the processor 1310, the communication interface 1320 and the memory 1330 are in communication with each other through the communication bus 1340.
The memory 1330 is used for storing instructions executable by the processor 1310.
The processor 1310, when executing the instructions stored in the memory 1330, performs the following steps:
extracting a first voice vector from first voice data of a voice object to be recognized and extracting a second voice vector from second voice data of a target voice object; inputting the first voice vector and the second voice vector into a voiceprint feature recognition model, and respectively carrying out voiceprint feature extraction on the first voice vector and the second voice vector by utilizing an activation function of a hidden layer in the voiceprint feature recognition model to obtain a first voiceprint feature of a vocal object to be recognized and a second voiceprint feature of a target vocal object, wherein the voiceprint feature recognition model comprises a plurality of cascaded hidden layers, and the number of the hidden layers and the number of hidden neurons of each hidden layer are determined according to the number of training samples; calculating the similarity between the first voiceprint feature and the second voiceprint feature; and if the similarity is greater than or equal to the similarity threshold, determining that the sound-emitting object to be recognized is matched with the target sound-emitting object.
It can be seen that, by applying the embodiment of the present disclosure, the first speech vector and the second speech vector extracted from the first speech data of the sound generating object to be recognized and the second speech data of the target sound generating object can be input into the voiceprint feature recognition model, and the activation function of the hidden layer in the voiceprint feature recognition model is utilized to perform voiceprint feature extraction on the first speech vector and the second speech vector respectively, so as to obtain the first voiceprint feature of the sound generating object to be recognized and the second voiceprint feature of the target sound generating object, because the voiceprint feature recognition model includes a plurality of cascaded hidden layers, the number of the hidden layers and the number of hidden neurons of each hidden layer are determined according to the number of training samples, the voiceprint feature recognition model has a fast speed and a high accuracy for extracting the voiceprint feature, and can accurately represent the voiceprint feature of the sound generating object to be recognized, so that the first voiceprint feature can accurately represent the feature of the sound generating object to be recognized, the second voiceprint characteristics can accurately represent the characteristics of the target sound production object. Therefore, whether the sound-emitting object to be recognized is matched with the target sound-emitting object can be quickly and accurately determined by determining the similarity between the first voiceprint feature and the second voiceprint feature.
FIG. 14 is a block diagram illustrating an apparatus for data processing according to an example embodiment. For example, the device 1400 may be provided as a server. Referring to FIG. 14, the server 1400 includes a processing component 1422 that further includes one or more processors and memory resources, represented by memory 1432, for storing instructions, such as applications, that are executable by the processing component 1422. The application programs stored in memory 1432 may include one or more modules each corresponding to a set of instructions. Further, the processing component 1422 is configured to execute instructions to perform the method for recognizing a spoken object as described in any of the embodiments above.
The device 1400 may also include a power component 1426 configured to perform power management of the device 1400, a wired or wireless network interface 1450 configured to connect the device 1400 to a network, and an input output (I/O) interface 1458. The device 1400 may operate based on an operating system stored in the memory 1432, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
In some embodiments of the present disclosure, there is also provided a storage medium, wherein instructions when executed by a processor of a server enable the server to execute the method for recognizing a sound-emitting object according to any one of the above embodiments.
Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In some embodiments of the present disclosure, there is further provided a computer program product, wherein instructions of the computer program product, when executed by a processor of a server, enable the server to perform the method for recognizing a sound-emitting object according to any of the above embodiments.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A method for recognizing a sound-producing object, comprising:
extracting a first voice vector from first voice data of a voice object to be recognized and extracting a second voice vector from second voice data of a target voice object;
inputting the first voice vector and the second voice vector into a voiceprint feature recognition model, and performing voiceprint feature extraction on the first voice vector and the second voice vector by using an activation function of a hidden layer in the voiceprint feature recognition model to obtain a first voiceprint feature of the vocal object to be recognized and a second voiceprint feature of the target vocal object, wherein the voiceprint feature recognition model comprises a plurality of cascaded hidden layers, and the number of the hidden layers and the number of hidden neurons of each hidden layer are determined according to the number of training samples;
calculating the similarity between the first voiceprint feature and the second voiceprint feature;
and if the similarity is greater than or equal to a similarity threshold value, determining that the sound-emitting object to be recognized is matched with the target sound-emitting object.
2. The method of claim 1, wherein prior to said inputting the first speech vector and the second speech vector into a voiceprint feature recognition model, the method further comprises:
acquiring a third voice vector of the first sound-emitting object and corresponding target identification information thereof;
determining a fourth voice vector according to a preset time delay parameter and the third voice vectors, wherein the preset time delay parameter is a time delay parameter of the voiceprint feature recognition model, and each third voice vector corresponds to one frame of audio;
determining a target speech vector according to the third speech vector and the fourth speech vector;
inputting the target voice vector into the voiceprint feature recognition model, so that a first hidden layer of the voiceprint feature recognition model utilizes an activation function to perform voiceprint feature extraction on the target voice vector to obtain a third voiceprint feature;
and training the voiceprint feature recognition model according to the third voiceprint feature and the corresponding target identification information.
3. The method of claim 2, wherein determining a fourth speech vector according to the preset delay parameter and the third speech vector comprises:
determining at least one audio frame which is separated from the audio frame corresponding to the third voice vector by a preset time delay according to the preset time delay parameter;
and determining a speech vector corresponding to the at least one audio frame as the fourth speech vector.
4. The method of claim 1, wherein extracting the first speech vector from the first speech data of the utterance object to be recognized comprises:
determining a first voice signal according to an audio frame corresponding to the first voice data and a preset window function;
performing fast Fourier transform on the first voice signal to obtain a frequency spectrum signal of the audio frame;
filtering the frequency spectrum signal to obtain a filtered frequency spectrum signal;
and performing discrete cosine transform on the filtered frequency spectrum signal to obtain the first voice vector.
5. The method of claim 1, wherein the calculating the similarity between the first voiceprint feature and the second voiceprint feature comprises:
carrying out averaging calculation on a plurality of third voiceprint characteristics of the first vocal object to determine a fourth voiceprint characteristic;
preprocessing the fourth voiceprint feature, and determining a calculation parameter of a PLDA algorithm;
adjusting the PLDA algorithm by using the calculation parameters to determine a first PLDA algorithm;
calculating a similarity between the first voiceprint feature and the second voiceprint feature using the first PLDA algorithm.
6. The method of claim 5, wherein preprocessing the fourth voiceprint feature to determine the calculation parameters of the PLDA algorithm comprises:
performing linear discriminant analysis processing on the fourth voiceprint feature to determine a fifth voiceprint feature;
carrying out mean value normalization processing on the fifth voiceprint characteristic to determine a sixth voiceprint characteristic;
carrying out length normalization processing on the sixth voiceprint characteristic to obtain a seventh voiceprint characteristic;
and determining the calculation parameters of the PLDA algorithm according to the seventh voiceprint characteristics.
7. The method of claim 1, wherein the calculating the similarity between the first voiceprint feature and the second voiceprint feature comprises:
and calculating the similarity between the first voiceprint feature and the second voiceprint feature by a cosine similarity algorithm.
8. A sound object recognition apparatus, comprising:
an extraction module configured to perform extraction of a first speech vector from first speech data of a sound object to be recognized and extraction of a second speech vector from second speech data of a target sound object;
an input module configured to perform input of the first speech vector and the second speech vector into a voiceprint feature recognition model, and perform voiceprint feature extraction on the first speech vector and the second speech vector by using an activation function of a hidden layer in the voiceprint feature recognition model to obtain a first voiceprint feature of the vocal object to be recognized and a second voiceprint feature of the target vocal object, where the voiceprint feature recognition model includes a plurality of cascaded hidden layers, and the number of layers of the hidden layers and the number of hidden neurons of each hidden layer are determined according to the number of training samples;
a computing module configured to perform computing a similarity between the first voiceprint feature and the second voiceprint feature;
a matching module configured to perform determining that the sound-emitting object to be recognized matches the target sound-emitting object if the similarity is greater than or equal to a similarity threshold.
9. A server, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method of sound object recognition according to any one of claims 1 to 7.
10. A storage medium, wherein instructions in the storage medium, when executed by a processor of a server, enable the server to perform the method of recognizing a sound-emitting object according to any one of claims 1 to 7.
CN202011159156.6A 2020-10-26 2020-10-26 Sound object recognition method, sound object recognition device, server and storage medium Pending CN114512133A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011159156.6A CN114512133A (en) 2020-10-26 2020-10-26 Sound object recognition method, sound object recognition device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011159156.6A CN114512133A (en) 2020-10-26 2020-10-26 Sound object recognition method, sound object recognition device, server and storage medium

Publications (1)

Publication Number Publication Date
CN114512133A true CN114512133A (en) 2022-05-17

Family

ID=81546308

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011159156.6A Pending CN114512133A (en) 2020-10-26 2020-10-26 Sound object recognition method, sound object recognition device, server and storage medium

Country Status (1)

Country Link
CN (1) CN114512133A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116884437A (en) * 2023-09-07 2023-10-13 北京惠朗时代科技有限公司 Speech recognition processor based on artificial intelligence

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116884437A (en) * 2023-09-07 2023-10-13 北京惠朗时代科技有限公司 Speech recognition processor based on artificial intelligence
CN116884437B (en) * 2023-09-07 2023-11-17 北京惠朗时代科技有限公司 Speech recognition processor based on artificial intelligence

Similar Documents

Publication Publication Date Title
US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
WO2021139425A1 (en) Voice activity detection method, apparatus and device, and storage medium
CN106486131B (en) A kind of method and device of speech de-noising
EP3156978A1 (en) A system and a method for secure speaker verification
Mashao et al. Combining classifier decisions for robust speaker identification
WO2014153800A1 (en) Voice recognition system
CN108922541A (en) Multidimensional characteristic parameter method for recognizing sound-groove based on DTW and GMM model
Murugappan et al. DWT and MFCC based human emotional speech classification using LDA
Bagul et al. Text independent speaker recognition system using GMM
Nandyal et al. MFCC based text-dependent speaker identification using BPNN
Venturini et al. On speech features fusion, α-integration Gaussian modeling and multi-style training for noise robust speaker classification
Pao et al. Combining acoustic features for improved emotion recognition in mandarin speech
Shabani et al. Speech recognition using principal components analysis and neural networks
Biagetti et al. Speaker identification in noisy conditions using short sequences of speech frames
CN113782032B (en) Voiceprint recognition method and related device
Karthikeyan et al. Hybrid machine learning classification scheme for speaker identification
Aroon et al. Speaker recognition system using Gaussian Mixture model
Maazouzi et al. MFCC and similarity measurements for speaker identification systems
CN112466276A (en) Speech synthesis system training method and device and readable storage medium
CN114512133A (en) Sound object recognition method, sound object recognition device, server and storage medium
Kaminski et al. Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models
Singh et al. Modified group delay cepstral coefficients for voice liveness detection
Islam et al. A Novel Approach for Text-Independent Speaker Identification Using Artificial Neural Network
Sas et al. Gender recognition using neural networks and ASR techniques
CN111785262A (en) Speaker age and gender classification method based on residual error network and fusion characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination