CN110349588A

CN110349588A - A kind of LSTM network method for recognizing sound-groove of word-based insertion

Info

Publication number: CN110349588A
Application number: CN201910642258.4A
Authority: CN
Inventors: 闫河; 罗成; 李焕; 董莺艳
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2019-07-16
Filing date: 2019-07-16
Publication date: 2019-10-18

Abstract

The invention discloses a kind of LSTM network method for recognizing sound-groove of word-based insertion, comprising the following steps: S1, obtains sound bite to be identified；S2, the time scale of sound bite to be identified, frequency and amplitude are converted by Fast Fourier Transform (FFT), generates the sound spectrograph of sound bite to be identified；S3, the sound spectrograph of sound bite to be identified is embedded in the LSTM network after inputting training after processing progress dimensionality reduction by word, obtains the identities information of sound bite to be identified.The present invention is based on the sound spectrograph feature extracting methods of word insertion dimensionality reduction, to improve validity of the sound spectrograph in network training, has the characteristics that good temporal aspect capturing ability using LSTM network simultaneously, sound spectrograph after being embedded in dimensionality reduction to word using LSTM network is classified, and the Application on Voiceprint Recognition of high-accuracy is realized.

Description

A kind of LSTM network method for recognizing sound-groove of word-based insertion

Technical field

The invention belongs to voice technology fields, and in particular to a kind of LSTM network method for recognizing sound-groove of word-based insertion.

Background technique

Application on Voiceprint Recognition is a kind of a kind of biological identification technology that speaker's identity is differentiated by sound.At present in criminal investigation, gold Melt and there is application in internet etc..Vocal print feature extraction refers to during Application on Voiceprint Recognition, for characterizing speaker's voice The information of individual character performance, since people is that vocal cords and other organs combine common vibration and generate in sounding, sounding feature root Variation is generated according to sound difference, these variation non-linear hours, the nonlinear change between different people is also different.Vocal print at present Usually consider that more is the parameters such as linear predictor coefficient, mel cepstrum coefficients, sound spectrograph feature in identification research, wherein Sound spectrograph is a kind of current deep learning research field commonly used character representation in Application on Voiceprint Recognition direction, it is voice spectrum Time series chart.In sound spectrograph other than including speaker's individual information local spatial feature abundant and temporal aspect, It is insufficient that there is also blank voice messaging segment and speech energies, and will lead to the background color of sound spectrograph, there are a large amount of redundancies So that network training can not fast convergence, bring great burden to calculating process.For the feature of deep learning network training It extracts, the big feature of efficient otherness shows well in network training.Feature with redundancy needs to disappear in training It consumes more resources and carries out feature learning.So sound spectrograph, which is applied, has to eliminate in terms of deep learning Application on Voiceprint Recognition redundancy letter Breath improves the effective training effectiveness of network.

Neural network method is the basic subject of current deep learning research, it simulates the brain structure of people due to everyone Personal characteristics of speaking can not accurately carry out linear expression, as deep learning gradually gos deep into every field, Application on Voiceprint Recognition skill Art method is also increasingly turned to deep learning field and carries out exploratory development.The academic personnel for most starting to carry out this aspect research were sharp before this With the outstanding ability in feature extraction of convolutional neural networks (CNN) come comprehensive extraction speaker individual character characterization information.Due to sound The complex characteristics of line feature, the CNN network structure of the less number of plies can not complete identification work well, study on this basis Personnel start to increase the number of plies of network structure, are gradually substituted with deep neural network (Deep Neural Networks, DNN) The network abstraction function of CNN, it is desirable to be able to be promoted in similar this complex characteristic situation accurate performance of vocal print, and pass through reality It tests and has carried out relevant verifying work.It also include speaker in terms of timing in voice signal but with the progress of research Vocal print individual information, but be all ignored in previous studies, there is scholar to propose to use CNN network continuous speech under study for action Speaker's Application on Voiceprint Recognition learns the speaker's personal characteristics for including in voice by CNN on the basis of orderly sound spectrograph, But due to the inherent characteristic of CNN network, study can preferably be extracted for spatiality feature, but for sequential character Just show biggish limitation.

Summary of the invention

In view of the above shortcomings of the prior art, the invention proposes a kind of LSTM network Application on Voiceprint Recognition sides of word-based insertion Method, the sound spectrograph feature extracting method of word-based insertion dimensionality reduction, validity of the Lai Tigao sound spectrograph in network training, while benefit Have the characteristics that good temporal aspect capturing ability with LSTM network, is embedded in the sound spectrograph after dimensionality reduction to word using LSTM network Classify, realizes the Application on Voiceprint Recognition of high-accuracy.

Present invention employs the following technical solutions:

A kind of LSTM network method for recognizing sound-groove of word-based insertion, comprising the following steps:

S1, sound bite to be identified is obtained；

S2, by Fast Fourier Transform (FFT) by the time scale of sound bite to be identified, frequency and amplitude convert, generate to Identify the sound spectrograph of sound bite；

S3, the sound spectrograph of sound bite to be identified is embedded in the LSTM net after inputting training after processing progress dimensionality reduction by word Network obtains the identities information of sound bite to be identified.

Preferably, the training method of LSTM network includes the following steps:

S200, sound bite training set and sound bite test set are obtained；

S201, pass through Fast Fourier Transform (FFT) for each sound bite in sound bite training set and sound bite test set Time scale, frequency and amplitude conversion, obtain sound spectrograph training set and sound spectrograph test set；

S203, it sound spectrograph training set is embedded in after processing carries out dimensionality reduction by word with vocal print label inputs LSTM to be trained Network is treated trained LSTM network and is trained；

S204, sound spectrograph test set is embedded in the LSTM network after inputting training after processing progress dimensionality reduction by word, if defeated Test result out meets preset condition, then completes the training of LSTM network, otherwise again returns to step S203 and is instructed again Practice, until test result meets preset condition.

Preferably, if the accuracy rate of sound spectrograph test set is greater than or equal to default accuracy rate, and loss function when test Value then judges that test result meets preset condition in preset threshold, wherein calculates accuracy rate and loss letter based on following formula Numerical value:

Loss=L (Y, P (Y | X)=- logP (Y | X)

ACC indicates that recognition accuracy, Loss indicate loss function value, and n indicates speaker's total sample number；Pi is indicated i-th The accuracy of speaker's sample；TP_i、、FN_iRespectively indicate the number correctly classified in the vocal print sample class of i-th of speaker and The number of mistake classification；Y indicates that classification is correctly classified, and P (Y | X) indicate the probability correctly classified.

Preferably, the generation method of sonagram spectrum includes:

S401, the processing of voice preemphasis, formula are carried out to sound bite based on formula spreemp [j]=s [j]-α * s [j-1] In, spreemp [j] indicates that the signal at jth moment in voice preemphasis treated sound bite, s [j] indicate voice preemphasis The high-frequency signal at jth moment in sound bite before processing, s [j-1] are indicated the in the sound bite before the processing of voice preemphasis The high-frequency signal at j-1 moment, α indicate it is the parameter being fixedly installed；

S402, by treated sound bite the carries out overlapping segmentation of voice preemphasis, keep seamlessly transitting between frame and frame With its continuity；

S403, it is based on formulaTo each frame signal into Row adding window obtains steady short signal, and in formula, w (n1) indicates that window function, n1 indicate that frame length, N indicate the total frame length of voice；

S404, X (m, n1) is obtained to the progress Fast Fourier Transform (FFT) of steady short signal, is being based on formula Y (m, n1)=X (m, n1) * X (m, n1) ' is obtained cyclic graph Y (m, n1), and in formula, m indicates the number of frame, and n1 indicates that frame length, X (m, n1) indicate fast Short Time Speech signal after fast Fourier transformation, X (m, n1) ' indicate the transposition of voice signal matrix；

S405, logarithm is taken to obtain cyclic graph Y (m, n1)Time scale and frequency are carved based on m and n1 Degree is converted to P and Q, obtains the corresponding RGB information of sound spectrograph

S406, the sonagram spectrum that sound bite is obtained based on sound spectrograph frequency time scale and corresponding colouring information.

Preferably, sonagram is composed to the method for carrying out word insertion dimensionality reduction includes:

The word that S501, the feature vector based on sonagram spectrum obtain sonagram spectrum is embedded in vector mode (v_c-o+1=Vx^c-o+1,..., v_c+o=Vx^c+o), in formula, V indicates weight matrix, x^c-o+1Indicate the c-o+1 feature vector, x^c+oIndicate the c+o feature to Amount, v_c-o+1Indicate the c-o+1 word insertion vector, v_c+oIndicate the c+o word insertion vector, x is one-hot coding vector i.e. with only Heat coding mode vector is uniformly processed, c indicate intermediate vector position, o indicate between the upper and lower away from；

S502, it is based on formulaWord insertion vector is averaged, in formula,Indicate word Be embedded in vector between the upper and lower away from for o when, the average value of middle position c；

S503, score vector is generatedIn formula, U indicates word matrix when output；

S504, it is based on formulaObtain form of probability, in formula,Indicate the probability distribution of sonagram spectrum Form, softmax (z) indicate the classification function of z.

In conclusion the invention discloses a kind of LSTM network method for recognizing sound-groove of word-based insertion, including following step It is rapid: S1, to obtain sound bite to be identified；S2, by Fast Fourier Transform (FFT) by the time scale of sound bite to be identified, frequency It is converted with amplitude, generates the sound spectrograph of sound bite to be identified；S3, by the sound spectrograph of sound bite to be identified by word insertion at Reason inputs the LSTM network after training after carrying out dimensionality reduction, obtains the identities information of sound bite to be identified.The present invention is based on Word is embedded in the sound spectrograph feature extracting method of dimensionality reduction, validity of the Lai Tigao sound spectrograph in network training, while utilizing LSTM Network has the characteristics that good temporal aspect capturing ability, and the sound spectrograph after being embedded in dimensionality reduction to word using LSTM network is divided Class realizes the Application on Voiceprint Recognition of high-accuracy.

Detailed description of the invention

In order to keep the purposes, technical schemes and advantages of invention clearer, the present invention is made into one below in conjunction with attached drawing The detailed description of step, in which:

Fig. 1 is a kind of a kind of specific embodiment party of the LSTM network method for recognizing sound-groove of word-based insertion disclosed by the invention The flow chart of formula；

Fig. 2 is the mapping process schematic diagram of feature vector and weight matrix；

Fig. 3 is that word is embedded in dimensionality reduction LSTM and standard LSTM accuracy rate comparison diagram；

Fig. 4 is that word is embedded in dimensionality reduction LSTM and standard LSTM loss function comparison diagram；

Fig. 5 is accuracy rate with the number of iterations variation diagram；

Fig. 6 is loss function with the number of iterations variation diagram.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawing.

As shown in Figure 1, the invention discloses a kind of LSTM network method for recognizing sound-groove of word-based insertion, including following step It is rapid:

S1, sound bite to be identified is obtained；

The voice original signal of acquisition can be subjected to fragment segmentation, voice data may be partitioned into the 4s time in the method The segment of length.

When it is implemented, the training method of LSTM network includes the following steps:

S200, sound bite training set and sound bite test set are obtained；

The ratio of the quantity of the middle sound bite of training set and test set can be 8:2.

In the training process of LSTM network, following manner can be taken: be inputted every sound spectrograph as an x sample, it is right The speaker ID answered is label y input, and each speaker has the sound spectrograph quantity of same number, by obtained sound spectrograph according to saying It is sound spectrograph that the time sequencings of words, which is sent into network and is trained wherein the input dimension of sound spectrograph is 106 × 80 × 3,106, Long, 80 be width, and 3 be corresponding colored triple channel.Assuming that difference is spoken, number is 10, and voice is spoken the ID of speaker in label Coded treatment in the form of one-hot coding before progress network inputs, the matrix form for making it become that network can be directly inputted, Specific such as first speaker ID is encoded to [0000000001], and second speaker ID is encoded to [00000000010], with This analogizes the ID coding for obtaining 10 speakers.It is the treatment process before inputting network to data set above.Below by data It is embedded in dimensionality reduction layer (Embedding) by word, is sent into LSTM network.Word insertion dimensionality reduction layer, which plays, will input the sparse square in network Battle array carries out the effect of dimensionality reduction, and the input dimension of sound spectrograph 25440 is tieed up for 106 × 80 × 3, belong to higher dimensional matrix, pass through dimensionality reduction totally It can be reduced to the matrix of 128 dimensions after layer, reconnect LSTM and be trained.Therefore, the LSTM's being connected with word insertion dimensionality reduction layer Neuron number is 128, wherein the generation over-fitting in training process in order to prevent, can refer to dropout and Recur-rent_dropout is prevented.Dropout is by neural network unit according to certain probability according to certain probability It is temporarily abandoned from network, disconnects ratio for controlling the neuron of input linear transformation, the neuron of disconnection is same Connection in LSTM unit between different neurons, numerical value takes the floating number between 0~1 to indicate the probability disconnected, settable disconnected Opening ratio is 0.2.Recurrent_dropout is the disconnection ratio between the linear transformation neuron of control loop state, is broken Open be circulation unit between company disconnect be circulation unit between connection, wherein disconnect probability setting equally can be 0.2.It can be classified by Dense layers, Dense is the usually described full articulamentum, and activation primitive is carried out using sigmoid Activation.The generic of speaker is finally obtained, in order to optimize model constantly, the cost function of selection is logarithm damage It loses function (binary_crossentropy).Optimizer selects adam, can calculate the autoadapted learning rate of each parameter.

When it is implemented, if the accuracy rate of sound spectrograph test set is greater than or equal to default accuracy rate, and loss when test Functional value then judges that test result meets preset condition in preset threshold, wherein calculates accuracy rate and damage based on following formula Lose functional value:

Loss=L (Y, P (Y | X)=- logP (Y | X)

ACC indicates that recognition accuracy, Loss indicate loss function value, and n indicates speaker's total sample number；P_iIt indicates i-th The accuracy of speaker's sample；TP_i、、FN_iRespectively indicate the number correctly classified in the vocal print sample class of i-th of speaker and The number of mistake classification；Y indicates that classification is correctly classified, and P (Y | X) indicate the probability correctly classified.

In the present invention, accuracy rate indicates that recognition result correctly accounts for the ratios of all recognition results, range 0%-100% it Between, existing recognition accuracy is about 93% or so；Loss function indicates the steady robustness of network model training, does not fix Range, but more tend to 0, indicate that trained robust performance is better, the loss function of existing network model is about below 0.5.Therefore, Default accuracy rate may be configured as 90%, and loss function is worth preset threshold to may be configured as 0 to 0.5.

When it is implemented, the generation method of sonagram spectrum includes:

Herein, α can value be 0.95.Low-frequency range energy is big in collected voice signal, and high frequency band signal energy is small.For Filtering low part prevents low frequency intensity in voice data to be greater than high frequency, and then prominent high frequency characteristics, increases the high frequency of voice Resolution ratio

Its operating process is to move to read sample voice with certain frame length and frame, is less than frame length since frame moves numerical value, therefore every Include previous frame partial content in one frame, that is, smooth transition and its continuity between frame and frame is maintained, so that the language after framing Vocal print feature information is sufficiently reserved in sound.

Since voice is non stationary state process, the Digital Signal Processing for being unable to use reason stationary signal divides it Analysis processing, but (be defaulted as 10~30ms) in a short time characteristics of speech sounds remains unchanged, voice is subjected to overlapping segmentation, keeps frame Smooth transition and its continuity between frame.

Since the both ends of speech frame can mutate, the frequency spectrum and original signal gap of conversion are larger, so using to each The mode of frame adding window, to protect endpoint not mutate when carrying out subsequent Fourier transformation, generally using Hamming window come real Existing function.

When selecting the shape of window, it is available that there are also rectangular windows, bigger than rectangular window compared with the main lobe width of Hamming window, Do not allow radio-frequency component easy to be lost, in the experiment of voice signal, Hamming window can be taken as voice window function in the present invention.

Steady short signal after adding window, the speech characteristic parameter for being able to carry out next step extracts, raw tone sequence, Successively resolve into a series of short sequence.Make full use of discrete Fourier transform (Discrete Fourier Transform, DFT) symmetric property possessed by exponential factor and periodic property in calculating formula, and then find out these short corresponding DFT of sequence simultaneously It carries out appropriately combined, reaches deletion and compute repeatedly, reduce multiplying and simplify the purpose of structure.

Colored sound spectrograph can be shown on the image according to sound spectrograph location information.

When it is implemented, including: by the method that sonagram composes progress word insertion dimensionality reduction

S503, score vector is generatedIn formula, U indicates word matrix when exporting (shown herein as sound bite information Matrix)；

In the present invention, x indicates feature vector, and the unification of x feature vector is carried out one-hot coding in order, is similar to vector Number replaces vector itself with number, and different vector numbers are different.C indicates position, which specific position do not referred to, is to refer to The meaning, the mobile c with position ceaselessly changing, and wherein m indicates that the interval c-m between c is indicated for c m forward The vector of position, c+m indicate to pass through the context before and after the available position c of this method for the vector of the m position backward c Feature.

Each speech vector can be used as a node, be carried out by weight between node connected, then carry out mapping relations It indicates, as shown in Figure 2.

Fig. 3 is that word is embedded in dimensionality reduction LSTM and standard LSTM accuracy rate comparison diagram, and Fig. 4 is that word is embedded in dimensionality reduction LSTM and standard LSTM loss function comparison diagram, table one, table two are respectively the structure and data class of word insertion LSTM model and standard LSTM model Shape parameter table:

1 word of table is embedded in LSTM model structure and data type parameters table

2 standard LSTM model structure of table and data type parameters table

For the execution efficiency for verifying inventive algorithm, identification is trained to the model of different the number of iterations respectively, it is quasi- True rate and loss function are as shown in Figure 5 and Figure 6；

Table three is iteration 15 times lower test set accuracys rate and loss function.

3 iteration of table, 15 lower test set accuracys rate and loss function

From the above experiment as can be seen that by analysis sound spectrograph redundancy issue, the invention proposes word-based insertion dimensionality reductions Sound spectrograph feature extracting method, validity of the Lai Tigao sound spectrograph in network training.Have very using LSTM network simultaneously Good temporal aspect capturing ability, the sound spectrograph after being embedded in dimensionality reduction to word using LSTM network is trained, and carries out vocal print knowledge It does not verify.Due to calculating equipment and the limited time, under conditions of fixed number of iterations is 7, word insertion LSTM network is demonstrated Recognition performance is better than the LSTM network without word insertion dimensionality reduction.It, will be word-based embedding in order to further verify Network Recognition effect The LSTM network for entering dimensionality reduction sound spectrograph has carried out 15 iteration, has obtained accuracy rate 94.17%, the identification of loss function 0.1354 Effect.

Finally, it is stated that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although passing through ginseng According to the preferred embodiment of the present invention, invention has been described, it should be appreciated by those of ordinary skill in the art that can To make various changes to it in the form and details, without departing from the present invention defined by the appended claims Spirit and scope.

Claims

1. a kind of LSTM network method for recognizing sound-groove of word-based insertion, which comprises the following steps:

S1, sound bite to be identified is obtained；

S2, the time scale of sound bite to be identified, frequency and amplitude are converted by Fast Fourier Transform (FFT), is generated to be identified The sound spectrograph of sound bite；

S3, the sound spectrograph of sound bite to be identified is embedded in the LSTM network after inputting training after processing progress dimensionality reduction by word, Obtain the identities information of sound bite to be identified.

2. the LSTM network method for recognizing sound-groove of word-based insertion as described in claim 1, which is characterized in that LSTM network Training method includes the following steps:

S200, sound bite training set and sound bite test set are obtained；

S201, by Fast Fourier Transform (FFT) by each sound bite in sound bite training set and sound bite test set Time scale, frequency and amplitude conversion, obtain sound spectrograph training set and sound spectrograph test set；

S203, it sound spectrograph training set is embedded in after processing carries out dimensionality reduction by word with vocal print label inputs LSTM net to be trained Network is treated trained LSTM network and is trained；

S204, sound spectrograph test set is embedded in the LSTM network after inputting training after processing progress dimensionality reduction by word, if output Test result meets preset condition, then completes the training of LSTM network, otherwise again returns to step S203 and is trained again, directly Until test result meets preset condition.

3. the LSTM network method for recognizing sound-groove of word-based insertion as claimed in claim 2, which is characterized in that if sound spectrograph is surveyed The accuracy rate of examination collection is greater than or equal to default accuracy rate, and loss function value when test then judges to test in preset threshold As a result meet preset condition, wherein accuracy rate and loss function value are calculated based on following formula:

Loss=L (Y, P (Y | X)=- logP (Y | X)

ACC indicates that recognition accuracy, Loss indicate loss function value, and n indicates speaker's total sample number；Pi indicates to speak for i-th The accuracy of proper manners sheet；TP_i、、FN_iRespectively indicate the number and mistake correctly classified in the vocal print sample class of i-th of speaker The number of classification；Y indicates that classification is correctly classified, and P (Y | X) indicate the probability correctly classified.

4. the LSTM network method for recognizing sound-groove of word-based insertion as described in claim 1, which is characterized in that the life of sonagram spectrum Include: at method

S401, the processing of voice preemphasis is carried out to sound bite based on formula spreemp [j]=s [j]-α * s [j-1], in formula, Spreemp [j] indicates that the signal at jth moment in voice preemphasis treated sound bite, s [j] indicate at voice preemphasis The high-frequency signal at jth moment in sound bite before reason, s [j-1] indicate jth -1 in the sound bite before the processing of voice preemphasis The high-frequency signal at moment, α indicate it is the parameter being fixedly installed；

S402, by treated sound bite the carries out overlapping segmentation of voice preemphasis, keep seamlessly transitting between frame and frame and its Continuity；

S403, it is based on formulaAdding window is carried out to each frame signal Obtain steady short signal, in formula, w (n1) indicates that window function, n1 indicate that frame length, N indicate the total frame length of voice；

S404, to steady short signal carry out Fast Fourier Transform (FFT) obtain X (m, n1), based on formula Y (m, n1)=X (m, N1) * X (m, n1) ' is obtained cyclic graph Y (m, n1), and in formula, m indicates the number of frame, and n1 indicates that frame length, X (m, n1) indicate quick Fu In Short Time Speech signal after leaf transformation, X (m, n1) ' indicates the transposition of voice signal matrix；

S405, logarithm is taken to obtain cyclic graph Y (m, n1)Time scale and frequency scale are turned based on m and n1 It is changed to P and Q, obtains the corresponding RGB information of sound spectrograph

5. the LSTM network method for recognizing sound-groove of word-based insertion as described in claim 1, which is characterized in that by sonagram compose into Row word insertion dimensionality reduction method include:

The word that S501, the feature vector based on sonagram spectrum obtain sonagram spectrum is embedded in vector mode (v_c-o+1=Vx^c-o+1,...,v_c+o= Vx^c+o), in formula, V indicates weight matrix, x^c-o+1Indicate the c-o+1 feature vector, x^c+oIndicate the c+o feature vector, v_c-o+1Indicate the c-o+1 word insertion vector, v_c+oIndicate that the c+o word insertion vector, x are one-hot coding vector i.e. with solely heat Vector is uniformly processed in the mode of coding, c indicate intermediate vector position, o indicate between the upper and lower away from；

S502, it is based on formulaWord insertion vector is averaged, in formula,Indicate word insertion Vector between the upper and lower away from for o when, the average value of middle position c；

S504, it is based on formulaObtain form of probability, in formula,Indicate the probability distribution shape of sonagram spectrum Formula, softmax (z) indicate the classification function of z.