CN110349588A - A kind of LSTM network method for recognizing sound-groove of word-based insertion - Google Patents

A kind of LSTM network method for recognizing sound-groove of word-based insertion Download PDF

Info

Publication number
CN110349588A
CN110349588A CN201910642258.4A CN201910642258A CN110349588A CN 110349588 A CN110349588 A CN 110349588A CN 201910642258 A CN201910642258 A CN 201910642258A CN 110349588 A CN110349588 A CN 110349588A
Authority
CN
China
Prior art keywords
sound
word
indicate
formula
sound bite
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910642258.4A
Other languages
Chinese (zh)
Inventor
闫河
罗成
李焕
董莺艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Technology
Original Assignee
Chongqing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Technology filed Critical Chongqing University of Technology
Priority to CN201910642258.4A priority Critical patent/CN110349588A/en
Publication of CN110349588A publication Critical patent/CN110349588A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches

Abstract

The invention discloses a kind of LSTM network method for recognizing sound-groove of word-based insertion, comprising the following steps: S1, obtains sound bite to be identified;S2, the time scale of sound bite to be identified, frequency and amplitude are converted by Fast Fourier Transform (FFT), generates the sound spectrograph of sound bite to be identified;S3, the sound spectrograph of sound bite to be identified is embedded in the LSTM network after inputting training after processing progress dimensionality reduction by word, obtains the identities information of sound bite to be identified.The present invention is based on the sound spectrograph feature extracting methods of word insertion dimensionality reduction, to improve validity of the sound spectrograph in network training, has the characteristics that good temporal aspect capturing ability using LSTM network simultaneously, sound spectrograph after being embedded in dimensionality reduction to word using LSTM network is classified, and the Application on Voiceprint Recognition of high-accuracy is realized.

Description

A kind of LSTM network method for recognizing sound-groove of word-based insertion
Technical field
The invention belongs to voice technology fields, and in particular to a kind of LSTM network method for recognizing sound-groove of word-based insertion.
Background technique
Application on Voiceprint Recognition is a kind of a kind of biological identification technology that speaker's identity is differentiated by sound.At present in criminal investigation, gold Melt and there is application in internet etc..Vocal print feature extraction refers to during Application on Voiceprint Recognition, for characterizing speaker's voice The information of individual character performance, since people is that vocal cords and other organs combine common vibration and generate in sounding, sounding feature root Variation is generated according to sound difference, these variation non-linear hours, the nonlinear change between different people is also different.Vocal print at present Usually consider that more is the parameters such as linear predictor coefficient, mel cepstrum coefficients, sound spectrograph feature in identification research, wherein Sound spectrograph is a kind of current deep learning research field commonly used character representation in Application on Voiceprint Recognition direction, it is voice spectrum Time series chart.In sound spectrograph other than including speaker's individual information local spatial feature abundant and temporal aspect, It is insufficient that there is also blank voice messaging segment and speech energies, and will lead to the background color of sound spectrograph, there are a large amount of redundancies So that network training can not fast convergence, bring great burden to calculating process.For the feature of deep learning network training It extracts, the big feature of efficient otherness shows well in network training.Feature with redundancy needs to disappear in training It consumes more resources and carries out feature learning.So sound spectrograph, which is applied, has to eliminate in terms of deep learning Application on Voiceprint Recognition redundancy letter Breath improves the effective training effectiveness of network.
Neural network method is the basic subject of current deep learning research, it simulates the brain structure of people due to everyone Personal characteristics of speaking can not accurately carry out linear expression, as deep learning gradually gos deep into every field, Application on Voiceprint Recognition skill Art method is also increasingly turned to deep learning field and carries out exploratory development.The academic personnel for most starting to carry out this aspect research were sharp before this With the outstanding ability in feature extraction of convolutional neural networks (CNN) come comprehensive extraction speaker individual character characterization information.Due to sound The complex characteristics of line feature, the CNN network structure of the less number of plies can not complete identification work well, study on this basis Personnel start to increase the number of plies of network structure, are gradually substituted with deep neural network (Deep Neural Networks, DNN) The network abstraction function of CNN, it is desirable to be able to be promoted in similar this complex characteristic situation accurate performance of vocal print, and pass through reality It tests and has carried out relevant verifying work.It also include speaker in terms of timing in voice signal but with the progress of research Vocal print individual information, but be all ignored in previous studies, there is scholar to propose to use CNN network continuous speech under study for action Speaker's Application on Voiceprint Recognition learns the speaker's personal characteristics for including in voice by CNN on the basis of orderly sound spectrograph, But due to the inherent characteristic of CNN network, study can preferably be extracted for spatiality feature, but for sequential character Just show biggish limitation.
Summary of the invention
In view of the above shortcomings of the prior art, the invention proposes a kind of LSTM network Application on Voiceprint Recognition sides of word-based insertion Method, the sound spectrograph feature extracting method of word-based insertion dimensionality reduction, validity of the Lai Tigao sound spectrograph in network training, while benefit Have the characteristics that good temporal aspect capturing ability with LSTM network, is embedded in the sound spectrograph after dimensionality reduction to word using LSTM network Classify, realizes the Application on Voiceprint Recognition of high-accuracy.
Present invention employs the following technical solutions:
A kind of LSTM network method for recognizing sound-groove of word-based insertion, comprising the following steps:
S1, sound bite to be identified is obtained;
S2, by Fast Fourier Transform (FFT) by the time scale of sound bite to be identified, frequency and amplitude convert, generate to Identify the sound spectrograph of sound bite;
S3, the sound spectrograph of sound bite to be identified is embedded in the LSTM net after inputting training after processing progress dimensionality reduction by word Network obtains the identities information of sound bite to be identified.
Preferably, the training method of LSTM network includes the following steps:
S200, sound bite training set and sound bite test set are obtained;
S201, pass through Fast Fourier Transform (FFT) for each sound bite in sound bite training set and sound bite test set Time scale, frequency and amplitude conversion, obtain sound spectrograph training set and sound spectrograph test set;
S203, it sound spectrograph training set is embedded in after processing carries out dimensionality reduction by word with vocal print label inputs LSTM to be trained Network is treated trained LSTM network and is trained;
S204, sound spectrograph test set is embedded in the LSTM network after inputting training after processing progress dimensionality reduction by word, if defeated Test result out meets preset condition, then completes the training of LSTM network, otherwise again returns to step S203 and is instructed again Practice, until test result meets preset condition.
Preferably, if the accuracy rate of sound spectrograph test set is greater than or equal to default accuracy rate, and loss function when test Value then judges that test result meets preset condition in preset threshold, wherein calculates accuracy rate and loss letter based on following formula Numerical value:
Loss=L (Y, P (Y | X)=- logP (Y | X)
ACC indicates that recognition accuracy, Loss indicate loss function value, and n indicates speaker's total sample number;Pi is indicated i-th The accuracy of speaker's sample;TPi、、FNiRespectively indicate the number correctly classified in the vocal print sample class of i-th of speaker and The number of mistake classification;Y indicates that classification is correctly classified, and P (Y | X) indicate the probability correctly classified.
Preferably, the generation method of sonagram spectrum includes:
S401, the processing of voice preemphasis, formula are carried out to sound bite based on formula spreemp [j]=s [j]-α * s [j-1] In, spreemp [j] indicates that the signal at jth moment in voice preemphasis treated sound bite, s [j] indicate voice preemphasis The high-frequency signal at jth moment in sound bite before processing, s [j-1] are indicated the in the sound bite before the processing of voice preemphasis The high-frequency signal at j-1 moment, α indicate it is the parameter being fixedly installed;
S402, by treated sound bite the carries out overlapping segmentation of voice preemphasis, keep seamlessly transitting between frame and frame With its continuity;
S403, it is based on formulaTo each frame signal into Row adding window obtains steady short signal, and in formula, w (n1) indicates that window function, n1 indicate that frame length, N indicate the total frame length of voice;
S404, X (m, n1) is obtained to the progress Fast Fourier Transform (FFT) of steady short signal, is being based on formula Y (m, n1)=X (m, n1) * X (m, n1) ' is obtained cyclic graph Y (m, n1), and in formula, m indicates the number of frame, and n1 indicates that frame length, X (m, n1) indicate fast Short Time Speech signal after fast Fourier transformation, X (m, n1) ' indicate the transposition of voice signal matrix;
S405, logarithm is taken to obtain cyclic graph Y (m, n1)Time scale and frequency are carved based on m and n1 Degree is converted to P and Q, obtains the corresponding RGB information of sound spectrograph
S406, the sonagram spectrum that sound bite is obtained based on sound spectrograph frequency time scale and corresponding colouring information.
Preferably, sonagram is composed to the method for carrying out word insertion dimensionality reduction includes:
The word that S501, the feature vector based on sonagram spectrum obtain sonagram spectrum is embedded in vector mode (vc-o+1=Vxc-o+1,..., vc+o=Vxc+o), in formula, V indicates weight matrix, xc-o+1Indicate the c-o+1 feature vector, xc+oIndicate the c+o feature to Amount, vc-o+1Indicate the c-o+1 word insertion vector, vc+oIndicate the c+o word insertion vector, x is one-hot coding vector i.e. with only Heat coding mode vector is uniformly processed, c indicate intermediate vector position, o indicate between the upper and lower away from;
S502, it is based on formulaWord insertion vector is averaged, in formula,Indicate word Be embedded in vector between the upper and lower away from for o when, the average value of middle position c;
S503, score vector is generatedIn formula, U indicates word matrix when output;
S504, it is based on formulaObtain form of probability, in formula,Indicate the probability distribution of sonagram spectrum Form, softmax (z) indicate the classification function of z.
In conclusion the invention discloses a kind of LSTM network method for recognizing sound-groove of word-based insertion, including following step It is rapid: S1, to obtain sound bite to be identified;S2, by Fast Fourier Transform (FFT) by the time scale of sound bite to be identified, frequency It is converted with amplitude, generates the sound spectrograph of sound bite to be identified;S3, by the sound spectrograph of sound bite to be identified by word insertion at Reason inputs the LSTM network after training after carrying out dimensionality reduction, obtains the identities information of sound bite to be identified.The present invention is based on Word is embedded in the sound spectrograph feature extracting method of dimensionality reduction, validity of the Lai Tigao sound spectrograph in network training, while utilizing LSTM Network has the characteristics that good temporal aspect capturing ability, and the sound spectrograph after being embedded in dimensionality reduction to word using LSTM network is divided Class realizes the Application on Voiceprint Recognition of high-accuracy.
Detailed description of the invention
In order to keep the purposes, technical schemes and advantages of invention clearer, the present invention is made into one below in conjunction with attached drawing The detailed description of step, in which:
Fig. 1 is a kind of a kind of specific embodiment party of the LSTM network method for recognizing sound-groove of word-based insertion disclosed by the invention The flow chart of formula;
Fig. 2 is the mapping process schematic diagram of feature vector and weight matrix;
Fig. 3 is that word is embedded in dimensionality reduction LSTM and standard LSTM accuracy rate comparison diagram;
Fig. 4 is that word is embedded in dimensionality reduction LSTM and standard LSTM loss function comparison diagram;
Fig. 5 is accuracy rate with the number of iterations variation diagram;
Fig. 6 is loss function with the number of iterations variation diagram.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawing.
As shown in Figure 1, the invention discloses a kind of LSTM network method for recognizing sound-groove of word-based insertion, including following step It is rapid:
S1, sound bite to be identified is obtained;
The voice original signal of acquisition can be subjected to fragment segmentation, voice data may be partitioned into the 4s time in the method The segment of length.
S2, by Fast Fourier Transform (FFT) by the time scale of sound bite to be identified, frequency and amplitude convert, generate to Identify the sound spectrograph of sound bite;
S3, the sound spectrograph of sound bite to be identified is embedded in the LSTM net after inputting training after processing progress dimensionality reduction by word Network obtains the identities information of sound bite to be identified.
When it is implemented, the training method of LSTM network includes the following steps:
S200, sound bite training set and sound bite test set are obtained;
The ratio of the quantity of the middle sound bite of training set and test set can be 8:2.
S201, pass through Fast Fourier Transform (FFT) for each sound bite in sound bite training set and sound bite test set Time scale, frequency and amplitude conversion, obtain sound spectrograph training set and sound spectrograph test set;
S203, it sound spectrograph training set is embedded in after processing carries out dimensionality reduction by word with vocal print label inputs LSTM to be trained Network is treated trained LSTM network and is trained;
S204, sound spectrograph test set is embedded in the LSTM network after inputting training after processing progress dimensionality reduction by word, if defeated Test result out meets preset condition, then completes the training of LSTM network, otherwise again returns to step S203 and is instructed again Practice, until test result meets preset condition.
In the training process of LSTM network, following manner can be taken: be inputted every sound spectrograph as an x sample, it is right The speaker ID answered is label y input, and each speaker has the sound spectrograph quantity of same number, by obtained sound spectrograph according to saying It is sound spectrograph that the time sequencings of words, which is sent into network and is trained wherein the input dimension of sound spectrograph is 106 × 80 × 3,106, Long, 80 be width, and 3 be corresponding colored triple channel.Assuming that difference is spoken, number is 10, and voice is spoken the ID of speaker in label Coded treatment in the form of one-hot coding before progress network inputs, the matrix form for making it become that network can be directly inputted, Specific such as first speaker ID is encoded to [0000000001], and second speaker ID is encoded to [00000000010], with This analogizes the ID coding for obtaining 10 speakers.It is the treatment process before inputting network to data set above.Below by data It is embedded in dimensionality reduction layer (Embedding) by word, is sent into LSTM network.Word insertion dimensionality reduction layer, which plays, will input the sparse square in network Battle array carries out the effect of dimensionality reduction, and the input dimension of sound spectrograph 25440 is tieed up for 106 × 80 × 3, belong to higher dimensional matrix, pass through dimensionality reduction totally It can be reduced to the matrix of 128 dimensions after layer, reconnect LSTM and be trained.Therefore, the LSTM's being connected with word insertion dimensionality reduction layer Neuron number is 128, wherein the generation over-fitting in training process in order to prevent, can refer to dropout and Recur-rent_dropout is prevented.Dropout is by neural network unit according to certain probability according to certain probability It is temporarily abandoned from network, disconnects ratio for controlling the neuron of input linear transformation, the neuron of disconnection is same Connection in LSTM unit between different neurons, numerical value takes the floating number between 0~1 to indicate the probability disconnected, settable disconnected Opening ratio is 0.2.Recurrent_dropout is the disconnection ratio between the linear transformation neuron of control loop state, is broken Open be circulation unit between company disconnect be circulation unit between connection, wherein disconnect probability setting equally can be 0.2.It can be classified by Dense layers, Dense is the usually described full articulamentum, and activation primitive is carried out using sigmoid Activation.The generic of speaker is finally obtained, in order to optimize model constantly, the cost function of selection is logarithm damage It loses function (binary_crossentropy).Optimizer selects adam, can calculate the autoadapted learning rate of each parameter.
When it is implemented, if the accuracy rate of sound spectrograph test set is greater than or equal to default accuracy rate, and loss when test Functional value then judges that test result meets preset condition in preset threshold, wherein calculates accuracy rate and damage based on following formula Lose functional value:
Loss=L (Y, P (Y | X)=- logP (Y | X)
ACC indicates that recognition accuracy, Loss indicate loss function value, and n indicates speaker's total sample number;PiIt indicates i-th The accuracy of speaker's sample;TPi、、FNiRespectively indicate the number correctly classified in the vocal print sample class of i-th of speaker and The number of mistake classification;Y indicates that classification is correctly classified, and P (Y | X) indicate the probability correctly classified.
In the present invention, accuracy rate indicates that recognition result correctly accounts for the ratios of all recognition results, range 0%-100% it Between, existing recognition accuracy is about 93% or so;Loss function indicates the steady robustness of network model training, does not fix Range, but more tend to 0, indicate that trained robust performance is better, the loss function of existing network model is about below 0.5.Therefore, Default accuracy rate may be configured as 90%, and loss function is worth preset threshold to may be configured as 0 to 0.5.
When it is implemented, the generation method of sonagram spectrum includes:
S401, the processing of voice preemphasis, formula are carried out to sound bite based on formula spreemp [j]=s [j]-α * s [j-1] In, spreemp [j] indicates that the signal at jth moment in voice preemphasis treated sound bite, s [j] indicate voice preemphasis The high-frequency signal at jth moment in sound bite before processing, s [j-1] are indicated the in the sound bite before the processing of voice preemphasis The high-frequency signal at j-1 moment, α indicate it is the parameter being fixedly installed;
Herein, α can value be 0.95.Low-frequency range energy is big in collected voice signal, and high frequency band signal energy is small.For Filtering low part prevents low frequency intensity in voice data to be greater than high frequency, and then prominent high frequency characteristics, increases the high frequency of voice Resolution ratio
S402, by treated sound bite the carries out overlapping segmentation of voice preemphasis, keep seamlessly transitting between frame and frame With its continuity;
Its operating process is to move to read sample voice with certain frame length and frame, is less than frame length since frame moves numerical value, therefore every Include previous frame partial content in one frame, that is, smooth transition and its continuity between frame and frame is maintained, so that the language after framing Vocal print feature information is sufficiently reserved in sound.
Since voice is non stationary state process, the Digital Signal Processing for being unable to use reason stationary signal divides it Analysis processing, but (be defaulted as 10~30ms) in a short time characteristics of speech sounds remains unchanged, voice is subjected to overlapping segmentation, keeps frame Smooth transition and its continuity between frame.
S403, it is based on formulaTo each frame signal into Row adding window obtains steady short signal, and in formula, w (n1) indicates that window function, n1 indicate that frame length, N indicate the total frame length of voice;
Since the both ends of speech frame can mutate, the frequency spectrum and original signal gap of conversion are larger, so using to each The mode of frame adding window, to protect endpoint not mutate when carrying out subsequent Fourier transformation, generally using Hamming window come real Existing function.
When selecting the shape of window, it is available that there are also rectangular windows, bigger than rectangular window compared with the main lobe width of Hamming window, Do not allow radio-frequency component easy to be lost, in the experiment of voice signal, Hamming window can be taken as voice window function in the present invention.
S404, X (m, n1) is obtained to the progress Fast Fourier Transform (FFT) of steady short signal, is being based on formula Y (m, n1)=X (m, n1) * X (m, n1) ' is obtained cyclic graph Y (m, n1), and in formula, m indicates the number of frame, and n1 indicates that frame length, X (m, n1) indicate fast Short Time Speech signal after fast Fourier transformation, X (m, n1) ' indicate the transposition of voice signal matrix;
Steady short signal after adding window, the speech characteristic parameter for being able to carry out next step extracts, raw tone sequence, Successively resolve into a series of short sequence.Make full use of discrete Fourier transform (Discrete Fourier Transform, DFT) symmetric property possessed by exponential factor and periodic property in calculating formula, and then find out these short corresponding DFT of sequence simultaneously It carries out appropriately combined, reaches deletion and compute repeatedly, reduce multiplying and simplify the purpose of structure.
S405, logarithm is taken to obtain cyclic graph Y (m, n1)Time scale and frequency are carved based on m and n1 Degree is converted to P and Q, obtains the corresponding RGB information of sound spectrograph
S406, the sonagram spectrum that sound bite is obtained based on sound spectrograph frequency time scale and corresponding colouring information.
Colored sound spectrograph can be shown on the image according to sound spectrograph location information.
When it is implemented, including: by the method that sonagram composes progress word insertion dimensionality reduction
The word that S501, the feature vector based on sonagram spectrum obtain sonagram spectrum is embedded in vector mode (vc-o+1=Vxc-o+1,..., vc+o=Vxc+o), in formula, V indicates weight matrix, xc-o+1Indicate the c-o+1 feature vector, xc+oIndicate the c+o feature to Amount, vc-o+1Indicate the c-o+1 word insertion vector, vc+oIndicate the c+o word insertion vector, x is one-hot coding vector i.e. with only Heat coding mode vector is uniformly processed, c indicate intermediate vector position, o indicate between the upper and lower away from;
S502, it is based on formulaWord insertion vector is averaged, in formula,Indicate word Be embedded in vector between the upper and lower away from for o when, the average value of middle position c;
S503, score vector is generatedIn formula, U indicates word matrix when exporting (shown herein as sound bite information Matrix);
S504, it is based on formulaObtain form of probability, in formula,Indicate the probability distribution of sonagram spectrum Form, softmax (z) indicate the classification function of z.
In the present invention, x indicates feature vector, and the unification of x feature vector is carried out one-hot coding in order, is similar to vector Number replaces vector itself with number, and different vector numbers are different.C indicates position, which specific position do not referred to, is to refer to The meaning, the mobile c with position ceaselessly changing, and wherein m indicates that the interval c-m between c is indicated for c m forward The vector of position, c+m indicate to pass through the context before and after the available position c of this method for the vector of the m position backward c Feature.
Each speech vector can be used as a node, be carried out by weight between node connected, then carry out mapping relations It indicates, as shown in Figure 2.
Fig. 3 is that word is embedded in dimensionality reduction LSTM and standard LSTM accuracy rate comparison diagram, and Fig. 4 is that word is embedded in dimensionality reduction LSTM and standard LSTM loss function comparison diagram, table one, table two are respectively the structure and data class of word insertion LSTM model and standard LSTM model Shape parameter table:
1 word of table is embedded in LSTM model structure and data type parameters table
2 standard LSTM model structure of table and data type parameters table
For the execution efficiency for verifying inventive algorithm, identification is trained to the model of different the number of iterations respectively, it is quasi- True rate and loss function are as shown in Figure 5 and Figure 6;
Table three is iteration 15 times lower test set accuracys rate and loss function.
3 iteration of table, 15 lower test set accuracys rate and loss function
From the above experiment as can be seen that by analysis sound spectrograph redundancy issue, the invention proposes word-based insertion dimensionality reductions Sound spectrograph feature extracting method, validity of the Lai Tigao sound spectrograph in network training.Have very using LSTM network simultaneously Good temporal aspect capturing ability, the sound spectrograph after being embedded in dimensionality reduction to word using LSTM network is trained, and carries out vocal print knowledge It does not verify.Due to calculating equipment and the limited time, under conditions of fixed number of iterations is 7, word insertion LSTM network is demonstrated Recognition performance is better than the LSTM network without word insertion dimensionality reduction.It, will be word-based embedding in order to further verify Network Recognition effect The LSTM network for entering dimensionality reduction sound spectrograph has carried out 15 iteration, has obtained accuracy rate 94.17%, the identification of loss function 0.1354 Effect.
Finally, it is stated that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although passing through ginseng According to the preferred embodiment of the present invention, invention has been described, it should be appreciated by those of ordinary skill in the art that can To make various changes to it in the form and details, without departing from the present invention defined by the appended claims Spirit and scope.

Claims (5)

1. a kind of LSTM network method for recognizing sound-groove of word-based insertion, which comprises the following steps:
S1, sound bite to be identified is obtained;
S2, the time scale of sound bite to be identified, frequency and amplitude are converted by Fast Fourier Transform (FFT), is generated to be identified The sound spectrograph of sound bite;
S3, the sound spectrograph of sound bite to be identified is embedded in the LSTM network after inputting training after processing progress dimensionality reduction by word, Obtain the identities information of sound bite to be identified.
2. the LSTM network method for recognizing sound-groove of word-based insertion as described in claim 1, which is characterized in that LSTM network Training method includes the following steps:
S200, sound bite training set and sound bite test set are obtained;
S201, by Fast Fourier Transform (FFT) by each sound bite in sound bite training set and sound bite test set Time scale, frequency and amplitude conversion, obtain sound spectrograph training set and sound spectrograph test set;
S203, it sound spectrograph training set is embedded in after processing carries out dimensionality reduction by word with vocal print label inputs LSTM net to be trained Network is treated trained LSTM network and is trained;
S204, sound spectrograph test set is embedded in the LSTM network after inputting training after processing progress dimensionality reduction by word, if output Test result meets preset condition, then completes the training of LSTM network, otherwise again returns to step S203 and is trained again, directly Until test result meets preset condition.
3. the LSTM network method for recognizing sound-groove of word-based insertion as claimed in claim 2, which is characterized in that if sound spectrograph is surveyed The accuracy rate of examination collection is greater than or equal to default accuracy rate, and loss function value when test then judges to test in preset threshold As a result meet preset condition, wherein accuracy rate and loss function value are calculated based on following formula:
Loss=L (Y, P (Y | X)=- logP (Y | X)
ACC indicates that recognition accuracy, Loss indicate loss function value, and n indicates speaker's total sample number;Pi indicates to speak for i-th The accuracy of proper manners sheet;TPi、、FNiRespectively indicate the number and mistake correctly classified in the vocal print sample class of i-th of speaker The number of classification;Y indicates that classification is correctly classified, and P (Y | X) indicate the probability correctly classified.
4. the LSTM network method for recognizing sound-groove of word-based insertion as described in claim 1, which is characterized in that the life of sonagram spectrum Include: at method
S401, the processing of voice preemphasis is carried out to sound bite based on formula spreemp [j]=s [j]-α * s [j-1], in formula, Spreemp [j] indicates that the signal at jth moment in voice preemphasis treated sound bite, s [j] indicate at voice preemphasis The high-frequency signal at jth moment in sound bite before reason, s [j-1] indicate jth -1 in the sound bite before the processing of voice preemphasis The high-frequency signal at moment, α indicate it is the parameter being fixedly installed;
S402, by treated sound bite the carries out overlapping segmentation of voice preemphasis, keep seamlessly transitting between frame and frame and its Continuity;
S403, it is based on formulaAdding window is carried out to each frame signal Obtain steady short signal, in formula, w (n1) indicates that window function, n1 indicate that frame length, N indicate the total frame length of voice;
S404, to steady short signal carry out Fast Fourier Transform (FFT) obtain X (m, n1), based on formula Y (m, n1)=X (m, N1) * X (m, n1) ' is obtained cyclic graph Y (m, n1), and in formula, m indicates the number of frame, and n1 indicates that frame length, X (m, n1) indicate quick Fu In Short Time Speech signal after leaf transformation, X (m, n1) ' indicates the transposition of voice signal matrix;
S405, logarithm is taken to obtain cyclic graph Y (m, n1)Time scale and frequency scale are turned based on m and n1 It is changed to P and Q, obtains the corresponding RGB information of sound spectrograph
S406, the sonagram spectrum that sound bite is obtained based on sound spectrograph frequency time scale and corresponding colouring information.
5. the LSTM network method for recognizing sound-groove of word-based insertion as described in claim 1, which is characterized in that by sonagram compose into Row word insertion dimensionality reduction method include:
The word that S501, the feature vector based on sonagram spectrum obtain sonagram spectrum is embedded in vector mode (vc-o+1=Vxc-o+1,...,vc+o= Vxc+o), in formula, V indicates weight matrix, xc-o+1Indicate the c-o+1 feature vector, xc+oIndicate the c+o feature vector, vc-o+1Indicate the c-o+1 word insertion vector, vc+oIndicate that the c+o word insertion vector, x are one-hot coding vector i.e. with solely heat Vector is uniformly processed in the mode of coding, c indicate intermediate vector position, o indicate between the upper and lower away from;
S502, it is based on formulaWord insertion vector is averaged, in formula,Indicate word insertion Vector between the upper and lower away from for o when, the average value of middle position c;
S503, score vector is generatedIn formula, U indicates word matrix when output;
S504, it is based on formulaObtain form of probability, in formula,Indicate the probability distribution shape of sonagram spectrum Formula, softmax (z) indicate the classification function of z.
CN201910642258.4A 2019-07-16 2019-07-16 A kind of LSTM network method for recognizing sound-groove of word-based insertion Pending CN110349588A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910642258.4A CN110349588A (en) 2019-07-16 2019-07-16 A kind of LSTM network method for recognizing sound-groove of word-based insertion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910642258.4A CN110349588A (en) 2019-07-16 2019-07-16 A kind of LSTM network method for recognizing sound-groove of word-based insertion

Publications (1)

Publication Number Publication Date
CN110349588A true CN110349588A (en) 2019-10-18

Family

ID=68175736

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910642258.4A Pending CN110349588A (en) 2019-07-16 2019-07-16 A kind of LSTM network method for recognizing sound-groove of word-based insertion

Country Status (1)

Country Link
CN (1) CN110349588A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111048099A (en) * 2019-12-16 2020-04-21 随手(北京)信息技术有限公司 Sound source identification method, device, server and storage medium
CN111261192A (en) * 2020-01-15 2020-06-09 厦门快商通科技股份有限公司 Audio detection method based on LSTM network, electronic equipment and storage medium
CN111833843A (en) * 2020-07-21 2020-10-27 苏州思必驰信息科技有限公司 Speech synthesis method and system
CN111933148A (en) * 2020-06-29 2020-11-13 厦门快商通科技股份有限公司 Age identification method and device based on convolutional neural network and terminal
CN112613481A (en) * 2021-01-04 2021-04-06 上海明略人工智能(集团)有限公司 Bearing abrasion early warning method and system based on frequency spectrum
CN113241054A (en) * 2021-05-10 2021-08-10 北京声智科技有限公司 Speech smoothing model generation method, speech smoothing method and device
WO2023070874A1 (en) * 2021-10-28 2023-05-04 中国科学院深圳先进技术研究院 Voiceprint recognition method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010072602A (en) * 2008-09-22 2010-04-02 Dainippon Printing Co Ltd Device for embedding information for audio signal, and device for extracting information from audio signal
CN109524014A (en) * 2018-11-29 2019-03-26 辽宁工业大学 A kind of Application on Voiceprint Recognition analysis method based on depth convolutional neural networks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010072602A (en) * 2008-09-22 2010-04-02 Dainippon Printing Co Ltd Device for embedding information for audio signal, and device for extracting information from audio signal
CN109524014A (en) * 2018-11-29 2019-03-26 辽宁工业大学 A kind of Application on Voiceprint Recognition analysis method based on depth convolutional neural networks

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
李靓等: "《基于深度学习的小样本声纹识别方法》", 《计算机工程》 *
毛焱颖: "《基于注意力双层LSTM的长文本情感分类方法》", 《重庆电子工程职业学院学报》 *
胡青等: "《基于卷积神经网络分类的说话人识别算法》", 《信息网络安全》 *
郑泽: "《基于Word2Vec词嵌入模型研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
闫河等: "《基于CNN-LSTM网络的声纹识别研究》", 《计算机应用与软件》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111048099A (en) * 2019-12-16 2020-04-21 随手(北京)信息技术有限公司 Sound source identification method, device, server and storage medium
CN111261192A (en) * 2020-01-15 2020-06-09 厦门快商通科技股份有限公司 Audio detection method based on LSTM network, electronic equipment and storage medium
CN111933148A (en) * 2020-06-29 2020-11-13 厦门快商通科技股份有限公司 Age identification method and device based on convolutional neural network and terminal
CN111833843A (en) * 2020-07-21 2020-10-27 苏州思必驰信息科技有限公司 Speech synthesis method and system
US11842722B2 (en) 2020-07-21 2023-12-12 Ai Speech Co., Ltd. Speech synthesis method and system
CN112613481A (en) * 2021-01-04 2021-04-06 上海明略人工智能(集团)有限公司 Bearing abrasion early warning method and system based on frequency spectrum
CN113241054A (en) * 2021-05-10 2021-08-10 北京声智科技有限公司 Speech smoothing model generation method, speech smoothing method and device
WO2023070874A1 (en) * 2021-10-28 2023-05-04 中国科学院深圳先进技术研究院 Voiceprint recognition method

Similar Documents

Publication Publication Date Title
CN110349588A (en) A kind of LSTM network method for recognizing sound-groove of word-based insertion
CN112509564B (en) End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
US20200402497A1 (en) Systems and Methods for Speech Generation
CN110459225B (en) Speaker recognition system based on CNN fusion characteristics
CN112818892B (en) Multi-modal depression detection method and system based on time convolution neural network
CN107146624B (en) A kind of method for identifying speaker and device
CN110136731A (en) Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice
CN109524014A (en) A kind of Application on Voiceprint Recognition analysis method based on depth convolutional neural networks
JP2654917B2 (en) Speaker independent isolated word speech recognition system using neural network
CN110379412A (en) Method, apparatus, electronic equipment and the computer readable storage medium of speech processes
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN110111797A (en) Method for distinguishing speek person based on Gauss super vector and deep neural network
CN109559736A (en) A kind of film performer's automatic dubbing method based on confrontation network
CN113571067B (en) Voiceprint recognition countermeasure sample generation method based on boundary attack
CN115602152B (en) Voice enhancement method based on multi-stage attention network
CN112053694A (en) Voiceprint recognition method based on CNN and GRU network fusion
CN109346084A (en) Method for distinguishing speek person based on depth storehouse autoencoder network
CN109036470B (en) Voice distinguishing method, device, computer equipment and storage medium
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism
CN113129908B (en) End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion
Jin et al. Speech separation and emotion recognition for multi-speaker scenarios
KR20220047080A (en) A speaker embedding extraction method and system for automatic speech recognition based pooling method for speaker recognition, and recording medium therefor
CN116434758A (en) Voiceprint recognition model training method and device, electronic equipment and storage medium
CN110415685A (en) A kind of audio recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191018