CN110444223A - Speaker's separation method and device based on Recognition with Recurrent Neural Network and acoustic feature - Google Patents
Speaker's separation method and device based on Recognition with Recurrent Neural Network and acoustic feature Download PDFInfo
- Publication number
- CN110444223A CN110444223A CN201910561692.XA CN201910561692A CN110444223A CN 110444223 A CN110444223 A CN 110444223A CN 201910561692 A CN201910561692 A CN 201910561692A CN 110444223 A CN110444223 A CN 110444223A
- Authority
- CN
- China
- Prior art keywords
- speaker
- word
- result
- identified
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 44
- 238000000926 separation method Methods 0.000 title claims abstract description 43
- 230000000306 recurrent effect Effects 0.000 title claims abstract description 41
- 239000013598 vector Substances 0.000 claims abstract description 152
- 238000006243 chemical reaction Methods 0.000 claims abstract description 75
- 230000011218 segmentation Effects 0.000 claims abstract description 68
- 238000000034 method Methods 0.000 claims abstract description 36
- 230000004927 fusion Effects 0.000 claims abstract description 26
- 238000003860 storage Methods 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 15
- 230000007246 mechanism Effects 0.000 claims description 10
- 238000009826 distribution Methods 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 6
- 239000003550 marker Substances 0.000 claims description 6
- 230000005540 biological transmission Effects 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 5
- 230000008859 change Effects 0.000 claims description 4
- 230000006870 function Effects 0.000 description 14
- 238000010586 diagram Methods 0.000 description 13
- 238000012545 processing Methods 0.000 description 6
- 238000012549 training Methods 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 235000013399 edible fruits Nutrition 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 239000004744 fabric Substances 0.000 description 2
- 238000000556 factor analysis Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000000151 deposition Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000002203 pretreatment Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Quality & Reliability (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses speaker's separation methods and device based on Recognition with Recurrent Neural Network and acoustic feature.This method includes the term vector set of voice data to be identified being obtained by speech recognition, and identify the MFCC feature vector set for obtaining voice data to be identified, is connected entirely, with feature vector after being merged;It is encoded feature vector after fusion to obtain coding result;Coding result is decoded to obtain segmentation result corresponding with feature vector after fusion;The prediction that segmentation result is carried out to voice conversion obtains voice conversion and accords with corresponding Speaker Identification result;By Speaker Identification result by cluster, speaker clustering result is obtained;And speaker clustering result is sent to the corresponding upload end of voice data to be identified.The method achieve having merged the sequence of vocabulary and acoustics characteristic information to series model and going to carry out the separation of speaker, can capture including the encoded information before and after voice conversion point.
Description
Technical field
The present invention relates to speech classification technique field more particularly to a kind of saying based on Recognition with Recurrent Neural Network and acoustic feature
Talk about people's separation method and device.
Background technique
For a complete speech identifying system comprising multiple speakers, (ASR, full name are speaker's separation
Automatic Speech Recognition) for be a very important pre-treatment step, and speaker separation
Information is also most important for the speech analysis of such as image angle colour change etc.
Usual speaker's separation system includes two parts: segmentation and cluster.The purpose of segmentation is to find all speak
Human world transfer point, the most commonly used is the dividing methods based on bayesian information criterion.Recently, using Recognition with Recurrent Neural Network,
The speaker of the methods of the deep neural network of pre-training of simultaneous factor analysis and process supervised learning and unsupervised learning
Separation all achieves good effect.However, few algorithms are related to excavating lexical information, it is most of to be related to lexical information
Research be directed to identity or the role of speaker, i.e., the text that speech recognition obtains be applied not to separation one can
The reason of energy is to introduce more noises if first running ASR before separation.
Summary of the invention
The embodiment of the invention provides a kind of speaker's separation methods based on Recognition with Recurrent Neural Network and acoustic feature, dress
Set, computer equipment and storage medium, it is intended to solve in the prior art usual speaker's separation system using Recognition with Recurrent Neural Network,
The speaker of the methods of the deep neural network of pre-training of simultaneous factor analysis and process supervised learning and unsupervised learning
Separation, and more noises can be generated due to introducing speech recognition before speaker separates, the text for causing speech recognition to obtain
Originally the problem of being applied not to speaker's separation.
In a first aspect, being separated the embodiment of the invention provides a kind of based on the speaker of Recognition with Recurrent Neural Network and acoustic feature
Method comprising:
It receives and uploads voice data to be identified transmitted by end;
The term vector set of the voice data to be identified is obtained by speech recognition, and by described in speech recognition acquisition
The MFCC feature vector set of voice data to be identified, the term vector set and MFCC feature vector set are connected entirely
It connects, with feature vector after being merged;
Feature vector after the fusion is input in encoder and is encoded, coding result is obtained;
It is decoded, obtains corresponding with feature vector after the fusion using the coding result as the input of decoder
Segmentation result;Wherein, the segmentation result includes word sequence and voice conversion symbol;
It will be distributed apart from nearest voice conversion symbol to corresponding word, to described in the segmentation result away from each word
Segmentation result carries out the prediction of voice conversion, obtains voice conversion and accords with corresponding Speaker Identification result;
By the Speaker Identification result by cluster, speaker clustering result is obtained;And
The speaker clustering result is sent to the corresponding upload end of the voice data to be identified.
Second aspect is separated the embodiment of the invention provides a kind of based on the speaker of Recognition with Recurrent Neural Network and acoustic feature
Device comprising:
Voice receiving unit uploads voice data to be identified transmitted by end for receiving;
Fusion Features unit for obtaining the term vector set of the voice data to be identified by speech recognition, and leads to
The MFCC feature vector set that speech recognition obtains the voice data to be identified is crossed, by the term vector set and MFCC feature
Vector set is connected entirely, with feature vector after being merged;
Coding unit encodes for feature vector after the fusion to be input in encoder, obtains coding result;
Decoding unit, for being decoded the coding result as the input of decoder, obtain with after described merge
The corresponding segmentation result of feature vector;Wherein, the segmentation result includes word sequence and voice conversion symbol;
Speaker's predicting unit, for distribution will to be accorded with apart from nearest voice conversion away from each word in the segmentation result
It obtains voice conversion to carry out the prediction of voice conversion to the segmentation result to corresponding word and accords with corresponding speaker
Recognition result;
Speaker clustering unit, for the Speaker Identification result by cluster, to be obtained speaker clustering result;With
And
As a result transmission unit, it is corresponding for the speaker clustering result to be sent to the voice data to be identified
Pass end.
The third aspect, the embodiment of the present invention provide a kind of computer equipment again comprising memory, processor and storage
On the memory and the computer program that can run on the processor, the processor execute the computer program
Speaker's separation method described in the above-mentioned first aspect of Shi Shixian based on Recognition with Recurrent Neural Network and acoustic feature.
Fourth aspect, the embodiment of the invention also provides a kind of computer readable storage mediums, wherein the computer can
It reads storage medium and is stored with computer program, it is above-mentioned that the computer program when being executed by a processor executes the processor
Speaker's separation method described in first aspect based on Recognition with Recurrent Neural Network and acoustic feature.
The embodiment of the invention provides a kind of speaker's separation methods based on Recognition with Recurrent Neural Network and acoustic feature, dress
It sets, computer equipment and storage medium.This method includes receiving to upload voice data to be identified transmitted by end;Known by voice
The term vector set of the voice data to be identified is not obtained, and the voice data to be identified is obtained by speech recognition
MFCC feature vector set is connected the term vector set and MFCC feature vector set entirely, with special after being merged
Levy vector;Feature vector after the fusion is input in encoder and is encoded, coding result is obtained;By the coding result
Input as decoder is decoded, and obtains segmentation result corresponding with feature vector after the fusion;Wherein, the segmentation
It as a result include that word sequence and voice conversion accord with;It will divide away from each word apart from nearest voice conversion symbol in the segmentation result
It is assigned to corresponding word, to carry out the prediction of voice conversion to the segmentation result, voice conversion is obtained and accords with corresponding speak
People's recognition result;By the Speaker Identification result by cluster, speaker clustering result is obtained;And the speaker is divided
Class result is sent to the corresponding upload end of the voice data to be identified.The method achieve merged vocabulary and acoustic feature letter
The sequence of breath goes to carry out the separation of speaker to series model, can capture including the coding letter before and after voice conversion point
Breath.
Detailed description of the invention
Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to needed in embodiment description
Attached drawing is briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, general for this field
For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is the stream of speaker's separation method provided in an embodiment of the present invention based on Recognition with Recurrent Neural Network and acoustic feature
Journey schematic diagram;
Fig. 2 is the son of speaker's separation method provided in an embodiment of the present invention based on Recognition with Recurrent Neural Network and acoustic feature
Flow diagram;
Fig. 3 is the another of speaker's separation method provided in an embodiment of the present invention based on Recognition with Recurrent Neural Network and acoustic feature
One sub-process schematic diagram;
Fig. 4 is to solve in speaker's separation method provided in an embodiment of the present invention based on Recognition with Recurrent Neural Network and acoustic feature
The schematic diagram of the voice conversion vector of the output of code device and overlapping;
Fig. 5 is showing for speaker's separator provided in an embodiment of the present invention based on Recognition with Recurrent Neural Network and acoustic feature
Meaning property block diagram;
Fig. 6 is the son of speaker's separator provided in an embodiment of the present invention based on Recognition with Recurrent Neural Network and acoustic feature
Unit schematic block diagram;
Fig. 7 is the another of speaker's separator provided in an embodiment of the present invention based on Recognition with Recurrent Neural Network and acoustic feature
One subelement schematic block diagram;
Fig. 8 is the schematic block diagram of computer equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall within the protection scope of the present invention.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction
Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded
Body, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodiment
And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on
Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in description of the invention and the appended claims is
Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
Referring to Fig. 1, it is the speaker provided in an embodiment of the present invention based on Recognition with Recurrent Neural Network and acoustic feature point
Flow diagram from method should be applied to server based on speaker's separation method of Recognition with Recurrent Neural Network and acoustic feature
In, this method is executed by the application software being installed in server.
As shown in Figure 1, the method comprising the steps of S110~S170.
S110, voice data to be identified transmitted by upload end is received.
In the present embodiment, when using upload end sound-recording function or video record function collected voice number to be identified
According to when, in order to realize to the speaker of voice data to be identified separation, voice data to be identified need to be sent to clothes by uploading end
Business device, carries out data processing to voice data to be identified by server and obtains speaker clustering result.
S120, the term vector set that the voice data to be identified is obtained by speech recognition, and obtained by speech recognition
The MFCC feature vector set for taking the voice data to be identified carries out the term vector set and MFCC feature vector set
Full connection, with feature vector after being merged.
In the present embodiment, it in order to carry out speaker's separation to voice data to be identified, needs first to extract voice to be identified
The corresponding term vector set of data and MFCC feature vector set.
Wherein, term vector set is obtained from the corresponding text data of the voice data to be identified.In one embodiment,
As shown in Fig. 2, step S120 includes:
S121, obtained by speech recognition independent heat corresponding with respectively being segmented in the voice data to be identified encode word to
Amount;
S122, will each independence corresponding with voice to be identified by the Word2Vec model for converting word to vector
Heat coding term vector is converted, and term vector set corresponding with voice data to be identified is obtained.
It i.e. first will independent heat coding term vector corresponding with respectively being segmented in the voice data to be identified with a linear layer
The word embeding layer being transformed into encoder, will be corresponding with voice to be identified each only by the Word2Vec model in word embeding layer
Vertical heat coding term vector is converted, and term vector set corresponding with voice data to be identified is obtained.
Wherein, MFCC feature vector set is directly acquired from the voice data to be identified.In one embodiment, step
S120 includes:
Feature extraction is carried out from the voice to be identified according to preset characteristic window, is obtained and the voice pair to be identified
The MFCC feature vector answered;Wherein, the length of window of the characteristic window is preset first time value, the characteristic window
Moving distance is preset second time value.
In the present embodiment, (MFCC's is full when carrying out feature extraction MFCC feature vector set from the voice to be identified
Title is Mel-scaleFrequency Cepstral Coefficients, indicates mel cepstrum coefficients), the spy can be preset
The length of window for levying window is 25ms, and the moving distance for presetting the characteristic window is 10ms.From length of window 25ms, it is mobile away from
The MFCC feature that 13 dimensions are extracted in window from 10ms, is then averaged to word section, each word obtains the one of a 13*1
Dimensional vector, to form MFCC feature vector set.By extracting MFCC feature vector corresponding with the voice to be identified, have
One group of feature vector for obtaining voice physical information (spectrum envelope and details) progress encoding operation of effect.
In one embodiment, the term vector set and the MFCC feature vector set are input to phase
With being connected entirely in the GRU model of number hidden layer, with feature vector after being merged.
For MFCC feature vector, using the hidden layer of 256 hiding layer units, the term vector of 256 sizes and 256 is greatly
Small output layer.And it enables the hidden layer number of plies of MFCC feature consistent with the hidden layer number of plies of word embeding layer, could export in this way
Preferably performance.
S130, it feature vector after the fusion is input in encoder encodes, obtain coding result.
In the present embodiment, it is encoded since feature vector after the fusion to be input in encoder, it is known that coding
MFCC feature vector and term vector are merged in device, MFCC feature vector and term vector are carried out by linear layers some in encoder
Full connection, feature vector after being merged.Then feature vector after fusion is input to the GRU model used in encoder (i.e.
Gating cycle unit), coding result can be obtained.
In one embodiment, step S130 includes:
Feature vector after the fusion is input in encoder and carries out nonlinear change conversion to obtain intermediate semanteme;
The Automobile driving probability distribution respectively segmented in the intermediate semanteme is obtained by attention mechanism, to obtain and institute
State intermediate semantic corresponding coding result.
In the present embodiment, attention mechanism is usually used in coder-decoder frame.Encoder is exactly as its name suggests to defeated
Enter sentence Source (such as sentence in the corresponding text of voice data to be identified) to be encoded, input sentence Source is led to
It crosses nonlinear transformation and is converted into intermediate semantic expressiveness C, wherein C=F (x1, x2... ..., xm);For decoder, task
It is according to the intermediate semantic expressiveness C for inputting sentence Source and the historical information y generated before1, y2... ... yi-1To generate
The i moment word y to be generatedi=(Ci, y1, y2... ... yi-1), CiAutomobile driving for participle i in the intermediate semanteme is general
Rate distribution.Even in the encoder and the decoder between introduce attention model, the above process such as y1=f (C1), y2=f (C2,
Y1), y3=f (C3, y1, y2) namely each Ci may correspond to the Automobile driving probability point of different source statement sub-words
Cloth, this attention mechanism for allowing for model can help to capture the most important part of the feature of speaker.
Attention mechanism in Sequence Learning task have huge castering action, in codec framework, by
A model is added in coding section, data weighting transformation is carried out to source data sequence, or introduce A model in decoding end, to target data
It is weighted variation, sequence is can effectively improve and the system under the natural way of sequence is showed.
S140, it is decoded, obtains and feature vector after described merge using the coding result as the input of decoder
Corresponding segmentation result;Wherein, the segmentation result includes word sequence and voice conversion symbol.
In the present embodiment, decoder output is the segmentation result for including word sequence and voice conversion symbol.For example, to
Identify the corresponding text of voice data are as follows: hello hi my name is James hi James.The segmentation of decoder output
As a result are as follows: hello#A hi#B my name is James#A hi James.
Decoder loss function calculate when, have ignored the ID of speaker, and be only concerned speaker be grouped situation.Than
Such as, voice conversion symbol sequence ABA is considered as being equivalent to BAB.Because the original of conversion symbol sequence can be calculated in calculating process
Version and flipped version, and select loss reduction that as penalty values.This loss function can also be to avoid study instruction
Practice the probability concentrated in target sequence between conversion symbol and word.
S150, it will be distributed apart from nearest voice conversion symbol to corresponding word in the segmentation result away from each word, with
The prediction that voice conversion is carried out to the segmentation result obtains voice conversion and accords with corresponding Speaker Identification result.
In the present embodiment, in order to realize the accuracy for maximizing voice conversion prediction, a movement-overlapping is used
Design.For example, using the window inswept entire part from the beginning to the end of a 32 word length.For each window, with training
Good sequence carrys out predictive conversion symbol sequence to series model.
In one embodiment, as shown in figure 3, step S150 includes:
Segmentation result in S151, acquisition decoder;
S152, the word that marker in the segmentation result is directed toward first place is obtained, to originate word as current;
S153, it will be distributed into the segmentation result often away from each word apart from nearest voice conversion symbol in segmentation result
A word, to establish converting vector;
S154, the converting vector is stored into voice conversion sequence matrix;
S155, marker direction position is moved right into a word to update current starting word;
Whether S156, the current starting word of judgement are the word that last bit is in segmentation result, if currently starting word is not segmentation
As a result it is in the word of last bit in, returns to step S153, if currently starting word is that the word in segmentation result in last bit executes
Step S157;
The process of the prediction of S157, end voice conversion.
I.e. in prediction, extracted from text and audio file respectively 32 words term vector and 32 dimensions MFCC feature to
Amount.The prediction of specific speaker's conversion sequence is carried out by Fig. 4 and following steps:
51) segmentation result is obtained from decoder;
52) conversion is established in such a way that conversion symbol nearest away from each word in segmentation result is distributed to each word
Vector;
53) in the voice conversion sequence matrix for accumulating the converting vector there are one;
54) move right a word distance and by next group of 32 term vectors and 32 tie up MFCC feature vector input
Into encoder;
After window is moved to end, determined by way of majority voting belonging to voice conversion symbol.
In this way, it is equivalent to 32 kinds of different predictions and determines.
S160, the Speaker Identification result is passed through into cluster, obtains speaker clustering result.
In the present embodiment, using the cluster mode for being based on bayesian information criterion (BIC), and the feature used
It is the other MFCC feature of frame level.In more speaker's speech Separations, the Speaker Identification result is clustered, is exactly one
Voice flow is clustered into the voice flow of multiple speakers by the cluster process of voice flow.
S170, the speaker clustering result is sent to the corresponding upload end of the voice data to be identified.
In the present embodiment, after completing the identification of speaker clustering result, the speaker clustering result will be sent out
Upload end corresponding to the voice data to be identified is sent, voice data to be identified is completed and carries out speaker point on server line
From.
The method achieve the sequences for having merged vocabulary and acoustics characteristic information to go to carry out dividing for speaker to series model
From can capture including the encoded information before and after voice conversion point.
The embodiment of the present invention also provides a kind of speaker's separator based on Recognition with Recurrent Neural Network and acoustic feature, the base
It is aforementioned special based on Recognition with Recurrent Neural Network and acoustics for executing in speaker's separator of Recognition with Recurrent Neural Network and acoustic feature
Any embodiment of speaker's separation method of sign.Specifically, referring to Fig. 5, Fig. 5 is provided in an embodiment of the present invention based on following
The schematic block diagram of speaker's separator of ring neural network and acoustic feature.It should be based on Recognition with Recurrent Neural Network and acoustic feature
Speaker's separator 100 can be configured in server.
As shown in figure 5, speaker's separator 100 based on Recognition with Recurrent Neural Network and acoustic feature includes phonetic incepting list
First 110, Fusion Features unit 120, coding unit 130, decoding unit 140, speaker's predicting unit 150, speaker clustering list
First 160, result transmission unit 170.
Voice receiving unit 110 uploads voice data to be identified transmitted by end for receiving.
In the present embodiment, when using upload end sound-recording function or video record function collected voice number to be identified
According to when, in order to realize to the speaker of voice data to be identified separation, voice data to be identified need to be sent to clothes by uploading end
Business device, carries out data processing to voice data to be identified by server and obtains speaker clustering result.
Fusion Features unit 120, for obtaining the term vector set of the voice data to be identified by speech recognition, and
The MFCC feature vector set that the voice data to be identified is obtained by speech recognition, the term vector set and MFCC is special
Sign vector set is connected entirely, with feature vector after being merged.
In the present embodiment, it in order to carry out speaker's separation to voice data to be identified, needs first to extract voice to be identified
The corresponding term vector set of data and MFCC feature vector set.
Wherein, term vector set is obtained from the corresponding text data of the voice data to be identified.In one embodiment,
As shown in fig. 6, Fusion Features unit 120 includes:
Independent heat coding term vector acquiring unit 121, for being obtained and the voice data to be identified by speech recognition
In respectively segment corresponding independent heat coding term vector;
Term vector set acquiring unit 122, for by the Word2Vec model for converting word to vector will with to
The corresponding each independent heat coding term vector of identification voice is converted, and term vector collection corresponding with voice data to be identified is obtained
It closes.
It i.e. first will independent heat coding term vector corresponding with respectively being segmented in the voice data to be identified with a linear layer
The word embeding layer being transformed into encoder, will be corresponding with voice to be identified each only by the Word2Vec model in word embeding layer
Vertical heat coding term vector is converted, and term vector set corresponding with voice data to be identified is obtained.
Wherein, MFCC feature vector set is directly acquired from the voice data to be identified.In one embodiment, feature
Integrated unit 120 is also used to:
Feature extraction is carried out from the voice to be identified according to preset characteristic window, is obtained and the voice pair to be identified
The MFCC feature vector answered;Wherein, the length of window of the characteristic window is preset first time value, the characteristic window
Moving distance is preset second time value.
In the present embodiment, (MFCC's is full when carrying out feature extraction MFCC feature vector set from the voice to be identified
Title is Mel-scaleFrequency Cepstral Coefficients, indicates mel cepstrum coefficients), the spy can be preset
The length of window for levying window is 25ms, and the moving distance for presetting the characteristic window is 10ms.From length of window 25ms, it is mobile away from
The MFCC feature that 13 dimensions are extracted in window from 10ms, is then averaged to word section, each word obtains the one of a 13*1
Dimensional vector, to form MFCC feature vector set.By extracting MFCC feature vector corresponding with the voice to be identified, have
One group of feature vector for obtaining voice physical information (spectrum envelope and details) progress encoding operation of effect.
In one embodiment, the term vector set and the MFCC feature vector set are input to phase
With being connected entirely in the GRU model of number hidden layer, with feature vector after being merged.
For MFCC feature vector, using the hidden layer of 256 hiding layer units, the term vector of 256 sizes and 256 is greatly
Small output layer.And it enables the hidden layer number of plies of MFCC feature consistent with the hidden layer number of plies of word embeding layer, could export in this way
Preferably performance.
Coding unit 130 is encoded for feature vector after the fusion to be input in encoder, obtains coding knot
Fruit.
In the present embodiment, it is encoded since feature vector after the fusion to be input in encoder, it is known that coding
MFCC feature vector and term vector are merged in device, MFCC feature vector and term vector are carried out by linear layers some in encoder
Full connection, feature vector after being merged.Then feature vector after fusion is input to the GRU model used in encoder (i.e.
Gating cycle unit), coding result can be obtained.
In one embodiment, coding unit 130 includes:
Intermediate semanteme acquiring unit, carries out nonlinear change for feature vector after the fusion to be input in encoder
Conversion is to obtain intermediate semanteme;
Attention mechanism processing unit, for obtaining the attention respectively segmented in the intermediate semanteme by attention mechanism
Allocation probability distribution, to obtain and the intermediate semantic corresponding coding result.
In the present embodiment, attention mechanism is usually used in coder-decoder frame.Encoder is exactly as its name suggests to defeated
Enter sentence Source (such as sentence in the corresponding text of voice data to be identified) to be encoded, input sentence Source is led to
It crosses nonlinear transformation and is converted into intermediate semantic expressiveness C, wherein C=F (x1, x2... ..., xm);For decoder, task
It is according to the intermediate semantic expressiveness C for inputting sentence Source and the historical information y generated before1, y2... ... yi-1To generate
The i moment word y to be generatedi=(Ci, y1, y2... ... yi-1), CiAutomobile driving for participle i in the intermediate semanteme is general
Rate distribution.Even in the encoder and the decoder between introduce attention model, the above process such as y1=f (C1), y2=f (C2,
Y1), y3=f (C3, y1, y2) namely each Ci may correspond to the Automobile driving probability point of different source statement sub-words
Cloth, this attention mechanism for allowing for model can help to capture the most important part of the feature of speaker.
Attention mechanism in Sequence Learning task have huge castering action, in codec framework, by
A model is added in coding section, data weighting transformation is carried out to source data sequence, or introduce A model in decoding end, to target data
It is weighted variation, sequence is can effectively improve and the system under the natural way of sequence is showed.
Decoding unit 140 obtains merging with described for being decoded the coding result as the input of decoder
The corresponding segmentation result of feature vector afterwards;Wherein, the segmentation result includes word sequence and voice conversion symbol.
In the present embodiment, decoder output is the segmentation result for including word sequence and voice conversion symbol.For example, to
Identify the corresponding text of voice data are as follows: hello hi my name is James hi James.The segmentation of decoder output
As a result are as follows: hello#A hi#B my name is James#A hi James.
Decoder loss function calculate when, have ignored the ID of speaker, and be only concerned speaker be grouped situation.Than
Such as, voice conversion symbol sequence ABA is considered as being equivalent to BAB.Because the original of conversion symbol sequence can be calculated in calculating process
Version and flipped version, and select loss reduction that as penalty values.This loss function can also be to avoid study instruction
Practice the probability concentrated in target sequence between conversion symbol and word.
Speaker's predicting unit 150, for will be accorded with away from each word apart from nearest voice conversion in the segmentation result
Distribution obtains voice conversion and accords with corresponding theory to corresponding word to carry out the prediction of voice conversion to the segmentation result
Talk about people's recognition result.
In the present embodiment, in order to realize the accuracy for maximizing voice conversion prediction, a movement-overlapping is used
Design.For example, using the window inswept entire part from the beginning to the end of a 32 word length.For each window, with training
Good sequence carrys out predictive conversion symbol sequence to series model.
In one embodiment, as shown in fig. 7, speaker's predicting unit 150 includes:
Segmentation result acquiring unit 151, for obtaining the segmentation result in decoder;
Current starting word acquiring unit 152 is directed toward the first word for obtaining marker in the segmentation result, using as
Current starting word;
Converting vector acquiring unit 153, for will divide away from each word apart from nearest voice conversion symbol in segmentation result
It is assigned to each word in the segmentation result, to establish converting vector;
Converting vector storage unit 154, for storing the converting vector into voice conversion sequence matrix;
Word updating unit 155 is originated, for marker direction position to be moved right a word to update current rise
Beginning word;
Last bit word judging unit 156 currently originates whether word is the word that last bit is in segmentation result for judging, if working as
Preceding starting word is not the word that last bit is in segmentation result, returns and executes and will speak away from each word apart from nearest in segmentation result
People converts symbol distribution each word into the segmentation result, the step of to establish converting vector, if currently starting word is segmentation knot
Word in fruit in last bit executes the step of process for terminating the prediction of voice conversion;
Process end unit 157, the process of the prediction for terminating voice conversion.
I.e. in prediction, extracted from text and audio file respectively 32 words term vector and 32 dimensions MFCC feature to
Amount.The prediction of specific speaker's conversion sequence is carried out by Fig. 4 and following steps:
51) segmentation result is obtained from decoder;
52) conversion is established in such a way that conversion symbol nearest away from each word in segmentation result is distributed to each word
Vector;
53) in the voice conversion sequence matrix for accumulating the converting vector there are one;
54) move right a word distance and by next group of 32 term vectors and 32 tie up MFCC feature vector input
Into encoder;
After window is moved to end, determined by way of majority voting belonging to voice conversion symbol.
In this way, it is equivalent to 32 kinds of different predictions and determines.
Speaker clustering unit 160, for the Speaker Identification result by cluster, to be obtained speaker clustering knot
Fruit.
In the present embodiment, using the cluster mode for being based on bayesian information criterion (BIC), and the feature used
It is the other MFCC feature of frame level.In more speaker's speech Separations, the Speaker Identification result is clustered, is exactly one
Voice flow is clustered into the voice flow of multiple speakers by the cluster process of voice flow.
As a result transmission unit 170, it is corresponding for the speaker clustering result to be sent to the voice data to be identified
Upload end.
In the present embodiment, after completing the identification of speaker clustering result, the speaker clustering result will be sent out
Upload end corresponding to the voice data to be identified is sent, voice data to be identified is completed and carries out speaker point on server line
From.
The arrangement achieves the sequences for having merged vocabulary and acoustics characteristic information to go to carry out dividing for speaker to series model
From can capture including the encoded information before and after voice conversion point.
Above-mentioned speaker's separator based on Recognition with Recurrent Neural Network and acoustic feature can be implemented as computer program
Form, the computer program can be run in computer equipment as shown in Figure 8.
Referring to Fig. 8, Fig. 8 is the schematic block diagram of computer equipment provided in an embodiment of the present invention.The computer equipment
500 be server, and server can be independent server, is also possible to the server cluster of multiple server compositions.
Refering to Fig. 8, which includes processor 502, memory and the net connected by system bus 501
Network interface 505, wherein memory may include non-volatile memory medium 503 and built-in storage 504.
The non-volatile memory medium 503 can storage program area 5031 and computer program 5032.The computer program
5032 are performed, and processor 502 may make to execute speaker's separation method based on Recognition with Recurrent Neural Network and acoustic feature.
The processor 502 supports the operation of entire computer equipment 500 for providing calculating and control ability.
The built-in storage 504 provides environment for the operation of the computer program 5032 in non-volatile memory medium 503, should
When computer program 5032 is executed by processor 502, processor 502 may make to execute based on Recognition with Recurrent Neural Network and acoustic feature
Speaker's separation method.
The network interface 505 is for carrying out network communication, such as the transmission of offer data information.Those skilled in the art can
To understand, structure shown in Fig. 8, only the block diagram of part-structure relevant to the present invention program, is not constituted to this hair
The restriction for the computer equipment 500 that bright scheme is applied thereon, specific computer equipment 500 may include than as shown in the figure
More or fewer components perhaps combine certain components or with different component layouts.
Wherein, the processor 502 is for running computer program 5032 stored in memory, to realize the application
Speaker's separation method in embodiment based on Recognition with Recurrent Neural Network and acoustic feature.
It will be understood by those skilled in the art that the embodiment of computer equipment shown in Fig. 8 is not constituted to computer
The restriction of equipment specific composition, in other embodiments, computer equipment may include components more more or fewer than diagram, or
Person combines certain components or different component layouts.For example, in some embodiments, computer equipment can only include depositing
Reservoir and processor, in such embodiments, the structure and function of memory and processor are consistent with embodiment illustrated in fig. 8,
Details are not described herein.
It should be appreciated that in embodiments of the present invention, processor 502 can be central processing unit (Central
Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital
Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit,
ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic
Device, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor or
Person's processor is also possible to any conventional processor etc..
Computer readable storage medium is provided in another embodiment of the invention.The computer readable storage medium can be with
For non-volatile computer readable storage medium.The computer-readable recording medium storage has computer program, wherein calculating
Machine program is realized in the embodiment of the present application when being executed by processor and is separated based on the speaker of Recognition with Recurrent Neural Network and acoustic feature
Method.
It is apparent to those skilled in the art that for convenience of description and succinctly, foregoing description is set
The specific work process of standby, device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
Those of ordinary skill in the art may be aware that unit described in conjunction with the examples disclosed in the embodiments of the present disclosure and algorithm
Step can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and software
Interchangeability generally describes each exemplary composition and step according to function in the above description.These functions are studied carefully
Unexpectedly the specific application and design constraint depending on technical solution are implemented in hardware or software.Professional technician
Each specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceed
The scope of the present invention.
In several embodiments provided by the present invention, it should be understood that disclosed unit and method, it can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only logical function partition, there may be another division manner in actual implementation, can also will be with the same function
Unit set is at a unit, such as multiple units or components can be combined or can be integrated into another system or some
Feature can be ignored, or not execute.In addition, shown or discussed mutual coupling, direct-coupling or communication connection can
Be through some interfaces, the indirect coupling or communication connection of device or unit, be also possible to electricity, mechanical or other shapes
Formula connection.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.Some or all of unit therein can be selected to realize the embodiment of the present invention according to the actual needs
Purpose.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, is also possible to two or more units and is integrated in one unit.It is above-mentioned integrated
Unit both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in one storage medium.Based on this understanding, technical solution of the present invention is substantially in other words to existing
The all or part of part or the technical solution that technology contributes can be embodied in the form of software products, should
Computer software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be
Personal computer, server or network equipment etc.) execute all or part of step of each embodiment the method for the present invention
Suddenly.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or
The various media that can store program code such as person's CD.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace
It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right
It is required that protection scope subject to.
Claims (10)
1. a kind of speaker's separation method based on Recognition with Recurrent Neural Network and acoustic feature characterized by comprising
It receives and uploads voice data to be identified transmitted by end;
The term vector set of the voice data to be identified is obtained by speech recognition, and described wait know by speech recognition acquisition
The MFCC feature vector set of other voice data, the term vector set and MFCC feature vector set are connected entirely, with
Feature vector after being merged;
Feature vector after the fusion is input in encoder and is encoded, coding result is obtained;
It is decoded the coding result as the input of decoder, obtains segmentation corresponding with feature vector after the fusion
As a result;Wherein, the segmentation result includes word sequence and voice conversion symbol;
It will be distributed apart from nearest voice conversion symbol to corresponding word, to the segmentation in the segmentation result away from each word
As a result the prediction for carrying out voice conversion obtains voice conversion and accords with corresponding Speaker Identification result;
By the Speaker Identification result by cluster, speaker clustering result is obtained;And
The speaker clustering result is sent to the corresponding upload end of the voice data to be identified.
2. speaker's separation method according to claim 1 based on Recognition with Recurrent Neural Network and acoustic feature, feature exist
In the term vector set for obtaining the voice data to be identified by speech recognition, comprising:
Independent heat coding term vector corresponding with respectively segmenting in the voice data to be identified is obtained by speech recognition;
It will each independent heat coding word corresponding with voice to be identified by the Word2Vec model for converting word to vector
Vector is converted, and term vector set corresponding with voice data to be identified is obtained.
3. speaker's separation method according to claim 1 based on Recognition with Recurrent Neural Network and acoustic feature, feature exist
In the MFCC feature vector set for obtaining the voice data to be identified by speech recognition, comprising:
Feature extraction is carried out from the voice to be identified according to preset characteristic window, is obtained corresponding with the voice to be identified
MFCC feature vector;Wherein, the length of window of the characteristic window is preset first time value, the movement of the characteristic window
Distance is preset second time value.
4. speaker's separation method according to claim 1 based on Recognition with Recurrent Neural Network and acoustic feature, feature exist
In, it is described to be connected the term vector set and MFCC feature vector set entirely, with feature vector after being merged, packet
It includes:
The term vector set and the MFCC feature vector set are input to the GRU with same number hidden layer
It is connected entirely in model, with feature vector after being merged.
5. speaker's separation method according to claim 1 based on Recognition with Recurrent Neural Network and acoustic feature, feature exist
In described feature vector after the fusion is input in encoder encodes, and obtains coding result, comprising:
Feature vector after the fusion is input in encoder and carries out nonlinear change conversion to obtain intermediate semanteme;
Obtain the Automobile driving probability distribution respectively segmented in the intermediate semanteme by attention mechanism, with obtain with it is described in
Between semantic corresponding coding result.
6. speaker's separation method according to claim 1 based on Recognition with Recurrent Neural Network and acoustic feature, feature exist
In described to be distributed apart from nearest voice conversion symbol to corresponding word, to described in the segmentation result away from each word
Segmentation result carries out the prediction of voice conversion, obtains voice conversion and accords with corresponding Speaker Identification result, comprising:
Obtain the segmentation result in decoder;
It obtains marker in the segmentation result and is directed toward the first word, to originate word as current;
Each word into the segmentation result will be distributed apart from nearest voice conversion symbol away from each word in segmentation result, to build
Vertical converting vector;
The converting vector is stored into voice conversion sequence matrix;
Marker direction position is moved right into a word to update current starting word;
Whether the current starting word of judgement is the word that last bit is in segmentation result, if currently starting word is not to be in segmentation result
The word of last bit returns and executes and will distribute apart from nearest voice conversion symbol to the segmentation result in segmentation result away from each word
In each word, the step of to establish converting vector, if currently starting word is that word in segmentation result in last bit executes and terminates to say
The step of talking about the process of the prediction of people's conversion;
Terminate the process of the prediction of voice conversion.
7. a kind of speaker's separator based on Recognition with Recurrent Neural Network and acoustic feature characterized by comprising
Voice receiving unit uploads voice data to be identified transmitted by end for receiving;
Fusion Features unit for obtaining the term vector set of the voice data to be identified by speech recognition, and passes through language
Sound identification obtains the MFCC feature vector set of the voice data to be identified, by the term vector set and MFCC feature vector
Set is connected entirely, with feature vector after being merged;
Coding unit encodes for feature vector after the fusion to be input in encoder, obtains coding result;
Decoding unit obtains and feature after described merge for being decoded the coding result as the input of decoder
The corresponding segmentation result of vector;Wherein, the segmentation result includes word sequence and voice conversion symbol;
Speaker's predicting unit, for will be distributed apart from nearest voice conversion symbol to right in the segmentation result away from each word
The word answered obtains voice conversion and accords with corresponding Speaker Identification to carry out the prediction of voice conversion to the segmentation result
As a result;
Speaker clustering unit, for the Speaker Identification result by cluster, to be obtained speaker clustering result;And
As a result transmission unit, for the speaker clustering result to be sent to the corresponding upload of the voice data to be identified
End.
8. speaker's separator according to claim 7 based on Recognition with Recurrent Neural Network and acoustic feature, feature exist
In the Fusion Features unit is also used to:
Feature extraction is carried out from the voice to be identified according to preset characteristic window, is obtained corresponding with the voice to be identified
MFCC feature vector;Wherein, the length of window of the characteristic window is preset first time value, the movement of the characteristic window
Distance is preset second time value.
9. a kind of computer equipment, including memory, processor and it is stored on the memory and can be on the processor
The computer program of operation, which is characterized in that the processor realizes such as claim 1 to 6 when executing the computer program
Any one of described in speaker's separation method based on Recognition with Recurrent Neural Network and acoustic feature.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey
Sequence, the computer program make the processor execute such as base as claimed in any one of claims 1 to 6 when being executed by a processor
In speaker's separation method of Recognition with Recurrent Neural Network and acoustic feature.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910561692.XA CN110444223B (en) | 2019-06-26 | 2019-06-26 | Speaker separation method and device based on cyclic neural network and acoustic characteristics |
PCT/CN2019/117805 WO2020258661A1 (en) | 2019-06-26 | 2019-11-13 | Speaking person separation method and apparatus based on recurrent neural network and acoustic features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910561692.XA CN110444223B (en) | 2019-06-26 | 2019-06-26 | Speaker separation method and device based on cyclic neural network and acoustic characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110444223A true CN110444223A (en) | 2019-11-12 |
CN110444223B CN110444223B (en) | 2023-05-23 |
Family
ID=68428733
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910561692.XA Active CN110444223B (en) | 2019-06-26 | 2019-06-26 | Speaker separation method and device based on cyclic neural network and acoustic characteristics |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110444223B (en) |
WO (1) | WO2020258661A1 (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110931013A (en) * | 2019-11-29 | 2020-03-27 | 北京搜狗科技发展有限公司 | Voice data processing method and device |
CN111128223A (en) * | 2019-12-30 | 2020-05-08 | 科大讯飞股份有限公司 | Text information-based auxiliary speaker separation method and related device |
CN111223476A (en) * | 2020-04-23 | 2020-06-02 | 深圳市友杰智新科技有限公司 | Method and device for extracting voice feature vector, computer equipment and storage medium |
CN111261186A (en) * | 2020-01-16 | 2020-06-09 | 南京理工大学 | Audio sound source separation method based on improved self-attention mechanism and cross-frequency band characteristics |
CN111276131A (en) * | 2020-01-22 | 2020-06-12 | 厦门大学 | Multi-class acoustic feature integration method and system based on deep neural network |
CN111461173A (en) * | 2020-03-06 | 2020-07-28 | 华南理工大学 | Attention mechanism-based multi-speaker clustering system and method |
CN111524527A (en) * | 2020-04-30 | 2020-08-11 | 合肥讯飞数码科技有限公司 | Speaker separation method, device, electronic equipment and storage medium |
CN111640450A (en) * | 2020-05-13 | 2020-09-08 | 广州国音智能科技有限公司 | Multi-person audio processing method, device, equipment and readable storage medium |
CN111640456A (en) * | 2020-06-04 | 2020-09-08 | 合肥讯飞数码科技有限公司 | Overlapped sound detection method, device and equipment |
CN111883165A (en) * | 2020-07-02 | 2020-11-03 | 中移(杭州)信息技术有限公司 | Speaker voice segmentation method, device, electronic equipment and storage medium |
WO2020258661A1 (en) * | 2019-06-26 | 2020-12-30 | 平安科技(深圳)有限公司 | Speaking person separation method and apparatus based on recurrent neural network and acoustic features |
CN112201275A (en) * | 2020-10-09 | 2021-01-08 | 深圳前海微众银行股份有限公司 | Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium |
CN112233668A (en) * | 2020-10-21 | 2021-01-15 | 中国人民解放军海军工程大学 | Voice instruction and identity recognition method based on neural network |
CN112951270A (en) * | 2019-11-26 | 2021-06-11 | 新东方教育科技集团有限公司 | Voice fluency detection method and device and electronic equipment |
CN112992175A (en) * | 2021-02-04 | 2021-06-18 | 深圳壹秘科技有限公司 | Voice distinguishing method and voice recording device thereof |
WO2021139772A1 (en) * | 2020-01-10 | 2021-07-15 | 阿里巴巴集团控股有限公司 | Audio information processing method and apparatus, electronic device, and storage medium |
CN113707130A (en) * | 2021-08-16 | 2021-11-26 | 北京搜狗科技发展有限公司 | Voice recognition method and device for voice recognition |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113642422B (en) * | 2021-07-27 | 2024-05-24 | 东北电力大学 | Continuous Chinese sign language recognition method |
CN113555034B (en) * | 2021-08-03 | 2024-03-01 | 京东科技信息技术有限公司 | Compressed audio identification method, device and storage medium |
CN113822276B (en) * | 2021-09-30 | 2024-06-14 | 中国平安人寿保险股份有限公司 | Picture correction method, device, equipment and medium based on neural network |
CN114330474B (en) * | 2021-10-20 | 2024-04-26 | 腾讯科技(深圳)有限公司 | Data processing method, device, computer equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106683661A (en) * | 2015-11-05 | 2017-05-17 | 阿里巴巴集团控股有限公司 | Role separation method and device based on voice |
US20170178666A1 (en) * | 2015-12-21 | 2017-06-22 | Microsoft Technology Licensing, Llc | Multi-speaker speech separation |
CN107731233A (en) * | 2017-11-03 | 2018-02-23 | 王华锋 | A kind of method for recognizing sound-groove based on RNN |
CN108766440A (en) * | 2018-05-28 | 2018-11-06 | 平安科技(深圳)有限公司 | Speaker's disjunctive model training method, two speaker's separation methods and relevant device |
CN109147758A (en) * | 2018-09-12 | 2019-01-04 | 科大讯飞股份有限公司 | A kind of speaker's sound converting method and device |
US20190066713A1 (en) * | 2016-06-14 | 2019-02-28 | The Trustees Of Columbia University In The City Of New York | Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments |
CN109584903A (en) * | 2018-12-29 | 2019-04-05 | 中国科学院声学研究所 | A kind of multi-person speech separation method based on deep learning |
US20190156837A1 (en) * | 2017-11-23 | 2019-05-23 | Samsung Electronics Co., Ltd. | Neural network device for speaker recognition, and method of operation thereof |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6895376B2 (en) * | 2001-05-04 | 2005-05-17 | Matsushita Electric Industrial Co., Ltd. | Eigenvoice re-estimation technique of acoustic models for speech recognition, speaker identification and speaker verification |
CN105427858B (en) * | 2015-11-06 | 2019-09-03 | 科大讯飞股份有限公司 | Realize the method and system that voice is classified automatically |
CN108320732A (en) * | 2017-01-13 | 2018-07-24 | 阿里巴巴集团控股有限公司 | The method and apparatus for generating target speaker's speech recognition computation model |
CN109036454A (en) * | 2018-06-06 | 2018-12-18 | 安徽继远软件有限公司 | The isolated method and system of the unrelated single channel recording of speaker based on DNN |
CN110444223B (en) * | 2019-06-26 | 2023-05-23 | 平安科技(深圳)有限公司 | Speaker separation method and device based on cyclic neural network and acoustic characteristics |
-
2019
- 2019-06-26 CN CN201910561692.XA patent/CN110444223B/en active Active
- 2019-11-13 WO PCT/CN2019/117805 patent/WO2020258661A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106683661A (en) * | 2015-11-05 | 2017-05-17 | 阿里巴巴集团控股有限公司 | Role separation method and device based on voice |
US20170178666A1 (en) * | 2015-12-21 | 2017-06-22 | Microsoft Technology Licensing, Llc | Multi-speaker speech separation |
US20190066713A1 (en) * | 2016-06-14 | 2019-02-28 | The Trustees Of Columbia University In The City Of New York | Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments |
CN107731233A (en) * | 2017-11-03 | 2018-02-23 | 王华锋 | A kind of method for recognizing sound-groove based on RNN |
US20190156837A1 (en) * | 2017-11-23 | 2019-05-23 | Samsung Electronics Co., Ltd. | Neural network device for speaker recognition, and method of operation thereof |
CN108766440A (en) * | 2018-05-28 | 2018-11-06 | 平安科技(深圳)有限公司 | Speaker's disjunctive model training method, two speaker's separation methods and relevant device |
CN109147758A (en) * | 2018-09-12 | 2019-01-04 | 科大讯飞股份有限公司 | A kind of speaker's sound converting method and device |
CN109584903A (en) * | 2018-12-29 | 2019-04-05 | 中国科学院声学研究所 | A kind of multi-person speech separation method based on deep learning |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020258661A1 (en) * | 2019-06-26 | 2020-12-30 | 平安科技(深圳)有限公司 | Speaking person separation method and apparatus based on recurrent neural network and acoustic features |
CN112951270B (en) * | 2019-11-26 | 2024-04-19 | 新东方教育科技集团有限公司 | Voice fluency detection method and device and electronic equipment |
CN112951270A (en) * | 2019-11-26 | 2021-06-11 | 新东方教育科技集团有限公司 | Voice fluency detection method and device and electronic equipment |
CN110931013A (en) * | 2019-11-29 | 2020-03-27 | 北京搜狗科技发展有限公司 | Voice data processing method and device |
CN111128223A (en) * | 2019-12-30 | 2020-05-08 | 科大讯飞股份有限公司 | Text information-based auxiliary speaker separation method and related device |
CN111128223B (en) * | 2019-12-30 | 2022-08-05 | 科大讯飞股份有限公司 | Text information-based auxiliary speaker separation method and related device |
WO2021139772A1 (en) * | 2020-01-10 | 2021-07-15 | 阿里巴巴集团控股有限公司 | Audio information processing method and apparatus, electronic device, and storage medium |
CN111261186A (en) * | 2020-01-16 | 2020-06-09 | 南京理工大学 | Audio sound source separation method based on improved self-attention mechanism and cross-frequency band characteristics |
CN111276131A (en) * | 2020-01-22 | 2020-06-12 | 厦门大学 | Multi-class acoustic feature integration method and system based on deep neural network |
CN111276131B (en) * | 2020-01-22 | 2021-01-12 | 厦门大学 | Multi-class acoustic feature integration method and system based on deep neural network |
US11217225B2 (en) | 2020-01-22 | 2022-01-04 | Xiamen University | Multi-type acoustic feature integration method and system based on deep neural networks |
CN111461173B (en) * | 2020-03-06 | 2023-06-20 | 华南理工大学 | Multi-speaker clustering system and method based on attention mechanism |
CN111461173A (en) * | 2020-03-06 | 2020-07-28 | 华南理工大学 | Attention mechanism-based multi-speaker clustering system and method |
CN111223476A (en) * | 2020-04-23 | 2020-06-02 | 深圳市友杰智新科技有限公司 | Method and device for extracting voice feature vector, computer equipment and storage medium |
CN111524527B (en) * | 2020-04-30 | 2023-08-22 | 合肥讯飞数码科技有限公司 | Speaker separation method, speaker separation device, electronic device and storage medium |
CN111524527A (en) * | 2020-04-30 | 2020-08-11 | 合肥讯飞数码科技有限公司 | Speaker separation method, device, electronic equipment and storage medium |
CN111640450A (en) * | 2020-05-13 | 2020-09-08 | 广州国音智能科技有限公司 | Multi-person audio processing method, device, equipment and readable storage medium |
CN111640456B (en) * | 2020-06-04 | 2023-08-22 | 合肥讯飞数码科技有限公司 | Method, device and equipment for detecting overlapping sound |
CN111640456A (en) * | 2020-06-04 | 2020-09-08 | 合肥讯飞数码科技有限公司 | Overlapped sound detection method, device and equipment |
CN111883165A (en) * | 2020-07-02 | 2020-11-03 | 中移(杭州)信息技术有限公司 | Speaker voice segmentation method, device, electronic equipment and storage medium |
CN112201275A (en) * | 2020-10-09 | 2021-01-08 | 深圳前海微众银行股份有限公司 | Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium |
CN112201275B (en) * | 2020-10-09 | 2024-05-07 | 深圳前海微众银行股份有限公司 | Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium |
CN112233668A (en) * | 2020-10-21 | 2021-01-15 | 中国人民解放军海军工程大学 | Voice instruction and identity recognition method based on neural network |
CN112233668B (en) * | 2020-10-21 | 2023-04-07 | 中国人民解放军海军工程大学 | Voice instruction and identity recognition method based on neural network |
CN112992175A (en) * | 2021-02-04 | 2021-06-18 | 深圳壹秘科技有限公司 | Voice distinguishing method and voice recording device thereof |
WO2022166219A1 (en) * | 2021-02-04 | 2022-08-11 | 深圳壹秘科技有限公司 | Voice diarization method and voice recording apparatus thereof |
CN112992175B (en) * | 2021-02-04 | 2023-08-11 | 深圳壹秘科技有限公司 | Voice distinguishing method and voice recording device thereof |
CN113707130A (en) * | 2021-08-16 | 2021-11-26 | 北京搜狗科技发展有限公司 | Voice recognition method and device for voice recognition |
Also Published As
Publication number | Publication date |
---|---|
WO2020258661A1 (en) | 2020-12-30 |
CN110444223B (en) | 2023-05-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110444223A (en) | Speaker's separation method and device based on Recognition with Recurrent Neural Network and acoustic feature | |
EP3459077B1 (en) | Permutation invariant training for talker-independent multi-talker speech separation | |
CN111429889B (en) | Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention | |
CN109785824B (en) | Training method and device of voice translation model | |
CN111461173B (en) | Multi-speaker clustering system and method based on attention mechanism | |
Kingsbury | Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling | |
KR102294638B1 (en) | Combined learning method and apparatus using deepening neural network based feature enhancement and modified loss function for speaker recognition robust to noisy environments | |
CN111754992B (en) | Noise robust audio/video bimodal speech recognition method and system | |
CN103117060A (en) | Modeling approach and modeling system of acoustic model used in speech recognition | |
CN113782048B (en) | Multi-mode voice separation method, training method and related device | |
CN111178157A (en) | Chinese lip language identification method from cascade sequence to sequence model based on tone | |
WO2022206602A1 (en) | Speech wakeup method and apparatus, and storage medium and system | |
EP4323988A1 (en) | End-to-end speech diarization via iterative speaker embedding | |
CN113643723A (en) | Voice emotion recognition method based on attention CNN Bi-GRU fusion visual information | |
CN110634469A (en) | Speech signal processing method and device based on artificial intelligence and storage medium | |
Maas et al. | Recurrent neural network feature enhancement: The 2nd CHiME challenge | |
CN112420079B (en) | Voice endpoint detection method and device, storage medium and electronic equipment | |
CN114360491A (en) | Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium | |
Jin et al. | Speech separation and emotion recognition for multi-speaker scenarios | |
CN112133294A (en) | Speech recognition method, apparatus and system and storage medium | |
CN116534700A (en) | Control system and method for stair climbing machine | |
CN115376533A (en) | Voice conversion method for personalized voice generation | |
Roy et al. | Phoneme recognition using boosted binary features | |
Zhang et al. | Audio-visual speech separation with visual features enhanced by adversarial training | |
Moons et al. | Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |