CN110444202A - Combination speech recognition methods, device, equipment and computer readable storage medium - Google Patents
Combination speech recognition methods, device, equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN110444202A CN110444202A CN201910601019.4A CN201910601019A CN110444202A CN 110444202 A CN110444202 A CN 110444202A CN 201910601019 A CN201910601019 A CN 201910601019A CN 110444202 A CN110444202 A CN 110444202A
- Authority
- CN
- China
- Prior art keywords
- frequency
- preset
- mel
- combination speech
- capsule
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 56
- 239000002775 capsule Substances 0.000 claims abstract description 202
- 238000001228 spectrum Methods 0.000 claims abstract description 111
- 239000013598 vector Substances 0.000 claims abstract description 108
- 239000002131 composite material Substances 0.000 claims abstract description 21
- 238000001514 detection method Methods 0.000 claims abstract description 18
- 230000006870 function Effects 0.000 claims description 32
- 230000009466 transformation Effects 0.000 claims description 20
- 238000005070 sampling Methods 0.000 claims description 17
- 238000004458 analytical method Methods 0.000 claims description 16
- 230000000644 propagated effect Effects 0.000 claims description 11
- 230000004913 activation Effects 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 239000000284 extract Substances 0.000 claims description 8
- 238000009432 framing Methods 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 claims description 5
- 150000001875 compounds Chemical class 0.000 claims description 5
- 239000003292 glue Substances 0.000 claims description 3
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000013135 deep learning Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 20
- 238000004590 computer program Methods 0.000 description 10
- 239000011159 matrix material Substances 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 5
- 239000011521 glass Substances 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000003595 spectral effect Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000012546 transfer Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000002156 mixing Methods 0.000 description 3
- 238000010183 spectrum analysis Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 241001342895 Chorus Species 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- HAORKNGNJCEJBX-UHFFFAOYSA-N cyprodinil Chemical compound N=1C(C)=CC(C2CC2)=NC=1NC1=CC=CC=C1 HAORKNGNJCEJBX-UHFFFAOYSA-N 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Telephonic Communication Services (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
The present invention relates to artificial intelligence fields, and deep learning has been used to realize the sound-type for identifying composite voice signal by capsule network model.A kind of combination speech recognition methods, device, computer equipment and computer readable storage medium are specifically disclosed altogether, this method comprises: in real time or the combination speech in timing detection presetting range;When detecting the combination speech, the voice signal of the combination speech is obtained;Short Time Fourier Transform is carried out to the voice signal, generates the time-frequency figure of the composite voice signal;Based on preset capsule network model, multiple frequency spectrums of the time-frequency figure are extracted, obtain the mel-frequency cepstrum coefficient of each frequency spectrum;By the preset capsule network model, the vector mould of each mel-frequency cepstrum coefficient is calculated, and determines the type of the combination speech according to the vector mould of each mel-frequency cepstrum coefficient.
Description
Technical field
The present invention relates to artificial intelligence field more particularly to combination speech recognition methods, device, equipment and computer-readable
Storage medium title.
Background technique
Sound event testing goal is the generation and end time automatically by sound detection particular event, and to every
A event is to outgoing label.Under the assistance of this technology, computer can be by the environment around speech understanding, and to it
It responds.Sound event detection has broad application prospects in daily life, including sound monitoring, bioacoustics monitoring
With smart home etc..According to whether allow multiple sound events while occurring, it is divided into single or complexsound event detection.In
In single sound event detection, each individually sound event has certain frequency and amplitude in frequency spectrum, but for multiple
Chorus sound event detection, these frequencies or amplitude may be overlapped, existing sound detection technology mainly for single sound into
Row detection identification, can not identify simultaneous overlapping complexsound type.
Summary of the invention
The main purpose of the present invention is to provide a kind of combination speech recognition methods, device, equipment and computer-readable deposit
Storage media title, it is intended to which simultaneous overlapping complexsound type can not be identified by solving existing sound detection technology.
In a first aspect, a kind of combination speech recognition methods of the application, the combination speech recognition methods include:
In real time or timing detects the combination speech in presetting range;
When detecting the combination speech, the voice signal of the composite voice signal is obtained;
Short Time Fourier Transform is carried out to the voice signal, generates the time-frequency figure of the composite voice signal;
Based on preset capsule network model, multiple frequency spectrums of the time-frequency figure are extracted, obtain the Meier of each frequency spectrum
Frequency cepstral coefficient;
By the preset capsule network model, the vector mould of each mel-frequency cepstrum coefficient, and root are calculated
The type of the combination speech is determined according to the vector mould of each mel-frequency cepstrum coefficient.
Second aspect, the application also provide a kind of combination speech identification device, and the combination speech identification device includes:
Detection unit, for detecting the combination speech in preset enclose in real time or periodically;
First obtains module, for when detecting the combination speech, obtaining the voice signal of the combination speech;
Generation module generates the time-frequency figure of the combination speech for carrying out Short Time Fourier Transform to the acoustical signal;
Second obtains module, for being based on preset capsule network model, extracts multiple spectrograms of the time-frequency figure, obtains
The mel-frequency cepstrum coefficient of each spectrogram;
Third obtains module, for calculating each mel-frequency cepstrum by the preset capsule network model
The vector mould of coefficient, and the class for obtaining the combination speech is determined according to the vector mould of each mel-frequency cepstrum coefficient
Type.
The third aspect, the application also provide a kind of computer equipment, and the computer equipment includes: memory, processor
And it is stored in the combination speech recognizer that can be run on the memory and on the processor, the combination speech identification
The step of as above inventing the combination speech recognition methods is realized when program is executed by the processor.
Fourth aspect, the application also provide a kind of computer readable storage medium, on the computer readable storage medium
It is stored with combination speech recognizer, the combination speech identification sequence is realized compound described in as above invention when being executed by processor
The step of audio recognition method.
A kind of combination speech recognition methods, device, equipment and the computer readable storage medium that the embodiment of the present invention proposes,
Pass through the combination speech in real time or in timing detection presetting range;When detecting the combination speech, the compound language is obtained
The voice signal of sound signal;Short Time Fourier Transform is carried out to the voice signal, generates the time-frequency of the composite voice signal
Figure;Based on preset capsule network model, multiple frequency spectrums of the time-frequency figure are extracted, the mel-frequency for obtaining each frequency spectrum falls
Spectral coefficient;By the preset capsule network model, the vector mould of each mel-frequency cepstrum coefficient is calculated, and according to
The vector mould of each mel-frequency cepstrum coefficient determines the type of the combination speech, realizes through capsule network model
Identify the sound-type of combination speech.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description
Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field
For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of flow diagram of combination speech recognition methods provided by the embodiments of the present application;
Fig. 2 is the sub-step flow diagram of the combination speech recognition methods in Fig. 1;
Fig. 3 is the sub-step flow diagram of the combination speech recognition methods in Fig. 1;
Fig. 4 is the flow diagram of another combination speech recognition methods provided by the embodiments of the present application;
Fig. 5 is the sub-step flow diagram of the combination speech recognition methods in Fig. 4;
Fig. 6 is the flow diagram of another combination speech recognition methods provided by the embodiments of the present application;
Fig. 7 is the sub-step flow diagram of the combination speech recognition methods in Fig. 6;
Fig. 8 is a kind of schematic block diagram of combination speech identification device provided by the embodiments of the present application;
Fig. 9 is the schematic block diagram of the submodule of the combination speech identification device in Fig. 8;
Figure 10 is the schematic block diagram of the submodule of the combination speech identification device in Fig. 8;
Figure 11 is the schematic block diagram of another combination speech identification device provided by the embodiments of the present application;
Figure 12 is the schematic block diagram of the submodule of the combination speech identification device in Figure 11;
Figure 13 is the schematic block diagram of another combination speech identification device provided by the embodiments of the present application;
Figure 14 is the schematic block diagram of the submodule of the combination speech identification device in Figure 13;
Figure 15 is the structural schematic block diagram for the computer equipment that one embodiment of the application is related to.
The embodiments will be further described with reference to the accompanying drawings for realization, functional characteristics and the advantage of the application purpose.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen
Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall in the protection scope of this application.
Flow chart shown in the drawings only illustrates, it is not necessary to including all content and operation/step, also not
It is that must be executed by described sequence.For example, some operation/steps can also decompose, combine or partially merge, therefore practical
The sequence of execution is possible to change according to the actual situation.
The embodiment of the present application provides a kind of combination speech recognition methods, device, equipment and computer readable storage medium.Its
In, which can be applied in terminal device, which can be with mobile phone, tablet computer, notebook electricity
Brain, desktop computer.
With reference to the accompanying drawing, it elaborates to some embodiments of the application.In the absence of conflict, following
Feature in embodiment and embodiment can be combined with each other.
Fig. 1 is please referred to, Fig. 1 is a kind of process signal for combination speech recognition methods that embodiments herein provides
Figure.
As shown in Figure 1, the combination speech recognition methods includes step S10 to step S50.
Step S10, the combination speech in presetting range is detected in real time or periodically;
Terminal is in real time or timing detects the combination speech in presetting range, for example, making in the range of terminal can be detected
For the presetting range of terminal, what terminal can detect may range from indoor room etc., be also possible to outdoor park etc..In advance
First setting terminal all the time detects the combination speech in preset room or preset park, alternatively, right every a hour
Preset room or preset park are detected, and wherein combination speech includes at least two different mixing voices.It needs to illustrate
It is that above-mentioned presetting range can be configured based on actual conditions, and the application is not especially limited this.
Step S20, when detecting combination speech, the voice signal of combination speech is obtained;
When terminal detects combination speech, the combination speech that acquisition testing arrives is obtained by analyzing combination speech
The voice signal of combination speech is got, voice signal includes frequency, amplitude, time of sound etc..For example, terminal is detecting two
Kind or when a variety of mixed combination speech, by preset spectrum analysis function or preset oscillography function to detecting
Combination speech detected, collect the sound frequency of combination speech, compound language got by preset decibel tester
The acoustic amplitudes of sound preset spectrum analysis function or oscillography function in the terminal, by presetting spectrum analysis function,
The sound frequency of combination speech is calculated, or by presetting oscillography function, calculates the acoustic amplitudes of combination speech.
In one embodiment, specifically, referring to Fig. 2, step S20 includes: sub-step S21 to sub-step S23.
Sub-step S21 transfers preset sample rate when detecting combination speech;
When terminal detects combination speech, preset sample rate is transferred, sample rate is also referred to as sample rate or sampling frequency
Rate defines the number of samples per second extracted from continuous signal and form discrete signal, it is indicated with hertz (Hz), preset
Sample rate can be 40Hz, be also possible to 60Hz etc..It should be noted that above-mentioned preset sample rate can be carried out based on actual conditions
Setting, the application are not especially limited this.
Sub-step S22 determines the sampling time interval of preset sample rate by preset formula and preset sample rate;
Terminal calculates the sampling time interval of preset sample rate by preset formula and preset sample rate, wherein preset public affairs
Formula is sampling time interval=1/ sample rate, by preset sample rate so as to find out the sampling time interval of sample rate.For example, adopting
Sample frequency is 40KHz, then sampled point has 40 × 1000 in 1s, and each sampling period, (sampling period was consistent under normal conditions
) t=1/40 × 1000.
Sub-step S23 is acquired combination speech based on sampling time interval, obtains the discrete signal of combination speech.
Terminal is acquired combination speech by sampling time interval, gets the discrete signal of combination speech, and from
The quantity of scattered signal is based on sampling time interval quantity.Discrete signal is the signal up-sampled in continuous signal, and continuous
The independent variable of signal is continuous difference, and discrete signal is a sequence, i.e. its independent variable is " discrete ", and this sequence
Each value may be regarded as a sampling of continuous signal.It can will be at combination speech by preset sample rate
Reason, so that the discrete signal quality of the composite voice signal got is better.
Step S30, Short Time Fourier Transform is carried out to voice signal, generates the time-frequency figure of composite voice signal;
When terminal gets the voice signal of combination speech, Short Time Fourier Transform is done to the voice signal got,
Short Time Fourier Transform (STFT, short-time Fourier transform or short-term
It Fouriertransform) is a kind of mathematic(al) manipulation relevant with Fourier transformation, to determine its regional area of time varying signal
The frequency and phase of sine wave, specifically, Short Time Fourier Transform include frame shifting, frame duration and Fourier transformation, be will acquire
The voice signal arrived carries out the pretreatment of frame shifting and frame duration, and pretreated sound is done Fourier transformation, is got multiple
X-Y scheme can get relationship between frequency and amplitude in combination speech by doing Fourier transformation to voice signal, and two
Dimension figure is frequency spectrum, and multiple 2D signals are overlapped according to dimension, generate the time-frequency figure of combination speech, every in time-frequency figure
One frame is frequency spectrum, and frequency spectrum is time-frequency figure with the variation of time.
In one embodiment, specifically, referring to Fig. 3, step S30 includes: sub-step S31 to sub-step S33.
If step S31 reads preset frame duration information and frame and moves information get discrete signal;
If terminal gets discrete signal, Short Time Fourier Transform includes frame duration, frame shifting collection Fourier transformation.It reads
Preset frame duration information and frame move information, for example, presetting frame duration 40ms, 50ms etc., frame moves 20ms, 30ms etc..
It should be noted that preset frame duration information and frame, which move information, to be configured based on actual conditions, the application to this not
Make specific limit.
Step S32 moves information by frame duration information and frame and pre-processes to discrete signal, obtains multiple dividing in short-term
Analyse signal;
Terminal is moved information by preset frame duration information and frame and is pre-processed to the multiple discrete signals got,
Obtain multiple short-time analysis signals.For example, will acquire the processing that discrete signal carries out the frames durations such as 40ms or 50ms, frame is moved
The processing that the frames such as 20ms or 30ms move, obtains the short-time analysis signal of various discrete signal.
Step S33 carries out Fourier transformation to multiple short-time analysis signals, generates the time-frequency figure of combination speech.
Terminal carries out Fourier transformation when getting multiple short-time analysis signals, to each short-time analysis signal, obtains
The relationship of frequency and time generates an X-Y scheme, the dimension of each X-Y scheme is stacked, composite voice signal is generated
Time-frequency figure.By generating the time-frequency figure of composite voice signal to discrete signal progress frame shifting, frame duration, Fourier transformation, thus
Frequency spectrum and the variation of time of composite voice signal can be more preferably got according to time-frequency figure.
Step S40, it is based on preset capsule network model, multiple frequency spectrums of time-frequency figure is extracted, obtains the Meier of each frequency spectrum
Frequency cepstral coefficient;
When terminal gets the time-frequency figure of combination speech, it is based on pre-set capsule network model, wherein capsule net
Network is a kind of new neural network structure, including convolutional layer, primary capsule, advanced capsule etc., and capsule is one group of nested nerve
Network layer.In capsule network, more layers can be added in single network layer.Specifically, embedding in a neural net layer
Cover another, the state of neuron in capsule features above-mentioned attribute, the capsule of an entity in image and export a table
Attribute, the vector towards presentation-entity of the vector, vector that show entity existence are sent to all parent's capsules in neural network.
Capsule can be to calculate a predicted vector, and predicted vector is by obtaining itself weight multiplied by weight matrix.
Frame signal in capsule network model extraction time-frequency figure, wherein each frame representative frequency spectrum in time-frequency figure.It is obtaining
When getting multiple frequency spectrums of time-frequency figure, the mel-frequency filter function group in capsule network is transferred, frequency spectrum is passed through into mel-frequency
Filter function group reads the logarithm of mel-frequency filter function group, using logarithm as the mel-frequency cepstrum coefficient of the frequency spectrum.
Step S50, by preset capsule network model, the vector mould of each mel-frequency cepstrum coefficient, and root are calculated
The type of combination speech is determined according to the vector mould of each mel-frequency cepstrum coefficient.
When terminal gets the mel-frequency cepstrum coefficient of each frequency spectrum, preset capsule network model is transferred, and obtain
Dynamic routing algorithm and weight matrix in preset capsule network model are calculated by dynamic routing algorithm and weight matrix
The vector mould of the mel-frequency cepstrum coefficient of each frequency spectrum, the vector of the mel-frequency cepstrum coefficient for each frequency spectrum that will acquire
Mould is compared, and vector mould maximum mel-frequency cepstrum coefficient is obtained, to obtain the corresponding table of mel-frequency cepstrum coefficient
Show sound-type, using the sound-type as the sound-type of combination speech, sound-type include bark, glass breaking etc., and
Combination speech includes at least two kinds of sound-types.
Combination speech recognition methods provided by the above embodiment by the way that combination speech is generated time-frequency figure, and is based on capsule
Network model handles time-frequency figure, can detecte out the sound-type of combination speech.
Referring to figure 4., Fig. 4 is the schematic diagram of a scenario for implementing combination speech recognition methods provided in this embodiment, such as Fig. 4
Shown, which includes:
Step S10, the combination speech in presetting range is detected in real time or periodically;
Terminal is in real time or timing detects the combination speech in presetting range, for example, making in the range of terminal can be detected
For the presetting range of terminal, what terminal can detect may range from indoor room etc., be also possible to outdoor park etc..In advance
First setting terminal all the time detects the combination speech in preset room or preset park, alternatively, right every a hour
Preset room or preset park are detected, and wherein combination speech includes at least two different mixing voices.
Step S20, when detecting combination speech, the voice signal of combination speech is obtained;
When terminal detects combination speech, the combination speech that acquisition testing arrives is obtained by analyzing combination speech
The voice signal of combination speech is got, voice signal includes frequency, amplitude, time of sound etc..For example, terminal is detecting two
Kind or when a variety of mixed combination speech, answered by preset spectrum analyzer or preset oscillograph what is detected
It closes voice to be detected, collects the sound frequency of combination speech, combination speech is got by preset decibel tester
Acoustic amplitudes.
Step S30, Short Time Fourier Transform is carried out to voice signal, generates the time-frequency figure of combination speech;
When terminal gets the voice signal of combination speech, Short Time Fourier Transform is done to the voice signal got,
Short Time Fourier Transform (STFT, short-time Fourier transform or short-term
It Fouriertransform) is a kind of mathematic(al) manipulation relevant with Fourier transformation, to determine its regional area of time varying signal
The frequency and phase of sine wave, specifically, Short Time Fourier Transform include frame shifting, frame duration and Fourier transformation, be will acquire
The voice signal arrived carries out the pretreatment of frame shifting and frame duration, and pretreated sound is done Fourier transformation, is got multiple
X-Y scheme can get relationship between frequency and amplitude in combination speech by doing Fourier transformation to voice signal, and two
Dimension figure is frequency spectrum, and multiple 2D signals are overlapped according to dimension, generate the time-frequency figure of combination speech, every in time-frequency figure
One frame is frequency spectrum, and frequency spectrum is time-frequency figure with the variation of time.
If step S41, getting the time-frequency figure of composite voice signal, preset capsule network model is transferred, wherein preset glue
Capsule network model includes convolutional layer, primary capsule, advanced capsule, output layer;
If terminal gets the time-frequency figure of composite voice signal, preset capsule network model is transferred, wherein preset capsule
Network model includes convolutional layer, primary capsule, advanced capsule and output layer.It should be noted that the convolution kernel number of convolutional layer
It can be configured based on actual conditions, the application is not especially limited this.
Step S42, when time-frequency figure is inputted preset capsule network model, time-frequency figure is carried out by the convolution kernel of convolutional layer
Multiple frequency spectrums of time-frequency figure are extracted in framing;
The time-frequency figure that terminal will acquire inputs preset capsule network model, passes through the convolution of preset capsule network model
Layer, there is convolution kernel in convolutional layer, convolution kernel carries out framing to the time-frequency figure of input, extracts multiple frequency spectrums of time-frequency figure.For example, eventually
End inputs one 28 × 28 time-frequency figure, and has 256 9 × 9 in convolutional layer, and the convolution kernel that step-length is 1 passes through the number of convolution kernel
The information such as amount and step-length carry out framing to 28 × 28 time-frequency figure time-frequency figure, so that 256 20 × 20 frequency spectrums are got,
Calculation is rule=(f-n+1) × (f-n+1) of frequency spectrum, wherein f is time-frequency figure specification, and n is convolution kernel specification.Eventually
End 256 20 × 20 frequency spectrums are extracted by the convolutional layer in preset capsule network model.
Step S43, the multiple frequency spectrums extracted are filtered out by preset filter function group, obtains the plum of each frequency spectrum
That frequency cepstral coefficient.
When terminal extracts multiple frequency spectrums by convolutional layer, by the frequency spectrum extracted by preset filter function group, read
The logarithm log for getting preset filter function group, using the logarithm read as the mel-frequency cepstrum coefficient of the frequency spectrum.Specifically
To pass through frequency spectrum formula: X [K]=H [K] E [K] when getting frequency spectrum;Wherein X [K] is frequency spectrum, and H [K] is envelope, E [K]
For frequency spectrum details, frequency spectrum is the details by envelope and frequency spectrum, and envelope is that the multiple formants connected in frequency spectrum obtain, formant
It is the identification attribute (being exactly that personal identity card is the same) for carrying sound for the major frequency components for indicating voice.By preset
Filter function group reads the coefficient of H [K], and the coefficient by H [K] is exactly Meier frequency spectrum cepstrum coefficient.
In one embodiment, specifically, referring to Fig. 5, step S43 includes: sub-step S431 to sub-step S432.
Sub-step S431, when extracting multiple frequency spectrums, pass through the preset filter function group pair in the convolutional layer
Multiple frequency spectrums are filtered out, and the mel-frequency cepstrum of each frequency spectrum is obtained, wherein frequency spectrum is thin by envelope and frequency spectrum
Section composition;
When terminal detects that convolution kernel extracts multiple frequency spectrums, by filter function group preset in convolutional layer to multiple frequencies
Spectrum is filtered out, and preset filter function group includes multiple filter functions, and can be 40 filter functions is one group, is also possible to
50 filter functions are one group.It, can by preset filter function group comprising low frequency function, intermediate frequency function, high frequency function in frequency spectrum
Will in frequency spectrum include effectively being separated with the details of frequency spectrum, so that obtaining includes the details with frequency spectrum, that is, get
The Meier frequency spectrum rate cepstrum of envelope in each frequency spectrum.
Sub-step S431, cepstral analysis is done to each mel-frequency cepstrum by the primary capsule, obtained multiple
The cepstrum coefficient of the envelope, and using the cepstrum coefficient of the envelope as mel-frequency cepstrum coefficient.
Terminal does cepstral analysis to the mel-frequency cepstrum of each envelope by primary capsule, gets each envelope in plum
Meier frequency spectrum cepstrum coefficient on your frequency cepstral, wherein the Meier frequency spectrum cepstrum coefficient of each envelope is also each spectrum envelope
Meier frequency spectrum cepstrum coefficient.
Step S50, by preset capsule network model, the vector mould of each mel-frequency cepstrum coefficient, and root are calculated
The type of combination speech is determined according to the vector mould of each mel-frequency cepstrum coefficient.
It is preset by preset capsule network model when terminal gets the mel-frequency cepstrum coefficient of each frequency spectrum
Capsule network model includes dynamic routing algorithm and weight matrix, and each mel-frequency inverse coefficient got passes through dynamic
Routing algorithm and weight matrix calculate the vector mould of the mel-frequency cepstrum coefficient of each frequency spectrum, each frequency that will acquire
The vector mould of the mel-frequency cepstrum coefficient of spectrum, which is compared, obtains vector mould maximum mel-frequency cepstrum coefficient, should to obtain
The corresponding expression sound-type of mel-frequency cepstrum coefficient, using the sound-type as the sound-type of combination speech, voice class
Type include bark, glass breaking etc., and combination speech includes at least two kinds of sound-types.
Combination speech recognition methods provided by the above embodiment, by the frequency spectrum of capsule network model extraction time-frequency figure, from
And the Meier frequency spectrum cepstrum coefficient of each frequency spectrum is got, it can not only be rapidly obtained the feature of composite voice signal, also saved
Human resources are saved.
Fig. 6 is please referred to, Fig. 6 is the schematic diagram of a scenario for implementing combination speech recognition methods provided in this embodiment, such as Fig. 6
Shown, which includes:
Step S10, the combination speech in presetting range is detected in real time or periodically;
Terminal is in real time or timing detects the combination speech in presetting range, for example, making in the range of terminal can be detected
For the presetting range of terminal, what terminal can detect may range from indoor room etc., be also possible to outdoor park etc..In advance
First setting terminal all the time detects the combination speech in preset room or preset park, alternatively, right every a hour
Preset room or preset park are detected, and wherein combination speech includes at least two different mixing voices.
Step S20, when detecting combination speech, the voice signal of combination speech is obtained;
When terminal detects combination speech, the combination speech that acquisition testing arrives is obtained by analyzing combination speech
The voice signal of combination speech is got, voice signal includes frequency, amplitude, time of sound etc..For example, terminal is detecting two
Kind or when a variety of mixed combination speech, answered by preset spectrum analyzer or preset oscillograph what is detected
It closes voice to be detected, collects the sound frequency of combination speech, combination speech is got by preset decibel tester
Acoustic amplitudes.
Step S30, Short Time Fourier Transform is carried out to voice signal, generates the time-frequency figure of combination speech;
When terminal gets the voice signal of combination speech, Short Time Fourier Transform is done to the voice signal got,
Short Time Fourier Transform (STFT, short-time Fourier transform or short-term
It Fouriertransform)) is a kind of mathematic(al) manipulation relevant with Fourier transformation, to determine its regional area of time varying signal
The frequency and phase of sine wave, specifically, Short Time Fourier Transform include frame shifting, frame duration and Fourier transformation, be will acquire
The voice signal arrived carries out the pretreatment of frame shifting and frame duration, and pretreated sound is done Fourier transformation, is got multiple
X-Y scheme can get relationship between frequency and amplitude in combination speech by doing Fourier transformation to voice signal, and two
Dimension figure is frequency spectrum, and multiple 2D signals are overlapped according to dimension, generate the time-frequency figure of combination speech, every in time-frequency figure
One frame is frequency spectrum, and frequency spectrum is time-frequency figure with the variation of time.
Step S40, it is based on preset capsule network model, multiple frequency spectrums of time-frequency figure is extracted, obtains the Meier of each frequency spectrum
Frequency cepstral coefficient;
When terminal gets the time-frequency figure of combination speech, it is based on pre-set capsule network model, capsule network is
A kind of new neural network structure, including convolutional layer, primary capsule, advanced capsule etc..Capsule is one group of nested neural network
Layer.In capsule network, more layers can be added in single network layer.
Specifically, in a neural net layer it is nested another, the state of neuron in capsule features in image
The above-mentioned attribute of an entity, capsule export the category towards presentation-entity of the vector of a presentation-entity existence, vector
Property, vector are sent to all parent's capsules in neural network.Capsule can be to calculate a predicted vector, and predicted vector is logical
It crosses and obtains itself weight multiplied by weight matrix.Frame signal in capsule network model extraction time-frequency figure, wherein in time-frequency figure
Each frame representative frequency spectrum.When getting multiple frequency spectrums of time-frequency figure, the mel-frequency filtering letter in capsule network is transferred
Array reads the logarithm of mel-frequency filter function group by frequency spectrum by mel-frequency filter function group, using logarithm as the frequency
The mel-frequency cepstrum coefficient of spectrum.
Step S51, when multiple primary capsules are respectively to mel-frequency cepstrum coefficient described in the advanced capsule propagated forward
When, by the dynamic routing formula of preset capsule network, obtain the intermediate vector of mel-frequency cepstrum coefficient;
When terminal gets the mel-frequency cepstrum coefficient of each primary capsule output, each primary capsule is respectively to height
Grade capsule propagated forward mel-frequency cepstrum coefficient obtains Meier frequency by the dynamic routing formula of preset capsule network model
The intermediate vector of rate cepstrum coefficient.
In one embodiment, specifically, referring to Fig. 7, step S51 includes: sub-step S511 to sub-step S512.
Sub-step S511, when the primary capsule is to mel-frequency cepstrum coefficient described in the advanced capsule propagated forward
When, obtain the weighted value of the capsule network model;
Specifically, when primary capsule is to advanced capsule propagated forward mel-frequency cepstrum coefficient, preset capsule is got
The weighted value of network model, the weighted value are that capsule network model is obtained in training data set.
Sub-step S512, based on the first preset formula of the capsule network model and the weighted value, obtain the plum
The vector of your frequency cepstral coefficient, and obtain the coefficient of coup of the capsule network model;
Pass through the first preset formula in preset capsule network modelWhereinFor mel-frequency cepstrum coefficient
Vector, the weighted value that w is preset capsule network model, u is the mel-frequency cepstrum coefficient of primary capsule output.Pass through first
Preset formula gets the vector of mel-frequency cepstrum coefficient and the coefficient of coup of preset capsule network model,
Sub-step S513, the second preset formula based on the capsule network model, the coefficient of coup described in the vector sum
With the vector, the intermediate vector of the mel-frequency cepstrum coefficient is obtained, wherein the dynamic routing formula includes first pre-
Set formula and the second preset formula.
Pass through the second preset formulaWherein s is the mel-frequency cepstrum coefficient of the input of advanced capsule
Intermediate vector, c are the coefficient of coup,For the vector of mel-frequency cepstrum coefficient, plum is got by the second preset formula
The intermediate vector of your frequency cepstral coefficient, wherein the first preset formula and the second preset formula are the dynamic of preset capsule network model
State routes formula.
Step S52, activation primitive and intermediate vector based on advanced capsule, the mel-frequency for obtaining advanced capsule output fall
The vector mould of spectral coefficient;
The intermediate vector that terminal passes through each mel-frequency cepstrum coefficient that will acquire is input in advanced capsule, is obtained
To the activation primitive in advanced capsule, the intermediate vector of each mel-frequency cepstrum coefficient is calculated by activation primitive, is obtained high
The vector mould of each mel-frequency cepstrum coefficient of grade capsule output.
For example, when the quantity of primary capsule is 8, when the quantity of advanced capsule is 3,8 primary capsules respectively to
Advanced capsule 1 inputs mel-frequency cepstrum coefficient and calculates separately out 8 by the dynamic routing formula of preset capsule network model
The intermediate vector of the mel-frequency cepstrum coefficient of a primary capsule output, and the Meier that calculated 8 primary capsules are exported
The intermediate vector of frequency cepstral coefficient inputs advanced capsule 1 and calculates 8 mel-frequencies by the activation primitive of advanced capsule 1
The vector modulus value of cepstrum coefficient.
8 primary capsules are inputted into mel-frequency cepstrum coefficient to advanced capsule 2 respectively again, pass through preset capsule network mould
The dynamic routing formula of type calculates separately out the intermediate vector of the mel-frequency cepstrum coefficient of 8 primary capsule output, and will meter
The intermediate vector of the mel-frequency cepstrum coefficient of the primary capsule output of 8 calculated inputs advanced capsule 2, passes through advanced capsule 2
Activation primitive, calculate the vector modulus value of 8 mel-frequency cepstrum coefficients, and by calculated 8 primary capsules output
The intermediate vector of mel-frequency cepstrum coefficient inputs advanced capsule 3 and calculates 8 Meiers by the activation primitive of advanced capsule 3
The vector modulus value of frequency cepstral coefficient.
Step S53 passes through comparison when getting the vector mould of mel-frequency cepstrum coefficient of multiple advanced capsule outputs
The vector mould of multiple mel-frequency cepstrum coefficients, the target higher capsule of label output maximum vector mould;
When obtaining the vector modulus value of multiple mel-frequency cepstrum coefficients of each advanced capsule output, by multiple Meier frequencies
The vector modulus value of rate cepstrum coefficient is compared, and marks the maximum advanced capsule of output vector modulus value, and the high capsule of label is made
For target higher capsule, each advanced capsule corresponds to markd sound-type.
Step S54 is exported the identity type of target higher capsule by output layer, obtains the type of combination speech.
The identity type of target higher capsule is exported by output layer, each advanced capsule is identified with sound-type,
For example, the type that advanced capsule 1 identifies is to bark, the type that advanced capsule 2 identifies is that glass breaking or advanced capsule 1 are marked
The type of knowledge is to bark with glass breaking etc., and the type of advanced capsule mark can may be a variety of languages for a kind of sound-type
Sound type.
Combination speech recognition methods provided by the above embodiment, by getting time-frequency figure in preset capsule network model
In each frequency spectrum Meier frequency spectrum cepstrum coefficient, calculate the vector mould of each Meier frequency spectrum cepstrum coefficient, be based on each Meier
The vector mould of frequency spectrum cepstrum coefficient gets the identity type of the maximum advanced capsule of vector mould, and combination speech is generated image,
To handle by capsule network model image, voice signal and image are combined calculating, quickly obtained multiple
Close the type of voice.
Fig. 8 is please referred to, Fig. 8 is a kind of schematic block diagram of combination speech identification device provided by the embodiments of the present application.
As shown in figure 8, the combination speech identification device 400, comprising: detection module 401, first obtains module 402, generates
Module 403, second obtains module 404 and third obtains module 405.
Detection module 401, in real time or timing detects and preset encloses interior combination speech;
First acquisition module 402, the sound for when detecting the combination speech, obtaining the combination speech are believed
Number;
Generation module 403, for the voice signal carry out Short Time Fourier Transform, generate the combination speech when
Frequency is schemed;
Second obtains module 404, for extracting multiple spectrograms of the time-frequency figure based on preset capsule network model,
Obtain the mel-frequency cepstrum coefficient of each spectrogram;
Third obtains module 405, for calculating each mel-frequency and falling by the preset capsule network model
The vector mould of spectral coefficient, and the class for obtaining the combination speech is determined according to the vector mould of each mel-frequency cepstrum coefficient
Type.
In one embodiment, as shown in figure 9, the first acquisition module 402 includes:
First transfers submodule 4021, for when detecting the combination speech, transferring preset sample rate;
Determine submodule 4022, for determining the preset sample rate by preset formula and the preset sample rate
Sampling time interval;
First acquisition submodule 4023 is acquired the combination speech for being based on the sampling time interval, obtains
Take the discrete signal of the combination speech.
In one embodiment, as shown in Figure 10, generation module 403 includes:
If reading submodule 4031, for getting the discrete signal when, read preset frame duration information and frame
Move information;
It obtains submodule 4032, the discrete signal is carried out in advance by the frame duration information and frame shifting information
Processing, obtains multiple short-time analysis signals;
It generates submodule 4033, for carrying out Fourier transformation to multiple short-time analysis signals, generates described compound
The time-frequency figure of voice.
Figure 11 is please referred to, Figure 11 is the schematic frame of another combination speech identification device provided by the embodiments of the present application
Figure.
As shown in figure 11, combination speech identification device 500, comprising: detection module 501, first obtains module 502, life
Submodule 504, extracting sub-module 505, the second acquisition submodule 506, third, which are transferred, at module 503, second obtains module 507.
Detection module 501, in real time or timing detects and preset encloses interior combination speech;
First acquisition module 502, the sound for when detecting the combination speech, obtaining the combination speech are believed
Number;
Generation module 503, for the voice signal carry out Short Time Fourier Transform, generate the combination speech when
Frequency is schemed;
If second transfers submodule 504, the time-frequency figure for getting the combination speech, preset capsule network mould is transferred
Type, wherein the preset capsule network model includes convolutional layer, primary capsule, advanced capsule, output layer;
The time-frequency figure is inputted the preset capsule network model for working as by extracting sub-module 505, passes through the convolution
The convolution kernel of layer carries out framing to the time-frequency figure, extracts multiple frequency spectrums of the time-frequency figure;
Second acquisition submodule 506, for the multiple frequency spectrums extracted to be filtered by preset filter function group
It removes, obtains the mel-frequency cepstrum coefficient of each frequency spectrum;
Third obtains module 507, for calculating each mel-frequency and falling by the preset capsule network model
The vector mould of spectral coefficient, and the class for obtaining the combination speech is determined according to the vector mould of each mel-frequency cepstrum coefficient
Type.
In one embodiment, as shown in figure 12, the second acquisition submodule 506 includes:
First obtains subelement 5061, for passing through the preset filter function in convolutional layer when extracting multiple frequency spectrums
Group filters out multiple frequency spectrums, obtains the mel-frequency cepstrum of each frequency spectrum, wherein frequency spectrum by envelope and frequency spectrum details group
At;
Second obtains subelement 5062, for doing cepstral analysis to each mel-frequency cepstrum by primary capsule, obtains
The cepstrum coefficient of multiple envelopes, and using the cepstrum coefficient of envelope as mel-frequency cepstrum coefficient.
Figure 13 is please referred to, Figure 13 is the schematic frame of another combination speech identification device provided by the embodiments of the present application
Figure.
As shown in figure 13, combination speech identification device 600, comprising: detection module 601, first obtains module 602, life
At module 603, second obtain module 604, third acquisition submodule 605, the 4th acquisition submodule 606, label submodule 607,
5th acquisition submodule 608.
Detection module 601, in real time or timing detects and preset encloses interior combination speech;
First acquisition module 602, the sound for when detecting the combination speech, obtaining the combination speech are believed
Number;
Generation module 603, for the voice signal carry out Short Time Fourier Transform, generate the combination speech when
Frequency is schemed;
Second obtains module 604, for extracting multiple spectrograms of the time-frequency figure based on preset capsule network model,
Obtain the mel-frequency cepstrum coefficient of each spectrogram;
Third acquisition submodule 605 is used for when multiple primary capsules are respectively to the advanced capsule propagated forward institute
When stating mel-frequency cepstrum coefficient, by the dynamic routing formula of the preset capsule network, the mel-frequency cepstrum is obtained
The intermediate vector of coefficient;
4th acquisition submodule 606, for based on the advanced capsule activation primitive and the intermediate vector, obtain institute
State the vector mould of the mel-frequency cepstrum coefficient of advanced capsule output;
Mark submodule 607, in the mel-frequency cepstrum coefficient for getting multiple advanced capsules outputs
Vector mould, pass through the vector mould for comparing multiple mel-frequency cepstrum coefficients, the target of label output maximum vector mould is high
Grade capsule;
5th acquisition submodule 608, the identity type for exporting the target higher capsule by the output layer, are obtained
Take the type of the composite voice signal.
In one embodiment, as shown in figure 14, third acquisition submodule 605 includes:
Third obtains subelement 6051, for working as the primary capsule to the frequency of Meier described in the advanced capsule propagated forward
When rate cepstrum coefficient, the weighted value of the capsule network model is obtained;
4th obtains subelement 6052, for based on the first preset formula of the capsule network model and the weight
Value, obtains the vector of the mel-frequency cepstrum coefficient, and obtain the coefficient of coup of the capsule network model;
5th obtains subelement 6053, for the second preset formula, the vector sum based on the capsule network model
The coefficient of coup and the vector obtain the intermediate vector of the mel-frequency cepstrum coefficient, wherein the dynamic routing is public
Formula includes the first preset formula and the second preset formula.
It should be noted that it is apparent to those skilled in the art that, for convenience of description and succinctly,
The specific work process of the device of foregoing description and each module and unit can refer to aforementioned combination speech recognition methods embodiment
In corresponding process, details are not described herein.
Device provided by the above embodiment can be implemented as a kind of form of computer program, which can be
It is run in computer equipment as shown in figure 15.
Figure 15 is please referred to, Figure 15 is a kind of structural representation block diagram of computer equipment provided by the embodiments of the present application.It should
Computer equipment can be terminal.
As shown in figure 15, which includes processor, memory and the network interface connected by system bus,
Wherein, memory may include non-volatile memory medium and built-in storage.
Non-volatile memory medium can storage program area and computer program.The computer program includes program instruction,
The program instruction is performed, and processor may make to execute any one combination speech recognition methods.
Processor supports the operation of entire computer equipment for providing calculating and control ability.
Built-in storage provides environment for the operation of the computer program in non-volatile memory medium, the computer program quilt
When processor executes, processor may make to execute any one combination speech recognition methods.
The network interface such as sends the task dispatching of distribution for carrying out network communication.It will be understood by those skilled in the art that
Structure shown in Figure 15, only the block diagram of part-structure relevant to application scheme, is not constituted to application scheme
The restriction for the computer equipment being applied thereon, specifically computer equipment may include more more or fewer than as shown in the figure
Component perhaps combines certain components or with different component layouts.
It should be understood that processor can be central processing unit (Central Processing Unit, CPU), it should
Processor can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specially
With integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array
(Field-Programmable GateArray, FPGA) either other programmable logic device, discrete gate or transistor are patrolled
Collect device, discrete hardware components etc..Wherein, general processor can be microprocessor or the processor be also possible to it is any often
The processor etc. of rule.
Wherein, in one embodiment, the processor is for running computer program stored in memory, with reality
Existing following steps:
In real time or timing detects the combination speech in presetting range;
When detecting the composite voice signal, the voice signal of the composite voice signal is obtained;
Short Time Fourier Transform is carried out to the voice signal, generates the time-frequency figure of the combination speech;
Based on preset capsule network model, multiple frequency spectrums of the time-frequency figure are extracted, obtain the Meier of each frequency spectrum
Frequency cepstral coefficient;
By the preset capsule network model, the vector mould of each mel-frequency cepstrum coefficient, and root are calculated
The type of the combination speech is determined according to the vector mould of each mel-frequency cepstrum coefficient.
In one embodiment, the processor is being realized when detecting the composite voice signal, is obtained described multiple
When closing the voice signal of voice signal, for realizing: when detecting the combination speech, transfer preset sample rate;
By preset formula and the preset sample rate, the sampling time interval of the preset sample rate is determined;
The combination speech is acquired based on the sampling time interval, obtains the discrete letter of the combination speech
Number.
In one embodiment, the processor is being realized to voice signal progress Short Time Fourier Transform, is generated
When the time-frequency figure of the combination speech, for realizing: if get the discrete signal, read preset frame duration information with
And frame moves information;
Information is moved by the frame duration information and the frame to pre-process the discrete signal, is obtained multiple short
When analyze signal;
Fourier transformation is carried out to multiple short-time analysis signals, generates the time-frequency figure of the combination speech.
Wherein, in another embodiment, the processor is for running computer program stored in memory, with reality
Existing following steps:
In real time or timing detects the combination speech in preset enclose;
When detecting the combination speech, the voice signal of the combination speech is obtained;
Short Time Fourier Transform is carried out to the voice signal, generates the time-frequency figure of the combination speech;
If getting the time-frequency figure of the combination speech, preset capsule network model is transferred, wherein the preset capsule net
Network model includes convolutional layer, primary capsule, advanced capsule, output layer;
When the time-frequency figure is inputted the preset capsule network model, by the convolution kernel of the convolutional layer to it is described when
Frequency figure carries out framing, extracts multiple frequency spectrums of the time-frequency figure;
The multiple frequency spectrums extracted are filtered out by preset filter function group, obtain the plum of each frequency spectrum
That frequency cepstral coefficient;
By the preset capsule network model, the vector mould of each mel-frequency cepstrum coefficient, and root are calculated
The type for obtaining the combination speech is determined according to the vector mould of each mel-frequency cepstrum coefficient.
In one embodiment, the processor is being realized through the preset capsule network model, calculates each institute
The vector mould of mel-frequency cepstrum coefficient is stated, and according to the determining acquisition of the vector mould of each mel-frequency cepstrum coefficient
When the type of combination speech, for realizing:
When extracting multiple frequency spectrums, multiple frequency spectrums are filtered out by the preset filter function group in convolutional layer, are obtained
Take the mel-frequency cepstrum of each frequency spectrum, wherein frequency spectrum is made of the details of envelope and frequency spectrum;
Cepstral analysis is done to each mel-frequency cepstrum by primary capsule, obtains the cepstrum coefficient of multiple envelopes, and will
The cepstrum coefficient of envelope is as mel-frequency cepstrum coefficient.
Wherein, in one embodiment, the processor is for running computer program stored in memory, with reality
Existing following steps:
In real time or timing detects the combination speech in preset enclose;
When detecting the combination speech, the voice signal of the combination speech is obtained;
Short Time Fourier Transform is carried out to the voice signal, generates the time-frequency figure of the combination speech;
Based on preset capsule network model, multiple spectrograms of the time-frequency figure are extracted, obtain each spectrogram
Mel-frequency cepstrum coefficient;
When multiple primary capsules respectively to mel-frequency cepstrum coefficient described in the advanced capsule propagated forward when, lead to
The dynamic routing formula for crossing the preset capsule network, obtains the intermediate vector of the mel-frequency cepstrum coefficient;
Activation primitive and the intermediate vector based on the advanced capsule obtain the plum of the advanced capsule output
The vector mould of your frequency cepstral coefficient;
It is more by comparing in the vector mould for the mel-frequency cepstrum coefficient for getting multiple advanced capsule outputs
The vector mould of a mel-frequency cepstrum coefficient, the target higher capsule of label output maximum vector mould;
The identity type that the target higher capsule is exported by the output layer, obtains the class of the composite voice signal
Type.
In one embodiment, the processor realize activation primitive based on the advanced capsule and the centre to
Amount, when obtaining the vector mould of the mel-frequency cepstrum coefficient of the advanced capsule output, for realizing:
When the primary capsule is to mel-frequency cepstrum coefficient described in the advanced capsule propagated forward, the glue is obtained
The weighted value of capsule network model;
Based on the first preset formula of the capsule network model and the weighted value, mel-frequency cepstrum system is obtained
Several vectors, and obtain the coefficient of coup of the capsule network model;
The coefficient of coup and the vector described in the second preset formula, the vector sum based on the capsule network model,
The intermediate vector of the mel-frequency cepstrum coefficient is obtained, wherein the dynamic routing formula includes the first preset formula and second
Preset formula.
The embodiment of the present application also provides a kind of computer readable storage medium, stores on the computer readable storage medium
There is computer program, includes program instruction in the computer program, described program instruction is performed realized method can
Referring to each embodiment of the application combination speech recognition methods.
Wherein, the computer readable storage medium can be the storage inside of computer equipment described in previous embodiment
Unit, such as the hard disk or memory of the computer equipment.The computer readable storage medium is also possible to the computer
The plug-in type hard disk being equipped on the External memory equipment of equipment, such as the computer equipment, intelligent memory card (SmartMedia
Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..
It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row
His property includes, so that the process, method, article or the system that include a series of elements not only include those elements, and
And further include other elements that are not explicitly listed, or further include for this process, method, article or system institute it is intrinsic
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do
There is also other identical elements in the process, method of element, article or system.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art
The part contributed out can be embodied in the form of software products, which is stored in one as described above
In storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that terminal device (it can be mobile phone,
Computer, server, air conditioner or network equipment etc.) execute method described in each embodiment of the present invention.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair
Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills
Art field, is included within the scope of the present invention.
Claims (10)
1. a kind of combination speech recognition methods, which is characterized in that the combination speech recognition methods includes:
In real time or timing detects the combination speech in presetting range;
When detecting the combination speech, the voice signal of the combination speech is obtained;
Short Time Fourier Transform is carried out to the voice signal, generates the time-frequency figure of the combination speech;
Based on preset capsule network model, multiple frequency spectrums of the time-frequency figure are extracted, obtain the mel-frequency of each frequency spectrum
Cepstrum coefficient;
By the preset capsule network model, the vector mould of each mel-frequency cepstrum coefficient is calculated, and according to each
The vector mould of a mel-frequency cepstrum coefficient determines the type of the combination speech.
2. combination speech recognition methods as described in claim 1, which is characterized in that described to detect the combination speech letter
Number when, the voice signal for obtaining the composite voice signal includes:
When detecting the combination speech, preset sample rate is transferred;
By preset formula and the preset sample rate, the sampling time interval of the preset sample rate is determined;
The combination speech is acquired based on the sampling time interval, obtains the discrete signal of the combination speech letter.
3. combination speech recognition methods as claimed in claim 2, which is characterized in that described to be carried out in short-term to the voice signal
Fourier transformation, the time-frequency figure for generating the combination speech include:
If get the discrete signal, reads preset frame duration information and frame and move information;
Information is moved by the frame duration information and the frame to pre-process the discrete signal, obtains multiple dividing in short-term
Analyse signal;
Fourier transformation is carried out to multiple short-time analysis signals, generates the time-frequency figure of the combination speech.
4. the combination speech recognition methods as described in claim 1 or 3 any one, which is characterized in that described to be based on preset glue
Capsule network model extracts multiple frequency spectrums of the time-frequency figure, and the mel-frequency cepstrum coefficient for obtaining each frequency spectrum includes;
If getting the time-frequency figure of the combination speech, preset capsule network model is transferred, wherein the preset capsule network mould
Type includes convolutional layer, primary capsule, advanced capsule, output layer;
When the time-frequency figure is inputted the preset capsule network model, by the convolution kernel of the convolutional layer to the time-frequency figure
Framing is carried out, multiple frequency spectrums of the time-frequency figure are extracted;
The multiple frequency spectrums extracted are filtered out by preset filter function group, obtain the Meier frequency of each frequency spectrum
Rate cepstrum coefficient.
5. combination speech recognition methods as claimed in claim 4, which is characterized in that the multiple frequency spectrums that will be extracted
It is filtered out by preset filter function group, the mel-frequency cepstrum coefficient for obtaining each frequency spectrum includes:
When extracting multiple frequency spectrums, multiple frequency spectrums are carried out by the preset filter function group in the convolutional layer
It filters out, obtains the mel-frequency cepstrum of each frequency spectrum, wherein frequency spectrum is made of the details of envelope and frequency spectrum;
Cepstral analysis is done to each mel-frequency cepstrum by the primary capsule, obtains the cepstrum system of multiple envelopes
Number, and using the cepstrum coefficient of the envelope as mel-frequency cepstrum coefficient.
6. combination speech recognition methods as claimed in claim 5, which is characterized in that described to pass through the preset capsule network mould
Type, calculates the vector mould of each mel-frequency cepstrum coefficient, and the type for obtaining the composite voice signal includes:
When multiple primary capsules respectively to mel-frequency cepstrum coefficient described in the advanced capsule propagated forward when, pass through institute
The dynamic routing formula for stating preset capsule network obtains the intermediate vector of the mel-frequency cepstrum coefficient;
Activation primitive and the intermediate vector based on the advanced capsule obtain the Meier frequency of the advanced capsule output
The vector mould of rate cepstrum coefficient;
In the vector mould for the mel-frequency cepstrum coefficient for getting multiple advanced capsule outputs, by comparing multiple institutes
State the vector mould of mel-frequency cepstrum coefficient, the target higher capsule of label output maximum vector mould;
The identity type that the target higher capsule is exported by the output layer, obtains the type of the combination speech.
7. combination speech recognition methods as claimed in claim 6, which is characterized in that described to work as the primary capsule to the height
When mel-frequency cepstrum coefficient described in grade capsule propagated forward, by the dynamic routing algorithm of the preset capsule network, obtain
The intermediate vector of the mel-frequency cepstrum coefficient includes:
When the primary capsule is to mel-frequency cepstrum coefficient described in the advanced capsule propagated forward, the capsule net is obtained
The weighted value of network model;
Based on the first preset formula of the capsule network model and the weighted value, the mel-frequency cepstrum coefficient is obtained
Vector, and obtain the coefficient of coup of the capsule network model;
The coefficient of coup and the vector described in the second preset formula, the vector sum based on the capsule network model obtain
The intermediate vector of the mel-frequency cepstrum coefficient, wherein the dynamic routing formula includes that the first preset formula and second are preset
Formula.
8. a kind of combination speech identification device, it is characterized in that, the combination speech identification device includes:
Detection module, for detecting the combination speech in preset enclose in real time or periodically;
First obtains module, for when detecting the combination speech, obtaining the voice signal of the composite voice signal;
Generation module generates the time-frequency figure of the combination speech for carrying out Short Time Fourier Transform to the voice signal;
Second obtains module, for being based on preset capsule network model, extracts multiple spectrograms of the time-frequency figure, obtains each
The mel-frequency cepstrum coefficient of the spectrogram;
Third obtains module, for calculating each mel-frequency cepstrum coefficient by the preset capsule network model
Vector mould, and the type for obtaining the combination speech is determined according to the vector mould of each mel-frequency cepstrum coefficient.
9. a kind of computer equipment, which is characterized in that the computer equipment includes: memory, processor and is stored in described
On memory and the combination speech recognizer that can run on the processor, the combination speech recognizer is by the place
It manages and is realized when device executes as described in any one of claims 1 to 7 the step of combination speech recognition methods.
10. a kind of computer readable storage medium, which is characterized in that be stored with compound language on the computer readable storage medium
Sound recognizer is realized as described in any one of claims 1 to 7 when the combination speech recognizer is executed by processor
The step of combination speech recognition methods.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910601019.4A CN110444202B (en) | 2019-07-04 | 2019-07-04 | Composite voice recognition method, device, equipment and computer readable storage medium |
PCT/CN2019/118458 WO2021000498A1 (en) | 2019-07-04 | 2019-11-14 | Composite speech recognition method, device, equipment, and computer-readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910601019.4A CN110444202B (en) | 2019-07-04 | 2019-07-04 | Composite voice recognition method, device, equipment and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110444202A true CN110444202A (en) | 2019-11-12 |
CN110444202B CN110444202B (en) | 2023-05-26 |
Family
ID=68429517
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910601019.4A Active CN110444202B (en) | 2019-07-04 | 2019-07-04 | Composite voice recognition method, device, equipment and computer readable storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110444202B (en) |
WO (1) | WO2021000498A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110910893A (en) * | 2019-11-26 | 2020-03-24 | 北京梧桐车联科技有限责任公司 | Audio processing method, device and storage medium |
WO2021000498A1 (en) * | 2019-07-04 | 2021-01-07 | 平安科技(深圳)有限公司 | Composite speech recognition method, device, equipment, and computer-readable storage medium |
CN113450775A (en) * | 2020-03-10 | 2021-09-28 | 富士通株式会社 | Model training device, model training method, and storage medium |
CN114173405A (en) * | 2022-01-17 | 2022-03-11 | 上海道生物联技术有限公司 | Rapid awakening method and system in technical field of wireless communication |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113096649B (en) * | 2021-03-31 | 2023-12-22 | 平安科技(深圳)有限公司 | Voice prediction method, device, electronic equipment and storage medium |
CN116705055B (en) * | 2023-08-01 | 2023-10-17 | 国网福建省电力有限公司 | Substation noise monitoring method, system, equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107564530A (en) * | 2017-08-18 | 2018-01-09 | 浙江大学 | A kind of unmanned plane detection method based on vocal print energy feature |
CN107993648A (en) * | 2017-11-27 | 2018-05-04 | 北京邮电大学 | A kind of unmanned plane recognition methods, device and electronic equipment |
CN108281146A (en) * | 2017-12-29 | 2018-07-13 | 青岛真时科技有限公司 | A kind of phrase sound method for distinguishing speek person and device |
CN109146066A (en) * | 2018-11-01 | 2019-01-04 | 重庆邮电大学 | A kind of collaborative virtual learning environment natural interactive method based on speech emotion recognition |
CN109147818A (en) * | 2018-10-30 | 2019-01-04 | Oppo广东移动通信有限公司 | Acoustic feature extracting method, device, storage medium and terminal device |
CN109410917A (en) * | 2018-09-26 | 2019-03-01 | 河海大学常州校区 | Voice data classification method based on modified capsule network |
CN109559755A (en) * | 2018-12-25 | 2019-04-02 | 沈阳品尚科技有限公司 | A kind of sound enhancement method based on DNN noise classification |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB201416303D0 (en) * | 2014-09-16 | 2014-10-29 | Univ Hull | Speech synthesis |
CN108766419B (en) * | 2018-05-04 | 2020-10-27 | 华南理工大学 | Abnormal voice distinguishing method based on deep learning |
CN108922559A (en) * | 2018-07-06 | 2018-11-30 | 华南理工大学 | Recording terminal clustering method based on voice time-frequency conversion feature and integral linear programming |
CN109523993B (en) * | 2018-11-02 | 2022-02-08 | 深圳市网联安瑞网络科技有限公司 | Voice language classification method based on CNN and GRU fusion deep neural network |
CN110444202B (en) * | 2019-07-04 | 2023-05-26 | 平安科技(深圳)有限公司 | Composite voice recognition method, device, equipment and computer readable storage medium |
-
2019
- 2019-07-04 CN CN201910601019.4A patent/CN110444202B/en active Active
- 2019-11-14 WO PCT/CN2019/118458 patent/WO2021000498A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107564530A (en) * | 2017-08-18 | 2018-01-09 | 浙江大学 | A kind of unmanned plane detection method based on vocal print energy feature |
CN107993648A (en) * | 2017-11-27 | 2018-05-04 | 北京邮电大学 | A kind of unmanned plane recognition methods, device and electronic equipment |
CN108281146A (en) * | 2017-12-29 | 2018-07-13 | 青岛真时科技有限公司 | A kind of phrase sound method for distinguishing speek person and device |
CN109410917A (en) * | 2018-09-26 | 2019-03-01 | 河海大学常州校区 | Voice data classification method based on modified capsule network |
CN109147818A (en) * | 2018-10-30 | 2019-01-04 | Oppo广东移动通信有限公司 | Acoustic feature extracting method, device, storage medium and terminal device |
CN109146066A (en) * | 2018-11-01 | 2019-01-04 | 重庆邮电大学 | A kind of collaborative virtual learning environment natural interactive method based on speech emotion recognition |
CN109559755A (en) * | 2018-12-25 | 2019-04-02 | 沈阳品尚科技有限公司 | A kind of sound enhancement method based on DNN noise classification |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021000498A1 (en) * | 2019-07-04 | 2021-01-07 | 平安科技(深圳)有限公司 | Composite speech recognition method, device, equipment, and computer-readable storage medium |
CN110910893A (en) * | 2019-11-26 | 2020-03-24 | 北京梧桐车联科技有限责任公司 | Audio processing method, device and storage medium |
CN113450775A (en) * | 2020-03-10 | 2021-09-28 | 富士通株式会社 | Model training device, model training method, and storage medium |
CN114173405A (en) * | 2022-01-17 | 2022-03-11 | 上海道生物联技术有限公司 | Rapid awakening method and system in technical field of wireless communication |
CN114173405B (en) * | 2022-01-17 | 2023-11-03 | 上海道生物联技术有限公司 | Rapid wake-up method and system in wireless communication technical field |
Also Published As
Publication number | Publication date |
---|---|
CN110444202B (en) | 2023-05-26 |
WO2021000498A1 (en) | 2021-01-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110444202A (en) | Combination speech recognition methods, device, equipment and computer readable storage medium | |
CN108597492B (en) | Phoneme synthesizing method and device | |
CN108053838B (en) | In conjunction with fraud recognition methods, device and the storage medium of audio analysis and video analysis | |
Mesgarani et al. | Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations | |
Yang et al. | Psychoacoustical evaluation of natural and urban sounds in soundscapes | |
CN110457432A (en) | Interview methods of marking, device, equipment and storage medium | |
CN110880329B (en) | Audio identification method and equipment and storage medium | |
CN107086040A (en) | Speech recognition capabilities method of testing and device | |
CN110047514A (en) | A kind of accompaniment degree of purity appraisal procedure and relevant device | |
Mittal et al. | Analysis of production characteristics of laughter | |
CN110970036B (en) | Voiceprint recognition method and device, computer storage medium and electronic equipment | |
CN110738998A (en) | Voice-based personal credit evaluation method, device, terminal and storage medium | |
CN109800720A (en) | Emotion identification model training method, Emotion identification method, apparatus, equipment and storage medium | |
Reddy et al. | A comparison of cepstral features in the detection of pathological voices by varying the input and filterbank of the cepstrum computation | |
Chaki | Pattern analysis based acoustic signal processing: a survey of the state-of-art | |
Hsu et al. | Robust voice activity detection algorithm based on feature of frequency modulation of harmonics and its DSP implementation | |
AU2020210905A1 (en) | Systems and methods for pre-filtering audio content based on prominence of frequency content | |
CN109147146B (en) | Voice number taking method and terminal equipment | |
CN112185342A (en) | Voice conversion and model training method, device and system and storage medium | |
CN111145726A (en) | Deep learning-based sound scene classification method, system, device and storage medium | |
Ling He et al. | Recognition of stress in speech using wavelet analysis and teager energy operator | |
CN115938364A (en) | Intelligent identification control method, terminal equipment and readable storage medium | |
Sahoo et al. | Analyzing the vocal tract characteristics for out-of-breath speech | |
CN111599345B (en) | Speech recognition algorithm evaluation method, system, mobile terminal and storage medium | |
CN112908299B (en) | Customer demand information identification method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |