CN110444202A

CN110444202A - Combination speech recognition methods, device, equipment and computer readable storage medium

Info

Publication number: CN110444202A
Application number: CN201910601019.4A
Authority: CN
Inventors: 吴冀平; 彭俊清; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-07-04
Filing date: 2019-07-04
Publication date: 2019-11-12
Anticipated expiration: 2039-07-04
Also published as: CN110444202B; WO2021000498A1

Abstract

The present invention relates to artificial intelligence fields, and deep learning has been used to realize the sound-type for identifying composite voice signal by capsule network model.A kind of combination speech recognition methods, device, computer equipment and computer readable storage medium are specifically disclosed altogether, this method comprises: in real time or the combination speech in timing detection presetting range；When detecting the combination speech, the voice signal of the combination speech is obtained；Short Time Fourier Transform is carried out to the voice signal, generates the time-frequency figure of the composite voice signal；Based on preset capsule network model, multiple frequency spectrums of the time-frequency figure are extracted, obtain the mel-frequency cepstrum coefficient of each frequency spectrum；By the preset capsule network model, the vector mould of each mel-frequency cepstrum coefficient is calculated, and determines the type of the combination speech according to the vector mould of each mel-frequency cepstrum coefficient.

Description

Combination speech recognition methods, device, equipment and computer readable storage medium

Technical field

The present invention relates to artificial intelligence field more particularly to combination speech recognition methods, device, equipment and computer-readable Storage medium title.

Background technique

Sound event testing goal is the generation and end time automatically by sound detection particular event, and to every A event is to outgoing label.Under the assistance of this technology, computer can be by the environment around speech understanding, and to it It responds.Sound event detection has broad application prospects in daily life, including sound monitoring, bioacoustics monitoring With smart home etc..According to whether allow multiple sound events while occurring, it is divided into single or complexsound event detection.In In single sound event detection, each individually sound event has certain frequency and amplitude in frequency spectrum, but for multiple Chorus sound event detection, these frequencies or amplitude may be overlapped, existing sound detection technology mainly for single sound into Row detection identification, can not identify simultaneous overlapping complexsound type.

Summary of the invention

The main purpose of the present invention is to provide a kind of combination speech recognition methods, device, equipment and computer-readable deposit Storage media title, it is intended to which simultaneous overlapping complexsound type can not be identified by solving existing sound detection technology.

In a first aspect, a kind of combination speech recognition methods of the application, the combination speech recognition methods include:

In real time or timing detects the combination speech in presetting range；

When detecting the combination speech, the voice signal of the composite voice signal is obtained；

Short Time Fourier Transform is carried out to the voice signal, generates the time-frequency figure of the composite voice signal；

Based on preset capsule network model, multiple frequency spectrums of the time-frequency figure are extracted, obtain the Meier of each frequency spectrum Frequency cepstral coefficient；

By the preset capsule network model, the vector mould of each mel-frequency cepstrum coefficient, and root are calculated The type of the combination speech is determined according to the vector mould of each mel-frequency cepstrum coefficient.

Second aspect, the application also provide a kind of combination speech identification device, and the combination speech identification device includes:

Detection unit, for detecting the combination speech in preset enclose in real time or periodically；

First obtains module, for when detecting the combination speech, obtaining the voice signal of the combination speech；

Generation module generates the time-frequency figure of the combination speech for carrying out Short Time Fourier Transform to the acoustical signal；

Second obtains module, for being based on preset capsule network model, extracts multiple spectrograms of the time-frequency figure, obtains The mel-frequency cepstrum coefficient of each spectrogram；

Third obtains module, for calculating each mel-frequency cepstrum by the preset capsule network model The vector mould of coefficient, and the class for obtaining the combination speech is determined according to the vector mould of each mel-frequency cepstrum coefficient Type.

The third aspect, the application also provide a kind of computer equipment, and the computer equipment includes: memory, processor And it is stored in the combination speech recognizer that can be run on the memory and on the processor, the combination speech identification The step of as above inventing the combination speech recognition methods is realized when program is executed by the processor.

Fourth aspect, the application also provide a kind of computer readable storage medium, on the computer readable storage medium It is stored with combination speech recognizer, the combination speech identification sequence is realized compound described in as above invention when being executed by processor The step of audio recognition method.

A kind of combination speech recognition methods, device, equipment and the computer readable storage medium that the embodiment of the present invention proposes, Pass through the combination speech in real time or in timing detection presetting range；When detecting the combination speech, the compound language is obtained The voice signal of sound signal；Short Time Fourier Transform is carried out to the voice signal, generates the time-frequency of the composite voice signal Figure；Based on preset capsule network model, multiple frequency spectrums of the time-frequency figure are extracted, the mel-frequency for obtaining each frequency spectrum falls Spectral coefficient；By the preset capsule network model, the vector mould of each mel-frequency cepstrum coefficient is calculated, and according to The vector mould of each mel-frequency cepstrum coefficient determines the type of the combination speech, realizes through capsule network model Identify the sound-type of combination speech.

Detailed description of the invention

Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is a kind of flow diagram of combination speech recognition methods provided by the embodiments of the present application；

Fig. 2 is the sub-step flow diagram of the combination speech recognition methods in Fig. 1；

Fig. 3 is the sub-step flow diagram of the combination speech recognition methods in Fig. 1；

Fig. 4 is the flow diagram of another combination speech recognition methods provided by the embodiments of the present application；

Fig. 5 is the sub-step flow diagram of the combination speech recognition methods in Fig. 4；

Fig. 6 is the flow diagram of another combination speech recognition methods provided by the embodiments of the present application；

Fig. 7 is the sub-step flow diagram of the combination speech recognition methods in Fig. 6；

Fig. 8 is a kind of schematic block diagram of combination speech identification device provided by the embodiments of the present application；

Fig. 9 is the schematic block diagram of the submodule of the combination speech identification device in Fig. 8；

Figure 10 is the schematic block diagram of the submodule of the combination speech identification device in Fig. 8；

Figure 11 is the schematic block diagram of another combination speech identification device provided by the embodiments of the present application；

Figure 12 is the schematic block diagram of the submodule of the combination speech identification device in Figure 11；

Figure 13 is the schematic block diagram of another combination speech identification device provided by the embodiments of the present application；

Figure 14 is the schematic block diagram of the submodule of the combination speech identification device in Figure 13；

Figure 15 is the structural schematic block diagram for the computer equipment that one embodiment of the application is related to.

The embodiments will be further described with reference to the accompanying drawings for realization, functional characteristics and the advantage of the application purpose.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.

Flow chart shown in the drawings only illustrates, it is not necessary to including all content and operation/step, also not It is that must be executed by described sequence.For example, some operation/steps can also decompose, combine or partially merge, therefore practical The sequence of execution is possible to change according to the actual situation.

The embodiment of the present application provides a kind of combination speech recognition methods, device, equipment and computer readable storage medium.Its In, which can be applied in terminal device, which can be with mobile phone, tablet computer, notebook electricity Brain, desktop computer.

With reference to the accompanying drawing, it elaborates to some embodiments of the application.In the absence of conflict, following Feature in embodiment and embodiment can be combined with each other.

Fig. 1 is please referred to, Fig. 1 is a kind of process signal for combination speech recognition methods that embodiments herein provides Figure.

As shown in Figure 1, the combination speech recognition methods includes step S10 to step S50.

Step S10, the combination speech in presetting range is detected in real time or periodically；

Terminal is in real time or timing detects the combination speech in presetting range, for example, making in the range of terminal can be detected For the presetting range of terminal, what terminal can detect may range from indoor room etc., be also possible to outdoor park etc..In advance First setting terminal all the time detects the combination speech in preset room or preset park, alternatively, right every a hour Preset room or preset park are detected, and wherein combination speech includes at least two different mixing voices.It needs to illustrate It is that above-mentioned presetting range can be configured based on actual conditions, and the application is not especially limited this.

Step S20, when detecting combination speech, the voice signal of combination speech is obtained；

When terminal detects combination speech, the combination speech that acquisition testing arrives is obtained by analyzing combination speech The voice signal of combination speech is got, voice signal includes frequency, amplitude, time of sound etc..For example, terminal is detecting two Kind or when a variety of mixed combination speech, by preset spectrum analysis function or preset oscillography function to detecting Combination speech detected, collect the sound frequency of combination speech, compound language got by preset decibel tester The acoustic amplitudes of sound preset spectrum analysis function or oscillography function in the terminal, by presetting spectrum analysis function, The sound frequency of combination speech is calculated, or by presetting oscillography function, calculates the acoustic amplitudes of combination speech.

In one embodiment, specifically, referring to Fig. 2, step S20 includes: sub-step S21 to sub-step S23.

Sub-step S21 transfers preset sample rate when detecting combination speech；

When terminal detects combination speech, preset sample rate is transferred, sample rate is also referred to as sample rate or sampling frequency Rate defines the number of samples per second extracted from continuous signal and form discrete signal, it is indicated with hertz (Hz), preset Sample rate can be 40Hz, be also possible to 60Hz etc..It should be noted that above-mentioned preset sample rate can be carried out based on actual conditions Setting, the application are not especially limited this.

Sub-step S22 determines the sampling time interval of preset sample rate by preset formula and preset sample rate；

Terminal calculates the sampling time interval of preset sample rate by preset formula and preset sample rate, wherein preset public affairs Formula is sampling time interval=1/ sample rate, by preset sample rate so as to find out the sampling time interval of sample rate.For example, adopting Sample frequency is 40KHz, then sampled point has 40 × 1000 in 1s, and each sampling period, (sampling period was consistent under normal conditions ) t=1/40 × 1000.

Sub-step S23 is acquired combination speech based on sampling time interval, obtains the discrete signal of combination speech.

Terminal is acquired combination speech by sampling time interval, gets the discrete signal of combination speech, and from The quantity of scattered signal is based on sampling time interval quantity.Discrete signal is the signal up-sampled in continuous signal, and continuous The independent variable of signal is continuous difference, and discrete signal is a sequence, i.e. its independent variable is " discrete ", and this sequence Each value may be regarded as a sampling of continuous signal.It can will be at combination speech by preset sample rate Reason, so that the discrete signal quality of the composite voice signal got is better.

Step S30, Short Time Fourier Transform is carried out to voice signal, generates the time-frequency figure of composite voice signal；

When terminal gets the voice signal of combination speech, Short Time Fourier Transform is done to the voice signal got, Short Time Fourier Transform (STFT, short-time Fourier transform or short-term It Fouriertransform) is a kind of mathematic(al) manipulation relevant with Fourier transformation, to determine its regional area of time varying signal The frequency and phase of sine wave, specifically, Short Time Fourier Transform include frame shifting, frame duration and Fourier transformation, be will acquire The voice signal arrived carries out the pretreatment of frame shifting and frame duration, and pretreated sound is done Fourier transformation, is got multiple X-Y scheme can get relationship between frequency and amplitude in combination speech by doing Fourier transformation to voice signal, and two Dimension figure is frequency spectrum, and multiple 2D signals are overlapped according to dimension, generate the time-frequency figure of combination speech, every in time-frequency figure One frame is frequency spectrum, and frequency spectrum is time-frequency figure with the variation of time.

In one embodiment, specifically, referring to Fig. 3, step S30 includes: sub-step S31 to sub-step S33.

If step S31 reads preset frame duration information and frame and moves information get discrete signal；

If terminal gets discrete signal, Short Time Fourier Transform includes frame duration, frame shifting collection Fourier transformation.It reads Preset frame duration information and frame move information, for example, presetting frame duration 40ms, 50ms etc., frame moves 20ms, 30ms etc.. It should be noted that preset frame duration information and frame, which move information, to be configured based on actual conditions, the application to this not Make specific limit.

Step S32 moves information by frame duration information and frame and pre-processes to discrete signal, obtains multiple dividing in short-term Analyse signal；

Terminal is moved information by preset frame duration information and frame and is pre-processed to the multiple discrete signals got, Obtain multiple short-time analysis signals.For example, will acquire the processing that discrete signal carries out the frames durations such as 40ms or 50ms, frame is moved The processing that the frames such as 20ms or 30ms move, obtains the short-time analysis signal of various discrete signal.

Step S33 carries out Fourier transformation to multiple short-time analysis signals, generates the time-frequency figure of combination speech.

Terminal carries out Fourier transformation when getting multiple short-time analysis signals, to each short-time analysis signal, obtains The relationship of frequency and time generates an X-Y scheme, the dimension of each X-Y scheme is stacked, composite voice signal is generated Time-frequency figure.By generating the time-frequency figure of composite voice signal to discrete signal progress frame shifting, frame duration, Fourier transformation, thus Frequency spectrum and the variation of time of composite voice signal can be more preferably got according to time-frequency figure.

Step S40, it is based on preset capsule network model, multiple frequency spectrums of time-frequency figure is extracted, obtains the Meier of each frequency spectrum Frequency cepstral coefficient；

When terminal gets the time-frequency figure of combination speech, it is based on pre-set capsule network model, wherein capsule net Network is a kind of new neural network structure, including convolutional layer, primary capsule, advanced capsule etc., and capsule is one group of nested nerve Network layer.In capsule network, more layers can be added in single network layer.Specifically, embedding in a neural net layer Cover another, the state of neuron in capsule features above-mentioned attribute, the capsule of an entity in image and export a table Attribute, the vector towards presentation-entity of the vector, vector that show entity existence are sent to all parent's capsules in neural network. Capsule can be to calculate a predicted vector, and predicted vector is by obtaining itself weight multiplied by weight matrix.

Frame signal in capsule network model extraction time-frequency figure, wherein each frame representative frequency spectrum in time-frequency figure.It is obtaining When getting multiple frequency spectrums of time-frequency figure, the mel-frequency filter function group in capsule network is transferred, frequency spectrum is passed through into mel-frequency Filter function group reads the logarithm of mel-frequency filter function group, using logarithm as the mel-frequency cepstrum coefficient of the frequency spectrum.

Step S50, by preset capsule network model, the vector mould of each mel-frequency cepstrum coefficient, and root are calculated The type of combination speech is determined according to the vector mould of each mel-frequency cepstrum coefficient.

When terminal gets the mel-frequency cepstrum coefficient of each frequency spectrum, preset capsule network model is transferred, and obtain Dynamic routing algorithm and weight matrix in preset capsule network model are calculated by dynamic routing algorithm and weight matrix The vector mould of the mel-frequency cepstrum coefficient of each frequency spectrum, the vector of the mel-frequency cepstrum coefficient for each frequency spectrum that will acquire Mould is compared, and vector mould maximum mel-frequency cepstrum coefficient is obtained, to obtain the corresponding table of mel-frequency cepstrum coefficient Show sound-type, using the sound-type as the sound-type of combination speech, sound-type include bark, glass breaking etc., and Combination speech includes at least two kinds of sound-types.

Combination speech recognition methods provided by the above embodiment by the way that combination speech is generated time-frequency figure, and is based on capsule Network model handles time-frequency figure, can detecte out the sound-type of combination speech.

Referring to figure 4., Fig. 4 is the schematic diagram of a scenario for implementing combination speech recognition methods provided in this embodiment, such as Fig. 4 Shown, which includes:

Terminal is in real time or timing detects the combination speech in presetting range, for example, making in the range of terminal can be detected For the presetting range of terminal, what terminal can detect may range from indoor room etc., be also possible to outdoor park etc..In advance First setting terminal all the time detects the combination speech in preset room or preset park, alternatively, right every a hour Preset room or preset park are detected, and wherein combination speech includes at least two different mixing voices.

When terminal detects combination speech, the combination speech that acquisition testing arrives is obtained by analyzing combination speech The voice signal of combination speech is got, voice signal includes frequency, amplitude, time of sound etc..For example, terminal is detecting two Kind or when a variety of mixed combination speech, answered by preset spectrum analyzer or preset oscillograph what is detected It closes voice to be detected, collects the sound frequency of combination speech, combination speech is got by preset decibel tester Acoustic amplitudes.

Step S30, Short Time Fourier Transform is carried out to voice signal, generates the time-frequency figure of combination speech；

If step S41, getting the time-frequency figure of composite voice signal, preset capsule network model is transferred, wherein preset glue Capsule network model includes convolutional layer, primary capsule, advanced capsule, output layer；

If terminal gets the time-frequency figure of composite voice signal, preset capsule network model is transferred, wherein preset capsule Network model includes convolutional layer, primary capsule, advanced capsule and output layer.It should be noted that the convolution kernel number of convolutional layer It can be configured based on actual conditions, the application is not especially limited this.

Step S42, when time-frequency figure is inputted preset capsule network model, time-frequency figure is carried out by the convolution kernel of convolutional layer Multiple frequency spectrums of time-frequency figure are extracted in framing；

The time-frequency figure that terminal will acquire inputs preset capsule network model, passes through the convolution of preset capsule network model Layer, there is convolution kernel in convolutional layer, convolution kernel carries out framing to the time-frequency figure of input, extracts multiple frequency spectrums of time-frequency figure.For example, eventually End inputs one 28 × 28 time-frequency figure, and has 256 9 × 9 in convolutional layer, and the convolution kernel that step-length is 1 passes through the number of convolution kernel The information such as amount and step-length carry out framing to 28 × 28 time-frequency figure time-frequency figure, so that 256 20 × 20 frequency spectrums are got, Calculation is rule=(f-n+1) × (f-n+1) of frequency spectrum, wherein f is time-frequency figure specification, and n is convolution kernel specification.Eventually End 256 20 × 20 frequency spectrums are extracted by the convolutional layer in preset capsule network model.

Step S43, the multiple frequency spectrums extracted are filtered out by preset filter function group, obtains the plum of each frequency spectrum That frequency cepstral coefficient.

When terminal extracts multiple frequency spectrums by convolutional layer, by the frequency spectrum extracted by preset filter function group, read The logarithm log for getting preset filter function group, using the logarithm read as the mel-frequency cepstrum coefficient of the frequency spectrum.Specifically To pass through frequency spectrum formula: X [K]=H [K] E [K] when getting frequency spectrum；Wherein X [K] is frequency spectrum, and H [K] is envelope, E [K] For frequency spectrum details, frequency spectrum is the details by envelope and frequency spectrum, and envelope is that the multiple formants connected in frequency spectrum obtain, formant It is the identification attribute (being exactly that personal identity card is the same) for carrying sound for the major frequency components for indicating voice.By preset Filter function group reads the coefficient of H [K], and the coefficient by H [K] is exactly Meier frequency spectrum cepstrum coefficient.

In one embodiment, specifically, referring to Fig. 5, step S43 includes: sub-step S431 to sub-step S432.

Sub-step S431, when extracting multiple frequency spectrums, pass through the preset filter function group pair in the convolutional layer Multiple frequency spectrums are filtered out, and the mel-frequency cepstrum of each frequency spectrum is obtained, wherein frequency spectrum is thin by envelope and frequency spectrum Section composition；

When terminal detects that convolution kernel extracts multiple frequency spectrums, by filter function group preset in convolutional layer to multiple frequencies Spectrum is filtered out, and preset filter function group includes multiple filter functions, and can be 40 filter functions is one group, is also possible to 50 filter functions are one group.It, can by preset filter function group comprising low frequency function, intermediate frequency function, high frequency function in frequency spectrum Will in frequency spectrum include effectively being separated with the details of frequency spectrum, so that obtaining includes the details with frequency spectrum, that is, get The Meier frequency spectrum rate cepstrum of envelope in each frequency spectrum.

Sub-step S431, cepstral analysis is done to each mel-frequency cepstrum by the primary capsule, obtained multiple The cepstrum coefficient of the envelope, and using the cepstrum coefficient of the envelope as mel-frequency cepstrum coefficient.

Terminal does cepstral analysis to the mel-frequency cepstrum of each envelope by primary capsule, gets each envelope in plum Meier frequency spectrum cepstrum coefficient on your frequency cepstral, wherein the Meier frequency spectrum cepstrum coefficient of each envelope is also each spectrum envelope Meier frequency spectrum cepstrum coefficient.

It is preset by preset capsule network model when terminal gets the mel-frequency cepstrum coefficient of each frequency spectrum Capsule network model includes dynamic routing algorithm and weight matrix, and each mel-frequency inverse coefficient got passes through dynamic Routing algorithm and weight matrix calculate the vector mould of the mel-frequency cepstrum coefficient of each frequency spectrum, each frequency that will acquire The vector mould of the mel-frequency cepstrum coefficient of spectrum, which is compared, obtains vector mould maximum mel-frequency cepstrum coefficient, should to obtain The corresponding expression sound-type of mel-frequency cepstrum coefficient, using the sound-type as the sound-type of combination speech, voice class Type include bark, glass breaking etc., and combination speech includes at least two kinds of sound-types.

Combination speech recognition methods provided by the above embodiment, by the frequency spectrum of capsule network model extraction time-frequency figure, from And the Meier frequency spectrum cepstrum coefficient of each frequency spectrum is got, it can not only be rapidly obtained the feature of composite voice signal, also saved Human resources are saved.

Fig. 6 is please referred to, Fig. 6 is the schematic diagram of a scenario for implementing combination speech recognition methods provided in this embodiment, such as Fig. 6 Shown, which includes:

When terminal gets the voice signal of combination speech, Short Time Fourier Transform is done to the voice signal got, Short Time Fourier Transform (STFT, short-time Fourier transform or short-term It Fouriertransform)) is a kind of mathematic(al) manipulation relevant with Fourier transformation, to determine its regional area of time varying signal The frequency and phase of sine wave, specifically, Short Time Fourier Transform include frame shifting, frame duration and Fourier transformation, be will acquire The voice signal arrived carries out the pretreatment of frame shifting and frame duration, and pretreated sound is done Fourier transformation, is got multiple X-Y scheme can get relationship between frequency and amplitude in combination speech by doing Fourier transformation to voice signal, and two Dimension figure is frequency spectrum, and multiple 2D signals are overlapped according to dimension, generate the time-frequency figure of combination speech, every in time-frequency figure One frame is frequency spectrum, and frequency spectrum is time-frequency figure with the variation of time.

When terminal gets the time-frequency figure of combination speech, it is based on pre-set capsule network model, capsule network is A kind of new neural network structure, including convolutional layer, primary capsule, advanced capsule etc..Capsule is one group of nested neural network Layer.In capsule network, more layers can be added in single network layer.

Specifically, in a neural net layer it is nested another, the state of neuron in capsule features in image The above-mentioned attribute of an entity, capsule export the category towards presentation-entity of the vector of a presentation-entity existence, vector Property, vector are sent to all parent's capsules in neural network.Capsule can be to calculate a predicted vector, and predicted vector is logical It crosses and obtains itself weight multiplied by weight matrix.Frame signal in capsule network model extraction time-frequency figure, wherein in time-frequency figure Each frame representative frequency spectrum.When getting multiple frequency spectrums of time-frequency figure, the mel-frequency filtering letter in capsule network is transferred Array reads the logarithm of mel-frequency filter function group by frequency spectrum by mel-frequency filter function group, using logarithm as the frequency The mel-frequency cepstrum coefficient of spectrum.

Step S51, when multiple primary capsules are respectively to mel-frequency cepstrum coefficient described in the advanced capsule propagated forward When, by the dynamic routing formula of preset capsule network, obtain the intermediate vector of mel-frequency cepstrum coefficient；

When terminal gets the mel-frequency cepstrum coefficient of each primary capsule output, each primary capsule is respectively to height Grade capsule propagated forward mel-frequency cepstrum coefficient obtains Meier frequency by the dynamic routing formula of preset capsule network model The intermediate vector of rate cepstrum coefficient.

In one embodiment, specifically, referring to Fig. 7, step S51 includes: sub-step S511 to sub-step S512.

Sub-step S511, when the primary capsule is to mel-frequency cepstrum coefficient described in the advanced capsule propagated forward When, obtain the weighted value of the capsule network model；

Specifically, when primary capsule is to advanced capsule propagated forward mel-frequency cepstrum coefficient, preset capsule is got The weighted value of network model, the weighted value are that capsule network model is obtained in training data set.

Sub-step S512, based on the first preset formula of the capsule network model and the weighted value, obtain the plum The vector of your frequency cepstral coefficient, and obtain the coefficient of coup of the capsule network model；

Pass through the first preset formula in preset capsule network modelWhereinFor mel-frequency cepstrum coefficient Vector, the weighted value that w is preset capsule network model, u is the mel-frequency cepstrum coefficient of primary capsule output.Pass through first Preset formula gets the vector of mel-frequency cepstrum coefficient and the coefficient of coup of preset capsule network model,

Sub-step S513, the second preset formula based on the capsule network model, the coefficient of coup described in the vector sum With the vector, the intermediate vector of the mel-frequency cepstrum coefficient is obtained, wherein the dynamic routing formula includes first pre- Set formula and the second preset formula.

Pass through the second preset formulaWherein s is the mel-frequency cepstrum coefficient of the input of advanced capsule Intermediate vector, c are the coefficient of coup,For the vector of mel-frequency cepstrum coefficient, plum is got by the second preset formula The intermediate vector of your frequency cepstral coefficient, wherein the first preset formula and the second preset formula are the dynamic of preset capsule network model State routes formula.

Step S52, activation primitive and intermediate vector based on advanced capsule, the mel-frequency for obtaining advanced capsule output fall The vector mould of spectral coefficient；

The intermediate vector that terminal passes through each mel-frequency cepstrum coefficient that will acquire is input in advanced capsule, is obtained To the activation primitive in advanced capsule, the intermediate vector of each mel-frequency cepstrum coefficient is calculated by activation primitive, is obtained high The vector mould of each mel-frequency cepstrum coefficient of grade capsule output.

For example, when the quantity of primary capsule is 8, when the quantity of advanced capsule is 3,8 primary capsules respectively to Advanced capsule 1 inputs mel-frequency cepstrum coefficient and calculates separately out 8 by the dynamic routing formula of preset capsule network model The intermediate vector of the mel-frequency cepstrum coefficient of a primary capsule output, and the Meier that calculated 8 primary capsules are exported The intermediate vector of frequency cepstral coefficient inputs advanced capsule 1 and calculates 8 mel-frequencies by the activation primitive of advanced capsule 1 The vector modulus value of cepstrum coefficient.

8 primary capsules are inputted into mel-frequency cepstrum coefficient to advanced capsule 2 respectively again, pass through preset capsule network mould The dynamic routing formula of type calculates separately out the intermediate vector of the mel-frequency cepstrum coefficient of 8 primary capsule output, and will meter The intermediate vector of the mel-frequency cepstrum coefficient of the primary capsule output of 8 calculated inputs advanced capsule 2, passes through advanced capsule 2 Activation primitive, calculate the vector modulus value of 8 mel-frequency cepstrum coefficients, and by calculated 8 primary capsules output The intermediate vector of mel-frequency cepstrum coefficient inputs advanced capsule 3 and calculates 8 Meiers by the activation primitive of advanced capsule 3 The vector modulus value of frequency cepstral coefficient.

Step S53 passes through comparison when getting the vector mould of mel-frequency cepstrum coefficient of multiple advanced capsule outputs The vector mould of multiple mel-frequency cepstrum coefficients, the target higher capsule of label output maximum vector mould；

When obtaining the vector modulus value of multiple mel-frequency cepstrum coefficients of each advanced capsule output, by multiple Meier frequencies The vector modulus value of rate cepstrum coefficient is compared, and marks the maximum advanced capsule of output vector modulus value, and the high capsule of label is made For target higher capsule, each advanced capsule corresponds to markd sound-type.

Step S54 is exported the identity type of target higher capsule by output layer, obtains the type of combination speech.

The identity type of target higher capsule is exported by output layer, each advanced capsule is identified with sound-type, For example, the type that advanced capsule 1 identifies is to bark, the type that advanced capsule 2 identifies is that glass breaking or advanced capsule 1 are marked The type of knowledge is to bark with glass breaking etc., and the type of advanced capsule mark can may be a variety of languages for a kind of sound-type Sound type.

Combination speech recognition methods provided by the above embodiment, by getting time-frequency figure in preset capsule network model In each frequency spectrum Meier frequency spectrum cepstrum coefficient, calculate the vector mould of each Meier frequency spectrum cepstrum coefficient, be based on each Meier The vector mould of frequency spectrum cepstrum coefficient gets the identity type of the maximum advanced capsule of vector mould, and combination speech is generated image, To handle by capsule network model image, voice signal and image are combined calculating, quickly obtained multiple Close the type of voice.

Fig. 8 is please referred to, Fig. 8 is a kind of schematic block diagram of combination speech identification device provided by the embodiments of the present application.

As shown in figure 8, the combination speech identification device 400, comprising: detection module 401, first obtains module 402, generates Module 403, second obtains module 404 and third obtains module 405.

Detection module 401, in real time or timing detects and preset encloses interior combination speech；

First acquisition module 402, the sound for when detecting the combination speech, obtaining the combination speech are believed Number；

Generation module 403, for the voice signal carry out Short Time Fourier Transform, generate the combination speech when Frequency is schemed；

Second obtains module 404, for extracting multiple spectrograms of the time-frequency figure based on preset capsule network model, Obtain the mel-frequency cepstrum coefficient of each spectrogram；

Third obtains module 405, for calculating each mel-frequency and falling by the preset capsule network model The vector mould of spectral coefficient, and the class for obtaining the combination speech is determined according to the vector mould of each mel-frequency cepstrum coefficient Type.

In one embodiment, as shown in figure 9, the first acquisition module 402 includes:

First transfers submodule 4021, for when detecting the combination speech, transferring preset sample rate；

Determine submodule 4022, for determining the preset sample rate by preset formula and the preset sample rate Sampling time interval；

First acquisition submodule 4023 is acquired the combination speech for being based on the sampling time interval, obtains Take the discrete signal of the combination speech.

In one embodiment, as shown in Figure 10, generation module 403 includes:

If reading submodule 4031, for getting the discrete signal when, read preset frame duration information and frame Move information；

It obtains submodule 4032, the discrete signal is carried out in advance by the frame duration information and frame shifting information Processing, obtains multiple short-time analysis signals；

It generates submodule 4033, for carrying out Fourier transformation to multiple short-time analysis signals, generates described compound The time-frequency figure of voice.

Figure 11 is please referred to, Figure 11 is the schematic frame of another combination speech identification device provided by the embodiments of the present application Figure.

As shown in figure 11, combination speech identification device 500, comprising: detection module 501, first obtains module 502, life Submodule 504, extracting sub-module 505, the second acquisition submodule 506, third, which are transferred, at module 503, second obtains module 507.

Detection module 501, in real time or timing detects and preset encloses interior combination speech；

First acquisition module 502, the sound for when detecting the combination speech, obtaining the combination speech are believed Number；

Generation module 503, for the voice signal carry out Short Time Fourier Transform, generate the combination speech when Frequency is schemed；

If second transfers submodule 504, the time-frequency figure for getting the combination speech, preset capsule network mould is transferred Type, wherein the preset capsule network model includes convolutional layer, primary capsule, advanced capsule, output layer；

The time-frequency figure is inputted the preset capsule network model for working as by extracting sub-module 505, passes through the convolution The convolution kernel of layer carries out framing to the time-frequency figure, extracts multiple frequency spectrums of the time-frequency figure；

Second acquisition submodule 506, for the multiple frequency spectrums extracted to be filtered by preset filter function group It removes, obtains the mel-frequency cepstrum coefficient of each frequency spectrum；

Third obtains module 507, for calculating each mel-frequency and falling by the preset capsule network model The vector mould of spectral coefficient, and the class for obtaining the combination speech is determined according to the vector mould of each mel-frequency cepstrum coefficient Type.

In one embodiment, as shown in figure 12, the second acquisition submodule 506 includes:

First obtains subelement 5061, for passing through the preset filter function in convolutional layer when extracting multiple frequency spectrums Group filters out multiple frequency spectrums, obtains the mel-frequency cepstrum of each frequency spectrum, wherein frequency spectrum by envelope and frequency spectrum details group At；

Second obtains subelement 5062, for doing cepstral analysis to each mel-frequency cepstrum by primary capsule, obtains The cepstrum coefficient of multiple envelopes, and using the cepstrum coefficient of envelope as mel-frequency cepstrum coefficient.

Figure 13 is please referred to, Figure 13 is the schematic frame of another combination speech identification device provided by the embodiments of the present application Figure.

As shown in figure 13, combination speech identification device 600, comprising: detection module 601, first obtains module 602, life At module 603, second obtain module 604, third acquisition submodule 605, the 4th acquisition submodule 606, label submodule 607, 5th acquisition submodule 608.

Detection module 601, in real time or timing detects and preset encloses interior combination speech；

First acquisition module 602, the sound for when detecting the combination speech, obtaining the combination speech are believed Number；

Generation module 603, for the voice signal carry out Short Time Fourier Transform, generate the combination speech when Frequency is schemed；

Second obtains module 604, for extracting multiple spectrograms of the time-frequency figure based on preset capsule network model, Obtain the mel-frequency cepstrum coefficient of each spectrogram；

Third acquisition submodule 605 is used for when multiple primary capsules are respectively to the advanced capsule propagated forward institute When stating mel-frequency cepstrum coefficient, by the dynamic routing formula of the preset capsule network, the mel-frequency cepstrum is obtained The intermediate vector of coefficient；

4th acquisition submodule 606, for based on the advanced capsule activation primitive and the intermediate vector, obtain institute State the vector mould of the mel-frequency cepstrum coefficient of advanced capsule output；

Mark submodule 607, in the mel-frequency cepstrum coefficient for getting multiple advanced capsules outputs Vector mould, pass through the vector mould for comparing multiple mel-frequency cepstrum coefficients, the target of label output maximum vector mould is high Grade capsule；

5th acquisition submodule 608, the identity type for exporting the target higher capsule by the output layer, are obtained Take the type of the composite voice signal.

In one embodiment, as shown in figure 14, third acquisition submodule 605 includes:

Third obtains subelement 6051, for working as the primary capsule to the frequency of Meier described in the advanced capsule propagated forward When rate cepstrum coefficient, the weighted value of the capsule network model is obtained；

4th obtains subelement 6052, for based on the first preset formula of the capsule network model and the weight Value, obtains the vector of the mel-frequency cepstrum coefficient, and obtain the coefficient of coup of the capsule network model；

5th obtains subelement 6053, for the second preset formula, the vector sum based on the capsule network model The coefficient of coup and the vector obtain the intermediate vector of the mel-frequency cepstrum coefficient, wherein the dynamic routing is public Formula includes the first preset formula and the second preset formula.

It should be noted that it is apparent to those skilled in the art that, for convenience of description and succinctly, The specific work process of the device of foregoing description and each module and unit can refer to aforementioned combination speech recognition methods embodiment In corresponding process, details are not described herein.

Device provided by the above embodiment can be implemented as a kind of form of computer program, which can be It is run in computer equipment as shown in figure 15.

Figure 15 is please referred to, Figure 15 is a kind of structural representation block diagram of computer equipment provided by the embodiments of the present application.It should Computer equipment can be terminal.

As shown in figure 15, which includes processor, memory and the network interface connected by system bus, Wherein, memory may include non-volatile memory medium and built-in storage.

Non-volatile memory medium can storage program area and computer program.The computer program includes program instruction, The program instruction is performed, and processor may make to execute any one combination speech recognition methods.

Processor supports the operation of entire computer equipment for providing calculating and control ability.

Built-in storage provides environment for the operation of the computer program in non-volatile memory medium, the computer program quilt When processor executes, processor may make to execute any one combination speech recognition methods.

The network interface such as sends the task dispatching of distribution for carrying out network communication.It will be understood by those skilled in the art that Structure shown in Figure 15, only the block diagram of part-structure relevant to application scheme, is not constituted to application scheme The restriction for the computer equipment being applied thereon, specifically computer equipment may include more more or fewer than as shown in the figure Component perhaps combines certain components or with different component layouts.

It should be understood that processor can be central processing unit (Central Processing Unit, CPU), it should Processor can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specially With integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable GateArray, FPGA) either other programmable logic device, discrete gate or transistor are patrolled Collect device, discrete hardware components etc..Wherein, general processor can be microprocessor or the processor be also possible to it is any often The processor etc. of rule.

Wherein, in one embodiment, the processor is for running computer program stored in memory, with reality Existing following steps:

In real time or timing detects the combination speech in presetting range；

When detecting the composite voice signal, the voice signal of the composite voice signal is obtained；

Short Time Fourier Transform is carried out to the voice signal, generates the time-frequency figure of the combination speech；

In one embodiment, the processor is being realized when detecting the composite voice signal, is obtained described multiple When closing the voice signal of voice signal, for realizing: when detecting the combination speech, transfer preset sample rate；

By preset formula and the preset sample rate, the sampling time interval of the preset sample rate is determined；

The combination speech is acquired based on the sampling time interval, obtains the discrete letter of the combination speech Number.

In one embodiment, the processor is being realized to voice signal progress Short Time Fourier Transform, is generated When the time-frequency figure of the combination speech, for realizing: if get the discrete signal, read preset frame duration information with And frame moves information；

Information is moved by the frame duration information and the frame to pre-process the discrete signal, is obtained multiple short When analyze signal；

Fourier transformation is carried out to multiple short-time analysis signals, generates the time-frequency figure of the combination speech.

Wherein, in another embodiment, the processor is for running computer program stored in memory, with reality Existing following steps:

In real time or timing detects the combination speech in preset enclose；

When detecting the combination speech, the voice signal of the combination speech is obtained；

If getting the time-frequency figure of the combination speech, preset capsule network model is transferred, wherein the preset capsule net Network model includes convolutional layer, primary capsule, advanced capsule, output layer；

When the time-frequency figure is inputted the preset capsule network model, by the convolution kernel of the convolutional layer to it is described when Frequency figure carries out framing, extracts multiple frequency spectrums of the time-frequency figure；

The multiple frequency spectrums extracted are filtered out by preset filter function group, obtain the plum of each frequency spectrum That frequency cepstral coefficient；

By the preset capsule network model, the vector mould of each mel-frequency cepstrum coefficient, and root are calculated The type for obtaining the combination speech is determined according to the vector mould of each mel-frequency cepstrum coefficient.

In one embodiment, the processor is being realized through the preset capsule network model, calculates each institute The vector mould of mel-frequency cepstrum coefficient is stated, and according to the determining acquisition of the vector mould of each mel-frequency cepstrum coefficient When the type of combination speech, for realizing:

When extracting multiple frequency spectrums, multiple frequency spectrums are filtered out by the preset filter function group in convolutional layer, are obtained Take the mel-frequency cepstrum of each frequency spectrum, wherein frequency spectrum is made of the details of envelope and frequency spectrum；

Cepstral analysis is done to each mel-frequency cepstrum by primary capsule, obtains the cepstrum coefficient of multiple envelopes, and will The cepstrum coefficient of envelope is as mel-frequency cepstrum coefficient.

In real time or timing detects the combination speech in preset enclose；

Based on preset capsule network model, multiple spectrograms of the time-frequency figure are extracted, obtain each spectrogram Mel-frequency cepstrum coefficient；

When multiple primary capsules respectively to mel-frequency cepstrum coefficient described in the advanced capsule propagated forward when, lead to The dynamic routing formula for crossing the preset capsule network, obtains the intermediate vector of the mel-frequency cepstrum coefficient；

Activation primitive and the intermediate vector based on the advanced capsule obtain the plum of the advanced capsule output The vector mould of your frequency cepstral coefficient；

It is more by comparing in the vector mould for the mel-frequency cepstrum coefficient for getting multiple advanced capsule outputs The vector mould of a mel-frequency cepstrum coefficient, the target higher capsule of label output maximum vector mould；

The identity type that the target higher capsule is exported by the output layer, obtains the class of the composite voice signal Type.

In one embodiment, the processor realize activation primitive based on the advanced capsule and the centre to Amount, when obtaining the vector mould of the mel-frequency cepstrum coefficient of the advanced capsule output, for realizing:

When the primary capsule is to mel-frequency cepstrum coefficient described in the advanced capsule propagated forward, the glue is obtained The weighted value of capsule network model；

Based on the first preset formula of the capsule network model and the weighted value, mel-frequency cepstrum system is obtained Several vectors, and obtain the coefficient of coup of the capsule network model；

The coefficient of coup and the vector described in the second preset formula, the vector sum based on the capsule network model, The intermediate vector of the mel-frequency cepstrum coefficient is obtained, wherein the dynamic routing formula includes the first preset formula and second Preset formula.

The embodiment of the present application also provides a kind of computer readable storage medium, stores on the computer readable storage medium There is computer program, includes program instruction in the computer program, described program instruction is performed realized method can Referring to each embodiment of the application combination speech recognition methods.

Wherein, the computer readable storage medium can be the storage inside of computer equipment described in previous embodiment Unit, such as the hard disk or memory of the computer equipment.The computer readable storage medium is also possible to the computer The plug-in type hard disk being equipped on the External memory equipment of equipment, such as the computer equipment, intelligent memory card (SmartMedia Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..

It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, method, article or the system that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or system institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, method of element, article or system.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in one as described above In storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that terminal device (it can be mobile phone, Computer, server, air conditioner or network equipment etc.) execute method described in each embodiment of the present invention.

The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of combination speech recognition methods, which is characterized in that the combination speech recognition methods includes:

In real time or timing detects the combination speech in presetting range；

Based on preset capsule network model, multiple frequency spectrums of the time-frequency figure are extracted, obtain the mel-frequency of each frequency spectrum Cepstrum coefficient；

By the preset capsule network model, the vector mould of each mel-frequency cepstrum coefficient is calculated, and according to each The vector mould of a mel-frequency cepstrum coefficient determines the type of the combination speech.

2. combination speech recognition methods as described in claim 1, which is characterized in that described to detect the combination speech letter Number when, the voice signal for obtaining the composite voice signal includes:

When detecting the combination speech, preset sample rate is transferred；

The combination speech is acquired based on the sampling time interval, obtains the discrete signal of the combination speech letter.

3. combination speech recognition methods as claimed in claim 2, which is characterized in that described to be carried out in short-term to the voice signal Fourier transformation, the time-frequency figure for generating the combination speech include:

If get the discrete signal, reads preset frame duration information and frame and move information；

Information is moved by the frame duration information and the frame to pre-process the discrete signal, obtains multiple dividing in short-term Analyse signal；

4. the combination speech recognition methods as described in claim 1 or 3 any one, which is characterized in that described to be based on preset glue Capsule network model extracts multiple frequency spectrums of the time-frequency figure, and the mel-frequency cepstrum coefficient for obtaining each frequency spectrum includes；

If getting the time-frequency figure of the combination speech, preset capsule network model is transferred, wherein the preset capsule network mould Type includes convolutional layer, primary capsule, advanced capsule, output layer；

When the time-frequency figure is inputted the preset capsule network model, by the convolution kernel of the convolutional layer to the time-frequency figure Framing is carried out, multiple frequency spectrums of the time-frequency figure are extracted；

The multiple frequency spectrums extracted are filtered out by preset filter function group, obtain the Meier frequency of each frequency spectrum Rate cepstrum coefficient.

5. combination speech recognition methods as claimed in claim 4, which is characterized in that the multiple frequency spectrums that will be extracted It is filtered out by preset filter function group, the mel-frequency cepstrum coefficient for obtaining each frequency spectrum includes:

When extracting multiple frequency spectrums, multiple frequency spectrums are carried out by the preset filter function group in the convolutional layer It filters out, obtains the mel-frequency cepstrum of each frequency spectrum, wherein frequency spectrum is made of the details of envelope and frequency spectrum；

Cepstral analysis is done to each mel-frequency cepstrum by the primary capsule, obtains the cepstrum system of multiple envelopes Number, and using the cepstrum coefficient of the envelope as mel-frequency cepstrum coefficient.

6. combination speech recognition methods as claimed in claim 5, which is characterized in that described to pass through the preset capsule network mould Type, calculates the vector mould of each mel-frequency cepstrum coefficient, and the type for obtaining the composite voice signal includes:

When multiple primary capsules respectively to mel-frequency cepstrum coefficient described in the advanced capsule propagated forward when, pass through institute The dynamic routing formula for stating preset capsule network obtains the intermediate vector of the mel-frequency cepstrum coefficient；

Activation primitive and the intermediate vector based on the advanced capsule obtain the Meier frequency of the advanced capsule output The vector mould of rate cepstrum coefficient；

In the vector mould for the mel-frequency cepstrum coefficient for getting multiple advanced capsule outputs, by comparing multiple institutes State the vector mould of mel-frequency cepstrum coefficient, the target higher capsule of label output maximum vector mould；

The identity type that the target higher capsule is exported by the output layer, obtains the type of the combination speech.

7. combination speech recognition methods as claimed in claim 6, which is characterized in that described to work as the primary capsule to the height When mel-frequency cepstrum coefficient described in grade capsule propagated forward, by the dynamic routing algorithm of the preset capsule network, obtain The intermediate vector of the mel-frequency cepstrum coefficient includes:

When the primary capsule is to mel-frequency cepstrum coefficient described in the advanced capsule propagated forward, the capsule net is obtained The weighted value of network model；

Based on the first preset formula of the capsule network model and the weighted value, the mel-frequency cepstrum coefficient is obtained Vector, and obtain the coefficient of coup of the capsule network model；

The coefficient of coup and the vector described in the second preset formula, the vector sum based on the capsule network model obtain The intermediate vector of the mel-frequency cepstrum coefficient, wherein the dynamic routing formula includes that the first preset formula and second are preset Formula.

8. a kind of combination speech identification device, it is characterized in that, the combination speech identification device includes:

Detection module, for detecting the combination speech in preset enclose in real time or periodically；

First obtains module, for when detecting the combination speech, obtaining the voice signal of the composite voice signal；

Generation module generates the time-frequency figure of the combination speech for carrying out Short Time Fourier Transform to the voice signal；

Second obtains module, for being based on preset capsule network model, extracts multiple spectrograms of the time-frequency figure, obtains each The mel-frequency cepstrum coefficient of the spectrogram；

Third obtains module, for calculating each mel-frequency cepstrum coefficient by the preset capsule network model Vector mould, and the type for obtaining the combination speech is determined according to the vector mould of each mel-frequency cepstrum coefficient.

9. a kind of computer equipment, which is characterized in that the computer equipment includes: memory, processor and is stored in described On memory and the combination speech recognizer that can run on the processor, the combination speech recognizer is by the place It manages and is realized when device executes as described in any one of claims 1 to 7 the step of combination speech recognition methods.

10. a kind of computer readable storage medium, which is characterized in that be stored with compound language on the computer readable storage medium Sound recognizer is realized as described in any one of claims 1 to 7 when the combination speech recognizer is executed by processor The step of combination speech recognition methods.