CN110033757A

CN110033757A - A kind of voice recognizer

Info

Publication number: CN110033757A
Application number: CN201910272975.2A
Authority: CN
Inventors: 史程; 彭加木
Original assignee: Xingzhi Technology Co Ltd
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2019-07-19

Abstract

The present invention provides a kind of voice recognizer, comprising the following steps: S1: to the self-adaptive processing of speaker's wave volume size, doing overall normalized to identical maximum value after the wave volume size of speaker is carried out identification model training；S2: to the self-adaptive processing in the mute area of speaker's sound, the volume numerical values recited of current speaker is judged by mean filter, then falls mute region by threshold filtering；S3: noise reduction is filtered to background sound, and consistency treatment is done to speaker's voice data；S4: extracting the sound characteristic of speaker, by trained neural network algorithm model, extracts the high dimensional feature vector of speaker's sound；S5: being compared identification with vocal print library to the sound characteristic of speaker, and the high dimensional feature of neural network algorithm model extraction is compared using COS distance, obtains the similarity of speaker characteristic.The present invention directly recognizes the sound characteristic of speaker, and noise is small, and arithmetic accuracy is high.

Description

A kind of voice recognizer

Technical field

The invention belongs to voice recognition technology fields, and in particular to a kind of voice recognizer.

Background technique

Speech recognition technology is to be converted into sound, byte or the phrase that human hair goes out by the identification and understanding process of machine Corresponding text or symbol, or provide a kind of information technology of response.With the rapid development of information technology, speech recognition skill Art has been widely used in daily life.For example, can be passed through when using terminal equipment using speech recognition technology The mode of input voice easily inputs information in terminal device.

Speech recognition technology be substantially one mode identification process, the ginseng of the mode of unknown voice and known voice It examines mode to be compared one by one, the reference model of best match is exported as recognition result.Existing speech recognition technology is adopted There are many recognition methods, such as model matching method, probabilistic model method etc..What industry generallyd use at present is probabilistic model method Speech recognition technology.Probabilistic model method speech recognition technology is to be carried out by cloud to the voice that a large amount of different user inputs Acoustics training, and a general acoustic model is obtained, it will be to be identified according to the general acoustic model and speech model Voice signal is decoded as text output.This recognition methods can be to the language of most people for unspecified person Sound identified, still, since it is general acoustic model, when user pronunciation is not up to standard, or has accent, This general acoustic model just can not accurately carry out matching primitives, and substantive defect is can only to identify saying for speaker Content is talked about, directly the sound characteristic of speaker directly can not be recognized.

Summary of the invention

The object of the present invention is to provide a kind of voice recognizers, are directly recognized, are made an uproar to the sound characteristic of speaker Sound is small, and arithmetic accuracy is high.

The present invention provides the following technical solutions:

A kind of voice recognizer, comprising the following steps:

S1: to the self-adaptive processing of speaker's wave volume size, the wave volume size of speaker is subjected to identification mould Overall normalized is done after type training to identical maximum value；

S2: to the self-adaptive processing in the mute area of speaker's sound, the volume of current speaker is judged by mean filter Numerical values recited, then fall mute region by threshold filtering；

S3: noise reduction is filtered to background sound, and consistency treatment is done to speaker's voice data；

S4: extracting the sound characteristic of speaker, by trained neural network algorithm model, extracts speaker's sound High dimensional feature vector；

S5: being compared identification with vocal print library to the sound characteristic of speaker, compares neural network using COS distance and calculates The high dimensional feature of method model extraction obtains the similarity of speaker characteristic.

Preferably, the volume of input sound characteristic is used in the S1 to the self-adaptive processing of speaker's wave volume size Size adaptation algorithm, it is described input sound characteristic volume adaptive algorithm the following steps are included:

S11: the maximum value of current sound data is found out；

S12: on the basis of volume maximum value, the coefficient value relative to a constant is found out；

S13: current sound data multiply the data that the coefficient obtains adaptive volume.

Preferably, the mute area of input sound characteristic is used in the S2 to the self-adaptive processing in the mute area of speaker's sound From pruning algorithm, it is described input sound characteristic mute area from pruning algorithm the following steps are included:

S21: the mean value of the absolute value of 1600 sample magnitude about 0.1s is sought；

Whether S22: being mute area by threshold decision current data；

S23: truncation is carried out to the data in then mute area.

Preferably, include using noise reduction algorithm, the filtering noise reduction algorithm is filtered to background sound filtering noise reduction in the S3 Following steps:

S31: the extraction of background noise data is carried out using wavelet algorithm；

S32: being divided into multiband for band data of making an uproar, each frequency band and noise data seek matrix inner products, utilize orthogonal matrix Product be zero principle, filter out the noise sample currently extracted.

Preferably, doing consistency treatment to speaker's voice data in the S3 is to the FBANK feature of speaker and one Rank increment and second-order increment data operate after subtracting mean value divided by the standard deviation of data, obtain consistency voice data.

Preferably, the sound characteristic of speaker is extracted in the S4 by trained neural network algorithm model, is used Multilayer convolution algorithm and full-join algorithm extract 512 high dimensional feature vector classification of sound of speaker.

The beneficial effects of the present invention are: this recognizer to pre-input sound characteristic processing so that input feature vector it is more perfect, Noise is smaller, arithmetic accuracy is higher；Classification is extracted using the high dimensional feature that depth nerve convolutional network algorithm carries out sound, to saying Words voice sound feature extraction directly directly recognizes the sound characteristic of speaker, avoid passing through the speech content of speaker into The defect of row identification.

Detailed description of the invention

Attached drawing is used to provide further understanding of the present invention, and constitutes part of specification, with reality of the invention It applies example to be used to explain the present invention together, not be construed as limiting the invention.In the accompanying drawings:

Fig. 1 is recognizer flow diagram of the present invention；

Fig. 2 is that deep neural network structural diagrams are intended to.

Specific embodiment

As shown in Figure 1, a kind of voice recognizer, comprising the following steps:

Specifically, to the self-adaptive processing of speaker's wave volume size using the volume of input sound characteristic in S1 Adaptive algorithm, input the volume adaptive algorithm of sound characteristic the following steps are included:

S11: the maximum value of current sound data is found out；

S13: current sound data multiply the data that coefficient obtains adaptive volume.

Specifically, being cut certainly to the self-adaptive processing in the mute area of speaker's sound using the mute area of input sound characteristic in S2 Disconnected algorithm, input the mute area of sound characteristic from pruning algorithm the following steps are included:

Whether S22: being mute area by threshold decision current data；

S23: truncation is carried out to the data in then mute area.

Specifically, in S3 to background sound filtering noise reduction using filtering noise reduction algorithm, filtering noise reduction algorithm the following steps are included:

Specifically, doing consistency treatment to speaker's voice data in S3 is to increase to the FBANK feature and single order of speaker Amount and second-order increment data operate after subtracting mean value divided by the standard deviation of data, obtain consistency voice data.

As depicted in figs. 1 and 2, the sound characteristic that speaker is extracted in S4 passes through trained neural network algorithm model, 512 high dimensional feature vector classification of sound of speaker are extracted using multilayer convolution algorithm and full-join algorithm.

As depicted in figs. 1 and 2, a kind of voice recognizer, specifically, convolution sum full-join algorithm: convolutional neural networks Basic structure include two layers, one is characterized extract layer, and the input of each neuron is connected with the local acceptance region of preceding layer, And extract the feature of the part.After the local feature is extracted, its positional relationship between other feature also determines therewith Get off；The second is Feature Mapping layer, each computation layer of network is made of multiple Feature Mappings, and each Feature Mapping is one flat Face, the weight of all neurons is equal in plane.Feature Mapping structure is using the small sigmoid function of influence function core as volume The activation primitive of product network, so that Feature Mapping has shift invariant.Further, since the neuron on a mapping face is shared Weight, thus reduce the number of network freedom parameter.Each of convolutional neural networks convolutional layer all followed by use The computation layer of local average and second extraction is sought, this distinctive structure of feature extraction twice reduces feature resolution.

Convolutional neural networks are mainly used to the X-Y scheme of identification displacement, scaling and other forms distortion invariance.Due to The feature detection layer of convolutional neural networks is learnt by training data, so avoiding when using convolutional neural networks The feature extraction of display, and implicitly learnt from training data；Furthermore due to the neuron on same Feature Mapping face Weight is identical, so network can be with collateral learning, this is also convolutional network is connected with each other the one big excellent of network relative to neuron Gesture.Convolutional neural networks have in terms of speech recognition and image procossing unique excellent with the special construction that its local weight is shared More property, layout share the complexity for reducing network closer to actual biological neural network, weight, and especially multidimensional is defeated The image of incoming vector can directly input the complexity that network this feature avoids data reconstruction in feature extraction and assorting process Degree, is being eventually adding full articulamentum, in this way may be used in full articulamentum using the output of the 4th layer of convolution as the input of full articulamentum To learn to part and global feature.The height that the present invention mainly uses multilayer convolution algorithm that full-join algorithm is added to carry out sound Dimensional feature extracts classification, to directly directly be recognized to the sound characteristic of speaker, it is interior to avoid passing through speaking for speaker Hold the defect identified, arithmetic accuracy is higher.

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, although referring to aforementioned reality Applying example, invention is explained in detail, for those skilled in the art, still can be to aforementioned each implementation Technical solution documented by example is modified or equivalent replacement of some of the technical features.It is all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of voice recognizer, which comprises the following steps:

S1: to the self-adaptive processing of speaker's wave volume size, the wave volume size of speaker is subjected to identification model instruction Overall normalized is done after white silk to identical maximum value；

S2: to the self-adaptive processing in the mute area of speaker's sound, the volume numerical value of current speaker is judged by mean filter Size, then fall mute region by threshold filtering；

S4: extracting the sound characteristic of speaker, by trained neural network algorithm model, extracts the higher-dimension of speaker's sound Feature vector；

S5: being compared identification with vocal print library to the sound characteristic of speaker, compares neural network algorithm mould using COS distance The high dimensional feature that type extracts, obtains the similarity of speaker characteristic.

2. a kind of voice recognizer according to claim 1, which is characterized in that speaker's wave volume in the S1 The self-adaptive processing of size is given great volume using the volume adaptive algorithm for inputting sound characteristic, the input sound characteristic Small adaptive algorithm the following steps are included:

S11: the maximum value of current sound data is found out；

3. a kind of voice recognizer according to claim 1, which is characterized in that mute to speaker's sound in the S2 Using the mute area of input sound characteristic from pruning algorithm, the mute area of the input sound characteristic is cut the self-adaptive processing in area certainly Disconnected algorithm the following steps are included:

Whether S22: being mute area by threshold decision current data；

S23: truncation is carried out to the data in then mute area.

4. a kind of voice recognizer according to claim 1, which is characterized in that filter noise reduction to background sound in the S3 Using filtering noise reduction algorithm, the filtering noise reduction algorithm the following steps are included:

S32: being divided into multiband for band data of making an uproar, each frequency band and noise data seek matrix inner products, utilize the product of orthogonal matrix The principle for being zero filters out the noise sample currently extracted.

5. a kind of voice recognizer according to claim 1, which is characterized in that speaker's voice data in the S3 Doing consistency treatment is after subtract to the FBANK feature and first increment and second-order increment data of speaker mean value divided by number According to standard deviation operation, obtain consistency voice data.

6. a kind of voice recognizer according to claim 1, which is characterized in that extract the sound of speaker in the S4 Feature extracts speaker's using multilayer convolution algorithm and full-join algorithm by trained neural network algorithm model 512 high dimensional feature vector classification of sound.