CN108597501A

CN108597501A - A kind of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element

Info

Publication number: CN108597501A
Application number: CN201810383059.1A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2018-04-26
Filing date: 2018-04-26
Publication date: 2018-09-28

Abstract

A kind of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element proposed in the present invention, main contents include：Vision stream, audio stream, classification layer and audiovisual fusion, its process is, in vision stream or audio stream, time dynamic can be by 2 layers of bidirectional valve controlled cycling element (BGRU) medelling, the BGRU outputs of right latter two signal stream can be connected and be transported in classification layer and be merged, and then carry out joint modeling to their time dynamic, finally from a Softmax layers of output, Softmax layers can be marked each frame, and labeled sequence is to be based on highest average probability.The present invention can not only simultaneously, directly extract the feature of pixel and audio volume control, it is also equipped with the text recognition in huge open context data concentration, in the case of noise intensity height, compared to the accuracy that traditional audio-visual speech identification model can significantly improve classification.

Description

A kind of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element

Technical field

The present invention relates to audio-visual speech to identify field, is recycled based on residual error network and bidirectional valve controlled more particularly, to a kind of The audio-visual speech model of unit.

Background technology

It is significantly promoted with the performance of personal computer, human-computer interaction technology is from centered on computer, gradually It is transferred to interactive mode focusing on people, audio-visual speech identification technology is also developed rapidly in this context.Audiovisual language Sound identification technology is mainly used in phone and communication system, and people can be by voice command easily from the database of distal end It is inquired in system and extracts related information；Audio-visual speech identification technology is also largely applied interacts machine, voice note in user Thing sheet, business are self-service to be handled in the equipment such as platform, and cost of labor is greatly reduced；In terms of police criminal detection, pass through audio-visual speech Identification technology can judge the identity of suspect in conjunction with the acoustic information and facial expression information obtained.However, traditional audiovisual Speech recognition technology is mainly based upon the feature of mel-frequency cepstrum coefficient (MFCC), and uses shot and long term memory network (LSTM) time dynamic is modeled, causes its accuracy of identification in the case of strong noise not high.

A kind of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element of the present invention, in vision stream or audio In stream, time dynamic can be by 2 layers of bidirectional valve controlled cycling element (BGRU) medelling, and the BGRU outputs of right latter two signal stream can quilt It connects and is transported in classification layer and merged, joint modeling then is carried out to their time dynamic, finally from one Softmax layers of output, Softmax layers can be marked each frame, and labeled sequence is to be based on highest average probability.This Invention can not only simultaneously, directly extract the feature of pixel and audio volume control, be also equipped in huge open context number It, can be notable compared to traditional audio-visual speech identification model in the case of noise intensity height according to the text recognition of concentration Improve the accuracy of classification.

Invention content

For the problems such as accuracy of identification is not high in the case of strong noise, the purpose of the present invention is to provide one kind based on residual The audio-visual speech model of poor network and bidirectional valve controlled cycling element, in vision stream or audio stream, time dynamic can be two-way by 2 layers The BGRU outputs of gating cycle unit (BGRU) medelling, right latter two signal stream can be connected and be transported in classification layer and be carried out Fusion then carries out joint modeling to their time dynamic, and finally from a Softmax layers of output, Softmax layers can be to every One frame is marked, and labeled sequence is to be based on highest average probability.

To solve the above problems, the present invention provides a kind of audio-visual speech based on residual error network and bidirectional valve controlled cycling element Model, main contents include：

(1) vision stream；

(2) audio stream；

(3) classification layer；

(4) audiovisual is merged.

Wherein, the vision stream, which is characterized in that vision stream is by subsidiary 34 layers of residual error network (ResNet-34) 2 layers of bidirectional valve controlled cycling element (BGRU) composition of space-time convolution sum；Used herein is the version of 34 layers of identical mapping, main to flow Cheng Shi：When the output of each step becomes single dimension tensor, residual error network can continuously decrease Spatial dimensionality；Finally, The output of 34 layers of residual error network can be fed back among 2 layers of BGRU (every layer all includes 1024 grids).

Wherein, the audio stream, which is characterized in that audio stream by 18 layers of residual error network (ResNet-18) and with 2 layers BGRU connections form；18 layers of residual error Web vector graphic is standard architecture, major difference is that it uses 1D kernels, without It is the 2D kernels for image data；In order to extract fine spectral information, when often walking in a length of 0.25 millisecond of 5 milliseconds of times Core is used for first space-time convolutional layer；Identical in order to ensure the frame per second of video, the output of residual error network, which is averaged, is assigned to 29 A frame/window；Then in the residual error network after these audio frames can be transported to, these residual error networks are 3 × 1 by size Give tacit consent to kernel composition, level deeper in this way can extract long-term language feature；The output of 18 layers of residual error network can be sent to 2 Among layer BGRU (every layer all includes 1024 grids).

Wherein, the classification layer, which is characterized in that classification layer is made of 2 layers of BGRU, and the BGRU of two kinds of signal streams is defeated Go out to be connected and be transported in classification layer to be merged, joint modeling then is carried out to their time dynamic；Output layer It it is one Softmax layers, each frame can be marked in it；Labeled sequence is to be based on highest average probability.

Wherein, audiovisual fusion, which is characterized in that end-to-end audiovisual model is first audiovisual Fusion Model, its energy Enough features for simultaneously directly extracting pixel and audio volume control are also equipped with the text in huge open context data concentration Recognition capability；Its operating procedure includes：Pretreatment, evaluation and training.

Further, the pretreatment, which is characterized in that it is divided into for the pretreatment of video and for the pre- of audio Processing；For the pretreatment for video, the first step is the extraction to oral area ROI；Because being extracted oral area ROI, One fixed 98 × 98 bounding box is used all videos；Finally, each frame is all converted into gray value and by standard Change (according to population mean and variance)；For the pretreatment for audio, each segment can be by carry out zero standard, also It is to say that its mean value and standard deviation are all zero, explains that loudness different degrees of between the loudspeakers changes with this.

Further, the evaluation, which is characterized in that video clip is subdivided into training set, verification collection and test set； Each word has 800 to 1000 sequences in training set, and each word respectively has 50 sequences in checksum set and test set；In total Training set, verification collection and test set have 488766,25000 and 25000 samples respectively.

Further, the training, which is characterized in that main there are two the stages：First, it stand-alone training audio stream or regards Frequency flows, and is then combined with trained audiovisual network.

Further, the stand-alone training audio stream or video flowing, which is characterized in that training is divided into initialization and end is arrived End two parts of training；For initialization, main there are three steps：First, using convolution rather than 2 layers of BGRU；Then, The aggregate (carrying Softmax layers) of residual error network and convolution can be by lasting training, until the nicety of grading of checksum set It is all no longer improved in 5 different time points；Finally, removal convolution rear end is replaced with the rear ends BGRU；For end-to-end instruction Practice, after the residual error network of each signal stream and 2 layers of BGRU are trained in advance, they can be merged into a complete signal Stream is to be trained (use Softmax output layers) end to end；It is end-to-end using Adam training algorithms, mainly use 36 The small lot of a sequence and 0.0003 initial learning rate；Stop operation after 5 time points.

Further, the audio visual network is complexed and trains, which is characterized in that merge training be divided into initialization and it is end-to-end Two parts of training；For initialization, once the single signal stream of each completes training, they will be used to initialize Corresponding signal stream in multithread framework；Then, in addition 2 layers of BGRU can be added on all signal streams to merge single signal The output of stream；This 2 layers of BGRU can be trained first in 5 time points and (be used Softmax output layers), to keep audio stream and regard The weight of frequency stream is stablized；For end-to-end training, entire audiovisual network, which is merged together, to be trained, and is instructed using Adam Practice algorithm, that is, uses the small lot of 18 sequences and 0.0001 initial learning rate；Stop fortune after 5 time points It calculates.

Description of the drawings

Fig. 1 is a kind of system framework of the audio-visual speech model based on residual error network and bidirectional valve controlled cycling element of the present invention Figure.

Fig. 2 is a kind of flow chart of the audio-visual speech model based on residual error network and bidirectional valve controlled cycling element of the present invention.

Fig. 3 is a kind of ROI extractions of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element of the present invention.

Specific implementation mode

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase It mutually combines, invention is further described in detail in the following with reference to the drawings and specific embodiments.

Fig. 1 is a kind of system framework of the audio-visual speech model based on residual error network and bidirectional valve controlled cycling element of the present invention Figure.Include mainly vision stream, audio stream, classify layer and audiovisual fusion.

Vision stream is by 2 layers of bidirectional valve controlled cycling element of space-time convolution sum of subsidiary 34 layers of residual error network (ResNet-34) (BGRU) it forms；Used herein is the version of 34 layers of identical mapping, and main flow is：When the output of each step becomes When single dimension tensor, residual error network can continuously decrease Spatial dimensionality；Finally, the output of 34 layers of residual error network can be fed back to 2 Among layer BGRU (every layer all includes 1024 grids).

Audio stream is made of 18 layers of residual error network (ResNet-18) and being connect with 2 layers of BGRU；18 layers of residual error network make It is standard architecture, major difference is that it uses 1D kernels, rather than for the 2D kernels of image data；In order to Fine spectral information is extracted, a length of 0.25 millisecond of 5 milliseconds of time kernels are used for first space-time convolutional layer when often walking；For Ensure that the frame per second of video is identical, the output of residual error network is averaged is assigned to 29 frame/windows；Then these audio frames can quilt In residual error network after being transported to, these residual error networks are made of the acquiescence kernel that size is 3 × 1, level deeper in this way Long-term language feature can be extracted；The output of 18 layers of residual error network can be sent to 2 layers of BGRU, and (every layer all includes 1024 nets Lattice) among.

Classify layer, be made of 2 layers of BGRU, the BGRU outputs of two kinds of signal streams can be connected and be transported in classification layer with It is merged, joint modeling then is carried out to their time dynamic；Output layer is one Softmax layers, it can be to each Frame is marked；Labeled sequence is to be based on highest average probability.

Audiovisual is merged, and end-to-end audio-visual speech model is first audiovisual Fusion Model, it simultaneously can directly be extracted The feature of pixel and audio volume control is also equipped with the text recognition in huge open context data concentration；Its operation Step includes：Pretreatment, evaluation and training.

Fig. 2 is a kind of flow chart of the audio-visual speech model based on residual error network and bidirectional valve controlled cycling element of the present invention. This figure shows the workflows of this audio-visual speech model：In vision stream or audio stream, time dynamic can be by 2 layers of bidirectional valve controlled The BGRU outputs of cycling element (BGRU) medelling, right latter two signal stream, which can be connected and are transported in classification layer, is melted It closes, joint modeling then is carried out to their time dynamic, finally from a Softmax layers of output, Softmax layers can be to each Frame is marked, and labeled sequence is to be based on highest average probability.

Fig. 3 is a kind of ROI extractions of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element of the present invention. This figure shows this audio-visual speech models to the extracting mode of ROI：One fixed 98 × 98 side is used all videos Boundary's frame；Finally, each frame is all converted into gray value and is standardized (according to population mean and variance).

For those skilled in the art, the present invention is not limited to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and range, the present invention can be realized in other specific forms.In addition, those skilled in the art can be to this hair Bright to carry out various modification and variations without departing from the spirit and scope of the present invention, these improvements and modifications also should be regarded as the present invention's Protection domain.Therefore, the following claims are intended to be interpreted as including preferred embodiment and falls into all changes of the scope of the invention More and change.

Claims

1. a kind of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element, which is characterized in that main includes regarding Feel stream (one)；Audio stream (two)；Classification layer (three)；(4) are merged in audiovisual.

2. based on the vision stream (one) described in claims 1, which is characterized in that vision stream is by subsidiary 34 layers of residual error network (ResNet-34) 2 layers of bidirectional valve controlled cycling element (BGRU) composition of space-time convolution sum；Used herein is 34 layers of identical mapping Version, main flow is：When the output of each step becomes single dimension tensor, when residual error network can continuously decrease Empty dimension；Finally, the output of 34 layers of residual error network can be fed back among 2 layers of BGRU (every layer all includes 1024 grids).

3. based on the audio stream (two) described in claims 1, which is characterized in that audio stream is by 18 layers of residual error network (ResNet- 18) and with 2 layers of BGRU it connect composition；18 layers of residual error Web vector graphic is standard architecture, major difference is that it was used It is 1D kernels, rather than for the 2D kernels of image data；It is 0.25 millisecond a length of when often walking in order to extract fine spectral information 5 milliseconds of time kernels be used for first space-time convolutional layer；It is identical in order to ensure the frame per second of video, the output quilt of residual error network It is evenly distributed to 29 frame/windows；Then in the residual error network after these audio frames can be transported to, these residual error networks by The acquiescence kernel that size is 3 × 1 forms, and level deeper in this way can extract long-term language feature；18 layers of residual error network Output can be sent among 2 layers of BGRU (every layer all includes 1024 grids).

4. based on the classification layer (three) described in claims 1, which is characterized in that classification layer is made of 2 layers of BGRU, two kinds of letters Number stream BGRU output can be connected and be transported to classification floor in be merged, then to their time dynamic combine Modeling；Output layer is one Softmax layers, and each frame can be marked in it；Labeled sequence is based on highest flat Equal probability.

5. merging (four) based on the audiovisual described in claims 1, which is characterized in that end-to-end audio-visual speech model is first regards Fusion Model is listened, it simultaneously can directly extract the feature of pixel and audio volume control, be also equipped in huge open language Text recognition in the data set of border；Its operating procedure includes：Pretreatment, evaluation and training.

6. based on the pretreatment described in claims 5, which is characterized in that it is divided into the pretreatment for video and is directed to audio Pretreatment；For the pretreatment for video, the first step is extracted to oral area area-of-interest (ROI)；Because extracted Oral area ROI, so using all videos one fixed 98 × 98 bounding box；Finally, each frame is all converted into Gray value is simultaneously standardized (according to population mean and variance)；For the pretreatment for audio, each segment can by into Row zero standard, that is to say, that its mean value and standard deviation be all zero, and sound different degrees of between the loudspeakers is explained with this Degree variation.

7. based on the evaluation described in claims 5, which is characterized in that video clip is subdivided into training set, verification collection and surveys Examination collection；Each word has 800 to 1000 sequences in training set, and each word respectively has 50 sequences in checksum set and test set； Training set, verification collection and test set in total has 488766,25000 and 25000 samples respectively.

8. based on the training described in claims 5, which is characterized in that main there are two the stages：First, stand-alone training audio stream Or video flowing, it is then combined with trained audiovisual network.

9. based on stand-alone training audio stream or video flowing described in claims 8, which is characterized in that training be divided into initialization and Two parts of end-to-end training；For initialization, main there are three steps：First, using convolution rather than 2 layers of BGRU； Then, the aggregate (carrying Softmax layers) of residual error network and convolution can be by lasting training, until the classification of checksum set Precision is all no longer improved in 5 different time points；Finally, removal convolution rear end is replaced with the rear ends BGRU；For end-to-end Training, after the residual error network of each signal stream and 2 layers of BGRU are trained in advance, they can be merged into a complete letter Number stream to be trained (use Softmax output layers) end to end；It is end-to-end using Adam training algorithms, it is main to use The small lot of 36 sequences and 0.0003 initial learning rate；Stop operation after 5 time points.

10. being complexed and training based on audio visual network according to any one of claims 8, which is characterized in that merge training and be divided into initialization and end To end two parts of training；For initialization, once the single signal stream of each completes training, they will be used to just Corresponding signal stream in beginningization multithread framework；Then, in addition 2 layers of BGRU can be added on all signal streams to merge single The output of signal stream；This 2 layers of BGRU can be trained first in 5 time points and (be used Softmax output layers), to keep audio stream Stablize with the weight of video flowing；For end-to-end training, entire audiovisual network, which is merged together, to be trained, using Adam training algorithms use the small lot of 18 sequences and 0.0001 initial learning rate；Stop after 5 time points Only operation.