CN108597501A - A kind of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element - Google Patents
A kind of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element Download PDFInfo
- Publication number
- CN108597501A CN108597501A CN201810383059.1A CN201810383059A CN108597501A CN 108597501 A CN108597501 A CN 108597501A CN 201810383059 A CN201810383059 A CN 201810383059A CN 108597501 A CN108597501 A CN 108597501A
- Authority
- CN
- China
- Prior art keywords
- layers
- audio
- training
- residual error
- stream
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 230000002457 bidirectional effect Effects 0.000 title claims abstract description 20
- 230000001351 cycling effect Effects 0.000 title claims abstract description 19
- 230000004927 fusion Effects 0.000 claims abstract description 7
- 239000000284 extract Substances 0.000 claims abstract description 6
- 238000012549 training Methods 0.000 claims description 34
- 238000011156 evaluation Methods 0.000 claims description 5
- 238000012360 testing method Methods 0.000 claims description 5
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 4
- 239000000203 mixture Substances 0.000 claims description 4
- 238000012795 verification Methods 0.000 claims description 4
- 230000007774 longterm Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 230000003595 spectral effect Effects 0.000 claims description 3
- 230000002045 lasting effect Effects 0.000 claims description 2
- 238000011017 operating method Methods 0.000 claims description 2
- 230000000007 visual effect Effects 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 description 7
- 238000000605 extraction Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
A kind of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element proposed in the present invention, main contents include:Vision stream, audio stream, classification layer and audiovisual fusion, its process is, in vision stream or audio stream, time dynamic can be by 2 layers of bidirectional valve controlled cycling element (BGRU) medelling, the BGRU outputs of right latter two signal stream can be connected and be transported in classification layer and be merged, and then carry out joint modeling to their time dynamic, finally from a Softmax layers of output, Softmax layers can be marked each frame, and labeled sequence is to be based on highest average probability.The present invention can not only simultaneously, directly extract the feature of pixel and audio volume control, it is also equipped with the text recognition in huge open context data concentration, in the case of noise intensity height, compared to the accuracy that traditional audio-visual speech identification model can significantly improve classification.
Description
Technical field
The present invention relates to audio-visual speech to identify field, is recycled based on residual error network and bidirectional valve controlled more particularly, to a kind of
The audio-visual speech model of unit.
Background technology
It is significantly promoted with the performance of personal computer, human-computer interaction technology is from centered on computer, gradually
It is transferred to interactive mode focusing on people, audio-visual speech identification technology is also developed rapidly in this context.Audiovisual language
Sound identification technology is mainly used in phone and communication system, and people can be by voice command easily from the database of distal end
It is inquired in system and extracts related information;Audio-visual speech identification technology is also largely applied interacts machine, voice note in user
Thing sheet, business are self-service to be handled in the equipment such as platform, and cost of labor is greatly reduced;In terms of police criminal detection, pass through audio-visual speech
Identification technology can judge the identity of suspect in conjunction with the acoustic information and facial expression information obtained.However, traditional audiovisual
Speech recognition technology is mainly based upon the feature of mel-frequency cepstrum coefficient (MFCC), and uses shot and long term memory network
(LSTM) time dynamic is modeled, causes its accuracy of identification in the case of strong noise not high.
A kind of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element of the present invention, in vision stream or audio
In stream, time dynamic can be by 2 layers of bidirectional valve controlled cycling element (BGRU) medelling, and the BGRU outputs of right latter two signal stream can quilt
It connects and is transported in classification layer and merged, joint modeling then is carried out to their time dynamic, finally from one
Softmax layers of output, Softmax layers can be marked each frame, and labeled sequence is to be based on highest average probability.This
Invention can not only simultaneously, directly extract the feature of pixel and audio volume control, be also equipped in huge open context number
It, can be notable compared to traditional audio-visual speech identification model in the case of noise intensity height according to the text recognition of concentration
Improve the accuracy of classification.
Invention content
For the problems such as accuracy of identification is not high in the case of strong noise, the purpose of the present invention is to provide one kind based on residual
The audio-visual speech model of poor network and bidirectional valve controlled cycling element, in vision stream or audio stream, time dynamic can be two-way by 2 layers
The BGRU outputs of gating cycle unit (BGRU) medelling, right latter two signal stream can be connected and be transported in classification layer and be carried out
Fusion then carries out joint modeling to their time dynamic, and finally from a Softmax layers of output, Softmax layers can be to every
One frame is marked, and labeled sequence is to be based on highest average probability.
To solve the above problems, the present invention provides a kind of audio-visual speech based on residual error network and bidirectional valve controlled cycling element
Model, main contents include:
(1) vision stream;
(2) audio stream;
(3) classification layer;
(4) audiovisual is merged.
Wherein, the vision stream, which is characterized in that vision stream is by subsidiary 34 layers of residual error network (ResNet-34)
2 layers of bidirectional valve controlled cycling element (BGRU) composition of space-time convolution sum;Used herein is the version of 34 layers of identical mapping, main to flow
Cheng Shi:When the output of each step becomes single dimension tensor, residual error network can continuously decrease Spatial dimensionality;Finally,
The output of 34 layers of residual error network can be fed back among 2 layers of BGRU (every layer all includes 1024 grids).
Wherein, the audio stream, which is characterized in that audio stream by 18 layers of residual error network (ResNet-18) and with 2 layers
BGRU connections form;18 layers of residual error Web vector graphic is standard architecture, major difference is that it uses 1D kernels, without
It is the 2D kernels for image data;In order to extract fine spectral information, when often walking in a length of 0.25 millisecond of 5 milliseconds of times
Core is used for first space-time convolutional layer;Identical in order to ensure the frame per second of video, the output of residual error network, which is averaged, is assigned to 29
A frame/window;Then in the residual error network after these audio frames can be transported to, these residual error networks are 3 × 1 by size
Give tacit consent to kernel composition, level deeper in this way can extract long-term language feature;The output of 18 layers of residual error network can be sent to 2
Among layer BGRU (every layer all includes 1024 grids).
Wherein, the classification layer, which is characterized in that classification layer is made of 2 layers of BGRU, and the BGRU of two kinds of signal streams is defeated
Go out to be connected and be transported in classification layer to be merged, joint modeling then is carried out to their time dynamic;Output layer
It it is one Softmax layers, each frame can be marked in it;Labeled sequence is to be based on highest average probability.
Wherein, audiovisual fusion, which is characterized in that end-to-end audiovisual model is first audiovisual Fusion Model, its energy
Enough features for simultaneously directly extracting pixel and audio volume control are also equipped with the text in huge open context data concentration
Recognition capability;Its operating procedure includes:Pretreatment, evaluation and training.
Further, the pretreatment, which is characterized in that it is divided into for the pretreatment of video and for the pre- of audio
Processing;For the pretreatment for video, the first step is the extraction to oral area ROI;Because being extracted oral area ROI,
One fixed 98 × 98 bounding box is used all videos;Finally, each frame is all converted into gray value and by standard
Change (according to population mean and variance);For the pretreatment for audio, each segment can be by carry out zero standard, also
It is to say that its mean value and standard deviation are all zero, explains that loudness different degrees of between the loudspeakers changes with this.
Further, the evaluation, which is characterized in that video clip is subdivided into training set, verification collection and test set;
Each word has 800 to 1000 sequences in training set, and each word respectively has 50 sequences in checksum set and test set;In total
Training set, verification collection and test set have 488766,25000 and 25000 samples respectively.
Further, the training, which is characterized in that main there are two the stages:First, it stand-alone training audio stream or regards
Frequency flows, and is then combined with trained audiovisual network.
Further, the stand-alone training audio stream or video flowing, which is characterized in that training is divided into initialization and end is arrived
End two parts of training;For initialization, main there are three steps:First, using convolution rather than 2 layers of BGRU;Then,
The aggregate (carrying Softmax layers) of residual error network and convolution can be by lasting training, until the nicety of grading of checksum set
It is all no longer improved in 5 different time points;Finally, removal convolution rear end is replaced with the rear ends BGRU;For end-to-end instruction
Practice, after the residual error network of each signal stream and 2 layers of BGRU are trained in advance, they can be merged into a complete signal
Stream is to be trained (use Softmax output layers) end to end;It is end-to-end using Adam training algorithms, mainly use 36
The small lot of a sequence and 0.0003 initial learning rate;Stop operation after 5 time points.
Further, the audio visual network is complexed and trains, which is characterized in that merge training be divided into initialization and it is end-to-end
Two parts of training;For initialization, once the single signal stream of each completes training, they will be used to initialize
Corresponding signal stream in multithread framework;Then, in addition 2 layers of BGRU can be added on all signal streams to merge single signal
The output of stream;This 2 layers of BGRU can be trained first in 5 time points and (be used Softmax output layers), to keep audio stream and regard
The weight of frequency stream is stablized;For end-to-end training, entire audiovisual network, which is merged together, to be trained, and is instructed using Adam
Practice algorithm, that is, uses the small lot of 18 sequences and 0.0001 initial learning rate;Stop fortune after 5 time points
It calculates.
Description of the drawings
Fig. 1 is a kind of system framework of the audio-visual speech model based on residual error network and bidirectional valve controlled cycling element of the present invention
Figure.
Fig. 2 is a kind of flow chart of the audio-visual speech model based on residual error network and bidirectional valve controlled cycling element of the present invention.
Fig. 3 is a kind of ROI extractions of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element of the present invention.
Specific implementation mode
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
It mutually combines, invention is further described in detail in the following with reference to the drawings and specific embodiments.
Fig. 1 is a kind of system framework of the audio-visual speech model based on residual error network and bidirectional valve controlled cycling element of the present invention
Figure.Include mainly vision stream, audio stream, classify layer and audiovisual fusion.
Vision stream is by 2 layers of bidirectional valve controlled cycling element of space-time convolution sum of subsidiary 34 layers of residual error network (ResNet-34)
(BGRU) it forms;Used herein is the version of 34 layers of identical mapping, and main flow is:When the output of each step becomes
When single dimension tensor, residual error network can continuously decrease Spatial dimensionality;Finally, the output of 34 layers of residual error network can be fed back to 2
Among layer BGRU (every layer all includes 1024 grids).
Audio stream is made of 18 layers of residual error network (ResNet-18) and being connect with 2 layers of BGRU;18 layers of residual error network make
It is standard architecture, major difference is that it uses 1D kernels, rather than for the 2D kernels of image data;In order to
Fine spectral information is extracted, a length of 0.25 millisecond of 5 milliseconds of time kernels are used for first space-time convolutional layer when often walking;For
Ensure that the frame per second of video is identical, the output of residual error network is averaged is assigned to 29 frame/windows;Then these audio frames can quilt
In residual error network after being transported to, these residual error networks are made of the acquiescence kernel that size is 3 × 1, level deeper in this way
Long-term language feature can be extracted;The output of 18 layers of residual error network can be sent to 2 layers of BGRU, and (every layer all includes 1024 nets
Lattice) among.
Classify layer, be made of 2 layers of BGRU, the BGRU outputs of two kinds of signal streams can be connected and be transported in classification layer with
It is merged, joint modeling then is carried out to their time dynamic;Output layer is one Softmax layers, it can be to each
Frame is marked;Labeled sequence is to be based on highest average probability.
Audiovisual is merged, and end-to-end audio-visual speech model is first audiovisual Fusion Model, it simultaneously can directly be extracted
The feature of pixel and audio volume control is also equipped with the text recognition in huge open context data concentration;Its operation
Step includes:Pretreatment, evaluation and training.
Fig. 2 is a kind of flow chart of the audio-visual speech model based on residual error network and bidirectional valve controlled cycling element of the present invention.
This figure shows the workflows of this audio-visual speech model:In vision stream or audio stream, time dynamic can be by 2 layers of bidirectional valve controlled
The BGRU outputs of cycling element (BGRU) medelling, right latter two signal stream, which can be connected and are transported in classification layer, is melted
It closes, joint modeling then is carried out to their time dynamic, finally from a Softmax layers of output, Softmax layers can be to each
Frame is marked, and labeled sequence is to be based on highest average probability.
Fig. 3 is a kind of ROI extractions of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element of the present invention.
This figure shows this audio-visual speech models to the extracting mode of ROI:One fixed 98 × 98 side is used all videos
Boundary's frame;Finally, each frame is all converted into gray value and is standardized (according to population mean and variance).
For those skilled in the art, the present invention is not limited to the details of above-described embodiment, in the essence without departing substantially from the present invention
In the case of refreshing and range, the present invention can be realized in other specific forms.In addition, those skilled in the art can be to this hair
Bright to carry out various modification and variations without departing from the spirit and scope of the present invention, these improvements and modifications also should be regarded as the present invention's
Protection domain.Therefore, the following claims are intended to be interpreted as including preferred embodiment and falls into all changes of the scope of the invention
More and change.
Claims (10)
1. a kind of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element, which is characterized in that main includes regarding
Feel stream (one);Audio stream (two);Classification layer (three);(4) are merged in audiovisual.
2. based on the vision stream (one) described in claims 1, which is characterized in that vision stream is by subsidiary 34 layers of residual error network
(ResNet-34) 2 layers of bidirectional valve controlled cycling element (BGRU) composition of space-time convolution sum;Used herein is 34 layers of identical mapping
Version, main flow is:When the output of each step becomes single dimension tensor, when residual error network can continuously decrease
Empty dimension;Finally, the output of 34 layers of residual error network can be fed back among 2 layers of BGRU (every layer all includes 1024 grids).
3. based on the audio stream (two) described in claims 1, which is characterized in that audio stream is by 18 layers of residual error network (ResNet-
18) and with 2 layers of BGRU it connect composition;18 layers of residual error Web vector graphic is standard architecture, major difference is that it was used
It is 1D kernels, rather than for the 2D kernels of image data;It is 0.25 millisecond a length of when often walking in order to extract fine spectral information
5 milliseconds of time kernels be used for first space-time convolutional layer;It is identical in order to ensure the frame per second of video, the output quilt of residual error network
It is evenly distributed to 29 frame/windows;Then in the residual error network after these audio frames can be transported to, these residual error networks by
The acquiescence kernel that size is 3 × 1 forms, and level deeper in this way can extract long-term language feature;18 layers of residual error network
Output can be sent among 2 layers of BGRU (every layer all includes 1024 grids).
4. based on the classification layer (three) described in claims 1, which is characterized in that classification layer is made of 2 layers of BGRU, two kinds of letters
Number stream BGRU output can be connected and be transported to classification floor in be merged, then to their time dynamic combine
Modeling;Output layer is one Softmax layers, and each frame can be marked in it;Labeled sequence is based on highest flat
Equal probability.
5. merging (four) based on the audiovisual described in claims 1, which is characterized in that end-to-end audio-visual speech model is first regards
Fusion Model is listened, it simultaneously can directly extract the feature of pixel and audio volume control, be also equipped in huge open language
Text recognition in the data set of border;Its operating procedure includes:Pretreatment, evaluation and training.
6. based on the pretreatment described in claims 5, which is characterized in that it is divided into the pretreatment for video and is directed to audio
Pretreatment;For the pretreatment for video, the first step is extracted to oral area area-of-interest (ROI);Because extracted
Oral area ROI, so using all videos one fixed 98 × 98 bounding box;Finally, each frame is all converted into
Gray value is simultaneously standardized (according to population mean and variance);For the pretreatment for audio, each segment can by into
Row zero standard, that is to say, that its mean value and standard deviation be all zero, and sound different degrees of between the loudspeakers is explained with this
Degree variation.
7. based on the evaluation described in claims 5, which is characterized in that video clip is subdivided into training set, verification collection and surveys
Examination collection;Each word has 800 to 1000 sequences in training set, and each word respectively has 50 sequences in checksum set and test set;
Training set, verification collection and test set in total has 488766,25000 and 25000 samples respectively.
8. based on the training described in claims 5, which is characterized in that main there are two the stages:First, stand-alone training audio stream
Or video flowing, it is then combined with trained audiovisual network.
9. based on stand-alone training audio stream or video flowing described in claims 8, which is characterized in that training be divided into initialization and
Two parts of end-to-end training;For initialization, main there are three steps:First, using convolution rather than 2 layers of BGRU;
Then, the aggregate (carrying Softmax layers) of residual error network and convolution can be by lasting training, until the classification of checksum set
Precision is all no longer improved in 5 different time points;Finally, removal convolution rear end is replaced with the rear ends BGRU;For end-to-end
Training, after the residual error network of each signal stream and 2 layers of BGRU are trained in advance, they can be merged into a complete letter
Number stream to be trained (use Softmax output layers) end to end;It is end-to-end using Adam training algorithms, it is main to use
The small lot of 36 sequences and 0.0003 initial learning rate;Stop operation after 5 time points.
10. being complexed and training based on audio visual network according to any one of claims 8, which is characterized in that merge training and be divided into initialization and end
To end two parts of training;For initialization, once the single signal stream of each completes training, they will be used to just
Corresponding signal stream in beginningization multithread framework;Then, in addition 2 layers of BGRU can be added on all signal streams to merge single
The output of signal stream;This 2 layers of BGRU can be trained first in 5 time points and (be used Softmax output layers), to keep audio stream
Stablize with the weight of video flowing;For end-to-end training, entire audiovisual network, which is merged together, to be trained, using
Adam training algorithms use the small lot of 18 sequences and 0.0001 initial learning rate;Stop after 5 time points
Only operation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810383059.1A CN108597501A (en) | 2018-04-26 | 2018-04-26 | A kind of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810383059.1A CN108597501A (en) | 2018-04-26 | 2018-04-26 | A kind of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108597501A true CN108597501A (en) | 2018-09-28 |
Family
ID=63609339
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810383059.1A Withdrawn CN108597501A (en) | 2018-04-26 | 2018-04-26 | A kind of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108597501A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109801621A (en) * | 2019-03-15 | 2019-05-24 | 三峡大学 | A kind of audio recognition method based on residual error gating cycle unit |
CN110097541A (en) * | 2019-04-22 | 2019-08-06 | 电子科技大学 | A kind of image of no reference removes rain QA system |
CN110600053A (en) * | 2019-07-30 | 2019-12-20 | 广东工业大学 | Cerebral stroke dysarthria risk prediction method based on ResNet and LSTM network |
CN110865705A (en) * | 2019-10-24 | 2020-03-06 | 中国人民解放军军事科学院国防科技创新研究院 | Multi-mode converged communication method and device, head-mounted equipment and storage medium |
CN111128122A (en) * | 2019-12-31 | 2020-05-08 | 苏州思必驰信息科技有限公司 | Method and system for optimizing rhythm prediction model |
-
2018
- 2018-04-26 CN CN201810383059.1A patent/CN108597501A/en not_active Withdrawn
Non-Patent Citations (1)
Title |
---|
STAVROS PETRIDIS等: ""END-TO-END AUDIOVISUAL SPEECH RECOGNITION"", 《网页在线公开:HTTPS://ARXIV.ORG/ABS/1802.06424V2》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109801621A (en) * | 2019-03-15 | 2019-05-24 | 三峡大学 | A kind of audio recognition method based on residual error gating cycle unit |
CN110097541A (en) * | 2019-04-22 | 2019-08-06 | 电子科技大学 | A kind of image of no reference removes rain QA system |
CN110097541B (en) * | 2019-04-22 | 2023-03-28 | 电子科技大学 | No-reference image rain removal quality evaluation system |
CN110600053A (en) * | 2019-07-30 | 2019-12-20 | 广东工业大学 | Cerebral stroke dysarthria risk prediction method based on ResNet and LSTM network |
CN110865705A (en) * | 2019-10-24 | 2020-03-06 | 中国人民解放军军事科学院国防科技创新研究院 | Multi-mode converged communication method and device, head-mounted equipment and storage medium |
CN110865705B (en) * | 2019-10-24 | 2023-09-19 | 中国人民解放军军事科学院国防科技创新研究院 | Multi-mode fusion communication method and device, head-mounted equipment and storage medium |
CN111128122A (en) * | 2019-12-31 | 2020-05-08 | 苏州思必驰信息科技有限公司 | Method and system for optimizing rhythm prediction model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | Lip movements generation at a glance | |
CN108597501A (en) | A kind of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element | |
US10621991B2 (en) | Joint neural network for speaker recognition | |
JP6993353B2 (en) | Neural network-based voiceprint information extraction method and device | |
US11862145B2 (en) | Deep hierarchical fusion for machine intelligence applications | |
WO2020119630A1 (en) | Multi-mode comprehensive evaluation system and method for customer satisfaction | |
CN112069484A (en) | Multi-mode interactive information acquisition method and system | |
CN109344781A (en) | Expression recognition method in a kind of video based on audio visual union feature | |
CN108269133A (en) | A kind of combination human bioequivalence and the intelligent advertisement push method and terminal of speech recognition | |
CN111292765B (en) | Bimodal emotion recognition method integrating multiple deep learning models | |
CN102930297B (en) | Emotion recognition method for enhancing coupling hidden markov model (HMM) voice-vision fusion | |
KR102167760B1 (en) | Sign language analysis Algorithm System using Recognition of Sign Language Motion process and motion tracking pre-trained model | |
Tao et al. | End-to-end audiovisual speech activity detection with bimodal recurrent neural models | |
CN115329779A (en) | Multi-person conversation emotion recognition method | |
CN112151030A (en) | Multi-mode-based complex scene voice recognition method and device | |
CN109829499A (en) | Image, text and data fusion sensibility classification method and device based on same feature space | |
CN107358947A (en) | Speaker recognition methods and system again | |
Argones Rua et al. | Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models | |
CN112101096A (en) | Suicide emotion perception method based on multi-mode fusion of voice and micro-expression | |
CN114724224A (en) | Multi-mode emotion recognition method for medical care robot | |
Saiful et al. | Real-time sign language detection using cnn | |
Ivanko et al. | An experimental analysis of different approaches to audio–visual speech recognition and lip-reading | |
CN113326868A (en) | Decision layer fusion method for multi-modal emotion classification | |
CN116758451A (en) | Audio-visual emotion recognition method and system based on multi-scale and global cross attention | |
CN116434786A (en) | Text-semantic-assisted teacher voice emotion recognition method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20180928 |
|
WW01 | Invention patent application withdrawn after publication |