CN115019772A - Guangdong language voice recognition enhancing method based on visual information - Google Patents

Guangdong language voice recognition enhancing method based on visual information Download PDF

Info

Publication number
CN115019772A
CN115019772A CN202210636176.0A CN202210636176A CN115019772A CN 115019772 A CN115019772 A CN 115019772A CN 202210636176 A CN202210636176 A CN 202210636176A CN 115019772 A CN115019772 A CN 115019772A
Authority
CN
China
Prior art keywords
audio
video
data
network
visual information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210636176.0A
Other languages
Chinese (zh)
Inventor
肖业伟
滕连伟
刘烜铭
朱澳苏
田丕承
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiangtan University
Original Assignee
Xiangtan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiangtan University filed Critical Xiangtan University
Priority to CN202210636176.0A priority Critical patent/CN115019772A/en
Publication of CN115019772A publication Critical patent/CN115019772A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Computer Security & Cryptography (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a Guangdong language voice recognition enhancing method based on visual information, which comprises the following steps: s1, constructing a data set; s2, processing data; and S3, training the preprocessed data by using an algorithm to obtain a training model. And S4, comparing the recognition effect with the pure audio model under different voice environments by using the training model. The Guangdong language voice recognition enhancing method based on the visual information is adopted, the end-to-end audio and video enhancing network based on the multi-scale time convolution network is provided, the visual information is utilized to enhance the Guangdong language voice recognition, and the Guangdong language voice recognition effect under the complex voice environment is effectively improved.

Description

Guangdong language voice recognition enhancing method based on visual information
Technical Field
The invention relates to the technical field of voice recognition and lip language recognition, in particular to a Guangdong language voice recognition enhancing method based on visual information.
Background
Speech is the most natural way for human to communicate and express, and is the most central part of human-computer interaction. In recent decades, with the rapid advance of deep learning technology, the speech recognition field has made a breakthrough progress, and many commercial-grade speech recognition products have reached an accuracy of more than 95%. At present, intelligent voice recognition has become one of the main ways of human-computer interaction, and is applied to a plurality of voice products such as car machine systems, intelligent household appliances, voice assistants and the like. However, in a daily use environment, the intelligent voice product always encounters various noises, so that the accuracy of voice recognition is sharply reduced, and the user experience is affected. How to separate pure audio in a complex speech environment becomes a popular research in recent years.
Speech recognition is multi-modal in nature, and in general, we can pass other organs of the speaker in addition to acoustic information reaching the ear, such as: the tongue, teeth, chin, and facial expressions read the voice content. Research in neuroscience and speech perception has shown that speech has a potentially powerful effect on the ability of humans to focus their auditory attention on specific stimuli in terms of vision. At the same time, the visual information is not affected by noise, which makes it a reliable clue for speech recognition in a complex speech environment.
Cantonese is widely used in hong Kong special administrative districts and Macau special administrative districts in two broad areas of China as well as in Chinese communities in the world, so that the research on cantonese voice enhancement in a complex voice environment is not slow enough. The sound mode, tone and sound length of cantonese are different when the content of speech is consistent with that of mandarin. Thus, the Chinese Mandarin speech enhancement model may not be directly migrated into the Cantonese task.
The research on cantonese speech recognition in the existing speech recognition enhancing task is very little, and the cantonese speech recognition enhancing task mainly has the following technical difficulties by combining the research progress in the field of home and abroad speech recognition enhancing:
there is no research institution or individual issuing published large-scale audio and video data set of cantonese.
And secondly, how to improve the speech recognition accuracy by using visual information in a complex speech environment.
And obtaining multi-scale time information by obtaining a plurality of receptive fields.
Disclosure of Invention
The invention aims to provide a Guangdong language voice recognition enhancing method based on visual information, which makes up the blank that no large-scale data set exists in the field of Guangdong language audio and video enhancement.
In order to achieve the purpose, the invention provides a Guangdong language voice recognition enhancing method based on visual information, which comprises the following steps:
s1, constructing a data set;
s2, processing data;
s3, training the preprocessed data by using an algorithm to obtain a training model;
and S4, comparing the recognition effect with the pure audio model under different voice environments by using the training model.
Preferably, in step S1, a yuu-get tool is used to obtain a video resource of cantonese, the video source is sent to an automatic data acquisition system, and the audio data and the video data are processed to obtain a cantonese audio/video data set with the audio data and the video data.
Preferably, the video source firstly cuts the short sentence of the video by manpower and then enters the automatic data acquisition system.
Preferably, in step S2, the video data and the audio data are processed respectively to obtain a lip region image sequence and an audio waveform, and the video sequence and the audio waveform and text information corresponding to the lip region image sequence and the audio waveform are encoded to obtain packed training data.
Preferably, the processing is based on ResNet-18 and MS-TCN backbone networks, which is as follows:
(1) modifying a first 2D convolutional layer in a ResNet-18 network in a sub-network corresponding to a video stream into a 3D convolutional layer with a convolutional kernel size of 5 multiplied by 7, so that the time sequence information of the lip motion is captured more effectively and the fine-grained characteristic of the lip motion is captured;
(2) for the audio streaming subnetwork, the 1D convolutional layer based ResNet-18 network, the convolutional kernel of the first layer is set to 80(5ms), the step size is set to 4;
(3) designing a multi-scale time convolution network, changing the size of a receptive field of the time convolution network by changing a convolution kernel and step length, and mixing long-term and short-term characteristic information during characteristic coding by acquiring the receptive fields of a plurality of scales;
(4) improving a loss function and an optimizer;
(5) improving the training strategy;
(5) and constructing an audio and video enhancement model.
Preferably, the method further comprises preprocessing the data before training in step S3, and uniformly adding noise audio of-5 dB to 20dB in the NOISEX database to all audio data so as to simulate different complex voice environments, and encoding the processed video data, audio data and text information by using Libjpeg tool.
Preferably, in step S3, the video network is used to extract features of the video data and the audio network is used to extract features of the audio data, the extracted audio features and video features are connected together as input of the fusion network, a prediction result is generated through the fusion network, and then the whole system is trained end to end.
Therefore, the Guangdong language voice recognition enhancement method based on the visual information is adopted, and by downloading network video resources, manually cutting and screening invalid scenes, data cut sentence by sentence is input into an autonomously designed Guangdong language audio and video data acquisition system to collect the data. The ResNet-18 network is modified aiming at video data and audio data respectively, and meanwhile, long-term and short-term feature information can be mixed during feature coding in a mode of acquiring receptive fields of multiple scales.
Specifically, the convolution layer in the ResNet-18 network in the sub-network corresponding to the video stream is set as a 3D convolution layer with the convolution kernel size of 5 × 7 × 7, so that the time sequence information of the lip motion can be effectively captured, and the fine-grained characteristic of the lip motion can be captured.
The multi-scale time convolution network is added at the back end of the fusion network and the sub-networks corresponding to the video stream and the audio stream, the multi-scale time convolution model is composed of three time convolution network branches with different convolution kernel sizes, the output of each branch is simply combined through series connection, the multi-scale receptive field is obtained through the mode, long-term and short-term characteristic information can be mixed during characteristic coding, a better identification effect is obtained, and the limitation that the audio signal is one-dimensional single information is overcome.
The application makes up the blank that the field of Guangdong language audio and video enhancement does not have a large-scale data set by collecting the Guangdong language word-level audio and video data set, and effectively improves the Guangdong language voice recognition effect under the complex voice environment by providing the end-to-end audio and video enhancement network based on the multi-scale time convolution network and enhancing the Guangdong language voice recognition by utilizing the visual information.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a flow chart of a method for enhancing Cantonese speech recognition based on visual information according to the present invention;
FIG. 2 is a flow chart of the present invention for collecting Audio and video data sets for Cantonese languages;
FIG. 3 is a schematic illustration of the lip region extraction of the present invention;
fig. 4 is a complete network of an end-to-end audio and video enhancement method based on a multi-scale time convolution network, which is proposed by the present invention.
Detailed Description
The technical solution of the present invention is further illustrated by the accompanying drawings and examples.
Unless defined otherwise, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not to be construed as limiting the claims.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art. These other embodiments are also covered by the scope of the present invention.
It should be understood that the above-mentioned embodiments are only for explaining the present invention, and the protection scope of the present invention is not limited thereto, and any person skilled in the art should be able to cover the technical scope of the present invention and the equivalent replacement or change of the technical solution and the inventive concept thereof in the technical scope of the present invention.
The use of the word "comprising" or "comprises" and the like in the present invention means that the element preceding the word covers the element listed after the word and does not exclude the possibility of also covering other elements. The terms "inner", "outer", "upper", "lower", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the referred devices or elements must have a specific orientation, be constructed and operated in a specific orientation, and thus are not to be construed as limiting the present invention, and when the absolute position of the described object is changed, the relative positional relationships may be changed accordingly. In the present invention, unless otherwise expressly stated or limited, the terms "attached" and the like are to be construed broadly, e.g., as meaning a fixed connection, a removable connection, or an integral part; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations. The term "about" as used herein has the meaning well known to those skilled in the art, and preferably means that the term modifies a value within the range of ± 50%, ± 40%, ± 30%, ± 20%, ± 10%, ± 5% or ± 1% thereof.
All terms (including technical or scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs unless specifically defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
The disclosures of the prior art documents cited in the present description are incorporated by reference in their entirety and are, therefore, part of the present disclosure.
Example one
A Guangdong language voice recognition enhancing method based on visual information comprises the following steps:
s1, as shown in fig. 2, the method of the cantonese lip reading data collection system collecting sets and constructing cantonese audio/video data sets is as follows:
(1) the Guangdong language television programs such as Guangdong language news simulcasts, Guangdong language synthesis programs, Guangdong language character talk shows, talk show programs and the like are crawled from the Internet by utilizing the you-get tool;
(2) manually eliminating useless scenes (with human voice but without human images, or the human voice of the human images is not matched);
(3) synchronously aligning the collected audio and video, and cutting the video segments sentence by sentence;
(4) separating the audio and the video of the acquired video segment, performing word segmentation and timestamp generation on the audio by using a science university communication fly voice transcription function, and marking the video and the audio according to the naming mode of the audio and the voice;
(5) expanding all word segmentation time stamps for 0.02s left and right respectively, and generating a labeling text according to the sequence of the video sequence name, the word segmentation time stamps, the word segmentation pinyin and the word segmentation;
(6) extracting the human face in the video sequence by using a mediaprofile tool to obtain a face mark point;
(7) FIG. 3 is a schematic drawing of the lip region with the mouth angle extended outward by 12% and twice the lip center to nose tip distance d, respectively, in the X-axis direction MN The larger of the two, then calculates the coordinates of the center point of the two lip angles, and regards this point as the middle of the cropped lip regionThe center point is then used as the center to extract a square lip area. The lip region has a side length L. X l Is the X-axis coordinate of the left lip angle, X r Is the x-axis coordinate of the right lip angle. The lip region side length L is calculated as follows.
L=max{2d MN ,1.12X r -0.88X l } (1)
S2, as shown in FIG. 4, the complete network of the end-to-end audio and video enhancement method based on the multi-scale time convolution network, the backbone network is composed of a video streaming sub-network, an audio streaming sub-network and a fusion network. The video streaming subnetwork and the audio streaming subnetwork directly extract the features of the input image sequence and the input audio waveform, respectively. Each sub-network consists of a modified ResNet-18 network and a MS-TCN network, respectively. The outputs of the two MS-TCNs are spliced together and then used as input to another MS-TCN network to fuse video and audio features, and the joint modeling of the fused features processes their features in the time dimension.
The ResNet-18+ MS-TCN-based backbone network is specifically set as follows:
(1) for a video stream sub-network, in order to more effectively capture the time sequence information of lip movement and simultaneously take account of the fine-grained characteristic of the lip movement, a first 2D convolutional layer in a ResNet-18 network in the sub-network corresponding to a video stream is modified into a 3D convolutional layer with the convolutional kernel size of 5 x 7, and then the characteristic of a space domain is compressed by utilizing the spatial maximum pooling;
(2) for the audio streaming sub-network, since the audio signal is one-dimensional, using a ResNet-18 network based on 1D convolutional layers, the convolutional kernel of the first layer is set to 80(5ms) and the step size is set to 4. To sample up and down on a time scale, the step size of each layer is set to 2, and then the audio features are down-sampled to 25 frames per second using average pooling to match the frame rate of the video features;
(3) the multi-scale time convolution network is added at the back end of the fusion network and the sub-networks corresponding to the video stream and the audio stream, the multi-scale time convolution model is composed of three time convolution network branches with different convolution kernel sizes, the output of each branch is simply combined through series connection, the multi-scale receptive field is obtained through the mode, long-term and short-term characteristic information can be mixed during characteristic coding, and therefore a better identification effect is obtained.
And S3, preprocessing data.
(1) For video data: the lip action sequence was randomly clipped to a size of 88 x 88, flipped with a probability level of 0.5, converted to a grayscale map and normalized to [0, 1 ].
(2) For audio data: in order to research the robustness of the audio and video enhancement model in different noise environments, experiments are carried out in different noise environments. Noise tones of-5 dB to 20dB in the NOISEX database were added uniformly through all audio data, thereby simulating different complex speech environments. All audio is then normalized to have a mean of zero and a mean square error of one.
(3) And encoding the video, the audio and the labels corresponding to the video and the audio by using a Libjeg tool, thereby generating the processed training data.
And S4, training.
The training process is divided into two steps, firstly, the audio and video sub-networks are jointly trained, and finally, the whole end-to-end cantonese audio and video enhancement network based on the multi-scale time convolution network is trained.
(1) The method comprises the steps of splicing the outputs of an audio sub-network and a video sub-network together, then using a softmax layer to carry out end-to-end training on the whole network consisting of the audio sub-network and the softmax layer, and using an Adam optimizer to carry out optimization, wherein the initial learning rate eta is 3e-4, and the weight attenuation is 1 e-4. Only one graphics card was used for training, with the batch-size set to 32. The Cosine Scheduler was used over 80 training cycles. The learning rate calculation formula for each cycle is as follows:
Figure BDA0003680379550000081
t is the total number of training cycles, T80. t is the current training period, η t Is the learning rate of the current training period. Using a Cosine SchedulerAdvantageously, the learning rate can be reduced from the start of training while maintaining a relatively large learning rate, which is a potential benefit to training.
(2) After a single stream is trained, the softmax layer is changed into a multi-scale time convolution network, and the same setting is used for carrying out end-to-end training on the whole audio and video speech enhancement network, so that the pre-training weight of the whole network is obtained.
And S5, demonstrating the Guangdong language voice recognition enhancement method.
(1) A system Ui interface is designed by using a pyside2 toolkit, and is divided into four parts, namely a prompt word area, a face display area, a noise adding area and a result display area.
(2) The training weights, model, and function in the button and code of the Ui interface are matched.
(3) Clicking a 'start identification' button to start identification, collecting a video sequence and an audio waveform through a face display area, and extracting a face in the video sequence by using a media.
(4) Clicking the "add noise" button adds random noise from-5 dB to 20dB in the NOISEX audio library to the collected audio.
(5) And loading a network, and processing the video sequence and the audio data by using the training weight.
(6) And displaying the recognition result in a display area.
Therefore, the Guangdong language voice recognition enhancing method based on the visual information is adopted by the invention, the end-to-end audio and video enhancing network based on the multi-scale time convolution network is provided, the Guangdong language voice recognition is enhanced by utilizing the visual information, and the Guangdong language voice recognition effect under the complex voice environment is effectively improved.
Finally, it should be noted that: the above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and although the present invention is described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the invention without departing from the spirit and scope of the invention.

Claims (7)

1. A Guangdong language voice recognition enhancing method based on visual information is characterized by comprising the following steps:
s1, constructing a data set;
s2, processing data;
s3, training the preprocessed data by using an algorithm to obtain a training model;
and S4, comparing the recognition effect with the pure audio model under different voice environments by using the training model.
2. The method of claim 1, wherein the method for enhancing Guangdong language speech recognition based on visual information comprises: in step S1, a Yue language video resource is obtained by using a you-get tool, a video source is sent into an automatic data acquisition system, and audio data and video data are processed to obtain a Yue language audio and video data set with the audio data and the video data.
3. The method of claim 2, wherein the method for enhancing Guangdong language speech recognition based on visual information comprises: the video source firstly cuts the short sentence of the video through manual work and then enters an automatic data acquisition system.
4. The method of claim 1, wherein the method for enhancing Guangdong language speech recognition based on visual information comprises: in step S2, the video data and the audio data are processed respectively to obtain a lip region image sequence and an audio waveform, and the video sequence and the audio waveform and text information corresponding thereto are encoded to obtain packed training data.
5. The Guangdong language voice recognition enhancement method based on visual information of claim 4, wherein the processing is based on ResNet-18 and MS-TCN backbone network, which is specifically as follows:
(1) modifying a first 2D convolutional layer in a ResNet-18 network in a sub-network corresponding to a video stream into a 3D convolutional layer with a convolutional kernel size of 5 multiplied by 7, so that the time sequence information of the lip motion is captured more effectively and the fine-grained characteristic of the lip motion is captured;
(2) for the audio streaming subnetwork, the ResNet-18 network based on the 1D convolutional layer, the convolutional kernel of the first layer is set to 80, and the step size is set to 4;
(3) designing a multi-scale time convolution network, changing the size of the receptive field of the time convolution network by changing the convolution kernel and the step length, and mixing long-term and short-term characteristic information during characteristic coding by acquiring the receptive fields of a plurality of scales;
(4) improving a loss function and an optimizer;
(5) improving the training strategy;
(5) and constructing an audio and video enhancement model.
6. The method of claim 1, wherein the method for enhancing Guangdong language speech recognition based on visual information comprises: the method also comprises preprocessing the data before training in step S3, and uniformly adding noise audio of-5 dB to 20dB in the NOISEX database to all audio data so as to simulate different complex voice environments, and encoding the processed video data, audio data and text information by using Libjpeg tool.
7. The method of claim 1, wherein the method for enhancing Guangdong language speech recognition based on visual information comprises: in step S3, the video network is used to extract the features of the video data and the audio network is used to extract the features of the audio data, the extracted audio features and video features are connected together as the input of the fusion network, the prediction result is generated through the fusion network, and then the whole system is trained end to end.
CN202210636176.0A 2022-06-07 2022-06-07 Guangdong language voice recognition enhancing method based on visual information Pending CN115019772A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210636176.0A CN115019772A (en) 2022-06-07 2022-06-07 Guangdong language voice recognition enhancing method based on visual information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210636176.0A CN115019772A (en) 2022-06-07 2022-06-07 Guangdong language voice recognition enhancing method based on visual information

Publications (1)

Publication Number Publication Date
CN115019772A true CN115019772A (en) 2022-09-06

Family

ID=83073076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210636176.0A Pending CN115019772A (en) 2022-06-07 2022-06-07 Guangdong language voice recognition enhancing method based on visual information

Country Status (1)

Country Link
CN (1) CN115019772A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111862942A (en) * 2020-07-28 2020-10-30 苏州思必驰信息科技有限公司 Method and system for training mixed speech recognition model of Mandarin and Sichuan
US20210110831A1 (en) * 2018-05-18 2021-04-15 Deepmind Technologies Limited Visual speech recognition by phoneme prediction
CN114299418A (en) * 2021-12-10 2022-04-08 湘潭大学 Guangdong language lip reading identification method, equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210110831A1 (en) * 2018-05-18 2021-04-15 Deepmind Technologies Limited Visual speech recognition by phoneme prediction
CN111862942A (en) * 2020-07-28 2020-10-30 苏州思必驰信息科技有限公司 Method and system for training mixed speech recognition model of Mandarin and Sichuan
CN114299418A (en) * 2021-12-10 2022-04-08 湘潭大学 Guangdong language lip reading identification method, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN104777911B (en) A kind of intelligent interactive method based on holographic technique
CN110070065A (en) The sign language systems and the means of communication of view-based access control model and speech-sound intelligent
KR20220097118A (en) Mouth shape synthesis device and method using artificial neural network
Dreuw et al. SignSpeak-understanding, recognition, and translation of sign languages
CN108256458B (en) Bidirectional real-time translation system and method for deaf natural sign language
Bourbakis et al. Extracting and associating meta-features for understanding people’s emotional behaviour: face and speech
CN110210416B (en) Sign language recognition system optimization method and device based on dynamic pseudo tag decoding
CN109394258A (en) A kind of classification method, device and the terminal device of lung's breath sound
CN111967334B (en) Human body intention identification method, system and storage medium
CN106997243A (en) Speech scene monitoring method and device based on intelligent robot
CN109166409B (en) Sign language conversion method and device
CN112668407A (en) Face key point generation method and device, storage medium and electronic equipment
CN109919114A (en) One kind is based on the decoded video presentation method of complementary attention mechanism cyclic convolution
CN111126280A (en) Gesture recognition fusion-based aphasia patient auxiliary rehabilitation training system and method
CN114550057A (en) Video emotion recognition method based on multi-modal representation learning
CN104361787A (en) System and method for converting signals
CN111539408A (en) Intelligent point reading scheme based on photographing and object recognizing
CN117055724A (en) Generating type teaching resource system in virtual teaching scene and working method thereof
CN110096987B (en) Dual-path 3DCNN model-based mute action recognition method
CN115019772A (en) Guangdong language voice recognition enhancing method based on visual information
CN116758451A (en) Audio-visual emotion recognition method and system based on multi-scale and global cross attention
CN116244473A (en) Multi-mode emotion recognition method based on feature decoupling and graph knowledge distillation
Shahira et al. Assistive technologies for visual, hearing, and speech impairments: Machine learning and deep learning solutions
CN114630190A (en) Joint posture parameter determining method, model training method and device
CN111832412B (en) Sounding training correction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination