CN110096966A - A kind of audio recognition method merging the multi-modal corpus of depth information Chinese - Google Patents

A kind of audio recognition method merging the multi-modal corpus of depth information Chinese Download PDF

Info

Publication number
CN110096966A
CN110096966A CN201910284877.0A CN201910284877A CN110096966A CN 110096966 A CN110096966 A CN 110096966A CN 201910284877 A CN201910284877 A CN 201910284877A CN 110096966 A CN110096966 A CN 110096966A
Authority
CN
China
Prior art keywords
modal
corpus
data
depth information
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910284877.0A
Other languages
Chinese (zh)
Inventor
徐天一
张奕超
赵满坤
高洁
于健
于瑞国
喻梅
王丽媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201910284877.0A priority Critical patent/CN110096966A/en
Publication of CN110096966A publication Critical patent/CN110096966A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of audio recognition methods for merging the multi-modal corpus of depth information Chinese, it include: to incorporate depth information in Bimodal Database to construct multi-modal data synchronous using Microsoft's second generation Kinect multielement bar, the system is used to obtain the color image and depth image of speaker;Small-scale corpus is acquired, production is automatically selected by corpus and balances corpus up to 78%, two phone coverage rates up to 93.3% phoneme without tuning section coverage rate;Acquire the multi-modal data of audio, color video, depth image, 3D information;Data prediction is carried out to the multi-modal data of acquisition and extracts multi-modal feature, establish the multi-modal corpus of Chinese of fusion depth information and carries out multi-modal speech recognition.The present invention solve it is domestic for multi-modal data library research aspect there is vocabularies it is single, audio-visual quality is poor the problems such as, the quality for solving conventional two-dimensional image is highly prone to illumination, speaker's end rotation, the influence for the factors such as blocking.

Description

A kind of audio recognition method merging the multi-modal corpus of depth information Chinese
Technical field
The present invention relates to multi-modal data libraries to establish field, is related to a kind of multi-modal corpus of Chinese for merging depth information Foundation, the feature extraction of fusion depth information more particularly to a kind of audio recognition method based on multi-modal data library.
Background technique
Bimodal database contains visual signal and acoustic signal simultaneously, and the structure of acoustic signal is simple, scale Depending on sample rate and pronunciation total duration;Visual information is relatively complicated, its evaluation criterion is often the clear of image Degree and frame rate.Wherein, the purpose of design of corpus determines the selection of corpus in the quantity and database of speaker.
The audiovisual bimodal database of standard is the requisite data basis for carrying out the research of bimodal speech recognition technology, so And compared to the audiovisual corpus of external multiplicity, domestic, the Chinese that has disclosed also far from enough to the research of Bimodal Database Language Bimodal Database there is vocabularies it is single, audio-visual quality is poor the problems such as, and Chinese multi-modal corpus still stop It stays on the bimodal data set being made of audio and color video, and the quality of two dimensional image is highly prone to illumination, speaker End rotation, the influence for the factors such as blocking.Secondly, external multi-modal corpus has been applied to authentication, recognition of face etc. Various researchs, and the multi-modal corpus of Chinese is only limitted to the research of audio-visual speech identification.
In the course that lip moves Study of Feature Extracting Method, the lip based on two-dimensional image information moves feature extracting method and melts The lip for having closed depth information moves the two methods that feature extracting method is current most mainstream.Wherein based on the lip of two-dimensional image information Dynamic feature extracting method includes feature extracting method pixel-based and the feature extracting method based on model.Spy pixel-based Sign extracting method both can directly carry out lip on the gray level image in entire lip area and move feature extraction, can also be by compressing lip area Image, and image carries out some transformation for treated, such as wavelet transform, discrete cosine transform, linear discriminant analysis And principal component analysis etc., to generate the feature vector of lip-region.Feature extracting method based on model mainly includes several What characteristic method and parameter curve method.Height, width, perimeter, area and the key coordinate point that geometrical measurers open the shape of the mouth as one speaks The distance between be used as lip-region feature.
The above-mentioned feature extracting method based on two dimensional image is highly prone to illumination, the rotation of speaker head to a certain extent The influence of factors such as turn, block.For the greatest differences of people's tongue, traditional feature extracting method can not be as one kind General method effectively to characterize lip comprehensively and move information.
Speech recognition is the core stage an of identifying system, and pervious speech recognition is broadly divided into because of performance model difference Four class methods: template matching, dynamic time programming (DTW), Hidden Markov (HMM), artificial neural network (ANN).Close Year, deep learning receives the extensive concern of people, it significantly improves multi-modal voice using the image data of the positive face of standard The performance of identifying system.Network model based on convolutional neural networks (CNN) and long memory network (LSTM) in short-term also achieves Multi-modal speech recognition system.
Summary of the invention
The present invention provides a kind of audio recognition method for merging the multi-modal corpus of depth information Chinese, the present invention is solved It is domestic for multi-modal data library research aspect there is vocabularies it is single, audio-visual quality is poor the problems such as, solve tradition two The quality of dimension image is highly prone to illumination, speaker's end rotation, the influence for the factors such as blocking, described below:
A kind of audio recognition method merging the multi-modal corpus of depth information Chinese, the described method comprises the following steps:
Depth information is incorporated multi-modal using the building of Microsoft's second generation Kinect multielement bar in Bimodal Database Synchronous data sampling system, the system are used to obtain the color image and depth image of speaker;
Small-scale corpus is acquired, production is automatically selected by corpus and is covered without tuning section coverage rate up to 78%, two phones Rate balances corpus up to 93.3% phoneme;Acquire the multi-modal data of audio, color video, depth image, 3D information;
Data prediction is carried out to the multi-modal data of acquisition and extracts multi-modal feature, establishes the Chinese of fusion depth information The multi-modal corpus of language simultaneously carries out multi-modal speech recognition.
The multi-modal data synchronous includes: microphone array, colour imagery shot, infrared projection machine, infrared Camera and usb bus component.
Further, the data prediction includes: data segmentation, voice annotation, database purchase;
Data segmentation: the audio file that captures through data collection system, color image, depth image, 3D data point are equal It is accompanied with the timestamp for being accurate to Millisecond when data acquisition, respectively by color image sequence, the depth image after merging Sequence and depth data are cut into corresponding sentence level set according to the record time of every audio, realize multi-modal data It synchronizes;
Voice annotation: force alignment tool to the automatic marking of voice progress phone-level using voice label.
Wherein, described to extract multi-modal feature specifically: color image feature extraction and depth information feature extraction;
Using trained face Keypoint detector and human face recognition model in Dlib machine learning library, face is obtained Face feature point;
It establishes multi-modal corpus and carries out the multi-modal speech recognition experiment based on the tool box HTK.
The beneficial effect of the technical scheme provided by the present invention is that:
1, the present invention utilizes the multi-modal data synchronous developed based on Kinect multielement bar, pre-acquired The compact multi-mode state corpus of depth information is merged, and using the corpus as data basis, has carried out mentioning for two dimensional image feature Take technique study, the lip area Study of Feature Extracting Method based on depth information, the multi-modal speech recognition based on the tool box HTK real It tests, provides benchmarks for the multi-modal the Research of Speech Recognition based on the database;
2, the present invention devises the Chinese corpus of phoneme balance, under the studio environment of profession, acquires 69 people, The multi-modal data of 10074 phoneme balance corpus in total establishes the first multi-modal corpus of Chinese for incorporating depth information Library is engaged in the multi-modal the Research of Speech Recognition of Chinese for more researchers and provides data basis;
3, the present invention analyzes the effect of speech recognition under different acoustic enviroments, joined signal-to-noise ratio difference to audio herein For the bubble noise of -5dB, 0dB, 5dB, 10dB, 15dB, 20dB, Fig. 3 is the audio of different signal-to-noise ratio in each acoustic model lower word The statistics details of level speech recognition accuracy.From data in Fig. 3, on the one hand, with the continuous reduction of signal-to-noise ratio, voice The accuracy rate of identification constantly declines, and when signal-to-noise ratio is less than zero, the effect of speech recognition is very bad.On the other hand, acoustic mode The design of type also has particularly important relationship to the result of speech recognition, and three-tone acoustic model is compared to the sub- acoustic mode of single-tone Type can preferably characterize the speech phenomenon in voice flow, and DNN-HMM acoustic model not only accelerates the speed of model training, more Improve the accuracy rate of model training, it can be seen that, deep neural network has milestone formula meaning to the development of speech recognition.
Detailed description of the invention
Fig. 1 is a kind of flow chart of audio recognition method for merging the multi-modal corpus of depth information Chinese;
Fig. 2 is the schematic diagram of data acquisition;
Fig. 3 is the schematic diagram of the speech recognition accuracy of different signal-to-noise ratio audios.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, embodiment of the present invention is made below further Ground detailed description.
Embodiment 1
The embodiment of the invention provides the audio recognition method that a kind of multi-modal corpus of fusion depth information Chinese is established, Referring to Fig. 1, method includes the following steps:
101: depth information being incorporated in Bimodal Database and is developed using Microsoft's second generation Kinect multielement bar A set of multi-modal data synchronous, Kinect multielement bar mainly include five key components: microphone array, Colour imagery shot, infrared projection machine, infrared camera and usb bus component, by synchronous data sampling system, to be said Talk about the color image and depth image of people;
102: small-scale corpus is acquired in advance, it is multi-modal it is demonstrated experimentally that deep by being carried out on the basis of the corpus Degree information is very helpful for speech recognition;
103: design corpus automatically selects algorithm, and production is reached without tuning section coverage rate up to 78%, two phone coverage rates 93.3% phoneme balances corpus;
104: under the recording scene of profession, as shown in Fig. 2, 69 speakers of acquisition include: audio, color video, depth Spend the multi-modal data of image, 3D information;
105: it is multi-modal for the Chinese for having merged depth information of 6TB up to 22.4 hours, total memory space to establish total duration Corpus carries out speech recognition.
In one embodiment, step 101 obtains depth image using Kinect multielement bar and develops a set of multimode State synchronous data sampling system is as follows:
Wherein, the important research field of Microsoft Kinect multielement bar is computer vision related content, wherein depth Information is the key point of this research.Structure light law technology has been used to obtain the depth in scene in first generation Kinect sensor Information, and the depth camera in second generation Kinect sensor is then accomplished that entirely different time-of-flight method algorithm, connects Carry out face characteristic Model Reconstruction, using current state-of-the-art face library HD Face, it not only can be according to cromogram Picture and depth image quickly detect face in a short time, and can pass through 1347 people face face feature point predetermined Face three-dimensional grid model is established when mysorethorn.
In one embodiment, step 102 acquires small-scale corpus in advance on the basis of step 101, specific to walk It is rapid as follows:
An easy data acquisition has been built jointly with a Microsoft Kinect v2 sensor device and a desktop computer Environment, corpus are 100 unduplicated digital string sequences being made of the number from one to ten, have chosen two pronunciation streams Freely, without the volunteer of accent (1 female, 1 male), it is desirable that everyone is according to a normal word speed time corpus of pronunciation, i.e., final collected Multi-modal data in database when 100 voicing texts each comprising two words persons.By being carried out on the basis of the corpus It is multi-modal it is demonstrated experimentally that depth information is very helpful for speech recognition.
In one embodiment, step 103 devises corpus and automatically selects algorithm, has made and has reached without tuning section coverage rate 78%, two phone coverage rates balance corpus up to 93.3% phoneme, the specific steps are as follows:
Corpus is the data basis of speech recognition training, its selection have the function of for model training it is very crucial, And according to the otherness of semantic task, the design principle of corpus is also slightly different, and Chinese character is as unit of syllable, Mei Geyin Section is all made of initial consonant and simple or compound vowel of a Chinese syllable, causes the difference with other western languages.
According to the basic knowledge that corpus designs, the embodiment of the present invention devises a kind of suitable for mandarin continuous speech recognition system The autonomous selection algorithm of the corpus of system, the algorithm synthesis consider corpus to syllable, tone, double-tone submodel and triphone model Coverage rate, and meet the corpus text of condition using Greedy idea screening according to valuation functions, the corpus of final design is equal It is even to cover 78% without tuning section and 93.3% 2 phone, most speech phenomenons are covered with relatively small number of text.
In one embodiment, step 104 carries out data acquisition, the specific steps are as follows:
The database is to participate in recording by 69 University Of Tianjin students working on a postgraduate program in the professional recording room of University Of Tianjin, It includes Sony's recording pen and a second generation Microsoft Kinect multielement bar that data, which acquire used equipment,.
In one embodiment, step 105 finally carries out the foundation of database, the specific steps are as follows:
It include: data segmentation, voice annotation, database purchase by carrying out data prediction to the data of acquisition.
Data segmentation: the audio file that captures through data collection system, color image, depth image, 3D data point are equal It is accompanied with the timestamp for being accurate to Millisecond when data acquisition, the RAW representation of file audio text that title is started with " audio " Part, with the PNG representation of file color image format that " color " starts, the PNG representation of file depth image started with " depth " File indicates the depth data point that Kinect depth camera captures with the CSV formatted file that " depth " starts, with The csv file of " facePoints " beginning indicates the three-dimensional face geometrical model that Kinect is generated.It is stored according to multi-modal data The corresponding frame rate of timestamp mark and multi-channel data on file, people can easily design program and respectively will Color image sequence, range image sequence and depth data after merging are cut into accordingly according to the record time of every audio Sentence level set, realize the synchronization of multi-modal data.
Voice annotation: this method forces alignment tool (Penn Phonetics using University of Pennsylvania's voice label Lab Forced Aligner, P2FA) to the automatic marking of voice progress phone-level.
Then multi-modal feature extraction is carried out, comprising: color image feature extraction and depth information feature extraction.This method Using trained face Keypoint detector and human face recognition model in Dlib machine learning library, 68, face face is obtained Characteristic point, wherein lip area is by 20 Based on Feature Points.It finally establishes multi-modal corpus and carries out based on the tool box HTK Multi-modal speech recognition experiment, provides benchmarks for the multi-modal the Research of Speech Recognition based on the database.
Embodiment 2
The scheme in embodiment 1 is further introduced below with reference to specific example, formula, it is as detailed below to retouch It states:
201: obtaining depth image using Kinect multielement bar and develop a set of multi-modal data synchronous;
Wherein, the structure composition of second generation Kinect multielement bar, it mainly includes following five key components: wheat Gram wind array, colour imagery shot, infrared projection machine, infrared camera, usb bus component.Kinect sensor could support up 6 Human body, and neck, (left and right) finger tip, (left and right) thumb artis are increased, 25 artis can be collected, are successfully solved More complicated, subtleer gesture actions are locked.
202: acquiring small-scale corpus in advance with multi-modal data synchronous;
203: devising corpus and automatically select algorithm;
Before selecting corpus, the pre-defined and consistent table of evaluation parameter is needed:
1, (there is and reconciles without tune) syllable statistical form, the syllable and the syllable for storing whole in original language material are selecting language The frequency occurred in material collection;
2, two phone statistical form, for store cover in original language material all two phones and they selecting in corpus The frequency of generation;
3, three-tone statistical form, for storing the whole three-tones for including in original language material and they are being selected in corpus The frequency of generation;
4, tone combination table, for storing, the whole tones covered in original language material are combined and they are being selected in corpus The frequency of generation.
The eight layers of structure obtained according to Concordance, the calculating process to score text are as follows:
Remember that syScore indicates the score of syllable in text, ESY is the weight that syllable predetermined occupies, and traverses text In syllable illustrate the syllable in having selected corpus if certain syllable corresponding frequency in syllable statistical form is zero in text Not yet occurred, then assigns the text bigger syllable score: syScore+=ESY (1)
Otherwise, the relatively small syllable score of the text is assigned, it is assumed that the syllable corresponding frequency in syllable statistical form For count, show that the syllable has had already appeared count times in having selected corpus, then:
SyScore+=1/ (count+1)2 (2)
The scoring process of two phones, three-tone and tone combination is similar therewith, is denoted as bipScore, tripScore respectively And toneScore, then the total score of the text can be calculated by following formula.
Score=syScore+bipScore+tripScore+toneScore (3)
Wherein, score indicates the total score of the text.The score for comparing all texts chooses score value using greedy algorithm Maximum text, which is added to, has selected corpus.Update syllable statistical form, two phone statistical forms, three-tone statistical form and tone combination Table repeats aforesaid operations, is suitble to experiment purpose and the high corpus of phone coverage rate until successfully choosing.
204: two phones of unscreened original language material text and coverage rate without tuning section be up to 87.6% respectively, 92%, with the reduction of corpus scale, selected corpus is also reducing the coverage rate of no tuning section and two phones step by step, and two The coverage rate variation degree of phone is significantly less than the coverage rate variation degree of no tuning section.
The embodiment of the present invention automatically selects algorithm with corpus, has finally made without tuning section coverage rate up to 78%, two sounds Sub- coverage rate balances corpus up to 93.3% phoneme.
205: carrying out extensive multi-modal data acquisition;
206: carrying out data prediction to the data of acquisition includes: data segmentation, voice annotation, database purchase, then Carry out multi-modal feature extraction, comprising: color image feature extraction and depth information feature extraction finally establish total duration and reach 22.4 hours, total memory space is the multi-modal corpus of Chinese for having merged depth information of 6TB, and carries out based on HTK tool The multi-modal speech recognition of case is tested.
Embodiment 3
The scheme in Examples 1 and 2 is further introduced below with reference to specific example, calculation formula, is detailed in It is described below:
During people's data recording, the unconscious movement of head inclination, rotation is inevitably had, and Kinect is sensed Device exactly can be convenient present invention adjustment faceform by the three-dimensional grid model that built-in SDK is generated in real time, evade people Face offset and caused by information lose and information errors.According to Kinect solid space coordinate system, head rotation can be divided into Left rotation and right rotation in XZ plane, the horizontal-shift on X/Y plane, being rotated up and down in YZ plane.
The embodiment of the present invention is introduced for adjusting around the faceform that Y-axis rotates left and right to depth information below Pre-treatment step:
The embodiment of the present invention is using the central point of the left and right corners of the mouth as origin, to 1 in faceform, 347 depth characteristic points Do certain linear transformation so that the left and right corners of the mouth obtains identical value in the Z-axis direction, i.e., it is regular after the left and right corners of the mouth be maintained at same One depth distance.L and R in figure respectively indicate the left and right corners of the mouth in original faceform, and newL, newR respectively indicate regular The left and right corners of the mouth after change, w indicate the angle that the corners of the mouth rotates left and right, and l indicates the length of left and right corners of the mouth line, then:
L=[xl, yl, zl] (4)
R=[xr, yr, zr] (5)
NewL=[newxl, newyl, newzl] (6)
NewR=[newxr, newyr, newzr] (7)
W=tan-1((zr-zl)/xr-xl) (8)
Vacation lets m represent the transformation matrix of 3 rows 3 column, images to rotate faceform around Y-axis to face Kinect The position of head, then should meet the following conditions:
| newxr-newxl |=l (10) newyr=yr (11)
Newyl=yl (12) newzr=newzl (13)
By calculating, transformation matrix M can be obtained are as follows:
That is:
NewR=MR (15)
NewL=ML (16)
Similarly, carrying out respectively to three-dimensional face model about the z axis later can be by the depth point of offset with the rotation around X-axis The regular position to camera planar horizontal of cloud.
Multi-modal alone word voice identification experimental result so based on preparatory acquisition corpus design, which can be seen that, to be based on The single mode speech recognition result of depth information and the single mode speech recognition result based on two dimensional image are very close, discrimination Respectively 72.27% and 69.91%.And the discrimination of multi-modal alone word voice identification experimental result has then reached 93.68%. The collected depth information discrimination of second generation Kinect sensor is higher, it is meant that has more fully face characteristic information. Multi-modal data acquisition system based on the design of Microsoft's Kinect multielement bar has the propulsion of multi-modal the Research of Speech Recognition Great role and value.
In order to analyze the effect of speech recognition under different acoustic enviroments, the embodiment of the present invention joined noise score to audio Not Wei -5dB, 0dB, 5dB, 10dB, 15dB, 20dB bubble noise, Fig. 2 be different signal-to-noise ratio audio under each acoustic model The statistics details of word level speech recognition accuracy.From data in figure, on the one hand, with the continuous reduction of signal-to-noise ratio, language The accuracy rate of sound identification constantly declines, and when signal-to-noise ratio is less than zero, the effect of speech recognition is very bad.On the other hand, acoustics The design of model also has particularly important relationship to the result of speech recognition, and three-tone acoustic model is compared to the sub- acoustics of single-tone Model can preferably characterize the speech phenomenon in voice flow, and DNN-HMM acoustic model not only accelerates the speed of model training, More improve the accuracy rate of model training, it can be seen that, deep neural network anticipates to the development of speech recognition with milestone formula Justice.
Fig. 3 is the sound after lip color image feature and lip depth characteristic and audio frequency characteristics splicing under different dimensions Video speech recognition effect statistical chart.AV15 in figure indicates pure audio and the sound after 15 dimension lip color image Fusion Features Video features, ALip15 indicate that pure audio and the 15 dimension fused audio and video characteristics of lip depth characteristic, remaining parameter are similar Such expression.Note that audio frequency characteristics here refer both to 13 dimension MFCC features.By data in figure it is found that under GMM-HMM model, Lip depth characteristic is slightly better than the speech recognition effect of lip color image feature, and 32 dimension lip aspect ratios 15 tie up lip Feature is better.Under DNN-HMM model, effect is on the contrary, the voice of lip color image aspect ratio lip depth characteristic is known Other effect is slightly better, and 15 dimension lip aspect ratios 32 tie up lip feature better.
It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention Serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (4)

1. a kind of audio recognition method for merging the multi-modal corpus of depth information Chinese, which is characterized in that the method includes Following steps:
Depth information is incorporated in Bimodal Database and constructs multi-modal data using Microsoft's second generation Kinect multielement bar Synchronous, the system are used to obtain the color image and depth image of speaker;
Small-scale corpus is acquired, production is automatically selected by corpus and is reached without tuning section coverage rate up to 78%, two phone coverage rates 93.3% phoneme balances corpus;Acquire the multi-modal data of audio, color video, depth image, 3D information;
Data prediction is carried out to the multi-modal data of acquisition and extracts multi-modal feature, the Chinese for establishing fusion depth information is more Mode corpus simultaneously carries out multi-modal speech recognition.
2. a kind of audio recognition method for merging the multi-modal corpus of depth information Chinese according to claim 1, special Sign is,
The multi-modal data synchronous includes: microphone array, colour imagery shot, infrared projection machine, infrared photography Head and usb bus component.
3. a kind of audio recognition method for merging the multi-modal corpus of depth information Chinese according to claim 1, special Sign is that the data prediction includes: data segmentation, voice annotation, database purchase;
Data segmentation: the audio file that captures through data collection system, color image, depth image, 3D data point are subsidiary There is the timestamp for being accurate to Millisecond when data acquisition, respectively by color image sequence, the range image sequence after merging And depth data is cut into corresponding sentence level set according to the record time of every audio, realizes the synchronization of multi-modal data Change;
Voice annotation: force alignment tool to the automatic marking of voice progress phone-level using voice label.
4. a kind of audio recognition method for merging the multi-modal corpus of depth information Chinese according to claim 1, special Sign is, described to extract multi-modal feature specifically: color image feature extraction and depth information feature extraction;
Using trained face Keypoint detector and human face recognition model in Dlib machine learning library, face face is obtained Characteristic point;
It establishes multi-modal corpus and carries out the multi-modal speech recognition experiment based on the tool box HTK.
CN201910284877.0A 2019-04-10 2019-04-10 A kind of audio recognition method merging the multi-modal corpus of depth information Chinese Pending CN110096966A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910284877.0A CN110096966A (en) 2019-04-10 2019-04-10 A kind of audio recognition method merging the multi-modal corpus of depth information Chinese

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910284877.0A CN110096966A (en) 2019-04-10 2019-04-10 A kind of audio recognition method merging the multi-modal corpus of depth information Chinese

Publications (1)

Publication Number Publication Date
CN110096966A true CN110096966A (en) 2019-08-06

Family

ID=67444603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910284877.0A Pending CN110096966A (en) 2019-04-10 2019-04-10 A kind of audio recognition method merging the multi-modal corpus of depth information Chinese

Country Status (1)

Country Link
CN (1) CN110096966A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909613A (en) * 2019-10-28 2020-03-24 Oppo广东移动通信有限公司 Video character recognition method and device, storage medium and electronic equipment
CN111462733A (en) * 2020-03-31 2020-07-28 科大讯飞股份有限公司 Multi-modal speech recognition model training method, device, equipment and storage medium
CN111554279A (en) * 2020-04-27 2020-08-18 天津大学 Multi-mode man-machine interaction system based on Kinect
CN111933120A (en) * 2020-08-19 2020-11-13 潍坊医学院 Voice data automatic labeling method and system for voice recognition
TWI727395B (en) * 2019-08-15 2021-05-11 亞東技術學院 Language pronunciation learning system and method
CN112863538A (en) * 2021-02-24 2021-05-28 复旦大学 Audio-visual network-based multi-modal voice separation method and device
CN114615450A (en) * 2020-12-08 2022-06-10 中国科学院深圳先进技术研究院 Multi-mode pronunciation data acquisition method and system
US11899765B2 (en) 2019-12-23 2024-02-13 Dts Inc. Dual-factor identification system and method with adaptive enrollment
EP4191579A4 (en) * 2020-08-14 2024-05-08 Huawei Technologies Co., Ltd. Electronic device and speech recognition method therefor, and medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101752A (en) * 2007-07-19 2008-01-09 华中科技大学 Monosyllabic language lip-reading recognition system based on vision character

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101752A (en) * 2007-07-19 2008-01-09 华中科技大学 Monosyllabic language lip-reading recognition system based on vision character

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
J. WANG, L. WANG, J. ZHANG, J. WEI, M. YU , R. YU: "A Large-Scale Depth-Based Multimodal Audio-Visual Corpus in Mandarin", 《2018 IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS; IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY; IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS)》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI727395B (en) * 2019-08-15 2021-05-11 亞東技術學院 Language pronunciation learning system and method
CN110909613A (en) * 2019-10-28 2020-03-24 Oppo广东移动通信有限公司 Video character recognition method and device, storage medium and electronic equipment
CN110909613B (en) * 2019-10-28 2024-05-31 Oppo广东移动通信有限公司 Video character recognition method and device, storage medium and electronic equipment
US11899765B2 (en) 2019-12-23 2024-02-13 Dts Inc. Dual-factor identification system and method with adaptive enrollment
CN111462733A (en) * 2020-03-31 2020-07-28 科大讯飞股份有限公司 Multi-modal speech recognition model training method, device, equipment and storage medium
CN111462733B (en) * 2020-03-31 2024-04-16 科大讯飞股份有限公司 Multi-modal speech recognition model training method, device, equipment and storage medium
CN111554279A (en) * 2020-04-27 2020-08-18 天津大学 Multi-mode man-machine interaction system based on Kinect
EP4191579A4 (en) * 2020-08-14 2024-05-08 Huawei Technologies Co., Ltd. Electronic device and speech recognition method therefor, and medium
CN111933120A (en) * 2020-08-19 2020-11-13 潍坊医学院 Voice data automatic labeling method and system for voice recognition
CN114615450A (en) * 2020-12-08 2022-06-10 中国科学院深圳先进技术研究院 Multi-mode pronunciation data acquisition method and system
CN114615450B (en) * 2020-12-08 2023-02-17 中国科学院深圳先进技术研究院 Multi-mode pronunciation data acquisition method and system
CN112863538A (en) * 2021-02-24 2021-05-28 复旦大学 Audio-visual network-based multi-modal voice separation method and device

Similar Documents

Publication Publication Date Title
CN110096966A (en) A kind of audio recognition method merging the multi-modal corpus of depth information Chinese
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
Fernandez-Lopez et al. Survey on automatic lip-reading in the era of deep learning
Makino et al. Recurrent neural network transducer for audio-visual speech recognition
Anina et al. Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis
Hazen et al. A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments
Harte et al. TCD-TIMIT: An audio-visual corpus of continuous speech
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
US7636662B2 (en) System and method for audio-visual content synthesis
WO2018049979A1 (en) Animation synthesis method and device
CN101359473A (en) Auto speech conversion method and apparatus
JP2016029576A (en) Computer generation head
CN105390133A (en) Tibetan TTVS system realization method
Goecke et al. The audio-video Australian English speech data corpus AVOZES
WO2023035969A1 (en) Speech and image synchronization measurement method and apparatus, and model training method and apparatus
CN110570842B (en) Speech recognition method and system based on phoneme approximation degree and pronunciation standard degree
Liu et al. A novel resynchronization procedure for hand-lips fusion applied to continuous french cued speech recognition
CN115312030A (en) Display control method and device of virtual role and electronic equipment
Gimeno-Gómez et al. LIP-RTVE: An audiovisual database for continuous Spanish in the wild
Taylor et al. A mouth full of words: Visually consistent acoustic redubbing
Paleček Experimenting with lipreading for large vocabulary continuous speech recognition
Chiţu¹ et al. Automatic visual speech recognition
Karpov et al. A framework for recording audio-visual speech corpora with a microphone and a high-speed camera
Karpov et al. Designing a multimodal corpus of audio-visual speech using a high-speed camera
Narwekar et al. PRAV: A Phonetically Rich Audio Visual Corpus.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190806

RJ01 Rejection of invention patent application after publication