CN110096966A - A kind of audio recognition method merging the multi-modal corpus of depth information Chinese - Google Patents
A kind of audio recognition method merging the multi-modal corpus of depth information Chinese Download PDFInfo
- Publication number
- CN110096966A CN110096966A CN201910284877.0A CN201910284877A CN110096966A CN 110096966 A CN110096966 A CN 110096966A CN 201910284877 A CN201910284877 A CN 201910284877A CN 110096966 A CN110096966 A CN 110096966A
- Authority
- CN
- China
- Prior art keywords
- modal
- corpus
- data
- depth information
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000001360 synchronised effect Effects 0.000 claims abstract description 11
- 230000002902 bimodal effect Effects 0.000 claims abstract description 10
- 230000004927 fusion Effects 0.000 claims abstract description 6
- 238000004519 manufacturing process Methods 0.000 claims abstract description 4
- 239000000284 extract Substances 0.000 claims abstract description 3
- 238000000605 extraction Methods 0.000 claims description 12
- 230000011218 segmentation Effects 0.000 claims description 7
- 238000002474 experimental method Methods 0.000 claims description 4
- 238000013480 data collection Methods 0.000 claims description 3
- 238000010801 machine learning Methods 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 2
- 238000011160 research Methods 0.000 abstract description 12
- 238000005286 illumination Methods 0.000 abstract description 4
- 230000000903 blocking effect Effects 0.000 abstract description 3
- 238000013461 design Methods 0.000 description 9
- 230000000694 effects Effects 0.000 description 7
- 239000000463 material Substances 0.000 description 6
- 238000012549 training Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 235000014161 Caesalpinia gilliesii Nutrition 0.000 description 1
- 244000003240 Caesalpinia gilliesii Species 0.000 description 1
- 241001672694 Citrus reticulata Species 0.000 description 1
- 241000209140 Triticum Species 0.000 description 1
- 235000021307 Triticum Nutrition 0.000 description 1
- 208000003443 Unconsciousness Diseases 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 210000003811 finger Anatomy 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000000155 melt Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001151 other effect Effects 0.000 description 1
- 238000002203 pretreatment Methods 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 210000003813 thumb Anatomy 0.000 description 1
- 238000002366 time-of-flight method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Acoustics & Sound (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of audio recognition methods for merging the multi-modal corpus of depth information Chinese, it include: to incorporate depth information in Bimodal Database to construct multi-modal data synchronous using Microsoft's second generation Kinect multielement bar, the system is used to obtain the color image and depth image of speaker;Small-scale corpus is acquired, production is automatically selected by corpus and balances corpus up to 78%, two phone coverage rates up to 93.3% phoneme without tuning section coverage rate;Acquire the multi-modal data of audio, color video, depth image, 3D information;Data prediction is carried out to the multi-modal data of acquisition and extracts multi-modal feature, establish the multi-modal corpus of Chinese of fusion depth information and carries out multi-modal speech recognition.The present invention solve it is domestic for multi-modal data library research aspect there is vocabularies it is single, audio-visual quality is poor the problems such as, the quality for solving conventional two-dimensional image is highly prone to illumination, speaker's end rotation, the influence for the factors such as blocking.
Description
Technical field
The present invention relates to multi-modal data libraries to establish field, is related to a kind of multi-modal corpus of Chinese for merging depth information
Foundation, the feature extraction of fusion depth information more particularly to a kind of audio recognition method based on multi-modal data library.
Background technique
Bimodal database contains visual signal and acoustic signal simultaneously, and the structure of acoustic signal is simple, scale
Depending on sample rate and pronunciation total duration;Visual information is relatively complicated, its evaluation criterion is often the clear of image
Degree and frame rate.Wherein, the purpose of design of corpus determines the selection of corpus in the quantity and database of speaker.
The audiovisual bimodal database of standard is the requisite data basis for carrying out the research of bimodal speech recognition technology, so
And compared to the audiovisual corpus of external multiplicity, domestic, the Chinese that has disclosed also far from enough to the research of Bimodal Database
Language Bimodal Database there is vocabularies it is single, audio-visual quality is poor the problems such as, and Chinese multi-modal corpus still stop
It stays on the bimodal data set being made of audio and color video, and the quality of two dimensional image is highly prone to illumination, speaker
End rotation, the influence for the factors such as blocking.Secondly, external multi-modal corpus has been applied to authentication, recognition of face etc.
Various researchs, and the multi-modal corpus of Chinese is only limitted to the research of audio-visual speech identification.
In the course that lip moves Study of Feature Extracting Method, the lip based on two-dimensional image information moves feature extracting method and melts
The lip for having closed depth information moves the two methods that feature extracting method is current most mainstream.Wherein based on the lip of two-dimensional image information
Dynamic feature extracting method includes feature extracting method pixel-based and the feature extracting method based on model.Spy pixel-based
Sign extracting method both can directly carry out lip on the gray level image in entire lip area and move feature extraction, can also be by compressing lip area
Image, and image carries out some transformation for treated, such as wavelet transform, discrete cosine transform, linear discriminant analysis
And principal component analysis etc., to generate the feature vector of lip-region.Feature extracting method based on model mainly includes several
What characteristic method and parameter curve method.Height, width, perimeter, area and the key coordinate point that geometrical measurers open the shape of the mouth as one speaks
The distance between be used as lip-region feature.
The above-mentioned feature extracting method based on two dimensional image is highly prone to illumination, the rotation of speaker head to a certain extent
The influence of factors such as turn, block.For the greatest differences of people's tongue, traditional feature extracting method can not be as one kind
General method effectively to characterize lip comprehensively and move information.
Speech recognition is the core stage an of identifying system, and pervious speech recognition is broadly divided into because of performance model difference
Four class methods: template matching, dynamic time programming (DTW), Hidden Markov (HMM), artificial neural network (ANN).Close
Year, deep learning receives the extensive concern of people, it significantly improves multi-modal voice using the image data of the positive face of standard
The performance of identifying system.Network model based on convolutional neural networks (CNN) and long memory network (LSTM) in short-term also achieves
Multi-modal speech recognition system.
Summary of the invention
The present invention provides a kind of audio recognition method for merging the multi-modal corpus of depth information Chinese, the present invention is solved
It is domestic for multi-modal data library research aspect there is vocabularies it is single, audio-visual quality is poor the problems such as, solve tradition two
The quality of dimension image is highly prone to illumination, speaker's end rotation, the influence for the factors such as blocking, described below:
A kind of audio recognition method merging the multi-modal corpus of depth information Chinese, the described method comprises the following steps:
Depth information is incorporated multi-modal using the building of Microsoft's second generation Kinect multielement bar in Bimodal Database
Synchronous data sampling system, the system are used to obtain the color image and depth image of speaker;
Small-scale corpus is acquired, production is automatically selected by corpus and is covered without tuning section coverage rate up to 78%, two phones
Rate balances corpus up to 93.3% phoneme;Acquire the multi-modal data of audio, color video, depth image, 3D information;
Data prediction is carried out to the multi-modal data of acquisition and extracts multi-modal feature, establishes the Chinese of fusion depth information
The multi-modal corpus of language simultaneously carries out multi-modal speech recognition.
The multi-modal data synchronous includes: microphone array, colour imagery shot, infrared projection machine, infrared
Camera and usb bus component.
Further, the data prediction includes: data segmentation, voice annotation, database purchase;
Data segmentation: the audio file that captures through data collection system, color image, depth image, 3D data point are equal
It is accompanied with the timestamp for being accurate to Millisecond when data acquisition, respectively by color image sequence, the depth image after merging
Sequence and depth data are cut into corresponding sentence level set according to the record time of every audio, realize multi-modal data
It synchronizes;
Voice annotation: force alignment tool to the automatic marking of voice progress phone-level using voice label.
Wherein, described to extract multi-modal feature specifically: color image feature extraction and depth information feature extraction;
Using trained face Keypoint detector and human face recognition model in Dlib machine learning library, face is obtained
Face feature point;
It establishes multi-modal corpus and carries out the multi-modal speech recognition experiment based on the tool box HTK.
The beneficial effect of the technical scheme provided by the present invention is that:
1, the present invention utilizes the multi-modal data synchronous developed based on Kinect multielement bar, pre-acquired
The compact multi-mode state corpus of depth information is merged, and using the corpus as data basis, has carried out mentioning for two dimensional image feature
Take technique study, the lip area Study of Feature Extracting Method based on depth information, the multi-modal speech recognition based on the tool box HTK real
It tests, provides benchmarks for the multi-modal the Research of Speech Recognition based on the database;
2, the present invention devises the Chinese corpus of phoneme balance, under the studio environment of profession, acquires 69 people,
The multi-modal data of 10074 phoneme balance corpus in total establishes the first multi-modal corpus of Chinese for incorporating depth information
Library is engaged in the multi-modal the Research of Speech Recognition of Chinese for more researchers and provides data basis;
3, the present invention analyzes the effect of speech recognition under different acoustic enviroments, joined signal-to-noise ratio difference to audio herein
For the bubble noise of -5dB, 0dB, 5dB, 10dB, 15dB, 20dB, Fig. 3 is the audio of different signal-to-noise ratio in each acoustic model lower word
The statistics details of level speech recognition accuracy.From data in Fig. 3, on the one hand, with the continuous reduction of signal-to-noise ratio, voice
The accuracy rate of identification constantly declines, and when signal-to-noise ratio is less than zero, the effect of speech recognition is very bad.On the other hand, acoustic mode
The design of type also has particularly important relationship to the result of speech recognition, and three-tone acoustic model is compared to the sub- acoustic mode of single-tone
Type can preferably characterize the speech phenomenon in voice flow, and DNN-HMM acoustic model not only accelerates the speed of model training, more
Improve the accuracy rate of model training, it can be seen that, deep neural network has milestone formula meaning to the development of speech recognition.
Detailed description of the invention
Fig. 1 is a kind of flow chart of audio recognition method for merging the multi-modal corpus of depth information Chinese;
Fig. 2 is the schematic diagram of data acquisition;
Fig. 3 is the schematic diagram of the speech recognition accuracy of different signal-to-noise ratio audios.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, embodiment of the present invention is made below further
Ground detailed description.
Embodiment 1
The embodiment of the invention provides the audio recognition method that a kind of multi-modal corpus of fusion depth information Chinese is established,
Referring to Fig. 1, method includes the following steps:
101: depth information being incorporated in Bimodal Database and is developed using Microsoft's second generation Kinect multielement bar
A set of multi-modal data synchronous, Kinect multielement bar mainly include five key components: microphone array,
Colour imagery shot, infrared projection machine, infrared camera and usb bus component, by synchronous data sampling system, to be said
Talk about the color image and depth image of people;
102: small-scale corpus is acquired in advance, it is multi-modal it is demonstrated experimentally that deep by being carried out on the basis of the corpus
Degree information is very helpful for speech recognition;
103: design corpus automatically selects algorithm, and production is reached without tuning section coverage rate up to 78%, two phone coverage rates
93.3% phoneme balances corpus;
104: under the recording scene of profession, as shown in Fig. 2, 69 speakers of acquisition include: audio, color video, depth
Spend the multi-modal data of image, 3D information;
105: it is multi-modal for the Chinese for having merged depth information of 6TB up to 22.4 hours, total memory space to establish total duration
Corpus carries out speech recognition.
In one embodiment, step 101 obtains depth image using Kinect multielement bar and develops a set of multimode
State synchronous data sampling system is as follows:
Wherein, the important research field of Microsoft Kinect multielement bar is computer vision related content, wherein depth
Information is the key point of this research.Structure light law technology has been used to obtain the depth in scene in first generation Kinect sensor
Information, and the depth camera in second generation Kinect sensor is then accomplished that entirely different time-of-flight method algorithm, connects
Carry out face characteristic Model Reconstruction, using current state-of-the-art face library HD Face, it not only can be according to cromogram
Picture and depth image quickly detect face in a short time, and can pass through 1347 people face face feature point predetermined
Face three-dimensional grid model is established when mysorethorn.
In one embodiment, step 102 acquires small-scale corpus in advance on the basis of step 101, specific to walk
It is rapid as follows:
An easy data acquisition has been built jointly with a Microsoft Kinect v2 sensor device and a desktop computer
Environment, corpus are 100 unduplicated digital string sequences being made of the number from one to ten, have chosen two pronunciation streams
Freely, without the volunteer of accent (1 female, 1 male), it is desirable that everyone is according to a normal word speed time corpus of pronunciation, i.e., final collected
Multi-modal data in database when 100 voicing texts each comprising two words persons.By being carried out on the basis of the corpus
It is multi-modal it is demonstrated experimentally that depth information is very helpful for speech recognition.
In one embodiment, step 103 devises corpus and automatically selects algorithm, has made and has reached without tuning section coverage rate
78%, two phone coverage rates balance corpus up to 93.3% phoneme, the specific steps are as follows:
Corpus is the data basis of speech recognition training, its selection have the function of for model training it is very crucial,
And according to the otherness of semantic task, the design principle of corpus is also slightly different, and Chinese character is as unit of syllable, Mei Geyin
Section is all made of initial consonant and simple or compound vowel of a Chinese syllable, causes the difference with other western languages.
According to the basic knowledge that corpus designs, the embodiment of the present invention devises a kind of suitable for mandarin continuous speech recognition system
The autonomous selection algorithm of the corpus of system, the algorithm synthesis consider corpus to syllable, tone, double-tone submodel and triphone model
Coverage rate, and meet the corpus text of condition using Greedy idea screening according to valuation functions, the corpus of final design is equal
It is even to cover 78% without tuning section and 93.3% 2 phone, most speech phenomenons are covered with relatively small number of text.
In one embodiment, step 104 carries out data acquisition, the specific steps are as follows:
The database is to participate in recording by 69 University Of Tianjin students working on a postgraduate program in the professional recording room of University Of Tianjin,
It includes Sony's recording pen and a second generation Microsoft Kinect multielement bar that data, which acquire used equipment,.
In one embodiment, step 105 finally carries out the foundation of database, the specific steps are as follows:
It include: data segmentation, voice annotation, database purchase by carrying out data prediction to the data of acquisition.
Data segmentation: the audio file that captures through data collection system, color image, depth image, 3D data point are equal
It is accompanied with the timestamp for being accurate to Millisecond when data acquisition, the RAW representation of file audio text that title is started with " audio "
Part, with the PNG representation of file color image format that " color " starts, the PNG representation of file depth image started with " depth "
File indicates the depth data point that Kinect depth camera captures with the CSV formatted file that " depth " starts, with
The csv file of " facePoints " beginning indicates the three-dimensional face geometrical model that Kinect is generated.It is stored according to multi-modal data
The corresponding frame rate of timestamp mark and multi-channel data on file, people can easily design program and respectively will
Color image sequence, range image sequence and depth data after merging are cut into accordingly according to the record time of every audio
Sentence level set, realize the synchronization of multi-modal data.
Voice annotation: this method forces alignment tool (Penn Phonetics using University of Pennsylvania's voice label
Lab Forced Aligner, P2FA) to the automatic marking of voice progress phone-level.
Then multi-modal feature extraction is carried out, comprising: color image feature extraction and depth information feature extraction.This method
Using trained face Keypoint detector and human face recognition model in Dlib machine learning library, 68, face face is obtained
Characteristic point, wherein lip area is by 20 Based on Feature Points.It finally establishes multi-modal corpus and carries out based on the tool box HTK
Multi-modal speech recognition experiment, provides benchmarks for the multi-modal the Research of Speech Recognition based on the database.
Embodiment 2
The scheme in embodiment 1 is further introduced below with reference to specific example, formula, it is as detailed below to retouch
It states:
201: obtaining depth image using Kinect multielement bar and develop a set of multi-modal data synchronous;
Wherein, the structure composition of second generation Kinect multielement bar, it mainly includes following five key components: wheat
Gram wind array, colour imagery shot, infrared projection machine, infrared camera, usb bus component.Kinect sensor could support up 6
Human body, and neck, (left and right) finger tip, (left and right) thumb artis are increased, 25 artis can be collected, are successfully solved
More complicated, subtleer gesture actions are locked.
202: acquiring small-scale corpus in advance with multi-modal data synchronous;
203: devising corpus and automatically select algorithm;
Before selecting corpus, the pre-defined and consistent table of evaluation parameter is needed:
1, (there is and reconciles without tune) syllable statistical form, the syllable and the syllable for storing whole in original language material are selecting language
The frequency occurred in material collection;
2, two phone statistical form, for store cover in original language material all two phones and they selecting in corpus
The frequency of generation;
3, three-tone statistical form, for storing the whole three-tones for including in original language material and they are being selected in corpus
The frequency of generation;
4, tone combination table, for storing, the whole tones covered in original language material are combined and they are being selected in corpus
The frequency of generation.
The eight layers of structure obtained according to Concordance, the calculating process to score text are as follows:
Remember that syScore indicates the score of syllable in text, ESY is the weight that syllable predetermined occupies, and traverses text
In syllable illustrate the syllable in having selected corpus if certain syllable corresponding frequency in syllable statistical form is zero in text
Not yet occurred, then assigns the text bigger syllable score: syScore+=ESY (1)
Otherwise, the relatively small syllable score of the text is assigned, it is assumed that the syllable corresponding frequency in syllable statistical form
For count, show that the syllable has had already appeared count times in having selected corpus, then:
SyScore+=1/ (count+1)2 (2)
The scoring process of two phones, three-tone and tone combination is similar therewith, is denoted as bipScore, tripScore respectively
And toneScore, then the total score of the text can be calculated by following formula.
Score=syScore+bipScore+tripScore+toneScore (3)
Wherein, score indicates the total score of the text.The score for comparing all texts chooses score value using greedy algorithm
Maximum text, which is added to, has selected corpus.Update syllable statistical form, two phone statistical forms, three-tone statistical form and tone combination
Table repeats aforesaid operations, is suitble to experiment purpose and the high corpus of phone coverage rate until successfully choosing.
204: two phones of unscreened original language material text and coverage rate without tuning section be up to 87.6% respectively,
92%, with the reduction of corpus scale, selected corpus is also reducing the coverage rate of no tuning section and two phones step by step, and two
The coverage rate variation degree of phone is significantly less than the coverage rate variation degree of no tuning section.
The embodiment of the present invention automatically selects algorithm with corpus, has finally made without tuning section coverage rate up to 78%, two sounds
Sub- coverage rate balances corpus up to 93.3% phoneme.
205: carrying out extensive multi-modal data acquisition;
206: carrying out data prediction to the data of acquisition includes: data segmentation, voice annotation, database purchase, then
Carry out multi-modal feature extraction, comprising: color image feature extraction and depth information feature extraction finally establish total duration and reach
22.4 hours, total memory space is the multi-modal corpus of Chinese for having merged depth information of 6TB, and carries out based on HTK tool
The multi-modal speech recognition of case is tested.
Embodiment 3
The scheme in Examples 1 and 2 is further introduced below with reference to specific example, calculation formula, is detailed in
It is described below:
During people's data recording, the unconscious movement of head inclination, rotation is inevitably had, and Kinect is sensed
Device exactly can be convenient present invention adjustment faceform by the three-dimensional grid model that built-in SDK is generated in real time, evade people
Face offset and caused by information lose and information errors.According to Kinect solid space coordinate system, head rotation can be divided into
Left rotation and right rotation in XZ plane, the horizontal-shift on X/Y plane, being rotated up and down in YZ plane.
The embodiment of the present invention is introduced for adjusting around the faceform that Y-axis rotates left and right to depth information below
Pre-treatment step:
The embodiment of the present invention is using the central point of the left and right corners of the mouth as origin, to 1 in faceform, 347 depth characteristic points
Do certain linear transformation so that the left and right corners of the mouth obtains identical value in the Z-axis direction, i.e., it is regular after the left and right corners of the mouth be maintained at same
One depth distance.L and R in figure respectively indicate the left and right corners of the mouth in original faceform, and newL, newR respectively indicate regular
The left and right corners of the mouth after change, w indicate the angle that the corners of the mouth rotates left and right, and l indicates the length of left and right corners of the mouth line, then:
L=[xl, yl, zl] (4)
R=[xr, yr, zr] (5)
NewL=[newxl, newyl, newzl] (6)
NewR=[newxr, newyr, newzr] (7)
W=tan-1((zr-zl)/xr-xl) (8)
Vacation lets m represent the transformation matrix of 3 rows 3 column, images to rotate faceform around Y-axis to face Kinect
The position of head, then should meet the following conditions:
| newxr-newxl |=l (10) newyr=yr (11)
Newyl=yl (12) newzr=newzl (13)
By calculating, transformation matrix M can be obtained are as follows:
That is:
NewR=MR (15)
NewL=ML (16)
Similarly, carrying out respectively to three-dimensional face model about the z axis later can be by the depth point of offset with the rotation around X-axis
The regular position to camera planar horizontal of cloud.
Multi-modal alone word voice identification experimental result so based on preparatory acquisition corpus design, which can be seen that, to be based on
The single mode speech recognition result of depth information and the single mode speech recognition result based on two dimensional image are very close, discrimination
Respectively 72.27% and 69.91%.And the discrimination of multi-modal alone word voice identification experimental result has then reached 93.68%.
The collected depth information discrimination of second generation Kinect sensor is higher, it is meant that has more fully face characteristic information.
Multi-modal data acquisition system based on the design of Microsoft's Kinect multielement bar has the propulsion of multi-modal the Research of Speech Recognition
Great role and value.
In order to analyze the effect of speech recognition under different acoustic enviroments, the embodiment of the present invention joined noise score to audio
Not Wei -5dB, 0dB, 5dB, 10dB, 15dB, 20dB bubble noise, Fig. 2 be different signal-to-noise ratio audio under each acoustic model
The statistics details of word level speech recognition accuracy.From data in figure, on the one hand, with the continuous reduction of signal-to-noise ratio, language
The accuracy rate of sound identification constantly declines, and when signal-to-noise ratio is less than zero, the effect of speech recognition is very bad.On the other hand, acoustics
The design of model also has particularly important relationship to the result of speech recognition, and three-tone acoustic model is compared to the sub- acoustics of single-tone
Model can preferably characterize the speech phenomenon in voice flow, and DNN-HMM acoustic model not only accelerates the speed of model training,
More improve the accuracy rate of model training, it can be seen that, deep neural network anticipates to the development of speech recognition with milestone formula
Justice.
Fig. 3 is the sound after lip color image feature and lip depth characteristic and audio frequency characteristics splicing under different dimensions
Video speech recognition effect statistical chart.AV15 in figure indicates pure audio and the sound after 15 dimension lip color image Fusion Features
Video features, ALip15 indicate that pure audio and the 15 dimension fused audio and video characteristics of lip depth characteristic, remaining parameter are similar
Such expression.Note that audio frequency characteristics here refer both to 13 dimension MFCC features.By data in figure it is found that under GMM-HMM model,
Lip depth characteristic is slightly better than the speech recognition effect of lip color image feature, and 32 dimension lip aspect ratios 15 tie up lip
Feature is better.Under DNN-HMM model, effect is on the contrary, the voice of lip color image aspect ratio lip depth characteristic is known
Other effect is slightly better, and 15 dimension lip aspect ratios 32 tie up lip feature better.
It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention
Serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (4)
1. a kind of audio recognition method for merging the multi-modal corpus of depth information Chinese, which is characterized in that the method includes
Following steps:
Depth information is incorporated in Bimodal Database and constructs multi-modal data using Microsoft's second generation Kinect multielement bar
Synchronous, the system are used to obtain the color image and depth image of speaker;
Small-scale corpus is acquired, production is automatically selected by corpus and is reached without tuning section coverage rate up to 78%, two phone coverage rates
93.3% phoneme balances corpus;Acquire the multi-modal data of audio, color video, depth image, 3D information;
Data prediction is carried out to the multi-modal data of acquisition and extracts multi-modal feature, the Chinese for establishing fusion depth information is more
Mode corpus simultaneously carries out multi-modal speech recognition.
2. a kind of audio recognition method for merging the multi-modal corpus of depth information Chinese according to claim 1, special
Sign is,
The multi-modal data synchronous includes: microphone array, colour imagery shot, infrared projection machine, infrared photography
Head and usb bus component.
3. a kind of audio recognition method for merging the multi-modal corpus of depth information Chinese according to claim 1, special
Sign is that the data prediction includes: data segmentation, voice annotation, database purchase;
Data segmentation: the audio file that captures through data collection system, color image, depth image, 3D data point are subsidiary
There is the timestamp for being accurate to Millisecond when data acquisition, respectively by color image sequence, the range image sequence after merging
And depth data is cut into corresponding sentence level set according to the record time of every audio, realizes the synchronization of multi-modal data
Change;
Voice annotation: force alignment tool to the automatic marking of voice progress phone-level using voice label.
4. a kind of audio recognition method for merging the multi-modal corpus of depth information Chinese according to claim 1, special
Sign is, described to extract multi-modal feature specifically: color image feature extraction and depth information feature extraction;
Using trained face Keypoint detector and human face recognition model in Dlib machine learning library, face face is obtained
Characteristic point;
It establishes multi-modal corpus and carries out the multi-modal speech recognition experiment based on the tool box HTK.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910284877.0A CN110096966A (en) | 2019-04-10 | 2019-04-10 | A kind of audio recognition method merging the multi-modal corpus of depth information Chinese |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910284877.0A CN110096966A (en) | 2019-04-10 | 2019-04-10 | A kind of audio recognition method merging the multi-modal corpus of depth information Chinese |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110096966A true CN110096966A (en) | 2019-08-06 |
Family
ID=67444603
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910284877.0A Pending CN110096966A (en) | 2019-04-10 | 2019-04-10 | A kind of audio recognition method merging the multi-modal corpus of depth information Chinese |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110096966A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110909613A (en) * | 2019-10-28 | 2020-03-24 | Oppo广东移动通信有限公司 | Video character recognition method and device, storage medium and electronic equipment |
CN111462733A (en) * | 2020-03-31 | 2020-07-28 | 科大讯飞股份有限公司 | Multi-modal speech recognition model training method, device, equipment and storage medium |
CN111554279A (en) * | 2020-04-27 | 2020-08-18 | 天津大学 | Multi-mode man-machine interaction system based on Kinect |
CN111933120A (en) * | 2020-08-19 | 2020-11-13 | 潍坊医学院 | Voice data automatic labeling method and system for voice recognition |
TWI727395B (en) * | 2019-08-15 | 2021-05-11 | 亞東技術學院 | Language pronunciation learning system and method |
CN112863538A (en) * | 2021-02-24 | 2021-05-28 | 复旦大学 | Audio-visual network-based multi-modal voice separation method and device |
CN114615450A (en) * | 2020-12-08 | 2022-06-10 | 中国科学院深圳先进技术研究院 | Multi-mode pronunciation data acquisition method and system |
US11899765B2 (en) | 2019-12-23 | 2024-02-13 | Dts Inc. | Dual-factor identification system and method with adaptive enrollment |
EP4191579A4 (en) * | 2020-08-14 | 2024-05-08 | Huawei Technologies Co., Ltd. | Electronic device and speech recognition method therefor, and medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101752A (en) * | 2007-07-19 | 2008-01-09 | 华中科技大学 | Monosyllabic language lip-reading recognition system based on vision character |
-
2019
- 2019-04-10 CN CN201910284877.0A patent/CN110096966A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101752A (en) * | 2007-07-19 | 2008-01-09 | 华中科技大学 | Monosyllabic language lip-reading recognition system based on vision character |
Non-Patent Citations (1)
Title |
---|
J. WANG, L. WANG, J. ZHANG, J. WEI, M. YU , R. YU: "A Large-Scale Depth-Based Multimodal Audio-Visual Corpus in Mandarin", 《2018 IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS; IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY; IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS)》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI727395B (en) * | 2019-08-15 | 2021-05-11 | 亞東技術學院 | Language pronunciation learning system and method |
CN110909613A (en) * | 2019-10-28 | 2020-03-24 | Oppo广东移动通信有限公司 | Video character recognition method and device, storage medium and electronic equipment |
CN110909613B (en) * | 2019-10-28 | 2024-05-31 | Oppo广东移动通信有限公司 | Video character recognition method and device, storage medium and electronic equipment |
US11899765B2 (en) | 2019-12-23 | 2024-02-13 | Dts Inc. | Dual-factor identification system and method with adaptive enrollment |
CN111462733A (en) * | 2020-03-31 | 2020-07-28 | 科大讯飞股份有限公司 | Multi-modal speech recognition model training method, device, equipment and storage medium |
CN111462733B (en) * | 2020-03-31 | 2024-04-16 | 科大讯飞股份有限公司 | Multi-modal speech recognition model training method, device, equipment and storage medium |
CN111554279A (en) * | 2020-04-27 | 2020-08-18 | 天津大学 | Multi-mode man-machine interaction system based on Kinect |
EP4191579A4 (en) * | 2020-08-14 | 2024-05-08 | Huawei Technologies Co., Ltd. | Electronic device and speech recognition method therefor, and medium |
CN111933120A (en) * | 2020-08-19 | 2020-11-13 | 潍坊医学院 | Voice data automatic labeling method and system for voice recognition |
CN114615450A (en) * | 2020-12-08 | 2022-06-10 | 中国科学院深圳先进技术研究院 | Multi-mode pronunciation data acquisition method and system |
CN114615450B (en) * | 2020-12-08 | 2023-02-17 | 中国科学院深圳先进技术研究院 | Multi-mode pronunciation data acquisition method and system |
CN112863538A (en) * | 2021-02-24 | 2021-05-28 | 复旦大学 | Audio-visual network-based multi-modal voice separation method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110096966A (en) | A kind of audio recognition method merging the multi-modal corpus of depth information Chinese | |
Czyzewski et al. | An audio-visual corpus for multimodal automatic speech recognition | |
Fernandez-Lopez et al. | Survey on automatic lip-reading in the era of deep learning | |
Makino et al. | Recurrent neural network transducer for audio-visual speech recognition | |
Anina et al. | Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis | |
Hazen et al. | A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments | |
Harte et al. | TCD-TIMIT: An audio-visual corpus of continuous speech | |
CN102779508B (en) | Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof | |
US7636662B2 (en) | System and method for audio-visual content synthesis | |
WO2018049979A1 (en) | Animation synthesis method and device | |
CN101359473A (en) | Auto speech conversion method and apparatus | |
JP2016029576A (en) | Computer generation head | |
CN105390133A (en) | Tibetan TTVS system realization method | |
Goecke et al. | The audio-video Australian English speech data corpus AVOZES | |
WO2023035969A1 (en) | Speech and image synchronization measurement method and apparatus, and model training method and apparatus | |
CN110570842B (en) | Speech recognition method and system based on phoneme approximation degree and pronunciation standard degree | |
Liu et al. | A novel resynchronization procedure for hand-lips fusion applied to continuous french cued speech recognition | |
CN115312030A (en) | Display control method and device of virtual role and electronic equipment | |
Gimeno-Gómez et al. | LIP-RTVE: An audiovisual database for continuous Spanish in the wild | |
Taylor et al. | A mouth full of words: Visually consistent acoustic redubbing | |
Paleček | Experimenting with lipreading for large vocabulary continuous speech recognition | |
Chiţu¹ et al. | Automatic visual speech recognition | |
Karpov et al. | A framework for recording audio-visual speech corpora with a microphone and a high-speed camera | |
Karpov et al. | Designing a multimodal corpus of audio-visual speech using a high-speed camera | |
Narwekar et al. | PRAV: A Phonetically Rich Audio Visual Corpus. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190806 |
|
RJ01 | Rejection of invention patent application after publication |