CN110096966A

CN110096966A - A kind of audio recognition method merging the multi-modal corpus of depth information Chinese

Info

Publication number: CN110096966A
Application number: CN201910284877.0A
Authority: CN
Inventors: 徐天一; 张奕超; 赵满坤; 高洁; 于健; 于瑞国; 喻梅; 王丽媛
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-04-10
Filing date: 2019-04-10
Publication date: 2019-08-06

Abstract

The invention discloses a kind of audio recognition methods for merging the multi-modal corpus of depth information Chinese, it include: to incorporate depth information in Bimodal Database to construct multi-modal data synchronous using Microsoft's second generation Kinect multielement bar, the system is used to obtain the color image and depth image of speaker；Small-scale corpus is acquired, production is automatically selected by corpus and balances corpus up to 78%, two phone coverage rates up to 93.3% phoneme without tuning section coverage rate；Acquire the multi-modal data of audio, color video, depth image, 3D information；Data prediction is carried out to the multi-modal data of acquisition and extracts multi-modal feature, establish the multi-modal corpus of Chinese of fusion depth information and carries out multi-modal speech recognition.The present invention solve it is domestic for multi-modal data library research aspect there is vocabularies it is single, audio-visual quality is poor the problems such as, the quality for solving conventional two-dimensional image is highly prone to illumination, speaker's end rotation, the influence for the factors such as blocking.

Description

A kind of audio recognition method merging the multi-modal corpus of depth information Chinese

Technical field

The present invention relates to multi-modal data libraries to establish field, is related to a kind of multi-modal corpus of Chinese for merging depth information Foundation, the feature extraction of fusion depth information more particularly to a kind of audio recognition method based on multi-modal data library.

Background technique

Bimodal database contains visual signal and acoustic signal simultaneously, and the structure of acoustic signal is simple, scale Depending on sample rate and pronunciation total duration；Visual information is relatively complicated, its evaluation criterion is often the clear of image Degree and frame rate.Wherein, the purpose of design of corpus determines the selection of corpus in the quantity and database of speaker.

The audiovisual bimodal database of standard is the requisite data basis for carrying out the research of bimodal speech recognition technology, so And compared to the audiovisual corpus of external multiplicity, domestic, the Chinese that has disclosed also far from enough to the research of Bimodal Database Language Bimodal Database there is vocabularies it is single, audio-visual quality is poor the problems such as, and Chinese multi-modal corpus still stop It stays on the bimodal data set being made of audio and color video, and the quality of two dimensional image is highly prone to illumination, speaker End rotation, the influence for the factors such as blocking.Secondly, external multi-modal corpus has been applied to authentication, recognition of face etc. Various researchs, and the multi-modal corpus of Chinese is only limitted to the research of audio-visual speech identification.

In the course that lip moves Study of Feature Extracting Method, the lip based on two-dimensional image information moves feature extracting method and melts The lip for having closed depth information moves the two methods that feature extracting method is current most mainstream.Wherein based on the lip of two-dimensional image information Dynamic feature extracting method includes feature extracting method pixel-based and the feature extracting method based on model.Spy pixel-based Sign extracting method both can directly carry out lip on the gray level image in entire lip area and move feature extraction, can also be by compressing lip area Image, and image carries out some transformation for treated, such as wavelet transform, discrete cosine transform, linear discriminant analysis And principal component analysis etc., to generate the feature vector of lip-region.Feature extracting method based on model mainly includes several What characteristic method and parameter curve method.Height, width, perimeter, area and the key coordinate point that geometrical measurers open the shape of the mouth as one speaks The distance between be used as lip-region feature.

The above-mentioned feature extracting method based on two dimensional image is highly prone to illumination, the rotation of speaker head to a certain extent The influence of factors such as turn, block.For the greatest differences of people's tongue, traditional feature extracting method can not be as one kind General method effectively to characterize lip comprehensively and move information.

Speech recognition is the core stage an of identifying system, and pervious speech recognition is broadly divided into because of performance model difference Four class methods: template matching, dynamic time programming (DTW), Hidden Markov (HMM), artificial neural network (ANN).Close Year, deep learning receives the extensive concern of people, it significantly improves multi-modal voice using the image data of the positive face of standard The performance of identifying system.Network model based on convolutional neural networks (CNN) and long memory network (LSTM) in short-term also achieves Multi-modal speech recognition system.

Summary of the invention

The present invention provides a kind of audio recognition method for merging the multi-modal corpus of depth information Chinese, the present invention is solved It is domestic for multi-modal data library research aspect there is vocabularies it is single, audio-visual quality is poor the problems such as, solve tradition two The quality of dimension image is highly prone to illumination, speaker's end rotation, the influence for the factors such as blocking, described below:

A kind of audio recognition method merging the multi-modal corpus of depth information Chinese, the described method comprises the following steps:

Depth information is incorporated multi-modal using the building of Microsoft's second generation Kinect multielement bar in Bimodal Database Synchronous data sampling system, the system are used to obtain the color image and depth image of speaker；

Small-scale corpus is acquired, production is automatically selected by corpus and is covered without tuning section coverage rate up to 78%, two phones Rate balances corpus up to 93.3% phoneme；Acquire the multi-modal data of audio, color video, depth image, 3D information；

Data prediction is carried out to the multi-modal data of acquisition and extracts multi-modal feature, establishes the Chinese of fusion depth information The multi-modal corpus of language simultaneously carries out multi-modal speech recognition.

The multi-modal data synchronous includes: microphone array, colour imagery shot, infrared projection machine, infrared Camera and usb bus component.

Further, the data prediction includes: data segmentation, voice annotation, database purchase；

Data segmentation: the audio file that captures through data collection system, color image, depth image, 3D data point are equal It is accompanied with the timestamp for being accurate to Millisecond when data acquisition, respectively by color image sequence, the depth image after merging Sequence and depth data are cut into corresponding sentence level set according to the record time of every audio, realize multi-modal data It synchronizes；

Voice annotation: force alignment tool to the automatic marking of voice progress phone-level using voice label.

Wherein, described to extract multi-modal feature specifically: color image feature extraction and depth information feature extraction；

Using trained face Keypoint detector and human face recognition model in Dlib machine learning library, face is obtained Face feature point；

It establishes multi-modal corpus and carries out the multi-modal speech recognition experiment based on the tool box HTK.

The beneficial effect of the technical scheme provided by the present invention is that:

1, the present invention utilizes the multi-modal data synchronous developed based on Kinect multielement bar, pre-acquired The compact multi-mode state corpus of depth information is merged, and using the corpus as data basis, has carried out mentioning for two dimensional image feature Take technique study, the lip area Study of Feature Extracting Method based on depth information, the multi-modal speech recognition based on the tool box HTK real It tests, provides benchmarks for the multi-modal the Research of Speech Recognition based on the database；

2, the present invention devises the Chinese corpus of phoneme balance, under the studio environment of profession, acquires 69 people, The multi-modal data of 10074 phoneme balance corpus in total establishes the first multi-modal corpus of Chinese for incorporating depth information Library is engaged in the multi-modal the Research of Speech Recognition of Chinese for more researchers and provides data basis；

3, the present invention analyzes the effect of speech recognition under different acoustic enviroments, joined signal-to-noise ratio difference to audio herein For the bubble noise of -5dB, 0dB, 5dB, 10dB, 15dB, 20dB, Fig. 3 is the audio of different signal-to-noise ratio in each acoustic model lower word The statistics details of level speech recognition accuracy.From data in Fig. 3, on the one hand, with the continuous reduction of signal-to-noise ratio, voice The accuracy rate of identification constantly declines, and when signal-to-noise ratio is less than zero, the effect of speech recognition is very bad.On the other hand, acoustic mode The design of type also has particularly important relationship to the result of speech recognition, and three-tone acoustic model is compared to the sub- acoustic mode of single-tone Type can preferably characterize the speech phenomenon in voice flow, and DNN-HMM acoustic model not only accelerates the speed of model training, more Improve the accuracy rate of model training, it can be seen that, deep neural network has milestone formula meaning to the development of speech recognition.

Detailed description of the invention

Fig. 1 is a kind of flow chart of audio recognition method for merging the multi-modal corpus of depth information Chinese；

Fig. 2 is the schematic diagram of data acquisition；

Fig. 3 is the schematic diagram of the speech recognition accuracy of different signal-to-noise ratio audios.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, embodiment of the present invention is made below further Ground detailed description.

Embodiment 1

The embodiment of the invention provides the audio recognition method that a kind of multi-modal corpus of fusion depth information Chinese is established, Referring to Fig. 1, method includes the following steps:

101: depth information being incorporated in Bimodal Database and is developed using Microsoft's second generation Kinect multielement bar A set of multi-modal data synchronous, Kinect multielement bar mainly include five key components: microphone array, Colour imagery shot, infrared projection machine, infrared camera and usb bus component, by synchronous data sampling system, to be said Talk about the color image and depth image of people；

102: small-scale corpus is acquired in advance, it is multi-modal it is demonstrated experimentally that deep by being carried out on the basis of the corpus Degree information is very helpful for speech recognition；

103: design corpus automatically selects algorithm, and production is reached without tuning section coverage rate up to 78%, two phone coverage rates 93.3% phoneme balances corpus；

104: under the recording scene of profession, as shown in Fig. 2, 69 speakers of acquisition include: audio, color video, depth Spend the multi-modal data of image, 3D information；

105: it is multi-modal for the Chinese for having merged depth information of 6TB up to 22.4 hours, total memory space to establish total duration Corpus carries out speech recognition.

In one embodiment, step 101 obtains depth image using Kinect multielement bar and develops a set of multimode State synchronous data sampling system is as follows:

Wherein, the important research field of Microsoft Kinect multielement bar is computer vision related content, wherein depth Information is the key point of this research.Structure light law technology has been used to obtain the depth in scene in first generation Kinect sensor Information, and the depth camera in second generation Kinect sensor is then accomplished that entirely different time-of-flight method algorithm, connects Carry out face characteristic Model Reconstruction, using current state-of-the-art face library HD Face, it not only can be according to cromogram Picture and depth image quickly detect face in a short time, and can pass through 1347 people face face feature point predetermined Face three-dimensional grid model is established when mysorethorn.

In one embodiment, step 102 acquires small-scale corpus in advance on the basis of step 101, specific to walk It is rapid as follows:

An easy data acquisition has been built jointly with a Microsoft Kinect v2 sensor device and a desktop computer Environment, corpus are 100 unduplicated digital string sequences being made of the number from one to ten, have chosen two pronunciation streams Freely, without the volunteer of accent (1 female, 1 male), it is desirable that everyone is according to a normal word speed time corpus of pronunciation, i.e., final collected Multi-modal data in database when 100 voicing texts each comprising two words persons.By being carried out on the basis of the corpus It is multi-modal it is demonstrated experimentally that depth information is very helpful for speech recognition.

In one embodiment, step 103 devises corpus and automatically selects algorithm, has made and has reached without tuning section coverage rate 78%, two phone coverage rates balance corpus up to 93.3% phoneme, the specific steps are as follows:

Corpus is the data basis of speech recognition training, its selection have the function of for model training it is very crucial, And according to the otherness of semantic task, the design principle of corpus is also slightly different, and Chinese character is as unit of syllable, Mei Geyin Section is all made of initial consonant and simple or compound vowel of a Chinese syllable, causes the difference with other western languages.

According to the basic knowledge that corpus designs, the embodiment of the present invention devises a kind of suitable for mandarin continuous speech recognition system The autonomous selection algorithm of the corpus of system, the algorithm synthesis consider corpus to syllable, tone, double-tone submodel and triphone model Coverage rate, and meet the corpus text of condition using Greedy idea screening according to valuation functions, the corpus of final design is equal It is even to cover 78% without tuning section and 93.3% 2 phone, most speech phenomenons are covered with relatively small number of text.

In one embodiment, step 104 carries out data acquisition, the specific steps are as follows:

The database is to participate in recording by 69 University Of Tianjin students working on a postgraduate program in the professional recording room of University Of Tianjin, It includes Sony's recording pen and a second generation Microsoft Kinect multielement bar that data, which acquire used equipment,.

In one embodiment, step 105 finally carries out the foundation of database, the specific steps are as follows:

It include: data segmentation, voice annotation, database purchase by carrying out data prediction to the data of acquisition.

Data segmentation: the audio file that captures through data collection system, color image, depth image, 3D data point are equal It is accompanied with the timestamp for being accurate to Millisecond when data acquisition, the RAW representation of file audio text that title is started with " audio " Part, with the PNG representation of file color image format that " color " starts, the PNG representation of file depth image started with " depth " File indicates the depth data point that Kinect depth camera captures with the CSV formatted file that " depth " starts, with The csv file of " facePoints " beginning indicates the three-dimensional face geometrical model that Kinect is generated.It is stored according to multi-modal data The corresponding frame rate of timestamp mark and multi-channel data on file, people can easily design program and respectively will Color image sequence, range image sequence and depth data after merging are cut into accordingly according to the record time of every audio Sentence level set, realize the synchronization of multi-modal data.

Voice annotation: this method forces alignment tool (Penn Phonetics using University of Pennsylvania's voice label Lab Forced Aligner, P2FA) to the automatic marking of voice progress phone-level.

Then multi-modal feature extraction is carried out, comprising: color image feature extraction and depth information feature extraction.This method Using trained face Keypoint detector and human face recognition model in Dlib machine learning library, 68, face face is obtained Characteristic point, wherein lip area is by 20 Based on Feature Points.It finally establishes multi-modal corpus and carries out based on the tool box HTK Multi-modal speech recognition experiment, provides benchmarks for the multi-modal the Research of Speech Recognition based on the database.

Embodiment 2

The scheme in embodiment 1 is further introduced below with reference to specific example, formula, it is as detailed below to retouch It states:

201: obtaining depth image using Kinect multielement bar and develop a set of multi-modal data synchronous；

Wherein, the structure composition of second generation Kinect multielement bar, it mainly includes following five key components: wheat Gram wind array, colour imagery shot, infrared projection machine, infrared camera, usb bus component.Kinect sensor could support up 6 Human body, and neck, (left and right) finger tip, (left and right) thumb artis are increased, 25 artis can be collected, are successfully solved More complicated, subtleer gesture actions are locked.

202: acquiring small-scale corpus in advance with multi-modal data synchronous；

203: devising corpus and automatically select algorithm；

Before selecting corpus, the pre-defined and consistent table of evaluation parameter is needed:

1, (there is and reconciles without tune) syllable statistical form, the syllable and the syllable for storing whole in original language material are selecting language The frequency occurred in material collection；

2, two phone statistical form, for store cover in original language material all two phones and they selecting in corpus The frequency of generation；

3, three-tone statistical form, for storing the whole three-tones for including in original language material and they are being selected in corpus The frequency of generation；

4, tone combination table, for storing, the whole tones covered in original language material are combined and they are being selected in corpus The frequency of generation.

The eight layers of structure obtained according to Concordance, the calculating process to score text are as follows:

Remember that syScore indicates the score of syllable in text, ESY is the weight that syllable predetermined occupies, and traverses text In syllable illustrate the syllable in having selected corpus if certain syllable corresponding frequency in syllable statistical form is zero in text Not yet occurred, then assigns the text bigger syllable score: syScore+=ESY (1)

Otherwise, the relatively small syllable score of the text is assigned, it is assumed that the syllable corresponding frequency in syllable statistical form For count, show that the syllable has had already appeared count times in having selected corpus, then:

SyScore+=1/ (count+1)² (2)

The scoring process of two phones, three-tone and tone combination is similar therewith, is denoted as bipScore, tripScore respectively And toneScore, then the total score of the text can be calculated by following formula.

Score=syScore+bipScore+tripScore+toneScore (3)

Wherein, score indicates the total score of the text.The score for comparing all texts chooses score value using greedy algorithm Maximum text, which is added to, has selected corpus.Update syllable statistical form, two phone statistical forms, three-tone statistical form and tone combination Table repeats aforesaid operations, is suitble to experiment purpose and the high corpus of phone coverage rate until successfully choosing.

204: two phones of unscreened original language material text and coverage rate without tuning section be up to 87.6% respectively, 92%, with the reduction of corpus scale, selected corpus is also reducing the coverage rate of no tuning section and two phones step by step, and two The coverage rate variation degree of phone is significantly less than the coverage rate variation degree of no tuning section.

The embodiment of the present invention automatically selects algorithm with corpus, has finally made without tuning section coverage rate up to 78%, two sounds Sub- coverage rate balances corpus up to 93.3% phoneme.

205: carrying out extensive multi-modal data acquisition；

206: carrying out data prediction to the data of acquisition includes: data segmentation, voice annotation, database purchase, then Carry out multi-modal feature extraction, comprising: color image feature extraction and depth information feature extraction finally establish total duration and reach 22.4 hours, total memory space is the multi-modal corpus of Chinese for having merged depth information of 6TB, and carries out based on HTK tool The multi-modal speech recognition of case is tested.

Embodiment 3

The scheme in Examples 1 and 2 is further introduced below with reference to specific example, calculation formula, is detailed in It is described below:

During people's data recording, the unconscious movement of head inclination, rotation is inevitably had, and Kinect is sensed Device exactly can be convenient present invention adjustment faceform by the three-dimensional grid model that built-in SDK is generated in real time, evade people Face offset and caused by information lose and information errors.According to Kinect solid space coordinate system, head rotation can be divided into Left rotation and right rotation in XZ plane, the horizontal-shift on X/Y plane, being rotated up and down in YZ plane.

The embodiment of the present invention is introduced for adjusting around the faceform that Y-axis rotates left and right to depth information below Pre-treatment step:

The embodiment of the present invention is using the central point of the left and right corners of the mouth as origin, to 1 in faceform, 347 depth characteristic points Do certain linear transformation so that the left and right corners of the mouth obtains identical value in the Z-axis direction, i.e., it is regular after the left and right corners of the mouth be maintained at same One depth distance.L and R in figure respectively indicate the left and right corners of the mouth in original faceform, and newL, newR respectively indicate regular The left and right corners of the mouth after change, w indicate the angle that the corners of the mouth rotates left and right, and l indicates the length of left and right corners of the mouth line, then:

L=[xl, yl, zl] (4)

R=[xr, yr, zr] (5)

NewL=[newxl, newyl, newzl] (6)

NewR=[newxr, newyr, newzr] (7)

W=tan^-1((zr-zl)/xr-xl) (8)

Vacation lets m represent the transformation matrix of 3 rows 3 column, images to rotate faceform around Y-axis to face Kinect The position of head, then should meet the following conditions:

| newxr-newxl |=l (10) newyr=yr (11)

Newyl=yl (12) newzr=newzl (13)

By calculating, transformation matrix M can be obtained are as follows:

That is:

NewR=MR (15)

NewL=ML (16)

Similarly, carrying out respectively to three-dimensional face model about the z axis later can be by the depth point of offset with the rotation around X-axis The regular position to camera planar horizontal of cloud.

Multi-modal alone word voice identification experimental result so based on preparatory acquisition corpus design, which can be seen that, to be based on The single mode speech recognition result of depth information and the single mode speech recognition result based on two dimensional image are very close, discrimination Respectively 72.27% and 69.91%.And the discrimination of multi-modal alone word voice identification experimental result has then reached 93.68%. The collected depth information discrimination of second generation Kinect sensor is higher, it is meant that has more fully face characteristic information. Multi-modal data acquisition system based on the design of Microsoft's Kinect multielement bar has the propulsion of multi-modal the Research of Speech Recognition Great role and value.

In order to analyze the effect of speech recognition under different acoustic enviroments, the embodiment of the present invention joined noise score to audio Not Wei -5dB, 0dB, 5dB, 10dB, 15dB, 20dB bubble noise, Fig. 2 be different signal-to-noise ratio audio under each acoustic model The statistics details of word level speech recognition accuracy.From data in figure, on the one hand, with the continuous reduction of signal-to-noise ratio, language The accuracy rate of sound identification constantly declines, and when signal-to-noise ratio is less than zero, the effect of speech recognition is very bad.On the other hand, acoustics The design of model also has particularly important relationship to the result of speech recognition, and three-tone acoustic model is compared to the sub- acoustics of single-tone Model can preferably characterize the speech phenomenon in voice flow, and DNN-HMM acoustic model not only accelerates the speed of model training, More improve the accuracy rate of model training, it can be seen that, deep neural network anticipates to the development of speech recognition with milestone formula Justice.

Fig. 3 is the sound after lip color image feature and lip depth characteristic and audio frequency characteristics splicing under different dimensions Video speech recognition effect statistical chart.AV15 in figure indicates pure audio and the sound after 15 dimension lip color image Fusion Features Video features, ALip15 indicate that pure audio and the 15 dimension fused audio and video characteristics of lip depth characteristic, remaining parameter are similar Such expression.Note that audio frequency characteristics here refer both to 13 dimension MFCC features.By data in figure it is found that under GMM-HMM model, Lip depth characteristic is slightly better than the speech recognition effect of lip color image feature, and 32 dimension lip aspect ratios 15 tie up lip Feature is better.Under DNN-HMM model, effect is on the contrary, the voice of lip color image aspect ratio lip depth characteristic is known Other effect is slightly better, and 15 dimension lip aspect ratios 32 tie up lip feature better.

It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention Serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of audio recognition method for merging the multi-modal corpus of depth information Chinese, which is characterized in that the method includes Following steps:

Depth information is incorporated in Bimodal Database and constructs multi-modal data using Microsoft's second generation Kinect multielement bar Synchronous, the system are used to obtain the color image and depth image of speaker；

Small-scale corpus is acquired, production is automatically selected by corpus and is reached without tuning section coverage rate up to 78%, two phone coverage rates 93.3% phoneme balances corpus；Acquire the multi-modal data of audio, color video, depth image, 3D information；

Data prediction is carried out to the multi-modal data of acquisition and extracts multi-modal feature, the Chinese for establishing fusion depth information is more Mode corpus simultaneously carries out multi-modal speech recognition.

2. a kind of audio recognition method for merging the multi-modal corpus of depth information Chinese according to claim 1, special Sign is,

The multi-modal data synchronous includes: microphone array, colour imagery shot, infrared projection machine, infrared photography Head and usb bus component.

3. a kind of audio recognition method for merging the multi-modal corpus of depth information Chinese according to claim 1, special Sign is that the data prediction includes: data segmentation, voice annotation, database purchase；

Data segmentation: the audio file that captures through data collection system, color image, depth image, 3D data point are subsidiary There is the timestamp for being accurate to Millisecond when data acquisition, respectively by color image sequence, the range image sequence after merging And depth data is cut into corresponding sentence level set according to the record time of every audio, realizes the synchronization of multi-modal data Change；

4. a kind of audio recognition method for merging the multi-modal corpus of depth information Chinese according to claim 1, special Sign is, described to extract multi-modal feature specifically: color image feature extraction and depth information feature extraction；

Using trained face Keypoint detector and human face recognition model in Dlib machine learning library, face face is obtained Characteristic point；