CN108538283B

CN108538283B - Method for converting lip image characteristics into voice coding parameters

Info

Publication number: CN108538283B
Application number: CN201810215220.4A
Authority: CN
Inventors: 贾振堂
Original assignee: Shanghai University of Electric Power
Current assignee: Shanghai University of Electric Power
Priority date: 2018-03-15
Filing date: 2018-03-15
Publication date: 2020-06-26
Anticipated expiration: 2038-03-15
Also published as: CN108538283A

Abstract

The invention relates to a method for converting lip image characteristics into voice coding parameters, which comprises the following steps: 1) constructing a voice coding parameter converter, which comprises an input cache and a trained predictor, sequentially receiving lip feature vectors according to time sequence, and storing the lip feature vectors in the input cache of the converter; 2) sending k latest lip feature vectors cached at the current moment into a predictor as a short-time vector sequence at regular intervals, and obtaining a prediction result which is a coding parameter vector of a voice frame; 3) the speech coding parameter converter outputs a prediction result. Compared with the prior art, the method has the advantages of direct conversion, no need of character conversion, convenience in construction training and the like.

Description

Method for converting lip image characteristics into voice coding parameters

Technical Field

The invention relates to the technical field of computer vision, digital image processing and microelectronics, in particular to a method for converting lip image characteristics into voice coding parameters

Background

Lip language identification is to generate corresponding character expressions according to lip videos, and the following technical scheme is related to the prior art:

(1) CN107122646A, inventive name: a method for unlocking lip language is provided. The principle is that lip features acquired in real time are compared with lip features stored in advance to determine identity, but only the lip features can be acquired.

(2) CN107437019A, inventive name: a method and a device for authenticating identity of lip language identification are provided. The principle is similar to (1), with the difference that a 3D image is used.

(3) CN106504751A, inventive name: a self-adaptive lip language interaction method and an interaction device. The principle still is that the lips are recognized as characters, then instruction interaction is carried out based on the characters, and the conversion steps are complex.

(4) LipNet is a deep learning lip language recognition algorithm issued by the oxford university in combination with deep mind, and aims to recognize lips as characters. The recognition rate is higher than that of the prior art, but the conversion process is also complicated.

(5) CN107610703A, inventive name: a multi-language translator based on lip language collection and voice picking. The existing voice recognition module is used for recognizing the characters, and then the existing voice synthesis module is used for converting the characters into voice.

Disclosure of Invention

The present invention is directed to overcoming the above-mentioned drawbacks of the prior art and providing a method for converting lip image features into speech coding parameters

The purpose of the invention can be realized by the following technical scheme:

a method for converting lip image characteristics into voice coding parameters comprises the following steps:

1) constructing a voice coding parameter converter, which comprises an input cache and a trained predictor, sequentially receiving lip feature vectors according to time sequence, and storing the lip feature vectors in the input cache of the converter;

2) sending k latest lip feature vectors cached at the current moment into a predictor as a short-time vector sequence at regular intervals, and obtaining a prediction result which is a coding parameter vector of a voice frame;

3) the speech coding parameter converter outputs a prediction result.

The training method of the predictor specifically comprises the following steps:

21) synchronously acquiring video and voice: synchronously acquiring videos and corresponding voice data through video and audio acquisition equipment, extracting lip images from the videos, wherein the lip images comprise the whole mouth and a rectangular area with the mouth as the center, and acquiring a series of lip images I₁,I₂,...,I_nForming lip video, the voice data being a voice sample value sequence S₁,S₂,...,S_MKeeping the corresponding relation between the lip image and the voice data;

22) lip feature vector short-time sequence FIS (fuzzy inference System) for acquiring any time t_tCalculating image characteristic vector FI of each frame of lip image I in the lip video to obtain a series of lip characteristic vectors FI₁,FI₂,...,FI_nFor a given arbitrary time t, k continuous lip feature vectors are extracted as lip feature vector short-time sequence FIS at the time t_t＝(FI_t-k+1,...,FI_t-2,F_t-1,FI_t) Wherein, FI_tIs a lip feature vector closest to t in time, and k is a designated parameter;

23) obtaining the coding parameter vector FA of the speech frame at any time t_tFor any time t, extracting L continuous voice sampling values as a voice frame A_t＝(S_t-L+1,...,S_t-2,S_t-1,S_t) In which S is_tIs a speech sample closest to t in time, and adopts speech analysis algorithm to calculate the coding parameter of the speech frame, i.e. the coding parameter vector FA of the speech frame at t moment_tWherein, L is a fixed parameter;

24) training the predictor with the samples: at any given time t), the training sample pairs { FIS) obtained according to steps 22) and 23) are obtained_t，A_tAnd the values are used as the input and the expected output of the predictor, a plurality of t values are randomly selected in an effective range, and the predictor is trained according to the type of the predictor.

In the step 22), the frame rate of the lip feature vector is doubled by performing time interpolation on the lip feature vector, or the frame rate of the lip feature vector is increased by adopting a high-speed image acquisition device for acquisition.

The predictor adopts an artificial neural network, and the artificial neural network is formed by sequentially connecting 3 LSTM layers and 2 full-connection layers.

In the step 22), the obtaining of the lip feature vector specifically includes the following steps:

for each frame of lip image, 20 feature points surrounding the inner edge and the outer edge of the lip are extracted, the center coordinates of the 20 feature points are obtained, the center coordinates are subtracted from the coordinates of each point to obtain 40 coordinate data, the 40 coordinate values are subjected to normalization processing, and finally a lip feature vector is obtained.

In step 23), the speech analysis algorithm is an LPC10e algorithm, and the coding parameter vector is an LPC parameter, and includes 1 voiced/unvoiced flag of the first half frame, 1 voiced/unvoiced flag of the second half frame, 1 pitch period, 1 gain, and 10 reflection coefficients.

Compared with the prior art, the invention has the following characteristics:

firstly, direct conversion: the invention adopts machine learning technology to construct a special converter, which realizes the conversion from lip image characteristic vector to speech frame coding parameter vector. The predictor can be realized by an artificial neural network, but is not limited to the artificial neural network.

Secondly, character conversion is not needed: the converter adopts lip image characteristic vector sequence as input and speech frame coding parameter vector as output. The output speech frame coding parameter vector can be directly synthesized into a speech sampling frame by a speech synthesis technology without the intermediate link of 'text'.

Thirdly, facilitating the structure training: the invention also provides a training method of the designed predictor and a construction method of a training sample.

Drawings

Fig. 1 is a diagram showing the constitution and interface structure of a converter.

FIG. 2 is a flow chart of predictor training.

FIG. 3 is an artificial neural network structure of a predictor.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments.

The invention designs a converter for converting lip image characteristics into voice coding parameters, which can receive the characteristic vector sequence of the lip image, convert the characteristic vector sequence into a voice frame coding parameter vector sequence and output the voice frame coding parameter vector sequence.

The converter mainly includes an input buffer, a predictor, and configuration parameters. The core of the method is a predictor, which is a machine learning model and can be trained by utilizing training samples. The trained predictor can predict and output a short-time sequence of the lip feature vector as a corresponding speech coding parameter vector.

As shown in fig. 1, the translator basically includes an input cache, a predictor, and configuration parameters, as well as input and output interfaces. The converter receives individual lip feature vectors and stores them in an input buffer. And sending the k latest lip feature vectors cached into the predictor at regular time intervals delta t, obtaining a prediction result by the predictor and outputting the result from the output port. The prediction result is the coding parameters of a speech frame. The configuration parameters mainly store the configuration parameters of the predictor.

The operation of the converter is described as follows:

(1) the converter receives a series of lip feature vectors FI₁,FI₂,...,FI_nAnd stores it in the input buffer. These lip feature vectors are input sequentially in chronological order.

(2) Every certain time delta t, the converter takes the k latest lip feature vectors cached at the current moment as a short-time vector sequence FIS_t＝(FI_t-k+1,...,FI_t-2,FI_t-1,FI_t) Feeding into the predictor and obtaining a prediction result FA_t. The prediction result is a vector of coding parameters for a speech frame. Where Δ t is equal to the duration occupied by a speech frame and k is a fixed parameter.

(3) Obtain a prediction result FA_tThen, it is output from the output interface.

The steps are continuously and circularly operated, so that the lip image feature vector sequence FI is obtained₁,FI₂,...,FI_nSequence of encoding parameter vectors FA converted into speech frames₁,FA₂,...,FA_m. Since the frequency of the speech frame is not necessarily equal to the frequency of the video frame, the number n of the image feature vectors FI input here is not necessarily equal to the number m of the speech frame parameter vectors FA output here.

The converter described in this patent relates to a predictor implemented with a machine learning model with data prediction capabilities, such as but not limited to an artificial neural network. Before application, it needs to be trained (i.e. let the predictor learn). The following is a training method, the principle of which is shown in fig. 2, and the method extracts a lip image feature vector from a lip video and a speech coding parameter vector from corresponding speech, respectively. Short time sequence FIS of lip image feature vector_t＝(FI_t-k+1,...,FI_t-2,FI_t-1,FI_t) As input samples for training; and FIS_tCoding parameter vector FA of corresponding speech frame_tAs the desired output, i.e., the label. Thereby obtaining a large number of training sample and tag pairs { FIS_t，FA_tAnd training the predictor, wherein t is any random effective moment.

The training predictor specifically comprises the following steps:

(1) video and voice are synchronously collected. And synchronously acquiring the video and the corresponding voice data through the video and audio acquisition equipment. The video needs to include a lip portion. A lip part, i.e. a rectangular area containing the whole mouth and centered at the mouth, is extracted from the video. The final lip video is composed of a series of lip images I₁,I₂,...,I_nAnd (4) forming. The speech data is represented as a sequence S of speech samples₁,S₂,...,S_M(where M is capital, representing the number of samples; the number of speech frames is represented as lowercase M). And keeping the time corresponding relation between the image and the voice.

(2) Lip feature vector short time sequence FIS at any time t_t. Calculating image characteristic vector FI of each frame of lip image I in the lip video, and obtaining a series of lip characteristic vectors FI₁,FI₂,...,FI_n. For a given arbitrary time t, k continuous lip feature vectors are extracted as lip feature vector short-time sequence FIS at the time t_t＝(FI_t-k+1,...,FI_t-2,F_t-1,FI_t) Wherein FI_tK is a specified parameter for a lip feature vector closest in time to t. In order to improve the frame rate of the lip feature vectors, the lip feature vectors can be subjected to time interpolation to double the frame rate, or high-speed image acquisition equipment can be directly adopted.

(3) Speech frame coding parameter vector FA at any time t_t. For any time t, extracting L continuous voice sampling values as a voice frame A_t＝(S_t-L+1,...,S_t-2,S_t-1,S_t) In which S is_tIs the one speech sample closest in time to t. Calculating the coding parameter of the speech frame by adopting a speech analysis algorithm, namely calculating the coding parameter vector FA of the speech frame at the time t_t. Where L is a fixed parameter.

(4) The predictor was trained with the sample. At any time t, a training sample pair { FIS) is obtained according to (2) and (3)_t，A_tIn which FIS_tAs input to the predictor, A_tIs the expected output of the predictor, i.e., the label. A large number of samples can be obtained by randomly selecting a large number of t values in an effective range. And training the predictor by adopting a corresponding method according to the type of the predictor by using the samples.

(5) And using the trained predictor as a component for constructing the lip sound converter.

Example 1:

the following is a specific implementation, but the methods and principles of the present invention are not limited to the specific numbers given therein.

(1) And the predictor can be realized by adopting an artificial neural network. Other machine learning techniques may also be employed to construct the predictor. In the following process, the predictor uses an artificial neural network, i.e., the predictor is equivalent to an artificial neural network.

In this embodiment, the neural network is composed of 3 LSTM layers +2 fully-connected layers sense connected in sequence. Dropout layers are added between each two layers and between the internal feedback layers of the LSTM, which are not shown in the figure for architectural clarity. As shown in fig. 3:

wherein, the LSTM in three layers has 80 neurons respectively, and the first two layers adopt a 'return _ sequences' mode. There are 100 neurons and 14 neurons in the two sense layers, respectively.

The first LSTM layer receives input of the LIP feature sequence in the format of a 3-dimensional array (batteries, stages, LIP _ DIM). The last full link layer is the output layer of the neural network, and the output format is a 2-dimensional array (BATCHES, LPC _ DIM). Of these formats, bat specifies the number of samples (customarily called the batch) that are fed into the neural network each time, bat is typically a number greater than 1 at training, and bat ═ 1 at application; the shape of an input sample is specified by STEPS, LIP _ DIM, STEPS specifies the length (customarily called the number of STEPS) of a short-time sequence of LIP features, i.e. FIS_t＝(FI_t-k+1,...,FI_t-2,F_t-1,FI_t) K in (1), i.e., STEPS ═ k; LIP _ DIM specifies the dimension of one LIP feature vector FI, and for a LIP feature vector consisting of 40 pieces of coordinate data, LIP _ DIM is 40. In the output format, LPC _ DIM is the dimension of a speech coding parameter vector, and for LPC10e, LPC _ DIM is 14.

The number and the number of layers of the neurons are properly adjusted according to different application scenes. In the application scenario of large word exchange, the number of neurons and the number of layers can be set to be larger.

(2) And determining the value of k. The value of k needs to be determined according to the application scenario. For a simple application scenario, only one-by-one chinese character recognition may be needed, since the pronunciation of one chinese character is about 0.5 seconds, and if the video is 50 frames/second, k is the number of video frames contained in 0.5 seconds, i.e., k is 50x0.5 or 25. For the scene with more words, the words and even phrases are required to be recognized as a whole, and the value of k is multiplied correspondingly. For example, in the two words of "size" and "truck", because the mouth shape of "large" and "card" is approximate, and it is difficult to distinguish individual characters, it is necessary to perform whole word recognition on "size" and "truck", and k is at least 2x 25-50 or so

(3) And (4) calculating lip feature vectors. For each frame of image, 20 feature points are extracted around the inner edge and the outer edge of the lips to describe the current shape of the lipsAnd (4) forming. The center coordinates of the 20 points are calculated and subtracted from the coordinates of each point. Each point has two coordinate values of x and y, and the total number of 20 points is 40. And normalizing the 40 coordinate values to obtain a lip feature vector FI. From successive video images a series of lip feature vectors FI can be derived₁,FI₂,...,FI_n. Since the image frame rate is usually not high, the lip feature vector can be interpolated to increase the frame rate. Forming a short time sequence FIS by continuous k lip feature vectors_t＝(FI_t-k+1,...,FI_t-2,FI_t-1,FI_t) Input samples as predictor, where FI_tIs the one closest in time to t.

(4) And calculating the encoding parameter vector of the speech frame. the speech frame at time t is A_t＝(S_t-L+1,...,S_t-2,S_t-1,S_t) In which S is_tIs the one speech sample closest in time to t, which is an arbitrary significant instant. Here, the speech may take a sample rate of 8000Hz and L is set to 180, i.e., every 180 samples as an audio frame, accounting for 22.5 ms. The speech coding may employ the LPC10e algorithm. For a speech frame A using this algorithm_tAnalyzing to obtain the encoding parameter vector FA of the frame_tI.e., 14 values of LPC parameters, including 1 first half frame voiced flag, 1 second half frame voiced flag, 1 pitch period, 1 gain, and 10 reflection coefficients. Thus, the encoding parameter vector FA of the speech frame at any effective moment can be calculated_t. Different speech frames may have overlap. The sound frame coding parameter vector is used as an expected output format when a predictor is trained, and is used as an actual output format of the predictor when the predictor is applied.

(5) Training of the predictor: a short time sequence FIS from the lip feature vector_tAs input samples, the encoding parameter vector FA of the speech frame corresponding to the time instant_tAs its label (i.e. predicted target), a sample pair { FIS is composed_t,FA_tSince t can take any value in valid time, a large number of training samples can be obtained for training the predictor.During training, the prediction error is calculated by adopting the Mean Square Error (MSE), and the network weight is gradually adjusted by adopting an error back propagation method. Ultimately providing a usable predictor.

(6) After the training of the predictor is completed, the predictor is used as a module in the converter. The structure description data and weight data of the predictor, as well as other parameters, are stored in the 'configuration parameters', and when the converter is started, the configuration parameters are read out, and the predictor is reconstructed according to the parameters.

(7) The methods described herein may be implemented in software or may be implemented partially or wholly in hardware.

Claims

1. A method for converting lip image features into speech coding parameters, comprising the steps of:

2) sending k latest lip feature vectors cached at the current moment into a predictor as a short-time vector sequence at regular intervals, and obtaining a prediction result, wherein the prediction result is a coding parameter vector of a speech frame, the predictor adopts an artificial neural network, the artificial neural network is formed by sequentially connecting 3 LSTM layers and 2 full-connection layers, and the training method of the predictor specifically comprises the following steps:

22) lip feature vector short-time sequence FIS (fuzzy inference System) for acquiring any time t_tCalculating image characteristic vector FI of each frame of lip image I in the lip video to obtain a series of lip characteristic vectors FI₁,FI₂,...,FI_nFor a given arbitrary time t, k continuous lip feature vectors are extracted as lip feature vector short-time sequence FIS at the time t_t＝(FI_t-k+1,...,FI_t-2,F_t-1,FI_t) Wherein, FI_tThe method is characterized in that the lip feature vector closest to t in time is obtained, k is a designated parameter, and the method specifically comprises the following steps:

for each frame of lip image, extracting 20 feature points which surround the inner edge and the outer edge of the lip, acquiring the central coordinates of the 20 feature points, subtracting the central coordinates from the coordinates of each point to obtain 40 coordinate data, and performing normalization processing on the 40 coordinate values to finally acquire a lip feature vector;

24) training the predictor with the samples: at any given time t), the training sample pairs { FIS) obtained according to steps 22) and 23) are obtained_t，A_tThe speech analysis algorithm is an LPC10e algorithm, the coding parameter vector is an LPC parameter and comprises 1 first half frame voiced and unvoiced flag, 1 second half frame voiced and unvoiced flag, 1 pitch period, 1 gain and 10 reflection coefficients;

3) the speech coding parameter converter outputs a prediction result.

2. The method as claimed in claim 1, wherein in step 22), the lip feature vector is temporally interpolated to double its frame rate, or the frame rate of the lip feature vector is increased by using a high-speed image capturing device.