CN108538283B - Method for converting lip image characteristics into voice coding parameters - Google Patents
Method for converting lip image characteristics into voice coding parameters Download PDFInfo
- Publication number
- CN108538283B CN108538283B CN201810215220.4A CN201810215220A CN108538283B CN 108538283 B CN108538283 B CN 108538283B CN 201810215220 A CN201810215220 A CN 201810215220A CN 108538283 B CN108538283 B CN 108538283B
- Authority
- CN
- China
- Prior art keywords
- lip
- time
- predictor
- frame
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 239000013598 vector Substances 0.000 claims abstract description 91
- 238000013528 artificial neural network Methods 0.000 claims description 14
- 238000005070 sampling Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 2
- 238000006243 chemical reaction Methods 0.000 abstract description 7
- 238000010276 construction Methods 0.000 abstract description 2
- 210000002569 neuron Anatomy 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 4
- 230000003993 interaction Effects 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000004377 microelectronic Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/027—Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Acoustics & Sound (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Signal Processing (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a method for converting lip image characteristics into voice coding parameters, which comprises the following steps: 1) constructing a voice coding parameter converter, which comprises an input cache and a trained predictor, sequentially receiving lip feature vectors according to time sequence, and storing the lip feature vectors in the input cache of the converter; 2) sending k latest lip feature vectors cached at the current moment into a predictor as a short-time vector sequence at regular intervals, and obtaining a prediction result which is a coding parameter vector of a voice frame; 3) the speech coding parameter converter outputs a prediction result. Compared with the prior art, the method has the advantages of direct conversion, no need of character conversion, convenience in construction training and the like.
Description
Technical Field
The invention relates to the technical field of computer vision, digital image processing and microelectronics, in particular to a method for converting lip image characteristics into voice coding parameters
Background
Lip language identification is to generate corresponding character expressions according to lip videos, and the following technical scheme is related to the prior art:
(1) CN107122646A, inventive name: a method for unlocking lip language is provided. The principle is that lip features acquired in real time are compared with lip features stored in advance to determine identity, but only the lip features can be acquired.
(2) CN107437019A, inventive name: a method and a device for authenticating identity of lip language identification are provided. The principle is similar to (1), with the difference that a 3D image is used.
(3) CN106504751A, inventive name: a self-adaptive lip language interaction method and an interaction device. The principle still is that the lips are recognized as characters, then instruction interaction is carried out based on the characters, and the conversion steps are complex.
(4) LipNet is a deep learning lip language recognition algorithm issued by the oxford university in combination with deep mind, and aims to recognize lips as characters. The recognition rate is higher than that of the prior art, but the conversion process is also complicated.
(5) CN107610703A, inventive name: a multi-language translator based on lip language collection and voice picking. The existing voice recognition module is used for recognizing the characters, and then the existing voice synthesis module is used for converting the characters into voice.
Disclosure of Invention
The present invention is directed to overcoming the above-mentioned drawbacks of the prior art and providing a method for converting lip image features into speech coding parameters
The purpose of the invention can be realized by the following technical scheme:
a method for converting lip image characteristics into voice coding parameters comprises the following steps:
1) constructing a voice coding parameter converter, which comprises an input cache and a trained predictor, sequentially receiving lip feature vectors according to time sequence, and storing the lip feature vectors in the input cache of the converter;
2) sending k latest lip feature vectors cached at the current moment into a predictor as a short-time vector sequence at regular intervals, and obtaining a prediction result which is a coding parameter vector of a voice frame;
3) the speech coding parameter converter outputs a prediction result.
The training method of the predictor specifically comprises the following steps:
21) synchronously acquiring video and voice: synchronously acquiring videos and corresponding voice data through video and audio acquisition equipment, extracting lip images from the videos, wherein the lip images comprise the whole mouth and a rectangular area with the mouth as the center, and acquiring a series of lip images I1,I2,...,InForming lip video, the voice data being a voice sample value sequence S1,S2,...,SMKeeping the corresponding relation between the lip image and the voice data;
22) lip feature vector short-time sequence FIS (fuzzy inference System) for acquiring any time ttCalculating image characteristic vector FI of each frame of lip image I in the lip video to obtain a series of lip characteristic vectors FI1,FI2,...,FInFor a given arbitrary time t, k continuous lip feature vectors are extracted as lip feature vector short-time sequence FIS at the time tt=(FIt-k+1,...,FIt-2,Ft-1,FIt) Wherein, FItIs a lip feature vector closest to t in time, and k is a designated parameter;
23) obtaining the coding parameter vector FA of the speech frame at any time ttFor any time t, extracting L continuous voice sampling values as a voice frame At=(St-L+1,...,St-2,St-1,St) In which S istIs a speech sample closest to t in time, and adopts speech analysis algorithm to calculate the coding parameter of the speech frame, i.e. the coding parameter vector FA of the speech frame at t momenttWherein, L is a fixed parameter;
24) training the predictor with the samples: at any given time t), the training sample pairs { FIS) obtained according to steps 22) and 23) are obtainedt,AtAnd the values are used as the input and the expected output of the predictor, a plurality of t values are randomly selected in an effective range, and the predictor is trained according to the type of the predictor.
In the step 22), the frame rate of the lip feature vector is doubled by performing time interpolation on the lip feature vector, or the frame rate of the lip feature vector is increased by adopting a high-speed image acquisition device for acquisition.
The predictor adopts an artificial neural network, and the artificial neural network is formed by sequentially connecting 3 LSTM layers and 2 full-connection layers.
In the step 22), the obtaining of the lip feature vector specifically includes the following steps:
for each frame of lip image, 20 feature points surrounding the inner edge and the outer edge of the lip are extracted, the center coordinates of the 20 feature points are obtained, the center coordinates are subtracted from the coordinates of each point to obtain 40 coordinate data, the 40 coordinate values are subjected to normalization processing, and finally a lip feature vector is obtained.
In step 23), the speech analysis algorithm is an LPC10e algorithm, and the coding parameter vector is an LPC parameter, and includes 1 voiced/unvoiced flag of the first half frame, 1 voiced/unvoiced flag of the second half frame, 1 pitch period, 1 gain, and 10 reflection coefficients.
Compared with the prior art, the invention has the following characteristics:
firstly, direct conversion: the invention adopts machine learning technology to construct a special converter, which realizes the conversion from lip image characteristic vector to speech frame coding parameter vector. The predictor can be realized by an artificial neural network, but is not limited to the artificial neural network.
Secondly, character conversion is not needed: the converter adopts lip image characteristic vector sequence as input and speech frame coding parameter vector as output. The output speech frame coding parameter vector can be directly synthesized into a speech sampling frame by a speech synthesis technology without the intermediate link of 'text'.
Thirdly, facilitating the structure training: the invention also provides a training method of the designed predictor and a construction method of a training sample.
Drawings
Fig. 1 is a diagram showing the constitution and interface structure of a converter.
FIG. 2 is a flow chart of predictor training.
FIG. 3 is an artificial neural network structure of a predictor.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments.
The invention designs a converter for converting lip image characteristics into voice coding parameters, which can receive the characteristic vector sequence of the lip image, convert the characteristic vector sequence into a voice frame coding parameter vector sequence and output the voice frame coding parameter vector sequence.
The converter mainly includes an input buffer, a predictor, and configuration parameters. The core of the method is a predictor, which is a machine learning model and can be trained by utilizing training samples. The trained predictor can predict and output a short-time sequence of the lip feature vector as a corresponding speech coding parameter vector.
As shown in fig. 1, the translator basically includes an input cache, a predictor, and configuration parameters, as well as input and output interfaces. The converter receives individual lip feature vectors and stores them in an input buffer. And sending the k latest lip feature vectors cached into the predictor at regular time intervals delta t, obtaining a prediction result by the predictor and outputting the result from the output port. The prediction result is the coding parameters of a speech frame. The configuration parameters mainly store the configuration parameters of the predictor.
The operation of the converter is described as follows:
(1) the converter receives a series of lip feature vectors FI1,FI2,...,FInAnd stores it in the input buffer. These lip feature vectors are input sequentially in chronological order.
(2) Every certain time delta t, the converter takes the k latest lip feature vectors cached at the current moment as a short-time vector sequence FISt=(FIt-k+1,...,FIt-2,FIt-1,FIt) Feeding into the predictor and obtaining a prediction result FAt. The prediction result is a vector of coding parameters for a speech frame. Where Δ t is equal to the duration occupied by a speech frame and k is a fixed parameter.
(3) Obtain a prediction result FAtThen, it is output from the output interface.
The steps are continuously and circularly operated, so that the lip image feature vector sequence FI is obtained1,FI2,...,FInSequence of encoding parameter vectors FA converted into speech frames1,FA2,...,FAm. Since the frequency of the speech frame is not necessarily equal to the frequency of the video frame, the number n of the image feature vectors FI input here is not necessarily equal to the number m of the speech frame parameter vectors FA output here.
The converter described in this patent relates to a predictor implemented with a machine learning model with data prediction capabilities, such as but not limited to an artificial neural network. Before application, it needs to be trained (i.e. let the predictor learn). The following is a training method, the principle of which is shown in fig. 2, and the method extracts a lip image feature vector from a lip video and a speech coding parameter vector from corresponding speech, respectively. Short time sequence FIS of lip image feature vectort=(FIt-k+1,...,FIt-2,FIt-1,FIt) As input samples for training; and FIStCoding parameter vector FA of corresponding speech frametAs the desired output, i.e., the label. Thereby obtaining a large number of training sample and tag pairs { FISt,FAtAnd training the predictor, wherein t is any random effective moment.
The training predictor specifically comprises the following steps:
(1) video and voice are synchronously collected. And synchronously acquiring the video and the corresponding voice data through the video and audio acquisition equipment. The video needs to include a lip portion. A lip part, i.e. a rectangular area containing the whole mouth and centered at the mouth, is extracted from the video. The final lip video is composed of a series of lip images I1,I2,...,InAnd (4) forming. The speech data is represented as a sequence S of speech samples1,S2,...,SM(where M is capital, representing the number of samples; the number of speech frames is represented as lowercase M). And keeping the time corresponding relation between the image and the voice.
(2) Lip feature vector short time sequence FIS at any time tt. Calculating image characteristic vector FI of each frame of lip image I in the lip video, and obtaining a series of lip characteristic vectors FI1,FI2,...,FIn. For a given arbitrary time t, k continuous lip feature vectors are extracted as lip feature vector short-time sequence FIS at the time tt=(FIt-k+1,...,FIt-2,Ft-1,FIt) Wherein FItK is a specified parameter for a lip feature vector closest in time to t. In order to improve the frame rate of the lip feature vectors, the lip feature vectors can be subjected to time interpolation to double the frame rate, or high-speed image acquisition equipment can be directly adopted.
(3) Speech frame coding parameter vector FA at any time tt. For any time t, extracting L continuous voice sampling values as a voice frame At=(St-L+1,...,St-2,St-1,St) In which S istIs the one speech sample closest in time to t. Calculating the coding parameter of the speech frame by adopting a speech analysis algorithm, namely calculating the coding parameter vector FA of the speech frame at the time tt. Where L is a fixed parameter.
(4) The predictor was trained with the sample. At any time t, a training sample pair { FIS) is obtained according to (2) and (3)t,AtIn which FIStAs input to the predictor, AtIs the expected output of the predictor, i.e., the label. A large number of samples can be obtained by randomly selecting a large number of t values in an effective range. And training the predictor by adopting a corresponding method according to the type of the predictor by using the samples.
(5) And using the trained predictor as a component for constructing the lip sound converter.
Example 1:
the following is a specific implementation, but the methods and principles of the present invention are not limited to the specific numbers given therein.
(1) And the predictor can be realized by adopting an artificial neural network. Other machine learning techniques may also be employed to construct the predictor. In the following process, the predictor uses an artificial neural network, i.e., the predictor is equivalent to an artificial neural network.
In this embodiment, the neural network is composed of 3 LSTM layers +2 fully-connected layers sense connected in sequence. Dropout layers are added between each two layers and between the internal feedback layers of the LSTM, which are not shown in the figure for architectural clarity. As shown in fig. 3:
wherein, the LSTM in three layers has 80 neurons respectively, and the first two layers adopt a 'return _ sequences' mode. There are 100 neurons and 14 neurons in the two sense layers, respectively.
The first LSTM layer receives input of the LIP feature sequence in the format of a 3-dimensional array (batteries, stages, LIP _ DIM). The last full link layer is the output layer of the neural network, and the output format is a 2-dimensional array (BATCHES, LPC _ DIM). Of these formats, bat specifies the number of samples (customarily called the batch) that are fed into the neural network each time, bat is typically a number greater than 1 at training, and bat ═ 1 at application; the shape of an input sample is specified by STEPS, LIP _ DIM, STEPS specifies the length (customarily called the number of STEPS) of a short-time sequence of LIP features, i.e. FISt=(FIt-k+1,...,FIt-2,Ft-1,FIt) K in (1), i.e., STEPS ═ k; LIP _ DIM specifies the dimension of one LIP feature vector FI, and for a LIP feature vector consisting of 40 pieces of coordinate data, LIP _ DIM is 40. In the output format, LPC _ DIM is the dimension of a speech coding parameter vector, and for LPC10e, LPC _ DIM is 14.
The number and the number of layers of the neurons are properly adjusted according to different application scenes. In the application scenario of large word exchange, the number of neurons and the number of layers can be set to be larger.
(2) And determining the value of k. The value of k needs to be determined according to the application scenario. For a simple application scenario, only one-by-one chinese character recognition may be needed, since the pronunciation of one chinese character is about 0.5 seconds, and if the video is 50 frames/second, k is the number of video frames contained in 0.5 seconds, i.e., k is 50x0.5 or 25. For the scene with more words, the words and even phrases are required to be recognized as a whole, and the value of k is multiplied correspondingly. For example, in the two words of "size" and "truck", because the mouth shape of "large" and "card" is approximate, and it is difficult to distinguish individual characters, it is necessary to perform whole word recognition on "size" and "truck", and k is at least 2x 25-50 or so
(3) And (4) calculating lip feature vectors. For each frame of image, 20 feature points are extracted around the inner edge and the outer edge of the lips to describe the current shape of the lipsAnd (4) forming. The center coordinates of the 20 points are calculated and subtracted from the coordinates of each point. Each point has two coordinate values of x and y, and the total number of 20 points is 40. And normalizing the 40 coordinate values to obtain a lip feature vector FI. From successive video images a series of lip feature vectors FI can be derived1,FI2,...,FIn. Since the image frame rate is usually not high, the lip feature vector can be interpolated to increase the frame rate. Forming a short time sequence FIS by continuous k lip feature vectorst=(FIt-k+1,...,FIt-2,FIt-1,FIt) Input samples as predictor, where FItIs the one closest in time to t.
(4) And calculating the encoding parameter vector of the speech frame. the speech frame at time t is At=(St-L+1,...,St-2,St-1,St) In which S istIs the one speech sample closest in time to t, which is an arbitrary significant instant. Here, the speech may take a sample rate of 8000Hz and L is set to 180, i.e., every 180 samples as an audio frame, accounting for 22.5 ms. The speech coding may employ the LPC10e algorithm. For a speech frame A using this algorithmtAnalyzing to obtain the encoding parameter vector FA of the frametI.e., 14 values of LPC parameters, including 1 first half frame voiced flag, 1 second half frame voiced flag, 1 pitch period, 1 gain, and 10 reflection coefficients. Thus, the encoding parameter vector FA of the speech frame at any effective moment can be calculatedt. Different speech frames may have overlap. The sound frame coding parameter vector is used as an expected output format when a predictor is trained, and is used as an actual output format of the predictor when the predictor is applied.
(5) Training of the predictor: a short time sequence FIS from the lip feature vectortAs input samples, the encoding parameter vector FA of the speech frame corresponding to the time instanttAs its label (i.e. predicted target), a sample pair { FIS is composedt,FAtSince t can take any value in valid time, a large number of training samples can be obtained for training the predictor.During training, the prediction error is calculated by adopting the Mean Square Error (MSE), and the network weight is gradually adjusted by adopting an error back propagation method. Ultimately providing a usable predictor.
(6) After the training of the predictor is completed, the predictor is used as a module in the converter. The structure description data and weight data of the predictor, as well as other parameters, are stored in the 'configuration parameters', and when the converter is started, the configuration parameters are read out, and the predictor is reconstructed according to the parameters.
(7) The methods described herein may be implemented in software or may be implemented partially or wholly in hardware.
Claims (2)
1. A method for converting lip image features into speech coding parameters, comprising the steps of:
1) constructing a voice coding parameter converter, which comprises an input cache and a trained predictor, sequentially receiving lip feature vectors according to time sequence, and storing the lip feature vectors in the input cache of the converter;
2) sending k latest lip feature vectors cached at the current moment into a predictor as a short-time vector sequence at regular intervals, and obtaining a prediction result, wherein the prediction result is a coding parameter vector of a speech frame, the predictor adopts an artificial neural network, the artificial neural network is formed by sequentially connecting 3 LSTM layers and 2 full-connection layers, and the training method of the predictor specifically comprises the following steps:
21) synchronously acquiring video and voice: synchronously acquiring videos and corresponding voice data through video and audio acquisition equipment, extracting lip images from the videos, wherein the lip images comprise the whole mouth and a rectangular area with the mouth as the center, and acquiring a series of lip images I1,I2,...,InForming lip video, the voice data being a voice sample value sequence S1,S2,...,SMKeeping the corresponding relation between the lip image and the voice data;
22) lip feature vector short-time sequence FIS (fuzzy inference System) for acquiring any time ttCalculating image characteristic vector FI of each frame of lip image I in the lip video to obtain a series of lip characteristic vectors FI1,FI2,...,FInFor a given arbitrary time t, k continuous lip feature vectors are extracted as lip feature vector short-time sequence FIS at the time tt=(FIt-k+1,...,FIt-2,Ft-1,FIt) Wherein, FItThe method is characterized in that the lip feature vector closest to t in time is obtained, k is a designated parameter, and the method specifically comprises the following steps:
for each frame of lip image, extracting 20 feature points which surround the inner edge and the outer edge of the lip, acquiring the central coordinates of the 20 feature points, subtracting the central coordinates from the coordinates of each point to obtain 40 coordinate data, and performing normalization processing on the 40 coordinate values to finally acquire a lip feature vector;
23) obtaining the coding parameter vector FA of the speech frame at any time ttFor any time t, extracting L continuous voice sampling values as a voice frame At=(St-L+1,...,St-2,St-1,St) In which S istIs a speech sample closest to t in time, and adopts speech analysis algorithm to calculate the coding parameter of the speech frame, i.e. the coding parameter vector FA of the speech frame at t momenttWherein, L is a fixed parameter;
24) training the predictor with the samples: at any given time t), the training sample pairs { FIS) obtained according to steps 22) and 23) are obtainedt,AtThe speech analysis algorithm is an LPC10e algorithm, the coding parameter vector is an LPC parameter and comprises 1 first half frame voiced and unvoiced flag, 1 second half frame voiced and unvoiced flag, 1 pitch period, 1 gain and 10 reflection coefficients;
3) the speech coding parameter converter outputs a prediction result.
2. The method as claimed in claim 1, wherein in step 22), the lip feature vector is temporally interpolated to double its frame rate, or the frame rate of the lip feature vector is increased by using a high-speed image capturing device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810215220.4A CN108538283B (en) | 2018-03-15 | 2018-03-15 | Method for converting lip image characteristics into voice coding parameters |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810215220.4A CN108538283B (en) | 2018-03-15 | 2018-03-15 | Method for converting lip image characteristics into voice coding parameters |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108538283A CN108538283A (en) | 2018-09-14 |
CN108538283B true CN108538283B (en) | 2020-06-26 |
Family
ID=63484002
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810215220.4A Active CN108538283B (en) | 2018-03-15 | 2018-03-15 | Method for converting lip image characteristics into voice coding parameters |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108538283B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10891969B2 (en) * | 2018-10-19 | 2021-01-12 | Microsoft Technology Licensing, Llc | Transforming audio content into images |
CN111023470A (en) * | 2019-12-06 | 2020-04-17 | 厦门快商通科技股份有限公司 | Air conditioner temperature adjusting method, medium, equipment and device |
CN111508509A (en) * | 2020-04-02 | 2020-08-07 | 广东九联科技股份有限公司 | Sound quality processing system and method based on deep learning |
CN113869212B (en) * | 2021-09-28 | 2024-06-21 | 平安科技(深圳)有限公司 | Multi-mode living body detection method, device, computer equipment and storage medium |
CN116013354B (en) * | 2023-03-24 | 2023-06-09 | 北京百度网讯科技有限公司 | Training method of deep learning model and method for controlling mouth shape change of virtual image |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104217218A (en) * | 2014-09-11 | 2014-12-17 | 广州市香港科大霍英东研究院 | Lip language recognition method and system |
CN105321519A (en) * | 2014-07-28 | 2016-02-10 | 刘璟锋 | Speech recognition system and unit |
CN105632497A (en) * | 2016-01-06 | 2016-06-01 | 昆山龙腾光电有限公司 | Voice output method, voice output system |
CN107799125A (en) * | 2017-11-09 | 2018-03-13 | 维沃移动通信有限公司 | A kind of audio recognition method, mobile terminal and computer-readable recording medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7133535B2 (en) * | 2002-12-21 | 2006-11-07 | Microsoft Corp. | System and method for real time lip synchronization |
-
2018
- 2018-03-15 CN CN201810215220.4A patent/CN108538283B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105321519A (en) * | 2014-07-28 | 2016-02-10 | 刘璟锋 | Speech recognition system and unit |
CN104217218A (en) * | 2014-09-11 | 2014-12-17 | 广州市香港科大霍英东研究院 | Lip language recognition method and system |
CN105632497A (en) * | 2016-01-06 | 2016-06-01 | 昆山龙腾光电有限公司 | Voice output method, voice output system |
CN107799125A (en) * | 2017-11-09 | 2018-03-13 | 维沃移动通信有限公司 | A kind of audio recognition method, mobile terminal and computer-readable recording medium |
Also Published As
Publication number | Publication date |
---|---|
CN108538283A (en) | 2018-09-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108538283B (en) | Method for converting lip image characteristics into voice coding parameters | |
CN110119703B (en) | Human body action recognition method fusing attention mechanism and spatio-temporal graph convolutional neural network in security scene | |
CN111243626B (en) | Method and system for generating speaking video | |
CN108875807B (en) | Image description method based on multiple attention and multiple scales | |
CN111488807B (en) | Video description generation system based on graph rolling network | |
CN108648745B (en) | Method for converting lip image sequence into voice coding parameter | |
CN111914076B (en) | User image construction method, system, terminal and storage medium based on man-machine conversation | |
CN110866510A (en) | Video description system and method based on key frame detection | |
CN114245215B (en) | Method, device, electronic equipment, medium and product for generating speaking video | |
CN113961736B (en) | Method, apparatus, computer device and storage medium for text generation image | |
CN115237255B (en) | Natural image co-pointing target positioning system and method based on eye movement and voice | |
CN114882873B (en) | Speech recognition model training method and device and readable storage medium | |
CN114974215A (en) | Audio and video dual-mode-based voice recognition method and system | |
CN111259785A (en) | Lip language identification method based on time offset residual error network | |
CN116597857A (en) | Method, system, device and storage medium for driving image by voice | |
CN108959512B (en) | Image description network and technology based on attribute enhanced attention model | |
CN114694255A (en) | Sentence-level lip language identification method based on channel attention and time convolution network | |
CN117409121A (en) | Fine granularity emotion control speaker face video generation method, system, equipment and medium based on audio frequency and single image driving | |
CN115713535B (en) | Image segmentation model determination method and image segmentation method | |
CN113450824B (en) | Voice lip reading method and system based on multi-scale video feature fusion | |
CN114360491B (en) | Speech synthesis method, device, electronic equipment and computer readable storage medium | |
CN108538282B (en) | Method for directly generating voice from lip video | |
CN115496134A (en) | Traffic scene video description generation method and device based on multi-modal feature fusion | |
CN115019137A (en) | Method and device for predicting multi-scale double-flow attention video language event | |
CN111340329B (en) | Actor evaluation method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |