CN108538283B - Method for converting lip image characteristics into voice coding parameters - Google Patents

Method for converting lip image characteristics into voice coding parameters Download PDF

Info

Publication number
CN108538283B
CN108538283B CN201810215220.4A CN201810215220A CN108538283B CN 108538283 B CN108538283 B CN 108538283B CN 201810215220 A CN201810215220 A CN 201810215220A CN 108538283 B CN108538283 B CN 108538283B
Authority
CN
China
Prior art keywords
lip
time
predictor
frame
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810215220.4A
Other languages
Chinese (zh)
Other versions
CN108538283A (en
Inventor
贾振堂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai University of Electric Power
Original Assignee
Shanghai University of Electric Power
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai University of Electric Power filed Critical Shanghai University of Electric Power
Priority to CN201810215220.4A priority Critical patent/CN108538283B/en
Publication of CN108538283A publication Critical patent/CN108538283A/en
Application granted granted Critical
Publication of CN108538283B publication Critical patent/CN108538283B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a method for converting lip image characteristics into voice coding parameters, which comprises the following steps: 1) constructing a voice coding parameter converter, which comprises an input cache and a trained predictor, sequentially receiving lip feature vectors according to time sequence, and storing the lip feature vectors in the input cache of the converter; 2) sending k latest lip feature vectors cached at the current moment into a predictor as a short-time vector sequence at regular intervals, and obtaining a prediction result which is a coding parameter vector of a voice frame; 3) the speech coding parameter converter outputs a prediction result. Compared with the prior art, the method has the advantages of direct conversion, no need of character conversion, convenience in construction training and the like.

Description

Method for converting lip image characteristics into voice coding parameters
Technical Field
The invention relates to the technical field of computer vision, digital image processing and microelectronics, in particular to a method for converting lip image characteristics into voice coding parameters
Background
Lip language identification is to generate corresponding character expressions according to lip videos, and the following technical scheme is related to the prior art:
(1) CN107122646A, inventive name: a method for unlocking lip language is provided. The principle is that lip features acquired in real time are compared with lip features stored in advance to determine identity, but only the lip features can be acquired.
(2) CN107437019A, inventive name: a method and a device for authenticating identity of lip language identification are provided. The principle is similar to (1), with the difference that a 3D image is used.
(3) CN106504751A, inventive name: a self-adaptive lip language interaction method and an interaction device. The principle still is that the lips are recognized as characters, then instruction interaction is carried out based on the characters, and the conversion steps are complex.
(4) LipNet is a deep learning lip language recognition algorithm issued by the oxford university in combination with deep mind, and aims to recognize lips as characters. The recognition rate is higher than that of the prior art, but the conversion process is also complicated.
(5) CN107610703A, inventive name: a multi-language translator based on lip language collection and voice picking. The existing voice recognition module is used for recognizing the characters, and then the existing voice synthesis module is used for converting the characters into voice.
Disclosure of Invention
The present invention is directed to overcoming the above-mentioned drawbacks of the prior art and providing a method for converting lip image features into speech coding parameters
The purpose of the invention can be realized by the following technical scheme:
a method for converting lip image characteristics into voice coding parameters comprises the following steps:
1) constructing a voice coding parameter converter, which comprises an input cache and a trained predictor, sequentially receiving lip feature vectors according to time sequence, and storing the lip feature vectors in the input cache of the converter;
2) sending k latest lip feature vectors cached at the current moment into a predictor as a short-time vector sequence at regular intervals, and obtaining a prediction result which is a coding parameter vector of a voice frame;
3) the speech coding parameter converter outputs a prediction result.
The training method of the predictor specifically comprises the following steps:
21) synchronously acquiring video and voice: synchronously acquiring videos and corresponding voice data through video and audio acquisition equipment, extracting lip images from the videos, wherein the lip images comprise the whole mouth and a rectangular area with the mouth as the center, and acquiring a series of lip images I1,I2,...,InForming lip video, the voice data being a voice sample value sequence S1,S2,...,SMKeeping the corresponding relation between the lip image and the voice data;
22) lip feature vector short-time sequence FIS (fuzzy inference System) for acquiring any time ttCalculating image characteristic vector FI of each frame of lip image I in the lip video to obtain a series of lip characteristic vectors FI1,FI2,...,FInFor a given arbitrary time t, k continuous lip feature vectors are extracted as lip feature vector short-time sequence FIS at the time tt=(FIt-k+1,...,FIt-2,Ft-1,FIt) Wherein, FItIs a lip feature vector closest to t in time, and k is a designated parameter;
23) obtaining the coding parameter vector FA of the speech frame at any time ttFor any time t, extracting L continuous voice sampling values as a voice frame At=(St-L+1,...,St-2,St-1,St) In which S istIs a speech sample closest to t in time, and adopts speech analysis algorithm to calculate the coding parameter of the speech frame, i.e. the coding parameter vector FA of the speech frame at t momenttWherein, L is a fixed parameter;
24) training the predictor with the samples: at any given time t), the training sample pairs { FIS) obtained according to steps 22) and 23) are obtainedt,AtAnd the values are used as the input and the expected output of the predictor, a plurality of t values are randomly selected in an effective range, and the predictor is trained according to the type of the predictor.
In the step 22), the frame rate of the lip feature vector is doubled by performing time interpolation on the lip feature vector, or the frame rate of the lip feature vector is increased by adopting a high-speed image acquisition device for acquisition.
The predictor adopts an artificial neural network, and the artificial neural network is formed by sequentially connecting 3 LSTM layers and 2 full-connection layers.
In the step 22), the obtaining of the lip feature vector specifically includes the following steps:
for each frame of lip image, 20 feature points surrounding the inner edge and the outer edge of the lip are extracted, the center coordinates of the 20 feature points are obtained, the center coordinates are subtracted from the coordinates of each point to obtain 40 coordinate data, the 40 coordinate values are subjected to normalization processing, and finally a lip feature vector is obtained.
In step 23), the speech analysis algorithm is an LPC10e algorithm, and the coding parameter vector is an LPC parameter, and includes 1 voiced/unvoiced flag of the first half frame, 1 voiced/unvoiced flag of the second half frame, 1 pitch period, 1 gain, and 10 reflection coefficients.
Compared with the prior art, the invention has the following characteristics:
firstly, direct conversion: the invention adopts machine learning technology to construct a special converter, which realizes the conversion from lip image characteristic vector to speech frame coding parameter vector. The predictor can be realized by an artificial neural network, but is not limited to the artificial neural network.
Secondly, character conversion is not needed: the converter adopts lip image characteristic vector sequence as input and speech frame coding parameter vector as output. The output speech frame coding parameter vector can be directly synthesized into a speech sampling frame by a speech synthesis technology without the intermediate link of 'text'.
Thirdly, facilitating the structure training: the invention also provides a training method of the designed predictor and a construction method of a training sample.
Drawings
Fig. 1 is a diagram showing the constitution and interface structure of a converter.
FIG. 2 is a flow chart of predictor training.
FIG. 3 is an artificial neural network structure of a predictor.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments.
The invention designs a converter for converting lip image characteristics into voice coding parameters, which can receive the characteristic vector sequence of the lip image, convert the characteristic vector sequence into a voice frame coding parameter vector sequence and output the voice frame coding parameter vector sequence.
The converter mainly includes an input buffer, a predictor, and configuration parameters. The core of the method is a predictor, which is a machine learning model and can be trained by utilizing training samples. The trained predictor can predict and output a short-time sequence of the lip feature vector as a corresponding speech coding parameter vector.
As shown in fig. 1, the translator basically includes an input cache, a predictor, and configuration parameters, as well as input and output interfaces. The converter receives individual lip feature vectors and stores them in an input buffer. And sending the k latest lip feature vectors cached into the predictor at regular time intervals delta t, obtaining a prediction result by the predictor and outputting the result from the output port. The prediction result is the coding parameters of a speech frame. The configuration parameters mainly store the configuration parameters of the predictor.
The operation of the converter is described as follows:
(1) the converter receives a series of lip feature vectors FI1,FI2,...,FInAnd stores it in the input buffer. These lip feature vectors are input sequentially in chronological order.
(2) Every certain time delta t, the converter takes the k latest lip feature vectors cached at the current moment as a short-time vector sequence FISt=(FIt-k+1,...,FIt-2,FIt-1,FIt) Feeding into the predictor and obtaining a prediction result FAt. The prediction result is a vector of coding parameters for a speech frame. Where Δ t is equal to the duration occupied by a speech frame and k is a fixed parameter.
(3) Obtain a prediction result FAtThen, it is output from the output interface.
The steps are continuously and circularly operated, so that the lip image feature vector sequence FI is obtained1,FI2,...,FInSequence of encoding parameter vectors FA converted into speech frames1,FA2,...,FAm. Since the frequency of the speech frame is not necessarily equal to the frequency of the video frame, the number n of the image feature vectors FI input here is not necessarily equal to the number m of the speech frame parameter vectors FA output here.
The converter described in this patent relates to a predictor implemented with a machine learning model with data prediction capabilities, such as but not limited to an artificial neural network. Before application, it needs to be trained (i.e. let the predictor learn). The following is a training method, the principle of which is shown in fig. 2, and the method extracts a lip image feature vector from a lip video and a speech coding parameter vector from corresponding speech, respectively. Short time sequence FIS of lip image feature vectort=(FIt-k+1,...,FIt-2,FIt-1,FIt) As input samples for training; and FIStCoding parameter vector FA of corresponding speech frametAs the desired output, i.e., the label. Thereby obtaining a large number of training sample and tag pairs { FISt,FAtAnd training the predictor, wherein t is any random effective moment.
The training predictor specifically comprises the following steps:
(1) video and voice are synchronously collected. And synchronously acquiring the video and the corresponding voice data through the video and audio acquisition equipment. The video needs to include a lip portion. A lip part, i.e. a rectangular area containing the whole mouth and centered at the mouth, is extracted from the video. The final lip video is composed of a series of lip images I1,I2,...,InAnd (4) forming. The speech data is represented as a sequence S of speech samples1,S2,...,SM(where M is capital, representing the number of samples; the number of speech frames is represented as lowercase M). And keeping the time corresponding relation between the image and the voice.
(2) Lip feature vector short time sequence FIS at any time tt. Calculating image characteristic vector FI of each frame of lip image I in the lip video, and obtaining a series of lip characteristic vectors FI1,FI2,...,FIn. For a given arbitrary time t, k continuous lip feature vectors are extracted as lip feature vector short-time sequence FIS at the time tt=(FIt-k+1,...,FIt-2,Ft-1,FIt) Wherein FItK is a specified parameter for a lip feature vector closest in time to t. In order to improve the frame rate of the lip feature vectors, the lip feature vectors can be subjected to time interpolation to double the frame rate, or high-speed image acquisition equipment can be directly adopted.
(3) Speech frame coding parameter vector FA at any time tt. For any time t, extracting L continuous voice sampling values as a voice frame At=(St-L+1,...,St-2,St-1,St) In which S istIs the one speech sample closest in time to t. Calculating the coding parameter of the speech frame by adopting a speech analysis algorithm, namely calculating the coding parameter vector FA of the speech frame at the time tt. Where L is a fixed parameter.
(4) The predictor was trained with the sample. At any time t, a training sample pair { FIS) is obtained according to (2) and (3)t,AtIn which FIStAs input to the predictor, AtIs the expected output of the predictor, i.e., the label. A large number of samples can be obtained by randomly selecting a large number of t values in an effective range. And training the predictor by adopting a corresponding method according to the type of the predictor by using the samples.
(5) And using the trained predictor as a component for constructing the lip sound converter.
Example 1:
the following is a specific implementation, but the methods and principles of the present invention are not limited to the specific numbers given therein.
(1) And the predictor can be realized by adopting an artificial neural network. Other machine learning techniques may also be employed to construct the predictor. In the following process, the predictor uses an artificial neural network, i.e., the predictor is equivalent to an artificial neural network.
In this embodiment, the neural network is composed of 3 LSTM layers +2 fully-connected layers sense connected in sequence. Dropout layers are added between each two layers and between the internal feedback layers of the LSTM, which are not shown in the figure for architectural clarity. As shown in fig. 3:
wherein, the LSTM in three layers has 80 neurons respectively, and the first two layers adopt a 'return _ sequences' mode. There are 100 neurons and 14 neurons in the two sense layers, respectively.
The first LSTM layer receives input of the LIP feature sequence in the format of a 3-dimensional array (batteries, stages, LIP _ DIM). The last full link layer is the output layer of the neural network, and the output format is a 2-dimensional array (BATCHES, LPC _ DIM). Of these formats, bat specifies the number of samples (customarily called the batch) that are fed into the neural network each time, bat is typically a number greater than 1 at training, and bat ═ 1 at application; the shape of an input sample is specified by STEPS, LIP _ DIM, STEPS specifies the length (customarily called the number of STEPS) of a short-time sequence of LIP features, i.e. FISt=(FIt-k+1,...,FIt-2,Ft-1,FIt) K in (1), i.e., STEPS ═ k; LIP _ DIM specifies the dimension of one LIP feature vector FI, and for a LIP feature vector consisting of 40 pieces of coordinate data, LIP _ DIM is 40. In the output format, LPC _ DIM is the dimension of a speech coding parameter vector, and for LPC10e, LPC _ DIM is 14.
The number and the number of layers of the neurons are properly adjusted according to different application scenes. In the application scenario of large word exchange, the number of neurons and the number of layers can be set to be larger.
(2) And determining the value of k. The value of k needs to be determined according to the application scenario. For a simple application scenario, only one-by-one chinese character recognition may be needed, since the pronunciation of one chinese character is about 0.5 seconds, and if the video is 50 frames/second, k is the number of video frames contained in 0.5 seconds, i.e., k is 50x0.5 or 25. For the scene with more words, the words and even phrases are required to be recognized as a whole, and the value of k is multiplied correspondingly. For example, in the two words of "size" and "truck", because the mouth shape of "large" and "card" is approximate, and it is difficult to distinguish individual characters, it is necessary to perform whole word recognition on "size" and "truck", and k is at least 2x 25-50 or so
(3) And (4) calculating lip feature vectors. For each frame of image, 20 feature points are extracted around the inner edge and the outer edge of the lips to describe the current shape of the lipsAnd (4) forming. The center coordinates of the 20 points are calculated and subtracted from the coordinates of each point. Each point has two coordinate values of x and y, and the total number of 20 points is 40. And normalizing the 40 coordinate values to obtain a lip feature vector FI. From successive video images a series of lip feature vectors FI can be derived1,FI2,...,FIn. Since the image frame rate is usually not high, the lip feature vector can be interpolated to increase the frame rate. Forming a short time sequence FIS by continuous k lip feature vectorst=(FIt-k+1,...,FIt-2,FIt-1,FIt) Input samples as predictor, where FItIs the one closest in time to t.
(4) And calculating the encoding parameter vector of the speech frame. the speech frame at time t is At=(St-L+1,...,St-2,St-1,St) In which S istIs the one speech sample closest in time to t, which is an arbitrary significant instant. Here, the speech may take a sample rate of 8000Hz and L is set to 180, i.e., every 180 samples as an audio frame, accounting for 22.5 ms. The speech coding may employ the LPC10e algorithm. For a speech frame A using this algorithmtAnalyzing to obtain the encoding parameter vector FA of the frametI.e., 14 values of LPC parameters, including 1 first half frame voiced flag, 1 second half frame voiced flag, 1 pitch period, 1 gain, and 10 reflection coefficients. Thus, the encoding parameter vector FA of the speech frame at any effective moment can be calculatedt. Different speech frames may have overlap. The sound frame coding parameter vector is used as an expected output format when a predictor is trained, and is used as an actual output format of the predictor when the predictor is applied.
(5) Training of the predictor: a short time sequence FIS from the lip feature vectortAs input samples, the encoding parameter vector FA of the speech frame corresponding to the time instanttAs its label (i.e. predicted target), a sample pair { FIS is composedt,FAtSince t can take any value in valid time, a large number of training samples can be obtained for training the predictor.During training, the prediction error is calculated by adopting the Mean Square Error (MSE), and the network weight is gradually adjusted by adopting an error back propagation method. Ultimately providing a usable predictor.
(6) After the training of the predictor is completed, the predictor is used as a module in the converter. The structure description data and weight data of the predictor, as well as other parameters, are stored in the 'configuration parameters', and when the converter is started, the configuration parameters are read out, and the predictor is reconstructed according to the parameters.
(7) The methods described herein may be implemented in software or may be implemented partially or wholly in hardware.

Claims (2)

1. A method for converting lip image features into speech coding parameters, comprising the steps of:
1) constructing a voice coding parameter converter, which comprises an input cache and a trained predictor, sequentially receiving lip feature vectors according to time sequence, and storing the lip feature vectors in the input cache of the converter;
2) sending k latest lip feature vectors cached at the current moment into a predictor as a short-time vector sequence at regular intervals, and obtaining a prediction result, wherein the prediction result is a coding parameter vector of a speech frame, the predictor adopts an artificial neural network, the artificial neural network is formed by sequentially connecting 3 LSTM layers and 2 full-connection layers, and the training method of the predictor specifically comprises the following steps:
21) synchronously acquiring video and voice: synchronously acquiring videos and corresponding voice data through video and audio acquisition equipment, extracting lip images from the videos, wherein the lip images comprise the whole mouth and a rectangular area with the mouth as the center, and acquiring a series of lip images I1,I2,...,InForming lip video, the voice data being a voice sample value sequence S1,S2,...,SMKeeping the corresponding relation between the lip image and the voice data;
22) lip feature vector short-time sequence FIS (fuzzy inference System) for acquiring any time ttCalculating image characteristic vector FI of each frame of lip image I in the lip video to obtain a series of lip characteristic vectors FI1,FI2,...,FInFor a given arbitrary time t, k continuous lip feature vectors are extracted as lip feature vector short-time sequence FIS at the time tt=(FIt-k+1,...,FIt-2,Ft-1,FIt) Wherein, FItThe method is characterized in that the lip feature vector closest to t in time is obtained, k is a designated parameter, and the method specifically comprises the following steps:
for each frame of lip image, extracting 20 feature points which surround the inner edge and the outer edge of the lip, acquiring the central coordinates of the 20 feature points, subtracting the central coordinates from the coordinates of each point to obtain 40 coordinate data, and performing normalization processing on the 40 coordinate values to finally acquire a lip feature vector;
23) obtaining the coding parameter vector FA of the speech frame at any time ttFor any time t, extracting L continuous voice sampling values as a voice frame At=(St-L+1,...,St-2,St-1,St) In which S istIs a speech sample closest to t in time, and adopts speech analysis algorithm to calculate the coding parameter of the speech frame, i.e. the coding parameter vector FA of the speech frame at t momenttWherein, L is a fixed parameter;
24) training the predictor with the samples: at any given time t), the training sample pairs { FIS) obtained according to steps 22) and 23) are obtainedt,AtThe speech analysis algorithm is an LPC10e algorithm, the coding parameter vector is an LPC parameter and comprises 1 first half frame voiced and unvoiced flag, 1 second half frame voiced and unvoiced flag, 1 pitch period, 1 gain and 10 reflection coefficients;
3) the speech coding parameter converter outputs a prediction result.
2. The method as claimed in claim 1, wherein in step 22), the lip feature vector is temporally interpolated to double its frame rate, or the frame rate of the lip feature vector is increased by using a high-speed image capturing device.
CN201810215220.4A 2018-03-15 2018-03-15 Method for converting lip image characteristics into voice coding parameters Active CN108538283B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810215220.4A CN108538283B (en) 2018-03-15 2018-03-15 Method for converting lip image characteristics into voice coding parameters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810215220.4A CN108538283B (en) 2018-03-15 2018-03-15 Method for converting lip image characteristics into voice coding parameters

Publications (2)

Publication Number Publication Date
CN108538283A CN108538283A (en) 2018-09-14
CN108538283B true CN108538283B (en) 2020-06-26

Family

ID=63484002

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810215220.4A Active CN108538283B (en) 2018-03-15 2018-03-15 Method for converting lip image characteristics into voice coding parameters

Country Status (1)

Country Link
CN (1) CN108538283B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10891969B2 (en) * 2018-10-19 2021-01-12 Microsoft Technology Licensing, Llc Transforming audio content into images
CN111023470A (en) * 2019-12-06 2020-04-17 厦门快商通科技股份有限公司 Air conditioner temperature adjusting method, medium, equipment and device
CN111508509A (en) * 2020-04-02 2020-08-07 广东九联科技股份有限公司 Sound quality processing system and method based on deep learning
CN113869212B (en) * 2021-09-28 2024-06-21 平安科技(深圳)有限公司 Multi-mode living body detection method, device, computer equipment and storage medium
CN116013354B (en) * 2023-03-24 2023-06-09 北京百度网讯科技有限公司 Training method of deep learning model and method for controlling mouth shape change of virtual image

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104217218A (en) * 2014-09-11 2014-12-17 广州市香港科大霍英东研究院 Lip language recognition method and system
CN105321519A (en) * 2014-07-28 2016-02-10 刘璟锋 Speech recognition system and unit
CN105632497A (en) * 2016-01-06 2016-06-01 昆山龙腾光电有限公司 Voice output method, voice output system
CN107799125A (en) * 2017-11-09 2018-03-13 维沃移动通信有限公司 A kind of audio recognition method, mobile terminal and computer-readable recording medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7133535B2 (en) * 2002-12-21 2006-11-07 Microsoft Corp. System and method for real time lip synchronization

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105321519A (en) * 2014-07-28 2016-02-10 刘璟锋 Speech recognition system and unit
CN104217218A (en) * 2014-09-11 2014-12-17 广州市香港科大霍英东研究院 Lip language recognition method and system
CN105632497A (en) * 2016-01-06 2016-06-01 昆山龙腾光电有限公司 Voice output method, voice output system
CN107799125A (en) * 2017-11-09 2018-03-13 维沃移动通信有限公司 A kind of audio recognition method, mobile terminal and computer-readable recording medium

Also Published As

Publication number Publication date
CN108538283A (en) 2018-09-14

Similar Documents

Publication Publication Date Title
CN108538283B (en) Method for converting lip image characteristics into voice coding parameters
CN110119703B (en) Human body action recognition method fusing attention mechanism and spatio-temporal graph convolutional neural network in security scene
CN111243626B (en) Method and system for generating speaking video
CN108875807B (en) Image description method based on multiple attention and multiple scales
CN111488807B (en) Video description generation system based on graph rolling network
CN108648745B (en) Method for converting lip image sequence into voice coding parameter
CN111914076B (en) User image construction method, system, terminal and storage medium based on man-machine conversation
CN110866510A (en) Video description system and method based on key frame detection
CN114245215B (en) Method, device, electronic equipment, medium and product for generating speaking video
CN113961736B (en) Method, apparatus, computer device and storage medium for text generation image
CN115237255B (en) Natural image co-pointing target positioning system and method based on eye movement and voice
CN114882873B (en) Speech recognition model training method and device and readable storage medium
CN114974215A (en) Audio and video dual-mode-based voice recognition method and system
CN111259785A (en) Lip language identification method based on time offset residual error network
CN116597857A (en) Method, system, device and storage medium for driving image by voice
CN108959512B (en) Image description network and technology based on attribute enhanced attention model
CN114694255A (en) Sentence-level lip language identification method based on channel attention and time convolution network
CN117409121A (en) Fine granularity emotion control speaker face video generation method, system, equipment and medium based on audio frequency and single image driving
CN115713535B (en) Image segmentation model determination method and image segmentation method
CN113450824B (en) Voice lip reading method and system based on multi-scale video feature fusion
CN114360491B (en) Speech synthesis method, device, electronic equipment and computer readable storage medium
CN108538282B (en) Method for directly generating voice from lip video
CN115496134A (en) Traffic scene video description generation method and device based on multi-modal feature fusion
CN115019137A (en) Method and device for predicting multi-scale double-flow attention video language event
CN111340329B (en) Actor evaluation method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant