CN113223123A - Image processing method and image processing apparatus - Google Patents

Image processing method and image processing apparatus Download PDF

Info

Publication number
CN113223123A
CN113223123A CN202110560268.0A CN202110560268A CN113223123A CN 113223123 A CN113223123 A CN 113223123A CN 202110560268 A CN202110560268 A CN 202110560268A CN 113223123 A CN113223123 A CN 113223123A
Authority
CN
China
Prior art keywords
image
audio
lip
target
image sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110560268.0A
Other languages
Chinese (zh)
Inventor
赵明瑶
闫嵩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dami Technology Co Ltd
Original Assignee
Beijing Dami Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dami Technology Co Ltd filed Critical Beijing Dami Technology Co Ltd
Priority to CN202110560268.0A priority Critical patent/CN113223123A/en
Publication of CN113223123A publication Critical patent/CN113223123A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The embodiment of the invention discloses an image processing method and an image processing device. After the target audio is obtained, the phoneme labels corresponding to the audio segments in the target audio are determined, the lip image sequences corresponding to the audio segments are further determined according to the phoneme labels of the audio segments, and the lip image sequences are processed according to the lengths of the audio segments, so that the lengths of the audio segments and the corresponding lip image sequences are kept the same. The method of the embodiment of the invention can automatically determine the phonemes corresponding to the audio, and adjust the duration of the corresponding lip image sequence according to the duration of each phoneme, thereby effectively enhancing the matching degree of the voice and the lip.

Description

Image processing method and image processing apparatus
Technical Field
The present invention relates to a data processing method, and more particularly, to an image processing method and an image processing apparatus.
Background
With the increasing popularity of internet and computer technology, online teaching activities, particularly language-like online teaching activities, are therefore becoming more and more frequent. The learning of language is vital to learners, and the pronunciation of words is the basis in the learning of language, so the learning of pronunciation of words is an essential link in the on-line teaching activities of language class. The on-line teaching mode of the word pronunciation shows the lip shape change in the word pronunciation process to the learner in a visual mode, but the mode easily causes the problem that the voice is not matched with the lip shape.
Disclosure of Invention
In view of the above, an object of the embodiments of the present invention is to provide an image processing method and an image processing apparatus for adjusting the duration of a corresponding lip image sequence according to the duration of each phoneme in audio to enhance the matching degree between speech and lips.
According to a first aspect of embodiments of the present invention, there is provided an image processing method, the method including:
acquiring a target audio;
determining phoneme labels corresponding to the audio fragments in the target audio;
determining lip image sequences corresponding to the audio clips according to the phoneme labels;
and processing each lip image sequence so that each audio segment and the corresponding lip image sequence keep the same length.
According to a second aspect of the embodiments of the present invention, there is provided an image processing apparatus including:
an audio acquisition unit for acquiring a target audio;
a factor tag determining unit, configured to perform speech recognition on the target audio, and determine a phoneme tag corresponding to each audio segment in the target audio;
a lip sequence determining unit, configured to determine a lip image sequence corresponding to each audio clip according to each phoneme label;
and processing each lip image sequence so that each audio segment and the corresponding lip image sequence keep the same length.
According to a third aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method of any of the first aspects.
According to a fourth aspect of embodiments of the present invention, there is provided an electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method according to any one of the first aspect.
After the target audio is obtained, the phoneme labels corresponding to the audio segments in the target audio are determined, the lip image sequences corresponding to the audio segments are further determined according to the phoneme labels of the audio segments, and the lip image sequences are processed according to the lengths of the audio segments, so that the lengths of the audio segments and the corresponding lip image sequences are kept the same. The method of the embodiment of the invention can automatically determine the phonemes corresponding to the audio, and adjust the duration of the corresponding lip image sequence according to the duration of each phoneme, thereby effectively enhancing the matching degree of the voice and the lip.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:
FIG. 1 is a schematic interface diagram of an online teaching activity according to an embodiment of the present invention;
FIG. 2 is a flowchart of an image processing method according to a first embodiment of the present invention;
FIG. 3 is a flow chart of the generation of a sequence of lip images in an alternative implementation of the first embodiment of the invention;
FIG. 4 is a diagram illustrating a first embodiment of obtaining a target image according to a first image and a second image;
FIG. 5 is a schematic diagram of acquiring a target image in an alternative implementation of the first embodiment of the invention;
fig. 6 is a schematic diagram of an image processing apparatus of a second embodiment of the present invention;
fig. 7 is a schematic view of an electronic device according to a third embodiment of the present invention.
Detailed Description
The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.
Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.
Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".
In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
In the traditional learning process of the word pronunciation, a learner usually needs to complete more standard word pronunciation learning by simulating lip shape change of a teacher and simulating pronunciation change of the teacher, so that the on-line teaching mode of the word pronunciation as a learning mode of the word pronunciation also needs to perform the pronunciation teaching of the word in a visual mode. FIG. 1 is a schematic interface diagram of an online teaching activity according to an embodiment of the present invention. As shown in fig. 1, when a learner learns a word pronunciation, a window 11 for showing a lip shape (i.e., for playing a sequence of lip images), a window 12 for showing an avatar of the learner or for showing a captured facial image of the learner, a window 13 for showing an avatar of a learner or for showing a captured facial image of the learner, and a window 14 for showing a word (e.g., an applet) being learned by the learner may or may not be displayed in the interface. The lip image sequence, which is generally used to show lip variation during the pronunciation of words, is not recorded at the same time as the audio of the pronunciation of words, so that when the audio is played simultaneously with the lip image sequence through the window 11, the problem of mismatching between voice and lips is easily generated.
Fig. 2 is a flowchart of an image processing method according to a first embodiment of the present invention. As shown in fig. 2, the method of the present embodiment includes the following steps:
in step S201, a target audio is acquired.
In a possible case, the audio of the teacher reading each word in the historical online teaching activities may be recorded in real time at least once through an audio recording device, or the audio of the teacher reading each word may be recorded in any other case, or Text information of a predetermined word is converted into audio based on a Text-To-Speech (Text-To-Speech) technology, and each word and the corresponding audio are stored in a database in a key-value pair manner. In subsequent online teaching activities, when a learner teaches any word, the server may obtain the audio of the word from the database as the target audio.
In another possible case, the audio recording device may record the audio of the teacher reading words in the on-line teaching activities in real time, and the recorded audio is used as the target audio.
In this embodiment, the word corresponding to the target audio may be a word in any language, such as a chinese word, an english word, a german word, a japanese word, a french word, and the like.
Step S202, determining phoneme labels corresponding to the audio fragments in the target audio.
The phoneme is the minimum voice unit divided according to the natural attribute of the voice, and is analyzed by a sentence of pronunciation action, and each pronunciation action can form a phoneme. Taking English phonemes as an example, English international phonetic symbols have 48 phonemes, including 20 vowel phonemes and 28 consonant phonemes, and taking consonant phonemes as an example, the consonant phonemes include/p/,/b/,/ts/etc.
The audio clip in the target audio is an audio clip obtained by intercepting the target audio based on a window with a preset length. Alternatively, the server may slide-intercept the target audio into a plurality of audio clips based on a window of a predetermined length. Wherein the window length is typically greater than the sliding length of the window. For example, the target audio is audio with a length of 1 second, and when the window length is 20 milliseconds and the sliding length is 10 milliseconds, the server may intercept the target audio as 99 audio pieces of 0-20 milliseconds, 10-30 milliseconds, …, 990 and 1000 milliseconds. The continuity of voice change in the audio clip can be improved by adopting a sliding interception mode, so that the accuracy of voice recognition is improved subsequently.
In an optional implementation manner, if the target audio is the audio acquired from the database according to the word identifier of the word, the phoneme tags corresponding to the audio segments in the target audio are known, and the server may directly determine the phoneme tags corresponding to the audio segments.
In another alternative implementation, if the target audio is the audio of a real-time recorded word spoken by a learner in an on-line teaching activity, the server may perform speech recognition on the target audio by using a speech recognition system described in "tretian" and HMM-based speech recognition system, jilin university and 2016.
Step S203 determines a lip image sequence corresponding to each audio clip according to each phoneme label.
In order to increase the learning interest of the learner, the online teaching activities can be used for online teaching in a way of changing the image of the learner, namely, the image of the learner, which is seen by the learner, can be a favorite cartoon image or a real character image. Therefore, in the present embodiment, the lip image sequence corresponding to each audio clip may be a lip image sequence of a target avatar formed by combining an avatar of a learner with a predetermined avatar.
Fig. 3 is a flow chart of the generation of a sequence of lip images in an alternative implementation of the first embodiment of the invention. As shown in fig. 3, in an alternative implementation, the lip image sequence of the present embodiment is generated by the following steps:
in step S301, a first image sequence is acquired.
In a possible case, the image sequence of each word read aloud by a teacher in the historical online teaching activity may be collected in real time in advance through an image collecting device, or the image sequence of each word read aloud by the teacher in any other case may be collected, and the collected image sequence is intercepted according to lips corresponding to different phonemes, so as to obtain the first image sequence corresponding to each phoneme. In subsequent online teaching activities, when a learner teaches any word, the server may obtain the first image sequence of each phoneme corresponding to the word from the database.
In another possible case, an image sequence of words read aloud by a teacher in an on-line teaching activity may be acquired in real time through an image acquisition device, and the acquired image sequence is intercepted according to lips corresponding to different phonemes to obtain a first image sequence corresponding to each phoneme.
In this embodiment, the first image sequence includes a plurality of first images, and each of the first images includes an original character. Wherein the original image is used for representing the image of the teacher, at least comprising the facial image of the teacher.
In step S302, a second image is determined.
In this embodiment, the second image includes a predetermined avatar. Wherein the predetermined image is used to represent the image of a non-instructor, such as a cartoon image, an authorized, real character image, or the like. The predetermined avatar may be selected by the learner. The terminal may transmit a selection request for a predetermined avatar to the server in response to a selection operation of the learner for the predetermined avatar, so that the server may determine a second image including the predetermined avatar after receiving the request.
Step S303, a target image sequence is obtained based on a pre-trained image processing model according to each first image and each second image.
In this embodiment, the second image is used as a source image (source image) of the image processing model, each first image is used as a driving image (driving image) of the image processing model, and the server may combine each first image and the second image into one image pair, and then input each image pair into the image processing model according to the order of each first image, so as to obtain a target image sequence composed of a plurality of target images. Wherein each of the target images includes a target character, and the target character is a predetermined character having a posture of an original character. Alternatively, the gesture may include at least one of an action of the original character, a facial expression, and the like. It will be readily appreciated that the number of target images in the target image sequence is the same as the number of first images in the first image sequence.
FIG. 4 is a diagram illustrating a first embodiment of obtaining a target image according to a first image and a second image. As shown in fig. 4, the image P1 is a driving image, i.e., a first image, and the character in the image P1 is an original character having a facial expression of 41. The image P2 is the source image, i.e., the second image, and the avatar in the image P2 is a predetermined avatar having a facial expression of 42. After the image P1 and the image P2 are combined into one image pair to be input into the image processing model, the image P3, i.e., the target image, can be obtained. The character in the image P3 is a target character, the facial expression of which is the facial expression 43, that is, the facial expression 43 is the facial expression of the original character, and the five sense organs are the five sense organs of the predetermined character.
In this embodiment, the Image processing Model belongs to one of video synthesis models, and may specifically be a First Order Motion Model of an Image Animation (FOMMIA). The fomia mainly comprises two modules, namely a motion prediction module (motion prediction module) and an image generation module (image generation module). Wherein the motion prediction module is configured to predict a dense motion field (dense motion field) from a frame (i.e., a driving image) D of the driving video to the source image S, the dense motion field mapping each pixel in D to S, and the function T may be usedS←D:R2→R2Indicating, i.e. the reverse optical flow. The motion prediction module includes two sub-modules, respectively a keypoint detector and a dense motion network.
Alternatively, the server may be based on First Order Motion Model for Image Animation, alias andr siariohin, St phase Lathiili re, Sergey Tulyakov, Elisa Ricci, Nicu Sebe, Computer Vision and Pattern Recognition (cs.CV); the method described in Artificial Intelligence (cs.ai),2020, trains a first order motion model of an image animation. Specifically, in the training sample for training the first-order motion model of the image animation, the image included in each training image is an object of the same category as the original image and the predetermined image. For example, if the images are used to implement motion migration, the images included in the training images are all performing the same or different motions; if the images are used for realizing expression migration, the images included in the images for training all have the same or different expressions.
Fig. 5 is a schematic diagram of acquiring a target image in an alternative implementation of the first embodiment of the invention. As shown in fig. 5, the character 51 is a predetermined character, the character 52 is an original character, and the server composes a driving image (i.e., a first image) including the character 51 and a source image (i.e., a second image) including the character 52 into an image pair, and then, through a process of a first order motion model of an image animation, obtains a target image including a character 53, the character 53 is a target character, and the character 53 is the character 51 having a posture of the character 52.
And step S304, intercepting each target image to obtain a lip image sequence of the target image.
After obtaining the target image sequence, the server may perform key point detection on each target image in the image sequence, determine the positions (e.g., coordinates) of lip key points of the target avatar, and then perform clipping processing on each target image according to the positions of the lip key points of the target avatar, thereby obtaining the lip image sequence of the target avatar, that is, the lip image sequence corresponding to each phoneme.
In an alternative implementation, the server may utilize Dlib to determine the location of lip keypoints of the target persona by way of keypoint detection. The Dlib is a C + + open source toolkit that contains machine learning algorithms. In Dlib, the facial features and contours are identified by 68 keypoints, where keypoint 49 — keypoint 68 is the lip keypoint in Dlib.
After determining the position of the lip key point of the target image, the server can intercept the target image through an interception frame with a preset size. It will be readily appreciated that the width of the capture box is generally equal to or greater than the lip width of the target image, and the height of the capture box is generally equal to or greater than the lip height of the target image.
It is easy to understand that, in this embodiment, if the first image sequence is an image sequence collected in at least one historical online teaching activity when the teacher speaks words, steps S301 to S304 may be performed before step S201, and the server may store the lip image sequence corresponding to each phoneme in the database in the form of a key-value pair; if the first image sequence is an image sequence collected during the on-line teaching activity and used by the teacher reading words, steps S301 to S304 may be executed simultaneously with step S202, or may be executed sequentially with step S202.
Step S204, each lip image sequence is processed so that each audio clip and the corresponding lip image sequence keep the same length.
The audio clip is obtained by cutting the target audio according to a window with a preset length, so that the server can determine the length of the audio clip, namely the first length. Meanwhile, the server may also determine the length of each lip image sequence, that is, the second length, according to the acquisition frequency of the first image sequence corresponding to each phoneme. Thus, in this step, the server may compare each first length with the corresponding second length. Specifically, for each audio clip, if the first length is greater than the second length, the server may perform frame interpolation on the corresponding lip image sequence, so that the first length is equal to the second length; if the first length is smaller than the second length, the server may perform frame extraction processing on the corresponding lip image sequence, so that the first length is equal to the second length.
The lip images corresponding to the same phoneme generally have little change, so optionally, when the server performs the framing processing on the lip image sequence corresponding to any phoneme, the server may select at least one lip image from the lip image sequence and insert the at least one lip image at the middle position, the end position, and the like of the lip image sequence. Alternatively, a new image may be generated between two adjacent lip image sequences in the lip image sequence in other manners, for example, by using the supersslomo algorithm, so as to achieve the purpose of performing frame interpolation on the lip image sequence.
In particular, the server may determine the number of lip images to insert into the sequence of lip images based on a product of a difference of the first length and the second length and the sampling frequency. For example, the phoneme/p/corresponding audio clip has a length of 1.1 seconds, the lip image sequence has a length of 1 second, and the acquisition frequency is 30 frames/second (i.e., 30 lip images per second), and the server may determine that the number of lip images inserted into the lip image sequence is (1.1-1) × 30 — 3.
Similarly, when the server performs frame extraction processing on the lip image sequence corresponding to any phoneme, the server may select at least one lip image from the lip image sequence and remove the lip image from the lip image sequence. In particular, the server may determine the number of lip images removed in the lip image sequence from a product of a difference of the second length and the first length and the sampling frequency.
In this embodiment, after the target audio is acquired, the phoneme tags corresponding to the audio segments in the target audio are determined, then the lip image sequences corresponding to the audio segments are determined according to the phoneme tags of the audio segments, and the lip image sequences are processed according to the lengths of the audio segments, so that the lengths of the audio segments are the same as the lengths of the corresponding lip image sequences. The method of the embodiment can automatically determine the phonemes corresponding to the audio, and adjust the duration of the corresponding lip image sequence according to the duration of each phoneme, so that the matching degree of the voice and the lip shape can be effectively enhanced.
Fig. 6 is a schematic diagram of an image processing apparatus according to a second embodiment of the present invention. As shown in fig. 6, the apparatus of the present embodiment includes an audio acquisition unit 61, a phoneme label determination unit 62, a lip sequence determination unit 63, and an image sequence processing unit 64.
The audio acquiring unit 61 is configured to acquire target audio. The phoneme label determining unit 62 is configured to perform speech recognition on the target audio, and determine a phoneme label corresponding to each audio fragment in the target audio. The lip sequence determining unit 63 is configured to determine a lip image sequence corresponding to each audio segment according to each phoneme label. The image sequence processing unit 64 is configured to process each lip image sequence so that each audio segment is kept the same length as the corresponding lip image sequence.
In this embodiment, after the target audio is acquired, the phoneme tags corresponding to the audio segments in the target audio are determined, then the lip image sequences corresponding to the audio segments are determined according to the phoneme tags of the audio segments, and the lip image sequences are processed according to the lengths of the audio segments, so that the lengths of the audio segments are the same as the lengths of the corresponding lip image sequences. The device of the embodiment can automatically determine the phonemes corresponding to the audio, and adjust the duration of the corresponding lip image sequence according to the duration of each phoneme, so that the matching degree of the voice and the lip shape can be effectively enhanced.
Fig. 7 is a schematic view of an electronic device according to a third embodiment of the present invention. The electronic device shown in fig. 7 is a general-purpose data processing apparatus, and may be specifically a first terminal, a second terminal or a server according to an embodiment of the present invention, and includes a general-purpose computer hardware structure, which includes at least a processor 71 and a memory 72. The processor 71 and the memory 72 are connected by a bus 73. The memory 72 is adapted to store instructions or programs executable by the processor 71. The processor 71 may be a stand-alone microprocessor or a collection of one or more microprocessors. Thus, the processor 71 implements the processing of data and the control of other devices by executing the commands stored in the memory 72 to thereby execute the method flows of the embodiments of the present invention as described above. The bus 73 connects the above-described components together, and also connects the above-described components to a display controller 74 and a display device and an input/output (I/O) device 75. Input/output (I/O) devices 75 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, input/output (I/O) devices 75 are connected to the system through an input/output (I/O) controller 76.
The memory 72 may store, among other things, software components such as an operating system, communication modules, interaction modules, and application programs. Each of the modules and applications described above corresponds to a set of executable program instructions that perform one or more functions and methods described in embodiments of the invention.
The flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention described above illustrate various aspects of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
Also, as will be appreciated by one skilled in the art, aspects of embodiments of the present invention may be embodied as a system, method or computer program product. Accordingly, various aspects of embodiments of the invention may take the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," module "or" system. Further, aspects of the invention may take the form of: a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.
Any combination of one or more computer-readable media may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of embodiments of the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to: electromagnetic, optical, or any suitable combination thereof. The computer readable signal medium may be any of the following computer readable media: is not a computer readable storage medium and may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including: object oriented programming languages such as Java, Smalltalk, C + +, PHP, Python, and the like; and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package; executing in part on a user computer and in part on a remote computer; or entirely on a remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. An image processing method, characterized in that the method comprises:
acquiring a target audio;
determining phoneme labels corresponding to the audio fragments in the target audio;
determining lip image sequences corresponding to the audio clips according to the phoneme labels;
and processing each lip image sequence so that each audio segment and the corresponding lip image sequence keep the same length.
2. The method of claim 1, wherein the lip image sequence is a lip image sequence of a target character;
the lip image sequence is generated by:
acquiring a first image sequence, wherein the first image sequence comprises a plurality of first images, and each first image comprises an original image;
determining a second image, the second image comprising a predetermined avatar;
obtaining a target image sequence based on a pre-trained image processing model according to each first image and each second image, wherein each target image in the target image sequence comprises a target image which is the preset image with the posture of the original image;
and intercepting each target image to obtain a lip image sequence of the target image.
3. The method of claim 2, wherein the determining a second image comprises:
determining the second image in response to receiving a selection request for the predetermined avatar.
4. The method of claim 1, wherein processing each of the lip image sequences such that each of the audio segments remains the same length as the corresponding lip image sequence comprises:
determining a first length of each of the audio segments and a second length of each of the lip image sequences;
for each audio clip, in response to the first length being greater than the second length, performing frame interpolation on the corresponding lip image sequence so that the first length is equal to the second length;
for each audio clip, in response to the first length being less than the second length, performing framing processing on the corresponding lip image sequence so that the first length is equal to the second length.
5. The method of claim 2, wherein the image processing model is a first order motion model of an image animation.
6. An image processing apparatus, characterized in that the apparatus comprises:
an audio acquisition unit for acquiring a target audio;
a phoneme label determining unit, configured to perform speech recognition on the target audio, and determine a phoneme label corresponding to each audio fragment in the target audio;
a lip sequence determining unit, configured to determine a lip image sequence corresponding to each audio clip according to each phoneme label;
and the image sequence processing unit is used for processing each lip image sequence so that each audio clip and the corresponding lip image sequence keep the same length.
7. A computer-readable storage medium on which computer program instructions are stored, which, when executed by a processor, implement the method of any one of claims 1-5.
8. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-5.
CN202110560268.0A 2021-05-21 2021-05-21 Image processing method and image processing apparatus Pending CN113223123A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110560268.0A CN113223123A (en) 2021-05-21 2021-05-21 Image processing method and image processing apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110560268.0A CN113223123A (en) 2021-05-21 2021-05-21 Image processing method and image processing apparatus

Publications (1)

Publication Number Publication Date
CN113223123A true CN113223123A (en) 2021-08-06

Family

ID=77097880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110560268.0A Pending CN113223123A (en) 2021-05-21 2021-05-21 Image processing method and image processing apparatus

Country Status (1)

Country Link
CN (1) CN113223123A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114025235A (en) * 2021-11-12 2022-02-08 北京捷通华声科技股份有限公司 Video generation method and device, electronic equipment and storage medium
CN115984427A (en) * 2022-12-08 2023-04-18 上海积图科技有限公司 Animation synthesis method, device and equipment based on audio and storage medium
CN116012505A (en) * 2022-12-29 2023-04-25 上海师范大学天华学院 Pronunciation animation generation method and system based on key point self-detection and style migration
CN116166125A (en) * 2023-03-03 2023-05-26 北京百度网讯科技有限公司 Avatar construction method, apparatus, device and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347867A (en) * 2019-07-16 2019-10-18 北京百度网讯科技有限公司 Method and apparatus for generating lip motion video
CN110876024A (en) * 2018-08-31 2020-03-10 百度在线网络技术(北京)有限公司 Method and device for determining lip action of avatar
CN111369967A (en) * 2020-03-11 2020-07-03 北京字节跳动网络技术有限公司 Virtual character-based voice synthesis method, device, medium and equipment
CN111476871A (en) * 2020-04-02 2020-07-31 百度在线网络技术(北京)有限公司 Method and apparatus for generating video
CN112102448A (en) * 2020-09-14 2020-12-18 北京百度网讯科技有限公司 Virtual object image display method and device, electronic equipment and storage medium
CN112131988A (en) * 2020-09-14 2020-12-25 北京百度网讯科技有限公司 Method, device, equipment and computer storage medium for determining virtual character lip shape
CN112465935A (en) * 2020-11-19 2021-03-09 科大讯飞股份有限公司 Virtual image synthesis method and device, electronic equipment and storage medium
CN112511853A (en) * 2020-11-26 2021-03-16 北京乐学帮网络技术有限公司 Video processing method and device, electronic equipment and storage medium
US20210110831A1 (en) * 2018-05-18 2021-04-15 Deepmind Technologies Limited Visual speech recognition by phoneme prediction

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210110831A1 (en) * 2018-05-18 2021-04-15 Deepmind Technologies Limited Visual speech recognition by phoneme prediction
CN110876024A (en) * 2018-08-31 2020-03-10 百度在线网络技术(北京)有限公司 Method and device for determining lip action of avatar
CN110347867A (en) * 2019-07-16 2019-10-18 北京百度网讯科技有限公司 Method and apparatus for generating lip motion video
CN111369967A (en) * 2020-03-11 2020-07-03 北京字节跳动网络技术有限公司 Virtual character-based voice synthesis method, device, medium and equipment
CN111476871A (en) * 2020-04-02 2020-07-31 百度在线网络技术(北京)有限公司 Method and apparatus for generating video
CN112102448A (en) * 2020-09-14 2020-12-18 北京百度网讯科技有限公司 Virtual object image display method and device, electronic equipment and storage medium
CN112131988A (en) * 2020-09-14 2020-12-25 北京百度网讯科技有限公司 Method, device, equipment and computer storage medium for determining virtual character lip shape
CN112465935A (en) * 2020-11-19 2021-03-09 科大讯飞股份有限公司 Virtual image synthesis method and device, electronic equipment and storage medium
CN112511853A (en) * 2020-11-26 2021-03-16 北京乐学帮网络技术有限公司 Video processing method and device, electronic equipment and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114025235A (en) * 2021-11-12 2022-02-08 北京捷通华声科技股份有限公司 Video generation method and device, electronic equipment and storage medium
CN115984427A (en) * 2022-12-08 2023-04-18 上海积图科技有限公司 Animation synthesis method, device and equipment based on audio and storage medium
CN115984427B (en) * 2022-12-08 2024-05-17 上海积图科技有限公司 Animation synthesis method, device, equipment and storage medium based on audio
CN116012505A (en) * 2022-12-29 2023-04-25 上海师范大学天华学院 Pronunciation animation generation method and system based on key point self-detection and style migration
CN116166125A (en) * 2023-03-03 2023-05-26 北京百度网讯科技有限公司 Avatar construction method, apparatus, device and storage medium
CN116166125B (en) * 2023-03-03 2024-04-23 北京百度网讯科技有限公司 Avatar construction method, apparatus, device and storage medium

Similar Documents

Publication Publication Date Title
CN111582241B (en) Video subtitle recognition method, device, equipment and storage medium
US10299008B1 (en) Smart closed caption positioning system for video content
CN113223123A (en) Image processing method and image processing apparatus
JP7432556B2 (en) Methods, devices, equipment and media for man-machine interaction
US9548048B1 (en) On-the-fly speech learning and computer model generation using audio-visual synchronization
US8793118B2 (en) Adaptive multimodal communication assist system
US7636662B2 (en) System and method for audio-visual content synthesis
CN110688008A (en) Virtual image interaction method and device
US20200202859A1 (en) Generating interactive audio-visual representations of individuals
Laraba et al. Dance performance evaluation using hidden Markov models
US20210390748A1 (en) Personalized speech-to-video with three-dimensional (3d) skeleton regularization and expressive body poses
CN110232340B (en) Method and device for establishing video classification model and video classification
CN107209552A (en) Based on the text input system and method stared
CN112420014A (en) Virtual face construction method and device, computer equipment and computer readable medium
US10825224B2 (en) Automatic viseme detection for generating animatable puppet
CN112329451B (en) Sign language action video generation method, device, equipment and storage medium
Naert et al. Coarticulation analysis for sign language synthesis
CN116958342A (en) Method for generating actions of virtual image, method and device for constructing action library
CN117152308B (en) Virtual person action expression optimization method and system
CN113077819A (en) Pronunciation evaluation method and device, storage medium and electronic equipment
CN113253838A (en) AR-based video teaching method and electronic equipment
KR20180012192A (en) Infant Learning Apparatus and Method Using The Same
CN115438210A (en) Text image generation method, text image generation device, terminal and computer readable storage medium
CN114694633A (en) Speech synthesis method, apparatus, device and storage medium
CN113990351A (en) Sound correction method, sound correction device and non-transient storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination