CN109391842B

CN109391842B - Dubbing method and mobile terminal

Info

Publication number: CN109391842B
Application number: CN201811368673.7A
Authority: CN
Inventors: 秦帅
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2018-11-16
Filing date: 2018-11-16
Publication date: 2021-01-26
Anticipated expiration: 2038-11-16
Also published as: CN109391842A

Abstract

The invention provides a dubbing method, a mobile terminal and a computer-readable storage medium, and relates to the technical field of image processing. Wherein the method comprises the following steps: receiving video data to be dubbed; determining characteristic information of each frame of image in the video data; and dubbing the video data according to the characteristic information. The embodiment of the invention can automatically dub the video data according to the characteristic information of each frame of image in the video data, thereby avoiding manual dubbing and synthesizing and improving the dubbing efficiency.

Description

Dubbing method and mobile terminal

Technical Field

The invention relates to the field of image processing, in particular to a dubbing method and a mobile terminal.

Background

At present, small videos, GIF (image interchange format) motion pictures, dynamic expressions and the like are indispensable elements for social contact and entertainment on the internet. People enhance their mutual emotions by distributing interesting small videos, and replace the words that a person expresses by sending out some dynamic expressions or GIF pictures.

Sometimes, the small video lacks dubbing or the user wants to have additional dubbing effects to it. Especially, more and more people like to raise pets at present, the video shot for the pets is popular, and the addition of the dubbing effect to the video of the pets is more fun for people. Moreover, the dubbing effect is added to the motion picture or the expression bag, so that the expression effect can be more shown, and meanwhile, the overall interestingness can be increased.

In the prior art, professional video processing software and audio processing software are needed to realize dubbing for videos, users can adjust and synthesize the dubbing personally, and the dubbing realization process is complicated for users without professional technology.

Disclosure of Invention

The invention provides a dubbing method, which aims to solve the problem that the dubbing process of a user is complicated.

In a first aspect, an embodiment of the present invention provides a dubbing method, which is applied to a mobile terminal, and the method includes:

receiving video data to be dubbed;

determining characteristic information of each frame of image in the video data;

and dubbing the video data according to the characteristic information.

In a second aspect, an embodiment of the present invention provides a mobile terminal, where the mobile terminal includes:

the receiving module is used for receiving video data to be dubbed;

the determining module is used for determining the characteristic information of each frame of image in the video data;

and the dubbing module is used for dubbing the video data according to the characteristic information.

In a third aspect, a mobile terminal is provided, which comprises a processor, a memory and a computer program stored on the memory and operable on the processor, wherein the computer program, when executed by the processor, implements the steps of the dubbing method according to the invention.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the dubbing method according to the invention.

In the embodiment of the invention, the video data to be dubbed is received; determining characteristic information of each frame of image in the video data; and dubbing the video data according to the characteristic information. The embodiment of the invention can automatically dub the video data according to the characteristic information of each frame of image in the video data, thereby avoiding manual dubbing and synthesizing and improving the dubbing efficiency.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

Fig. 1 shows a flow chart of a dubbing method in a first embodiment of the present invention;

fig. 2 is a flowchart illustrating a dubbing method according to a second embodiment of the present invention;

fig. 3 is a block diagram illustrating a mobile terminal according to a third embodiment of the present invention;

fig. 4 is a block diagram illustrating a mobile terminal according to a third embodiment of the present invention;

fig. 5 is a block diagram illustrating a mobile terminal according to a fourth embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Example one

Referring to fig. 1, a flowchart of a dubbing method according to a first embodiment of the present invention is shown, which may specifically include the following steps:

step 101, receiving video data to be dubbed.

In the embodiment of the present invention, the video data includes a small video, or a GIF moving picture or a dynamic expression, etc., wherein the small video is a video having a limited frame, such as a 10s video recorded by wechat.

In the embodiment of the present invention, the video data further includes other forms of video, which is not limited herein.

Step 102, determining characteristic information of each frame of image in the video data.

In the embodiment of the present invention, the feature information of each frame image includes: a subject object in each frame image, a form of the subject object, a background in which the subject object is located, and the like.

Wherein the subject object includes: characters, animals and cartoon images, wherein the characters can be divided into old people, young people and children. Animals can be classified by species, and cartoons can be classified by cartoon images. The morphology of the subject object includes: standing, climbing on an object or eating, speaking, etc. The background in which the subject object is located includes: the environment in which the subject object is located, weather, etc.

For example, it is recognized that the subject of each frame image in the video data is a boy, and the boy is in the form of eating at home. The information obtained by combining the feature information of the images of the frames is that the boy eats at home.

And 103, dubbing the video data according to the characteristic information.

In the embodiment of the invention, the corresponding dubbing information can be stored aiming at various feature information in advance, wherein the dubbing information comprises character and sound information. For example, when it is determined that the characteristic information of the video data is that a boy eats at home, the corresponding dubbing information may be called to dub the video data. The corresponding relation between the dubbing information and the characteristic information is established in advance, and when the corresponding characteristic information is identified, the corresponding dubbing information can be directly called to dub the video data.

For example, when the characteristic information is that the boy is eating, the dubbing information "hao xiangna" may be directly called to dub, wherein the dubbing information may be a previously recorded speech sound of the boy or synthesized according to a computer. When the characteristic information is that the dog is in the car chase, the dubbing information 'where to go' can be directly called, wherein the dubbing information can be recorded in advance or synthesized according to a computer. The dubbing can increase the interest of the video data.

Example two

Referring to fig. 2, a flowchart of a dubbing method according to a second embodiment of the present invention is shown, which may specifically include the following steps:

step 201, receiving video data to be dubbed.

Referring to step 101, the description is omitted here.

Step 202, determining a main body object and background information of each frame of image in the video data.

In an embodiment of the present invention, the feature information includes: a subject feature and a background feature.

In the embodiment of the present invention, the subject object includes a person, an animal, a cartoon, or the like. Or why the subject object is a particular organism or object. For example, the subject is a boy, a young lady, a young girl, a dog, a cat, a maiden warrior, a machine cat, a table with expression, a stool, and the like.

In an embodiment of the invention, the background features include: the background environment of the subject object, such as a forest, a home, a school, a road, etc.

In the embodiment of the invention, a matting technology can be adopted to matte the region where the main body object is located, and then the main body object is identified; the other region excluding the region where the subject object is located is identified, and the background information is identified. The accuracy of recognition is improved.

Step 203, determining the subject features of the subject object.

In an embodiment of the present invention, the main features include: at least one of a global feature, a facial feature, and an oral feature of the subject.

In the embodiment of the present invention, the overall characteristics refer to the overall form of the subject object, and when the subject object is a person, the overall characteristics include: the character is standing, or sitting, or lying, or running, or dancing, or swimming. Specifically, the overall characteristic refers to a state that the motion of the entire body of the person appears. Facial features may refer to the subject's facial expression: such as surprise, fear, disgust, anger, happiness, sadness, etc. The mouth characteristics refer to mouth shapes, the mouth shapes are various for the person, and the content which the person wants to express can be determined according to the mouth shapes; the mouth shape has specific conditions for animals, large, slightly open or closed. For cartoon images, the mouth shape can be as rich as a person in various shapes.

In an embodiment of the present invention, when the main feature includes an integral feature, the step 203 includes:

sub-step 2031, extracting a main body region in each frame image in the video data.

In the embodiment of the invention, the body region in each frame image can be extracted by using a matting technology. When the subject has only a back or side, not the face and mouth present on the image, only the subject region is extracted.

Sub-step 2032, determining the overall characteristics of the subject object based on the subject region.

In an embodiment of the present invention, a subject area is identified, wherein when a subject object of the subject area is a person or an animal, the overall characteristics can be identified by identifying the limb motion of the subject object.

Substep 2032 specifically comprises: and if the main body area is extracted, inputting the main body area into a first recognition model to obtain the overall characteristics of the main body object.

In the embodiment of the present invention, the first recognition model may be obtained by training in advance according to a first training sample, where the first training sample is an image corresponding to a subject region, and the overall feature is a description of a subject object. When the first recognition model is used, the subject region is input to the first recognition model, and the overall features of the subject object are obtained.

In an embodiment of the present invention, when the main body feature includes an integral feature and a face feature, the step 203 includes:

a substep 2033 of extracting a subject region and a face region in the subject region in each frame image in the video data;

in the embodiment of the present invention, a matting technique may be adopted to extract a main body region in each frame image, and when a face feature needs to be determined, a face region in the main body region is extracted.

When the subject object is a bird or other animal without a mouth shape, a subject region and a face region of each frame image are extracted.

Substep 2034, determining the overall characteristics of the subject object according to the subject region;

sub-step 2035 of determining facial features of the subject object based on the facial region;

in the embodiment of the present invention, the sub-step 2035 specifically includes: and inputting the facial area into a second recognition model to obtain the facial features of the main object.

In the embodiment of the present invention, the second recognition model may be obtained by training in advance according to a second training sample, where the first training sample is an image corresponding to a face region, and the facial feature is a description of a subject on a face. The facial region is input to the second recognition model when the second recognition model is used, resulting in the facial features of the subject object.

In an embodiment of the present invention, when the main body feature includes a whole feature, a face feature and an oral feature, the step 203 includes:

sub-step 2036 of extracting a main body region, a face region in the main body region, and a mouth region in the face region in each frame image in the video data;

in the embodiment of the present invention, a matting technique may be adopted to extract a body region in each frame image, when a face feature needs to be determined, a face region in the body region is extracted, and when an oral region needs to be analyzed, an oral region is directly extracted in the body region for analysis, or an oral region is extracted in the face region for analysis.

In the embodiment of the present invention, when the subject object is a human figure or an animal having a mouth shape, and is a front face of the human figure or the animal, the subject region, the face region, and the mouth region of each frame image are extracted.

Substep 2037, determining the overall characteristics of the subject object according to the subject region;

sub-step 2038 of determining facial features of the subject object based on the facial region;

substep 2039 of determining an oral feature of the subject object based on the oral region;

in the embodiment of the present invention, the sub-step 2039 specifically includes: inputting the mouth region into a third recognition model, and obtaining the mouth feature of the subject object.

In the embodiment of the present invention, the third recognition model may be obtained by training in advance according to a third training sample, where the third training sample is an image corresponding to the mouth region, and the mouth feature is a description of the mouth region of the subject. When the third recognition model is used, the mouth region is input to the third recognition model, and the mouth feature of the subject is obtained.

In the embodiment of the present invention, when the main body feature includes a facial feature and an oral feature, the step 203 includes:

a substep 20310 of extracting a face region and an oral region in the face region in each frame image in the video data;

in the embodiment of the present invention, there is an image having only a face area and a mouth area and no limbs, and in this case, only the face area and the mouth area are extracted.

Sub-step 20311 of determining facial features of the subject object based on the facial region;

substep 20312 of determining an oral characteristic of the subject object based on the oral region.

In the embodiment of the invention, when the image is only the face area and the mouth area, only the face area and the mouth area are extracted, and the corresponding features are determined.

Step 204, determining the background feature of the background information.

In the embodiment of the present invention, step 205 includes: and inputting the background information of each frame of image into a fourth recognition model to obtain the background characteristics.

In the embodiment of the present invention, the fourth recognition model may be obtained by training in advance according to a fourth training sample, where the fourth training sample is an image corresponding to a background area, and a background feature is a description of a subject object. When the fourth recognition model is used, the background area is input into the fourth recognition model, and the background feature of the subject object is obtained.

Step 205, determining target voice information according to the main body characteristics.

In the embodiment of the invention, the storage of the corresponding target voice information according to the main body characteristics can be realized, wherein the voice information comprises tone, tone and volume. For example, when the subject feature is a person running, the facial feature is a painful expression, and the mouth shape is a slowly changing mouth shape, the frequency of the tone is determined to be in a certain range. The volume may be a lower volume. The timbre may be determined based on the character characteristics.

In the embodiment of the invention, the corresponding relation between different main body characteristics and voice information can be established, and the corresponding target voice information is directly called when the main body characteristics are determined.

Step 206, determining target character information according to the main body characteristics and the background characteristics.

In the embodiment of the invention, corresponding target character information can be stored in advance for different combinations of main body characteristics and background characteristics. And after the main body characteristic and the background characteristic are determined, directly calling corresponding target character information.

And step 207, carrying out dubbing on the video data according to the target voice information and the target character information.

In the embodiment of the invention, the target voice information and the target character information are combined to obtain the dubbing information, and the dubbing information is synthesized with the video data to finish dubbing.

In this embodiment of the present invention, step 207 includes:

substep 2071, obtaining general sound data corresponding to the target text information based on a general sound library, wherein the general sound library comprises: and universal voice data corresponding to each character information.

In the embodiment of the invention, each text message comprises all possible texts used in the expression process, and each text has own fixed general sound data. Wherein the general sound data includes: timbre, tone and volume. I.e. each text has a fixed tone, pitch and volume. In the real-time embodiment of the present invention, each word can store the same tone, tone and volume to form a universal sound library.

In the embodiment of the invention, each character in the universal sound library has standard volume, tone and timbre.

And a substep 2072 of adjusting the general voice data according to the target voice information to obtain target voice data.

In the embodiment of the invention, when the target voice information is obtained, the voice information in the universal voice data corresponding to the target character information is adjusted according to the target voice information. The volume, tone and tone of the target text information are adjusted to the effect of the target voice information.

Substep 2073 dubs the video data based on the target sound data and the target text information.

In the embodiment of the invention, each character has various combinations of tone, tone and volume, and when each sound information of each character is stored, a large database is needed, so that each character is stored as general sound data, and after the target voice information is determined, the general sound data is directly adjusted, so that the storage quantity of the character and the voice information is reduced.

EXAMPLE III

Referring to fig. 3, a block diagram of a mobile terminal 300 according to a third embodiment of the present invention is shown, which may specifically include:

a receiving module 301, configured to receive video data to be dubbed;

a determining module 302, configured to determine feature information of each frame of image in the video data;

and a dubbing module 303, configured to dub the video data according to the feature information.

Optionally, on the basis of fig. 3, referring to fig. 4, the determining module 302 includes:

a first determining unit 3021 configured to determine a subject object and background information of each frame of image in the video data;

wherein the feature information includes: a subject feature and a background feature;

a second determination unit 3022 configured to determine a subject feature of the subject object;

a third determining unit 3023 configured to determine a background feature of the background information;

the dubbing module 303 includes:

a fourth determining unit 3031, configured to determine target voice information according to the body characteristic;

a fifth determining unit 3032, configured to determine target text information according to the main feature and the background feature;

and a dubbing unit 3033, configured to dub the video data according to the target vocal information and the target text information.

When the main body feature includes an integral feature, the second determination unit 3022 includes:

a first extraction subunit, configured to extract a main body region in each frame image in the video data;

the first determining subunit is used for determining the overall characteristics of the subject object according to the subject region;

when the main body feature includes an integral feature and a face feature, the second determination unit 3022 includes:

a second extraction subunit operable to extract a subject region and a face region in the subject region in each frame image in the video data;

the second determining subunit is used for determining the overall characteristics of the subject object according to the subject region;

a third determining subunit configured to determine a facial feature of the subject object based on the face region;

when the main body feature includes a whole feature, a face feature, and a mouth feature, the second determination unit 3022 includes:

a third extraction subunit operable to extract a main body region, a face region in the main body region, and a mouth region in the face region in each frame image in the video data;

the fourth determining subunit is used for determining the overall characteristics of the subject object according to the subject region;

a fifth determining subunit configured to determine a facial feature of the subject object based on the face region;

a sixth determining subunit configured to determine an oral characteristic of the subject object based on the oral area;

when the main body feature includes a facial feature and a mouth feature, the second determination unit 3022 includes:

a fourth extraction subunit operable to extract a face region and an oral region in the face region in each frame image in the video data;

a seventh determining subunit that determines a facial feature of the subject object from the face region;

an eighth determining subunit that determines a mouth feature of the subject object based on the mouth region.

The first determining subunit is specifically configured to input the subject region into a first recognition model, and obtain an overall feature of the subject object;

the second determining subunit is specifically configured to input the facial region into a second recognition model, and obtain a facial feature of the subject object;

the third determining subunit is specifically configured to input the mouth region into a third recognition model, and obtain a mouth feature of the subject object;

the third determination unit includes:

and the fourth determining subunit is configured to input the background information of each frame of image into a fourth recognition model, so as to obtain the background feature.

The dubbing unit 3033 includes:

an obtaining subunit, configured to obtain, based on a general sound library, general sound data corresponding to the target text information, where the general sound library includes: general voice data corresponding to each text message;

the obtaining subunit is used for adjusting the general sound data according to the target voice information to obtain target sound data;

and the dubbing subunit is used for dubbing the video data based on the target sound data and the target character information.

The mobile terminal provided in the embodiment of the present invention can implement each process implemented by the mobile terminal in the method embodiments of fig. 1 to fig. 2, and is not described herein again to avoid repetition.

In the embodiment of the invention, the mobile terminal receives the video data to be dubbed; determining characteristic information of each frame of image in the video data; and dubbing the video data according to the characteristic information. The mobile terminal of the embodiment of the invention can automatically dub the video data according to the characteristic information of each frame of image in the video data, thereby avoiding manual dubbing and synthesizing and improving the dubbing efficiency.

Example four

Figure 5 is a schematic diagram of a hardware configuration of a mobile terminal implementing various embodiments of the present invention,

the mobile terminal 500 includes, but is not limited to: a radio frequency unit 501, a network module 502, an audio output unit 503, an input unit 504, a sensor 505, a display unit 506, a user input unit 507, an interface unit 508, a memory 509, a processor 510, and a power supply 511. Those skilled in the art will appreciate that the mobile terminal architecture shown in fig. 5 is not intended to be limiting of mobile terminals, and that a mobile terminal may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. In the embodiment of the present invention, the mobile terminal includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.

A processor 510 for receiving video data to be dubbed; determining characteristic information of each frame of image in the video data; and dubbing the video data according to the characteristic information.

It should be understood that, in the embodiment of the present invention, the radio frequency unit 501 may be used for receiving and sending signals during a message sending and receiving process or a call process, and specifically, receives downlink data from a base station and then processes the received downlink data to the processor 510; in addition, the uplink data is transmitted to the base station. In general, radio frequency unit 501 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 501 can also communicate with a network and other devices through a wireless communication system.

The mobile terminal provides the user with wireless broadband internet access through the network module 502, such as helping the user send and receive e-mails, browse webpages, access streaming media, and the like.

The audio output unit 503 may convert audio data received by the radio frequency unit 501 or the network module 502 or stored in the memory 509 into an audio signal and output as sound. Also, the audio output unit 503 may also provide audio output related to a specific function performed by the mobile terminal 500 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 503 includes a speaker, a buzzer, a receiver, and the like.

The input unit 504 is used to receive an audio or video signal. The input Unit 504 may include a Graphics Processing Unit (GPU) 5041 and a microphone 5042, and the Graphics processor 5041 processes image data of a still picture or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 506. The image frames processed by the graphic processor 5041 may be stored in the memory 509 (or other storage medium) or transmitted via the radio frequency unit 501 or the network module 502. The microphone 5042 may receive sounds and may be capable of processing such sounds into audio data. The processed audio data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 501 in case of the phone call mode.

The mobile terminal 500 also includes at least one sensor 505, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor that adjusts the brightness of the display panel 5061 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 5061 and/or a backlight when the mobile terminal 500 is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of the mobile terminal (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), and vibration identification related functions (such as pedometer, tapping); the sensors 505 may also include fingerprint sensors, pressure sensors, iris sensors, molecular sensors, gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., which are not described in detail herein.

The display unit 506 is used to display information input by the user or information provided to the user. The Display unit 506 may include a Display panel 5061, and the Display panel 5061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 507 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the mobile terminal. Specifically, the user input unit 507 includes a touch panel 5071 and other input devices 5072. Touch panel 5071, also referred to as a touch screen, may collect touch operations by a user on or near it (e.g., operations by a user on or near touch panel 5071 using a finger, stylus, or any suitable object or accessory). The touch panel 5071 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 510, and receives and executes commands sent by the processor 510. In addition, the touch panel 5071 may be implemented in various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. In addition to the touch panel 5071, the user input unit 507 may include other input devices 5072. In particular, other input devices 5072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein.

Further, the touch panel 5071 may be overlaid on the display panel 5061, and when the touch panel 5071 detects a touch operation thereon or nearby, the touch operation is transmitted to the processor 510 to determine the type of the touch event, and then the processor 510 provides a corresponding visual output on the display panel 5061 according to the type of the touch event. Although in fig. 5, the touch panel 5071 and the display panel 5061 are two independent components to implement the input and output functions of the mobile terminal, in some embodiments, the touch panel 5071 and the display panel 5061 may be integrated to implement the input and output functions of the mobile terminal, and is not limited herein.

The interface unit 508 is an interface through which an external device is connected to the mobile terminal 500. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 508 may be used to receive input (e.g., data information, power, etc.) from external devices and transmit the received input to one or more elements within the mobile terminal 500 or may be used to transmit data between the mobile terminal 500 and external devices.

The memory 509 may be used to store software programs as well as various data. The memory 509 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 509 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 510 is a control center of the mobile terminal, connects various parts of the entire mobile terminal using various interfaces and lines, and performs various functions of the mobile terminal and processes data by operating or executing software programs and/or modules stored in the memory 509 and calling data stored in the memory 509, thereby performing overall monitoring of the mobile terminal. Processor 510 may include one or more processing units; preferably, the processor 510 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 510.

The mobile terminal 500 may further include a power supply 511 (e.g., a battery) for supplying power to various components, and preferably, the power supply 511 may be logically connected to the processor 510 via a power management system, so that functions of managing charging, discharging, and power consumption are performed via the power management system.

In addition, the mobile terminal 500 includes some functional modules that are not shown, and thus, are not described in detail herein.

Preferably, an embodiment of the present invention further provides a mobile terminal, which includes a processor 510, a memory 509, and a computer program stored in the memory 509 and capable of running on the processor 510, where the computer program, when executed by the processor 510, implements each process of the foregoing dubbing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not described here again.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the dubbing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A dubbing method applied to a mobile terminal is characterized by comprising the following steps:

receiving video data to be dubbed;

dubbing the video data according to the characteristic information;

wherein the feature information includes: a subject feature and a background feature; the step of determining the feature information of each frame of image in the video data includes:

determining a main body object and background information of each frame of image in the video data;

determining a subject feature of the subject object;

determining a background feature of the background information;

the dubbing the video data according to the feature information includes:

determining target voice information according to the main body characteristics;

determining target character information according to the main body characteristics and the background characteristics;

and carrying out dubbing on the video data according to the target voice information and the target character information.

2. The dubbing method according to claim 1,

when the subject feature includes an integral feature, the step of determining the subject feature of the subject object includes:

extracting a main body area in each frame image in the video data;

determining the overall characteristics of the subject object according to the subject region;

when the subject feature includes an integral feature and a facial feature, the step of determining the subject feature of the subject object includes:

extracting a main body region and a face region in the main body region in each frame image in the video data;

determining facial features of the subject object according to the facial region;

when the subject feature includes a whole feature, a facial feature, and an oral feature, the step of determining the subject feature of the subject includes:

extracting a main body region, a face region in the main body region, and a mouth region in the face region in each frame image in the video data;

determining oral characteristics of the subject according to the oral region;

when the body feature includes a facial feature and an oral feature, the step of determining the body feature of the subject includes:

extracting a face region and an oral region in the face region in each frame image in the video data;

determining oral characteristics of the subject object based on the oral region.

3. The dubbing method according to claim 2, wherein the step of determining the overall feature of the subject object based on the subject region comprises:

inputting the main body area into a first recognition model to obtain the overall characteristics of the main body object;

the step of determining facial features of the subject object from the facial region comprises:

inputting the facial region into a second recognition model to obtain facial features of the subject object;

the step of determining the oral characteristics of the subject according to the oral area includes:

inputting the mouth region into a third recognition model to obtain mouth features of the subject object;

the step of determining the background feature of the background information includes:

and inputting the background information of each frame of image into a fourth recognition model to obtain the background characteristics.

4. The method of claim 1,

the step of dubbing the video data according to the target voice information and the target text information comprises:

based on a general sound library, obtaining general sound data corresponding to the target text information, wherein the general sound library comprises: general voice data corresponding to each text message;

according to the target voice information, adjusting the general voice data to obtain target voice data;

and dubbing the video data based on the target sound data and the target character information.

5. A mobile terminal, characterized in that the mobile terminal comprises:

the receiving module is used for receiving video data to be dubbed;

the dubbing module is used for dubbing the video data according to the characteristic information;

wherein the feature information includes: a subject feature and a background feature; the determining module includes:

a first determining unit, configured to determine a subject object and background information of each frame of image in the video data;

a second determination unit configured to determine a subject feature of the subject object;

a third determining unit, configured to determine a background feature of the background information;

the dubbing module comprises:

the fourth determining unit is used for determining target voice information according to the main body characteristics;

a fifth determining unit, configured to determine target text information according to the main feature and the background feature;

and the dubbing unit is used for dubbing the video data according to the target voice information and the target character information.

6. The mobile terminal of claim 5, wherein the body characteristics comprise:

when the main body feature includes an integral feature, the second determination unit includes:

when the main body feature includes an integral feature and a face feature, the second determination unit includes:

a second determining subunit configured to determine a facial feature of the subject object based on the face region;

when the main body feature includes a whole feature, a face feature, and an oral feature, the second determination unit includes:

a third determining subunit for determining an oral feature of the subject object based on the oral region;

when the main body feature includes a facial feature and an oral feature, the second determination unit includes:

a second determining subunit that determines a facial feature of the subject object from the face region;

and a third determining subunit that determines the mouth feature of the subject object based on the mouth region.

7. The mobile terminal of claim 6,

the third determination unit includes:

8. The mobile terminal of claim 6, further comprising:

the providing module is used for providing the general sound library; the universal sound bank includes: general voice data corresponding to each text message;

the dubbing unit includes:

the obtaining subunit is configured to obtain, based on the general sound library, general sound data corresponding to the target text information;

9. A mobile terminal, characterized in that it comprises a processor, a memory and a computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, carries out the steps of the dubbing method of any one of claims 1 to 4.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the dubbing method of any one of claims 1 to 4.