CN113259778A

CN113259778A - Method, system and storage medium for using virtual character for automatic video production

Info

Publication number: CN113259778A
Application number: CN202110434256.3A
Authority: CN
Inventors: 李�权; 王伦基; 叶俊杰; 朱杰; 成秋喜; 韩蓝青
Original assignee: CYAGEN BIOSCIENCES (GUANGZHOU) Inc; Research Institute Of Tsinghua Pearl River Delta
Current assignee: CYAGEN BIOSCIENCES (GUANGZHOU) Inc; Research Institute Of Tsinghua Pearl River Delta
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2021-08-13

Abstract

The invention discloses a method, a system and a storage medium for using a virtual character for automatic video production. The method comprises the steps of synthesizing pronunciation sound attributes and an explanation manuscript by using a neural network, obtaining voice information, generating virtual characters, generating video information according to the image information, embedding the virtual characters into the video information and the like. When the video information embedded with the virtual character is played, the image information contained in the video information can be simultaneously displayed, the virtual character simulating the action of reading and explaining the manuscript by the real person and playing synchronous voice information is played, the display effect is that the virtual character introduces the image information as the background, the video information embedded with the virtual character has the characteristics of lip shape and voice matching, rich expression and the like of the real person, the defects of voice synthesis by adopting a splicing scheme and short plates without the explanation of the real person and the virtual cartoon character in the prior art are solved, and the efficiency of video automatic creation can be greatly improved. The invention is widely applied to the technical field of multimedia.

Description

Method, system and storage medium for using virtual character for automatic video production

Technical Field

The invention relates to the technical field of multimedia, in particular to a method, a system and a storage medium for using a virtual character for automatic video production.

Background

Videos often need to be made from the fields of media, campus problem campaigns, travel promotions, and the like. The video production is always provided with higher technical thresholds due to the limitation of the quality of the camera equipment and the professional level of the shooting personnel. For example, video recording requires video equipment to be generally expensive; during the shooting and recording, the non-professional interpreter often has errors such as logic errors, mouth errors, laugh scenes, unsmooth sentences and the like. In response to the above shortcomings, some techniques have attempted to use computer-synthesized speech instead of filming a live instructor to produce promotional material. However, most of the synthesized voices used in the prior art are spliced and synthesized, are not natural and smooth enough, have no rhythm, seriously affect the video quality and give people the feeling of false video; the method for synthesizing the voice has no mirror of a real person, only synthesizes the voice of the cold ice, has low affinity and is not enough to attract the attention of audiences; in a word, the impression effect generated by the prior art is greatly different from the impression effect of the real shot works, and a space worthy of improvement still exists.

Disclosure of Invention

In view of at least one of the above technical problems, it is an object of the present invention to provide a method, system and storage medium for using a virtual character for automatic video production.

In one aspect, an embodiment of the present invention includes a method for using a virtual character for automatic video production, including:

determining a pronunciation sound attribute, a pronunciation image attribute and a pronunciation action attribute;

acquiring image information and an explanation manuscript corresponding to the image information;

synthesizing the pronunciation sound attribute and the explanation manuscript by using a neural network to obtain voice information;

generating a virtual character; the virtual character plays the voice information under the driving of the pronunciation image attribute and the pronunciation action attribute;

generating video information according to the image information;

embedding the virtual character in the video information.

Further, the determining the attribute of the pronunciation sound, the attribute of the pronunciation image and the attribute of the pronunciation action comprises:

acquiring a voice sample, a picture sample and a video sample;

selecting and determining the pronunciation sound attribute matched with the voice sample from a database through voiceprint recognition;

manually selecting an image in a database or taking a photo image of a speaker uploaded by a user as the virtual character;

the actions of the virtual character are selected manually or according to the actions of the speaker uploaded by the user.

Further, the generating the virtual character comprises:

synthesizing the pronunciation action attribute and the voice information by using a character lip generating model to obtain a lip synchronization character action video;

synthesizing the lip-shaped synchronous character action video and the pronunciation image attribute by using a video-driven virtual character model to obtain a virtual character explanation video; the virtual character explanation video includes the virtual character.

Further, the generating the virtual character further includes:

and carrying out cutout processing on the virtual character explanation video by using a video cutout model to obtain a background-free virtual character explanation video.

Further, the embedding the virtual character into the video information includes:

embedding the virtual character explanation video into the video information;

or

Embedding the virtual character explanation background-free video into the video information.

Further, the image information is information in a picture form; the generating of the video information according to the image information includes:

expanding the image information into the video information; the duration of the video information is equal to the duration of the voice information.

Further, the image information is information in the form of video clips; the generating of the video information according to the image information includes:

carrying out variable speed processing to enable the duration of the image information to be equal to the duration of the voice information;

and after the speed change processing, taking the image information as the video information.

Further, the performing the speed change process includes:

performing a speed change process on only a part or all of the image information;

or

Performing variable speed processing on only part or all of the voice information;

or

And carrying out variable speed processing on corresponding parts in the image information and the voice information.

In another aspect, an embodiment of the present invention further includes a system for using a virtual character for automatic video production, including:

the first module is used for determining the attribute of pronunciation sound, the attribute of pronunciation image and the attribute of pronunciation action;

the second module is used for acquiring image information and an explanation manuscript corresponding to the image information;

the third module is used for synthesizing the pronunciation sound attribute and the explanation manuscript by using a neural network to obtain voice information;

a fourth module for generating a virtual character; the virtual character plays the voice information under the driving of the pronunciation image attribute and the pronunciation action attribute;

a fifth module, configured to generate video information according to the image information;

a sixth module for embedding the virtual character into the video information.

In another aspect, embodiments of the present invention also include a storage medium having stored therein processor-executable instructions that, when executed by a processor, perform a method for using a avatar for automated video production.

The invention has the beneficial effects that: when the video information embedded with the virtual character in the embodiment is played, the image information contained in the video information can be simultaneously displayed, the virtual character simulating the action of reading and explaining the manuscript by the real person and playing the synchronous voice information is played, the whole display effect is that the virtual character introduces the image information as the background, the video playing method has the characteristics of lip shape and voice matching, rich expression and the like of the real person, solves the limitation of the requirement of video recording equipment in the prior art, overcomes the defect that the voice is synthesized only by adopting a splicing scheme in the prior art, solves the problem that a short board is not used for explaining the real person and the virtual cartoon character in automatic video generation in the prior art, and can greatly improve the efficiency of automatic video creation.

Drawings

FIG. 1 is a flowchart of a method for using a virtual character for automatic video production according to an embodiment;

FIG. 2 is a schematic diagram of a method for using a virtual character for automatic video production in an embodiment.

Detailed Description

In this embodiment, referring to fig. 1, the method for using a virtual character for video automatic production includes the following steps:

s1, determining pronunciation sound attribute, pronunciation image attribute and pronunciation action attribute;

s2, acquiring image information and an explanation manuscript corresponding to the image information;

s3, synthesizing the pronunciation sound attribute and the explanation manuscript by using a neural network to obtain voice information;

s4, generating a virtual character; the virtual character plays voice information under the drive of the attribute of the pronunciation image and the attribute of the pronunciation action;

s5, generating video information according to the image information;

and S6, embedding the virtual character into the video information.

The principle of steps S1-S6 is shown in FIG. 2.

In step S1, a plurality of pronunciation sound attributes, a plurality of pronunciation image attributes and a plurality of pronunciation action attributes may be stored in the database in advance and provided to the user for selection, the user generates a selection command by selecting a certain set of pronunciation sound attributes, pronunciation image attributes and pronunciation action attributes, and the system determines the pronunciation sound attributes, pronunciation image attributes and pronunciation action attributes corresponding to the selection command from the database according to the selection command.

In step S1, a voice sample, a picture sample, and a video sample uploaded by the user may also be obtained, where the voice sample may be a voice of a person recorded by the user through a mobile phone, the picture sample may be a portrait of a person recorded by the user through a mobile phone, the video sample may be an action video of a person recorded by the user through a mobile phone, and a voice sound attribute matching with the voice sample is selected and determined from the database through voiceprint recognition and image recognition, and the action of the avatar in the database is selected manually or the action of the speaker uploaded by the user is used as the action of the avatar. For example, the matching of the voice sample and the pronunciation sound attribute means that the voiceprint analysis result of the voice sample is close to the voiceprint analysis result of the pronunciation sound attribute, so that the obtained pronunciation sound attribute is similar to the pronunciation feature of the person recording the voice sample.

In step S2, the system acquires image information input by the user and an explanation document corresponding to the image information. The image information may be in a picture form or a video clip form, for example, the image information may be a high-definition photo of a piece of cultural relic taken by a museum, and the corresponding explanation manuscript may be text content telling a source and a historical story of the piece of cultural relic; the image information may also be a video clip obtained by aerial photography of one of the scenic spots in the scenic spot, and the corresponding explanation manuscript may be text content for explaining the opening time, traffic route and play advice of the scenic spot.

In step S3, the system synthesizes the pronunciation attribute and the explanation document using the neural network, and obtains speech information. When the obtained voice information is played, the expressed content is the content of the explanation manuscript, and the pronunciation characteristics are the characteristics of tone, speed, pause rule and the like determined by the pronunciation sound attributes.

In step S4, referring to fig. 2, the system first synthesizes the vocal action attributes and the voice information using a character lip shape generation model to obtain a lip-synchronized character action video, and when the virtual character explanation video is played, displays a lip shape synchronized with the lip action of a real person reading the voice information; and then, the system uses the video to drive the virtual character model to synthesize the lip-shaped synchronous character motion video and the pronunciation image attribute to obtain a virtual character explanation video. When the virtual character explanation video is played, a virtual character with the pronunciation character attribute is displayed, and the virtual character has a lip shape synchronous with the lip action when the real person reads the voice information, namely, the virtual character plays the voice information under the drive of the pronunciation character attribute and the pronunciation action attribute.

In step S4, referring to fig. 2, the system may further use the video matting model to matte the virtual character explanation video, that is, separate the virtual character from the background in the virtual character explanation video, and reserve the virtual character part, thereby obtaining the virtual character explanation background-free video.

In step S5, referring to fig. 2, the image information is subjected to expansion or shift processing to generate video information. When the image information is in a picture form, the image information can be subjected to processing such as amplification, reduction, fade-in, fade-out and the like, and then the processing result is arranged on a time axis, so that the image information is expanded into video information.

When the image information is in the form of a video clip, if the durations of the image information and the voice information do not coincide, the variable speed processing may be performed. Specifically, only part or all of the image information may be subjected to the variable speed processing, only part or all of the voice information may be subjected to the variable speed processing, or both of the image information and the voice information may be subjected to the variable speed processing, so that the time length of the image information coincides with the time length of the voice information, and the final playing effect can be synchronized. When the durations of the image information and the voice information are not consistent, the corresponding portions in the image information and the voice information may refer to a first portion in the image information and a second portion in the voice information, and the relative position of the time axis of the first portion in the image information is the same as the relative position of the time axis of the second portion in the voice information.

In step S6, referring to fig. 2, the avatar explanation video or the avatar explanation background-free video may be embedded in the video information, and the embedding position of the avatar in the video information may be freely selected, thereby implementing the embedding of the avatar in the video information.

Referring to fig. 2, the video information embedded with the virtual character is output by executing steps S1-S6, and since the video information is obtained by processing the image information, that is, the video information includes information of the cultural relic source, the historical story, the scenery spot opening time, the traffic route, the play advice and the like which need to be introduced, and the virtual character is obtained by processing the explanation manuscript, when the virtual character is displayed, the action of reading the explanation manuscript by a real person is simulated, and the synchronous voice information is played, the overall display effect of the virtual character is that the virtual character introduces the image information as the background, and the virtual character has the characteristics of lip shape, voice matching, rich expression and the like of the real person, solves the limitation of the equipment requirement of video recording in the prior art, solves the defect that voice is synthesized by only adopting the splicing scheme in the prior art, and solves the problem that no real person exists in the automatic video generation in the prior art, The short board for explaining the virtual cartoon character can greatly improve the efficiency of automatic video creation. The method for automatically producing the video by using the virtual character in the embodiment is not only applied to virtual character explanation, but also applied to the fields of online education, cultural relic explanation, book reading, intelligent virtual human media, tour guide robots, intelligent customer service, catering robots, home robots and the like.

In this embodiment, the method of using the avatar for video automatic production may be performed using a system of using the avatar for video automatic production. The system for using the virtual character for automatic video production comprises:

the second module is used for acquiring the image information and the explanation manuscript corresponding to the image information;

a fourth module for generating a virtual character; the virtual character plays voice information under the drive of the attribute of the pronunciation image and the attribute of the pronunciation action;

a fifth module for generating video information according to the image information;

a sixth module for embedding the virtual character into the video information.

The first module, the second module, the third module, the fourth module, the fifth module and the sixth module may be computer software modules, hardware modules or a combination of software modules and hardware modules with corresponding technical features, and when the system for automatically producing videos using virtual characters is operated, the technical effect of the method for automatically producing videos using virtual characters in the embodiment can be achieved.

In the present embodiment, a storage medium has stored therein processor-executable instructions, which when executed by a processor, are configured to perform the method for using a virtual character for video automatic production in the present embodiment, and achieve the same technical effects as those described in the present embodiment.

It should be noted that, unless otherwise specified, when a feature is referred to as being "fixed" or "connected" to another feature, it may be directly fixed or connected to the other feature or indirectly fixed or connected to the other feature. Furthermore, the descriptions of upper, lower, left, right, etc. used in the present disclosure are only relative to the mutual positional relationship of the constituent parts of the present disclosure in the drawings. As used in this disclosure, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. In addition, unless defined otherwise, all technical and scientific terms used in this example have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the description of the embodiments herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this embodiment, the term "and/or" includes any combination of one or more of the associated listed items.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element of the same type from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. The use of any and all examples, or exemplary language ("e.g.," such as "or the like") provided with this embodiment is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the invention unless otherwise claimed.

It should be recognized that embodiments of the present invention can be realized and implemented by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The methods may be implemented in a computer program using standard programming techniques, including a non-transitory computer-readable storage medium configured with the computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner, according to the methods and figures described in the detailed description. Each program may be implemented in a high level procedural or object terminal oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Further, operations of processes described in this embodiment can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described in this embodiment (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.

Further, the method may be implemented in any type of computing platform operatively connected to a suitable interface, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and the like. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described in this embodiment includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention also includes the computer itself when programmed according to the methods and techniques described herein.

A computer program can be applied to input data to perform the functions described in the present embodiment to convert the input data to generate output data that is stored to a non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the present invention, the transformed data represents a physical and tangible target terminal, including a particular visual depiction of the physical and tangible target terminal produced on a display.

The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

Claims

1. A method for using a virtual character for automated video production, comprising:

generating video information according to the image information;

embedding the virtual character in the video information.

2. The method of claim 1, wherein determining the attributes of the pronunciation sound, the pronunciation image and the pronunciation action comprises:

acquiring a voice sample, a picture sample and a video sample;

3. The method of claim 1, wherein the generating a virtual character comprises:

4. The method of claim 3, wherein the generating a virtual character further comprises:

5. The method of claim 4, wherein the embedding the virtual character into the video information comprises:

embedding the virtual character explanation video into the video information;

or

6. The method of claim 1, wherein the image information is information in the form of a picture; the generating of the video information according to the image information includes:

7. A method for automatic production of video by a virtual character according to claim 1, wherein the image information is information in the form of a video clip; the generating of the video information according to the image information includes:

8. The method of claim 7, wherein the performing a variable speed process comprises:

or

9. A system for using a virtual character for automated video production, comprising:

a sixth module for embedding the virtual character into the video information.

10. A storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform the method of any one of claims 1-8.