WO2023050650A1

WO2023050650A1 - Animation video generation method and apparatus, and device and storage medium

Info

Publication number: WO2023050650A1
Application number: PCT/CN2022/071302
Authority: WO
Inventors: 郑喜民; 陈振宏; 舒畅; 陈又新
Original assignee: 平安科技（深圳）有限公司
Priority date: 2021-09-29
Filing date: 2022-01-11
Publication date: 2023-04-06
Also published as: CN113870395A

Abstract

The present application relates to artificial intelligence. Provided are an animation video generation method and apparatus, and a device and a storage medium. The method may comprise: when a video generation request is received, acquiring text information according to the video generation request; inputting the text information into a pre-trained video generation model, so as to obtain an initial video; identifying human body feature points of each frame of image in the initial video; generating posture information of a user in each frame of image according to the human body feature points; if the posture information is a preset posture, adjusting the posture information according to the human body feature points, so as to obtain a second video; analyzing the text information on the basis of a pre-trained audio generation model, so as to obtain audio information; and generating an animation video according to the second video and the audio information. By means of the present application, the generation efficiency and generation quality of an animation video can be improved. In addition, the present application further relates to a blockchain technology, and the animation video can be stored in a blockchain.

Description

Animation video generation method, device, equipment and storage medium

This application claims the priority of the Chinese patent application submitted to the China Patent Office on September 29, 2021, with the application number 202111152667.X, and the title of the invention is "Animation Video Generation Method, Device, Equipment, and Storage Medium", the entire content of which is passed References are incorporated in this application.

technical field

The present application relates to the technical field of artificial intelligence, and in particular to a method, device, equipment and storage medium for generating animation video.

Background technique

In the educational scene for children and students, animated video teaching can stimulate students' interest and enthusiasm for learning. With the development of artificial intelligence, animation video teaching has also developed. However, the inventor realized that in the current animation video generation process, steps such as story script writing, storyboard design, live anchor shooting, illustration material drawing, animation production and post-editing are involved, resulting in complete animation video generation efficiency In addition, because different video production users have different views on video production, it is impossible to ensure the quality of video generation.

Contents of the invention

In view of the above, it is necessary to provide an animation video generation method, device, device and storage medium, which can improve the generation efficiency and quality of animation video.

The first aspect of the present application provides a method for generating an animated video, the method for generating an animated video includes:

When a video generation request is received, text information is obtained according to the video generation request;

Input the text information into the pre-trained video generation model to obtain the initial video;

Identify the human body feature points of each frame image in the initial video;

Generate gesture information of the user in each frame of image according to the human body feature points;

If the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;

analyzing the text information based on a pre-trained audio generation model to obtain audio information;

An animation video is generated according to the second video and the audio information.

A second aspect of the present application provides an electronic device, the electronic device includes a processor and a memory, and the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:

A third aspect of the present application provides a computer-readable storage medium, on which at least one computer-readable instruction is stored, and the at least one computer-readable instruction is executed by a processor to implement the following steps:

A fourth aspect of the present application provides an animation video generation device, the animation video generation device comprising:

An acquisition unit, configured to acquire text information according to the video generation request when receiving the video generation request;

An input unit, configured to input the text information into a pre-trained video generation model to obtain an initial video;

A recognition unit, configured to recognize the human body feature points of each frame of image in the initial video;

A generating unit, configured to generate posture information of the user in each frame of image according to the human body feature points;

An adjustment unit, configured to adjust the posture information according to the human body feature points to obtain a second video if the posture information is a preset posture;

An analysis unit, configured to analyze the text information based on a pre-trained audio generation model to obtain audio information;

The generating unit is configured to generate animation video according to the second video and the audio information.

It can be seen from the above technical solutions that the present application can quickly generate the initial video by analyzing the text information through the video generation model, thereby improving the generation efficiency of the animation video, and then by identifying the feature points of the human body , can accurately determine the posture information of the user in each frame of image, and then adjust the posture information when the posture information is a preset posture, so as to avoid bad postures such as the preset posture in the second video Information, because good posture information can play a certain educational role for users, so by avoiding the existence of bad posture information such as the preset posture in the second video, the quality of the second video can be improved, and through the audio The generation model can accurately generate audio information corresponding to the text information, and the generation quality of the animation video can be improved according to the audio information and the second video.

Description of drawings

Fig. 1 is a flowchart of a preferred embodiment of the animation video generation method of the present application.

Fig. 2 is a functional block diagram of a preferred embodiment of the animation video generation device of the present application.

Fig. 3 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the animation video generation method of the present application.

Detailed ways

In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

As shown in FIG. 1 , it is a flow chart of a preferred embodiment of the animation video generation method of the present application. According to different requirements, the order of the steps in the flowchart can be changed, and some steps can be omitted.

The animation video generation method can acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .

Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

The animation video generation method is applied in the field of smart education, thereby promoting the development of smart cities. The animation video generation method is applied to one or more electronic devices, and the electronic device is a device that can automatically perform numerical calculation and/or information processing according to preset or stored computer-readable instructions, and its hardware Including but not limited to microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable gate arrays (Field-Programmable Gate Array, FPGA), digital signal processors (Digital Signal Processor, DSP), embedded devices, etc. .

The electronic device may be any electronic product capable of man-machine interaction with the user, for example, a personal computer, a tablet computer, a smart phone, a personal digital assistant (Personal Digital Assistant, PDA), a game console, an interactive Internet TV ( Internet Protocol Television, IPTV), smart wearable devices, etc.

The electronic devices may include network devices and/or user devices. Wherein, the network device includes, but is not limited to, a single network electronic device, an electronic device group composed of multiple network electronic devices, or a cloud composed of a large number of hosts or network electronic devices based on Cloud Computing.

The network where the electronic device is located includes, but is not limited to: the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN) and the like.

S10. When a video generation request is received, acquire text information according to the video generation request.

In at least one embodiment of the present application, according to different application scenarios of the video generation request, the triggering users of the video generation request are also different. For example, if the application scenario of the video generation request is in the field of education, then the The triggering user of the video generation request may be a teacher or the like.

The video generation request may include, but is not limited to: a text path, a preset tag, and the like.

The text information refers to text information that needs to be converted into a video, for example, the text information may be a teacher's handout.

In at least one embodiment of the present application, the electronic device acquiring text information according to the video generation request includes:

Analyzing the message of the video generation request to obtain the data information carried by the message;

extracting the text path from the data information according to the preset tag;

The text information is obtained from the text path.

Wherein, the preset label refers to a label used to indicate a path. For example, the preset label may be storage location.

The text path can be accurately extracted through the preset tag, so that the text information can be accurately obtained, which is beneficial to the generation of corresponding animation videos.

S11. Input the text information into a pre-trained video generation model to obtain an initial video.

In at least one embodiment of the present application, the video generation model refers to a model capable of converting text into video. The video generation model includes an encoding layer, a decoding layer, and a preset mapping table. Wherein, the preset mapping table stores a mapping relationship between pixel values and vectors.

The initial video refers to a video generated after the text information is analyzed by the video generation model. The initial video does not contain voice information.

In at least one embodiment of the present application, before inputting the text information into the pre-trained video generation model to obtain the initial video, the method further includes:

Obtain a plurality of video training samples, each video training sample includes a training video and training text corresponding to the training video;

Build a learner, wherein the learner includes an encoding layer and a decoding layer;

performing text encoding processing on the training text to obtain a text vector;

analyzing the text vector based on the encoding layer to obtain feature information of the training text;

analyzing the feature information based on the decoding layer to obtain an output vector;

performing mapping processing on the training video based on a preset mapping table to obtain an image vector of the training video;

calculating the similarity between the text vector and the output vector to obtain a first similarity, and calculating the similarity between the text vector and the image vector to obtain a second similarity;

calculating the ratio of the first similarity to the second similarity to obtain the learning index of the learner;

Adjusting the network parameters in the learner until the learning index no longer increases to obtain the video generation model.

Wherein, the text vector is used to characterize the training text.

The learning index is used to evaluate the accuracy of the learner.

The network parameters include preset parameters in the encoding layer and the decoding layer. For example, if the encoding layer includes a convolution layer, the network parameter may be the size of a convolution kernel in the convolution layer.

The learning index is generated by the similarity between the training text and the predicted video and the similarity between the training text and the training video, and then adjusting the network parameters according to the learning index can improve the performance of the video generation model. The ability to represent textual information, thereby improving the accuracy of video generation.

In at least one embodiment of the present application, the manner in which the electronic device analyzes the text information based on the video generation model is similar to the manner in which the electronic device analyzes the training text based on the learner. No longer.

S12. Identify human body feature points of each frame of image in the initial video.

In at least one embodiment of the present application, the human body feature points include, but are not limited to: key feature points of the human face, such as pupil center, etc.; hand joint points and bone joint points, etc.

In at least one embodiment of the present application, the electronic device identifying the human body feature points of each frame image in the initial video includes:

Detect each frame of image based on a preset detector to obtain the human body area in each frame of image;

performing grayscale processing on the human body region to obtain a plurality of pixels in the human body region and a pixel grayscale value corresponding to each pixel;

calculating the pixel difference between each pixel point and the preset feature point according to the gray value of the pixel and the feature gray value of the preset feature point;

Determining the pixel point whose pixel difference value is smaller than a preset threshold value as an initial feature point;

Constructing a coordinate system based on each frame of image, and obtaining the initial coordinate information of the initial feature point on each frame of image;

The human body feature points are screened out from the initial feature points according to the initial coordinate information.

Wherein, the preset detector can be used to identify the person information in the image.

The preset feature points include hand joint points and bone joint points. The feature gray value may be determined according to pixel information corresponding to preset feature points of multiple preset users.

The preset threshold can be set according to requirements.

The coordinate system includes an abscissa axis and a ordinate axis.

Detecting each frame of image by the preset detector can not only eliminate the interference of background information in each frame of image on the human body feature points, thereby improving the recognition accuracy of the human body feature points, but also reduce the number of pixels analyzed, thereby Improve the recognition efficiency of the human body feature points, and then through the analysis of the pixel gray value and the feature gray value, the initial feature point can be quickly determined, and then according to the initial coordinate information of the initial feature point can be The determination accuracy of the human body feature points is improved.

In at least one embodiment of the present application, the electronic device selecting the human body feature points from the initial feature points according to the initial coordinate information includes:

For any initial feature point, calculate the feature distance between any initial feature point and the target feature point according to the initial coordinate information, the target feature point refers to the initial feature point except any initial feature point other feature points;

Determining the feature distance with the smallest value as the target distance, and determining the target feature point corresponding to the target distance as the adjacent feature point of any initial feature point;

performing normal distribution processing on the target distance to obtain a probability value of the target distance;

An initial feature point corresponding to a target distance whose probability value is greater than a preset probability value is determined as the human body feature point.

Wherein, the preset probability value can be set according to requirements, for example, the preset probability value can be 99.44%.

The adjacent feature points of any initial feature point can be quickly determined through the analysis of the feature distance, and by performing normal distribution processing on the target distance, further analyzing the probability value of the target distance, can accurately The human body feature points are screened out from the initial feature points.

S13. Generate gesture information of the user in each frame of image according to the human body feature points.

In at least one embodiment of the present application, the posture information refers to the posture of the user in each frame of image, for example, the posture information may be head down, and the posture information may also be head up.

In at least one embodiment of the present application, the electronic device generating the gesture information of the user in each frame of image according to the human body feature points includes:

Acquiring coordinate information of the human body feature points according to the coordinate system as human body coordinate information;

Acquiring any two adjacent feature points from the human body feature points as feature point pairs;

Calculate the Euler angle of each feature point pair according to the coordinate information of the human body and the abscissa axis in the coordinate system;

Calculate the average value of the Euler angles to obtain the average angle value, and determine the preset posture information corresponding to the average angle value as the posture information.

Wherein, the feature point pair refers to any two adjacent feature points obtained from the human body feature points, further, the arbitrary two adjacent feature points refer to human body feature points with adjacent feature distances, for example, Human body feature point A, human body feature point B, human body feature point C, human body feature point D, the characteristic distance between the human body feature point A and the human body feature point B is 5, the human body feature point A and the human body feature The characteristic distance of point C is 2, and the human body characteristic point A and the human body characteristic point D are 3, then the human body characteristic point C is an adjacent characteristic point of the human body characteristic point A.

By calculating the Euler angles of any two adjacent feature points, it is possible to avoid the interference of the human body feature points that are far apart on the attitude information, thereby improving the determination accuracy of the attitude information.

Specifically, the posture information may be determined according to a mapping table of angles and preset posture information. Wherein, the preset posture information may be marked by the user.

S14. If the posture information is a preset posture, adjust the posture information according to the human body feature points to obtain a second video.

In at least one embodiment of the present application, the preset postures may include, but are not limited to: bad postures such as bowing the head and raising the head.

The user posture of each frame of image in the second video is not the preset posture.

In at least one embodiment of the present application, the electronic device adjusts the posture information according to the human body feature points, and obtaining the second video includes:

Get the posture angle of the standard posture from the posture mapping table;

comparing said angle mean with said posture angle;

If the angle mean is greater than the posture angle, then comparing the Euler angle with the angle mean;

Obtaining feature point pairs corresponding to Euler angles whose value is greater than the angle mean value as feature points to be processed;

Adjusting the positions of the feature points to be processed in the image until the adjusted posture information is not the preset posture, and obtaining the second video.

Wherein, the posture mapping table stores the mapping relationship between a plurality of preset posture information and angles, and the plurality of preset posture information includes the standard posture and bad postures such as pitching. The plurality of preset posture information in the posture mapping table can be marked by the user, and the calculation method of the angle in the posture mapping table is similar to the calculation method of the angle mean value in each frame of image. I won't repeat it here.

Through the comparison between the angle mean value and the posture angle, and the comparison between the Euler angle and the angle mean value, the human body feature points that affect the posture information can be quickly determined, and then adjusted, thereby improving the The quality of the second video.

S15. Analyze the text information based on the pre-trained audio generation model to obtain audio information.

In at least one embodiment of the present application, the audio generation model is used to convert the text information into speech.

The audio information refers to voice corresponding to the text information.

In at least one embodiment of the present application, the audio generation model includes an emotion recognition network layer and a speech conversion network layer, and the electronic device analyzes the text information based on a pre-trained audio generation model, and the obtained audio information includes:

analyzing the text information based on the emotion recognition network layer to obtain the text emotion of the text information;

Obtain the emotional voice features of the text emotion from the voice feature database;

Processing the text information based on the voice conversion network layer to obtain voice information, and acquiring text voice features from the voice information;

Perform audio mixing processing on the text speech features and the emotional speech features to obtain the audio information.

Wherein, the emotion recognition network layer is used to analyze the emotion corresponding to the text. The text emotion may be happy, sad and so on.

The speech-to-speech network layer is used to convert text to speech.

By performing audio mixing processing on the text speech features and the emotional speech features, the audio information includes the text emotion, thereby improving the interest of the audio information.

S16. Generate an animation video according to the second video and the audio information.

In at least one embodiment of the present application, the animation video refers to a video including the audio information and the second video.

It should be emphasized that, in order to further ensure the privacy and security of the animation video, the animation video can also be stored in a block chain node.

In at least one embodiment of the present application, the electronic device generating animation video according to the second video and the audio information includes:

Counting the duration of the second video to obtain the first duration;

counting the duration of the audio information to obtain a second duration;

If the first duration is not equal to the second duration, obtain the information with the largest duration from the second video and the audio information as the information to be processed;

Compressing the information to be processed until the duration of the processed second video and the processed audio information are equal;

Combining the processed second video and the processed audio information to obtain the animated video.

Through the above embodiment, when the first duration is not equal to the second duration, the information with the largest duration is compressed, which can ensure that the duration of the processed second video and the processed audio information are equal, and further facilitate The processed second video and the processed audio information are directly combined, thereby improving the generation efficiency of the animation video.

Specifically, the electronic device combines the processed second video and the processed audio information to obtain the animation video including:

Acquiring the sound track information of the processed second video in the sound track dimension;

The audio track information is replaced with the processed audio information to obtain the animated video.

By replacing the sound track information with the processed audio information, the animation video can be quickly generated.

As shown in FIG. 2 , it is a functional block diagram of a preferred embodiment of the animation video generation device of the present application. The animation video generation device 11 includes an acquisition unit 110, an input unit 111, a recognition unit 112, a generation unit 113, an adjustment unit 114, an analysis unit 115, a construction unit 116, an encoding unit 117, a mapping unit 118 and a calculation unit 119. The module/unit referred to in this application refers to a series of computer-readable instruction segments that can be acquired by the processor 13 and can perform fixed functions, and are stored in the memory 12 . In this embodiment, the functions of each module/unit will be described in detail in subsequent embodiments.

When receiving a video generation request, the obtaining unit 110 obtains text information according to the video generation request.

In at least one embodiment of the present application, the acquiring unit 110 acquiring text information according to the video generation request includes:

extracting the text path from the data information according to the preset tag;

The text information is obtained from the text path.

The input unit 111 inputs the text information into a pre-trained video generation model to obtain an initial video.

In at least one embodiment of the present application, before inputting the text information into the pre-trained video generation model to obtain the initial video, the acquiring unit 110 acquires a plurality of video training samples, each video training sample includes Training video and training text corresponding to the training video;

The construction unit 116 constructs a learner, wherein the learner includes an encoding layer and a decoding layer;

Encoding unit 117 performs text encoding processing on the training text to obtain a text vector;

The analysis unit 115 analyzes the text vector based on the coding layer to obtain feature information of the training text;

The analysis unit 115 analyzes the feature information based on the decoding layer to obtain an output vector;

The mapping unit 118 performs mapping processing on the training video based on a preset mapping table to obtain an image vector of the training video;

Calculation unit 119 calculates the similarity between the text vector and the output vector to obtain a first similarity, and calculates the similarity between the text vector and the image vector to obtain a second similarity;

The calculation unit 119 calculates the ratio of the first similarity to the second similarity to obtain the learning index of the learner;

The adjustment unit 114 adjusts the network parameters in the learner until the learning index does not increase any more to obtain the video generation model.

Wherein, the text vector is used to characterize the training text.

The learning index is used to evaluate the accuracy of the learner.

In at least one embodiment of the present application, the manner of analyzing the text information based on the video generation model is similar to the manner of analyzing the training text based on the learner, which will not be repeated in the present application.

The identification unit 112 identifies human body feature points of each frame of image in the initial video.

In at least one embodiment of the present application, the identifying unit 112 identifying the human body feature points of each frame image in the initial video includes:

The preset threshold can be set according to requirements.

The coordinate system includes an abscissa axis and a ordinate axis.

In at least one embodiment of the present application, the identifying unit 112 selecting the human body feature points from the initial feature points according to the initial coordinate information includes:

The generating unit 113 generates gesture information of the user in each frame of image according to the human body feature points.

In at least one embodiment of the present application, the posture information refers to the posture of the user in each frame of image, for example, the posture information may be head down, or the posture information may be head up, etc.

In at least one embodiment of the present application, the generating unit 113 generating the gesture information of the user in each frame of image according to the human body feature points includes:

If the posture information is a preset posture, the adjustment unit 114 adjusts the posture information according to the human body feature points to obtain a second video.

In at least one embodiment of the present application, the adjustment unit 114 adjusts the posture information according to the human body feature points, and obtaining the second video includes:

Get the posture angle of the standard posture from the posture mapping table;

comparing said angle mean with said posture angle;

The analysis unit 115 analyzes the text information based on a pre-trained audio generation model to obtain audio information.

The audio information refers to voice corresponding to the text information.

In at least one embodiment of the present application, the audio generation model includes an emotion recognition network layer and a speech conversion network layer, and the analysis unit 115 analyzes the text information based on a pre-trained audio generation model, and obtains the audio information including:

The speech-to-speech network layer is used to convert text to speech.

The generating unit 113 generates an animation video according to the second video and the audio information.

In at least one embodiment of the present application, the generating unit 113 generating animation video according to the second video and the audio information includes:

Counting the duration of the second video to obtain the first duration;

counting the duration of the audio information to obtain a second duration;

Specifically, the generating unit 113 merges the processed second video and the processed audio information to obtain the animation video including:

As shown in FIG. 3 , it is a schematic structural diagram of an electronic device implementing a preferred embodiment of the animation video generation method of the present application.

In one embodiment of the present application, the electronic device 1 includes, but is not limited to, a memory 12, a processor 13, and computer-readable instructions stored in the memory 12 and operable on the processor 13 , such as an animation video generator.

Those skilled in the art can understand that the schematic diagram is only an example of the electronic device 1 and does not constitute a limitation to the electronic device 1. It may include more or less components than those shown in the illustration, or combine certain components, or have different Components, for example, the electronic device 1 may also include input and output devices, network access devices, buses, and the like.

The processor 13 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc., the processor 13 is the computing core and control center of the electronic device 1, and uses various interfaces and lines to connect the entire electronic device 1, and execute the operating system of the electronic device 1 and various installed applications, program codes, etc.

Exemplarily, the computer-readable instructions may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 12 and executed by the processor 13 to Complete this application. The one or more modules/units may be a series of computer-readable instruction segments capable of accomplishing specific functions, and the computer-readable instruction segments are used to describe the execution process of the computer-readable instructions in the electronic device 1 . For example, the computer readable instructions may be divided into an acquisition unit 110, an input unit 111, an identification unit 112, a generation unit 113, an adjustment unit 114, an analysis unit 115, a construction unit 116, an encoding unit 117, a mapping unit 118, and a computing unit 119.

The memory 12 can be used to store the computer-readable instructions and/or modules, and the processor 13 runs or executes the computer-readable instructions and/or modules stored in the memory 12, and calls the computer-readable instructions and/or modules stored in the memory 12. The data in it realizes various functions of the electronic device 1 . The memory 12 can mainly include a program storage area and a data storage area, wherein the program storage area can store an operating system, an application program (such as a sound playback function, an image playback function, etc.) required by at least one function; the storage data area can be Stores data, etc. created in accordance with the use of electronic devices. Memory 12 can include nonvolatile and volatile memory, such as: hard disk, memory, plug-in hard disk, smart memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash memory card (Flash Card), at least one magnetic disk storage device, flash memory device, or other storage device.

The memory 12 may be an external memory and/or an internal memory of the electronic device 1 . Further, the memory 12 may be a memory in physical form, such as a memory stick, a TF card (Trans-flash Card) or the like.

If the integrated modules/units of the electronic device 1 are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium, and the computer-readable storage medium can be non-volatile A volatile storage medium may also be a volatile storage medium. Based on this understanding, all or part of the processes in the methods of the above-mentioned embodiments in the present application can also be completed by instructing related hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium In this example, when the computer readable instructions are executed by the processor, the steps of the above-mentioned various method embodiments can be realized.

Wherein, the computer-readable instructions include computer-readable instruction codes, and the computer-readable instruction codes may be in the form of source code, object code, executable file or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory).

The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain (Blockchain), essentially a decentralized database, is a series of data blocks associated with each other using cryptographic methods. Each data block contains a batch of network transaction information, which is used to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

1, the memory 12 in the electronic device 1 stores computer-readable instructions to implement a method for generating animated videos, and the processor 13 can execute the computer-readable instructions to achieve:

Specifically, for a specific implementation method of the above computer-readable instructions by the processor 13, reference may be made to the description of relevant steps in the embodiment corresponding to FIG. 1 , and details are not repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, device and method can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.

Computer-readable instructions are stored on the computer-readable storage medium, wherein the computer-readable instructions are used to implement the following steps when executed by the processor 13:

The modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or in the form of hardware plus software function modules.

Therefore, the embodiments should be regarded as exemplary and not restrictive in all points of view, and the scope of the application is defined by the appended claims rather than the foregoing description, and it is intended that the scope of the present application be defined by the appended claims rather than by the foregoing description. All changes within the meaning and range of equivalents of the elements are embraced in this application. Any reference sign in a claim should not be construed as limiting the claim concerned.

In addition, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. The multiple units or devices described above may also be implemented by one unit or device through software or hardware. The terms first, second, etc. are used to denote names and do not imply any particular order.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application without limitation. Although the present application has been described in detail with reference to the preferred embodiments, those skilled in the art should understand that the technical solutions of the present application can be Make modifications or equivalent replacements without departing from the spirit and scope of the technical solutions of the present application.

Claims

A method for generating an animated video, wherein the method for generating an animated video includes:

When a video generation request is received, text information is obtained according to the video generation request;

Input the text information into the pre-trained video generation model to obtain the initial video;

Identify the human body feature points of each frame image in the initial video;

Generate gesture information of the user in each frame of image according to the human body feature points;

If the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;

analyzing the text information based on a pre-trained audio generation model to obtain audio information;

An animation video is generated according to the second video and the audio information.
The animation video generation method according to claim 1, wherein, before inputting the text information into the pre-trained video generation model to obtain the initial video, the method also includes:

Obtain a plurality of video training samples, each video training sample includes a training video and training text corresponding to the training video;

Build a learner, wherein the learner includes an encoding layer and a decoding layer;

performing text encoding processing on the training text to obtain a text vector;

analyzing the text vector based on the encoding layer to obtain feature information of the training text;

analyzing the feature information based on the decoding layer to obtain an output vector;

performing mapping processing on the training video based on a preset mapping table to obtain an image vector of the training video;

calculating the similarity between the text vector and the output vector to obtain a first similarity, and calculating the similarity between the text vector and the image vector to obtain a second similarity;

calculating the ratio of the first similarity to the second similarity to obtain the learning index of the learner;

Adjusting the network parameters in the learner until the learning index no longer increases to obtain the video generation model.
The animation video generation method according to claim 1, wherein the human body feature points of each frame image in the described initial video recognition include:

Detect each frame of image based on a preset detector to obtain the human body area in each frame of image;

performing grayscale processing on the human body region to obtain a plurality of pixels in the human body region and a pixel grayscale value corresponding to each pixel;

calculating the pixel difference between each pixel point and the preset feature point according to the gray value of the pixel and the feature gray value of the preset feature point;

Determining the pixel point whose pixel difference value is smaller than a preset threshold value as an initial feature point;

Constructing a coordinate system based on each frame of image, and obtaining the initial coordinate information of the initial feature point on each frame of image;

The human body feature points are screened out from the initial feature points according to the initial coordinate information.
The animation video generation method according to claim 3, wherein said screening out said human body feature points from said initial feature points according to said initial coordinate information comprises:

For any initial feature point, calculate the feature distance between any initial feature point and the target feature point according to the initial coordinate information, the target feature point refers to the initial feature point except any initial feature point other feature points;

Determining the feature distance with the smallest value as the target distance, and determining the target feature point corresponding to the target distance as the adjacent feature point of any initial feature point;

performing normal distribution processing on the target distance to obtain a probability value of the target distance;

An initial feature point corresponding to a target distance whose probability value is greater than a preset probability value is determined as the human body feature point.
The animation video generating method according to claim 3, wherein said generating the gesture information of the user in each frame image according to said human body feature points comprises:

Acquiring coordinate information of the human body feature points according to the coordinate system as human body coordinate information;

Acquiring any two adjacent feature points from the human body feature points as feature point pairs;

Calculate the Euler angle of each feature point pair according to the coordinate information of the human body and the abscissa axis in the coordinate system;

Calculate the average value of the Euler angles to obtain the average angle value, and determine the preset posture information corresponding to the average angle value as the posture information.
The animation video generation method according to claim 1, wherein the audio generation model includes an emotion recognition network layer and a voice conversion network layer, and the pre-trained audio generation model analyzes the text information to obtain audio information including :

analyzing the text information based on the emotion recognition network layer to obtain the text emotion of the text information;

Obtain the emotional voice features of the text emotion from the voice feature database;

Processing the text information based on the voice conversion network layer to obtain voice information, and acquiring text voice features from the voice information;

Perform audio mixing processing on the text speech features and the emotional speech features to obtain the audio information.
The animation video generation method according to claim 1, wherein said generating animation video according to said second video and said audio information comprises:

Counting the duration of the second video to obtain the first duration;

counting the duration of the audio information to obtain a second duration;

If the first duration is not equal to the second duration, obtain the information with the largest duration from the second video and the audio information as the information to be processed;

Compressing the information to be processed until the duration of the processed second video and the processed audio information are equal;

Combining the processed second video and the processed audio information to obtain the animated video.
A device for generating animated video, wherein the device for generating animated video includes:

An acquisition unit, configured to acquire text information according to the video generation request when receiving the video generation request;

An input unit, configured to input the text information into a pre-trained video generation model to obtain an initial video;

A recognition unit, configured to recognize the human body feature points of each frame of image in the initial video;

A generating unit, configured to generate posture information of the user in each frame of image according to the human body feature points;

An adjustment unit, configured to adjust the posture information according to the human body feature points to obtain a second video if the posture information is a preset posture;

An analysis unit, configured to analyze the text information based on a pre-trained audio generation model to obtain audio information;

The generating unit is configured to generate animation video according to the second video and the audio information.
An electronic device, wherein the electronic device includes a processor and a memory, and the processor is configured to execute at least one computer-readable instruction stored in the memory to implement the following steps:

When a video generation request is received, text information is obtained according to the video generation request;

Input the text information into the pre-trained video generation model to obtain the initial video;

Identify the human body feature points of each frame image in the initial video;

Generate gesture information of the user in each frame of image according to the human body feature points;

If the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;

analyzing the text information based on a pre-trained audio generation model to obtain audio information;

An animation video is generated according to the second video and the audio information.
The electronic device according to claim 9, wherein, before inputting the text information into a pre-trained video generation model to obtain an initial video, the processor executes the at least one computer-readable instruction to further Implement the following steps:

Obtain a plurality of video training samples, each video training sample includes a training video and training text corresponding to the training video;

Build a learner, wherein the learner includes an encoding layer and a decoding layer;

performing text encoding processing on the training text to obtain a text vector;

analyzing the text vector based on the encoding layer to obtain feature information of the training text;

analyzing the feature information based on the decoding layer to obtain an output vector;

performing mapping processing on the training video based on a preset mapping table to obtain an image vector of the training video;

calculating the similarity between the text vector and the output vector to obtain a first similarity, and calculating the similarity between the text vector and the image vector to obtain a second similarity;

calculating the ratio of the first similarity to the second similarity to obtain the learning index of the learner;

Adjusting the network parameters in the learner until the learning index no longer increases to obtain the video generation model.
The electronic device according to claim 9, wherein, when identifying the human body feature points of each frame image in the initial video, the processor executes the at least one computer-readable instruction to implement the following steps:

Detect each frame of image based on a preset detector to obtain the human body area in each frame of image;

performing grayscale processing on the human body region to obtain a plurality of pixels in the human body region and a pixel grayscale value corresponding to each pixel;

calculating the pixel difference between each pixel point and the preset feature point according to the gray value of the pixel and the feature gray value of the preset feature point;

Determining the pixel point whose pixel difference value is smaller than a preset threshold value as an initial feature point;

Constructing a coordinate system based on each frame of image, and obtaining the initial coordinate information of the initial feature point on each frame of image;

The human body feature points are screened out from the initial feature points according to the initial coordinate information.
The electronic device according to claim 11, wherein, when the human body feature points are screened out from the initial feature points according to the initial coordinate information, the processor executes the at least one computer-readable instruction to achieve the following steps:

For any initial feature point, calculate the feature distance between any initial feature point and the target feature point according to the initial coordinate information, the target feature point refers to the initial feature point except any initial feature point other feature points;

Determining the feature distance with the smallest value as the target distance, and determining the target feature point corresponding to the target distance as the adjacent feature point of any initial feature point;

performing normal distribution processing on the target distance to obtain a probability value of the target distance;

An initial feature point corresponding to a target distance whose probability value is greater than a preset probability value is determined as the human body feature point.
The electronic device according to claim 11, wherein when generating the posture information of the user in each frame of image according to the human body feature points, the processor executes the at least one computer-readable instruction to implement the following steps:

Acquiring coordinate information of the human body feature points according to the coordinate system as human body coordinate information;

Acquiring any two adjacent feature points from the human body feature points as feature point pairs;

Calculate the Euler angle of each feature point pair according to the coordinate information of the human body and the abscissa axis in the coordinate system;

Calculate the average value of the Euler angles to obtain the average angle value, and determine the preset posture information corresponding to the average angle value as the posture information.
The electronic device according to claim 9, wherein the audio generation model includes an emotion recognition network layer and a speech conversion network layer, and when the pre-trained audio generation model analyzes the text information to obtain audio information, The processor executes the at least one computer readable instruction to:

analyzing the text information based on the emotion recognition network layer to obtain the text emotion of the text information;

Obtain the emotional voice features of the text emotion from the voice feature database;

Processing the text information based on the voice conversion network layer to obtain voice information, and acquiring text voice features from the voice information;

Perform audio mixing processing on the text speech features and the emotional speech features to obtain the audio information.
A computer-readable storage medium, wherein the computer-readable storage medium stores at least one computer-readable instruction, and when the at least one computer-readable instruction is executed by a processor, the following steps are implemented:

When a video generation request is received, text information is obtained according to the video generation request;

Input the text information into the pre-trained video generation model to obtain the initial video;

Identify the human body feature points of each frame image in the initial video;

Generate gesture information of the user in each frame of image according to the human body feature points;

If the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;

analyzing the text information based on a pre-trained audio generation model to obtain audio information;

An animation video is generated according to the second video and the audio information.
The storage medium according to claim 15, wherein, before inputting the text information into a pre-trained video generation model to obtain an initial video, the at least one computer-readable instruction is executed by a processor to implement The following steps:

Obtain a plurality of video training samples, each video training sample includes a training video and training text corresponding to the training video;

Build a learner, wherein the learner includes an encoding layer and a decoding layer;

performing text encoding processing on the training text to obtain a text vector;

analyzing the text vector based on the encoding layer to obtain feature information of the training text;

analyzing the feature information based on the decoding layer to obtain an output vector;

performing mapping processing on the training video based on a preset mapping table to obtain an image vector of the training video;

calculating the similarity between the text vector and the output vector to obtain a first similarity, and calculating the similarity between the text vector and the image vector to obtain a second similarity;

calculating the ratio of the first similarity to the second similarity to obtain the learning index of the learner;

Adjusting the network parameters in the learner until the learning index no longer increases to obtain the video generation model.
The storage medium according to claim 15, wherein, when identifying the human body feature points of each frame image in the initial video, the at least one computer-readable instruction is executed by a processor to implement the following steps:

Detect each frame of image based on a preset detector to obtain the human body area in each frame of image;

performing grayscale processing on the human body region to obtain a plurality of pixels in the human body region and a pixel grayscale value corresponding to each pixel;

calculating the pixel difference between each pixel point and the preset feature point according to the gray value of the pixel and the feature gray value of the preset feature point;

Determining the pixel point whose pixel difference value is smaller than a preset threshold value as an initial feature point;

Construct a coordinate system based on each frame of image, and obtain the initial coordinate information of the initial feature point on each frame of image;

The human body feature points are screened out from the initial feature points according to the initial coordinate information.
The storage medium according to claim 17, wherein when the human body feature points are screened out from the initial feature points according to the initial coordinate information, the at least one computer-readable instruction is executed by a processor to Implement the following steps:

For any initial feature point, calculate the feature distance between any initial feature point and the target feature point according to the initial coordinate information, the target feature point refers to the initial feature point except any initial feature point other feature points;

Determine the minimum feature distance as the target distance, and determine the target feature point corresponding to the target distance as the adjacent feature point of any initial feature point;

performing normal distribution processing on the target distance to obtain a probability value of the target distance;

An initial feature point corresponding to a target distance whose probability value is greater than a preset probability value is determined as the human body feature point.
The storage medium according to claim 17, wherein, when generating the posture information of the user in each frame of image according to the human body feature points, the at least one computer-readable instruction is executed by a processor to implement the following steps:

Acquiring coordinate information of the human body feature points according to the coordinate system as human body coordinate information;

Acquiring any two adjacent feature points from the human body feature points as a feature point pair;

Calculate the Euler angle of each feature point pair according to the coordinate information of the human body and the abscissa axis in the coordinate system;

Calculate the average value of the Euler angles to obtain the average angle value, and determine the preset posture information corresponding to the average angle value as the posture information.
The storage medium according to claim 15, wherein the audio generation model includes an emotion recognition network layer and a speech conversion network layer, and when the pre-trained audio generation model analyzes the text information to obtain audio information, The at least one computer readable instruction is executed by the processor to implement the following steps:

analyzing the text information based on the emotion recognition network layer to obtain the text emotion of the text information;

Obtain the emotional voice features of the text emotion from the voice feature database;

Processing the text information based on the voice conversion network layer to obtain voice information, and acquiring text voice features from the voice information;

Perform audio mixing processing on the text speech features and the emotional speech features to obtain the audio information.