WO2023050650A1 - Animation video generation method and apparatus, and device and storage medium - Google Patents

Animation video generation method and apparatus, and device and storage medium Download PDF

Info

Publication number
WO2023050650A1
WO2023050650A1 PCT/CN2022/071302 CN2022071302W WO2023050650A1 WO 2023050650 A1 WO2023050650 A1 WO 2023050650A1 CN 2022071302 W CN2022071302 W CN 2022071302W WO 2023050650 A1 WO2023050650 A1 WO 2023050650A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
information
text
human body
initial
Prior art date
Application number
PCT/CN2022/071302
Other languages
French (fr)
Chinese (zh)
Inventor
郑喜民
陈振宏
舒畅
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023050650A1 publication Critical patent/WO2023050650A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular to a method, device, equipment and storage medium for generating animation video.
  • animated video teaching can stimulate students' interest and enthusiasm for learning.
  • animation video teaching has also developed.
  • steps such as story script writing, storyboard design, live anchor shooting, illustration material drawing, animation production and post-editing are involved, resulting in complete animation video generation efficiency
  • steps such as story script writing, storyboard design, live anchor shooting, illustration material drawing, animation production and post-editing are involved, resulting in complete animation video generation efficiency
  • steps such as story script writing, storyboard design, live anchor shooting, illustration material drawing, animation production and post-editing are involved, resulting in complete animation video generation efficiency
  • different video production users have different views on video production, it is impossible to ensure the quality of video generation.
  • the first aspect of the present application provides a method for generating an animated video, the method for generating an animated video includes:
  • the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;
  • An animation video is generated according to the second video and the audio information.
  • a second aspect of the present application provides an electronic device, the electronic device includes a processor and a memory, and the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:
  • the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;
  • An animation video is generated according to the second video and the audio information.
  • a third aspect of the present application provides a computer-readable storage medium, on which at least one computer-readable instruction is stored, and the at least one computer-readable instruction is executed by a processor to implement the following steps:
  • the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;
  • An animation video is generated according to the second video and the audio information.
  • a fourth aspect of the present application provides an animation video generation device, the animation video generation device comprising:
  • An acquisition unit configured to acquire text information according to the video generation request when receiving the video generation request
  • An input unit configured to input the text information into a pre-trained video generation model to obtain an initial video
  • a recognition unit configured to recognize the human body feature points of each frame of image in the initial video
  • a generating unit configured to generate posture information of the user in each frame of image according to the human body feature points
  • An adjustment unit configured to adjust the posture information according to the human body feature points to obtain a second video if the posture information is a preset posture
  • An analysis unit configured to analyze the text information based on a pre-trained audio generation model to obtain audio information
  • the generating unit is configured to generate animation video according to the second video and the audio information.
  • the present application can quickly generate the initial video by analyzing the text information through the video generation model, thereby improving the generation efficiency of the animation video, and then by identifying the feature points of the human body , can accurately determine the posture information of the user in each frame of image, and then adjust the posture information when the posture information is a preset posture, so as to avoid bad postures such as the preset posture in the second video Information, because good posture information can play a certain educational role for users, so by avoiding the existence of bad posture information such as the preset posture in the second video, the quality of the second video can be improved, and through the audio
  • the generation model can accurately generate audio information corresponding to the text information, and the generation quality of the animation video can be improved according to the audio information and the second video.
  • Fig. 1 is a flowchart of a preferred embodiment of the animation video generation method of the present application.
  • Fig. 2 is a functional block diagram of a preferred embodiment of the animation video generation device of the present application.
  • Fig. 3 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the animation video generation method of the present application.
  • FIG. 1 it is a flow chart of a preferred embodiment of the animation video generation method of the present application. According to different requirements, the order of the steps in the flowchart can be changed, and some steps can be omitted.
  • the animation video generation method can acquire and process relevant data based on artificial intelligence technology.
  • artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the animation video generation method is applied in the field of smart education, thereby promoting the development of smart cities.
  • the animation video generation method is applied to one or more electronic devices, and the electronic device is a device that can automatically perform numerical calculation and/or information processing according to preset or stored computer-readable instructions, and its hardware Including but not limited to microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable gate arrays (Field-Programmable Gate Array, FPGA), digital signal processors (Digital Signal Processor, DSP), embedded devices, etc. .
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • the electronic device may be any electronic product capable of man-machine interaction with the user, for example, a personal computer, a tablet computer, a smart phone, a personal digital assistant (Personal Digital Assistant, PDA), a game console, an interactive Internet TV ( Internet Protocol Television, IPTV), smart wearable devices, etc.
  • a personal computer a tablet computer
  • a smart phone a personal digital assistant (Personal Digital Assistant, PDA)
  • PDA Personal Digital Assistant
  • game console an interactive Internet TV ( Internet Protocol Television, IPTV), smart wearable devices, etc.
  • IPTV Internet Protocol Television
  • smart wearable devices etc.
  • the electronic devices may include network devices and/or user devices.
  • the network device includes, but is not limited to, a single network electronic device, an electronic device group composed of multiple network electronic devices, or a cloud composed of a large number of hosts or network electronic devices based on Cloud Computing.
  • the network where the electronic device is located includes, but is not limited to: the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN) and the like.
  • VPN Virtual Private Network
  • the triggering users of the video generation request are also different.
  • the application scenario of the video generation request is in the field of education, then the The triggering user of the video generation request may be a teacher or the like.
  • the video generation request may include, but is not limited to: a text path, a preset tag, and the like.
  • the text information refers to text information that needs to be converted into a video, for example, the text information may be a teacher's handout.
  • the electronic device acquiring text information according to the video generation request includes:
  • the text information is obtained from the text path.
  • the preset label refers to a label used to indicate a path.
  • the preset label may be storage location.
  • the text path can be accurately extracted through the preset tag, so that the text information can be accurately obtained, which is beneficial to the generation of corresponding animation videos.
  • the video generation model refers to a model capable of converting text into video.
  • the video generation model includes an encoding layer, a decoding layer, and a preset mapping table.
  • the preset mapping table stores a mapping relationship between pixel values and vectors.
  • the initial video refers to a video generated after the text information is analyzed by the video generation model.
  • the initial video does not contain voice information.
  • the method before inputting the text information into the pre-trained video generation model to obtain the initial video, the method further includes:
  • each video training sample includes a training video and training text corresponding to the training video
  • the learner includes an encoding layer and a decoding layer
  • mapping processing on the training video based on a preset mapping table to obtain an image vector of the training video
  • the text vector is used to characterize the training text.
  • the learning index is used to evaluate the accuracy of the learner.
  • the network parameters include preset parameters in the encoding layer and the decoding layer.
  • the network parameter may be the size of a convolution kernel in the convolution layer.
  • the learning index is generated by the similarity between the training text and the predicted video and the similarity between the training text and the training video, and then adjusting the network parameters according to the learning index can improve the performance of the video generation model.
  • the manner in which the electronic device analyzes the text information based on the video generation model is similar to the manner in which the electronic device analyzes the training text based on the learner. No longer.
  • the human body feature points include, but are not limited to: key feature points of the human face, such as pupil center, etc.; hand joint points and bone joint points, etc.
  • the electronic device identifying the human body feature points of each frame image in the initial video includes:
  • the human body feature points are screened out from the initial feature points according to the initial coordinate information.
  • the preset detector can be used to identify the person information in the image.
  • the preset feature points include hand joint points and bone joint points.
  • the feature gray value may be determined according to pixel information corresponding to preset feature points of multiple preset users.
  • the preset threshold can be set according to requirements.
  • the coordinate system includes an abscissa axis and a ordinate axis.
  • Detecting each frame of image by the preset detector can not only eliminate the interference of background information in each frame of image on the human body feature points, thereby improving the recognition accuracy of the human body feature points, but also reduce the number of pixels analyzed, thereby Improve the recognition efficiency of the human body feature points, and then through the analysis of the pixel gray value and the feature gray value, the initial feature point can be quickly determined, and then according to the initial coordinate information of the initial feature point can be The determination accuracy of the human body feature points is improved.
  • the electronic device selecting the human body feature points from the initial feature points according to the initial coordinate information includes:
  • the target feature point refers to the initial feature point except any initial feature point other feature points
  • An initial feature point corresponding to a target distance whose probability value is greater than a preset probability value is determined as the human body feature point.
  • the preset probability value can be set according to requirements, for example, the preset probability value can be 99.44%.
  • the adjacent feature points of any initial feature point can be quickly determined through the analysis of the feature distance, and by performing normal distribution processing on the target distance, further analyzing the probability value of the target distance, can accurately
  • the human body feature points are screened out from the initial feature points.
  • the posture information refers to the posture of the user in each frame of image, for example, the posture information may be head down, and the posture information may also be head up.
  • the electronic device generating the gesture information of the user in each frame of image according to the human body feature points includes:
  • the feature point pair refers to any two adjacent feature points obtained from the human body feature points
  • the arbitrary two adjacent feature points refer to human body feature points with adjacent feature distances, for example, Human body feature point A, human body feature point B, human body feature point C, human body feature point D
  • the characteristic distance between the human body feature point A and the human body feature point B is 5, the human body feature point A and the human body feature
  • the characteristic distance of point C is 2, and the human body characteristic point A and the human body characteristic point D are 3, then the human body characteristic point C is an adjacent characteristic point of the human body characteristic point A.
  • the posture information may be determined according to a mapping table of angles and preset posture information.
  • the preset posture information may be marked by the user.
  • the posture information is a preset posture, adjust the posture information according to the human body feature points to obtain a second video.
  • the preset postures may include, but are not limited to: bad postures such as bowing the head and raising the head.
  • the user posture of each frame of image in the second video is not the preset posture.
  • the electronic device adjusts the posture information according to the human body feature points, and obtaining the second video includes:
  • the posture mapping table stores the mapping relationship between a plurality of preset posture information and angles, and the plurality of preset posture information includes the standard posture and bad postures such as pitching.
  • the plurality of preset posture information in the posture mapping table can be marked by the user, and the calculation method of the angle in the posture mapping table is similar to the calculation method of the angle mean value in each frame of image. I won't repeat it here.
  • the human body feature points that affect the posture information can be quickly determined, and then adjusted, thereby improving the The quality of the second video.
  • the audio generation model is used to convert the text information into speech.
  • the audio information refers to voice corresponding to the text information.
  • the audio generation model includes an emotion recognition network layer and a speech conversion network layer
  • the electronic device analyzes the text information based on a pre-trained audio generation model, and the obtained audio information includes:
  • the emotion recognition network layer is used to analyze the emotion corresponding to the text.
  • the text emotion may be happy, sad and so on.
  • the speech-to-speech network layer is used to convert text to speech.
  • the audio information includes the text emotion, thereby improving the interest of the audio information.
  • the animation video refers to a video including the audio information and the second video.
  • the animation video can also be stored in a block chain node.
  • the electronic device generating animation video according to the second video and the audio information includes:
  • first duration is not equal to the second duration, obtain the information with the largest duration from the second video and the audio information as the information to be processed;
  • the information with the largest duration is compressed, which can ensure that the duration of the processed second video and the processed audio information are equal, and further facilitate The processed second video and the processed audio information are directly combined, thereby improving the generation efficiency of the animation video.
  • the electronic device combines the processed second video and the processed audio information to obtain the animation video including:
  • the audio track information is replaced with the processed audio information to obtain the animated video.
  • the animation video can be quickly generated.
  • the present application can quickly generate the initial video by analyzing the text information through the video generation model, thereby improving the generation efficiency of the animation video, and then by identifying the feature points of the human body , can accurately determine the posture information of the user in each frame of image, and then adjust the posture information when the posture information is a preset posture, so as to avoid bad postures such as the preset posture in the second video Information, because good posture information can play a certain educational role for users, so by avoiding the existence of bad posture information such as the preset posture in the second video, the quality of the second video can be improved, and through the audio
  • the generation model can accurately generate audio information corresponding to the text information, and the generation quality of the animation video can be improved according to the audio information and the second video.
  • the animation video generation device 11 includes an acquisition unit 110, an input unit 111, a recognition unit 112, a generation unit 113, an adjustment unit 114, an analysis unit 115, a construction unit 116, an encoding unit 117, a mapping unit 118 and a calculation unit 119.
  • the module/unit referred to in this application refers to a series of computer-readable instruction segments that can be acquired by the processor 13 and can perform fixed functions, and are stored in the memory 12 . In this embodiment, the functions of each module/unit will be described in detail in subsequent embodiments.
  • the obtaining unit 110 When receiving a video generation request, the obtaining unit 110 obtains text information according to the video generation request.
  • the triggering users of the video generation request are also different.
  • the application scenario of the video generation request is in the field of education, then the The triggering user of the video generation request may be a teacher or the like.
  • the video generation request may include, but is not limited to: a text path, a preset tag, and the like.
  • the text information refers to text information that needs to be converted into a video, for example, the text information may be a teacher's handout.
  • the acquiring unit 110 acquiring text information according to the video generation request includes:
  • the text information is obtained from the text path.
  • the preset label refers to a label used to indicate a path.
  • the preset label may be storage location.
  • the text path can be accurately extracted through the preset tag, so that the text information can be accurately obtained, which is beneficial to the generation of corresponding animation videos.
  • the input unit 111 inputs the text information into a pre-trained video generation model to obtain an initial video.
  • the video generation model refers to a model capable of converting text into video.
  • the video generation model includes an encoding layer, a decoding layer, and a preset mapping table.
  • the preset mapping table stores a mapping relationship between pixel values and vectors.
  • the initial video refers to a video generated after the text information is analyzed by the video generation model.
  • the initial video does not contain voice information.
  • the acquiring unit 110 acquires a plurality of video training samples, each video training sample includes Training video and training text corresponding to the training video;
  • the construction unit 116 constructs a learner, wherein the learner includes an encoding layer and a decoding layer;
  • Encoding unit 117 performs text encoding processing on the training text to obtain a text vector
  • the analysis unit 115 analyzes the text vector based on the coding layer to obtain feature information of the training text
  • the analysis unit 115 analyzes the feature information based on the decoding layer to obtain an output vector
  • the mapping unit 118 performs mapping processing on the training video based on a preset mapping table to obtain an image vector of the training video;
  • Calculation unit 119 calculates the similarity between the text vector and the output vector to obtain a first similarity, and calculates the similarity between the text vector and the image vector to obtain a second similarity;
  • the calculation unit 119 calculates the ratio of the first similarity to the second similarity to obtain the learning index of the learner
  • the adjustment unit 114 adjusts the network parameters in the learner until the learning index does not increase any more to obtain the video generation model.
  • the text vector is used to characterize the training text.
  • the learning index is used to evaluate the accuracy of the learner.
  • the network parameters include preset parameters in the encoding layer and the decoding layer.
  • the network parameter may be the size of a convolution kernel in the convolution layer.
  • the learning index is generated by the similarity between the training text and the predicted video and the similarity between the training text and the training video, and then adjusting the network parameters according to the learning index can improve the performance of the video generation model.
  • the manner of analyzing the text information based on the video generation model is similar to the manner of analyzing the training text based on the learner, which will not be repeated in the present application.
  • the identification unit 112 identifies human body feature points of each frame of image in the initial video.
  • the human body feature points include, but are not limited to: key feature points of the human face, such as pupil center, etc.; hand joint points and bone joint points, etc.
  • the identifying unit 112 identifying the human body feature points of each frame image in the initial video includes:
  • the human body feature points are screened out from the initial feature points according to the initial coordinate information.
  • the preset detector can be used to identify the person information in the image.
  • the preset feature points include hand joint points and bone joint points.
  • the feature gray value may be determined according to pixel information corresponding to preset feature points of multiple preset users.
  • the preset threshold can be set according to requirements.
  • the coordinate system includes an abscissa axis and a ordinate axis.
  • Detecting each frame of image by the preset detector can not only eliminate the interference of background information in each frame of image on the human body feature points, thereby improving the recognition accuracy of the human body feature points, but also reduce the number of pixels analyzed, thereby Improve the recognition efficiency of the human body feature points, and then through the analysis of the pixel gray value and the feature gray value, the initial feature point can be quickly determined, and then according to the initial coordinate information of the initial feature point can be The determination accuracy of the human body feature points is improved.
  • the identifying unit 112 selecting the human body feature points from the initial feature points according to the initial coordinate information includes:
  • the target feature point refers to the initial feature point except any initial feature point other feature points
  • An initial feature point corresponding to a target distance whose probability value is greater than a preset probability value is determined as the human body feature point.
  • the preset probability value can be set according to requirements, for example, the preset probability value can be 99.44%.
  • the adjacent feature points of any initial feature point can be quickly determined through the analysis of the feature distance, and by performing normal distribution processing on the target distance, further analyzing the probability value of the target distance, can accurately
  • the human body feature points are screened out from the initial feature points.
  • the generating unit 113 generates gesture information of the user in each frame of image according to the human body feature points.
  • the posture information refers to the posture of the user in each frame of image, for example, the posture information may be head down, or the posture information may be head up, etc.
  • the generating unit 113 generating the gesture information of the user in each frame of image according to the human body feature points includes:
  • the feature point pair refers to any two adjacent feature points obtained from the human body feature points
  • the arbitrary two adjacent feature points refer to human body feature points with adjacent feature distances, for example, Human body feature point A, human body feature point B, human body feature point C, human body feature point D
  • the characteristic distance between the human body feature point A and the human body feature point B is 5, the human body feature point A and the human body feature
  • the characteristic distance of point C is 2, and the human body characteristic point A and the human body characteristic point D are 3, then the human body characteristic point C is an adjacent characteristic point of the human body characteristic point A.
  • the posture information may be determined according to a mapping table of angles and preset posture information.
  • the preset posture information may be marked by the user.
  • the adjustment unit 114 adjusts the posture information according to the human body feature points to obtain a second video.
  • the preset postures may include, but are not limited to: bad postures such as bowing the head and raising the head.
  • the user posture of each frame of image in the second video is not the preset posture.
  • the adjustment unit 114 adjusts the posture information according to the human body feature points, and obtaining the second video includes:
  • the posture mapping table stores the mapping relationship between a plurality of preset posture information and angles, and the plurality of preset posture information includes the standard posture and bad postures such as pitching.
  • the plurality of preset posture information in the posture mapping table can be marked by the user, and the calculation method of the angle in the posture mapping table is similar to the calculation method of the angle mean value in each frame of image. I won't repeat it here.
  • the human body feature points that affect the posture information can be quickly determined, and then adjusted, thereby improving the The quality of the second video.
  • the analysis unit 115 analyzes the text information based on a pre-trained audio generation model to obtain audio information.
  • the audio generation model is used to convert the text information into speech.
  • the audio information refers to voice corresponding to the text information.
  • the audio generation model includes an emotion recognition network layer and a speech conversion network layer
  • the analysis unit 115 analyzes the text information based on a pre-trained audio generation model, and obtains the audio information including:
  • the emotion recognition network layer is used to analyze the emotion corresponding to the text.
  • the text emotion may be happy, sad and so on.
  • the speech-to-speech network layer is used to convert text to speech.
  • the audio information includes the text emotion, thereby improving the interest of the audio information.
  • the generating unit 113 generates an animation video according to the second video and the audio information.
  • the animation video refers to a video including the audio information and the second video.
  • the animation video can also be stored in a block chain node.
  • the generating unit 113 generating animation video according to the second video and the audio information includes:
  • first duration is not equal to the second duration, obtain the information with the largest duration from the second video and the audio information as the information to be processed;
  • the information with the largest duration is compressed, which can ensure that the duration of the processed second video and the processed audio information are equal, and further facilitate The processed second video and the processed audio information are directly combined, thereby improving the generation efficiency of the animation video.
  • the generating unit 113 merges the processed second video and the processed audio information to obtain the animation video including:
  • the audio track information is replaced with the processed audio information to obtain the animated video.
  • the animation video can be quickly generated.
  • the present application can quickly generate the initial video by analyzing the text information through the video generation model, thereby improving the generation efficiency of the animation video, and then by identifying the feature points of the human body , can accurately determine the posture information of the user in each frame of image, and then adjust the posture information when the posture information is a preset posture, so as to avoid bad postures such as the preset posture in the second video Information, because good posture information can play a certain educational role for users, so by avoiding the existence of bad posture information such as the preset posture in the second video, the quality of the second video can be improved, and through the audio
  • the generation model can accurately generate audio information corresponding to the text information, and the generation quality of the animation video can be improved according to the audio information and the second video.
  • FIG. 3 it is a schematic structural diagram of an electronic device implementing a preferred embodiment of the animation video generation method of the present application.
  • the electronic device 1 includes, but is not limited to, a memory 12, a processor 13, and computer-readable instructions stored in the memory 12 and operable on the processor 13 , such as an animation video generator.
  • the schematic diagram is only an example of the electronic device 1 and does not constitute a limitation to the electronic device 1. It may include more or less components than those shown in the illustration, or combine certain components, or have different Components, for example, the electronic device 1 may also include input and output devices, network access devices, buses, and the like.
  • the processor 13 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc.
  • the processor 13 is the computing core and control center of the electronic device 1, and uses various interfaces and lines to connect the entire electronic device 1, and execute the operating system of the electronic device 1 and various installed applications, program codes, etc.
  • the computer-readable instructions may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 12 and executed by the processor 13 to Complete this application.
  • the one or more modules/units may be a series of computer-readable instruction segments capable of accomplishing specific functions, and the computer-readable instruction segments are used to describe the execution process of the computer-readable instructions in the electronic device 1 .
  • the computer readable instructions may be divided into an acquisition unit 110, an input unit 111, an identification unit 112, a generation unit 113, an adjustment unit 114, an analysis unit 115, a construction unit 116, an encoding unit 117, a mapping unit 118, and a computing unit 119.
  • the memory 12 can be used to store the computer-readable instructions and/or modules, and the processor 13 runs or executes the computer-readable instructions and/or modules stored in the memory 12, and calls the computer-readable instructions and/or modules stored in the memory 12.
  • the data in it realizes various functions of the electronic device 1 .
  • the memory 12 can mainly include a program storage area and a data storage area, wherein the program storage area can store an operating system, an application program (such as a sound playback function, an image playback function, etc.) required by at least one function; the storage data area can be Stores data, etc. created in accordance with the use of electronic devices.
  • Memory 12 can include nonvolatile and volatile memory, such as: hard disk, memory, plug-in hard disk, smart memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash memory card (Flash Card), at least one magnetic disk storage device, flash memory device, or other storage device.
  • nonvolatile and volatile memory such as: hard disk, memory, plug-in hard disk, smart memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash memory card (Flash Card), at least one magnetic disk storage device, flash memory device, or other storage device.
  • the memory 12 may be an external memory and/or an internal memory of the electronic device 1 . Further, the memory 12 may be a memory in physical form, such as a memory stick, a TF card (Trans-flash Card) or the like.
  • TF card Trans-flash Card
  • the integrated modules/units of the electronic device 1 are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium, and the computer-readable storage medium can be non-volatile A volatile storage medium may also be a volatile storage medium.
  • all or part of the processes in the methods of the above-mentioned embodiments in the present application can also be completed by instructing related hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium
  • the steps of the above-mentioned various method embodiments can be realized.
  • the computer-readable instructions include computer-readable instruction codes
  • the computer-readable instruction codes may be in the form of source code, object code, executable file or some intermediate form.
  • the computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory).
  • Blockchain essentially a decentralized database, is a series of data blocks associated with each other using cryptographic methods. Each data block contains a batch of network transaction information, which is used to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the memory 12 in the electronic device 1 stores computer-readable instructions to implement a method for generating animated videos, and the processor 13 can execute the computer-readable instructions to achieve:
  • the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;
  • An animation video is generated according to the second video and the audio information.
  • Computer-readable instructions are stored on the computer-readable storage medium, wherein the computer-readable instructions are used to implement the following steps when executed by the processor 13:
  • the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;
  • An animation video is generated according to the second video and the audio information.
  • modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or in the form of hardware plus software function modules.

Abstract

The present application relates to artificial intelligence. Provided are an animation video generation method and apparatus, and a device and a storage medium. The method may comprise: when a video generation request is received, acquiring text information according to the video generation request; inputting the text information into a pre-trained video generation model, so as to obtain an initial video; identifying human body feature points of each frame of image in the initial video; generating posture information of a user in each frame of image according to the human body feature points; if the posture information is a preset posture, adjusting the posture information according to the human body feature points, so as to obtain a second video; analyzing the text information on the basis of a pre-trained audio generation model, so as to obtain audio information; and generating an animation video according to the second video and the audio information. By means of the present application, the generation efficiency and generation quality of an animation video can be improved. In addition, the present application further relates to a blockchain technology, and the animation video can be stored in a blockchain.

Description

动画视频生成方法、装置、设备及存储介质Animation video generation method, device, equipment and storage medium
本申请要求于2021年09月29日提交中国专利局,申请号为202111152667.X,发明名称为“动画视频生成方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application submitted to the China Patent Office on September 29, 2021, with the application number 202111152667.X, and the title of the invention is "Animation Video Generation Method, Device, Equipment, and Storage Medium", the entire content of which is passed References are incorporated in this application.
技术领域technical field
本申请涉及人工智能技术领域,尤其涉及一种动画视频生成方法、装置、设备及存储介质。The present application relates to the technical field of artificial intelligence, and in particular to a method, device, equipment and storage medium for generating animation video.
背景技术Background technique
在面向儿童学生的教育场景中,动画视频教学能够激发学生的学习兴趣和热情。随着人工智能的发展,动画视频教学也随之发展。然而,发明人意识到,在目前的动画视频生成过程中,涉及了故事脚本编写、分镜图设计、真人主播拍摄、插画素材绘制、动画制作和后期剪辑等步骤,导致完整的动画视频生成效率低下,另外,由于不同的视频制作用户对视频制作的看法不一,造成无法确保视频的生成质量。In the educational scene for children and students, animated video teaching can stimulate students' interest and enthusiasm for learning. With the development of artificial intelligence, animation video teaching has also developed. However, the inventor realized that in the current animation video generation process, steps such as story script writing, storyboard design, live anchor shooting, illustration material drawing, animation production and post-editing are involved, resulting in complete animation video generation efficiency In addition, because different video production users have different views on video production, it is impossible to ensure the quality of video generation.
发明内容Contents of the invention
鉴于以上内容,有必要提供一种动画视频生成方法、装置、设备及存储介质,能够提高动画视频的生成效率及生成质量。In view of the above, it is necessary to provide an animation video generation method, device, device and storage medium, which can improve the generation efficiency and quality of animation video.
本申请的第一方面提供一种动画视频生成方法,所述动画视频生成方法包括:The first aspect of the present application provides a method for generating an animated video, the method for generating an animated video includes:
当接收到视频生成请求时,根据所述视频生成请求获取文本信息;When a video generation request is received, text information is obtained according to the video generation request;
将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频;Input the text information into the pre-trained video generation model to obtain the initial video;
识别所述初始视频中每帧图像的人体特征点;Identify the human body feature points of each frame image in the initial video;
根据所述人体特征点生成每帧图像中用户的姿态信息;Generate gesture information of the user in each frame of image according to the human body feature points;
若所述姿态信息为预设姿态,则根据所述人体特征点调整所述姿态信息,得到第二视频;If the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;
基于预先训练好的音频生成模型分析所述文本信息,得到音频信息;analyzing the text information based on a pre-trained audio generation model to obtain audio information;
根据所述第二视频及所述音频信息生成动画视频。An animation video is generated according to the second video and the audio information.
本申请的第二方面提供一种电子设备,所述电子设备包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令以实现以下步骤:A second aspect of the present application provides an electronic device, the electronic device includes a processor and a memory, and the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:
当接收到视频生成请求时,根据所述视频生成请求获取文本信息;When a video generation request is received, text information is obtained according to the video generation request;
将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频;Input the text information into the pre-trained video generation model to obtain the initial video;
识别所述初始视频中每帧图像的人体特征点;Identify the human body feature points of each frame image in the initial video;
根据所述人体特征点生成每帧图像中用户的姿态信息;Generate gesture information of the user in each frame of image according to the human body feature points;
若所述姿态信息为预设姿态,则根据所述人体特征点调整所述姿态信息,得到第二视频;If the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;
基于预先训练好的音频生成模型分析所述文本信息,得到音频信息;analyzing the text information based on a pre-trained audio generation model to obtain audio information;
根据所述第二视频及所述音频信息生成动画视频。An animation video is generated according to the second video and the audio information.
本申请的第三方面提供一种计算机可读存储介质,所述计算机可读存储介质上存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行以实现以下步骤:A third aspect of the present application provides a computer-readable storage medium, on which at least one computer-readable instruction is stored, and the at least one computer-readable instruction is executed by a processor to implement the following steps:
当接收到视频生成请求时,根据所述视频生成请求获取文本信息;When a video generation request is received, text information is obtained according to the video generation request;
将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频;Input the text information into the pre-trained video generation model to obtain the initial video;
识别所述初始视频中每帧图像的人体特征点;Identify the human body feature points of each frame image in the initial video;
根据所述人体特征点生成每帧图像中用户的姿态信息;Generate gesture information of the user in each frame of image according to the human body feature points;
若所述姿态信息为预设姿态,则根据所述人体特征点调整所述姿态信息,得到第二视频;If the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;
基于预先训练好的音频生成模型分析所述文本信息,得到音频信息;analyzing the text information based on a pre-trained audio generation model to obtain audio information;
根据所述第二视频及所述音频信息生成动画视频。An animation video is generated according to the second video and the audio information.
本申请的第四方面提供一种动画视频生成装置,所述动画视频生成装置包括:A fourth aspect of the present application provides an animation video generation device, the animation video generation device comprising:
获取单元,用于当接收到视频生成请求时,根据所述视频生成请求获取文本信息;An acquisition unit, configured to acquire text information according to the video generation request when receiving the video generation request;
输入单元,用于将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频;An input unit, configured to input the text information into a pre-trained video generation model to obtain an initial video;
识别单元,用于识别所述初始视频中每帧图像的人体特征点;A recognition unit, configured to recognize the human body feature points of each frame of image in the initial video;
生成单元,用于根据所述人体特征点生成每帧图像中用户的姿态信息;A generating unit, configured to generate posture information of the user in each frame of image according to the human body feature points;
调整单元,用于若所述姿态信息为预设姿态,则根据所述人体特征点调整所述姿态信息,得到第二视频;An adjustment unit, configured to adjust the posture information according to the human body feature points to obtain a second video if the posture information is a preset posture;
分析单元,用于基于预先训练好的音频生成模型分析所述文本信息,得到音频信息;An analysis unit, configured to analyze the text information based on a pre-trained audio generation model to obtain audio information;
所述生成单元,用于根据所述第二视频及所述音频信息生成动画视频。The generating unit is configured to generate animation video according to the second video and the audio information.
由以上技术方案可以看出,本申请通过所述视频生成模型分析所述文本信息,能够快速生成所述初始视频,从而提高所述动画视频的生成效率,进而通过对所述人体特征点的识别,能够准确的确定出每帧图像中用户的姿态信息,进而在姿态信息为预设姿态时,对所述姿态信息进行调整,能够避免所述第二视频中存在所述预设姿态等不良姿态信息,由于良好的姿态信息能够对用户起到一定的教育作用,因此通过避免所述第二视频中存在所述预设姿态等不良姿态信息能够提高所述第二视频的质量,通过所述音频生成模型能够准确的生成与所述文本信息所对应的音频信息,根据所述音频信息及所述第二视频能够提高所述动画视频的生成质量。It can be seen from the above technical solutions that the present application can quickly generate the initial video by analyzing the text information through the video generation model, thereby improving the generation efficiency of the animation video, and then by identifying the feature points of the human body , can accurately determine the posture information of the user in each frame of image, and then adjust the posture information when the posture information is a preset posture, so as to avoid bad postures such as the preset posture in the second video Information, because good posture information can play a certain educational role for users, so by avoiding the existence of bad posture information such as the preset posture in the second video, the quality of the second video can be improved, and through the audio The generation model can accurately generate audio information corresponding to the text information, and the generation quality of the animation video can be improved according to the audio information and the second video.
附图说明Description of drawings
图1是本申请动画视频生成方法的较佳实施例的流程图。Fig. 1 is a flowchart of a preferred embodiment of the animation video generation method of the present application.
图2是本申请动画视频生成装置的较佳实施例的功能模块图。Fig. 2 is a functional block diagram of a preferred embodiment of the animation video generation device of the present application.
图3是本申请实现动画视频生成方法的较佳实施例的电子设备的结构示意图。Fig. 3 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the animation video generation method of the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本申请进行详细描述。In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be described in detail below in conjunction with the accompanying drawings and specific embodiments.
如图1所示,是本申请动画视频生成方法的较佳实施例的流程图。根据不同的需求,该流程图中步骤的顺序可以改变,某些步骤可以省略。As shown in FIG. 1 , it is a flow chart of a preferred embodiment of the animation video generation method of the present application. According to different requirements, the order of the steps in the flowchart can be changed, and some steps can be omitted.
所述动画视频生成方法可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The animation video generation method can acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
所述动画视频生成方法应用于智慧教育领域,从而推动智慧城市的发展。所述动画视频生成方法应用于一个或者多个电子设备中,所述电子设备是一种能够按照事先设定 或存储的计算机可读指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字信号处理器(Digital Signal Processor,DSP)、嵌入式设备等。The animation video generation method is applied in the field of smart education, thereby promoting the development of smart cities. The animation video generation method is applied to one or more electronic devices, and the electronic device is a device that can automatically perform numerical calculation and/or information processing according to preset or stored computer-readable instructions, and its hardware Including but not limited to microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable gate arrays (Field-Programmable Gate Array, FPGA), digital signal processors (Digital Signal Processor, DSP), embedded devices, etc. .
所述电子设备可以是任何一种可与用户进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能穿戴式设备等。The electronic device may be any electronic product capable of man-machine interaction with the user, for example, a personal computer, a tablet computer, a smart phone, a personal digital assistant (Personal Digital Assistant, PDA), a game console, an interactive Internet TV ( Internet Protocol Television, IPTV), smart wearable devices, etc.
所述电子设备可以包括网络设备和/或用户设备。其中,所述网络设备包括,但不限于单个网络电子设备、多个网络电子设备组成的电子设备组或基于云计算(Cloud Computing)的由大量主机或网络电子设备构成的云。The electronic devices may include network devices and/or user devices. Wherein, the network device includes, but is not limited to, a single network electronic device, an electronic device group composed of multiple network electronic devices, or a cloud composed of a large number of hosts or network electronic devices based on Cloud Computing.
所述电子设备所处的网络包括,但不限于:互联网、广域网、城域网、局域网、虚拟专用网络(Virtual Private Network,VPN)等。The network where the electronic device is located includes, but is not limited to: the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN) and the like.
S10,当接收到视频生成请求时,根据所述视频生成请求获取文本信息。S10. When a video generation request is received, acquire text information according to the video generation request.
在本申请的至少一个实施例中,根据所述视频生成请求的应用场景不同,所述视频生成请求的触发用户也有所不同,例如,所述视频生成请求的应用场景在教育领域,则所述视频生成请求的触发用户可以是教师人员等。In at least one embodiment of the present application, according to different application scenarios of the video generation request, the triggering users of the video generation request are also different. For example, if the application scenario of the video generation request is in the field of education, then the The triggering user of the video generation request may be a teacher or the like.
所述视频生成请求中可以包括,但不限于:文本路径、预设标签等。The video generation request may include, but is not limited to: a text path, a preset tag, and the like.
所述文本信息是指需要转换为视频的文字信息,例如,所述文本信息可以是教师的讲义。The text information refers to text information that needs to be converted into a video, for example, the text information may be a teacher's handout.
在本申请的至少一个实施例中,所述电子设备根据所述视频生成请求获取文本信息包括:In at least one embodiment of the present application, the electronic device acquiring text information according to the video generation request includes:
解析所述视频生成请求的报文,得到所述报文携带的数据信息;Analyzing the message of the video generation request to obtain the data information carried by the message;
根据所述预设标签从所述数据信息中提取所述文本路径;extracting the text path from the data information according to the preset tag;
从所述文本路径中获取所述文本信息。The text information is obtained from the text path.
其中,所述预设标签是指用于指示路径的标签。例如,所述预设标签可以是storage location。Wherein, the preset label refers to a label used to indicate a path. For example, the preset label may be storage location.
通过所述预设标签能够准确的提取到所述文本路径,从而能够准确的获取到所述文本信息,有利于相应动画视频的生成。The text path can be accurately extracted through the preset tag, so that the text information can be accurately obtained, which is beneficial to the generation of corresponding animation videos.
S11,将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频。S11. Input the text information into a pre-trained video generation model to obtain an initial video.
在本申请的至少一个实施例中,所述视频生成模型是指能够将文本转换视频的模型。所述视频生成模型中包括编码层、解码层及预设映射表等。其中,所述预设映射表中存储有像素值与向量的映射关系。In at least one embodiment of the present application, the video generation model refers to a model capable of converting text into video. The video generation model includes an encoding layer, a decoding layer, and a preset mapping table. Wherein, the preset mapping table stores a mapping relationship between pixel values and vectors.
所述初始视频是指经所述视频生成模型对所述文本信息进行分析后所生成的视频。所述初始视频中不包含语音信息。The initial video refers to a video generated after the text information is analyzed by the video generation model. The initial video does not contain voice information.
在本申请的至少一个实施例中,在将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频之前,所述方法还包括:In at least one embodiment of the present application, before inputting the text information into the pre-trained video generation model to obtain the initial video, the method further includes:
获取多个视频训练样本,每个视频训练样本包括训练视频及所述训练视频所对应的训练文本;Obtain a plurality of video training samples, each video training sample includes a training video and training text corresponding to the training video;
构建学习器,其中,所述学习器包括编码层及解码层;Build a learner, wherein the learner includes an encoding layer and a decoding layer;
对所述训练文本进行文本编码处理,得到文本向量;performing text encoding processing on the training text to obtain a text vector;
基于所述编码层分析所述文本向量,得到所述训练文本的特征信息;analyzing the text vector based on the encoding layer to obtain feature information of the training text;
基于所述解码层分析所述特征信息,得到输出向量;analyzing the feature information based on the decoding layer to obtain an output vector;
基于预设映射表对所述训练视频进行映射处理,得到所述训练视频的图像向量;performing mapping processing on the training video based on a preset mapping table to obtain an image vector of the training video;
计算所述文本向量与所述输出向量的相似度,得到第一相似度,并计算所述文本向量与所述图像向量的相似度,得到第二相似度;calculating the similarity between the text vector and the output vector to obtain a first similarity, and calculating the similarity between the text vector and the image vector to obtain a second similarity;
计算所述第一相似度在所述第二相似度中的比值,得到所述学习器的学习指标;calculating the ratio of the first similarity to the second similarity to obtain the learning index of the learner;
调整所述学习器中的网络参数,直至所述学习指标不再增加,得到所述视频生成模型。Adjusting the network parameters in the learner until the learning index no longer increases to obtain the video generation model.
其中,所述文本向量用于对所述训练文本进行表征。Wherein, the text vector is used to characterize the training text.
所述学习指标用于评价所述学习器的准确度。The learning index is used to evaluate the accuracy of the learner.
所述网络参数包括所述编码层及所述解码层中预先设定好的参数。例如,所述编码层中包括卷积层,则所述网络参数可以是卷积层中卷积核的大小。The network parameters include preset parameters in the encoding layer and the decoding layer. For example, if the encoding layer includes a convolution layer, the network parameter may be the size of a convolution kernel in the convolution layer.
通过所述训练文本与预测视频的相似度以及所述训练文本与所述训练视频的相似度生成所述学习指标,进而根据所述学习指标调整所述网络参数,能够提高所述视频生成模型对文本信息的表征能力,从而提高视频生成的准确性。The learning index is generated by the similarity between the training text and the predicted video and the similarity between the training text and the training video, and then adjusting the network parameters according to the learning index can improve the performance of the video generation model. The ability to represent textual information, thereby improving the accuracy of video generation.
在本申请的至少一个实施例中,所述电子设备基于所述视频生成模型分析所述文本信息的方式与所述电子设备基于所述学习器分析所述训练文本的方式相似,本申请对此不再赘述。In at least one embodiment of the present application, the manner in which the electronic device analyzes the text information based on the video generation model is similar to the manner in which the electronic device analyzes the training text based on the learner. No longer.
S12,识别所述初始视频中每帧图像的人体特征点。S12. Identify human body feature points of each frame of image in the initial video.
在本申请的至少一个实施例中,所述人体特征点包括,但不限于:人脸关键特征点,例如:瞳孔中心等;手部关节点及骨骼关节点等。In at least one embodiment of the present application, the human body feature points include, but are not limited to: key feature points of the human face, such as pupil center, etc.; hand joint points and bone joint points, etc.
在本申请的至少一个实施例中,所述电子设备识别所述初始视频中每帧图像的人体特征点包括:In at least one embodiment of the present application, the electronic device identifying the human body feature points of each frame image in the initial video includes:
基于预设检测器对每帧图像进行检测,得到每帧图像中的人体区域;Detect each frame of image based on a preset detector to obtain the human body area in each frame of image;
对所述人体区域进行灰度处理,得到所述人体区域的多个像素点及每个像素点对应的像素灰度值;performing grayscale processing on the human body region to obtain a plurality of pixels in the human body region and a pixel grayscale value corresponding to each pixel;
根据所述像素灰度值及预设特征点的特征灰度值计算每个像素点与所述预设特征点的像素差值;calculating the pixel difference between each pixel point and the preset feature point according to the gray value of the pixel and the feature gray value of the preset feature point;
将所述像素差值小于预设阈值的像素点确定为初始特征点;Determining the pixel point whose pixel difference value is smaller than a preset threshold value as an initial feature point;
基于每帧图像构建坐标系,并获取所述初始特征点在每帧图像上的初始坐标信息;Constructing a coordinate system based on each frame of image, and obtaining the initial coordinate information of the initial feature point on each frame of image;
根据所述初始坐标信息从所述初始特征点中筛选出所述人体特征点。The human body feature points are screened out from the initial feature points according to the initial coordinate information.
其中,所述预设检测器可以用于识别图像中的人物信息。Wherein, the preset detector can be used to identify the person information in the image.
所述预设特征点包括手部关节点及骨骼关节点等。所述特征灰度值可以根据多个预设用户的预设特征点所对应的像素信息确定。The preset feature points include hand joint points and bone joint points. The feature gray value may be determined according to pixel information corresponding to preset feature points of multiple preset users.
所述预设阈值可以根据需求设定。The preset threshold can be set according to requirements.
所述坐标系包括横坐标轴及纵坐标轴。The coordinate system includes an abscissa axis and a ordinate axis.
通过所述预设检测器检测每帧图像,不仅能够剔除每帧图像中背景信息对人体特征点的干扰,从而提高所述人体特征点的识别准确性,还能够减少像素点的分析数量,从而提高所述人体特征点的识别效率,进而通过所述像素灰度值与所述特征灰度值的分析,能够快速确定出所述初始特征点,进而根据所述初始特征点的初始坐标信息能够提高所述人体特征点的确定准确性。Detecting each frame of image by the preset detector can not only eliminate the interference of background information in each frame of image on the human body feature points, thereby improving the recognition accuracy of the human body feature points, but also reduce the number of pixels analyzed, thereby Improve the recognition efficiency of the human body feature points, and then through the analysis of the pixel gray value and the feature gray value, the initial feature point can be quickly determined, and then according to the initial coordinate information of the initial feature point can be The determination accuracy of the human body feature points is improved.
在本申请的至少一个实施例中,所述电子设备根据所述初始坐标信息从所述初始特征点中筛选出所述人体特征点包括:In at least one embodiment of the present application, the electronic device selecting the human body feature points from the initial feature points according to the initial coordinate information includes:
对于任一初始特征点,根据所述初始坐标信息计算所述任一初始特征点与目标特征点的特征距离,所述目标特征点是指所述初始特征点中除所述任一初始特征点外的其余特征点;For any initial feature point, calculate the feature distance between any initial feature point and the target feature point according to the initial coordinate information, the target feature point refers to the initial feature point except any initial feature point other feature points;
将取值最小的特征距离确定为目标距离,并将所述目标距离所对应的目标特征点确定为所述任一初始特征点的相邻特征点;Determining the feature distance with the smallest value as the target distance, and determining the target feature point corresponding to the target distance as the adjacent feature point of any initial feature point;
对所述目标距离进行正态分布处理,得到所述目标距离的概率值;performing normal distribution processing on the target distance to obtain a probability value of the target distance;
将所述概率值大于预设概率值的目标距离所对应的初始特征点确定为所述人体特征点。An initial feature point corresponding to a target distance whose probability value is greater than a preset probability value is determined as the human body feature point.
其中,所述预设概率值可以根据需求设定,例如所述预设概率值可以是99.44%。Wherein, the preset probability value can be set according to requirements, for example, the preset probability value can be 99.44%.
通过所述特征距离的分析能够快速的确定出所述任一初始特征点的相邻特征点,通过对所述目标距离进行正态分布处理,进一步分析所述目标距离的概率值,能够准确的从所述初始特征点中筛选出所述人体特征点。The adjacent feature points of any initial feature point can be quickly determined through the analysis of the feature distance, and by performing normal distribution processing on the target distance, further analyzing the probability value of the target distance, can accurately The human body feature points are screened out from the initial feature points.
S13,根据所述人体特征点生成每帧图像中用户的姿态信息。S13. Generate gesture information of the user in each frame of image according to the human body feature points.
在本申请的至少一个实施例中,所述姿态信息是指每帧图像中用户所处的姿势,例如,所述姿态信息可以是低头、所述姿态信息也可以是仰头等。In at least one embodiment of the present application, the posture information refers to the posture of the user in each frame of image, for example, the posture information may be head down, and the posture information may also be head up.
在本申请的至少一个实施例中,所述电子设备根据所述人体特征点生成每帧图像中用户的姿态信息包括:In at least one embodiment of the present application, the electronic device generating the gesture information of the user in each frame of image according to the human body feature points includes:
根据所述坐标系获取所述人体特征点的坐标信息作为人体坐标信息;Acquiring coordinate information of the human body feature points according to the coordinate system as human body coordinate information;
从所述人体特征点中获取任意两个相邻特征点作为特征点对;Acquiring any two adjacent feature points from the human body feature points as feature point pairs;
根据所述人体坐标信息及所述坐标系中的横坐标轴计算每个特征点对的欧拉角度;Calculate the Euler angle of each feature point pair according to the coordinate information of the human body and the abscissa axis in the coordinate system;
计算所述欧拉角度的平均值,得到角度均值,并将所述角度均值所对应的预设姿势信息确定为所述姿态信息。Calculate the average value of the Euler angles to obtain the average angle value, and determine the preset posture information corresponding to the average angle value as the posture information.
其中,所述特征点对是指所述人体特征点中获取任意两个相邻特征点,进一步地,所述任意两个相邻特征点是指特征距离相邻近的人体特征点,例如,人体特征点A、人体特征点B、人体特征点C、人体特征点D,所述人体特征点A与所述人体特征点B的特征距离为5,所述人体特征点A与所述人体特征点C的特征距离为2,所述人体特征点A与所述人体特征点D为3,则所述人体特征点C为所述人体特征点A的相邻特征点。Wherein, the feature point pair refers to any two adjacent feature points obtained from the human body feature points, further, the arbitrary two adjacent feature points refer to human body feature points with adjacent feature distances, for example, Human body feature point A, human body feature point B, human body feature point C, human body feature point D, the characteristic distance between the human body feature point A and the human body feature point B is 5, the human body feature point A and the human body feature The characteristic distance of point C is 2, and the human body characteristic point A and the human body characteristic point D are 3, then the human body characteristic point C is an adjacent characteristic point of the human body characteristic point A.
通过计算所述任意两个相邻特征点的欧拉角度,能够避免相距较远的人体特征点对所述姿态信息的干扰,从而提高所述姿态信息的确定准确性。By calculating the Euler angles of any two adjacent feature points, it is possible to avoid the interference of the human body feature points that are far apart on the attitude information, thereby improving the determination accuracy of the attitude information.
具体地,所述姿态信息可以根据角度与预设姿势信息的映射表确定。其中,所述预设姿势信息可以由用户标注。Specifically, the posture information may be determined according to a mapping table of angles and preset posture information. Wherein, the preset posture information may be marked by the user.
S14,若所述姿态信息为预设姿态,则根据所述人体特征点调整所述姿态信息,得到第二视频。S14. If the posture information is a preset posture, adjust the posture information according to the human body feature points to obtain a second video.
在本申请的至少一个实施例中,所述预设姿态可以包括,但不限于:低头、仰头等不良姿态。In at least one embodiment of the present application, the preset postures may include, but are not limited to: bad postures such as bowing the head and raising the head.
所述第二视频中每帧图像的用户姿势不为所述预设姿态。The user posture of each frame of image in the second video is not the preset posture.
在本申请的至少一个实施例中,所述电子设备根据所述人体特征点调整所述姿态信息,得到第二视频包括:In at least one embodiment of the present application, the electronic device adjusts the posture information according to the human body feature points, and obtaining the second video includes:
从姿势映射表中获取标准姿势的姿势角度;Get the posture angle of the standard posture from the posture mapping table;
比较所述角度均值与所述姿势角度;comparing said angle mean with said posture angle;
若所述角度均值大于所述姿势角度,则比较所述欧拉角度与所述角度均值;If the angle mean is greater than the posture angle, then comparing the Euler angle with the angle mean;
获取取值大于所述角度均值的欧拉角度所对应的特征点对作为待处理特征点;Obtaining feature point pairs corresponding to Euler angles whose value is greater than the angle mean value as feature points to be processed;
调整所述待处理特征点在所述图像中的位置,直至调整后的姿态信息不为预设姿态,得到所述第二视频。Adjusting the positions of the feature points to be processed in the image until the adjusted posture information is not the preset posture, and obtaining the second video.
其中,所述姿势映射表中存储有多个预设姿势信息与角度的映射关系,所述多个预设姿势信息包括所述标准姿势及俯仰等不良姿势。所述姿势映射表中的所述多个预设姿势信息可以由用户标注,所述姿势映射表中的所述角度的计算方式与每帧图像中的所述角度均值的计算方式相似,本申请对此不再赘述。Wherein, the posture mapping table stores the mapping relationship between a plurality of preset posture information and angles, and the plurality of preset posture information includes the standard posture and bad postures such as pitching. The plurality of preset posture information in the posture mapping table can be marked by the user, and the calculation method of the angle in the posture mapping table is similar to the calculation method of the angle mean value in each frame of image. I won't repeat it here.
通过所述角度均值与所述姿势角度的比较,以及所述欧拉角度与所述角度均值的比较,能够快速的确定出影响所述姿态信息的人体特征点,进而进行调整,从而提高所述第二视频的质量。Through the comparison between the angle mean value and the posture angle, and the comparison between the Euler angle and the angle mean value, the human body feature points that affect the posture information can be quickly determined, and then adjusted, thereby improving the The quality of the second video.
S15,基于预先训练好的音频生成模型分析所述文本信息,得到音频信息。S15. Analyze the text information based on the pre-trained audio generation model to obtain audio information.
在本申请的至少一个实施例中,所述音频生成模型用于将所述文本信息转换为语音。In at least one embodiment of the present application, the audio generation model is used to convert the text information into speech.
所述音频信息是指与所述文本信息相对应的语音。The audio information refers to voice corresponding to the text information.
在本申请的至少一个实施例中,所述音频生成模型包括情感识别网络层及语音转换网络层,所述电子设备基于预先训练好的音频生成模型分析所述文本信息,得到音频信息包括:In at least one embodiment of the present application, the audio generation model includes an emotion recognition network layer and a speech conversion network layer, and the electronic device analyzes the text information based on a pre-trained audio generation model, and the obtained audio information includes:
基于所述情感识别网络层分析所述文本信息,得到所述文本信息的文本情感;analyzing the text information based on the emotion recognition network layer to obtain the text emotion of the text information;
从语音特征库中获取所述文本情感的情感语音特征;Obtain the emotional voice features of the text emotion from the voice feature database;
基于所述语音转换网络层处理所述文本信息,得到语音信息,并获取所述语音信息中获取文本语音特征;Processing the text information based on the voice conversion network layer to obtain voice information, and acquiring text voice features from the voice information;
对所述文本语音特征及所述情感语音特征进行音频混流处理,得到所述音频信息。Perform audio mixing processing on the text speech features and the emotional speech features to obtain the audio information.
其中,所述情感识别网络层用于分析文本所对应的情感。所述文本情感可以是高兴、难过等。Wherein, the emotion recognition network layer is used to analyze the emotion corresponding to the text. The text emotion may be happy, sad and so on.
所述语音转换网络层用于将文本转换为语音。The speech-to-speech network layer is used to convert text to speech.
通过将所述文本语音特征及所述情感语音特征进行音频混流处理,使所述音频信息中包含有所述文本情感,从而提高了所述音频信息的趣味性。By performing audio mixing processing on the text speech features and the emotional speech features, the audio information includes the text emotion, thereby improving the interest of the audio information.
S16,根据所述第二视频及所述音频信息生成动画视频。S16. Generate an animation video according to the second video and the audio information.
在本申请的至少一个实施例中,所述动画视频是指包含有所述音频信息及所述第二视频的视频。In at least one embodiment of the present application, the animation video refers to a video including the audio information and the second video.
需要强调的是,为进一步保证上述动画视频的私密和安全性,上述动画视频还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the animation video, the animation video can also be stored in a block chain node.
在本申请的至少一个实施例中,所述电子设备根据所述第二视频及所述音频信息生成动画视频包括:In at least one embodiment of the present application, the electronic device generating animation video according to the second video and the audio information includes:
统计所述第二视频的时长,得到第一时长;Counting the duration of the second video to obtain the first duration;
统计所述音频信息的时长,得到第二时长;counting the duration of the audio information to obtain a second duration;
若所述第一时长与所述第二时长不相等,从所述第二视频及所述音频信息中获取时长最大的信息作为待处理信息;If the first duration is not equal to the second duration, obtain the information with the largest duration from the second video and the audio information as the information to be processed;
对所述待处理信息进行压缩处理,直至处理后的第二视频及处理后的音频信息的时长相等;Compressing the information to be processed until the duration of the processed second video and the processed audio information are equal;
合并所述处理后的第二视频及所述处理后的音频信息,得到所述动画视频。Combining the processed second video and the processed audio information to obtain the animated video.
通过上述实施方式,在所述第一时长与所述第二时长不相等时,对时长最大的信息进行压缩处理,能够确保处理后的第二视频及处理后的音频信息的时长相等,进而便于对处理后的第二视频及处理后的音频信息进行直接合并,从而提高所述动画视频的生成效率。Through the above embodiment, when the first duration is not equal to the second duration, the information with the largest duration is compressed, which can ensure that the duration of the processed second video and the processed audio information are equal, and further facilitate The processed second video and the processed audio information are directly combined, thereby improving the generation efficiency of the animation video.
具体地,所述电子设备合并所述处理后的第二视频及所述处理后的音频信息,得到所述动画视频包括:Specifically, the electronic device combines the processed second video and the processed audio information to obtain the animation video including:
获取所述处理后的第二视频在声轨维度上的声轨信息;Acquiring the sound track information of the processed second video in the sound track dimension;
将所述声轨信息替换为所述处理后的音频信息,得到所述动画视频。The audio track information is replaced with the processed audio information to obtain the animated video.
通过将所述声轨信息替换为所述处理后的音频信息,能够快速生成所述动画视频。By replacing the sound track information with the processed audio information, the animation video can be quickly generated.
由以上技术方案可以看出,本申请通过所述视频生成模型分析所述文本信息,能够快速生成所述初始视频,从而提高所述动画视频的生成效率,进而通过对所述人体特征点的识别,能够准确的确定出每帧图像中用户的姿态信息,进而在姿态信息为预设姿态时,对所述姿态信息进行调整,能够避免所述第二视频中存在所述预设姿态等不良姿态信息,由于良好的姿态信息能够对用户起到一定的教育作用,因此通过避免所述第二视频中存在所述预设姿态等不良姿态信息能够提高所述第二视频的质量,通过所述音频生成模型能够准确的生成与所述文本信息所对应的音频信息,根据所述音频信息及所述第二视频能够提高所述动画视频的生成质量。It can be seen from the above technical solutions that the present application can quickly generate the initial video by analyzing the text information through the video generation model, thereby improving the generation efficiency of the animation video, and then by identifying the feature points of the human body , can accurately determine the posture information of the user in each frame of image, and then adjust the posture information when the posture information is a preset posture, so as to avoid bad postures such as the preset posture in the second video Information, because good posture information can play a certain educational role for users, so by avoiding the existence of bad posture information such as the preset posture in the second video, the quality of the second video can be improved, and through the audio The generation model can accurately generate audio information corresponding to the text information, and the generation quality of the animation video can be improved according to the audio information and the second video.
如图2所示,是本申请动画视频生成装置的较佳实施例的功能模块图。所述动画视频生成装置11包括获取单元110、输入单元111、识别单元112、生成单元113、调整单元114、分析单元115、构建单元116、编码单元117、映射单元118及计算单元119。本申请所称的模块/单元是指一种能够被处理器13所获取,并且能够完成固定功能的一系列计算机可读指令段,其存储在存储器12中。在本实施例中,关于各模块/单元的功能将在后续的实施例中详述。As shown in FIG. 2 , it is a functional block diagram of a preferred embodiment of the animation video generation device of the present application. The animation video generation device 11 includes an acquisition unit 110, an input unit 111, a recognition unit 112, a generation unit 113, an adjustment unit 114, an analysis unit 115, a construction unit 116, an encoding unit 117, a mapping unit 118 and a calculation unit 119. The module/unit referred to in this application refers to a series of computer-readable instruction segments that can be acquired by the processor 13 and can perform fixed functions, and are stored in the memory 12 . In this embodiment, the functions of each module/unit will be described in detail in subsequent embodiments.
当接收到视频生成请求时,获取单元110根据所述视频生成请求获取文本信息。When receiving a video generation request, the obtaining unit 110 obtains text information according to the video generation request.
在本申请的至少一个实施例中,根据所述视频生成请求的应用场景不同,所述视频生成请求的触发用户也有所不同,例如,所述视频生成请求的应用场景在教育领域,则所述视频生成请求的触发用户可以是教师人员等。In at least one embodiment of the present application, according to different application scenarios of the video generation request, the triggering users of the video generation request are also different. For example, if the application scenario of the video generation request is in the field of education, then the The triggering user of the video generation request may be a teacher or the like.
所述视频生成请求中可以包括,但不限于:文本路径、预设标签等。The video generation request may include, but is not limited to: a text path, a preset tag, and the like.
所述文本信息是指需要转换为视频的文字信息,例如,所述文本信息可以是教师的讲义。The text information refers to text information that needs to be converted into a video, for example, the text information may be a teacher's handout.
在本申请的至少一个实施例中,所述获取单元110根据所述视频生成请求获取文本信息包括:In at least one embodiment of the present application, the acquiring unit 110 acquiring text information according to the video generation request includes:
解析所述视频生成请求的报文,得到所述报文携带的数据信息;Analyzing the message of the video generation request to obtain the data information carried by the message;
根据所述预设标签从所述数据信息中提取所述文本路径;extracting the text path from the data information according to the preset tag;
从所述文本路径中获取所述文本信息。The text information is obtained from the text path.
其中,所述预设标签是指用于指示路径的标签。例如,所述预设标签可以是storage location。Wherein, the preset label refers to a label used to indicate a path. For example, the preset label may be storage location.
通过所述预设标签能够准确的提取到所述文本路径,从而能够准确的获取到所述文本信息,有利于相应动画视频的生成。The text path can be accurately extracted through the preset tag, so that the text information can be accurately obtained, which is beneficial to the generation of corresponding animation videos.
输入单元111将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频。The input unit 111 inputs the text information into a pre-trained video generation model to obtain an initial video.
在本申请的至少一个实施例中,所述视频生成模型是指能够将文本转换视频的模型。所述视频生成模型中包括编码层、解码层及预设映射表等。其中,所述预设映射表中存储有像素值与向量的映射关系。In at least one embodiment of the present application, the video generation model refers to a model capable of converting text into video. The video generation model includes an encoding layer, a decoding layer, and a preset mapping table. Wherein, the preset mapping table stores a mapping relationship between pixel values and vectors.
所述初始视频是指经所述视频生成模型对所述文本信息进行分析后所生成的视频。所述初始视频中不包含语音信息。The initial video refers to a video generated after the text information is analyzed by the video generation model. The initial video does not contain voice information.
在本申请的至少一个实施例中,在将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频之前,所述获取单元110获取多个视频训练样本,每个视频训练样本包括训练视频及所述训练视频所对应的训练文本;In at least one embodiment of the present application, before inputting the text information into the pre-trained video generation model to obtain the initial video, the acquiring unit 110 acquires a plurality of video training samples, each video training sample includes Training video and training text corresponding to the training video;
构建单元116构建学习器,其中,所述学习器包括编码层及解码层;The construction unit 116 constructs a learner, wherein the learner includes an encoding layer and a decoding layer;
编码单元117对所述训练文本进行文本编码处理,得到文本向量; Encoding unit 117 performs text encoding processing on the training text to obtain a text vector;
分析单元115基于所述编码层分析所述文本向量,得到所述训练文本的特征信息;The analysis unit 115 analyzes the text vector based on the coding layer to obtain feature information of the training text;
所述分析单元115基于所述解码层分析所述特征信息,得到输出向量;The analysis unit 115 analyzes the feature information based on the decoding layer to obtain an output vector;
映射单元118基于预设映射表对所述训练视频进行映射处理,得到所述训练视频的图像向量;The mapping unit 118 performs mapping processing on the training video based on a preset mapping table to obtain an image vector of the training video;
计算单元119计算所述文本向量与所述输出向量的相似度,得到第一相似度,并计算所述文本向量与所述图像向量的相似度,得到第二相似度; Calculation unit 119 calculates the similarity between the text vector and the output vector to obtain a first similarity, and calculates the similarity between the text vector and the image vector to obtain a second similarity;
所述计算单元119计算所述第一相似度在所述第二相似度中的比值,得到所述学习器的学习指标;The calculation unit 119 calculates the ratio of the first similarity to the second similarity to obtain the learning index of the learner;
调整单元114调整所述学习器中的网络参数,直至所述学习指标不再增加,得到所述视频生成模型。The adjustment unit 114 adjusts the network parameters in the learner until the learning index does not increase any more to obtain the video generation model.
其中,所述文本向量用于对所述训练文本进行表征。Wherein, the text vector is used to characterize the training text.
所述学习指标用于评价所述学习器的准确度。The learning index is used to evaluate the accuracy of the learner.
所述网络参数包括所述编码层及所述解码层中预先设定好的参数。例如,所述编码层中包括卷积层,则所述网络参数可以是卷积层中卷积核的大小。The network parameters include preset parameters in the encoding layer and the decoding layer. For example, if the encoding layer includes a convolution layer, the network parameter may be the size of a convolution kernel in the convolution layer.
通过所述训练文本与预测视频的相似度以及所述训练文本与所述训练视频的相似度生成所述学习指标,进而根据所述学习指标调整所述网络参数,能够提高所述视频生成模型对文本信息的表征能力,从而提高视频生成的准确性。The learning index is generated by the similarity between the training text and the predicted video and the similarity between the training text and the training video, and then adjusting the network parameters according to the learning index can improve the performance of the video generation model. The ability to represent textual information, thereby improving the accuracy of video generation.
在本申请的至少一个实施例中,基于所述视频生成模型分析所述文本信息的方式与基于所述学习器分析所述训练文本的方式相似,本申请对此不再赘述。In at least one embodiment of the present application, the manner of analyzing the text information based on the video generation model is similar to the manner of analyzing the training text based on the learner, which will not be repeated in the present application.
识别单元112识别所述初始视频中每帧图像的人体特征点。The identification unit 112 identifies human body feature points of each frame of image in the initial video.
在本申请的至少一个实施例中,所述人体特征点包括,但不限于:人脸关键特征点,例如:瞳孔中心等;手部关节点及骨骼关节点等。In at least one embodiment of the present application, the human body feature points include, but are not limited to: key feature points of the human face, such as pupil center, etc.; hand joint points and bone joint points, etc.
在本申请的至少一个实施例中,所述识别单元112识别所述初始视频中每帧图像的人体特征点包括:In at least one embodiment of the present application, the identifying unit 112 identifying the human body feature points of each frame image in the initial video includes:
基于预设检测器对每帧图像进行检测,得到每帧图像中的人体区域;Detect each frame of image based on a preset detector to obtain the human body area in each frame of image;
对所述人体区域进行灰度处理,得到所述人体区域的多个像素点及每个像素点对应的像素灰度值;performing grayscale processing on the human body region to obtain a plurality of pixels in the human body region and a pixel grayscale value corresponding to each pixel;
根据所述像素灰度值及预设特征点的特征灰度值计算每个像素点与所述预设特征点的像素差值;calculating the pixel difference between each pixel point and the preset feature point according to the gray value of the pixel and the feature gray value of the preset feature point;
将所述像素差值小于预设阈值的像素点确定为初始特征点;Determining the pixel point whose pixel difference value is smaller than a preset threshold value as an initial feature point;
基于每帧图像构建坐标系,并获取所述初始特征点在每帧图像上的初始坐标信息;Constructing a coordinate system based on each frame of image, and obtaining the initial coordinate information of the initial feature point on each frame of image;
根据所述初始坐标信息从所述初始特征点中筛选出所述人体特征点。The human body feature points are screened out from the initial feature points according to the initial coordinate information.
其中,所述预设检测器可以用于识别图像中的人物信息。Wherein, the preset detector can be used to identify the person information in the image.
所述预设特征点包括手部关节点及骨骼关节点等。所述特征灰度值可以根据多个预设用户的预设特征点所对应的像素信息确定。The preset feature points include hand joint points and bone joint points. The feature gray value may be determined according to pixel information corresponding to preset feature points of multiple preset users.
所述预设阈值可以根据需求设定。The preset threshold can be set according to requirements.
所述坐标系包括横坐标轴及纵坐标轴。The coordinate system includes an abscissa axis and a ordinate axis.
通过所述预设检测器检测每帧图像,不仅能够剔除每帧图像中背景信息对人体特征点的干扰,从而提高所述人体特征点的识别准确性,还能够减少像素点的分析数量,从而提高所述人体特征点的识别效率,进而通过所述像素灰度值与所述特征灰度值的分析,能够快速确定出所述初始特征点,进而根据所述初始特征点的初始坐标信息能够提高所述人体特征点的确定准确性。Detecting each frame of image by the preset detector can not only eliminate the interference of background information in each frame of image on the human body feature points, thereby improving the recognition accuracy of the human body feature points, but also reduce the number of pixels analyzed, thereby Improve the recognition efficiency of the human body feature points, and then through the analysis of the pixel gray value and the feature gray value, the initial feature point can be quickly determined, and then according to the initial coordinate information of the initial feature point can be The determination accuracy of the human body feature points is improved.
在本申请的至少一个实施例中,所述识别单元112根据所述初始坐标信息从所述初始特征点中筛选出所述人体特征点包括:In at least one embodiment of the present application, the identifying unit 112 selecting the human body feature points from the initial feature points according to the initial coordinate information includes:
对于任一初始特征点,根据所述初始坐标信息计算所述任一初始特征点与目标特征点的特征距离,所述目标特征点是指所述初始特征点中除所述任一初始特征点外的其余特征点;For any initial feature point, calculate the feature distance between any initial feature point and the target feature point according to the initial coordinate information, the target feature point refers to the initial feature point except any initial feature point other feature points;
将取值最小的特征距离确定为目标距离,并将所述目标距离所对应的目标特征点确定为所述任一初始特征点的相邻特征点;Determining the feature distance with the smallest value as the target distance, and determining the target feature point corresponding to the target distance as the adjacent feature point of any initial feature point;
对所述目标距离进行正态分布处理,得到所述目标距离的概率值;performing normal distribution processing on the target distance to obtain a probability value of the target distance;
将所述概率值大于预设概率值的目标距离所对应的初始特征点确定为所述人体特征点。An initial feature point corresponding to a target distance whose probability value is greater than a preset probability value is determined as the human body feature point.
其中,所述预设概率值可以根据需求设定,例如所述预设概率值可以是99.44%。Wherein, the preset probability value can be set according to requirements, for example, the preset probability value can be 99.44%.
通过所述特征距离的分析能够快速的确定出所述任一初始特征点的相邻特征点,通过对所述目标距离进行正态分布处理,进一步分析所述目标距离的概率值,能够准确的从所述初始特征点中筛选出所述人体特征点。The adjacent feature points of any initial feature point can be quickly determined through the analysis of the feature distance, and by performing normal distribution processing on the target distance, further analyzing the probability value of the target distance, can accurately The human body feature points are screened out from the initial feature points.
生成单元113根据所述人体特征点生成每帧图像中用户的姿态信息。The generating unit 113 generates gesture information of the user in each frame of image according to the human body feature points.
在本申请的至少一个实施例中,所述姿态信息是指每帧图像中用户所处的姿势,例 如,所述姿态信息可以是低头、所述姿态信息也可以是仰头等。In at least one embodiment of the present application, the posture information refers to the posture of the user in each frame of image, for example, the posture information may be head down, or the posture information may be head up, etc.
在本申请的至少一个实施例中,所述生成单元113根据所述人体特征点生成每帧图像中用户的姿态信息包括:In at least one embodiment of the present application, the generating unit 113 generating the gesture information of the user in each frame of image according to the human body feature points includes:
根据所述坐标系获取所述人体特征点的坐标信息作为人体坐标信息;Acquiring coordinate information of the human body feature points according to the coordinate system as human body coordinate information;
从所述人体特征点中获取任意两个相邻特征点作为特征点对;Acquiring any two adjacent feature points from the human body feature points as feature point pairs;
根据所述人体坐标信息及所述坐标系中的横坐标轴计算每个特征点对的欧拉角度;Calculate the Euler angle of each feature point pair according to the coordinate information of the human body and the abscissa axis in the coordinate system;
计算所述欧拉角度的平均值,得到角度均值,并将所述角度均值所对应的预设姿势信息确定为所述姿态信息。Calculate the average value of the Euler angles to obtain the average angle value, and determine the preset posture information corresponding to the average angle value as the posture information.
其中,所述特征点对是指所述人体特征点中获取任意两个相邻特征点,进一步地,所述任意两个相邻特征点是指特征距离相邻近的人体特征点,例如,人体特征点A、人体特征点B、人体特征点C、人体特征点D,所述人体特征点A与所述人体特征点B的特征距离为5,所述人体特征点A与所述人体特征点C的特征距离为2,所述人体特征点A与所述人体特征点D为3,则所述人体特征点C为所述人体特征点A的相邻特征点。Wherein, the feature point pair refers to any two adjacent feature points obtained from the human body feature points, further, the arbitrary two adjacent feature points refer to human body feature points with adjacent feature distances, for example, Human body feature point A, human body feature point B, human body feature point C, human body feature point D, the characteristic distance between the human body feature point A and the human body feature point B is 5, the human body feature point A and the human body feature The characteristic distance of point C is 2, and the human body characteristic point A and the human body characteristic point D are 3, then the human body characteristic point C is an adjacent characteristic point of the human body characteristic point A.
通过计算所述任意两个相邻特征点的欧拉角度,能够避免相距较远的人体特征点对所述姿态信息的干扰,从而提高所述姿态信息的确定准确性。By calculating the Euler angles of any two adjacent feature points, it is possible to avoid the interference of the human body feature points that are far apart on the attitude information, thereby improving the determination accuracy of the attitude information.
具体地,所述姿态信息可以根据角度与预设姿势信息的映射表确定。其中,所述预设姿势信息可以由用户标注。Specifically, the posture information may be determined according to a mapping table of angles and preset posture information. Wherein, the preset posture information may be marked by the user.
若所述姿态信息为预设姿态,所述调整单元114则根据所述人体特征点调整所述姿态信息,得到第二视频。If the posture information is a preset posture, the adjustment unit 114 adjusts the posture information according to the human body feature points to obtain a second video.
在本申请的至少一个实施例中,所述预设姿态可以包括,但不限于:低头、仰头等不良姿态。In at least one embodiment of the present application, the preset postures may include, but are not limited to: bad postures such as bowing the head and raising the head.
所述第二视频中每帧图像的用户姿势不为所述预设姿态。The user posture of each frame of image in the second video is not the preset posture.
在本申请的至少一个实施例中,所述调整单元114根据所述人体特征点调整所述姿态信息,得到第二视频包括:In at least one embodiment of the present application, the adjustment unit 114 adjusts the posture information according to the human body feature points, and obtaining the second video includes:
从姿势映射表中获取标准姿势的姿势角度;Get the posture angle of the standard posture from the posture mapping table;
比较所述角度均值与所述姿势角度;comparing said angle mean with said posture angle;
若所述角度均值大于所述姿势角度,则比较所述欧拉角度与所述角度均值;If the angle mean is greater than the posture angle, then comparing the Euler angle with the angle mean;
获取取值大于所述角度均值的欧拉角度所对应的特征点对作为待处理特征点;Obtaining feature point pairs corresponding to Euler angles whose value is greater than the angle mean value as feature points to be processed;
调整所述待处理特征点在所述图像中的位置,直至调整后的姿态信息不为预设姿态,得到所述第二视频。Adjusting the positions of the feature points to be processed in the image until the adjusted posture information is not the preset posture, and obtaining the second video.
其中,所述姿势映射表中存储有多个预设姿势信息与角度的映射关系,所述多个预设姿势信息包括所述标准姿势及俯仰等不良姿势。所述姿势映射表中的所述多个预设姿势信息可以由用户标注,所述姿势映射表中的所述角度的计算方式与每帧图像中的所述角度均值的计算方式相似,本申请对此不再赘述。Wherein, the posture mapping table stores the mapping relationship between a plurality of preset posture information and angles, and the plurality of preset posture information includes the standard posture and bad postures such as pitching. The plurality of preset posture information in the posture mapping table can be marked by the user, and the calculation method of the angle in the posture mapping table is similar to the calculation method of the angle mean value in each frame of image. I won't repeat it here.
通过所述角度均值与所述姿势角度的比较,以及所述欧拉角度与所述角度均值的比较,能够快速的确定出影响所述姿态信息的人体特征点,进而进行调整,从而提高所述第二视频的质量。Through the comparison between the angle mean value and the posture angle, and the comparison between the Euler angle and the angle mean value, the human body feature points that affect the posture information can be quickly determined, and then adjusted, thereby improving the The quality of the second video.
所述分析单元115基于预先训练好的音频生成模型分析所述文本信息,得到音频信息。The analysis unit 115 analyzes the text information based on a pre-trained audio generation model to obtain audio information.
在本申请的至少一个实施例中,所述音频生成模型用于将所述文本信息转换为语音。In at least one embodiment of the present application, the audio generation model is used to convert the text information into speech.
所述音频信息是指与所述文本信息相对应的语音。The audio information refers to voice corresponding to the text information.
在本申请的至少一个实施例中,所述音频生成模型包括情感识别网络层及语音转换网络层,所述分析单元115基于预先训练好的音频生成模型分析所述文本信息,得到音频信息包括:In at least one embodiment of the present application, the audio generation model includes an emotion recognition network layer and a speech conversion network layer, and the analysis unit 115 analyzes the text information based on a pre-trained audio generation model, and obtains the audio information including:
基于所述情感识别网络层分析所述文本信息,得到所述文本信息的文本情感;analyzing the text information based on the emotion recognition network layer to obtain the text emotion of the text information;
从语音特征库中获取所述文本情感的情感语音特征;Obtain the emotional voice features of the text emotion from the voice feature database;
基于所述语音转换网络层处理所述文本信息,得到语音信息,并获取所述语音信息中获取文本语音特征;Processing the text information based on the voice conversion network layer to obtain voice information, and acquiring text voice features from the voice information;
对所述文本语音特征及所述情感语音特征进行音频混流处理,得到所述音频信息。Perform audio mixing processing on the text speech features and the emotional speech features to obtain the audio information.
其中,所述情感识别网络层用于分析文本所对应的情感。所述文本情感可以是高兴、难过等。Wherein, the emotion recognition network layer is used to analyze the emotion corresponding to the text. The text emotion may be happy, sad and so on.
所述语音转换网络层用于将文本转换为语音。The speech-to-speech network layer is used to convert text to speech.
通过将所述文本语音特征及所述情感语音特征进行音频混流处理,使所述音频信息中包含有所述文本情感,从而提高了所述音频信息的趣味性。By performing audio mixing processing on the text speech features and the emotional speech features, the audio information includes the text emotion, thereby improving the interest of the audio information.
所述生成单元113根据所述第二视频及所述音频信息生成动画视频。The generating unit 113 generates an animation video according to the second video and the audio information.
在本申请的至少一个实施例中,所述动画视频是指包含有所述音频信息及所述第二视频的视频。In at least one embodiment of the present application, the animation video refers to a video including the audio information and the second video.
需要强调的是,为进一步保证上述动画视频的私密和安全性,上述动画视频还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the animation video, the animation video can also be stored in a block chain node.
在本申请的至少一个实施例中,所述生成单元113根据所述第二视频及所述音频信息生成动画视频包括:In at least one embodiment of the present application, the generating unit 113 generating animation video according to the second video and the audio information includes:
统计所述第二视频的时长,得到第一时长;Counting the duration of the second video to obtain the first duration;
统计所述音频信息的时长,得到第二时长;counting the duration of the audio information to obtain a second duration;
若所述第一时长与所述第二时长不相等,从所述第二视频及所述音频信息中获取时长最大的信息作为待处理信息;If the first duration is not equal to the second duration, obtain the information with the largest duration from the second video and the audio information as the information to be processed;
对所述待处理信息进行压缩处理,直至处理后的第二视频及处理后的音频信息的时长相等;Compressing the information to be processed until the duration of the processed second video and the processed audio information are equal;
合并所述处理后的第二视频及所述处理后的音频信息,得到所述动画视频。Combining the processed second video and the processed audio information to obtain the animated video.
通过上述实施方式,在所述第一时长与所述第二时长不相等时,对时长最大的信息进行压缩处理,能够确保处理后的第二视频及处理后的音频信息的时长相等,进而便于对处理后的第二视频及处理后的音频信息进行直接合并,从而提高所述动画视频的生成效率。Through the above embodiment, when the first duration is not equal to the second duration, the information with the largest duration is compressed, which can ensure that the duration of the processed second video and the processed audio information are equal, and further facilitate The processed second video and the processed audio information are directly combined, thereby improving the generation efficiency of the animation video.
具体地,所述生成单元113合并所述处理后的第二视频及所述处理后的音频信息,得到所述动画视频包括:Specifically, the generating unit 113 merges the processed second video and the processed audio information to obtain the animation video including:
获取所述处理后的第二视频在声轨维度上的声轨信息;Acquiring the sound track information of the processed second video in the sound track dimension;
将所述声轨信息替换为所述处理后的音频信息,得到所述动画视频。The audio track information is replaced with the processed audio information to obtain the animated video.
通过将所述声轨信息替换为所述处理后的音频信息,能够快速生成所述动画视频。By replacing the sound track information with the processed audio information, the animation video can be quickly generated.
由以上技术方案可以看出,本申请通过所述视频生成模型分析所述文本信息,能够快速生成所述初始视频,从而提高所述动画视频的生成效率,进而通过对所述人体特征点的识别,能够准确的确定出每帧图像中用户的姿态信息,进而在姿态信息为预设姿态时,对所述姿态信息进行调整,能够避免所述第二视频中存在所述预设姿态等不良姿态信息,由于良好的姿态信息能够对用户起到一定的教育作用,因此通过避免所述第二视频中存在所述预设姿态等不良姿态信息能够提高所述第二视频的质量,通过所述音频生成模型能够准确的生成与所述文本信息所对应的音频信息,根据所述音频信息及所述第二视频能够提高所述动画视频的生成质量。It can be seen from the above technical solutions that the present application can quickly generate the initial video by analyzing the text information through the video generation model, thereby improving the generation efficiency of the animation video, and then by identifying the feature points of the human body , can accurately determine the posture information of the user in each frame of image, and then adjust the posture information when the posture information is a preset posture, so as to avoid bad postures such as the preset posture in the second video Information, because good posture information can play a certain educational role for users, so by avoiding the existence of bad posture information such as the preset posture in the second video, the quality of the second video can be improved, and through the audio The generation model can accurately generate audio information corresponding to the text information, and the generation quality of the animation video can be improved according to the audio information and the second video.
如图3所示,是本申请实现动画视频生成方法的较佳实施例的电子设备的结构示意图。As shown in FIG. 3 , it is a schematic structural diagram of an electronic device implementing a preferred embodiment of the animation video generation method of the present application.
在本申请的一个实施例中,所述电子设备1包括,但不限于,存储器12、处理器13,以及存储在所述存储器12中并可在所述处理器13上运行的计算机可读指令,例如动画 视频生成程序。In one embodiment of the present application, the electronic device 1 includes, but is not limited to, a memory 12, a processor 13, and computer-readable instructions stored in the memory 12 and operable on the processor 13 , such as an animation video generator.
本领域技术人员可以理解,所述示意图仅仅是电子设备1的示例,并不构成对电子设备1的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述电子设备1还可以包括输入输出设备、网络接入设备、总线等。Those skilled in the art can understand that the schematic diagram is only an example of the electronic device 1 and does not constitute a limitation to the electronic device 1. It may include more or less components than those shown in the illustration, or combine certain components, or have different Components, for example, the electronic device 1 may also include input and output devices, network access devices, buses, and the like.
所述处理器13可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,所述处理器13是所述电子设备1的运算核心和控制中心,利用各种接口和线路连接整个电子设备1的各个部分,及执行所述电子设备1的操作系统以及安装的各类应用程序、程序代码等。The processor 13 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc., the processor 13 is the computing core and control center of the electronic device 1, and uses various interfaces and lines to connect the entire electronic device 1, and execute the operating system of the electronic device 1 and various installed applications, program codes, etc.
示例性的,所述计算机可读指令可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器12中,并由所述处理器13执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,该计算机可读指令段用于描述所述计算机可读指令在所述电子设备1中的执行过程。例如,所述计算机可读指令可以被分割成获取单元110、输入单元111、识别单元112、生成单元113、调整单元114、分析单元115、构建单元116、编码单元117、映射单元118及计算单元119。Exemplarily, the computer-readable instructions may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 12 and executed by the processor 13 to Complete this application. The one or more modules/units may be a series of computer-readable instruction segments capable of accomplishing specific functions, and the computer-readable instruction segments are used to describe the execution process of the computer-readable instructions in the electronic device 1 . For example, the computer readable instructions may be divided into an acquisition unit 110, an input unit 111, an identification unit 112, a generation unit 113, an adjustment unit 114, an analysis unit 115, a construction unit 116, an encoding unit 117, a mapping unit 118, and a computing unit 119.
所述存储器12可用于存储所述计算机可读指令和/或模块,所述处理器13通过运行或执行存储在所述存储器12内的计算机可读指令和/或模块,以及调用存储在存储器12内的数据,实现所述电子设备1的各种功能。所述存储器12可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据等。存储器12可以包括非易失性和易失性存储器,例如:硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他存储器件。The memory 12 can be used to store the computer-readable instructions and/or modules, and the processor 13 runs or executes the computer-readable instructions and/or modules stored in the memory 12, and calls the computer-readable instructions and/or modules stored in the memory 12. The data in it realizes various functions of the electronic device 1 . The memory 12 can mainly include a program storage area and a data storage area, wherein the program storage area can store an operating system, an application program (such as a sound playback function, an image playback function, etc.) required by at least one function; the storage data area can be Stores data, etc. created in accordance with the use of electronic devices. Memory 12 can include nonvolatile and volatile memory, such as: hard disk, memory, plug-in hard disk, smart memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash memory card (Flash Card), at least one magnetic disk storage device, flash memory device, or other storage device.
所述存储器12可以是电子设备1的外部存储器和/或内部存储器。进一步地,所述存储器12可以是具有实物形式的存储器,如内存条、TF卡(Trans-flash Card)等等。The memory 12 may be an external memory and/or an internal memory of the electronic device 1 . Further, the memory 12 may be a memory in physical form, such as a memory stick, a TF card (Trans-flash Card) or the like.
所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中,所述计算机可读存储介质可以是非易失性的存储介质,也可以是易失性的存储介质。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机可读存储介质中,该计算机可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。If the integrated modules/units of the electronic device 1 are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium, and the computer-readable storage medium can be non-volatile A volatile storage medium may also be a volatile storage medium. Based on this understanding, all or part of the processes in the methods of the above-mentioned embodiments in the present application can also be completed by instructing related hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium In this example, when the computer readable instructions are executed by the processor, the steps of the above-mentioned various method embodiments can be realized.
其中,所述计算机可读指令包括计算机可读指令代码,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)。Wherein, the computer-readable instructions include computer-readable instruction codes, and the computer-readable instruction codes may be in the form of source code, object code, executable file or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory).
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain (Blockchain), essentially a decentralized database, is a series of data blocks associated with each other using cryptographic methods. Each data block contains a batch of network transaction information, which is used to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
结合图1,所述电子设备1中的所述存储器12存储计算机可读指令实现一种动画视频生成方法,所述处理器13可执行所述计算机可读指令从而实现:1, the memory 12 in the electronic device 1 stores computer-readable instructions to implement a method for generating animated videos, and the processor 13 can execute the computer-readable instructions to achieve:
当接收到视频生成请求时,根据所述视频生成请求获取文本信息;When a video generation request is received, text information is obtained according to the video generation request;
将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频;Input the text information into the pre-trained video generation model to obtain the initial video;
识别所述初始视频中每帧图像的人体特征点;Identify the human body feature points of each frame image in the initial video;
根据所述人体特征点生成每帧图像中用户的姿态信息;Generate gesture information of the user in each frame of image according to the human body feature points;
若所述姿态信息为预设姿态,则根据所述人体特征点调整所述姿态信息,得到第二视频;If the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;
基于预先训练好的音频生成模型分析所述文本信息,得到音频信息;analyzing the text information based on a pre-trained audio generation model to obtain audio information;
根据所述第二视频及所述音频信息生成动画视频。An animation video is generated according to the second video and the audio information.
具体地,所述处理器13对上述计算机可读指令的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。Specifically, for a specific implementation method of the above computer-readable instructions by the processor 13, reference may be made to the description of relevant steps in the embodiment corresponding to FIG. 1 , and details are not repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided in this application, it should be understood that the disclosed system, device and method can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.
所述计算机可读存储介质上存储有计算机可读指令,其中,所述计算机可读指令被处理器13执行时用以实现以下步骤:Computer-readable instructions are stored on the computer-readable storage medium, wherein the computer-readable instructions are used to implement the following steps when executed by the processor 13:
当接收到视频生成请求时,根据所述视频生成请求获取文本信息;When a video generation request is received, text information is obtained according to the video generation request;
将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频;Input the text information into the pre-trained video generation model to obtain the initial video;
识别所述初始视频中每帧图像的人体特征点;Identify the human body feature points of each frame image in the initial video;
根据所述人体特征点生成每帧图像中用户的姿态信息;Generate gesture information of the user in each frame of image according to the human body feature points;
若所述姿态信息为预设姿态,则根据所述人体特征点调整所述姿态信息,得到第二视频;If the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;
基于预先训练好的音频生成模型分析所述文本信息,得到音频信息;analyzing the text information based on a pre-trained audio generation model to obtain audio information;
根据所述第二视频及所述音频信息生成动画视频。An animation video is generated according to the second video and the audio information.
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, each functional module in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or in the form of hardware plus software function modules.
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。Therefore, the embodiments should be regarded as exemplary and not restrictive in all points of view, and the scope of the application is defined by the appended claims rather than the foregoing description, and it is intended that the scope of the present application be defined by the appended claims rather than by the foregoing description. All changes within the meaning and range of equivalents of the elements are embraced in this application. Any reference sign in a claim should not be construed as limiting the claim concerned.
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。所述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一、第二等词语用来表示名称,而并不表示任何特定的顺序。In addition, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. The multiple units or devices described above may also be implemented by one unit or device through software or hardware. The terms first, second, etc. are used to denote names and do not imply any particular order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application without limitation. Although the present application has been described in detail with reference to the preferred embodiments, those skilled in the art should understand that the technical solutions of the present application can be Make modifications or equivalent replacements without departing from the spirit and scope of the technical solutions of the present application.

Claims (20)

  1. 一种动画视频生成方法,其中,所述动画视频生成方法包括:A method for generating an animated video, wherein the method for generating an animated video includes:
    当接收到视频生成请求时,根据所述视频生成请求获取文本信息;When a video generation request is received, text information is obtained according to the video generation request;
    将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频;Input the text information into the pre-trained video generation model to obtain the initial video;
    识别所述初始视频中每帧图像的人体特征点;Identify the human body feature points of each frame image in the initial video;
    根据所述人体特征点生成每帧图像中用户的姿态信息;Generate gesture information of the user in each frame of image according to the human body feature points;
    若所述姿态信息为预设姿态,则根据所述人体特征点调整所述姿态信息,得到第二视频;If the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;
    基于预先训练好的音频生成模型分析所述文本信息,得到音频信息;analyzing the text information based on a pre-trained audio generation model to obtain audio information;
    根据所述第二视频及所述音频信息生成动画视频。An animation video is generated according to the second video and the audio information.
  2. 如权利要求1所述的动画视频生成方法,其中,在将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频之前,所述方法还包括:The animation video generation method according to claim 1, wherein, before inputting the text information into the pre-trained video generation model to obtain the initial video, the method also includes:
    获取多个视频训练样本,每个视频训练样本包括训练视频及所述训练视频所对应的训练文本;Obtain a plurality of video training samples, each video training sample includes a training video and training text corresponding to the training video;
    构建学习器,其中,所述学习器包括编码层及解码层;Build a learner, wherein the learner includes an encoding layer and a decoding layer;
    对所述训练文本进行文本编码处理,得到文本向量;performing text encoding processing on the training text to obtain a text vector;
    基于所述编码层分析所述文本向量,得到所述训练文本的特征信息;analyzing the text vector based on the encoding layer to obtain feature information of the training text;
    基于所述解码层分析所述特征信息,得到输出向量;analyzing the feature information based on the decoding layer to obtain an output vector;
    基于预设映射表对所述训练视频进行映射处理,得到所述训练视频的图像向量;performing mapping processing on the training video based on a preset mapping table to obtain an image vector of the training video;
    计算所述文本向量与所述输出向量的相似度,得到第一相似度,并计算所述文本向量与所述图像向量的相似度,得到第二相似度;calculating the similarity between the text vector and the output vector to obtain a first similarity, and calculating the similarity between the text vector and the image vector to obtain a second similarity;
    计算所述第一相似度在所述第二相似度中的比值,得到所述学习器的学习指标;calculating the ratio of the first similarity to the second similarity to obtain the learning index of the learner;
    调整所述学习器中的网络参数,直至所述学习指标不再增加,得到所述视频生成模型。Adjusting the network parameters in the learner until the learning index no longer increases to obtain the video generation model.
  3. 如权利要求1所述的动画视频生成方法,其中,所述识别所述初始视频中每帧图像的人体特征点包括:The animation video generation method according to claim 1, wherein the human body feature points of each frame image in the described initial video recognition include:
    基于预设检测器对每帧图像进行检测,得到每帧图像中的人体区域;Detect each frame of image based on a preset detector to obtain the human body area in each frame of image;
    对所述人体区域进行灰度处理,得到所述人体区域的多个像素点及每个像素点对应的像素灰度值;performing grayscale processing on the human body region to obtain a plurality of pixels in the human body region and a pixel grayscale value corresponding to each pixel;
    根据所述像素灰度值及预设特征点的特征灰度值计算每个像素点与所述预设特征点的像素差值;calculating the pixel difference between each pixel point and the preset feature point according to the gray value of the pixel and the feature gray value of the preset feature point;
    将所述像素差值小于预设阈值的像素点确定为初始特征点;Determining the pixel point whose pixel difference value is smaller than a preset threshold value as an initial feature point;
    基于每帧图像构建坐标系,并获取所述初始特征点在每帧图像上的初始坐标信息;Constructing a coordinate system based on each frame of image, and obtaining the initial coordinate information of the initial feature point on each frame of image;
    根据所述初始坐标信息从所述初始特征点中筛选出所述人体特征点。The human body feature points are screened out from the initial feature points according to the initial coordinate information.
  4. 如权利要求3所述的动画视频生成方法,其中,所述根据所述初始坐标信息从所述初始特征点中筛选出所述人体特征点包括:The animation video generation method according to claim 3, wherein said screening out said human body feature points from said initial feature points according to said initial coordinate information comprises:
    对于任一初始特征点,根据所述初始坐标信息计算所述任一初始特征点与目标特征点的特征距离,所述目标特征点是指所述初始特征点中除所述任一初始特征点外的其余特征点;For any initial feature point, calculate the feature distance between any initial feature point and the target feature point according to the initial coordinate information, the target feature point refers to the initial feature point except any initial feature point other feature points;
    将取值最小的特征距离确定为目标距离,并将所述目标距离所对应的目标特征点确定为所述任一初始特征点的相邻特征点;Determining the feature distance with the smallest value as the target distance, and determining the target feature point corresponding to the target distance as the adjacent feature point of any initial feature point;
    对所述目标距离进行正态分布处理,得到所述目标距离的概率值;performing normal distribution processing on the target distance to obtain a probability value of the target distance;
    将所述概率值大于预设概率值的目标距离所对应的初始特征点确定为所述人体特征点。An initial feature point corresponding to a target distance whose probability value is greater than a preset probability value is determined as the human body feature point.
  5. 如权利要求3所述的动画视频生成方法,其中,所述根据所述人体特征点生成 每帧图像中用户的姿态信息包括:The animation video generating method according to claim 3, wherein said generating the gesture information of the user in each frame image according to said human body feature points comprises:
    根据所述坐标系获取所述人体特征点的坐标信息作为人体坐标信息;Acquiring coordinate information of the human body feature points according to the coordinate system as human body coordinate information;
    从所述人体特征点中获取任意两个相邻特征点作为特征点对;Acquiring any two adjacent feature points from the human body feature points as feature point pairs;
    根据所述人体坐标信息及所述坐标系中的横坐标轴计算每个特征点对的欧拉角度;Calculate the Euler angle of each feature point pair according to the coordinate information of the human body and the abscissa axis in the coordinate system;
    计算所述欧拉角度的平均值,得到角度均值,并将所述角度均值所对应的预设姿势信息确定为所述姿态信息。Calculate the average value of the Euler angles to obtain the average angle value, and determine the preset posture information corresponding to the average angle value as the posture information.
  6. 如权利要求1所述的动画视频生成方法,其中,所述音频生成模型包括情感识别网络层及语音转换网络层,所述基于预先训练好的音频生成模型分析所述文本信息,得到音频信息包括:The animation video generation method according to claim 1, wherein the audio generation model includes an emotion recognition network layer and a voice conversion network layer, and the pre-trained audio generation model analyzes the text information to obtain audio information including :
    基于所述情感识别网络层分析所述文本信息,得到所述文本信息的文本情感;analyzing the text information based on the emotion recognition network layer to obtain the text emotion of the text information;
    从语音特征库中获取所述文本情感的情感语音特征;Obtain the emotional voice features of the text emotion from the voice feature database;
    基于所述语音转换网络层处理所述文本信息,得到语音信息,并获取所述语音信息中获取文本语音特征;Processing the text information based on the voice conversion network layer to obtain voice information, and acquiring text voice features from the voice information;
    对所述文本语音特征及所述情感语音特征进行音频混流处理,得到所述音频信息。Perform audio mixing processing on the text speech features and the emotional speech features to obtain the audio information.
  7. 如权利要求1所述的动画视频生成方法,其中,所述根据所述第二视频及所述音频信息生成动画视频包括:The animation video generation method according to claim 1, wherein said generating animation video according to said second video and said audio information comprises:
    统计所述第二视频的时长,得到第一时长;Counting the duration of the second video to obtain the first duration;
    统计所述音频信息的时长,得到第二时长;counting the duration of the audio information to obtain a second duration;
    若所述第一时长与所述第二时长不相等,从所述第二视频及所述音频信息中获取时长最大的信息作为待处理信息;If the first duration is not equal to the second duration, obtain the information with the largest duration from the second video and the audio information as the information to be processed;
    对所述待处理信息进行压缩处理,直至处理后的第二视频及处理后的音频信息的时长相等;Compressing the information to be processed until the duration of the processed second video and the processed audio information are equal;
    合并所述处理后的第二视频及所述处理后的音频信息,得到所述动画视频。Combining the processed second video and the processed audio information to obtain the animated video.
  8. 一种动画视频生成装置,其中,所述动画视频生成装置包括:A device for generating animated video, wherein the device for generating animated video includes:
    获取单元,用于当接收到视频生成请求时,根据所述视频生成请求获取文本信息;An acquisition unit, configured to acquire text information according to the video generation request when receiving the video generation request;
    输入单元,用于将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频;An input unit, configured to input the text information into a pre-trained video generation model to obtain an initial video;
    识别单元,用于识别所述初始视频中每帧图像的人体特征点;A recognition unit, configured to recognize the human body feature points of each frame of image in the initial video;
    生成单元,用于根据所述人体特征点生成每帧图像中用户的姿态信息;A generating unit, configured to generate posture information of the user in each frame of image according to the human body feature points;
    调整单元,用于若所述姿态信息为预设姿态,则根据所述人体特征点调整所述姿态信息,得到第二视频;An adjustment unit, configured to adjust the posture information according to the human body feature points to obtain a second video if the posture information is a preset posture;
    分析单元,用于基于预先训练好的音频生成模型分析所述文本信息,得到音频信息;An analysis unit, configured to analyze the text information based on a pre-trained audio generation model to obtain audio information;
    所述生成单元,用于根据所述第二视频及所述音频信息生成动画视频。The generating unit is configured to generate animation video according to the second video and the audio information.
  9. 一种电子设备,其中,所述电子设备包括处理器和存储器,所述处理器用于执行存储器中存储的至少一个计算机可读指令以实现以下步骤:An electronic device, wherein the electronic device includes a processor and a memory, and the processor is configured to execute at least one computer-readable instruction stored in the memory to implement the following steps:
    当接收到视频生成请求时,根据所述视频生成请求获取文本信息;When a video generation request is received, text information is obtained according to the video generation request;
    将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频;Input the text information into the pre-trained video generation model to obtain the initial video;
    识别所述初始视频中每帧图像的人体特征点;Identify the human body feature points of each frame image in the initial video;
    根据所述人体特征点生成每帧图像中用户的姿态信息;Generate gesture information of the user in each frame of image according to the human body feature points;
    若所述姿态信息为预设姿态,则根据所述人体特征点调整所述姿态信息,得到第二视频;If the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;
    基于预先训练好的音频生成模型分析所述文本信息,得到音频信息;analyzing the text information based on a pre-trained audio generation model to obtain audio information;
    根据所述第二视频及所述音频信息生成动画视频。An animation video is generated according to the second video and the audio information.
  10. 根据权利要求9所述的电子设备,其中,在将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频之前,所述处理器执行所述至少一个计算机可读指令 还用以实现以下步骤:The electronic device according to claim 9, wherein, before inputting the text information into a pre-trained video generation model to obtain an initial video, the processor executes the at least one computer-readable instruction to further Implement the following steps:
    获取多个视频训练样本,每个视频训练样本包括训练视频及所述训练视频所对应的训练文本;Obtain a plurality of video training samples, each video training sample includes a training video and training text corresponding to the training video;
    构建学习器,其中,所述学习器包括编码层及解码层;Build a learner, wherein the learner includes an encoding layer and a decoding layer;
    对所述训练文本进行文本编码处理,得到文本向量;performing text encoding processing on the training text to obtain a text vector;
    基于所述编码层分析所述文本向量,得到所述训练文本的特征信息;analyzing the text vector based on the encoding layer to obtain feature information of the training text;
    基于所述解码层分析所述特征信息,得到输出向量;analyzing the feature information based on the decoding layer to obtain an output vector;
    基于预设映射表对所述训练视频进行映射处理,得到所述训练视频的图像向量;performing mapping processing on the training video based on a preset mapping table to obtain an image vector of the training video;
    计算所述文本向量与所述输出向量的相似度,得到第一相似度,并计算所述文本向量与所述图像向量的相似度,得到第二相似度;calculating the similarity between the text vector and the output vector to obtain a first similarity, and calculating the similarity between the text vector and the image vector to obtain a second similarity;
    计算所述第一相似度在所述第二相似度中的比值,得到所述学习器的学习指标;calculating the ratio of the first similarity to the second similarity to obtain the learning index of the learner;
    调整所述学习器中的网络参数,直至所述学习指标不再增加,得到所述视频生成模型。Adjusting the network parameters in the learner until the learning index no longer increases to obtain the video generation model.
  11. 根据权利要求9所述的电子设备,其中,在所述识别所述初始视频中每帧图像的人体特征点时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:The electronic device according to claim 9, wherein, when identifying the human body feature points of each frame image in the initial video, the processor executes the at least one computer-readable instruction to implement the following steps:
    基于预设检测器对每帧图像进行检测,得到每帧图像中的人体区域;Detect each frame of image based on a preset detector to obtain the human body area in each frame of image;
    对所述人体区域进行灰度处理,得到所述人体区域的多个像素点及每个像素点对应的像素灰度值;performing grayscale processing on the human body region to obtain a plurality of pixels in the human body region and a pixel grayscale value corresponding to each pixel;
    根据所述像素灰度值及预设特征点的特征灰度值计算每个像素点与所述预设特征点的像素差值;calculating the pixel difference between each pixel point and the preset feature point according to the gray value of the pixel and the feature gray value of the preset feature point;
    将所述像素差值小于预设阈值的像素点确定为初始特征点;Determining the pixel point whose pixel difference value is smaller than a preset threshold value as an initial feature point;
    基于每帧图像构建坐标系,并获取所述初始特征点在每帧图像上的初始坐标信息;Constructing a coordinate system based on each frame of image, and obtaining the initial coordinate information of the initial feature point on each frame of image;
    根据所述初始坐标信息从所述初始特征点中筛选出所述人体特征点。The human body feature points are screened out from the initial feature points according to the initial coordinate information.
  12. 根据权利要求11所述的电子设备,其中,在所述根据所述初始坐标信息从所述初始特征点中筛选出所述人体特征点时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:The electronic device according to claim 11, wherein, when the human body feature points are screened out from the initial feature points according to the initial coordinate information, the processor executes the at least one computer-readable instruction to achieve the following steps:
    对于任一初始特征点,根据所述初始坐标信息计算所述任一初始特征点与目标特征点的特征距离,所述目标特征点是指所述初始特征点中除所述任一初始特征点外的其余特征点;For any initial feature point, calculate the feature distance between any initial feature point and the target feature point according to the initial coordinate information, the target feature point refers to the initial feature point except any initial feature point other feature points;
    将取值最小的特征距离确定为目标距离,并将所述目标距离所对应的目标特征点确定为所述任一初始特征点的相邻特征点;Determining the feature distance with the smallest value as the target distance, and determining the target feature point corresponding to the target distance as the adjacent feature point of any initial feature point;
    对所述目标距离进行正态分布处理,得到所述目标距离的概率值;performing normal distribution processing on the target distance to obtain a probability value of the target distance;
    将所述概率值大于预设概率值的目标距离所对应的初始特征点确定为所述人体特征点。An initial feature point corresponding to a target distance whose probability value is greater than a preset probability value is determined as the human body feature point.
  13. 根据权利要求11所述的电子设备,其中,在所述根据所述人体特征点生成每帧图像中用户的姿态信息时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:The electronic device according to claim 11, wherein when generating the posture information of the user in each frame of image according to the human body feature points, the processor executes the at least one computer-readable instruction to implement the following steps:
    根据所述坐标系获取所述人体特征点的坐标信息作为人体坐标信息;Acquiring coordinate information of the human body feature points according to the coordinate system as human body coordinate information;
    从所述人体特征点中获取任意两个相邻特征点作为特征点对;Acquiring any two adjacent feature points from the human body feature points as feature point pairs;
    根据所述人体坐标信息及所述坐标系中的横坐标轴计算每个特征点对的欧拉角度;Calculate the Euler angle of each feature point pair according to the coordinate information of the human body and the abscissa axis in the coordinate system;
    计算所述欧拉角度的平均值,得到角度均值,并将所述角度均值所对应的预设姿势信息确定为所述姿态信息。Calculate the average value of the Euler angles to obtain the average angle value, and determine the preset posture information corresponding to the average angle value as the posture information.
  14. 根据权利要求9所述的电子设备,其中,所述音频生成模型包括情感识别网络层及语音转换网络层,在所述基于预先训练好的音频生成模型分析所述文本信息,得到音频信息时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:The electronic device according to claim 9, wherein the audio generation model includes an emotion recognition network layer and a speech conversion network layer, and when the pre-trained audio generation model analyzes the text information to obtain audio information, The processor executes the at least one computer readable instruction to:
    基于所述情感识别网络层分析所述文本信息,得到所述文本信息的文本情感;analyzing the text information based on the emotion recognition network layer to obtain the text emotion of the text information;
    从语音特征库中获取所述文本情感的情感语音特征;Obtain the emotional voice features of the text emotion from the voice feature database;
    基于所述语音转换网络层处理所述文本信息,得到语音信息,并获取所述语音信息中获取文本语音特征;Processing the text information based on the voice conversion network layer to obtain voice information, and acquiring text voice features from the voice information;
    对所述文本语音特征及所述情感语音特征进行音频混流处理,得到所述音频信息。Perform audio mixing processing on the text speech features and the emotional speech features to obtain the audio information.
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行时实现以下步骤:A computer-readable storage medium, wherein the computer-readable storage medium stores at least one computer-readable instruction, and when the at least one computer-readable instruction is executed by a processor, the following steps are implemented:
    当接收到视频生成请求时,根据所述视频生成请求获取文本信息;When a video generation request is received, text information is obtained according to the video generation request;
    将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频;Input the text information into the pre-trained video generation model to obtain the initial video;
    识别所述初始视频中每帧图像的人体特征点;Identify the human body feature points of each frame image in the initial video;
    根据所述人体特征点生成每帧图像中用户的姿态信息;Generate gesture information of the user in each frame of image according to the human body feature points;
    若所述姿态信息为预设姿态,则根据所述人体特征点调整所述姿态信息,得到第二视频;If the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;
    基于预先训练好的音频生成模型分析所述文本信息,得到音频信息;analyzing the text information based on a pre-trained audio generation model to obtain audio information;
    根据所述第二视频及所述音频信息生成动画视频。An animation video is generated according to the second video and the audio information.
  16. 根据权利要求15所述的存储介质,其中,在将所述文本信息输入至预先训练好的视频生成模型中,得到初始视频之前,所述至少一个计算机可读指令被处理器执行还用以实现以下步骤:The storage medium according to claim 15, wherein, before inputting the text information into a pre-trained video generation model to obtain an initial video, the at least one computer-readable instruction is executed by a processor to implement The following steps:
    获取多个视频训练样本,每个视频训练样本包括训练视频及所述训练视频所对应的训练文本;Obtain a plurality of video training samples, each video training sample includes a training video and training text corresponding to the training video;
    构建学习器,其中,所述学习器包括编码层及解码层;Build a learner, wherein the learner includes an encoding layer and a decoding layer;
    对所述训练文本进行文本编码处理,得到文本向量;performing text encoding processing on the training text to obtain a text vector;
    基于所述编码层分析所述文本向量,得到所述训练文本的特征信息;analyzing the text vector based on the encoding layer to obtain feature information of the training text;
    基于所述解码层分析所述特征信息,得到输出向量;analyzing the feature information based on the decoding layer to obtain an output vector;
    基于预设映射表对所述训练视频进行映射处理,得到所述训练视频的图像向量;performing mapping processing on the training video based on a preset mapping table to obtain an image vector of the training video;
    计算所述文本向量与所述输出向量的相似度,得到第一相似度,并计算所述文本向量与所述图像向量的相似度,得到第二相似度;calculating the similarity between the text vector and the output vector to obtain a first similarity, and calculating the similarity between the text vector and the image vector to obtain a second similarity;
    计算所述第一相似度在所述第二相似度中的比值,得到所述学习器的学习指标;calculating the ratio of the first similarity to the second similarity to obtain the learning index of the learner;
    调整所述学习器中的网络参数,直至所述学习指标不再增加,得到所述视频生成模型。Adjusting the network parameters in the learner until the learning index no longer increases to obtain the video generation model.
  17. 根据权利要求15所述的存储介质,其中,在所述识别所述初始视频中每帧图像的人体特征点时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:The storage medium according to claim 15, wherein, when identifying the human body feature points of each frame image in the initial video, the at least one computer-readable instruction is executed by a processor to implement the following steps:
    基于预设检测器对每帧图像进行检测,得到每帧图像中的人体区域;Detect each frame of image based on a preset detector to obtain the human body area in each frame of image;
    对所述人体区域进行灰度处理,得到所述人体区域的多个像素点及每个像素点对应的像素灰度值;performing grayscale processing on the human body region to obtain a plurality of pixels in the human body region and a pixel grayscale value corresponding to each pixel;
    根据所述像素灰度值及预设特征点的特征灰度值计算每个像素点与所述预设特征点的像素差值;calculating the pixel difference between each pixel point and the preset feature point according to the gray value of the pixel and the feature gray value of the preset feature point;
    将所述像素差值小于预设阈值的像素点确定为初始特征点;Determining the pixel point whose pixel difference value is smaller than a preset threshold value as an initial feature point;
    基于每帧图像构建坐标系,并获取所述初始特征点在每帧图像上的初始坐标信息;Construct a coordinate system based on each frame of image, and obtain the initial coordinate information of the initial feature point on each frame of image;
    根据所述初始坐标信息从所述初始特征点中筛选出所述人体特征点。The human body feature points are screened out from the initial feature points according to the initial coordinate information.
  18. 根据权利要求17所述的存储介质,其中,在所述根据所述初始坐标信息从所述初始特征点中筛选出所述人体特征点时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:The storage medium according to claim 17, wherein when the human body feature points are screened out from the initial feature points according to the initial coordinate information, the at least one computer-readable instruction is executed by a processor to Implement the following steps:
    对于任一初始特征点,根据所述初始坐标信息计算所述任一初始特征点与目标特征点的特征距离,所述目标特征点是指所述初始特征点中除所述任一初始特征点外的其余特征点;For any initial feature point, calculate the feature distance between any initial feature point and the target feature point according to the initial coordinate information, the target feature point refers to the initial feature point except any initial feature point other feature points;
    将取值最小的特征距离确定为目标距离,并将所述目标距离所对应的目标特征点确 定为所述任一初始特征点的相邻特征点;Determine the minimum feature distance as the target distance, and determine the target feature point corresponding to the target distance as the adjacent feature point of any initial feature point;
    对所述目标距离进行正态分布处理,得到所述目标距离的概率值;performing normal distribution processing on the target distance to obtain a probability value of the target distance;
    将所述概率值大于预设概率值的目标距离所对应的初始特征点确定为所述人体特征点。An initial feature point corresponding to a target distance whose probability value is greater than a preset probability value is determined as the human body feature point.
  19. 根据权利要求17所述的存储介质,其中,在所述根据所述人体特征点生成每帧图像中用户的姿态信息时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:The storage medium according to claim 17, wherein, when generating the posture information of the user in each frame of image according to the human body feature points, the at least one computer-readable instruction is executed by a processor to implement the following steps:
    根据所述坐标系获取所述人体特征点的坐标信息作为人体坐标信息;Acquiring coordinate information of the human body feature points according to the coordinate system as human body coordinate information;
    从所述人体特征点中获取任意两个相邻特征点作为特征点对;Acquiring any two adjacent feature points from the human body feature points as a feature point pair;
    根据所述人体坐标信息及所述坐标系中的横坐标轴计算每个特征点对的欧拉角度;Calculate the Euler angle of each feature point pair according to the coordinate information of the human body and the abscissa axis in the coordinate system;
    计算所述欧拉角度的平均值,得到角度均值,并将所述角度均值所对应的预设姿势信息确定为所述姿态信息。Calculate the average value of the Euler angles to obtain the average angle value, and determine the preset posture information corresponding to the average angle value as the posture information.
  20. 根据权利要求15所述的存储介质,其中,所述音频生成模型包括情感识别网络层及语音转换网络层,在所述基于预先训练好的音频生成模型分析所述文本信息,得到音频信息时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:The storage medium according to claim 15, wherein the audio generation model includes an emotion recognition network layer and a speech conversion network layer, and when the pre-trained audio generation model analyzes the text information to obtain audio information, The at least one computer readable instruction is executed by the processor to implement the following steps:
    基于所述情感识别网络层分析所述文本信息,得到所述文本信息的文本情感;analyzing the text information based on the emotion recognition network layer to obtain the text emotion of the text information;
    从语音特征库中获取所述文本情感的情感语音特征;Obtain the emotional voice features of the text emotion from the voice feature database;
    基于所述语音转换网络层处理所述文本信息,得到语音信息,并获取所述语音信息中获取文本语音特征;Processing the text information based on the voice conversion network layer to obtain voice information, and acquiring text voice features from the voice information;
    对所述文本语音特征及所述情感语音特征进行音频混流处理,得到所述音频信息。Perform audio mixing processing on the text speech features and the emotional speech features to obtain the audio information.
PCT/CN2022/071302 2021-09-29 2022-01-11 Animation video generation method and apparatus, and device and storage medium WO2023050650A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111152667.XA CN113870395A (en) 2021-09-29 2021-09-29 Animation video generation method, device, equipment and storage medium
CN202111152667.X 2021-09-29

Publications (1)

Publication Number Publication Date
WO2023050650A1 true WO2023050650A1 (en) 2023-04-06

Family

ID=79000531

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071302 WO2023050650A1 (en) 2021-09-29 2022-01-11 Animation video generation method and apparatus, and device and storage medium

Country Status (2)

Country Link
CN (1) CN113870395A (en)
WO (1) WO2023050650A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116824010A (en) * 2023-07-04 2023-09-29 安徽建筑大学 Feedback type multiterminal animation design online interaction method and system

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113870395A (en) * 2021-09-29 2021-12-31 平安科技(深圳)有限公司 Animation video generation method, device, equipment and storage medium
CN114598926B (en) * 2022-01-20 2023-01-03 中国科学院自动化研究所 Video generation method and device, electronic equipment and storage medium
CN114567693B (en) * 2022-02-11 2024-01-30 维沃移动通信有限公司 Video generation method and device and electronic equipment
CN114979764B (en) * 2022-04-25 2024-02-06 中国平安人寿保险股份有限公司 Video generation method, device, computer equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109118562A (en) * 2018-08-31 2019-01-01 百度在线网络技术(北京)有限公司 Explanation video creating method, device and the terminal of virtual image
CN110866968A (en) * 2019-10-18 2020-03-06 平安科技(深圳)有限公司 Method for generating virtual character video based on neural network and related equipment
CN110880198A (en) * 2018-09-06 2020-03-13 百度在线网络技术(北京)有限公司 Animation generation method and device
WO2020070483A1 (en) * 2018-10-05 2020-04-09 Blupoint Ltd Data processing apparatus and method
CN112184858A (en) * 2020-09-01 2021-01-05 魔珐(上海)信息科技有限公司 Virtual object animation generation method and device based on text, storage medium and terminal
CN112381926A (en) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 Method and apparatus for generating video
CN112927712A (en) * 2021-01-25 2021-06-08 网易(杭州)网络有限公司 Video generation method and device and electronic equipment
CN113077537A (en) * 2021-04-29 2021-07-06 广州虎牙科技有限公司 Video generation method, storage medium and equipment
CN113194348A (en) * 2021-04-22 2021-07-30 清华珠三角研究院 Virtual human lecture video generation method, system, device and storage medium
CN113392231A (en) * 2021-06-30 2021-09-14 中国平安人寿保险股份有限公司 Method, device and equipment for generating freehand drawing video based on text and storage medium
CN113870395A (en) * 2021-09-29 2021-12-31 平安科技(深圳)有限公司 Animation video generation method, device, equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109118562A (en) * 2018-08-31 2019-01-01 百度在线网络技术(北京)有限公司 Explanation video creating method, device and the terminal of virtual image
CN110880198A (en) * 2018-09-06 2020-03-13 百度在线网络技术(北京)有限公司 Animation generation method and device
WO2020070483A1 (en) * 2018-10-05 2020-04-09 Blupoint Ltd Data processing apparatus and method
CN110866968A (en) * 2019-10-18 2020-03-06 平安科技(深圳)有限公司 Method for generating virtual character video based on neural network and related equipment
CN112184858A (en) * 2020-09-01 2021-01-05 魔珐(上海)信息科技有限公司 Virtual object animation generation method and device based on text, storage medium and terminal
CN112381926A (en) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 Method and apparatus for generating video
CN112927712A (en) * 2021-01-25 2021-06-08 网易(杭州)网络有限公司 Video generation method and device and electronic equipment
CN113194348A (en) * 2021-04-22 2021-07-30 清华珠三角研究院 Virtual human lecture video generation method, system, device and storage medium
CN113077537A (en) * 2021-04-29 2021-07-06 广州虎牙科技有限公司 Video generation method, storage medium and equipment
CN113392231A (en) * 2021-06-30 2021-09-14 中国平安人寿保险股份有限公司 Method, device and equipment for generating freehand drawing video based on text and storage medium
CN113870395A (en) * 2021-09-29 2021-12-31 平安科技(深圳)有限公司 Animation video generation method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116824010A (en) * 2023-07-04 2023-09-29 安徽建筑大学 Feedback type multiterminal animation design online interaction method and system
CN116824010B (en) * 2023-07-04 2024-03-26 安徽建筑大学 Feedback type multiterminal animation design online interaction method and system

Also Published As

Publication number Publication date
CN113870395A (en) 2021-12-31

Similar Documents

Publication Publication Date Title
WO2023050650A1 (en) Animation video generation method and apparatus, and device and storage medium
US11409791B2 (en) Joint heterogeneous language-vision embeddings for video tagging and search
JP6994588B2 (en) Face feature extraction model training method, face feature extraction method, equipment, equipment and storage medium
CN112333179B (en) Live broadcast method, device and equipment of virtual video and readable storage medium
WO2021208601A1 (en) Artificial-intelligence-based image processing method and apparatus, and device and storage medium
WO2021062990A1 (en) Video segmentation method and apparatus, device, and medium
CN111754596A (en) Editing model generation method, editing model generation device, editing method, editing device, editing equipment and editing medium
US11783615B2 (en) Systems and methods for language driven gesture understanding
CN111414946B (en) Artificial intelligence-based medical image noise data identification method and related device
WO2020238353A1 (en) Data processing method and apparatus, storage medium, and electronic apparatus
CN108491808B (en) Method and device for acquiring information
WO2023273628A1 (en) Video loop recognition method and apparatus, computer device, and storage medium
WO2018220700A1 (en) New learning dataset generation method, new learning dataset generation device, and learning method using generated learning dataset
CN113901894A (en) Video generation method, device, server and storage medium
CN113689436B (en) Image semantic segmentation method, device, equipment and storage medium
WO2020244151A1 (en) Image processing method and apparatus, terminal, and storage medium
WO2022188697A1 (en) Biological feature extraction method and apparatus, device, medium, and program product
US20210141867A1 (en) Translating texts for videos based on video context
Schiller et al. Relevance-based data masking: a model-agnostic transfer learning approach for facial expression recognition
CN109670559A (en) Recognition methods, device, equipment and the storage medium of handwritten Chinese character
CN113689527A (en) Training method of face conversion model and face image conversion method
TWI803243B (en) Method for expanding images, computer device and storage medium
CN116883737A (en) Classification method, computer device, and storage medium
CN116205723A (en) Artificial intelligence-based face tag risk detection method and related equipment
CN113486680A (en) Text translation method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22874053

Country of ref document: EP

Kind code of ref document: A1