CN112633129A

CN112633129A - Video analysis method and device, electronic equipment and storage medium

Info

Publication number: CN112633129A
Application number: CN202011506833.7A
Authority: CN
Inventors: 王鑫宇; 杨国基; 刘炫鹏; 陈泷翔; 刘云峰
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-04-09

Abstract

The application discloses a video analysis method, a video analysis device, electronic equipment and a storage medium, which relate to the technical field of artificial intelligence, and the method comprises the following steps: acquiring a first video and a first voice corresponding to the first video, wherein the first video comprises a plurality of first frame images, and each first frame image comprises a plurality of first key points; inputting the first voice into a video generation model to obtain a second video, wherein the second video comprises a plurality of second frame images, each second frame image comprises a plurality of second key points, and the second key points correspond to the first key points; acquiring a first distance between each first key point in the first video and each second key point in the second video; determining whether the second video meets a preset condition or not according to the first distance; and if the preset conditions are met, determining that the second video is the first-level video. The generated second video is accurately and effectively evaluated by utilizing the distance between the first key point and the second key point.

Description

Video analysis method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a video analysis method, an apparatus, an electronic device, and a storage medium.

Background

With the development of computer technology, many different algorithms of intelligent machine learning and deep learning are applied to digital human generation. In the prior art, when a digital person is generated, voice, text or the like is generally input into a machine learning or deep learning model to generate a video related to the digital person, but the prior art has no specific scheme for evaluating the digital person. Therefore, how to evaluate the generated second video is an urgent problem to be solved.

Disclosure of Invention

In view of the foregoing, the present application provides a video analysis method, an apparatus, an electronic device, and a storage medium.

In a first aspect, an embodiment of the present application provides a video analysis method, where the method includes: acquiring a first video and a first voice corresponding to the first video, wherein the first video comprises a plurality of first frame images, and each first frame image comprises a plurality of first key points; inputting the first voice into a video generation model to obtain a second video, wherein the second video comprises a plurality of second frame images, each second frame image comprises a plurality of second key points, and the second key points correspond to the first key points; acquiring a first distance between each first key point in the first video and each second key point in the second video; determining whether the second video meets a preset condition or not according to the first distance; if the preset conditions are met, determining that the second video is the first-level video

Further, a third video and a third voice corresponding to the third video are obtained; and inputting the third video and the third voice into a video generation network to obtain a video generation model.

Further, inputting the third voice into the video generation model to obtain a fourth video; acquiring a second distance between each third key point in the third video and each fourth key point in the fourth video; and determining whether the second video meets a preset condition according to the first distance and the second distance.

Further, obtaining a difference value between the first distance and the second distance to obtain a distance difference value; and determining whether the second video meets a preset condition or not according to the distance difference.

Further, obtaining a ratio of the distance difference to the second distance to obtain a target parameter; and determining whether the second video meets a preset condition or not according to the target parameter.

Further, determining whether the target parameter is smaller than a first preset threshold value; and if the target parameter is smaller than a first preset threshold value, determining that the second video meets a preset condition.

Further, if the target parameter is greater than or equal to a first preset threshold, determining whether the target parameter is less than a second preset threshold; and if the target parameter is smaller than a second preset threshold value, determining that the second video is a second-level video, wherein the user satisfaction of the second-level video is lower than that of the first-level video.

Further, if the target parameter is greater than or equal to a second preset threshold, it is determined that the second video is a third-level video, and the user satisfaction of the third-level video is lower than that of the second-level video.

Further, the first preset threshold is 0.05, and the second preset threshold is 0.1.

Further, inputting the first voice into a video generation model to obtain a candidate video; determining whether the candidate video contains a face image; and if the candidate video comprises the face image, taking the candidate video as a second video.

Further, if the candidate video does not contain the face image, it is determined that the second video is failed to be generated.

Further, searching a plurality of first mouth key points in the plurality of first key points, and searching a plurality of second mouth key points in the plurality of second key points, wherein each first mouth key point and each second mouth key point correspond to each other; a first distance between each of the first mouth key points in the first video and each of the second mouth key points in the second video is obtained.

In a second aspect, an embodiment of the present application provides a video analysis apparatus, including: the device comprises a first acquisition module, a second acquisition module, a third acquisition module, a condition determination module and a video determination module. The first obtaining module is configured to obtain a first video and a first voice corresponding to the first video, where the first video includes a plurality of first frame images, and each of the first frame images includes a plurality of first key points. The second obtaining module is configured to input the first voice to a video generation model to obtain a second video, where the second video includes a plurality of second frame images, each of the second frame images includes a plurality of second key points, and the second key points correspond to the first key points. A third obtaining module, configured to obtain a first distance between each first key point in the first video and each second key point in the second video. And the condition determining module is used for determining whether the second video meets a preset condition according to the first distance. A video determining module, configured to determine that the second video is a first-level video if a preset condition is met

In a third aspect, an embodiment of the present application provides an electronic device, which includes: memory, one or more processors, and one or more applications. Wherein the one or more processors are coupled with the memory. One or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of the first aspect as described above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, in which program code is stored, and the program code can be called by a processor to execute the method according to the first aspect.

The method for analyzing the video, the device, the electronic equipment and the storage medium provided by the embodiment of the application determine whether a user is satisfied with a generated second video by combining the distance between a first key point and a second key point, the evaluation method is simple and effective, first voice corresponding to the first video and the first video is obtained, wherein the first video comprises a plurality of first frame images, each first frame image comprises a plurality of first key points, then the first voice is input into a video generation model to obtain the second video, the second video comprises a plurality of second frame images, each second frame image comprises a plurality of second key points, the second key points and the first key points are mutually corresponding, then the first distance between each first key point in the first video and each second key point in the second video is obtained, when the second video meets a preset condition, and determining the second video as the first-level video. According to the embodiment of the application, the generated second video is accurately and effectively evaluated by acquiring the distance between the first key point and the second key point.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 shows a schematic flow chart of a video analysis method provided in a first embodiment of the present application.

Fig. 2 is a schematic diagram illustrating a face key point in a video analysis method according to a first embodiment of the present application.

Fig. 3 shows a schematic flow chart of a video analysis method according to a second embodiment of the present application.

Fig. 4 shows a schematic flow chart of a video analysis method according to a third embodiment of the present application.

Fig. 5 is a schematic flowchart illustrating step S380 in a video analysis method according to a third embodiment of the present application.

Fig. 6 shows a schematic flow chart of a video analysis method according to a fourth embodiment of the present application.

Fig. 7 is a flowchart illustrating a video analysis method according to a fifth embodiment of the present application.

Fig. 8 is a flowchart illustrating a video analysis method according to a sixth embodiment of the present application.

Fig. 9 is a flowchart illustrating a video analysis method according to a seventh embodiment of the present application.

Fig. 10 shows a block diagram of a video analysis apparatus according to an eighth embodiment of the present application.

Fig. 11 is a block diagram of an electronic device according to a ninth embodiment of the present application for executing a video analysis method according to the embodiment of the present application.

Fig. 12 is a storage unit according to a tenth embodiment of the present application, configured to store or carry program codes for implementing a video analysis method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, more and more schemes for generating digital people are provided, but no good evaluation scheme is provided for judging the quality of the generated digital people, and particularly, when the digital people are synthesized by voice, no corresponding key point exists, so that no good scheme is provided for evaluating the quality of the generated digital people, and the evaluation can be only carried out by the subjective feelings of two users. In other words, in the prior art, when a digital person is evaluated, the evaluation is mainly performed by human eyes, a standardized evaluation mode does not exist, the opinions of different persons on the generated digital person may be different, the accuracy and consistency of the evaluation cannot be guaranteed, especially for non-professional persons, the accuracy of the evaluation cannot be guaranteed due to the fact that the non-professional persons do not have professional knowledge, and meanwhile the confidence of the evaluation result is weak.

In order to improve the above problem, the inventor proposes a video analysis method, a video analysis apparatus, an electronic device, and a storage medium in the embodiments of the present application, which perform accurate and effective evaluation on a generated second video by combining a distance between a first video corresponding to a first key point and a second video corresponding to a key point.

The following describes in detail a video analysis method, an apparatus, an electronic device, and a storage medium provided in embodiments of the present application with specific embodiments.

First embodiment

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a video analysis method according to an embodiment of the present disclosure, where the method may include steps S110 to S150.

Step S110: the method comprises the steps of obtaining a first video and a first voice corresponding to the first video.

The embodiment of the application can be applied to electronic equipment, and the electronic equipment can be an intelligent mobile phone, a tablet personal computer and other electronic equipment capable of running application programs. The electronic device may acquire a first video and a first voice corresponding to the first video, where the first video may include a plurality of first frame images, and each first frame image includes a plurality of first key points, and the first key points may be key points of a human face. In an embodiment of the present invention, the first keypoints included in the first frame image may be keypoints from eyebrows, to noses, to mouths, and to contours of the face in the face.

In some embodiments, the number of the plurality of first keypoints in the first frame image may be 68, and the plurality of first keypoints may be divided into an internal keypoint which may include 51 keypoints in total of eyes, nose, and mouth, and a contour keypoint which includes 17 keypoints. In order to understand the distribution of the key points more clearly, the invention provides a diagram as shown in fig. 2, and as can be seen from fig. 2, a unilateral eyebrow may include 5 key points, which are uniformly sampled from the left boundary to the right boundary, and 5 × 2 ═ 10 in total; the eyes are divided into 6 key points which are respectively a left boundary and a right boundary, the upper eyelid and the lower eyelid are uniformly sampled, and 6 multiplied by 2 is 12; the lips are divided into 20 key points, except 2 lip corners, the lips are divided into an upper lip and a lower lip, the outer boundaries of the upper lip and the lower lip are respectively and uniformly sampled for 5 points, the inner boundaries of the upper lip and the lower lip are respectively and uniformly sampled for 3 points, and the total number is 20; the nose bridge part corresponding to the nose comprises 4 key points, and the nose tip part is uniformly collected by 5, namely the nose comprises 9 key points; the face contour uniformly adopts 17 key points. In summary, the number of the second key points in the embodiment of the present invention is 68.

In some embodiments, the first video is mainly used for providing first voice, and after the first video and the first voice corresponding to the first video are acquired, the electronic device may input the acquired first voice to the video generation model to obtain a second video, that is, step S120 is performed.

In some embodiments, the plurality of first frame images included in the first video may include a plurality of face images, the face images may constitute different head-nodding, blinking, head-shaking or speaking actions, and the first voice included in the first video may correspond to the head-nodding, blinking, head-shaking or speaking actions. In the embodiment of the present application, as long as voice data is generated, the content on the corresponding first frame image is different, for example, the mouth of the person a on the first frame image a is kept in a closed state at a first time, and the mouth of the person a on the first frame image B is kept in an open state at a second time.

Step S120: and inputting the first voice into a video generation model to obtain a second video.

In some embodiments, after acquiring a first voice corresponding to a first video, the electronic device may input the first voice to a video generation model to obtain a second video. The video generation model is mainly used for generating videos based on voice, and can be obtained through training of a large amount of voice data and video data. In addition, the second video may include a plurality of second frame images, each of the second frame images including a plurality of second keypoints, the second keypoints corresponding to the first keypoints included in the first frame image. To more clearly understand the relationship between the first keypoints and the second keypoints, the following example is given. For example, person a in video a is speaking and is keeping smiling, and person B in video B generated using speech in video a may also be speaking and is keeping smiling. It can be seen that the second video is generated based on the speech of the first video, and the closer the second video and the first video are, the more the generated second video conforms to the actual needs of the user.

In other embodiments, after acquiring the first voice corresponding to the first video, the electronic device may also acquire a carrier video, then input the first voice and the carrier video together into a video generation model, and acquire a second video through the video generation model, where the carrier video mainly functions to guide a digital person to nod his/her head, blink, expression, illumination and the like, and the illumination may include brightness, saturation and the like. When the second video is obtained based on the first voice, the carrier video and the video generation model, not only the digital person needs to be ensured to have actions such as nodding, blinking and the like, but also the digital person needs to have expressions, and the finally generated second video can better meet the actual requirements of the user to a certain extent.

In other embodiments, in order to make the finally generated second video more realistic, the electronic device may include illumination including brightness, saturation, and the like when generating the second video with the aid of the carrier video. In addition, when the carrier video is acquired, the electronic device may also determine brightness and saturation corresponding to the carrier video, then determine whether the brightness corresponding to the carrier video is greater than a brightness threshold, if so, determine whether the saturation is greater than a saturation threshold, and if so, input the carrier video and the first voice to the video generation model to generate the second video. Therefore, the finally obtained second video can better meet the actual requirements of the user, and in the actual situation, the video with higher brightness and saturation can improve the mood of the user.

In addition, when the brightness corresponding to the carrier video is smaller than the brightness threshold, the electronic device can perform video processing on the carrier video, that is, the brightness of the carrier video is increased.

Step S130: and acquiring a first distance between each first key point in the first video and each second key point in the second video.

In some implementations, after acquiring the second video, the electronic device may acquire a first distance between each first keypoint in the first video and each second keypoint in the second video. As can be understood from the above description, the first video may include a plurality of first frame images, each of which may include the first keypoint, and similarly, the second video may include a plurality of second frame images, each of which may include the second keypoint, where the second keypoint and the first keypoint are corresponding to each other. For example, the first keypoint is the left eyebrow left corner 36, and the second keypoint is the left eyebrow left corner 36, except that the first keypoint belongs to the first frame image of the first video, and the second keypoint belongs to the second frame image of the second video. For another example, if the first keypoint is the center 48 of the left mouth corner, then the second keypoint is also the center 48 of the left mouth corner, and the difference between them is that the first keypoint belongs to the first frame image of the first video, and the second keypoint belongs to the second frame image of the second video.

As one mode, after acquiring each first key point in the first video and each second key point in the second video, the electronic device may determine a position coordinate of each first key point in the first frame image and a position coordinate of each second key point in the second frame image, and then acquire a distance between each first key point and each second key point by using the euclidean distance, so as to obtain the first distance. For example, the coordinate position of the left corner of the left eyebrow of the first key point in the first frame image is (25, 27), the coordinate position of the left corner of the left eyebrow of the second key point in the first frame image is (22, 25), and then the distance between the first key point and the second key point is 5 obtained by the euclidean distance calculation, where the coordinate value of the position is only used as a reference, and the actual value is taken as a reference.

In some embodiments, the average of the distances between all the first keypoints in the first video and all the second keypoints in the second video may be used as the first distance, that is, the distances between the plurality of first keypoints and the plurality of second keypoints are summed and then averaged, so as to obtain the first distance, where if the total sum of the distances is 408, the first distance is 408/68 ═ 6. Alternatively, the electronic device may use a weighted average of distances between all the first keypoints and the second keypoints as the first distance.

Step S140: and determining whether the second video meets a preset condition or not according to the first distance.

As a mode, after a first distance between each first key point in the first video and each second key point in the second video is obtained, the embodiment of the present invention may determine whether the second video meets the preset condition according to the first distance. Specifically, the electronic device may determine whether the first distance is smaller than a first distance threshold, and if the first distance is smaller than the first distance threshold, determine that the second video meets the preset condition.

Alternatively, the electronic device may also determine whether the second video meets the preset condition in combination with the first distance and the other distances. When the second video meets the preset condition, the second video is determined to be the first-level video, and then the process proceeds to step S150. If the second video does not meet the preset condition, the generated second video is not good in effect, namely the second video may be a second-level video which is a video acceptable to the user, namely the user satisfaction of the second-level video is lower than that of the first-level video, or the second video may be a third-level video which is a video unacceptable to the user, namely the user satisfaction of the third-level video is lower than that of the second-level video. Specifically, how to judge that the second video is the second-level video or the third-level video is described in detail in the following embodiments, which will not be described herein again.

Step S150: and if the second video meets the preset condition, determining that the second video is the first-level video.

In the embodiment of the application, when the electronic device determines that the second video meets the preset condition according to the first distance, the electronic device may determine that the second video is a first-level video, where the first-level video is a video that is very satisfactory to a user. For example, if the first distance is smaller than the first distance threshold, it indicates that the difference between the generated second video and the original first video is not very large, that is, it indicates that the effect of the second video generated by using the video generation model is relatively good, and at this time, it may be determined that the generated second video better meets the actual needs of the user.

The video analysis method provided by one embodiment of the application determines whether a user is satisfied with a generated second video by combining the distance between a first key point and a second key point, the evaluation mode is simple and effective, first, a first voice corresponding to the first video and the first video is acquired, wherein the first video comprises a plurality of first frame images, each first frame image comprising a plurality of first keypoints, then inputting the first voice into a video generation model to obtain a second video, wherein the second video comprises a plurality of second frame images, each second frame image comprises a plurality of second key points, the second key points and the first key points are corresponding to each other, and then, acquiring a first distance between each first key point in the first video and each second key point in the second video, and determining the second video to be a first-level video when the second video meets a preset condition. According to the embodiment of the application, the second video is accurately and effectively evaluated by acquiring the distance between the first key point and the second key point.

Second embodiment

Referring to fig. 3, fig. 3 is a flowchart illustrating a video analysis method according to another embodiment of the present application, where the method may include steps S210 to S270.

Step S210: the method comprises the steps of obtaining a first video and a first voice corresponding to the first video.

Step S220: and acquiring a third video and a third voice corresponding to the third video.

As one way, the third video may be called a carrier video, and its main function is to guide the generation of digital human heads, blinks, expressions, lighting, and so on. The third video may be considered an input video that has only been processed on the video, e.g., nodding, panning, etc.

In some embodiments, the electronic device may acquire a plurality of videos to be selected, determine whether the videos to be selected meet preset selection conditions, perform video processing on the videos to be selected to obtain a third video if the videos to be selected meet the preset selection conditions, and acquire actions such as nodding or shaking a head in the second video through the third video. In other embodiments, when determining whether the video to be selected meets the preset selection condition, the electronic device may determine whether the video to be selected contains a face image, and if so, determine that the video to be selected meets the preset selection condition.

In other embodiments, when it is determined that the video to be selected includes a face image, the electronic device may also continue to determine whether the face in the video to be selected has a gesture of nodding or shaking the head, and if the face in the video to be selected has an action of nodding or shaking the head, the video in the region to be selected is subjected to video processing to obtain a third video.

In other embodiments, when it is determined that a human face in the video to be selected has a gesture of nodding or shaking a head, the electronic device may also count the number of times of occurrence of motion of nodding or shaking a head in the video to be selected, and determine whether the number of times is greater than a number threshold, and if the number of times of occurrence of motion of nodding or shaking a head is greater than the number threshold, perform video processing on the video in the region to be selected to obtain a third video.

Step S230: and inputting the third video and the third voice into a video generation network to obtain a video generation model.

In some embodiments, after acquiring the third video and the third voice corresponding to the third video, the electronic device may input the third video and the third voice to a video generation network, so as to obtain a video generation model. As a mode, the electronic device may train the video generation network by using a plurality of third videos and third voices corresponding to each third video to obtain a video generation model, where the video generation model may generate a corresponding second video based on the input voices, and a mode of generating the second video by using the video generation model is simple and more intelligent.

Step S240: and inputting the first voice into a video generation model to obtain a second video.

In the embodiment of the invention, the electronic equipment can input the first voice into the video generation model to obtain the second video, and can also simultaneously input the first voice and the third video into the video generation model, so that the finally obtained second video can be more accurate by combining the first voice and the third video.

Step S250: and acquiring a first distance between each first key point in the first video and each second key point in the second video.

Step S260: and determining whether the second video meets a preset condition or not according to the first distance.

Step S270: and if the second video meets the preset condition, determining that the second video is the first-level video.

The video analysis method provided by one embodiment of the application determines whether a user is satisfied with a generated second video by combining the distance between a first key point and a second key point, the evaluation mode is simple and effective, first, a first voice corresponding to the first video and the first video is acquired, wherein the first video comprises a plurality of first frame images, each first frame image comprising a plurality of first keypoints, then inputting the first voice into a video generation model to obtain a second video, wherein the second video comprises a plurality of second frame images, each second frame image comprises a plurality of second key points, the second key points and the first key points are corresponding to each other, and then, acquiring a first distance between each first key point in the first video and each second key point in the second video, and determining the second video to be a first-level video when the second video meets a preset condition. According to the embodiment of the application, the second video is accurately and effectively evaluated by acquiring the distance between the first key point and the second key point. In addition, the video generation network is trained through the third video, so that the finally obtained second video can be more accurate.

Third embodiment

Referring to fig. 4, fig. 4 is a flowchart illustrating a video analysis method according to another embodiment of the present application, where the method may include steps S310 to S390.

Step S310: the method comprises the steps of obtaining a first video and a first voice corresponding to the first video.

Step S320: and acquiring a third video and a third voice corresponding to the third video.

Step S330: and inputting the third video and the third voice into a video generation network to obtain a video generation model.

Step S340: and inputting the first voice into a video generation model to obtain a second video.

Step S350: and acquiring a first distance between each first key point in the first video and each second key point in the second video.

In some embodiments, in order to more accurately evaluate the generated second video (digital human video), the electronic device may also obtain a second distance, and determine whether the generated second video meets a preset condition by combining the first distance and the second distance, where the second distance is a distance between each third key point in the third video and each fourth key point in the fourth video, as described below.

Step S360: and inputting the third voice into the video generation model to obtain a fourth video.

It can be known from the above description that the third video is a carrier video, and the main function of the third video is to guide the digital person to nod his head, blink expression, illumination, and the like, that is, the digital person's nod his head, blink expression, illumination, and the like in the second video are all acquired on the basis of the third video. In order to better evaluate the generated second video, in the embodiment of the present invention, after the second video is acquired, a third voice corresponding to the third video may be input to the video generation model, so as to obtain a fourth video. Therefore, the fourth video is a video generated based on the third video and the third voice, the voice and the video are both from the third video, and the second video can be better evaluated by acquiring the difference between the fourth video and the third video.

Step S370: obtaining a second distance between each third key point in the third video and each fourth key point in the fourth video

After the fourth video is acquired by using the third voice, the electronic device in the embodiment of the invention can acquire the distance between each third key point in the third video and each fourth key point in the fourth video. The third video may include a plurality of third frame images each of which may include the third keypoint, and similarly, the fourth video may include a plurality of fourth frame images each of which may include the fourth keypoint, the fourth keypoint and the third keypoint being mutually corresponding. If the third keypoint is the right eyebrow right corner 45, the fourth keypoint is also the right eyebrow right corner 45, and the difference between the first keypoint and the fourth keypoint is that the third keypoint belongs to the third frame image of the third video and the fourth keypoint belongs to the fourth frame image of the fourth video. For another example, if the third key point is the right mouth corner center 54, the fourth key point is also the right mouth corner center 54, and the difference between the third key point and the fourth key point is that the third key point belongs to the third frame image of the third video, and the fourth key point belongs to the fourth frame image of the fourth video.

As one mode, after acquiring each third key point in the third video and each fourth key point in the fourth video, the electronic device may determine the position coordinate of each third key point in the third frame image and the position coordinate of each fourth key point in the fourth frame image, and then acquire the distance between each third key point and each fourth key point by using the euclidean distance, so as to obtain the third distance. For example, the coordinate position of the right eyebrow right corner of the third key point in the third frame image is (25, 27), the coordinate position of the right eyebrow right corner of the fourth key point in the third frame image is (21, 29), and then the distance between the third key point and the fourth key point is 4.47 obtained by the euclidean distance calculation, where the position coordinate value is only used as a reference, and the actual value is taken as a reference.

In some embodiments, the average of the distances between all the third keypoints in the third video and all the fourth keypoints in the fourth video may be used as the second distance, i.e., the distances between the plurality of third keypoints and the plurality of fourth keypoints are summed and then averaged, so as to obtain the second distance, where the sum of the total distances is 408, and then the second distance is 408/68 ═ 6. Alternatively, the electronic device may use a weighted average of distances between all the third key points and the fourth key points as the second distance.

Step S380: and determining whether the second video meets a preset condition according to the first distance and the second distance.

As one mode, after acquiring the first distance and the second distance, the electronic device may determine whether the second video meets the preset condition according to the first distance and the second distance. Specifically, referring to fig. 5, step S380 may include steps S381 to S382.

Step S381: and obtaining a difference value of the first distance and the second distance to obtain a distance difference value.

In some embodiments, the electronic device may obtain a difference between the first distance and the second distance, obtain a distance difference, and then determine whether the second video meets the preset condition according to the distance difference, i.e., enter step S382. Specifically, the electronic device may determine whether the distance difference is smaller than a difference threshold, and if so, it indicates that the second video meets the preset condition, and if so, it indicates that the second video does not meet the preset condition.

Step S382: and determining whether the second video meets a preset condition or not according to the distance difference.

As described above, the electronic device may determine whether the second video meets the preset condition according to a distance difference between a first distance and a second distance, where the first distance may be an average distance between key points in the first video and the second video, and the second distance may be an average distance between key points in the third video and the third video. After obtaining the distance difference between the first distance and the second distance, the electronic device may determine whether the distance difference is smaller than a difference threshold, and if the distance difference is smaller than the difference threshold, determine that the second video meets a preset condition, that is, determine that the second video is a video that is very satisfactory to the user, and if the distance difference is greater than or equal to the difference threshold, further determine which level the second video belongs to.

Step S390: and if the second video meets the preset condition, determining that the second video is the first-level video.

The video analysis method provided by one embodiment of the application determines whether a user is satisfied with a generated second video by combining the distance between a first key point and a second key point, the evaluation mode is simple and effective, first, a first voice corresponding to the first video and the first video is acquired, wherein the first video comprises a plurality of first frame images, each first frame image comprising a plurality of first keypoints, then inputting the first voice into a video generation model to obtain a second video, wherein the second video comprises a plurality of second frame images, each second frame image comprises a plurality of second key points, the second key points and the first key points are corresponding to each other, and then, acquiring a first distance between each first key point in the first video and each second key point in the second video, and determining the second video to be a first-level video when the second video meets a preset condition. According to the embodiment of the application, the second video is accurately and effectively evaluated by acquiring the distance between the first key point and the second key point. In addition, the electronic equipment can more accurately evaluate the second video by introducing the second distance and the distance difference between the first distance and the second distance.

Fourth embodiment

Referring to fig. 6, fig. 6 is a flowchart illustrating a video analysis method according to another embodiment of the present application, where the method may include steps S401 to S411.

Step S401: the method comprises the steps of obtaining a first video and a first voice corresponding to the first video.

Step S402: and acquiring a third video and a third voice corresponding to the third video.

Step S403: and inputting the third video and the third voice into a video generation network to obtain a video generation model.

Step S404: and inputting the first voice into a video generation model to obtain a second video.

Step S405: and acquiring a first distance between each first key point in the first video and each second key point in the second video.

Step S406: and inputting the third voice into the video generation model to obtain a fourth video.

Step S407: and acquiring a second distance between each third key point in the third video and each fourth key point in the fourth video.

Step S408: and obtaining a difference value of the first distance and the second distance to obtain a distance difference value.

In some embodiments, the electronic device obtains a first distance between a first video key point and a second video key point, and obtains a second distance between a third video key point and a fourth video key point, and then may obtain a difference between the first distance and the second distance to obtain a distance difference. Wherein, the first video may be referred to as a voice providing video; the second video may be referred to as a speech-generated video, and the input speech may be the speech of the first video while the input video may be the third video when the second video is generated; the third video can be called a carrier video and is used for training the action required by the generation of the second video; the fourth video may be referred to as carrier generation video, and the input voice thereof is the voice of the third video and the input video thereof is the third video at the time of generating the fourth video.

As another mode, before obtaining the distance difference, the electronic device may also input the first video and the first voice obtained by the electronic device to the video generation model at the same time to obtain a fifth video, and then obtain the distance between each key point in the fifth video and each key point in the first video to obtain a third distance. On this basis, the electronic device may obtain an average value of the second distance and the third distance to obtain a target distance, and use a difference value between the first distance and the target distance as a distance difference value, and then obtain a ratio of the distance difference value to the target distance as a target parameter. How to obtain the target parameters is not specifically limited, and the target parameters can be selected according to actual conditions.

Step S409: and acquiring the ratio of the distance difference to the second distance to obtain a target parameter.

In some embodiments, after acquiring the difference between the first distance and the second distance, the electronic device may continue to acquire the ratio of the distance difference to the second distance to obtain the target parameter. For example, the first distance is 5, the second distance is 4.98, and the distance difference is the first distance-the second distance is 5-4.98-0.02, and then the ratio of the distance difference to the second distance is obtained, and the target parameter is 0.004.

Step S410: and determining whether the second video meets a preset condition or not according to the target parameter.

As a mode, after the target parameter is obtained, the electronic device may determine whether the second video meets a preset condition according to the target parameter, that is, the electronic device may determine whether the target parameter is smaller than a ratio threshold, and if so, determine that the second video meets the preset condition. In this embodiment of the present invention, the ratio threshold may be a first preset threshold, and the first preset threshold may be 0.05. As in the above example, the target parameter is 0.004, and it can be seen that the target parameter 0.004 is smaller than the first preset threshold value 0.05, and it may be determined that the second video meets the preset condition.

Step S411: and if the second video meets the preset condition, determining that the second video is the first-level video.

The video analysis method provided by one embodiment of the application determines whether a user is satisfied with a generated second video by combining the distance between a first key point and a second key point, the evaluation mode is simple and effective, first, a first voice corresponding to the first video and the first video is acquired, wherein the first video comprises a plurality of first frame images, each first frame image comprising a plurality of first keypoints, then inputting the first voice into a video generation model to obtain a second video, wherein the second video comprises a plurality of second frame images, each second frame image comprises a plurality of second key points, the second key points and the first key points are corresponding to each other, and then, acquiring a first distance between each first key point in the first video and each second key point in the second video, and determining the second video to be a first-level video when the second video meets a preset condition. According to the embodiment of the application, the second video is accurately and effectively evaluated by acquiring the distance between the first key point and the second key point. In addition, the electronic equipment can more accurately evaluate the second video by acquiring the ratio of the distance difference to the second distance.

Referring to fig. 7, fig. 7 is a flowchart illustrating a video analysis method according to another embodiment of the present application, where the method may include steps S510 to S560.

Step S510: the method comprises the steps of obtaining a first video and a first voice corresponding to the first video.

Step S520: and inputting the first voice into a video generation model to obtain a second video.

Step S530: and acquiring a first distance between each first key point in the first video and each second key point in the second video.

As can be known from the above description, after the electronic device acquires the first distance, it may continue to acquire the second distance, the third distance, and the like, and then may acquire the target parameter according to the acquired first distance, second distance, and third distance. On the basis, the electronic device may determine whether the acquired target parameter meets a preset condition, that is, determine whether the target parameter is smaller than a first preset threshold, that is, enter step S540.

Step S540: it is determined whether the target parameter is less than a first preset threshold.

In one embodiment, the first preset threshold may be set according to an empirical value, may be determined according to the second video generation process, or may be determined according to the number of updates of the video generation model, or the like. In a specific embodiment, the video generation model is updated once, and the corresponding first preset threshold may be updated once, that is, the first preset threshold may be updated correspondingly according to a weight parameter ratio of the video generation model, where the weight parameter ratio of the video generation model may be a ratio between a weight parameter of a latest model and a weight parameter of a previous model, and after obtaining the weight parameter ratio, the electronic device may multiply the weight parameter ratio by the first preset threshold, so as to obtain a new first preset threshold.

Step S550: and if the target parameter is smaller than a first preset threshold value, determining that the second video meets a preset condition.

In some embodiments, when it is determined that the target parameter is smaller than the first preset threshold, it indicates that the second video meets the preset condition, that is, the second video generated by using the first voice and the third video is the first-level video, that is, the generated second video is a video that is satisfactory for the user.

In other embodiments, if the target parameter is greater than or equal to a first preset threshold, the electronic device may continue to determine whether the target parameter is less than a second preset threshold, and if the target parameter is less than the second preset threshold, determine that the second video is a second-level video, where the user satisfaction of the second-level video is lower than the user satisfaction of the first-level video, and the second-level video is a user-acceptable video.

In other embodiments, the second video is determined to be a third-level video if the target parameter is greater than or equal to a second preset threshold, wherein the user satisfaction of the third-level video is lower than the user satisfaction of the second-level video, and the third-level video is a video that is not acceptable to the user. In the embodiment of the present invention, the second preset threshold is similar to the first preset threshold, and may be set according to an empirical value, or may be set according to an actual situation of video generation, and how to set the second preset threshold is not specifically limited herein. In addition, the first preset threshold may be set to 0.05, and the second preset threshold may be set to 0.1.

Step S560: and if the second video meets the preset condition, determining that the second video is the first-level video.

The video analysis method provided by one embodiment of the application determines whether a user is satisfied with a generated second video by combining the distance between a first key point and a second key point, the evaluation mode is simple and effective, first, a first voice corresponding to the first video and the first video is acquired, wherein the first video comprises a plurality of first frame images, each first frame image comprising a plurality of first keypoints, then inputting the first voice into a video generation model to obtain a second video, wherein the second video comprises a plurality of second frame images, each second frame image comprises a plurality of second key points, the second key points and the first key points are corresponding to each other, and then, acquiring a first distance between each first key point in the first video and each second key point in the second video, and determining the second video to be a first-level video when the second video meets a preset condition. According to the embodiment of the application, the second video is accurately and effectively evaluated by acquiring the distance between the first key point and the second key point. In addition, the embodiment of the invention determines the grade of the second video by introducing the first preset threshold and the second preset threshold, and the judgment method is simple and easy to implement.

Sixth embodiment

Referring to fig. 8, fig. 8 is a flowchart illustrating a video analysis method according to another embodiment of the present application, where the method may include steps S610 to S670.

Step S610: the method comprises the steps of obtaining a first video and a first voice corresponding to the first video.

Step S620: and inputting the first voice into a video generation model to obtain a candidate video.

In some embodiments, in order to make the finally acquired second video more accurate, after the electronic device inputs the first voice and the carrier video (third video) into the video generation model to obtain the candidate video, it may determine whether the candidate video includes a face image, i.e., proceed to step S630.

Step S630: determining whether the candidate video comprises a face image.

As one mode, after acquiring the candidate video, the electronic device may determine whether the candidate includes a face image, and if the candidate includes a face image, the electronic device takes the candidate video as the second video, that is, the electronic device proceeds to step S640.

As another way, in the embodiment of the application, the candidate video may be sampled first, and then it is determined whether the candidate frame images obtained by sampling include face images, if the candidate frame images include face images, the number of the candidate frame images including the face images is obtained, and it is determined whether the number of the candidate frame images is greater than a number threshold, and if the number of the candidate frame images is greater than the number threshold, it is determined that the candidate video includes the face images.

In other embodiments, when it is determined that the candidate video includes a face image, the electronic device may also determine, according to the face image in the candidate video, a gender of a face in the candidate video, determine whether the gender is the same as the gender of the face in the third video, and if the gender is the same as the gender, identify the candidate video as the second video.

In other embodiments, when the gender of the face in the candidate video is the same as the gender of the face in the third video, the electronic device may also determine whether the age stage, nationality, etc. of the face in the candidate video is the same as the age stage, nationality, etc. of the face in the third video according to the face image in the candidate video. And if the candidate video is the same as the first video, taking the candidate video as a second video.

Step S640: and if the candidate video comprises the face image, taking the candidate video as a second video.

Step S650: and acquiring a first distance between each first key point in the first video and each second key point in the second video.

Step S660: and determining whether the second video meets a preset condition or not according to the first distance.

Step S670: and if the second video meets the preset condition, determining that the second video is the first-level video.

The video analysis method provided by one embodiment of the application determines whether a user is satisfied with a generated second video by combining the distance between a first key point and a second key point, the evaluation mode is simple and effective, first, a first voice corresponding to the first video and the first video is acquired, wherein the first video comprises a plurality of first frame images, each first frame image comprising a plurality of first keypoints, then inputting the first voice into a video generation model to obtain a second video, wherein the second video comprises a plurality of second frame images, each second frame image comprises a plurality of second key points, the second key points and the first key points are corresponding to each other, and then, acquiring a first distance between each first key point in the first video and each second key point in the second video, and determining the second video to be a first-level video when the second video meets a preset condition. According to the embodiment of the application, the second video is accurately and effectively evaluated by acquiring the distance between the first key point and the second key point. In addition, the method and the device can make the finally generated second video more accurate by judging the face image in the candidate video.

Seventh embodiment

Referring to fig. 9, fig. 9 is a flowchart illustrating a video analysis method according to another embodiment of the present application, where the method may include steps S710 to S760.

Step S710: the method comprises the steps of obtaining a first video and a first voice corresponding to the first video.

Step S720: and inputting the first voice into a video generation model to obtain a second video.

Step S730: finding a plurality of first mouth keypoints among the plurality of first keypoints, and finding a plurality of second mouth keypoints among the plurality of second keypoints.

In some embodiments, after inputting the first voice and the carrier video (the third video) into the video generation model and obtaining the second video, the electronic device may find a plurality of first mouth key points in the plurality of first key points in the first video and a plurality of second mouth key points in the plurality of second key points in the second video, the mouth key points may be as shown in fig. 2, and it can be seen from fig. 2 that the mouth key points include key points 48 to 67.

Step S740: a first distance between each of the first mouth key points in the first video and each of the second mouth key points in the second video is obtained.

In this embodiment of the present invention, the electronic device may obtain a first distance between each first mouth keypoint in the first video and each second keypoint in the second video, and then determine whether the second video meets the preset condition according to the first distance, that is, enter step S750. As one mode, after obtaining the distance between each first mouth key point and each second key point, the electronic device may sum the distances of all the key points, then divide the sum by the total number of the mouth key points to obtain an average key point distance value, and use the average key point distance value as the first distance. In addition, the first mouth key point and the second mouth key point correspond to each other.

Step S750: and determining whether the second video meets a preset condition or not according to the first distance.

Step S760: and if the second video meets the preset condition, determining that the second video is the first-level video.

The scheme for determining whether the second video meets the preset condition through the key points of the mouth is similar to the scheme for determining whether the second video meets the preset condition through all the key points, and the above embodiments of step S750 to step S760 have been described in detail, and are not repeated here.

The video analysis method provided by one embodiment of the application determines whether a user is satisfied with a generated second video by combining the distance between a first key point and a second key point, the evaluation mode is simple and effective, first, a first voice corresponding to the first video and the first video is acquired, wherein the first video comprises a plurality of first frame images, each first frame image comprising a plurality of first keypoints, then inputting the first voice into a video generation model to obtain a second video, wherein the second video comprises a plurality of second frame images, each second frame image comprises a plurality of second key points, the second key points and the first key points are corresponding to each other, and then, acquiring a first distance between each first key point in the first video and each second key point in the second video, and determining the second video to be a first-level video when the second video meets a preset condition. According to the embodiment of the application, the second video is accurately and effectively evaluated by acquiring the distance between the first key point and the second key point. In addition, the embodiment of the invention can realize the evaluation of the second video by using the key points of the mouth, and can improve the efficiency of video analysis to a certain extent.

Eighth embodiment

Referring to fig. 10, fig. 10 is a block diagram illustrating a video analysis apparatus according to an embodiment of the present disclosure. As will be explained below with respect to the block diagram of fig. 10, the video analysis apparatus 800 includes: a first acquisition module 810, a second module 820, a third acquisition module 830, a condition determination module 840, and a video determination module 850.

A first obtaining module 810, configured to obtain a first video and a first voice corresponding to the first video, where the first video includes a plurality of first frame images, and each of the first frame images includes a plurality of first key points.

A second obtaining module 820, configured to input the first voice into a video generation model to obtain a second video, where the second video includes a plurality of second frame images, each of the second frame images includes a plurality of second key points, and the second key points correspond to the first key points.

Further, the second obtaining module 820 is further configured to input the first voice to a video generation model to obtain a candidate video; determining whether the candidate video contains a face image; and if the candidate video comprises the face image, taking the candidate video as a second video.

Further, the second obtaining module 820 is further configured to determine that the second video is failed to be generated if the candidate video does not include the face image.

A third obtaining module 830, configured to obtain a first distance between each first key point in the first video and each second key point in the second video.

Further, the third obtaining module 830 is further configured to find a plurality of first mouth key points among the plurality of first key points, and find a plurality of second mouth key points among the plurality of second key points, where each first mouth key point and each second mouth key point correspond to each other; a first distance between each of the first mouth key points in the first video and each of the second mouth key points in the second video is obtained.

A condition determining module 840, configured to determine whether the second video meets a preset condition according to the first distance.

Further, the condition determining module 840 is further configured to input the third voice to the video generation model to obtain a fourth video; acquiring a second distance between each third key point in the third video and each fourth key point in the fourth video; and determining whether the second video meets a preset condition according to the first distance and the second distance.

Further, the condition determining module 840 is further configured to obtain a difference between the first distance and the second distance to obtain a distance difference; and determining whether the second video meets a preset condition or not according to the distance difference.

Further, the condition determining module 840 is further configured to obtain a ratio of the distance difference to the second distance to obtain a target parameter; and determining whether the second video meets a preset condition or not according to the target parameter.

Further, the condition determining module 840 is further configured to determine whether the target parameter is smaller than a first preset threshold; and if the target parameter is smaller than a first preset threshold value, determining that the second video meets a preset condition.

Further, the condition determining module 840 is further configured to determine whether the target parameter is smaller than a second preset threshold if the target parameter is greater than or equal to a first preset threshold; and if the target parameter is smaller than a second preset threshold value, determining that the second video is a second-level video, wherein the user satisfaction of the second-level video is lower than that of the first-level video.

Further, the condition determining module 840 is further configured to determine that the second video is a third-level video if the target parameter is greater than or equal to a second preset threshold, where user satisfaction of the third-level video is lower than user satisfaction of the second-level video. The first preset threshold is 0.05, and the second preset threshold is 0.1.

The video determining module 850 is configured to determine that the second video is the first-level video if the preset condition is met.

Further, before the first voice is input to the video generation model to obtain the second video, the apparatus 800 is further configured to obtain a third video and a third voice corresponding to the third video; and inputting the third video and the third voice into a video generation network to obtain a video generation model.

The video analysis apparatus 800 provided in this embodiment of the present application is used to implement the corresponding video analysis method in the foregoing method embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

It can be clearly understood by those skilled in the art that the video analysis apparatus 800 provided in the embodiment of the present application can implement each process in the foregoing method embodiments, and for convenience and brevity of description, the specific working processes of the apparatus 800 and the modules described above may refer to corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, the coupling or direct coupling or communication connection between the modules shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the apparatus 800 or the modules may be in an electrical, mechanical or other form.

In addition, each functional module in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

Ninth embodiment

Referring to fig. 11, a block diagram of an electronic device 1000 according to an embodiment of the present disclosure is shown. The electronic device 1000 may be an electronic device capable of running an application, such as a smart phone or a tablet computer. The electronic device 1000 in the present application may include one or more of the following components: a processor 1010, a memory 1020, and one or more applications, wherein the one or more applications may be stored in the memory 1020 and configured to be executed by the one or more processors 1010, the one or more programs configured to perform a method as described in the aforementioned method embodiments.

Processor 1010 may include one or more processing cores. The processor 1010 interfaces with various components throughout the electronic device 1000 using various interfaces and circuitry to perform various functions of the electronic device 1000 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1020 and invoking data stored in the memory 1020. Alternatively, the processor 1010 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-PrograMMable Gate Array (FPGA), and PrograMMable Logic Array (PLA). The processor 1010 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 1010, but may be implemented by a communication chip.

The Memory 1020 may include a Random ACCess Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 1020 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 1020 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The data storage area may also store data created by the electronic device 1000 during use (e.g., phone book, audio-video data, chat log data), and the like.

Tenth embodiment

Referring to fig. 12, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable storage medium 1100 has stored therein program code that can be invoked by a processor to perform the methods described in the method embodiments above.

The computer-readable storage medium 1100 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 1100 includes a non-volatile computer-readable storage medium. The computer readable storage medium 1100 has storage space for program code 1110 for performing any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 1110 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of video analysis, the method comprising:

acquiring a first video and a first voice corresponding to the first video, wherein the first video comprises a plurality of first frame images, and each first frame image comprises a plurality of first key points;

inputting the first voice into a video generation model to obtain a second video, wherein the second video comprises a plurality of second frame images, each second frame image comprises a plurality of second key points, and the second key points correspond to the first key points;

acquiring a first distance between each first key point in the first video and each second key point in the second video;

determining whether the second video meets a preset condition or not according to the first distance;

and if the second video meets the preset condition, determining that the second video is the first-level video.

2. The method of claim 1, wherein before inputting the first speech into a video generation model to obtain a second video, the method comprises:

acquiring a third video and a third voice corresponding to the third video;

and inputting the third video and the third voice into a video generation network to obtain a video generation model.

3. The method of claim 2, wherein the determining whether the second video meets a preset condition according to the first distance comprises:

inputting the third voice into the video generation model to obtain a fourth video;

acquiring a second distance between each third key point in the third video and each fourth key point in the fourth video;

and determining whether the second video meets a preset condition according to the first distance and the second distance.

4. The method of claim 3, wherein the determining whether the second video meets a preset condition according to the first distance and the second distance comprises:

obtaining a difference value between the first distance and the second distance to obtain a distance difference value;

and determining whether the second video meets a preset condition or not according to the distance difference.

5. The method of claim 4, wherein the determining whether the second video meets a preset condition according to the distance difference comprises:

obtaining the ratio of the distance difference to the second distance to obtain a target parameter;

and determining whether the second video meets a preset condition or not according to the target parameter.

6. The method of claim 5, wherein the determining whether the second video meets a preset condition according to the target parameter comprises:

determining whether the target parameter is smaller than a first preset threshold value;

and if the target parameter is smaller than a first preset threshold value, determining that the second video meets a preset condition.

7. The method of claim 6, further comprising:

if the target parameter is greater than or equal to a first preset threshold, determining whether the target parameter is smaller than a second preset threshold;

and if the target parameter is smaller than a second preset threshold value, determining that the second video is a second-level video, wherein the user satisfaction of the second-level video is lower than that of the first-level video.

8. The method of claim 7, further comprising:

and if the target parameter is greater than or equal to a second preset threshold value, determining that the second video is a third-level video, wherein the user satisfaction of the third-level video is lower than that of the second-level video.

9. The method according to any one of claims 7 or 8, wherein the first predetermined threshold value is 0.05 and the second predetermined threshold value is 0.1.

10. The method of claim 1, wherein inputting the first speech into a video generation model to obtain a second video comprises:

inputting the first voice into a video generation model to obtain a candidate video;

determining whether the candidate video contains a face image;

and if the candidate video comprises the face image, taking the candidate video as a second video.

11. The method of claim 10, further comprising:

and if the candidate video does not contain the face image, determining that the second video is failed to generate.

12. The method according to any one of claims 1 to 8, wherein the obtaining a first distance between each first keypoint in the first video and each second keypoint in the second video comprises:

searching a plurality of first mouth key points in the plurality of first key points, and searching a plurality of second mouth key points in the plurality of second key points, wherein each first mouth key point and each second mouth key point correspond to each other;

a first distance between each of the first mouth key points in the first video and each of the second mouth key points in the second video is obtained.

13. A video analysis apparatus, characterized in that the apparatus comprises:

the device comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring a first video and a first voice corresponding to the first video, the first video comprises a plurality of first frame images, and each first frame image comprises a plurality of first key points;

a second obtaining module, configured to input the first voice to a video generation model to obtain a second video, where the second video includes a plurality of second frame images, each of the second frame images includes a plurality of second key points, and the second key points correspond to the first key points;

a third obtaining module, configured to obtain a first distance between each first key point in the first video and each second key point in the second video;

the condition determining module is used for determining whether the second video meets a preset condition according to the first distance;

and the video determining module is used for determining that the second video is the first-level video if the preset conditions are met.

14. An electronic device, comprising:

one or more processors;

a memory;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-12.

15. A computer-readable storage medium, characterized in that a program code is stored in the computer-readable storage medium, which program code can be called by a processor to perform the method according to any of claims 1-12.