CN115909128A

CN115909128A - Video identification method and device, electronic equipment and storage medium

Info

Publication number: CN115909128A
Application number: CN202211266910.5A
Authority: CN
Inventors: 郑雪
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-10-17
Filing date: 2022-10-17
Publication date: 2023-04-04

Abstract

The present disclosure relates to a video identification method, apparatus, electronic device and storage medium, the method comprising: acquiring a video to be detected, wherein the video to be detected comprises a plurality of target video frames; aiming at each target video frame, determining a plurality of target key points through a preset key point detection algorithm; determining the opening-closing ratio of a target video frame and determining a target opening-closing ratio sequence of a video to be detected according to the coordinate information of the target key points; respectively calculating the similarity between the target opening-closing ratio sequence and a plurality of preset standard opening-closing ratio sequences; and under the condition that the similarity meeting the preset similarity condition is determined, determining the video to be detected as the video of the target interaction mode. By adopting the method and the device, whether the video to be detected is in the target interaction mode can be identified only according to the visual information of the video by calculating the similarity degree of the opening-closing ratio sequence of the video to be detected and the opening-closing ratio sequence of the preset standard video.

Description

Video identification method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of short video technologies, and in particular, to a video identification method and apparatus, an electronic device, and a storage medium.

Background

With the development of the short video technology, more and more users begin to experience multiple video interaction modes provided by the short video platform, for example, the video interaction mode can be a mouth-to-mouth type video interaction mode, the mouth-to-mouth type video interaction mode is a more popular video interaction mode in the short video platform, and specifically, the mouth-to-mouth type performance is completed by matching personal lip movements, mental expressions, even body movements and the like with existing sounds.

Video identification schemes in the related art include a mouth shape synchronization identification scheme for identifying whether visual modality information and audio modality information in a video represent the same meaning at the same time, a video content type identification scheme for classifying videos based on their contents, and the like. In the related art, an identification scheme for a video interaction mode does not exist, and therefore a scheme for identifying whether a video is an opposite-mouth type video interaction mode is urgently needed.

Disclosure of Invention

The present disclosure provides a video identification method, apparatus, electronic device and storage medium, to at least solve the problem that identification of a mouth shape video work cannot be performed in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a video identification method, including:

acquiring a video to be detected, wherein the video to be detected comprises a plurality of target video frames;

aiming at each target video frame, determining a plurality of target key points of a target part in the target video frame through a preset key point detection algorithm;

determining the opening and closing ratio of a target part in the target video frame according to the coordinate information of each target key point;

constructing a target opening-closing ratio sequence corresponding to the video to be detected according to the opening-closing ratio of the target part in each target video frame;

and under the condition that the target opening-closing ratio sequence and the standard opening-closing ratio sequences meet preset similar conditions, determining that the video to be detected is the video of the target interaction mode, wherein the standard opening-closing ratio sequence is obtained by calculation based on a preset standard video, and the preset standard video is the video of the target interaction mode.

In one embodiment, the determining, by a preset keypoint detection algorithm, a plurality of target keypoints of a target portion in the target video frame includes:

determining a face region in the target video frame through a preset face region recognition algorithm, and recognizing a target part in the face region;

and determining a plurality of target key points of the target part through a preset key point identification algorithm.

In one embodiment, the determining a plurality of target keypoints for the target portion by a preset keypoint detection algorithm includes:

determining a plurality of initial key points corresponding to the target part through a preset key point identification algorithm;

determining a middle position point of a first transverse boundary of the target part as a first target key point, determining a middle position point of a second transverse boundary of the target part as a second target key point, determining a first longitudinal boundary position point of the target part as a third target key point, and determining a second longitudinal boundary position point of the target part as a fourth target key point in a plurality of initial key points corresponding to the target part;

the determining the opening and closing ratio of the target part in the target video frame according to the coordinate information of each target key point comprises the following steps:

calculating a first distance between the first target key point and the second target key point according to the coordinate information of the first target key point and the coordinate information of the second target key point, and calculating a second distance between the third target key point and the fourth target key point according to the coordinate information of the third target key point and the coordinate information of the fourth target key point;

and calculating a ratio between the first distance and the second distance, and determining the ratio as an opening and closing ratio of a target part in the target video frame.

In one embodiment, before determining that the video to be detected is a video in a target interaction mode when it is determined that the target opening-closing ratio sequence and the plurality of standard opening-closing ratio sequences satisfy a preset similar condition, the method further includes:

respectively calculating the similarity between the target opening-closing ratio sequence and each standard opening-closing ratio sequence;

and under the condition that the similarity greater than or equal to a preset similarity threshold exists, determining that the target opening-closing ratio sequence and the plurality of standard opening-closing ratio sequences meet a preset similarity condition.

In one embodiment, before the acquiring the video to be detected, the method includes:

acquiring audio information of an initial video and determining the audio information of the preset standard video;

and determining the initial video to be the video to be detected under the condition that the audio information of the initial video and the audio information of the preset standard video meet a preset matching condition.

In one embodiment, before determining that the video to be detected is a video in a target interaction mode, the method further includes:

aiming at each standard video frame in the preset standard video, determining a plurality of target key points of a target part in the standard video frame through a preset key point detection algorithm; calculating the opening-closing ratio of a target part in the standard video frame according to the coordinate information of each target key point;

and constructing a standard opening-closing ratio sequence corresponding to the preset standard video according to the opening-closing ratio of the target part in each standard video frame.

In one embodiment, after determining that the video to be detected is a video in a target interaction mode when it is determined that the target open-close ratio sequence and the plurality of standard open-close ratio sequences satisfy a preset similar condition, the method further includes:

and adding a target opening-closing ratio sequence corresponding to the video to be detected as a newly increased standard opening-closing ratio sequence to a preset standard opening-closing ratio sequence set, wherein the preset standard opening-closing ratio sequence set is used for storing a standard opening-closing ratio sequence calculated based on the preset standard video and the newly increased standard opening-closing ratio sequence.

In one embodiment, the acquiring a video to be detected, where the video to be detected includes a plurality of target video frames, includes:

acquiring a video to be detected comprising a plurality of initial video frames;

and for each initial video frame, performing pixel recovery on the initial video frame according to a preset pixel recovery network to obtain a target video frame.

According to a second aspect of the embodiments of the present disclosure, there is provided a video recognition apparatus including:

a first acquisition unit configured to perform acquisition of a video to be detected, the video to be detected including a plurality of target video frames;

a first determining unit, configured to execute, for each of the target video frames, determining a plurality of target key points of a target portion in the target video frame by a preset key point detection algorithm;

a second determining unit configured to determine an opening/closing ratio of a target portion in the target video frame according to the coordinate information of each target key point;

the first construction unit is configured to execute construction of a target opening-closing ratio sequence corresponding to the video to be detected according to the opening-closing ratio of a target part in each target video frame;

and the third determining unit is configured to determine that the video to be detected is the video of the target interaction mode under the condition that the target opening-closing ratio sequence and the plurality of standard opening-closing ratio sequences meet preset similar conditions, wherein the standard opening-closing ratio sequence is obtained by calculation based on a preset standard video, and the preset standard video is the video of the target interaction mode.

In one embodiment, the first determining unit includes:

a first determining subunit, configured to execute a face region determination algorithm by a preset face region recognition algorithm, and recognize a target portion in the face region;

a second determining subunit configured to execute determining a plurality of target keypoints of the target site by a preset keypoint recognition algorithm.

In one embodiment, the first determining subunit is specifically configured to determine, through a preset keypoint identification algorithm, a plurality of initial keypoints corresponding to the target portion; determining a middle position point of a first transverse boundary of the target part as a first target key point, determining a middle position point of a second transverse boundary of the target part as a second target key point, determining a first longitudinal boundary position point of the target part as a third target key point, and determining a second longitudinal boundary position point of the target part as a fourth target key point in a plurality of initial key points corresponding to the target part;

the first determination unit includes:

a first calculating subunit configured to perform calculating a first distance between the first target key point and the second target key point according to the coordinate information of the first target key point and the coordinate information of the second target key point, and calculating a second distance between the third target key point and the fourth target key point according to the coordinate information of the third target key point and the coordinate information of the fourth target key point;

a second calculating subunit configured to perform calculating a ratio between the first distance and the second distance, and determine the ratio as an opening/closing ratio of a target portion in the target video frame.

In one embodiment, the video recognition apparatus further includes:

a first calculation unit configured to perform calculation of a similarity between the target opening-closing ratio sequence and each of the standard opening-closing ratio sequences, respectively;

a third determination unit configured to perform, in a case where there is a similarity greater than or equal to a preset similarity threshold, determining that the target opening-closing ratio sequence and a plurality of standard opening-closing ratio sequences satisfy a preset similarity condition.

In one embodiment, the video recognition device further comprises:

and the audio matching unit is configured to determine the audio information of the preset standard video, and determine the initial video to be the video to be detected under the condition that the audio information of the initial video and the audio information of the preset standard video meet a preset matching condition.

In one embodiment, the video recognition apparatus further includes:

a fourth determining unit configured to perform, for each standard video frame in the preset standard video, determining a plurality of target key points of a target portion in the standard video frame by a preset key point detection algorithm; calculating the opening and closing ratio of a target part in the standard video frame according to the coordinate information of each target key point;

and the construction unit is configured to execute the construction of a standard opening-closing ratio sequence corresponding to the preset standard video according to the opening-closing ratio of the target part in each standard video frame.

In one embodiment, the video recognition apparatus further includes:

the adding unit is configured to add a target opening-closing ratio sequence corresponding to a video to be detected as a new standard opening-closing ratio sequence to a preset standard opening-closing ratio sequence set, wherein the preset standard opening-closing ratio sequence set is used for storing a standard opening-closing ratio sequence calculated based on the preset standard video and the new standard opening-closing ratio sequence.

In one embodiment, the first obtaining unit includes:

a first acquisition subunit configured to perform acquisition of a video to be detected including a plurality of initial video frames;

and the recovery unit is configured to execute pixel recovery on each initial video frame according to a preset pixel recovery network to obtain a target video frame.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the video recognition method according to any one of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the video recognition method according to any one of the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, wherein the instructions, when executed by a processor of an electronic device, enable the electronic device to perform the video recognition method of any one of the above-mentioned first aspects.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

acquiring a video to be detected, wherein the video to be detected comprises a plurality of target video frames; aiming at each target video frame, determining a plurality of target key points of a target part in the target video frame through a preset key point detection algorithm; determining the opening and closing ratio of a target part in a target video frame according to the coordinate information of each target key point; constructing a target opening-closing ratio sequence corresponding to the video to be detected according to the opening-closing ratio of the target part in each target video frame; and under the condition that the target opening-closing ratio sequence and the standard opening-closing ratio sequences meet the preset similar conditions, determining the video to be detected as the video of the target interaction mode. By adopting the method and the device, whether the video to be detected is in the target interaction mode can be identified only according to the visual information of the video by calculating the similarity degree of the opening-closing ratio sequence of the video to be detected and the opening-closing ratio sequence of the preset standard video, the video identification mode consumes less computing resources, and the efficiency and the accuracy of the video identification are higher.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a flow diagram illustrating a video recognition method according to an example embodiment.

Fig. 2 is a flowchart illustrating a step of identifying target key points of a target portion in a face region in a video identification method according to an exemplary embodiment.

Fig. 3 is a flowchart illustrating a step of determining target keypoints in a video identification method according to an exemplary embodiment.

Fig. 4 is a schematic diagram illustrating a target site in a video recognition method according to an example embodiment.

Fig. 5 is a flowchart illustrating a step of calculating an opening/closing ratio of a target region in a video recognition method according to an exemplary embodiment.

Fig. 6A is a flowchart illustrating a step of determining whether a preset similarity condition is satisfied in a video recognition method according to an exemplary embodiment.

FIG. 6B is a flowchart illustrating a step of obtaining initial audio information in a video recognition method according to an example embodiment

Fig. 7 is a flowchart illustrating steps of constructing a standard opening-closing ratio sequence in a video recognition method according to an exemplary embodiment.

Fig. 8 is a flowchart illustrating a step of extracting a target video frame in a video recognition method according to an exemplary embodiment.

Fig. 9A is a schematic diagram of a video frame before pixel recovery in a video identification method according to an example embodiment.

Fig. 9B is a diagram illustrating a video frame after pixel recovery in a video recognition method according to an example embodiment.

Fig. 10 is a block diagram illustrating a video recognition device according to an example embodiment.

FIG. 11 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in other sequences than those illustrated or described herein. The implementations described in the exemplary embodiments below do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

It should be further noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are information and data authorized by the user or sufficiently authorized by each party.

Fig. 1 is a flowchart illustrating a video recognition method according to an exemplary embodiment, where the video recognition method illustrated in fig. 1 is used in an electronic device, and includes the following steps.

In step S110, a video to be detected is acquired.

The video to be detected comprises a plurality of target video frames.

In implementation, the electronic device may acquire a plurality of videos to be detected based on requirements of an actual application scenario. Each video to be detected comprises a plurality of video frames, the electronic device can acquire a plurality of target video frames from the plurality of video frames included in the video frames to be detected according to a preset sampling frequency, the duration of the video to be detected can be ten seconds, and the number of the target video frames acquired by the electronic device based on the preset sampling frequency can be thirty.

In step S120, for each target video frame, a plurality of target keypoints of a target portion in the target video frame are determined through a preset keypoint detection algorithm.

The preset key point detection algorithm may be any algorithm for detecting key points in the face region, for example, a 68-bit key point detection algorithm, a 101-bit key point detection algorithm, or a 203-bit key point detection algorithm. The target portion may be a body portion in the image of the user included in the target video frame, such as a lip portion, an eye portion, or the like, and the key point may be a pixel position point in the image data corresponding to the video frame. The target key point may be a key point in the target portion that satisfies a preset position detection condition.

In implementation, the electronic device may determine, for each target video frame of a plurality of target video frames included in the video to be detected, a target portion in the target video frame, and detect, through a preset key point detection algorithm, a plurality of target key points of the target portion of the target video frame.

In step 130, the opening/closing ratio of the target portion in the target video frame is determined according to the coordinate information of each target key point.

The opening/closing Ratio (Mouth Ratio) may be used to represent the opening/closing degree of the target portion, the opening/closing degree of the target portion corresponding to each word with different semantic information is different, the opening/closing degree of the target portion corresponding to each word with different pronunciation characteristics is also different, and the pronunciation characteristics may be syllables or the like, for example.

In implementation, the electronic device may acquire coordinate information of each target key point in image data corresponding to a video frame, and calculate an opening/closing ratio of a target portion in the target video frame based on the coordinate information of each target key point.

One possible implementation manner may be that the electronic device may determine, through a preset key point detection algorithm, a plurality of key points corresponding to the video frame, perform screening among the plurality of key points, perform screening based on pixel data and position coordinate information of each key point, and obtain a target portion and a plurality of target key points of the target portion.

In step S140, a target opening/closing ratio sequence corresponding to the video to be detected is constructed according to the opening/closing ratio of the target portion in each target video frame.

The target opening and closing ratio sequence comprises opening and closing ratios of target parts in a plurality of target video frames.

In implementation, the electronic device determines the time sequence of each target video frame in the time dimension, and arranges the opening and closing ratios of the target parts in each target video frame according to the time sequence to obtain a target opening and closing ratio sequence corresponding to the video to be detected.

In step S150, under the condition that it is determined that the target open-close ratio sequence and the plurality of standard open-close ratio sequences satisfy the preset similar conditions, it is determined that the video to be detected is a video of the target interaction mode.

The standard opening and closing ratio sequence is obtained by calculation based on a preset standard video, and the preset standard video is a video of a target interaction mode.

Specifically, the target interaction mode may include a plurality of interaction templates. In one example, the video of the target interaction mode may be a mouth-to-mouth type video, the target interaction mode may be a mouth-to-mouth type interaction mode, and the mouth-to-mouth type interaction mode may include a plurality of mouth-to-mouth type templates, for example, mouth-to-mouth type templates that may include various song segments and mouth-to-mouth type templates that may include various movie and television segments. The electronic device may determine a target interaction template, i.e. an interaction template that needs to be identified, based on the actual video operation policy. That is to say, in a case that the interaction template is a mouth-to-mouth template, the electronic device may determine one or more target mouth-to-mouth templates among the multiple mouth-to-mouth templates based on a preset video operation policy, and acquire a video of a target mouth-to-mouth type of the target mouth-to-mouth template.

In this way, the electronic device may calculate an opening-closing ratio sequence of the videos of each target type, determine the opening-closing ratio sequence of the videos of each plurality of target types as a standard opening-closing ratio sequence, and construct a standard opening-closing ratio sequence set based on the plurality of standard opening-closing ratio sequences.

In implementation, the electronic device may obtain a plurality of standard opening/closing ratio sequences in a preset standard opening/closing ratio sequence set. Therefore, the electronic equipment can judge whether the target opening-closing ratio sequence of the video to be detected and the plurality of standard opening-closing ratio sequences meet preset similar conditions or not. Under the condition that the electronic equipment determines that the target opening-closing ratio sequence of the video to be detected and the plurality of standard opening-closing ratio sequences meet the preset similar conditions, the electronic equipment can determine that the video to be detected is the video in the target interaction mode.

In the video identification method, a video to be detected is obtained, wherein the video to be detected comprises a plurality of target video frames; aiming at each target video frame, determining a plurality of target key points of a target part in the target video frame through a preset key point detection algorithm; determining the opening and closing ratio of a target part in a target video frame according to the coordinate information of each target key point; constructing a target opening-closing ratio sequence corresponding to the video to be detected according to the opening-closing ratio of the target part in each target video frame; and under the condition that the target opening-closing ratio sequence and the standard opening-closing ratio sequences meet the preset similar conditions, determining the video to be detected as the video of the target interaction mode. By adopting the method and the device, whether the video to be detected is in the target interaction mode can be identified only according to the visual information of the video by calculating the similarity degree of the opening-closing ratio sequence of the video to be detected and the opening-closing ratio sequence of the preset standard video, the video identification mode consumes less computing resources, and the efficiency and the accuracy of the video identification are higher, so that when the video in the target interaction mode is determined, the effective transmission of the video aiming at the target interaction mode can be realized, and the use experience of a short video creation user is improved.

In an exemplary embodiment, as shown in fig. 2, in step 120, a plurality of target keypoints of a target portion in a target video frame are determined through a preset keypoint detection algorithm, which may be specifically implemented by:

in step 1201, a face region in the target video frame is determined by a preset face region recognition algorithm, and a target part in the face region is recognized.

The preset face region recognition algorithm may be an Eigenface method (Eigenface), a Local Binary Pattern (LBP) and a Fisherface algorithm.

In implementation, the electronic device acquires pixel data corresponding to a target video frame, identifies the pixel data corresponding to the target video frame based on a preset face region identification algorithm, and determines a face region, namely a face image, in the target video frame. In this way, the electronic device can recognize the target part in the image data corresponding to the face region through a preset target part recognition algorithm.

In one example, the target portion may be a lip portion, and thus, for a target video frame, the electronic device may determine a face region in the target video frame according to a preset face region recognition algorithm, and perform recognition in the face region to determine the lip portion.

In another example, the target portion may be a left eye portion and a left eye portion, so that, for a target video frame, the electronic device may determine a face region in the target video frame according to a preset face region recognition algorithm, perform recognition in the face region, and determine the left eye portion and the left eye portion.

Alternatively, the target site may also be other body sites.

In step 1202, a plurality of target keypoints of the target site are determined by a preset keypoint identification algorithm.

In implementation, the electronic device may scan the target portion through a preset key point detection algorithm, and extract a plurality of target key points of the target portion.

Based on the scheme, a plurality of target key points of the target part are determined, the outline of the target part is accurately described, and accurate data support is provided for subsequent opening-closing ratio calculation.

In an exemplary embodiment, as shown in fig. 3, in step 1202, the determining a plurality of target keypoints of the target portion by using a preset keypoint identification algorithm may specifically be implemented by:

in step S310, a plurality of initial keypoints corresponding to the target region are determined by a preset keypoint identification algorithm.

The initial key points may be a plurality of key points obtained by the electronic device according to a preset key point identification algorithm, and the plurality of initial key points may be used for completely depicting the contour of the target portion.

In implementation, the electronic device may locate a boundary of a target portion (an outline of the target portion) in the face region based on a preset key point identification algorithm, so as to obtain a plurality of initial key points corresponding to the boundary of the target portion.

In one example, the target region may be a lip region in the face region, and as shown in fig. 4, the lip region may include an upper lip and a lower lip, and the lip region may also include an inner lip and an outer lip. The electronic device may obtain a plurality of initial key points based on a preset key point recognition algorithm, where the plurality of initial key points may include a plurality of initial key points corresponding to a boundary of the inner lip and a plurality of initial key points corresponding to a boundary of the outer lip. For example, the plurality of initial keypoints corresponding to the boundary of the inner lip may be initial keypoints 75, initial keypoints 76, \8230, and initial keypoints 86. The plurality of initial keypoints corresponding to the border of the outer lip may be initial keypoints 87, initial keypoints 88, \8230;, initial keypoints 94.

In step S320, among the plurality of initial key points corresponding to the target portion, the middle position point of the first transverse boundary of the target portion is determined as a first target key point, the middle position point of the second transverse boundary of the target portion is determined as a second target key point, the first longitudinal boundary position point of the target portion is determined as a third target key point, and the second longitudinal boundary position point of the target portion is determined as a fourth target key point.

Based on the fact that the target region may be a lip region, the lip region may include an upper lip and a lower lip, a first lateral boundary of the target region may be an upper boundary of the upper lip, a second lateral boundary of the target region may be a lower boundary of the lower lip, a middle position point of the first lateral boundary of the target region may be a middle position point of the upper boundary of the upper lip, and a middle position point of the second lateral boundary of the target region may be a middle position point of a lower boundary of the lower lip; the first longitudinal boundary position point of the target portion may be a position point on a leftmost boundary of the lip portion, and the second longitudinal boundary position point of the target portion may be a position point on a rightmost boundary of the lip portion.

In implementation, the electronic device may determine a plurality of initial key points corresponding to the target portion, and obtain a plurality of target key points based on a preset position detection condition among the plurality of initial key points. Because the mouth shapes corresponding to different pronunciations are different when the user speaks, the change of the mouth shape (the opening and closing degree of the mouth) of the user can reflect the semantic expression of the user. In this way, the electronic device may acquire a first longitudinal boundary position point of the lip portion, a second longitudinal boundary position point of the lip portion, a middle position point of the first transverse boundary, and a middle position point of the second transverse boundary of the target portion, as a plurality of target key points corresponding to the video to be detected.

In an example, the electronic device may consider the initial keypoint 78 as a first target keypoint, the initial keypoint 84 as a second target keypoint, the initial keypoint 75 as a third target keypoint, and the initial keypoint 81 as a fourth target keypoint.

Based on the above-described scheme, accordingly, as shown in fig. 5, in step 120, determining the opening/closing ratio of the target portion in the target video frame according to the coordinate information of each target key point may specifically be implemented by the following steps:

in step S510, a first distance between the first target key point and the second target key point is calculated according to the coordinate information of the first target key point and the coordinate information of the second target key point, and a second distance between the third target key point and the fourth target key point is calculated according to the coordinate information of the third target key point and the coordinate information of the fourth target key point.

The first distance may be an euclidean distance between two target key points, or a cosine distance between two target key points, and the second distance may be an euclidean distance between two target key points, or a cosine distance between two target key points.

In implementation, the electronic device may calculate a first distance between the first target key point and the second target key point according to a preset distance calculation method, and calculate a second distance between the third target key point and the fourth target key point according to the preset distance calculation method.

In step S520, a ratio between the first distance and the second distance is calculated, and the ratio is determined as an opening/closing ratio of the target portion in the target video frame.

In implementation, the electronic device may calculate the opening and closing degree of the target portion according to the first distance and the second distance; that is, the electronic device may calculate a ratio between the first distance and the second distance, and use the calculated ratio as the opening/closing ratio of the target portion in the target video frame.

Alternatively, the electronic device may calculate the opening/closing Ratio of the target portion in the ith target video frame according to the following formula _i ：

Therein, dot ₇₅ Representing the first target keypoint, dot ₈₄ Representing the second target keypoint, dot ₇₅ Represents the third target keypoint, dot ₈₁ Denotes the fourth target Key, L (dot) ₇₅ ，dot ₈₁ ) Represents dot ₇₅ And dot ₈₁ The distance between them.

Specifically, the electronic device may calculate the distance L by the following formula:

wherein, the distance L can represent the distance between the key point x and the key point y, and the coordinate information of the key point x can include the key point x ₁ 、x ₂ 、…、x _n The coordinate information of the y key points may include y ₁ 、y ₂ 、…、y _n 。

Based on the scheme, the target key points in the target video can be quickly detected, and the opening and closing ratio calculated through the target key points can be ensured to comprehensively and accurately represent semantic information to be expressed by the user.

In an exemplary embodiment, as shown in fig. 6A, in step 140, before determining that the video to be detected is the video in the target interaction mode when it is determined that the target open-close ratio sequence and the plurality of standard open-close ratio sequences satisfy the preset similar condition, the video identification method further includes:

in step 610, the similarity between the target opening/closing ratio sequence and each standard opening/closing ratio sequence is calculated.

The standard opening-closing ratio sequence is obtained by calculation according to a preset standard video.

In implementation, the electronic device calculates a plurality of similarities between the target opening-closing ratio sequence and each standard opening-closing ratio sequence according to a preset similarity algorithm. For example, the preset similarity algorithm may be a similarity algorithm, and then, for each standard opening-closing ratio sequence in the plurality of standard opening-closing ratio sequences, the electronic device may calculate a similarity between the target opening-closing ratio sequence and the standard opening-closing ratio sequence through the similarity algorithm.

The preset similarity algorithm may also be a similarity algorithm in a Faiss library, so that, for each standard opening-closing ratio sequence in the plurality of standard opening-closing ratio sequences, the electronic device may calculate the similarity between the target opening-closing ratio sequence and the standard opening-closing ratio sequence by calling the Faiss library.

Alternatively, the preset similarity algorithm may be a cosine similarity algorithm, and the present disclosure is directed to specific types and limitations of the similarity algorithm.

In step 620, in the case that there is a similarity greater than or equal to a preset similarity threshold, it is determined that the target opening-closing ratio sequence and the plurality of standard opening-closing ratio sequences satisfy a preset similarity condition.

Wherein, satisfying the preset similarity condition indicates that the target opening-closing ratio sequence is similar to at least one of the plurality of standard opening-closing ratio sequences.

In implementation, the electronic device may compare the calculated similarity between the target opening-closing ratio sequence and each standard opening-closing ratio sequence with a preset similarity threshold, and determine whether the target opening-closing ratio sequence and the standard opening-closing ratio sequences satisfy a preset similarity condition according to a plurality of comparison results.

One possible implementation may be: the electronic device can compare the similarity between the calculated target opening-closing ratio sequence and each standard opening-closing ratio sequence with a preset similarity threshold respectively to obtain a plurality of similarity comparison results. If the similarity greater than or equal to the preset similarity threshold exists, the electronic device can determine that the target opening-closing ratio sequence and the standard opening-closing ratio sequences meet the preset similarity conditions, and can determine that the interactive template of the video to be detected is the interactive template of the corresponding standard opening-closing ratio sequence.

Another possible implementation may be: the electronic device can compare the calculated similarity between the target opening-closing ratio sequence and each standard opening-closing ratio sequence with a preset similarity threshold respectively to obtain a plurality of similarity comparison results. If the similarity greater than or equal to the preset similarity threshold does not exist and the similarity with the target number greater than or equal to the first similarity threshold exists, the electronic device may determine that the target opening-closing ratio sequence and the plurality of standard opening-closing ratio sequences satisfy the preset similarity condition. The preset similarity threshold is greater than the first similarity threshold, and the number of targets may be determined according to the number of the standard opening-closing ratio sequences, for example, may be one-half of the number of the standard opening-closing ratio sequences, and the like. The present disclosure does not limit the specific value of the preset similarity threshold, the specific value of the first similarity threshold, and the specific value of the target number.

Based on the scheme, comprehensiveness and flexibility of judging whether the target opening-closing ratio sequence and the standard opening-closing ratio sequences meet preset similar conditions can be guaranteed.

In an exemplary embodiment, as shown in fig. 6B, before "acquiring the video to be detected" in step 110, the video identification method further includes:

in step 601, audio information of the initial video is obtained, and audio information of the preset standard video is determined.

In step 602, in a case that the audio information of the initial video and the audio information of the preset standard video satisfy a preset matching condition, it is determined that the initial video is a video to be detected.

The initial video may be a video collected by the electronic device in a video pool corresponding to the target program, or a video issued by the collected target account. Under the condition that the preset standard video is the video of the mouth-to-mouth interactive mode, the audio information of the preset standard video can be the audio information contained in the mouth-to-mouth video template, and under the condition that the mouth-to-mouth video template is the song type, the audio information can be the audio information of the song; in the case where the lip-video template is of a movie type, the audio information may be audio information of lines in the movie.

In implementation, an operator may initiate a video identification task for a target interaction mode based on an actual video operation policy. And the electronic equipment can determine a target interaction mode contained in the identification task and a target interaction template corresponding to the target interaction mode under the condition that the identification task is received. In this way, the electronic device may obtain the audio information included in the target interaction template as the audio information of the preset standard video. Based on the method, the electronic equipment can screen the collected multiple initial videos based on the audio information of the preset standard video, eliminate the initial videos which do not meet the preset matching condition with the audio information of the preset standard video, and take the remaining initial videos as the videos to be detected.

Specifically, the electronic device may respectively obtain a first audio feature corresponding to the audio information of the preset standard video and a second audio feature corresponding to the audio information of each initial video, and the electronic device may perform similarity calculation on the first audio feature and the second audio feature to obtain an audio similarity, and if the audio similarity is greater than or equal to a preset audio similarity threshold, it is determined that the initial video may be the video to be detected.

In one example, the target interaction mode contained in the identification task received by the electronic equipment can be a pair type, and the target interaction template contained in the pair type can be a pair type template of an A song segment and a pair type template of a B movie and television episode segment; the electronic device may respectively acquire the a song segment included in the opposite-mouth template of the a song segment as the audio information of the first preset standard video, and acquire the B audio segment included in the opposite-mouth template of the B movie and television play segment as the audio information of the second preset standard video.

In this way, after the electronic equipment collects the initial video, the first audio characteristics corresponding to the audio information of the initial video can be obtained, the similarity calculation is carried out on the first audio characteristics corresponding to the audio information of the first preset standard video to obtain the first audio similarity, and the second audio similarity can be obtained through a similar process and represents the similarity between the audio information of the initial video and the audio information of the B song segment;

if the first audio similarity is greater than or equal to a preset audio similarity threshold, determining that the audio information of the initial video and the audio information of the preset standard video meet a preset matching condition; if the first audio similarity obtained through calculation by the electronic equipment is smaller than a preset audio similarity threshold value and the second audio similarity is smaller than the preset audio similarity threshold value, the initial video is determined not to meet a preset matching condition, namely the initial video is determined not to be the video to be detected and not to be the opposite-mouth type video. In a case where it is determined that the initial video does not satisfy the preset matching condition, the electronic device may reject the initial video.

Based on the scheme, the collected initial video can be pre-screened in the audio dimension, the subsequent speed of video identification is accelerated, and the video identification accuracy and the video identification efficiency are improved.

In an exemplary embodiment, as shown in fig. 7, in step 140, before determining that the video to be detected is the video in the target interaction mode when it is determined that the target opening-closing ratio sequence and the multiple standard opening-closing ratio sequences satisfy the preset similar condition, the video identification method further includes:

in step 710, for each standard video frame in the preset standard video, a plurality of target key points of the target portion in the standard video frame are determined through a preset key point detection algorithm, and the opening and closing ratio of the target portion in the standard video frame is calculated according to the coordinate information of each target key point.

The preset standard video comprises a plurality of standard video frames.

In implementation, the electronic device may acquire a video frame in a preset standard video according to a preset sampling frequency to obtain a plurality of standard video frames. For each standard video frame in a plurality of standard video frames included in a preset standard video, the electronic device may determine a target portion in the standard video frame, and detect a plurality of target key points of the target portion of the standard video frame through a preset key point detection algorithm. Based on the above, the electronic device may acquire coordinate information of each target key point in the image data corresponding to the video frame, and calculate the opening/closing ratio of the target portion in the standard video frame based on the coordinate information of each target key point.

Optionally, the preset sampling frequency in the above step is the same frequency as the preset sampling frequency in step S110.

In step 720, a standard open-close ratio sequence corresponding to the preset standard video is constructed according to the open-close ratio of the target portion in each standard video frame.

In implementation, the electronic device determines a time sequence of each standard video frame in a time dimension, and arranges the opening and closing ratios of the target parts in each standard video frame according to the time sequence to obtain a standard opening and closing ratio sequence corresponding to the preset standard video.

In this way, the electronic device may calculate the opening-closing ratio sequence of each preset standard video as the standard opening-closing ratio sequence. In this way, the electronic device can construct a set of standard opening-closing ratio sequences based on the plurality of standard opening-closing ratio sequences.

Based on the scheme, a standard open-close ratio sequence can be constructed in advance, and a judgment basis for video identification is provided.

In an exemplary embodiment, in step 140, after determining that the video to be detected is the video in the target interaction mode when it is determined that the target open-close ratio sequence and the plurality of standard open-close ratio sequences satisfy the preset similar condition, the video identification method further includes:

and adding a target opening-closing ratio sequence corresponding to the video to be detected as a newly increased standard opening-closing ratio sequence to a preset standard opening-closing ratio sequence set.

The preset standard opening-closing ratio sequence set is used for storing a standard opening-closing ratio sequence and a newly added standard opening-closing ratio sequence which are obtained based on preset standard video calculation.

In implementation, after the electronic device determines that the video to be detected is a video in a target interaction manner, the electronic device may obtain a target opening/closing ratio sequence corresponding to the video to be detected, and add the target opening/closing ratio sequence as a newly added standard opening/closing ratio sequence to a preset standard opening/closing ratio sequence set.

Based on the scheme, the standard open-close ratio sequence set can be continuously updated in real time, and the standard open-close ratio sequence set can be more comprehensive and complete along with continuous updating.

In an exemplary embodiment, as shown in fig. 8, step 110 "acquiring a video to be detected, where the video to be detected includes a plurality of target video frames", may specifically be implemented by the following steps:

in step 810, a video to be detected comprising a plurality of initial video frames is obtained.

In step 820, for each initial video frame, pixel recovery is performed on the initial video frame according to a preset pixel recovery network to obtain a target video frame.

In implementation, the electronic device may judge the definition of the collected video to be detected, and perform pixel recovery processing on each initial video frame included in the video to be detected based on a preset pixel recovery network to obtain a target video frame under the condition that it is determined that the definition of the video to be detected is less than or equal to a preset definition threshold, where the definition of the target video frame is greater than that of the initial video frame, and the preset pixel recovery network may be an FSRCNN network. The specific pixel recovery process may be: the electronic equipment detects a face region in the initial video frame through the FSRCNN, and intercepts the initial video frame based on the face region in the initial video frame to obtain the intercepted initial video frame. Based on the definition, the electronic device can adjust the definition of the pixel data in the intercepted initial video frame to obtain the target video frame with the increased definition.

One possible implementation manner may be that the electronic device extracts each initial video frame in the collected video to be detected according to a preset sampling frequency to obtain a plurality of initial video frames. In one example, as shown in fig. 9A, it may be a schematic diagram of an initial video frame before performing a pixel recovery process, as shown in fig. 9B, it may be a schematic diagram of an initial video frame after performing a pixel recovery process, and a definition of the video frame represented in fig. 9A is smaller than that of the video frame represented in fig. 9B.

Based on the scheme, the definition of the video to be detected can be improved, and subsequent video identification is facilitated.

It should be understood that although the various steps in the flowcharts of fig. 1-8 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-8 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternatively with other steps or at least some of the other steps or stages.

It is understood that the same/similar parts between the embodiments of the method described above in this specification can be referred to each other, and each embodiment focuses on the differences from the other embodiments, and it is sufficient that the relevant points are referred to the descriptions of the other method embodiments.

Fig. 10 is an apparatus block diagram illustrating a video recognition apparatus 1000 according to an example embodiment. Referring to fig. 10, the apparatus includes a first acquisition unit 1002, a first determination unit 1004, a second determination unit 1006, a first construction unit 1008, and a third determination unit 1010.

A first obtaining unit 1002 configured to perform obtaining a video to be detected, the video to be detected including a plurality of target video frames;

a first determining unit 1004 configured to execute determining, for each target video frame, a plurality of target keypoints of a target portion in the target video frame by a preset keypoint detection algorithm;

a second determining unit 1006, configured to determine an opening/closing ratio of the target portion in the target video frame according to the coordinate information of each target key point;

a first constructing unit 1008 configured to execute constructing a target opening-closing ratio sequence corresponding to the video to be detected according to the opening-closing ratio of the target part in each target video frame;

a third determining unit 1010 configured to determine that the video to be detected is the video of the target interaction mode when it is determined that the target open-close ratio sequence and the plurality of standard open-close ratio sequences satisfy the preset similar condition, where the standard open-close ratio sequence is calculated based on a preset standard video, and the preset standard video is the video of the target interaction mode.

In an exemplary embodiment, the first determining unit 1004 includes:

and a second determining subunit configured to execute determining a plurality of target keypoints of the target portion by a preset keypoint detection algorithm.

In one embodiment, the first determining subunit is specifically configured to determine, through a preset keypoint identification algorithm, a plurality of initial keypoints corresponding to the target location; determining a middle position point of a first transverse boundary of the target part as a first target key point, determining a middle position point of a second transverse boundary of the target part as a second target key point, determining a first longitudinal boundary position point of the target part as a third target key point, and determining a second longitudinal boundary position point of the target part as a fourth target key point in a plurality of initial key points corresponding to the target part;

the first determination unit 1004 includes:

and the second calculating subunit is configured to calculate a ratio between the first distance and the second distance, and determine the ratio as an opening/closing ratio of the target part in the target video frame.

In an exemplary embodiment, the video recognition apparatus 1000 further comprises:

a first calculation unit configured to perform respective calculations of similarities between the target opening-closing ratio sequence and each of the standard opening-closing ratio sequences;

and a third determination unit configured to perform, in a case where there is a similarity greater than or equal to a preset similarity threshold, a determination that the target opening-closing ratio sequence and the plurality of standard opening-closing ratio sequences satisfy a preset similarity condition.

In an exemplary embodiment, the first obtaining unit 1002 is specifically configured to:

acquiring audio information of an initial video and determining the audio information of a preset standard video; and determining the initial video to be the video to be detected under the condition that the audio information of the initial video and the audio information of the preset standard video meet the preset matching condition.

In an exemplary embodiment, the video recognition apparatus 1000 further includes:

a fourth determining unit configured to perform, for each standard video frame in the preset standard video, determining a plurality of target key points of the target portion in the standard video frame by a preset key point detection algorithm; calculating the opening and closing ratio of a target part in a standard video frame according to the coordinate information of each target key point;

and the adding unit is configured to add a target opening-closing ratio sequence corresponding to the video to be detected as a new standard opening-closing ratio sequence to a preset standard opening-closing ratio sequence set, wherein the preset standard opening-closing ratio sequence set is used for storing the standard opening-closing ratio sequence calculated based on the preset standard video and the new standard opening-closing ratio sequence.

the audio matching unit is configured to execute the determination of the audio information of the preset standard video and acquire a video to be detected comprising a plurality of initial video frames;

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 11 is a block diagram illustrating an electronic device 1100 for video recognition, according to an example embodiment. For example, the electronic device 1100 can be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a gaming console, a tablet device, a medical device, a fitness device, a personal digital assistant, and so forth.

Referring to fig. 11, electronic device 1100 may include one or more of the following components: processing component 1102, memory 1104, power component 1106, multimedia component 1108, audio component 1110, input/output (I/O) interface 1112, sensor component 1114, and communications component 1116.

The processing component 1102 generally controls the overall operation of the electronic device 1100, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 1102 may include one or more processors 1120 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 1102 may include one or more modules that facilitate interaction between the processing component 1102 and other components. For example, the processing component 1102 may include a multimedia module to facilitate interaction between the multimedia component 1108 and the processing component 1102.

The memory 1104 is configured to store various types of data to support operations at the electronic device 1100. Examples of such data include instructions for any application or method operating on the electronic device 1100, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1104 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, optical disk, or graphene memory.

The power component 1106 provides power to the various components of the electronic device 1100. The power components 1106 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 1100.

The multimedia component 1108 includes a screen that provides an output interface between the electronic device 1100 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1108 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 1100 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 1110 is configured to output and/or input audio signals. For example, the audio component 1110 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 1100 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 1104 or transmitted via the communication component 1116. In some embodiments, audio component 1110 further includes a speaker for outputting audio signals.

The I/O interface 1112 provides an interface between the processing component 1102 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 1114 includes one or more sensors for providing various aspects of state assessment for the electronic device 1100. For example, the sensor component 1114 can detect an open/closed state of the electronic device 1100, the relative positioning of components, such as a display and keypad of the electronic device 1100, the sensor component 1114 can also detect a change in position of the electronic device 1100 or components of the electronic device 1100, the presence or absence of user contact with the electronic device 1100, orientation or acceleration/deceleration of the device 1100, and a change in temperature of the electronic device 1100. Sensor assembly 1114 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 1114 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1114 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1116 is configured to facilitate wired or wireless communication between the electronic device 1100 and other devices. The electronic device 1100 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 1116 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1116 also includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 1100 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 1104 comprising instructions, executable by the processor 1120 of the electronic device 1100 to perform the method described above is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, which includes instructions executable by the processor 1120 of the electronic device 1100 to perform the above-described method.

It should be noted that the descriptions of the above-mentioned apparatus, the electronic device, the computer-readable storage medium, the computer program product, and the like according to the method embodiments may also include other embodiments, and specific implementations may refer to the descriptions of the related method embodiments, which are not described in detail herein.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A video recognition method, comprising:

2. The video identification method according to claim 1, wherein the determining a plurality of target key points of the target portion in the target video frame by a preset key point detection algorithm comprises:

3. The video identification method according to claim 2, wherein the determining a plurality of target keypoints for the target portion by a preset keypoint identification algorithm comprises:

and calculating a ratio between the first distance and the second distance, and determining the ratio as the opening and closing ratio of a target part in the target video frame.

4. The video identification method according to any one of claims 1 to 3, wherein before determining that the video to be detected is a video in a target interaction mode when it is determined that the target opening-closing ratio sequence and a plurality of standard opening-closing ratio sequences satisfy a preset similarity condition, the method further comprises:

5. The video identification method according to any one of claims 1 to 3, further comprising, before the acquiring the video to be detected:

and under the condition that the audio information of the initial video and the audio information of the preset standard video meet a preset matching condition, determining the initial video to be a video to be detected.

6. The video identification method according to any one of claims 1 to 3, wherein before determining that the video to be detected is a video in a target interaction mode when it is determined that the target opening-closing ratio sequence and a plurality of standard opening-closing ratio sequences satisfy a preset similarity condition, the method further comprises:

aiming at each standard video frame in the preset standard video, determining a plurality of target key points of a target part in the standard video frame through a preset key point detection algorithm;

calculating the opening-closing ratio of a target part in the standard video frame according to the coordinate information of each target key point;

7. The video identification method according to any one of claims 1 to 3, wherein after determining that the video to be detected is a video in a target interaction mode when it is determined that the target opening-closing ratio sequence and a plurality of standard opening-closing ratio sequences satisfy a preset similarity condition, the method further comprises:

and adding a target opening-closing ratio sequence corresponding to the video to be detected as a new standard opening-closing ratio sequence to a preset standard opening-closing ratio sequence set, wherein the preset standard opening-closing ratio sequence set is used for storing a standard opening-closing ratio sequence calculated based on the preset standard video and the new standard opening-closing ratio sequence.

8. The video identification method according to any one of claims 1 to 3, wherein the acquiring a video to be detected, the video to be detected including a plurality of target video frames, comprises:

9. A video recognition apparatus, comprising:

a third determining unit, configured to determine that the video to be detected is a video of a target interaction manner when it is determined that the target opening-closing ratio sequence and a plurality of standard opening-closing ratio sequences satisfy a preset similar condition, where the standard opening-closing ratio sequence is calculated based on a preset standard video, and the preset standard video is a video of the target interaction manner.

10. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the video recognition method of any of claims 1 to 8.

11. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the video recognition method of any of claims 1-8.