CN112270231A

CN112270231A - Method for determining target video attribute characteristics, storage medium and electronic equipment

Info

Publication number: CN112270231A
Application number: CN202011120702.5A
Authority: CN
Inventors: 黄恺; 周佳; 包英泽
Original assignee: Beijing Dami Technology Co Ltd
Current assignee: Beijing Dami Technology Co Ltd
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2021-01-26

Abstract

The embodiment of the invention provides a method for determining attribute characteristics of a target video, a storage medium and electronic equipment. Acquiring a target video of online teaching, acquiring an image set according to the target video, combining the image set into a detection set, inputting the image set into a gesture detection neural network, detecting a gesture label of a teaching teacher in classroom teaching, determining whether a gesture used by the teacher in classroom is meaningful according to the gesture label, determining teaching quality of the online teaching according to the number of times that the meaningful gesture label appears in unit time, and giving an evaluation result; therefore, the performance of the teacher in the classroom can be identified, the negative course with poor teaching quality can be screened out in time, and adverse effects are avoided.

Description

Method for determining target video attribute characteristics, storage medium and electronic equipment

Technical Field

The invention relates to the field of online education, in particular to a method for determining target video attribute characteristics.

Background

In the online teaching process, in order to improve the teaching quality, the teaching performance of a classroom teacher needs to be monitored, namely, the class monitoring is carried out. In the prior art, the classroom performance of a teacher is monitored by adopting a manual lesson monitoring mode, the number of courses is increased rapidly along with the development of online education, and the requirements of manual lesson monitoring cannot be met. Negative teaching courses cannot be found in time, and adverse effects may be caused.

Disclosure of Invention

In view of this, in order to find out a negative teaching course in time and avoid causing adverse effects, embodiments of the present invention provide a method, a storage medium, and an electronic device for determining an attribute characteristic of a target video.

In a first aspect, an embodiment of the present invention provides a method for determining attribute characteristics of a target video, including:

acquiring an image set according to a target video, wherein the image set comprises a plurality of images, and the images comprise target characters;

recognizing each image in the image set through a pre-trained gesture detection neural network to obtain a gesture label of a target person in each image;

determining the label characteristics of the target video according to the gesture labels of the images in the image set;

and determining attribute characteristics of the label characteristics in a target video, wherein the attribute characteristics are used for representing the performance of the target task in the target video.

Preferably, the gesture detection neural network is obtained by training as follows:

acquiring each training sample in a training sample set;

acquiring an initialized gesture detection neural network;

and adjusting various parameters of the gesture detection neural network according to the training samples to extract characteristic information through a preset loss function, and responding to the characteristic information meeting a verification condition so as to determine the training gesture detection neural network.

Preferably, the gesture detection neural network comprises a one-glance neural network.

Preferably, the feature information includes:

predicting frame position information and confidence;

the prediction frame position information comprises positioning information and aspect ratio information, and the aspect ratio information is preset or determined by clustering pictures of the target object.

Preferably, the step of training the gesture detection neural network further comprises:

extending a training sample set through image enhancement;

wherein the augmenting the set of training samples by image enhancement comprises:

mirroring the training sample image to obtain an augmented training sample image; and/or

The training sample image is rotated to obtain an augmented training sample image.

Preferably, the predetermined loss function is a focus loss function.

Preferably, the method further comprises:

obtaining a confidence coefficient and an intersection ratio corresponding to a target image through a preset loss function;

and determining the label according to the confidence coefficient and the intersection ratio.

Preferably, the tag characteristics of the target video are obtained by calculating the times of significant gestures in the target video within a preset time;

the attribute characteristics are obtained through the label characteristics of the target video and are used for representing the expression of the target person in the target video; in response to the fact that the tag characteristics are high-quality, the attribute characteristics provide high-quality teaching courses for the target person to express in the target video; in response to the tag characteristic being negative, the attribute characteristic provides a negative tutorial session for the target person in the target video.

In a second aspect, an embodiment of the present invention provides a computer-readable storage medium for storing computer program instructions, wherein the computer program instructions, when executed by a processor, implement any of the above methods.

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, where the memory is configured to store one or more computer program instructions, where the one or more computer program instructions are executed by the processor to implement any one of the methods described above.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of a method of determining target video attribute characteristics according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a recognition gesture tag of an embodiment of the present invention;

FIG. 3 is a schematic diagram of a tag feature of a target video of an embodiment of the invention;

FIG. 4 is a schematic diagram of a tag feature of a target video of an embodiment of the invention;

FIG. 5 is a flow diagram of training a gesture detection neural network according to an embodiment of the present invention;

FIG. 6 is a diagram of a prediction box of an embodiment of the present invention;

FIG. 7 is a schematic diagram of an online education lesson-supervising system according to an embodiment of the present invention;

fig. 8 is a schematic diagram of an electronic device of an embodiment of the invention.

Detailed Description

The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.

Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.

Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".

In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

At present, the teaching quality of any lesson teacher in the online education course is mainly supervised by a manual lesson monitoring mode. The high-quality teaching course can benefit students and bring better public praise to the platform; negative teaching courses can bring adverse effects to students, further affect the public praise of the platform, and even bring adverse social effects.

Along with the development of online education, the number of courses is increased sharply, and the manual course monitoring mode cannot meet the requirements. Negative teaching lessons may be adversely affected if not discovered in a timely manner.

The characteristics are often possessed in the excellent online education course: most of the meaningful gestures are used by any teacher; in contrast, in a passive online education course, there are significantly fewer meaningful gestures used by any teacher.

Specifically, meaningful gestures used by any teacher in an instructional class include: command gestures, interactive gestures, teaching gestures, and the like.

Further subdivided, wherein the instruction gesture comprises: open one hand at the ear for listening, cross-eyebrow one hand for viewing, and other similar command gestures.

The interactive gestures include: the opening of the five fingers indicates a dry and beautiful appearance, the clapping of the two hands indicates encouragement to the student, and other similar interactive gestures.

The teaching gesture includes: the teacher holds the teaching aid by hand, holds the white board by hand, and other similar teaching gestures.

Specifically, refer to table 1.

TABLE 1 meaningful gestures

The teaching course of any lesson teacher is recorded as the target video, the gesture label of any lesson teacher is identified through the gesture detection neural network, the meaningful gesture used by any lesson teacher in the course is identified according to the gesture label, and the label characteristic of the target video is determined by calculating the occurrence frequency of the meaningful gesture in a preset time period.

The embodiment of the invention divides the label characteristics of the target video into high quality and negative quality.

In the target video, if a meaningful gesture occurs 50-55 times every 3 minutes, the tag characteristics of the target video are defined as good. In the target video, the optional lesson teacher interacts with students for multiple times, the students obtain more encouragement of the optional lesson teacher, and the optional lesson teacher gives clear teaching instructions through gestures.

In another case, in the target video, if a meaningful gesture occurs only 5-10 times every 3 minutes, the tag feature in the target video is defined as negative. In the target video, the interaction between any lesson teacher and students is little, the encouragement obtained by the students is little, and the any lesson teacher gives no clear teaching instruction through gestures.

Next, the attribute feature of the target video is determined by the tag feature of the target video, and the attribute feature can represent the performance of the lecturer in the classroom.

If the label characteristic is high quality, the attribute characteristic of the target video is as follows: the lessee-giving teacher provides a high-quality teaching course.

If the tag characteristics are negative, the attribute characteristics of the target video are: any lesson teacher provides a negative teaching lesson (also commonly referred to as a gray lesson).

According to the embodiment of the invention, the negative teaching courses provided by any teacher can be found in time, so that the bad influence on students can be reduced.

Fig. 1 is a flowchart of a method for determining attribute characteristics of a target video according to an embodiment of the present invention.

Referring to fig. 1, a method for determining attribute characteristics of a target video according to an embodiment of the present invention includes the following steps.

Step 100, acquiring an image set according to a target video, wherein the image set comprises a plurality of images, and the images comprise target characters.

In the embodiment of the invention, the whole course of giving lessons to students by any lesson teacher is recorded as the target video.

And converting the target video into an image set through a video conversion tool, wherein the image set comprises a plurality of images, and the images can show the performance of any teacher in a classroom. In an alternative implementation, the target video is converted to a set of images using FFmpeg (Fast Moving Picture Experts Group, multimedia processing tool). FFmpeg is a set of open source computer programs that can be used to record, convert digital audio, video, and convert them into streams. It provides a complete solution for recording, converting and streaming audio and video.

Each image in the image set can display the expression, the limb action and the gesture of any teacher in class. Taking online education as an example, expressions, body movements and gestures of any teacher in a classroom can reflect the attraction of the curriculum to students. The corresponding target can be identified by different types of neural networks. The embodiment of the present invention is described with reference to a preferred embodiment in which the gesture of any teacher is recognized.

In order to identify the gesture of any lesson teacher, the embodiment of the invention identifies the gesture label of any lesson teacher in each image through the gesture detection neural network.

Step 200, recognizing each image in the image set through a pre-trained gesture detection neural network to obtain a gesture label of a target person in each image.

In the online teaching course, a course teacher is taken as a target character. And taking the image set containing the performance of the teacher in the course as a detection set, and inputting the detection set into a gesture detection neural network so as to obtain the gesture label of the teacher in the course in each image.

Referring to fig. 2 in particular, fig. 2 is a schematic diagram of a gesture recognition tag according to an embodiment of the present invention.

An image containing a teacher is placed behind the ears by one hand is input into the gesture detection neural network.

The picture is identified through a gesture detection neural network, and the gesture detection neural network identifies that the gesture label of any teacher in the picture is 'single-hand open and ear-side representation for listening'.

Next, the gesture tag is extracted, thereby obtaining the gesture of the lessee teacher. The gesture of the teacher in the image is 'listen'.

That is, if any teacher in the image uses a meaningful gesture, the gesture detection neural network outputs a gesture tag; if any teacher in the image does not use a meaningful gesture, the gesture detection neural network does not output a gesture tag.

In particular, meaningful gestures can represent any teacher's interaction with students, any teacher's encouragement of students, and any teacher's giving explicit teaching instructions. Meaningful gestures may also be subdivided into instructional gestures, interactive gestures, instructional gestures, and the like.

In order to determine the teaching quality of any teacher in the target video, the label characteristics of the target video need to be determined according to the characteristics of meaningful gestures appearing in the target video.

Step 300, determining the label characteristics of the target video according to the gesture labels of the images in the image set.

With particular reference to fig. 3 and 4, fig. 3 is a schematic diagram of a tag feature of a target video of an embodiment of the present invention; fig. 4 is a schematic diagram of a tag feature of a target video according to an embodiment of the present invention.

The horizontal coordinates of fig. 3 and 4 represent the number of times a meaningful gesture occurs, and the unit of the horizontal coordinate is the number of times. The ordinate represents the time interval and the unit of the ordinate is minutes.

As can be seen from fig. 3, if the meaningful gestures in the target video occur 50-55 times every 3 minutes, the teacher in any class in such target video interacts with the students many times, the students get more encouragement for the teacher in any class, and the teacher in any class gives clear teaching instructions through the gestures. Therefore, the tag characteristics of the target video conforming to such characteristics are defined as good quality.

As can be seen from fig. 4, if a meaningful gesture occurs only 5-10 times every 3 minutes in the target video, any lesson teacher has little interaction with students and little encouragement is obtained by students, and any lesson teacher gives no clear teaching instruction through the gesture. The tag characteristics of the target video that meet such characteristics are defined as negative.

That is, the tag characteristic of the target video can be obtained by counting the number of times the meaningful gesture occurs in the target video within a predetermined time.

In order to determine the performance of any teacher in the target video in the classroom, next, the attribute characteristics of the target video need to be determined.

Step 400, determining the attribute characteristics of the tag characteristics in a target video, wherein the attribute characteristics are used for representing the performance of the target task in the target video.

The embodiment of the invention defines the performance of any teacher in class through the attribute characteristics.

The attribute features are obtained through the tag features of the target video.

Specifically, the tag characteristics of the target video are obtained, and if the tag characteristics are high in quality, the performance of any lesson teacher in the class is defined as providing a high-quality teaching course for any lesson teacher. That is, the attribute features are: the lessee-giving teacher provides a high-quality teaching course.

Alternatively, if the tag characteristics are negative, then the performance of any teacher in the class is defined as providing a negative teaching lesson (also commonly referred to as a gray lesson) for any teacher. That is, the attribute features are: any lesson teacher provides a passive teaching lesson.

By finding the attribute characteristics of the target video in time, the passive teaching courses can be found in time, so that the adverse effects on students are reduced.

Before the method for determining the target video attribute characteristics, the embodiment of the invention needs to train the gesture detection neural network in advance.

Training the gesture detection neural network requires preparing a large number of images, usually in a certain proportion, dividing these images into training sample images and verification images. For example, 70% -80% of the images are partitioned into a set of training samples; divide 30% -20% of the images into validation sets.

The more sufficient the number of training sample images in the training sample set is, the better the training effect is, and the more accurate the gesture of the target character is recognized.

However, some gestures have a smaller number of training sample images, and in order to achieve better training effect, the training sample set needs to be expanded by image enhancement.

Preferably, the embodiment of the present invention expands the training sample image by mirroring the training sample image; and/or expanding the training sample image by rotating the training sample image. Specifically, a plurality of images are generated from one image by performing operations such as left-right mirroring, top-bottom mirroring, and the like on a training sample image. It is also possible to generate a plurality of images from one image by rotating the training sample image by each angle.

And a plurality of images can be generated from one image by processing methods such as cutting, translation, interpolation, Gaussian noise, contrast transformation and the like, so that the training sample is expanded.

In order to obtain a trained gesture detection neural network, the gesture detection neural network of the embodiment of the present invention is obtained through the following steps.

FIG. 5 is a flow chart of training a gesture detection neural network according to an embodiment of the present invention.

Referring to fig. 5, step 500, each training sample in the set of training samples is obtained.

The training samples are images that include gestures of the target person. The large number of training samples constitutes a set of training samples.

The embodiment of the invention forms a training sample set by collecting a large number of images comprising the gestures of the target person.

Step 600, obtaining an initialized gesture detection neural network.

The initial model of the gesture detection neural network of the embodiment of the present invention is a YOLO V3 neural network (YOLO, You Only see one, V3 is the 3 rd version).

The YOLO V3 neural network can solve both the recognition (also commonly referred to as regression) and classification problems for images.

That is, the YOLO V3 neural network, upon receiving a given image, is able to identify the location of the target person's gesture in the image, as well as give a classification of the target person's gesture.

Specifically, the YOLO V3 neural network can determine a prediction box (bounding box) of the image, the prediction box having feature information from which the location of the target person gesture in the image and the classification of the target person gesture can be obtained.

And 700, adjusting various parameters of the gesture detection neural network according to the training samples to extract characteristic information through a preset loss function, and responding to the characteristic information meeting a verification condition so as to determine the training gesture detection neural network.

The embodiment of the invention uses the focus LOSS function (FOCAL LOSS) in the process of training the gesture detection network. The focus loss function is modified on the basis of a standard cross entropy loss function.

In the sample set of the embodiment of the invention, some gestures have enough training sample images, and other gestures have fewer training sample images. The accuracy of classifying an image is not high if a standard cross entropy loss function is used. The weight of the training samples with sufficient training samples can be reduced through the focus loss function, so that a better classification effect can be obtained, and the obtained gesture label is more accurate. That is, the same recognition effect can be obtained by the focus loss function, and the accuracy of classification can be improved.

Through the focus loss function, a predicted frame of the target person's gesture can be identified at multiple locations in a given image. For example, the given image is an image of a teacher placed behind the ears with one hand, and the predicted frame of the gesture of the target person is recognized at a plurality of positions in the given image.

The prediction box has feature information. The characteristic information includes: prediction box position information and confidence.

The position information of the prediction frame comprises positioning information and aspect ratio information (anchor box), and the gesture of the target person has certain characteristics, so that the appropriate aspect ratio information is favorable for the recognition of the gesture by the prediction frame. The aspect ratio information is either preset or determined by clustering the set of images of the target object. Specifically, the aspect ratio information may be set by an empirical value; or obtained by clustering the target object image set.

FIG. 6 is a diagram of a prediction box according to an embodiment of the present invention.

Referring to fig. 6, the prediction block is illustrated by taking as an example that the training sample is an image of a lessee placed one hand behind the ear. The prediction boxes obtained by the focus loss function are: a first prediction block 1, a second prediction block 2 and a third prediction block 3.

Those skilled in the art should understand that the number of the prediction boxes is related to the configuration parameters of the computer hardware and the response speed of the expected gesture detection neural network, theoretically, the number of the prediction boxes is not limited, the embodiment of the present invention is only illustrated by 3 prediction boxes, and the number of the prediction boxes in practice can be determined by itself according to the usage scenario.

To determine the final predicted box and classification of the target person gesture. In particular, a final predicted box sum classification of the target person gesture may be determined based on the confidence level. In an alternative implementation, the final prediction box and classification of the target person gesture are determined by Non-Maximum Suppression (NMS), and the specific steps are as follows:

step S1: sorting the confidence degrees of all the prediction frames in a descending order;

step S2: selecting a prediction frame P with the highest confidence coefficient, and calculating the Intersection ratio (IOU) of the P and other prediction frames;

step S3, obtaining each IOU, deleting the prediction frames with the IOU larger than the preset value, and only keeping the prediction frames with the IOU smaller than the preset value;

step S4, return to S1 until only one predicted box remains as the final predicted box for the target person gesture, and the classification of this predicted box is taken as the classification of the target person gesture.

And acquiring the position information and the confidence coefficient of the final prediction frame as an output result of the training set. And comparing the output result of the training set with the data of the verification set, and determining the training set as a trained gesture detection neural network when the error between the output result of the training set and the data of the verification set meets a preset condition.

For some training samples with larger differences, for example, training samples such as "meaningful other gestures" in table 1 have larger differences, the step of acquiring the final prediction box of such samples is changed to: and screening the label of the target object according to the confidence coefficient and the intersection ratio of the target object reaching a preset value. The method specifically comprises the following steps:

step S2: screening out a prediction box with confidence coefficient larger than 0.5 and IOU larger than 0.3;

step S3: a gesture determined to be meaningful.

Fig. 7 is a schematic view of an online education lesson-supervising system according to an embodiment of the present invention.

The on-line education lesson-supervising system 70 of the embodiment of the present invention is operated on the server side.

Experience of manual lesson monitoring shows that in an online education classroom, if meaningful gestures of teachers in the course are rich, the experience of students is better; in contrast, in some classes, there are fewer gestures that make sense to any teacher, in which case the student experience is often poor, and such classes are defined as gray classes. The courses (namely the gray courses) with few meaningful gestures are recognized, and the bad influence on students is avoided.

To this end, embodiments of the present invention contemplate identifying meaningful gestures through a gesture detection neural network. Specifically, the initial model of the gesture detection neural network of the embodiment of the present invention is the YOLO V3 neural network.

The initial model YOLO V3 is first trained. A training sample set and a validation set are prepared prior to training. The training sample set and the verification set are a large number of pictures containing the gestures of the lessee.

The training samples are input into an initial model YOLO V3, and the samples are identified through the initial model YOLO V3. And comparing the recognition result with the data in the verification set, and adjusting parameters in the YOLO V3 if the recognition requirement is not met. And when the recognition requirement is met, obtaining a gesture detection neural network.

Next, the class video file is recognized by the online education monitor system 70, and the evaluation result of the class video is given. The online education lesson supervising system 70 includes: an acquisition module 71, a recognition module 72, an extraction module 73 and an evaluation module 74.

The obtaining module 71 obtains a classroom video file, and obtains an image set by FFmpeg sampling and framing.

Specifically, the classroom video file may be acquired by the server.

When the optional lesson teacher is a student, the image of the optional lesson teacher in the classroom is transmitted back to the server through the teacher terminal. The server records the images of any lesson teacher in the classroom as classroom video files.

Specifically, the server samples and frames the classroom video file through FFmpeg, and converts the classroom video file into an image set. Wherein, the images in the image set comprise the expressions, the body movements and the gestures of any lesson teacher in class.

The recognition module 72 recognizes gestures that are meaningful in the collection of images.

The gesture detection neural network identifies a gesture tag in the image. Inputting the picture into a gesture detection neural network, and if the picture contains meaningful gestures, giving a gesture label by the gesture detection neural network; if the picture does not contain meaningful gestures, the gesture detection neural network does not give a gesture label.

The extraction module 73 extracts the gesture tags and summarizes the gesture data.

And if the gesture detection neural network provides the picture of the gesture label, extracting the recognized gesture label. The accumulated value is incremented by 1 for each gesture tag extracted.

The gesture data is aggregated, aggregating the accumulated values of gesture labels that occur every 3 minutes.

The time length of an online education classroom is usually more than 20 minutes, and meaningful gestures usually appear 50-55 times in every three minutes, 7 times in every three minutes and 103 times in the classroom with good experience of students.

In a classroom with poor student experience, meaningful gestures typically occur 5-10 times per three minutes, with a minimum of 0 and a maximum of 104 times per three minutes.

The evaluation module 74 evaluates the teaching quality and gives an evaluation result.

Lessons that give teachers performance negatives can be identified by the characteristics of the number of times a meaningful gesture occurs per unit time and evaluated as gray lessons. In particular, thresholds may be set to identify classes for which the teacher performs negatively and to evaluate such classes as gray classes.

And (4) discovering the gray lessons in time according to the evaluation result, namely discovering the lessons which can cause bad influence on the students in time.

The embodiment of the invention provides a method for determining attribute characteristics of a target video, which comprises the steps of acquiring a target video for online teaching, acquiring an image set according to the target video, combining the image set into a detection set, inputting a gesture detection neural network, detecting a gesture label of a teaching teacher in classroom teaching, determining whether a gesture used by the teacher in classroom is meaningful according to the gesture label, determining teaching quality of the online teaching according to the occurrence frequency of the meaningful gesture label in unit time, and giving an evaluation result; therefore, the performance of the teacher in the classroom can be identified, the negative course with poor teaching quality can be screened out in time, and adverse effects are avoided.

The electronic device 8 as shown in fig. 8 comprises a general hardware structure comprising at least a processor 81 and a memory 82. The processor 81 and the memory 82 are connected by a bus 83. The memory 82 is adapted to store instructions or programs executable by the processor 81. Processor 81 may be a stand-alone microprocessor or a collection of one or more microprocessors. Thus, the processor 81 implements the processing of data and the control of other devices by executing instructions stored by the memory 82 to perform the method flows of embodiments of the present invention as described above. The bus 83 connects the above components together, and also connects the above components to a display controller 84 and a display device and an input/output (I/O) device 85. Input/output (I/O) devices 85 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, the input/output devices 85 are coupled to the system through an input/output (I/O) controller 86.

As will be appreciated by one skilled in the art, embodiments of the present application may provide a method, apparatus (device) or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may employ a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the application. It will be understood that each flow in the flow diagrams can be implemented by computer program instructions.

These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.

These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.

Another embodiment of the invention relates to a non-transitory readable storage medium storing a computer-readable program for causing a computer to perform an embodiment of some or all of the above methods.

That is, as will be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be accomplished by specifying the relevant hardware through a program, where the program is stored in a readable storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned readable storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for determining attribute characteristics of a target video, the method comprising:

2. The method of claim 1, wherein the gesture detection neural network is obtained by training:

acquiring each training sample in a training sample set;

acquiring an initialized gesture detection neural network;

3. The method of claim 2, wherein the gesture detection neural network comprises a YOLO neural network.

4. The method of claim 3, wherein the feature information comprises:

predicting frame position information and confidence;

5. The method of claim 4, wherein the step of training the gesture detection neural network further comprises:

extending a training sample set through image enhancement;

6. The method of claim 2, wherein the predetermined loss function is a focal loss function.

7. The method of claim 2, further comprising:

8. The method according to claim 1, wherein the label feature of the target video is obtained by calculating the number of times the meaningful gesture occurs in the target video within a predetermined time;

9. A computer readable storage medium storing computer program instructions, which when executed by a processor implement the method of any one of claims 1-8.

10. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-8.