CN111708913A

CN111708913A - Label generation method and device and computer readable storage medium

Info

Publication number: CN111708913A
Application number: CN202010835038.6A
Authority: CN
Inventors: 周驰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-19
Filing date: 2020-08-19
Publication date: 2020-09-25
Anticipated expiration: 2040-08-19
Also published as: CN111708913B

Abstract

The application provides a label generation method, a device and a computer readable storage medium; the method comprises the following steps: when a video to be generated is received, extracting at least one video frame to be classified from the video to be generated; acquiring a full-scale classification model, and determining a plurality of model scheduling indexes corresponding to the full-scale classification model aiming at each video frame to be classified in at least one video frame to be classified; each model scheduling index in the plurality of model scheduling indexes represents the importance degree of each classification model in the full-scale classification model to each video frame to be classified in at least one video frame to be classified; determining a target model corresponding to each video frame to be classified from the full-scale classification model based on a plurality of model scheduling indexes; and identifying each video frame to be classified by using the target model corresponding to each video frame to be classified to obtain the identification label of each video frame to be classified. The method and the device improve the label generation efficiency through big data analysis.

Description

Label generation method and device and computer readable storage medium

Technical Field

The present application relates to artificial intelligence technology, and in particular, to a method and apparatus for generating a tag, and a computer-readable storage medium.

Background

The video usually contains rich information content, wherein some video segments may contain important information that the user pays attention to, and if the corresponding tags can be generated for the video segments, the user can be quickly and accurately positioned in the video segments containing the information that the user pays attention to.

In the related art, video frames are generally identified by using a plurality of different classification models, so as to generate labels for video segments. However, as the number of models increases, the time taken to identify the video frames also increases, making tag generation less efficient.

Disclosure of Invention

The embodiment of the application provides a label generation method, label generation equipment and a computer-readable storage medium, which can improve the efficiency of label generation.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a label generation method, which comprises the following steps:

when a video to be generated is received, extracting at least one video frame to be classified from the video to be generated; the at least one video frame to be classified is a key frame in the video to be generated;

acquiring a full-scale classification model, and determining a plurality of model scheduling indexes corresponding to the full-scale classification model aiming at each video frame to be classified in the at least one video frame to be classified; wherein each model scheduling index of the plurality of model scheduling indexes characterizes the importance degree of each classification model of the full-scale classification model to each video frame to be classified of the at least one video frame to be classified; the full scale classification model comprises at least one classification model;

determining a target model corresponding to each video frame to be classified from the full-scale classification model based on the model scheduling indexes;

and identifying each video frame to be classified by using the target model corresponding to each video frame to be classified to obtain an identification label of each video frame to be classified.

An embodiment of the present application provides a tag generation apparatus, including:

the device comprises an extraction module, a classification module and a classification module, wherein the extraction module is used for extracting at least one video frame to be classified from a video to be generated when the video to be generated is received; the at least one video frame to be classified is a key frame in the video to be generated;

the prediction module is used for acquiring a full-scale classification model and determining a plurality of model scheduling indexes corresponding to the full-scale classification model aiming at each video frame to be classified in the at least one video frame to be classified; wherein each model scheduling index of the plurality of model scheduling indexes characterizes the importance degree of each classification model of the full-scale classification model to each video frame to be classified of the at least one video frame to be classified; the full scale classification model comprises at least one classification model;

the model determining module is used for determining a target model corresponding to each video frame to be classified from the full-scale classification model based on the plurality of model scheduling indexes;

and the identification module is used for identifying each video frame to be classified by using the target model corresponding to each video frame to be classified to obtain an identification label of each video frame to be classified.

An embodiment of the present application provides a tag generation device, including:

a memory to store executable tag generation instructions;

and the processor is used for realizing the label generation method provided by the embodiment of the application when executing the executable label generation instruction stored in the memory.

The embodiment of the present application provides a computer-readable storage medium, which stores executable tag generation instructions, and is used for causing a processor to execute the executable tag generation instructions, so as to implement the tag generation method provided by the embodiment of the present application.

The embodiment of the application has the following beneficial effects:

in the embodiment of the application, the label generation device can extract at least one video frame to be classified from a video to be generated, then predict the importance degree of each classification model for each video frame to be classified in the at least one video frame to be classified, namely determine the model scheduling index, and determine a proper target model for each video frame to be classified according to the model scheduling index, so that each video frame to be classified is classified and identified only by using the target model which is important for each video frame to be classified, the number of models for classifying and identifying each video frame to be classified is reduced, and the label generation efficiency is improved.

Drawings

FIG. 1 is a diagram of exemplary labeling of video frames by different classification models;

fig. 2 is an alternative architecture diagram of the tag generation system 100 provided by the embodiment of the present application;

fig. 3 is a schematic structural diagram of the label generation apparatus 200 in fig. 2 provided in an embodiment of the present application;

fig. 4 is a first alternative flow chart of a tag generation method provided in the embodiment of the present application;

fig. 5 is a schematic flow chart diagram of an alternative label generation method provided in the embodiment of the present application;

fig. 6 is an exemplary diagram of performing recognition by using a target model corresponding to each video frame to be classified according to an embodiment of the present application;

FIG. 7 is a diagram of an example of a process for performing classification recognition within a limited time according to an embodiment of the present application;

fig. 8 is a third alternative flowchart of a tag generation method provided in the embodiment of the present application;

fig. 9 is a diagram illustrating a structure of a preset scheduling index prediction model according to an embodiment of the present application;

FIG. 10 is an exemplary diagram of a general network architecture provided by embodiments of the present application;

fig. 11 is an exemplary diagram of adding an advertisement to a key frame in an actual scene provided in an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) The tag management platform can be used for performing annotation analysis on the video and generating corresponding tags for some playing pictures in the video, so that a user can process the playing pictures according to the generated tags, for example, the user can add advertisements in a form of "copy-as-you-go" on the playing pictures according to the tags, or the user can determine whether the playing pictures contain sensitive content or not according to the tags. The tag management platform can be carried on a server and receives videos sent by different users, so that different videos are processed.

2) The label generation means that the event described by the playing picture in the video is represented by short characters. For example, when a scene of makeup for girls is present in the video playback screen, a label of "makeup for beauty" may be generated.

3) And the classification model is used for classifying and identifying the playing pictures of the video and generating corresponding labels for the playing pictures of the video. The classification model is usually a trained model, and when generating a label for a playing picture of a video, the plurality of classification models can be used to respectively identify the playing picture of the video to obtain the label.

4) The model scheduling index is an index for representing the importance degree of different classification models to the playing picture of the video. Because the scenes of the played pictures in which different classification models are focused are different, for example, the classification model focused on recognizing human body actions and the classification model focused on recognizing objects are different, the recognition results obtained for the same played picture are different, even some classification models cannot obtain effective recognition results for some played pictures, so that the importance degrees of different classification models to the played pictures of the video are different. The model scheduling index measures the importance degree of different classification models for identifying the played pictures, so that the most appropriate classification model can be found for the played pictures of the video conveniently.

5) Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of various intelligent design principles and implementation methods, so that the machine has the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

6) Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like.

7) Cloud Computing (Cloud Computing) is a Computing model that distributes Computing tasks over a resource pool of large numbers of computers, enabling various application systems to obtain Computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

8) Computer Vision technology (CV) is a science for researching how to make a machine "see", and more specifically, it refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

The video contains rich information content, wherein some video segments contain important information concerned by the user, so that the user can perform various operations on the video segments, for example, adding advertisements in the video segments, and the like. However, the video is composed of a large number of video frames, and a video clip containing important information focused by the user is found from the video, and the video frames of the video need to be classified to generate corresponding tags for the video clip, so that the user can express and accurately locate the video clip containing the focused information.

In the related art, a plurality of different classification models are generally used to classify video frames, so as to generate corresponding labels for video segments. However, the number of models increases, so that the time consumed for identifying a single video frame also increases correspondingly, for example, the time consumed for performing face recognition, target detection, vehicle type detection, scene detection and the like on a single video frame is certainly more than the time consumed for performing face recognition only. Moreover, for a single video frame, many models have no tag output, and at this time, it is clearly inefficient to identify the video frame by using all classification models, i.e., the tag generation efficiency is low.

For example, fig. 1 shows an exemplary graph of labels of video frames for different classification models. In fig. 1, there are 5 classification models, namely a face recognition model 1-1, an object classification model 1-2, a behavior classification model 1-3, a scene classification model 1-4 and a pet dog classification model 1-5; there are 5 video frames to be classified, which are video frame 1-a, video frame 1-b, video frame 1-c, video frame 1-d and video frame 1-e, respectively. Wherein, the label generated by the object classification model 1-2 for the video frame 1-a is: dog (0.96); the pet dog classification model 1-5 generates the following labels for the video 1-a: autumn dog (0.91); none of the remaining models yields a label for video frame 1-a. The labels produced by the object classification models 1-2 for video frames 1-b are: human (0.43); the labels generated by the behavior classification models 1-3 for the video frames 1-b are: a seat (0.87); none of the remaining models yields a label for video frames 1-b. The labels generated by the face recognition model 1-1 for the video frames 1-c are: face (0.99); the labels generated by the object classification models 1-2 for video frames 1-c are: human (0.96); the labels generated by the behavior classification models 1-3 for the video frames 1-c are: makeup (0.9); none of the remaining models yields a label for video frames 1-c. The labels produced by the scene classification models 1-4 for video frames 1-d are: mansion (0.89); none of the remaining models yields a label for video frames 1-d. The labels produced by the object classification models 1-2 for video frames 1-e are: a bicycle (0.97); the labels generated by the behavior classification models 1-3 for the video frames 1-e are: cycling (0.92); none of the remaining models yields a label for video frames 1-e. As can be seen from fig. 1, for the same single video frame, many models have no tag yield, and the operation of these models without tag yield is ineffective, which reduces the efficiency of tag generation.

Based on the above problem, a scheduling platform can be used to call a suitable classification model for the video frame, so that the video frame is classified only by using the suitable classification model, and the tag generation efficiency is improved. In the related art, there are some model scheduling manners, which can implement scheduling of models. For example, a multi-model scheduling simulation platform is used for realizing real-time switching of simulation models; realizing multi-model parallel scheduling by utilizing the node optimal path; and performing model scheduling by using reinforcement learning and the like.

However, the real-time and fast switching among a plurality of models is realized by using a multi-model scheduling simulation platform, and a proper model is not selected for the image to be identified; the optimal path of the node is too complex, so that the time occupied by model scheduling is increased; the reinforcement learning does not take into account the information of the image itself, so that it is difficult to determine a suitable model for different images.

Therefore, although some model scheduling methods exist in the related art, the effect of finding a suitable model for a playing picture of a video in a short time cannot be achieved, so that the efficiency of label generation is still low.

The embodiment of the application provides a label generation method, label generation equipment and a computer-readable storage medium, which can improve the efficiency of label generation. An exemplary application of the tag generation device provided in the embodiment of the present application is described below, and the device provided in the embodiment of the present application may be implemented as various types of user terminals, and may also be implemented as a server. The server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing basic cloud computing servers such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content delivery network), a big data and artificial intelligence platform and the like; the terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected by wired or wireless communication. Next, an exemplary application of the label producing apparatus will be explained.

Referring to fig. 2, fig. 2 is an alternative architecture diagram of the tag generation system 100 provided in this embodiment of the present application, in order to support a tag generation application, a terminal 400 is connected to a tag generation device 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of the two.

The terminal 400 is configured to send a video to be generated to the tag generation apparatus 200 through the network 300. When the tag generation device 200 receives a video to be generated, at least one video frame to be classified is extracted from the video to be generated, wherein the at least one video frame to be classified is a key frame in the video to be generated. The label generation device 200 obtains a full-scale classification model, and determines a plurality of model scheduling indexes corresponding to the full-scale classification model for each video frame to be classified in at least one video frame to be classified, wherein each model scheduling index in the plurality of model scheduling indexes represents the importance degree of each classification model in the full-scale classification model to each video frame to be classified in the at least one video frame to be classified, and the full-scale classification model comprises at least one classification model. The label generation device 200 determines a target model corresponding to each to-be-classified model for each to-be-classified video frame from the full-scale classification model based on the plurality of model scheduling indexes. Next, the label generating device 200 identifies each video frame to be classified by using the target model corresponding to each video frame to be classified, so as to obtain an identification label of each video frame to be classified. Finally, the tag generation device 200 sends the identification tag of each video frame to be classified to the terminal 400, so that the terminal 400 displays the obtained identification tag and receives the operation of the user to perform subsequent processing on the video to be generated.

Referring to fig. 3, fig. 3 is a schematic structural diagram of the label generating apparatus 200 in fig. 2 according to an embodiment of the present application, where the label generating apparatus 200 shown in fig. 3 includes: at least one processor 210, memory 250, at least one network interface 220, and a user interface 230. The various components in the label producing device 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 3.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 230 includes one or more output devices 231, including one or more speakers and/or one or more visual display screens, that enable the presentation of media content. The user interface 230 also includes one or more input devices 232, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.

The memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 250 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 252 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless-compatibility authentication (Wi-Fi), and Universal Serial Bus (USB), etc.;

a display module 253 to enable presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 231 (e.g., a display screen, speakers, etc.) associated with the user interface 230;

an input processing module 254 for detecting one or more user inputs or interactions from one of the one or more input devices 232 and translating the detected inputs or interactions.

In some embodiments, the tag generation apparatus provided in the embodiments of the present application may be implemented in software, and fig. 3 illustrates the tag generation apparatus 255 stored in the memory 250, which may be software in the form of programs and plug-ins, and includes the following software modules: an extraction module 2551, a prediction module 2552, a model determination module 2553, a recognition module 2554 and a model training module 2555, which are logical and thus can be arbitrarily combined or further split depending on the functionality implemented. The functions of the respective modules will be explained below.

In other embodiments, the tag generation apparatus provided in this embodiment may be implemented in hardware, and for example, the tag generation apparatus provided in this embodiment may be a processor in the form of a hardware decoding processor, which is programmed to execute the tag generation method provided in this embodiment, for example, the processor in the form of the hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

An exemplary embodiment of the present application provides a label generating apparatus, which includes:

a memory to store executable tag generation instructions;

The tag generation method provided by the embodiment of the present application will be described in conjunction with exemplary applications and implementations of the tag generation device provided by the embodiment of the present application.

Referring to fig. 4, fig. 4 is a first optional flowchart of a tag generation method provided in the embodiment of the present application, and will be described with reference to the steps shown in fig. 4.

S101, when a video to be generated is received, at least one video frame to be classified is extracted from the video to be generated.

The embodiment of the application is realized in the scene of generating the label for the video clip in the video. When the label generation device receives a video to be generated through a network, it is clear that the label generation needs to be carried out on the video to be generated at present, and at the moment, the label generation device carries out key frame extraction on the video to be generated to obtain at least one video key frame. The video key frames contain information capable of describing scenes of video clips, and the labels of the video clips can be obtained by identifying and classifying the video key frames, so that the label generation equipment can use the extracted video key frames as video frames to be classified. In other words, at least one video frame to be classified is a key frame in the video to be generated.

It is to be understood that the video to be generated refers to a video waiting for tag generation. The video to be generated may be a video such as a movie, a television play, or the like, may also be a video such as a live broadcast, or may also be a video of another type, which is not limited herein in this embodiment of the application.

It should be noted that the tag generation device may extract the video frame to be classified from the video to be classified by using a preset key frame extraction algorithm, where the preset key frame extraction algorithm may be set according to actual requirements, and the embodiments of the present application are not limited herein.

S102, acquiring a full-scale classification model, and determining a plurality of model scheduling indexes corresponding to the full-scale classification model for each video frame to be classified in at least one video frame to be classified.

After the label generation device obtains at least one video frame to be classified, the label generation device firstly obtains a full-scale classification model, then determines each classification model in the full-scale classification model, and schedules indexes for the models of the current video frame to be classified. Because the full-scale classification model comprises at least one classification model, the label generation device obtains a plurality of model scheduling indexes for the current video frame to be classified, each classification model has a model scheduling index corresponding to the current video frame to be classified, and the model scheduling indexes represent the importance degree of the classification model to the current video frame to be classified, that is, the model scheduling indexes correspond to each classification model in the full-scale classification model one by one. After determining a plurality of model scheduling indexes for the current video frame to be classified, the label generation device will continue to use the next video frame to be classified as the current video frame to be classified until the label generation device determines a plurality of model scheduling indexes corresponding to each video frame to be classified. Wherein, the full-scale classification model comprises at least one classification model which is trained.

It should be noted that each model scheduling index in the plurality of model scheduling indexes represents the importance degree of each classification model in the full-scale classification model to each video frame to be classified in at least one video frame to be classified. Because some classification models may not perform classification recognition on the video frames to be classified in the full-scale classification model, for example, the classification models do not generate tags for the video frames to be classified, or the classification models have low classification accuracy on the video frames to be classified, for the video frames to be classified, the importance degree of the classification models on the video frames to be classified is lower than that of the classification models capable of generating correct tags for the video frames to be classified. Therefore, the label generation device needs to use the model scheduling index to measure the importance degree of the classification model to the video frame to be classified, so as to select a suitable classification model for the video frame to be classified subsequently.

In some embodiments of the present application, the label generation device may predict the model scheduling index by using a trained preset scheduling index prediction model, and may also predict the model scheduling index by using a preset scheduling index prediction algorithm, which is not limited herein.

It is understood that the model scheduling index may exist in the form of a scoring score, for example, the tag generation device scores each classification model for each video frame to be classified, and the obtained scoring score is the model scheduling index. The model scheduling index may also exist in the form of an importance level, for example, the tag generation device divides each classification model into different importance levels for each video frame to be classified, for example, into levels with high importance and low importance, so as to obtain the model scheduling index.

S103, determining a target model corresponding to each video frame to be classified from the full-scale classification model based on the plurality of model scheduling indexes.

After the label generation device obtains a plurality of model scheduling indexes of each video frame to be classified and reads each model scheduling index, it is clear which classification models are important and which classification models are unimportant for each video frame to be classified. Then, the label generating device will select the important classification model for each video frame to be classified as the target model corresponding to each video frame to be classified. After the label generation device determines the target models for all the video frames to be classified, at least one target model is obtained, wherein each video frame to be classified may correspond to one or more target models.

It should be noted that, in the embodiment of the present application, for a certain video frame to be classified, the tag generation device may select an important classification model, or may select multiple important classification models, where the important classification models are all target models of the video frame to be classified.

Further, the number of target models corresponding to each video frame to be classified may be different, that is, the number of models included in a target model corresponding to a certain video frame to be classified may be different from the number of models included in a target model corresponding to another video frame to be classified. Hereby, the total number of video frames of the at least one video frame to be classified and the total number of models of the at least one target model may be different.

It should be noted that, although the total number of video frames of at least one video frame to be classified is different from the total number of models of at least one target model, each target model has a corresponding video frame to be classified because the target model is selected for each video frame to be classified, that is, each target model also corresponds to one or more video frames to be classified.

And S104, identifying each video frame to be classified by using the target model corresponding to each video frame to be classified to obtain the identification label of each video frame to be classified.

After the target model corresponding to each video frame to be classified is selected for each video frame to be classified, the label generation device can obtain a plurality of target model sets corresponding to the video frames to be classified respectively, and at the moment, the plurality of target model sets can be used for classifying and identifying the corresponding video frames to be classified. At this time, the label generating device first selects at least one video frame to be classified corresponding to each target model (that is, each target model is included in a plurality of target model sets corresponding to the at least one video frame to be classified), then inputs the video frame to be classified corresponding to each target model into each target model for identification and classification, and uses the label output by each target model as the identification label of the video frame to be classified corresponding to each target model. And when all the target models finish the identification and classification process, obtaining the identification label of each video frame to be classified, and realizing the label generation process aiming at the video to be generated so as to facilitate the user to process the video to be generated by utilizing the identification labels.

It should be noted that, because one or more classification models may be provided in the target model corresponding to the video frame to be classified, there may be some video frames to be classified that need to be classified and identified by multiple classification models, and at this time, the tag generation device may consider that the classification and identification process is completed for each video frame to be classified only when all the target models, that is, all the important classification models for each video frame to be classified, complete the classification and identification operation.

It is understood that at least one video to be classified and at least one identification tag do not correspond to each other, i.e. for a certain video frame to be classified, it may have a plurality of identification tags, for another video frame to be classified, it may have only one tag, or even no tag. For example, when the video frame to be classified is a motion of girl makeup, labels of "motion-makeup", "article-makeup", and "face-star XXX" may be obtained, respectively.

In some embodiments of the present application, the model scheduling index corresponds to each classification model in the full classification model one to one, that is, each classification model in the full classification model corresponds to a model scheduling index, in this case, the tag generation device determines, for each video frame to be classified, a target model corresponding to each video frame to be classified from the full classification model based on the plurality of model scheduling indexes, that is, a specific implementation process of S103 may include: S1031-S1032 are as follows:

and S1031, sequencing the model scheduling indexes of each video frame to be classified.

When the label generation equipment is used for screening the target model from the full-scale classification model, a plurality of model scheduling indexes corresponding to each classification model in the full-scale classification model are firstly sequenced according to the size sequence, so that the sub-label is scheduled for each model in the plurality of model scheduling indexes, and the rank of the sub-label is determined.

It should be noted that the multiple model scheduling indexes of each to-be-classified video frame represent the importance degree of each classification model in the full-scale classification model to each to-be-classified video frame, and therefore, the multiple model scheduling indexes of each to-be-classified video frame are sequenced, a relative relationship can be determined for the importance degrees of different classification models, and the relative relationship can be used for determining which classification models in the full-scale classification model are more important for each to-be-classified video frame. For example, when the model scheduling indexes of all classification models are higher for a certain video frame to be classified, it seems that all classification models are more important for the video frame to be classified, but after the model scheduling indexes of all classification models are sorted, a target model with a higher importance degree can be selected for the video frame to be classified according to the order.

S1032, the classification model corresponding to the highest preset number of model scheduling indexes in the plurality of model scheduling indexes of each video frame to be classified is used as the target model corresponding to each video frame to be classified.

The label generation equipment sequences a plurality of model scheduling indexes of each video frame to be classified, extracts the highest preset number of model scheduling indexes from the plurality of model scheduling indexes after determining the ranking of each model scheduling index in the plurality of model scheduling indexes of each video frame to be classified, and then finds out the classification model corresponding to the highest preset number of model scheduling indexes according to the corresponding relation between the model scheduling indexes and the classification models, wherein the classification models are the target models corresponding to each video frame to be classified.

It is understood that the preset number may be set to 3, may also be set to 1, and may also be set to other values, and the embodiment of the present application is not limited herein.

Illustratively, when three classification models are provided in the full-scale classification model, and the model scheduling indexes of a certain video frame to be classified of the three classification models are 0.5, 0.1 and 0.3, respectively, the tag generation device sorts the three model scheduling indexes to obtain a sequence of [0.5, 0.3 and 0.1], and when the preset number is 2, the tag generation device takes the classification models corresponding to 0.5 and 0.3 as the target models corresponding to the video frame to be classified.

In the embodiment of the application, the label generation equipment can sequence the model scheduling indexes, so that the classification models with the model scheduling indexes in the front order are selected as the target models, the accuracy of selecting the target models is improved, the number of the models for classifying and identifying the video frames to be classified is reduced, and the label generation efficiency is improved.

In some embodiments of the present application, each classification model in the full-scale classification model has a model scheduling index corresponding to at least one to-be-classified video frame, and at this time, the tag generation device determines, for each to-be-classified video frame, a target model corresponding to each to-be-classified video frame from the full-scale classification model based on the plurality of model scheduling indexes, that is, a specific implementation process of S103 may include: S1033-S1034, as follows:

s1033, comparing the model scheduling index corresponding to each video frame to be classified in the model scheduling indexes with a preset index threshold value to obtain a comparison result.

The label generation equipment can not only sequence the model scheduling indexes and determine the target model, but also directly compare the model scheduling index corresponding to each video frame to be classified with a preset index threshold value to obtain a comparison result, and then determine the target model according to the comparison result.

It should be noted that, the comparison result represents whether the model scheduling index is greater than a preset index threshold. And when the comparison result is greater than the preset index threshold, the classification model is important for the video frame to be classified, otherwise, when the comparison result is less than or equal to the preset index threshold, the classification model is less important for the video frame to be classified.

It is understood that the specific value of the preset index threshold may be set according to practical situations, for example, set to 0.5, or set to 0.8, and the embodiment of the present application is not limited herein.

S1034, selecting the classification model with the comparison result representing model scheduling index larger than a preset index threshold value from the full-scale classification models, and using the classification model as a target model corresponding to each video frame to be classified.

After the label generation equipment obtains the comparison result between the model scheduling index corresponding to each classification model and the preset index threshold, the classification model corresponding to the model scheduling index threshold which is larger than the preset index threshold can be selected out, and then the classification model corresponding to the selected model scheduling index threshold is used as the target model, so that the label generation equipment can obtain the target model.

For example, when there are three classification models in the full-scale classification model, and the model scheduling indexes of the three classification models are 0.5, 0.1 and 0.3, respectively, and the preset index threshold is 0.4, the label classification model may use the classification model corresponding to 0.5 as the target model.

In the embodiment of the application, the label generation equipment can directly compare a plurality of model scheduling indexes with the preset index threshold value, and then directly extract the classification model of which the model scheduling index is greater than the preset index threshold value as the target model, so that the number of models for classifying and identifying the video frames to be classified is reduced, and the label generation efficiency is improved.

It should be noted that, in some embodiments of the present application, the label generation device may further sequence the plurality of model scheduling indicators, and use a classification model corresponding to a preset number of model scheduling indicators that are the highest among the plurality of model scheduling indicators as a first candidate model; comparing the model scheduling index corresponding to each video frame to be classified in the plurality of model scheduling indexes with a preset index threshold value, and representing the classification model with the model scheduling index larger than the preset index threshold value as a second candidate model according to the comparison result; then, the label generation device compares the number of classification models in the first candidate model with the number of classification models in the second candidate model, and uses the candidate model with a larger number of classification models from the first candidate model and the second candidate model as a final target model. In this way, the label generation device can determine as many important classification models as possible for each video frame to be classified.

Referring to fig. 5, fig. 5 is a schematic view of an optional flowchart of a tag generation method according to an embodiment of the present application. In some embodiments of the present application, identifying each video frame to be classified by using the target model corresponding to each video frame to be classified to obtain an identification tag corresponding to each video frame to be classified, that is, a specific implementation process of S104 may include: S1041-S1045, as follows:

s1041, determining the video frame to be classified corresponding to each target model as a target video frame.

In the embodiment of the invention, the label generation device determines the target model for each video frame to be classified, namely, the corresponding relation between the video frame to be classified and the target model is constructed, so that each target model has the video frame to be classified which needs to be predicted and identified, namely, each target model corresponds to some video frames to be classified, and the video frames are the video frames which are identified by the target model in at least one video frame to be classified. Therefore, the label generation device can classify at least one video frame to be classified according to the target models, that is, the video frames to be classified which need to be classified by each target model are divided together to obtain the target video frame corresponding to each target model.

S1042, counting the number of the target video frames of each target model to obtain the number of the target video frames of each target model.

After obtaining the target video frames of each target model, the tag generation device will count the number of target video frames corresponding to each target model, and the obtained statistical result is the number of target video frames corresponding to each target model.

And S1043, determining a corresponding execution sequence for each target model by using the number of the target video frames.

Since for a certain object model, the larger the number of object video frames, the more video frames to be classified are indicated, which need to be classified. In the tag generation process, in order to ensure that most of the video frames to be classified are identified first, the tag generation device may select to execute a target model with a larger number of target video frames first. At this time, the tag generation device sorts the number of target video frames corresponding to each target model according to a principle from large to small, and then obtains the sort of the target models by using the corresponding relation between the number of target video frames and the target models, that is, the execution sequence is determined for each target model.

And S1044, identifying the target video frames corresponding to each target model by using each target model according to the execution sequence to obtain the identification labels corresponding to the target video frames.

And S1045, after the target video frame corresponding to each target model is identified, obtaining an identification label corresponding to each video frame to be classified.

After obtaining the execution sequence of each target model, the tag generation device calls the target models in sequence according to the execution sequence. When a certain target model is called, the label generation device inputs a target video frame corresponding to the target model into the target model for classification and identification, and the label finally output by the target model is the identification label of the target video frame corresponding to the target model. Because at least one video frame to be classified is divided into at least one target model, when each target model in the at least one target model completes the identification analysis of the target video frame corresponding to the target model, the identification of all the video frames to be classified is also indicated, and at this time, the label generating device can obtain the identification label corresponding to each video frame to be classified.

For example, fig. 6 is an exemplary diagram for performing recognition by using an object model corresponding to each video frame to be classified according to an embodiment of the present application. As shown in FIG. 6, there are 3 object models in total in at least one object model, model 6-1, model 6-2, and model 6-3, respectively. The label generation equipment divides the obtained 10 video frames to be classified into the 3 target models, wherein 3 video frames to be classified correspond to the model 6-1, 5 video frames to be classified correspond to the model 6-2, and 2 video frames to be classified correspond to the model 6-3. Then, the label generation equipment counts the number of the video frames to be classified corresponding to the 3 target models, and the number 3 of the video frames of the model 6-1, the number 5 of the video frames of the model 6-2 and the number 2 of the video frames of the model 6-3 are obtained. Then, according to the number of the video frames, the label generation device can determine respective execution sequences of the 3 target models, namely the first execution of the model 6-2, the second execution of the model 6-1 and the third execution of the model 6-3, and call the 3 target models according to the sequence to complete the classification and identification of 10 video frames to be classified, so as to obtain 10 identification labels.

In the embodiment of the application, the tag generation device can perform statistics according to the number of the target video frames corresponding to each target model, and then determine an execution sequence for each target model by using the counted number of the target video frames, so that each target model is called according to the execution sequence. Therefore, the label generation equipment can ensure that the target model with more video frames to be classified is executed first, so as to ensure that most video frames to be classified can be classified and identified first.

In some embodiments of the present application, according to the execution sequence, identifying, by using each target model, a target video frame corresponding to each target model to obtain an identification tag corresponding to the target video frame, that is, the specific implementation process in S1044 may include: s1044a-S1044c, as follows:

s1044a, comparing the execution sequence with a preset sequence to obtain a sequence comparison result; and the sequence comparison result represents the front-back relation between the execution sequence and the preset sequence.

In practical application, the problem that the time consumption of the classification and identification process is too long exists, and the label generation is often required to be completed within a limited time, so that the problem that the time consumption of the label generation is too long is avoided, the label generation efficiency is reduced, and the label generation equipment is required to interrupt the classification and identification process in time under the condition that most of video frames to be classified can be smoothly classified and identified. At this time, the label generating device may select a part of the object models from the at least one object model to be called and executed, thereby further reducing the time taken for classifying and identifying the at least one to-be-classified video frame and providing the label generating efficiency. At this time, the label generating device may obtain the preset sequence, compare the execution sequence of each target model with the preset sequence, and determine whether the execution sequence of each target model is before the preset sequence or after the preset sequence, where the obtained determination result is a sequence comparison result.

It should be noted that, in some embodiments of the present application, the preset sequence may be set according to actual requirements, for example, the preset sequence is set to 3, or set to 5, so as to achieve flexible adjustment of the video frame recognition effect according to the service condition or the model performance.

And S1044b, when the sequence comparison result represents that the execution sequence is before the preset sequence, identifying the target video frame corresponding to each target model by using each target model to obtain the identification tag corresponding to the target video frame.

S1044c, when the sequence comparison result indicates that the execution sequence is after the preset sequence, ending the identification of the target video frame corresponding to each target model by using each target model.

When the tag generation device judges that the execution sequence is before the preset sequence, each target model is normally used for identifying the corresponding target video frame, so that the identification tag of the target video frame corresponding to the target model with the execution sequence before the preset sequence is obtained. When the execution sequence is after the preset result, the tag generation device stops the process of calling the target model, that is, each target model is no longer used for identifying the corresponding target video frame. That is to say, in the embodiment of the present application, the tag generation device only calls the target model whose execution sequence is before the preset sequence, that is, performs classification and identification on the target video frames corresponding to the target model whose execution sequence is before the preset sequence, so that the tag generation device can preferentially call and execute the target model with more target video frames in a limited time.

In some embodiments of the present application, according to the execution sequence, identifying, by using each target model, a target video frame corresponding to each target model to obtain an identification tag corresponding to the target video frame, that is, the specific implementation process in S1044 may include: s1044d-S1044e, as follows:

s1044d, when the current target model corresponding to the current order is used to identify the current target video frame corresponding to the current target model, obtaining a current identification completion time corresponding to the current target model.

The label generation equipment can also judge whether to stop calling the target model by utilizing the maximum consumed time under the condition of ensuring that most of video frames to be classified can be successfully classified and identified and the condition of interrupting the classification and identification processes in time. At this time, the tag generation device will firstly use the current target model corresponding to the current sequence to classify and identify the current target video frame corresponding to the current target model. When the current target model completes the classification and identification of the corresponding current target video frame at a certain moment, the label generation device acquires the moment and takes the moment as the current identification completion time.

It should be noted that the current sequence is any one of the execution sequences corresponding to each object model, that is, when the tag generation device completes execution of any object model, the identification completion time corresponding to the object model is obtained once.

S1044e, when the current recognition completion time is greater than or equal to the preset maximum time, stopping recognizing a target video frame corresponding to a next target model by using a next target model corresponding to a next sequence of the current sequence.

After obtaining the current recognition completion time of the current target model, the tag generation device also obtains a preset maximum time, and compares the current recognition completion time with the preset maximum time. When the tag generation device judges that the current recognition completion time is greater than or equal to the preset maximum time, it indicates that no redundant time is left for a next target model corresponding to the next sequence of the current sequence, and at this time, the tag generation device stops calling the next target model corresponding to the next sequence of the current sequence, that is, stops using the next target model to recognize a target video frame corresponding to the next target model, so that the time for classification recognition is controlled within a proper range under the condition that most video frames to be classified can be classified and recognized, and the tag generation efficiency is improved.

Exemplary, referring to fig. 7, an exemplary diagram of a process for performing classification recognition in a limited time is provided in the embodiments of the present application. When the execution sequence of each target model, namely the model 7-1, the model 7-2, the model 7-3, the model 7-4 and the model 7-5, respectively corresponds to the model 7-3, the model 7-2, the model 7-1, the model 7-5 and the model 7-4, the label generating device firstly calls the model 7-3 (the current target model) to perform classification recognition, meanwhile, obtains the recognition completion time (the current recognition completion time) corresponding to the model 7-3 when the classification recognition of the model 7-3 is completed, compares the recognition completion time of the model 7-3 with the preset maximum time, and continuously calls the model 7-2 to perform classification recognition when the recognition time of the model 7-3 is less than the preset maximum time. By analogy, the tag generation device finds that the identification completion time corresponding to the model 7-5 is greater than the preset maximum time, and at this time, the tag generation device does not call the model 7-4 again to perform classification identification.

In the embodiment of the application, the tag generation device can compare the current identification completion time of the current target model corresponding to the current sequence with the preset maximum time, so as to judge whether the next target model corresponding to the next sequence of the current sequence needs to be called for classification and identification, thereby ensuring that most video frames to be classified can be classified and identified within a limited time, and improving the tag generation efficiency.

In some embodiments of the present application, for each to-be-classified video frame in the at least one to-be-classified video frame, determining a plurality of model scheduling indexes corresponding to the full-scale classification model, that is, a specific implementation process of S102 may include: S1021-S1022, as follows:

and S1021, predicting a model scheduling index corresponding to each classification model in the full-scale classification model by using a preset scheduling index prediction model aiming at each video frame to be classified.

And S1022, when the prediction of the model scheduling indexes is completed for all the full-scale classification models, obtaining a plurality of model scheduling indexes corresponding to the full-scale classification models.

The label generation device may predict the model scheduling index using a pre-trained preset scheduling index prediction model. At this time, the label generation device may first obtain a preset scheduling index prediction model, and then input each video frame to be classified into the preset scheduling index prediction model, where the preset scheduling index prediction model can predict a certain classification model, and identify the importance degree of the input video frame to be classified, so as to obtain a model scheduling index corresponding to the classification model. When the label generation equipment completes prediction on all the full-scale classification models, a plurality of model scheduling indexes can be obtained, wherein the number of the model scheduling indexes is the same as that of the classification models in the full-scale classification models.

For example, when there are 5 classification models in the full-scale classification model, the tag generation device may obtain 5 model scheduling indexes for a certain video frame to be classified, where the 5 model scheduling indexes are in one-to-one correspondence with the 5 classification models.

In the embodiment of the application, the label generation device can predict a plurality of model scheduling indexes for each video frame to be classified by using the preset scheduling index prediction model, so that the target model can be determined for each video frame to be classified by using the plurality of model scheduling indexes in the follow-up process.

Referring to fig. 8, fig. 8 is a third optional flowchart of a tag generation method provided in the embodiment of the present application. In some embodiments of the present application, before predicting, by using a preset scheduling index prediction model, a model scheduling index corresponding to each classification model in a full-scale classification model for each video frame to be classified, that is, before S1021, the method may further include: S201-S203, as follows:

s201, obtaining an initial prediction model, historical video frames, a plurality of historical label data, a plurality of label profit indexes corresponding to the historical label data and a full model speed corresponding to a full classification model.

The historical video frames are classified by using a full-scale classification model in historical time, the plurality of historical label data are label data obtained by identifying the historical video frames by using the full-scale classification model, and the plurality of historical label data correspond to the plurality of label profit indexes one by one; the full-scale model speed includes the predicted speed of each classification model for the historical video frames.

Before predicting the model scheduling index by using the preset scheduling index prediction model, the label generation device needs to train the preset scheduling index prediction model. At this time, the label generation device may first obtain the initial prediction model and the historical video frame, and classify and identify a plurality of historical label data obtained by using the full-scale classification model on the historical video frame, where some classification models may generate a plurality of labels for the historical video frame, and some classification models may generate one or even no label for the historical video frame. The label generating equipment also obtains a label income index corresponding to each historical label data and a model speed corresponding to each classification model in the full-scale classification model so as to facilitate the subsequent construction of a training scheduling index.

It is to be understood that the tag revenue index may be an index provided by the user for characterizing the revenue degree of the tag, for example, the click rate of the tag, the revenue rate of placing the advertisement on the tag, and the like, and the embodiment of the present application is not limited herein.

S202, constructing a plurality of training scheduling indexes corresponding to a full-scale classification model aiming at historical video frames by utilizing a plurality of historical label data, a plurality of label income indexes and a full-scale model speed.

After obtaining the plurality of historical label data, the plurality of label profit indexes, and the full-scale model speed, the label generation device may generate a corresponding training scheduling index for each classification model in the full-scale classification model by using the model speed of each classification model included in the historical label data, the label profit indexes, and the full-scale model speed. After constructing the corresponding training scheduling indexes for all classification models in the full-scale classification model, the label generation equipment can obtain a plurality of training scheduling indexes.

It can be understood that each training scheduling index in the plurality of training scheduling indexes corresponds to each classification model in the full classification model one to one, that is, how many classification models are in the full classification model, and how many training scheduling indexes can be obtained by the label generation device.

S203, training the initial prediction model by using the historical video frame and the training scheduling indexes until a training stopping condition is reached, and obtaining a preset scheduling index prediction model.

The label generation equipment takes the historical video frames as input and takes a plurality of training scheduling indexes as output supervision items to update the model parameters in the initial prediction model. When the training process reaches a training stopping condition, for example, the training times reach a preset training time, or the model has converged, the label generation device stops training, and at this time, the obtained model is the preset scheduling index prediction model.

In the embodiment of the application, the label generation device can firstly construct a plurality of corresponding training scheduling indexes for the full-scale classification model, and then train a preset scheduling index prediction model by using the training scheduling indexes and the historical video frames. Therefore, the label generation equipment can obtain the preset scheduling index prediction model so as to predict the video frame to be classified by utilizing the preset scheduling index prediction model subsequently.

In some embodiments of the present application, each historical tag data of the plurality of historical tag data comprises a tag confidence and a tag accuracy; by using a plurality of historical tag data, a plurality of tag revenue indexes and a full-scale model speed, a plurality of training scheduling indexes corresponding to the full-scale classification model are constructed for the historical video frames, that is, the specific implementation process of S202 may include: S2021-S2023, as follows:

s2021, extracting current historical label data from the historical label data and extracting a current model speed from the full-scale model speed according to a current classification model in the full-scale classification model; the current classification model is any one of the full-scale classification models.

The label generation equipment firstly uses any one classification model in the full-scale classification model as a current classification model, then extracts label data obtained by predicting by using the current classification model from a plurality of historical label data, uses the label data as the current historical label data, and finally extracts the current model speed corresponding to the current classification model from the full-scale model speed, so that a training scheduling index is constructed for the current classification model by using the data.

It can be understood that the current historical tag data refers to tag data obtained by prediction of a current classification model, and when the current classification model is a multi-classification model, the obtained current historical tag data also includes a plurality of sub-tag data.

S2022, constructing a training scheduling index corresponding to the current classification model by using the current tag income index corresponding to the current historical tag, the current tag accuracy of the current historical tag data, the current tag confidence coefficient and the current model speed.

Historical tag data is composed of tag confidence and tag accuracy, with current tag accuracy and current tag confidence, no exception is made for current historical tag data. When the label generation equipment constructs a corresponding training scheduling index aiming at the current model, the gain index of the current label, the accuracy of the current label, the confidence coefficient of the current label and the speed of the current model are accumulated, and the obtained accumulated result is used as the training scheduling index corresponding to the current classification model.

It should be noted that, when the current classification model is a multi-classification model, the tag generation device specifically selects one sub-tag data from a plurality of sub-tag data included in the current historical tag data, multiplies the tag accuracy, the tag confidence and the tag gain index corresponding to the sub-tag data by the current model speed to obtain a sub-training scheduling index corresponding to the sub-tag data, and then accumulates the sub-training scheduling indexes corresponding to all the sub-tag data to obtain a training scheduling index corresponding to the current classification model.

For example, the embodiment of the present application provides a formula for constructing a training scheduling index corresponding to a current classification model, see formula (1):

wherein the content of the first and second substances,iis as followsiThe number of the models is set according to the model,jis as followsjThe sub-tag data is stored in a memory,label_conffor the tag confidence of the sub-tag data,tag_valuefor the tag benefit indicator of the sub-tag data,tag_accuracyfor the tag accuracy of the sub-tag data,model_speedfor the current model speed to be the current model speed,model_importanceto train the scheduling indicator. When the label generation device obtains the specific numerical value of the parameter, the specific numerical value can be substituted into the formula (1), and the training scheduling index corresponding to the current classification model is calculated.

S2023, when corresponding training scheduling indexes are constructed for the full-scale classification models, obtaining a plurality of training scheduling indexes.

According to the above manner, the label generation device can construct the training scheduling index corresponding to each classification model, and when the training scheduling index is constructed for all classification models in the full-scale classification model, the label generation device can obtain a plurality of training scheduling indexes corresponding to the full-scale classification model.

In some embodiments of the application, after obtaining the plurality of training scheduling indexes, the label generation device further normalizes the training scheduling indexes corresponding to each classification model according to a maximum value in the plurality of training scheduling indexes, that is, a ratio result obtained by comparing the training scheduling index corresponding to each classification model with the maximum value in the plurality of training scheduling indexes is a normalized training scheduling index corresponding to each classification model.

For example, the embodiment of the present application provides a process of normalizing the training scheduling index corresponding to each classification model, see formula (2):

wherein the content of the first and second substances,

and training and scheduling indexes corresponding to each classification model. After the label generation device obtains the specific values of the parameters, the specific values of the parameters can be substituted into the formula (2), so that the normalization process of the training scheduling indexes corresponding to each classification model is completed.

In some embodiments of the present application, after training an initial prediction model by using a historical video frame and a plurality of training scheduling indexes until a training stop condition is reached, and after obtaining a preset scheduling index prediction model, for each video frame to be classified, before predicting a model scheduling index corresponding to each classification model in a full-scale classification model by using the preset scheduling index prediction model, that is, after S203 and before S1021, the method further includes: S204-S207, as follows:

s204, aiming at the test video frame, predicting a plurality of test model scheduling indexes corresponding to the full-scale classification model by using a preset scheduling index prediction model, and sequencing the plurality of test model scheduling indexes to obtain a model index sequence.

After the label generation device obtains the preset scheduling index prediction model, it is further required to evaluate the performance of the preset scheduling index prediction model. At this time, the label generating device may first obtain a test video frame prepared in advance, where the test video frame has real model scheduling indexes corresponding to the classification models. And the label generation equipment inputs the test video frames into a preset scheduling index prediction model, and the predicted results are a plurality of test model scheduling indexes corresponding to the full-scale classification model. And then, sequencing all the obtained test model scheduling indexes by the label generation equipment, wherein the sequencing result is a model index sequence.

S205, acquiring a preset model index sequence corresponding to the test video frame; the preset model index sequence represents the real sequence of the importance degree of each classification model in the full-scale classification model to the test video frame.

And S206, determining the prediction accuracy of the preset scheduling index prediction model according to the model index sequence and the preset model index sequence.

The label generation equipment can sort the real model scheduling indexes of the test video frames in advance to obtain a sequence which can represent the real sequence of the importance degree of each classification model to the test video frames, and the sequence is recorded as a preset model index sequence. After the label generation equipment obtains the model index sequence, a preset model index sequence is obtained, then the preset model index sequence is compared with the model index sequence, the distance difference between the preset model index sequence and the model index sequence is judged, and whether the preset model index sequence is similar to the model index sequence or not is measured according to the distance difference. When the distance difference is smaller than the preset distance difference threshold value, the preset model index sequence is similar to the model index sequence, namely the predicted model index sequence is very close to the preset model index sequence, and at the moment, the performance of the preset scheduling index prediction model is better, so that the prediction accuracy of the preset scheduling index prediction model is higher. On the contrary, when the distance difference is greater than or equal to the preset distance difference threshold, it indicates that the model index sequence is far from the preset model index sequence, the accuracy of the preset scheduling index prediction model is low, and in practical application, the predicted model scheduling index is not accurate, that is, the prediction accuracy of the preset scheduling index prediction model is low.

It should be noted that, in some embodiments of the present application, the distance between the model index sequence and the preset model index sequence may be used as a distance difference, then an inverse number is obtained for the distance difference, and an inverse result is used as the prediction accuracy. For example, when the model indicator sequence is [1, 2, 3, 4, 5] and the predetermined model indicator sequence is [5, 4, 3, 2, 1], 10 steps are required to reduce [1, 2, 3, 4, 5] to [5, 4, 3, 2, 1], at which time, the distance difference is 10, and the prediction accuracy 1/10 can be obtained by taking the reciprocal of 10. Further, when the model index sequence has n elements, the range of the distance difference is 0 to n (n-1)/2 steps.

It can be understood that the value of the preset distance difference threshold may be any integer between 0 and n (n-1)/2, and the specific preset distance difference threshold may be set according to an actual situation, which is not limited herein in this embodiment of the present application.

And S207, when the prediction accuracy is smaller than a preset accuracy threshold, training the initial prediction model by reusing the historical video frame and the training scheduling indexes until a training stopping condition is reached, and obtaining a latest scheduling index prediction model.

When the prediction accuracy is smaller than a preset accuracy threshold, the label generation device needs to take the historical video frame as input again, and takes a plurality of training scheduling indexes as supervision items to train the initial prediction model until a training stopping condition is reached, so that the latest scheduling index prediction model is obtained. Correspondingly, the label generating device predicts the model scheduling index corresponding to each classification model in the full classification model by using the preset scheduling index prediction model for each video frame to be classified in the subsequent process, that is, the specific implementation process of S1021 becomes S1021 a: and predicting the model scheduling indexes corresponding to each classification model in the full-scale classification model by using the latest scheduling index prediction model aiming at each video frame to be classified. Therefore, the inaccurate model scheduling index can be prevented from being predicted by using the preset scheduling index prediction model with low prediction accuracy.

It can be understood that, in the embodiment of the present application, a specific value of the preset accuracy threshold may be set according to an actual situation, for example, set to 0.5, set to 0.8, and the like, and is not specifically limited herein.

In the embodiment of the application, the label generation equipment can also predict the prediction accuracy of the preset scheduling index prediction model, and when the prediction accuracy is lower than a preset accuracy threshold, the label generation equipment can retrain the initial preset model to obtain the latest scheduling index prediction model, so that the model scheduling index is predicted by using the model with higher preset accuracy, the prediction accuracy of the model scheduling index is improved, and the accuracy of the target model is ensured.

In some embodiments of the present application, the preset scheduling index prediction model includes: the device comprises an input layer, at least two convolution layers, at least two pooling layers, a dimension reduction layer, a full connection layer and an output layer;

the output of the input layer is connected with the input of at least two convolution layers, the at least two convolution layers and the at least two pooling layers are sequentially and alternately connected, the input of the dimensionality reduction layer is connected with the output of the at least two pooling layers, the input of the full-connection layer is connected with the output of the dimensionality reduction layer, and the input of the output layer is connected with the output of the full-connection layer;

each video frame to be classified enters at least two convolution layers through an input layer, at least two pooling layers are used for receiving feature maps output by the at least two convolution layers, a dimensionality reduction layer is used for receiving a pooling feature map output by the last pooling layer of the at least two pooling layers, a full connection layer is used for receiving dimensionality reduction data output by the dimensionality reduction layer, and an output layer is used for receiving processing data output by the full connection layer and outputting a model scheduling index corresponding to each classification model.

It is understood that the convolutional layer is used to convolve the input image to obtain the feature map, where the input image to the convolutional layer may be the image output by the input layer or the pooled feature map output by the pooling layer. The pooling layer is used for pooling the characteristic map to obtain a pooled characteristic map.

The at least two convolution layers and the at least two pooling layers are alternately connected in sequence, which means that one pooling layer is connected after each of the at least two convolution layers, that is, the convolution layers, the pooling layers, the convolution layers and the pooling layers are connected in this order.

It is understood that, in the embodiment of the present application, the number of convolutional layers and the number of pooling layers may be set according to actual requirements, for example, the number of convolutional layers is set to 2, the number of pooling layers is set to 2, or the number of convolutional layers is set to 3, the number of pooling layers is set to 4, and the like, and the embodiment of the present application is not limited herein.

In some embodiments of the present application, the at least two convolutional layers comprise: the number of channels of the first coiling layer is different from that of the second coiling layer; the at least two pooling layers include: a first pooling layer, a second pooling layer;

the output of the input layer is connected with the input of the first convolution layer, the input of the first pooling layer is connected with the output of the first convolution layer, the input of the second convolution layer is connected with the output of the first pooling layer, the input of the second pooling layer is connected with the output of the second convolution layer, and the input of the dimension reduction layer is connected with the output of the second pooling layer.

In some embodiments of the present application, each video frame to be classified enters the first convolutional layer through the input layer, the first convolutional layer is configured to receive a first feature map output by the first convolutional layer, the second convolutional layer is configured to receive a first pooled feature map output by the first pooling layer, the second pooling layer is configured to receive a second feature map output by the second convolutional layer, the dimensionality reduction layer is configured to receive a second pooled feature map output by the second pooling layer, the full-link layer is configured to receive dimensionality reduction data output by the dimensionality reduction layer, and the output layer is configured to receive processing data output by the full-link layer and output a model scheduling index corresponding to each classification model.

It is understood that the sizes of the convolution kernels of the first convolution layer and the second convolution layer may be the same, for example, the convolution kernels of the first convolution layer and the second convolution layer are both set to 3 × 3, or 5 × 5; the sizes of the convolution kernels of the first convolutional layer and the second convolutional layer may be different, for example, the size of the convolution kernel of the first convolutional layer is set to 3 × 3, and the size of the convolution kernel of the second convolutional layer is set to 1 × 1, and the like, and the embodiment of the present application is not limited herein.

It should be noted that, in the embodiment of the present application, the number of channels of the second convolution layer may be greater than the number of channels of the first convolution layer, or may be less than the number of channels of the first convolution layer.

In some embodiments of the present application, the convolution kernel of the first convolution layer has a size of 3 × 3 and the number of channels is 8; the second convolution layer size is 3 x 3 and the number of channels is 16.

At this time, the structure of the preset scheduling index prediction model may be: the first convolution layer is connected behind the input layer, the size of a convolution kernel of the first convolution layer is 3 multiplied by 3, and the number of channels is 8; a first pooling layer connected after the first buildup layer; the second convolution layer is connected behind the first pooling layer, the size of convolution kernel of the second convolution layer is 3 multiplied by 3, and the number of channels is 16; the dimensionality reduction layer is connected behind the second pooling layer, and the full connecting layer is connected behind the dimensionality reduction layer; an output layer is connected after the fully connected layer.

For example, an exemplary example of the structure of the preset scheduling index prediction model is provided in the embodiment of the present application, referring to fig. 9, the structure of a first convolution layer 9-1 of the preset scheduling index prediction model is 3 × 3conv, 8, a first 2 × 2 pooling layer 9-2 is connected after the first convolution layer 9-1, and a second convolution layer 9-3 is connected after the first pooling layer 9-2, wherein the structure of the second convolution layer 9-3 is 3 × 3conv, 16. After the second convolutional layer 9-2, a 2 × 2 second pooling layer 9-4 is connected. A dimension reduction layer 9-5 is connected behind the second pooling layer 9-4, the dimension reduction layer 9-5 is used for mapping two-dimensional data to one dimension, a full connection layer 9-6 is connected behind the dimension reduction layer 9-5, a full connection layer 9-7 is connected behind the full connection layer 9-6, and the full connection layer 9-7 is connected with an output 9-8.

For example, the embodiment of the present application provides the output and parameter amount of each network layer in the preset scheduling index prediction model shown in fig. 9, see table 1:

TABLE 1

Referring to table 1, when the step sizes of convolution and pooling are both 1, the output size of the input layer is (none, 64, 64, 3) (none indicates the number of video frames for which input is not defined for each network layer), and the parameter number is 0; the output size of the first convolution layer 9-1 is (none, 62, 62, 8), and the parameter number is 224; the output size of the first pooling layer 9-2 is (none, 31, 31, 8), and the parameter number is 0; the output size of the second convolution layer 9-3 is (none, 29, 29, 16), and the parameter number is 1168; the output size of the second pooling layer 9-4 is (none, 14, 14, 16), and the parameter number is 0; the output size of the dimensionality reduction layer 9-5 is (none, 3136), and the parameter number is 0; the output size of the fully connected layer 9-6 is (none, 128), and the parameter number is 401536; the output size of the fully connected layer 9-7 is (none, 16), the parameter number is 2064; the output size of the output layer 9-8 is (none, 5) and the parameter number is 85.

As can be seen from fig. 9 and table 1, the preset scheduling index prediction model provided in the embodiment of the present application has a simpler structure and requires a smaller amount of parameters to be calculated, so that the operation speed of the preset scheduling index prediction model is ensured, and the efficiency of generating the label is improved.

In the embodiment of the application, the quantity of parameters needing to be calculated in the preset scheduling index prediction model is small, so that the running speed of the preset scheduling index prediction model is high, and the label generation efficiency is guaranteed.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described. The embodiment of the application is realized in a scene that a label is generated for a video so as to put an advertisement based on the label.

First, the background (tag generation device) needs to prepare data for training the multi-model scheduling module first. Before any one picture (historical video frame) processed by each model is added into the multi-model scheduling module, the background generates a plurality of labels after passing through I models (full-scale classification models), wherein one model generates one label, and some models generate a plurality of labels. For the picture, the importance score (training scheduling index corresponding to the current classification model) of the ith model (current classification model) is the result obtained by normalizing the product obtained by multiplying the confidence score (current label confidence) of each generated label by the benefit (current label benefit index) of the label, then multiplying by the label accuracy (current label accuracy) and finally multiplying by the model speed (current model speed). Wherein the confidence score is generated by the model and a record is output; the revenue for the tag is provided by the head end, which is equal to the sum of the investment prices of each advertiser that selects the tag; the accuracy of the label is artificially corrected and is generally 1, and if the label is produced wrongly, the label is set as 0; the model speed refers to the number of processed pictures per second, and is obtained by workers during model debugging, and the model speed cannot be changed under the condition that the model is not changed.

Through the mode, each picture can obtain an I x 1 dimensional label (a plurality of training scheduling indexes corresponding to the full-scale classification model).

After the importance scores are constructed, a multi-model scheduling module (a preset scheduling index prediction model) can be trained. Here, a network structure like that of fig. 9 may be trained, and a general network structure, such as a VGG16 network structure, may also be trained. Fig. 10 is an exemplary diagram of a general network structure provided in an embodiment of the present application, and as shown in fig. 10, the general network is composed of 5 network modules, a pooling layer, and a full connection layer. Wherein, the input size of the network module 10-1 is 224 × 224, and the network module 10-1 is composed of two convolution layers of 3 × 3, 64 channels; the network module 10-2 is composed of two convolution layers of 3 × 3, 128 channels, and at this time, the input size is 112 × 112; the network module 10-3 is composed of three convolution layers of 3 × 3, 256 channels, and at this time, the input size is 56 × 56; the network module 10-4 is composed of three convolution layers of 3 × 3, 512 channels, at this time, the input size is 28 × 28; the network module 10-5 consists of three 3 × 3, 512-channel convolutional layers, and the corresponding input size is 14 × 14. Behind each network module there is connected a pooling layer 10-6, after the last pooling layer there is connected a global pooling layer (not shown), then 1024 full-connected layers 10-7, then 128 full-connected layers 10-8, and finally an output layer 10-9.

Correspondingly, the embodiment of the present application provides the output and parameter quantity of each network layer of the general network, see table 2:

TABLE 2

Referring to table 2, the output size of the input layer is (none, none, none, 3) (none indicates that the number of input video frames is not limited to each network layer and the length and width size of a picture is not limited), and the parameter number is 0; the output sizes of the convolutional layers in the network module 10-1 are (none, none, none, 64), the parameter number of the first convolutional layer in the network module 10-1 is 1792, and the parameter number of the second convolutional layer in the network module 10-1 is 36928; the output size of the first pooling layer 10-6 is (none, none, 64), and the parameter number is 0; the output sizes of each convolutional layer in the network module 10-2 are (none, none, none, 128), the parameter number of the first convolutional layer of the network module 10-2 is 73856, and the parameter number of the second convolutional layer of the network module 10-2 is 147584; the output size of the second pooling layer 10-6 is (none, none, 128), and the parameter number is 0; the output sizes of each convolutional layer in the network module 10-3 are (none, none, none, 256), the parameter number of the first convolutional layer of the network module 10-3 is 295168, the parameter number of the second convolutional layer of the network module 10-3 is 590080, and the parameter number of the third convolutional layer of the network module 10-3 is 590080; the output size of the third pooling layer 10-6 is (none, none, 256), and the parameter number is 0; the output sizes of each convolutional layer in the network module 10-4 are (none, none, none, 512), the parameter number of the first convolutional layer of the network module 10-4 is 1180160, the parameter number of the second convolutional layer of the network module 10-4 is 2359808, and the parameter number of the third convolutional layer of the network module 10-4 is 2359808; the output size of the fourth pooling layer 10-6 is (none, none, 512), and the parameter number is 0; the output sizes of each convolution layer in the network module 10-5 are (none, none, none, 512), the parameter number of the first convolution layer of the network module 10-5 is 2359808, the parameter number of the second convolution layer of the network module 10-5 is 2359808, and the parameter number of the third convolution layer of the network module 10-5 is 2359808; the output size of the fifth pooling layer 10-6 is (none, none, 512), and the parameter number is 0; the output of global pooling is (none, 512) and the parameter number is 0; the output size of the fully connected layer 10-7 is (none, 1024), and the parameter number is 525312; the output size of the fully connected layer 10-8 is (none, 128), and the parameter number is 131200; the output size of the output layer 10-9 is (none, 5) and the parameter number is 645. It can be seen that the parameter amount of the general model is large, and therefore, in practical application, the preset scheduling index prediction model provided in fig. 9 is recommended to be used.

During training, random gradient descent optimization can be utilized, a loss function is mean square error, network input is RGB three-channel image data, and network output is I x 1 model importance degree scoring (multiple training scheduling indexes). For example, if the multi-model scheduling module schedules 5 models, the output obtained after inputting a certain picture may be [0.34, 0.12, 0.86, 0.66, 0.07 ]. After training, the shortest moving distance between the order (model index sequence) scored by the multi-model scheduling module and the order (preset model index sequence) of the correctly labeled values can be used as an evaluation index to evaluate the effect of the multi-model scheduling module.

In actual use, after the front end sends a video (to-be-generated video) to the background (the label generation device), the background automatically extracts a key frame (at least one to-be-classified video frame) of the video. And then, scoring each model by using a trained multi-model scheduling module (a preset scheduling index prediction model), and determining an important model (a target model) for the key frame according to the scoring score (the model scheduling index). When determining the important models, the background may send the key frames to only the first n/2 models (the classification models corresponding to the highest preset number of model scheduling indexes), or only send the models with scores of 0.5 (preset index thresholds), or may take the two indexes with more models as the standard for the key frames. In actual prediction, each model can also be run in reverse order according to the number order of pictures in the model correspondence (classified identification is performed according to the execution order). Here, the respective model queues may operate in series. In addition, when the models are operated in the reverse order according to the number of pictures, if a certain model reaches the maximum time consumption setting (the current recognition completion time is greater than or equal to the preset maximum time), the models behind the model can not be allowed any more.

When the model prediction is completed, the label of the key frame can be obtained, the background transmits the label to the front end, and the front end can add the advertisement in the key frame in a 'sticky note' mode according to the transmitted label. Fig. 11 is a schematic diagram of adding an advertisement to a key frame in an actual scene provided in an embodiment of the present application, and when a label of the key frame 11-1 is yoga movement 11-2, the front end may add an advertisement 11-3 related to the movement on the key frame 11-1: sports give you quality life.

By the method, the background can select the important models for the key frames from the models, so that the key frames are classified and identified only by the important models, the number of models needing to identify the key frames is reduced, the effective degree of the generated labels is guaranteed, and the label generation efficiency is improved.

Continuing with the exemplary structure of the tag generation apparatus 255 provided by the embodiments of the present application as software modules, in some embodiments, as shown in fig. 3, the software modules stored in the tag generation apparatus 255 of the memory 250 may include:

the extracting module 2551 is configured to, when a video to be generated is received, extract at least one video frame to be classified from the video to be generated; the at least one video frame to be classified is a key frame in the video to be generated;

a prediction module 2552, configured to obtain a full classification model, and determine, for each to-be-classified video frame in the at least one to-be-classified video frame, a plurality of model scheduling indexes corresponding to the full classification model; wherein each model scheduling index of the plurality of model scheduling indexes characterizes the importance degree of each classification model of the full-scale classification model to each video frame to be classified of the at least one video frame to be classified; the full scale classification model comprises at least one classification model;

a model determining module 2553, configured to determine, based on the plurality of model scheduling indexes, a target model corresponding to each video frame to be classified from the full-scale classification model for the each video frame to be classified;

an identifying module 2554, configured to identify each video frame to be classified by using the target model corresponding to each video frame to be classified, to obtain an identification tag of each video frame to be classified.

In some embodiments of the present application, the model determining module 2553 is specifically configured to rank the plurality of model scheduling indicators of each video frame to be classified, where the plurality of model scheduling indicators of each video frame to be classified represent the importance degree of each classification model in the full-scale classification model for each video frame to be classified; and taking the classification model corresponding to the highest preset number of model scheduling indexes in the plurality of model scheduling indexes of each video frame to be classified as the target model corresponding to each video frame to be classified.

In some embodiments of the present application, each of the full-scale classification models has a model scheduling index corresponding to the at least one video frame to be classified; the model determining module 2553 is specifically configured to compare a model scheduling index corresponding to each to-be-classified video frame in the multiple model scheduling indexes with a preset index threshold to obtain a comparison result; and selecting the classification model with the comparison result representing that the model scheduling index is larger than the preset index threshold value from the full-scale classification model as a target model corresponding to each video frame to be classified.

In some embodiments of the present application, the identifying module 2554 is specifically configured to determine a video frame to be classified corresponding to each target model as a target video frame; counting the number of the target video frames of each target model to obtain the number of the target video frames of each target model; determining a corresponding execution sequence for each target model by using the number of the target video frames; according to the execution sequence, identifying the target video frame corresponding to each target model by using each target model to obtain an identification label corresponding to the target video frame; and after the target video frame corresponding to each target model is identified, obtaining an identification label corresponding to each video frame to be classified.

In some embodiments of the present application, the identifying module 2554 is further specifically configured to compare the execution sequence with a preset sequence to obtain a sequence comparison result; the sequence comparison result represents the front-back relation between the execution sequence and the preset sequence; when the sequence comparison result represents that the execution sequence is before the preset sequence, identifying the target video frame corresponding to each target model by using each target model to obtain an identification tag corresponding to the target video frame; and when the sequence comparison result represents that the execution sequence is behind the preset sequence, finishing identifying the target video frame corresponding to each target model by using each target model.

In some embodiments of the application, the identifying module 2554 is further specifically configured to, when the current target model corresponding to the current order is used to identify the current target video frame corresponding to the current target model, obtain a current identification completion time corresponding to the current target model; the current sequence is any one of the execution sequences corresponding to each target model; and when the current identification completion time is greater than or equal to the preset maximum time, stopping utilizing a next target model corresponding to a next sequence of the current sequence to identify a target video frame corresponding to the next target model.

In some embodiments of the present application, the model determining module 2553 is specifically configured to predict, for each to-be-classified video frame, a model scheduling index corresponding to each classification model in the full classification model by using a preset scheduling index prediction model; and when the prediction of model scheduling indexes is completed for all the full-scale classification models, obtaining the plurality of model scheduling indexes corresponding to the full-scale classification models.

In some embodiments of the present application, the label generating device 255 further includes: a model training module 2555;

the model training module 2555 is configured to obtain an initial prediction model, a historical video frame, a plurality of historical label data, a plurality of label revenue indexes corresponding to the plurality of historical label data, and a full-scale model speed corresponding to the full-scale classification model; the historical video frames are classified by utilizing the full-scale classification model at historical time, the plurality of historical label data are label data obtained by identifying the historical video frames by utilizing the full-scale classification model, and the plurality of historical label data and the plurality of label revenue indexes are in one-to-one correspondence; the full-scale model speed comprises a predicted speed of each classification model on the historical video frame; constructing a plurality of training scheduling indexes corresponding to the full-scale classification model aiming at the historical video frame by utilizing the plurality of historical label data, the plurality of label revenue indexes and the full-scale model speed; each training scheduling index in the plurality of training scheduling indexes corresponds to each classification model in the full-scale classification model one to one; and training the initial prediction model by using the historical video frame and the training scheduling indexes until a training stopping condition is reached to obtain the preset scheduling index prediction model.

In some embodiments of the present application, each historical tag data of the plurality of historical tag data comprises a tag confidence and a tag accuracy; the model training module 2555 is specifically configured to, for a current classification model in the full-scale classification model, extract current historical label data from the plurality of historical label data, and extract a current model speed from the full-scale model speed; the current classification model is any one of the full-scale classification models; constructing a training scheduling index corresponding to the current classification model by using a current label income index corresponding to the current historical label, the current label accuracy of the current historical label data, the current label confidence coefficient and the current model speed; and when corresponding training scheduling indexes are constructed for the full-scale classification models, obtaining the plurality of training scheduling indexes.

In some embodiments of the present application, the model training module 2555 is further configured to predict, for the test video frame, a plurality of test model scheduling indexes corresponding to the full-scale classification model by using the preset scheduling index prediction model, and sort the plurality of test model indexes to obtain a model index sequence; acquiring a preset model index sequence corresponding to the test video frame; the preset model index sequence represents the real sequence of the importance degree of each classification model in the full-scale classification model to the test video frame; determining the prediction accuracy of the preset scheduling index prediction model according to the model index sequence and the preset model index sequence; when the prediction accuracy is smaller than a preset accuracy threshold value, training the initial prediction model by reusing the historical video frame and the training scheduling indexes until a training stopping condition is reached to obtain a latest scheduling index prediction model;

correspondingly, the identifying module 2554 is further configured to predict the model scheduling index corresponding to each classification model in the full-scale classification model by using the latest scheduling index prediction model for each video frame to be classified.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the label generation method described in the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions, where the executable instructions are stored, and when executed by a processor, the executable instructions cause the processor to execute a tag generation method provided by embodiments of the present application, for example, a method as shown in fig. 4, fig. 5 or fig. 8.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, the executable tag generation instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of a program, software module, script, or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, the executable tag generation instructions may, but need not, correspond to files in a file system, may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a HyperText Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, the executable tag generation instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A tag generation method, comprising:

2. The method according to claim 1, wherein the determining a target model corresponding to each video frame to be classified from the full-scale classification model for the each video frame to be classified based on the plurality of model scheduling indicators comprises:

sequencing the model scheduling indexes of each video frame to be classified, wherein the model scheduling indexes of each video frame to be classified represent the importance degree of each classification model in the full-scale classification model to each video frame to be classified;

and taking the classification model corresponding to the highest preset number of model scheduling indexes in the plurality of model scheduling indexes of each video frame to be classified as the target model corresponding to each video frame to be classified.

3. The method of claim 1, wherein each of the full-scale classification models has a model scheduling index corresponding to the at least one video frame to be classified; the determining, for each video frame to be classified, a target model corresponding to each video frame to be classified from the full-scale classification model based on the plurality of model scheduling indexes includes:

comparing the model scheduling index corresponding to each video frame to be classified in the plurality of model scheduling indexes with a preset index threshold value to obtain a comparison result;

and selecting the classification model with the comparison result representing that the model scheduling index is larger than the preset index threshold value from the full-scale classification model as the target model corresponding to each video frame to be classified.

4. The method according to any one of claims 1 to 3, wherein the identifying the video frames to be classified by using the target model corresponding to the video frames to be classified to obtain the identification tag of each video frame to be classified comprises:

determining a video frame to be classified corresponding to each target model as a target video frame;

counting the number of the target video frames of each target model to obtain the number of the target video frames of each target model;

determining a corresponding execution sequence for each target model by using the number of the target video frames;

according to the execution sequence, identifying the target video frame corresponding to each target model by using each target model to obtain an identification label corresponding to the target video frame;

and after the target video frame corresponding to each target model is identified, obtaining an identification label corresponding to each video frame to be classified.

5. The method of claim 4, wherein identifying the target video frame corresponding to each target model by using each target model according to the execution sequence to obtain the identification tag corresponding to the target video frame comprises:

comparing the execution sequence with a preset sequence to obtain a sequence comparison result; the sequence comparison result represents the front-back relation between the execution sequence and the preset sequence;

when the sequence comparison result represents that the execution sequence is before the preset sequence, identifying the target video frame corresponding to each target model by using each target model to obtain an identification tag corresponding to the target video frame;

and when the sequence comparison result represents that the execution sequence is behind the preset sequence, finishing identifying the target video frame corresponding to each target model by using each target model.

6. The method of claim 4, wherein identifying the target video frame corresponding to each target model by using each target model according to the execution sequence to obtain the identification tag corresponding to the target video frame comprises:

when the current target model corresponding to the current sequence is used for identifying the current target video frame corresponding to the current target model, obtaining the current identification completion time corresponding to the current target model; the current sequence is any one of the execution sequences corresponding to each target model;

and when the current identification completion time is greater than or equal to the preset maximum time, stopping utilizing a next target model corresponding to a next sequence of the current sequence to identify a target video frame corresponding to the next target model.

7. The method according to claim 1, wherein the determining, for each video frame to be classified in the at least one video frame to be classified, a plurality of model scheduling indexes corresponding to the full-scale classification model comprises:

for each video frame to be classified, predicting a model scheduling index corresponding to each classification model in the full-scale classification model by using a preset scheduling index prediction model;

and when the prediction of model scheduling indexes is completed for all the full-scale classification models, obtaining the plurality of model scheduling indexes corresponding to the full-scale classification models.

8. The method according to claim 7, wherein before predicting the model scheduling index corresponding to each classification model in the full-scale classification model by using a preset scheduling index prediction model for each video frame to be classified, the method further comprises:

acquiring an initial prediction model, historical video frames, a plurality of historical label data, a plurality of label profit indexes corresponding to the historical label data and a full-scale model speed corresponding to the full-scale classification model;

the historical video frames are classified by utilizing the full-scale classification model at historical time, the plurality of historical label data are label data obtained by identifying the historical video frames by utilizing the full-scale classification model, and the plurality of historical label data and the plurality of label revenue indexes are in one-to-one correspondence; the full-scale model speed comprises a predicted speed of each classification model on the historical video frame;

constructing a plurality of training scheduling indexes corresponding to the full-scale classification model aiming at the historical video frame by utilizing the plurality of historical label data, the plurality of label revenue indexes and the full-scale model speed; each training scheduling index in the plurality of training scheduling indexes corresponds to each classification model in the full-scale classification model one to one;

and training the initial prediction model by using the historical video frame and the training scheduling indexes until a training stopping condition is reached to obtain the preset scheduling index prediction model.

9. The method of claim 8, wherein each historical tag data of the plurality of historical tag data comprises a tag confidence and a tag accuracy; the constructing a plurality of training scheduling indexes corresponding to the full-scale classification model for the historical video frame by using the plurality of historical label data, the plurality of label revenue indexes and the full-scale model speed includes:

extracting current historical label data from the plurality of historical label data and extracting a current model speed from the full-scale model speed for a current classification model in the full-scale classification model; the current classification model is any one of the full-scale classification models;

constructing a training scheduling index corresponding to the current classification model by using a current label income index corresponding to the current historical label, the current label accuracy of the current historical label data, the current label confidence coefficient and the current model speed;

and when corresponding training scheduling indexes are constructed for the full-scale classification models, obtaining the plurality of training scheduling indexes.

10. The method according to claim 8 or 9, wherein after the initial prediction model is trained by using the historical video frames and the training scheduling indexes until the preset scheduling index prediction model is obtained when a training stop condition is reached, before the model scheduling index corresponding to each classification model in the full classification model is predicted by using a preset scheduling index prediction model for each video frame to be classified, the method further comprises:

aiming at a test video frame, predicting a plurality of test model scheduling indexes corresponding to the full-scale classification model by using the preset scheduling index prediction model, and sequencing the plurality of test model indexes to obtain a model index sequence;

acquiring a preset model index sequence corresponding to the test video frame; the preset model index sequence represents the real sequence of the importance degree of each classification model in the full-scale classification model to the test video frame;

determining the prediction accuracy of the preset scheduling index prediction model according to the model index sequence and the preset model index sequence;

when the prediction accuracy is smaller than a preset accuracy threshold value, training the initial prediction model by reusing the historical video frame and the training scheduling indexes until a training stopping condition is reached to obtain a latest scheduling index prediction model;

correspondingly, the predicting the model scheduling index corresponding to each classification model in the full-scale classification model by using a preset scheduling index prediction model for each video frame to be classified includes:

and predicting the model scheduling index corresponding to each classification model in the full-scale classification model by using the latest scheduling index prediction model aiming at each video frame to be classified.

11. The method according to any of claims 7 to 9, wherein the predetermined scheduling index prediction model comprises: the device comprises an input layer, at least two convolution layers, at least two pooling layers, a dimension reduction layer, a full connection layer and an output layer;

the output of the input layer is connected with the input of the at least two convolution layers, the at least two convolution layers and the at least two pooling layers are sequentially and alternately connected, the input of the dimensionality reduction layer is connected with the output of the at least two pooling layers, the input of the full-connection layer is connected with the output of the dimensionality reduction layer, and the input of the output layer is connected with the output of the full-connection layer;

the video frames to be classified enter the at least two convolutional layers through the input layer, the at least two pooling layers are used for receiving the feature maps output by the at least two convolutional layers, the dimensionality reduction layer is used for receiving the pooled feature map output by the last pooling layer of the at least two pooling layers, the full connection layer is used for receiving dimensionality reduction data output by the dimensionality reduction layer, and the output layer is used for receiving the processing data output by the full connection layer and outputting the model scheduling index corresponding to each classification model.

12. The method of claim 11, wherein the at least two convolutional layers comprise: the device comprises a first convolution layer and a second convolution layer, wherein the number of channels of the first convolution layer is different from that of the second convolution layer; the at least two pooling layers include: a first pooling layer, a second pooling layer;

13. The method of claim 12, wherein the convolution kernel of the first convolution layer has a size of 3 x 3 and a number of channels of 8; the second convolution layer has a size of 3 × 3 and a number of channels of 16.

14. A label producing apparatus, comprising:

a memory to store executable tag generation instructions;

a processor for implementing the tag generation method of any one of claims 1 to 13 when executing executable tag generation instructions stored in the memory.

15. A computer-readable storage medium having stored thereon executable tag generation instructions for, when executed by a processor, implementing the tag generation method of any one of claims 1 to 13.