WO2024114341A1 - 视频内容识别及模型训练的方法、装置和设备 - Google Patents

视频内容识别及模型训练的方法、装置和设备 Download PDF

Info

Publication number
WO2024114341A1
WO2024114341A1 PCT/CN2023/131057 CN2023131057W WO2024114341A1 WO 2024114341 A1 WO2024114341 A1 WO 2024114341A1 CN 2023131057 W CN2023131057 W CN 2023131057W WO 2024114341 A1 WO2024114341 A1 WO 2024114341A1
Authority
WO
WIPO (PCT)
Prior art keywords
label
model
content
tag
target
Prior art date
Application number
PCT/CN2023/131057
Other languages
English (en)
French (fr)
Inventor
李斌泉
Original Assignee
百果园技术(新加坡)有限公司
李斌泉
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百果园技术(新加坡)有限公司, 李斌泉 filed Critical 百果园技术(新加坡)有限公司
Publication of WO2024114341A1 publication Critical patent/WO2024114341A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0985Hyperparameter optimisation; Meta-learning; Learning-to-learn
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Definitions

  • the present application relates to the field of image processing technology, for example, to a video content recognition method, a model training method, a video content recognition device, a model training device, an electronic device, a computer-readable storage medium and a computer program product.
  • Video content recognition technology mainly applies deep learning technology. After fitting parameters with certain generalization capabilities under large-scale data levels, it recognizes new data.
  • the main features of this technology are: large-scale data, high data purity, and black box unsolvability.
  • High data purity Large-scale data alone is still not enough to fit the required model capabilities. This is because the composition and purity of the data directly determine the accuracy of model recognition. In actual business scenarios, the data that requires model recognition must have high accuracy to achieve optimal use of human resources in auditing or machine marking applications.
  • the high purity of the data and the need to combine it with a large scale make the cost of building the training set very high.
  • the unsolvable nature of the black box means that the model cannot be directly migrated or reused between multiple businesses.
  • the model can only be customized for each business based on the business data, repeatedly wasting manpower and machine resources.
  • the present application provides a method, apparatus and device for video content recognition and model training, which reduces the cost of collecting data sets, improves the portability and reusability of label models, and reduces migration or reuse costs.
  • the present application provides a video content recognition method, the method comprising:
  • At least one target label model is selected from a plurality of pre-generated label models, wherein each label model has a corresponding content label, and each label model is a model generated by performing hyper-parameter learning on a training data set of a corresponding content label using a pre-generated meta-model, and then training the obtained hyper-parameters and the training data set;
  • the at least one second detection result is combined to generate a video content recognition result.
  • the present application provides a model training method, the method comprising:
  • the tag set includes at least two tag levels, and each tag level has at least one content tag;
  • For each content tag respectively obtain a training data set for each content tag, wherein the number of the training data sets is less than a set small-scale threshold;
  • a label model for each content label is trained according to the hyperparameter of each content label and the training data set for each content label.
  • the present application provides a video content recognition device, the device comprising:
  • a general detection module configured to use a pre-generated general detection model to perform target detection on a target video to obtain a first detection result
  • a target label model determination module configured to select at least one target label model from a plurality of pre-generated label models according to the first detection result, wherein each label model has a corresponding content label, and each label model is a model generated by performing hyper-parameter learning on a training data set of a corresponding content label using a pre-generated meta-model, and then training the obtained hyper-parameters and the training data set;
  • a tag detection module is configured to detect the target video using the at least one target tag model. Perform label detection to obtain at least one second detection result;
  • the video content recognition result generation module is configured to generate a video content recognition result in combination with the at least one second detection result.
  • the present application provides a model training device, the device comprising:
  • a tag set acquisition module configured to acquire a tag set, wherein the tag set includes at least two tag levels, and each tag level has at least one content tag;
  • a training data set acquisition module configured to acquire, for each content tag, a training data set for each content tag, respectively, wherein the number of the training data sets is less than a set small-scale threshold;
  • a hyperparameter determination module configured to use a pre-generated meta-model to perform hyperparameter learning on the training data set of each content label to obtain a hyperparameter of each content label;
  • the label module training module is configured to train the label model of each content label according to the hyper parameters of each content label and the training data set of each content label.
  • the present application provides an electronic device, the electronic device comprising:
  • a memory communicatively coupled to the at least one processor
  • the memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor so that the at least one processor can execute the method described above.
  • the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable a processor to implement the above-mentioned method when executed.
  • the present application provides a computer program product, which includes computer executable instructions.
  • the computer executable instructions When executed, they are used to implement the above-mentioned method.
  • FIG1 is a flow chart of a model training method provided in Example 1 of the present application.
  • FIG2 is a flow chart of a video content recognition method provided in Embodiment 2 of the present application.
  • FIG3 is a schematic diagram of the structure of a video content recognition device provided in Embodiment 3 of the present application.
  • FIG4 is a schematic diagram of the structure of a model training device provided in Embodiment 4 of the present application.
  • FIG5 is a schematic diagram of the structure of an electronic device provided in Embodiment 5 of the present application.
  • FIG1 is a flowchart of a model training method provided in Embodiment 1 of the present application. This method embodiment can be applied in a server to train a label model for outputting the probability that the current input data belongs to a corresponding content label.
  • this embodiment may include the following steps:
  • Step 101 Acquire a tag set, wherein the tag set includes at least two tag levels, and each tag level has at least one content tag.
  • a tag set includes three tag levels, namely, a primary tag, a secondary tag, and a tertiary tag
  • the content tag of the primary tag is the initial content tag
  • the content tag of the secondary tag is the content tag obtained by decomposing the initial content tag
  • the content tag of the tertiary tag is the content tag obtained by further decomposing the content tag of the secondary tag.
  • the decomposition method of the content tag may include decomposition in multiple dimensions such as content and data format. For example, assuming that the content tag of the first-level tag is "beautiful girl dancing", “beautiful girl dancing” can be decomposed into two second-level tags “beauty” and “dance” according to the content dimension. Then, each content tag of the second-level tag can be further decomposed, and “beauty” can be decomposed into two third-level tags “gender” and “appearance”, and “dance” can be decomposed into three third-level tags "popular dance", "traditional dance” and “sports dance”.
  • step 101 may include the following steps: receiving configuration data input by a user, and extracting a tag set from the configuration data.
  • the server can provide a configuration entry to users (such as developers or product managers).
  • the user inputs the configuration data of the tag set through the configuration entry.
  • the server parses out the tag set.
  • the tag set may be recorded in a tree structure, a list structure, a key-value pair, etc., and this embodiment does not limit this, as long as different tag levels and content tags under each tag level can be distinguished.
  • each content label can participate in multiple businesses. Sufficiently fine-grained content labels can avoid the need to re-collect data to train labels for new businesses, thereby enhancing the reusability and portability of subsequent trained models.
  • Step 102 For each content tag, a training data set of each content tag is obtained respectively.
  • the number of training data sets for each content label is less than the set small-scale threshold, making the training data set for each content label a small-scale data set, for example, the number of positive sample data is in the tens of thousands. According to the actual time consumption in the project, this level of positive samples can generally be collected and labeled within a few days, and algorithm personnel can quickly balance the composition structure of the training data set to meet the requirements of high data purity and good data component balance, thereby improving the training speed and accuracy of the model, so that the training of the label model can be completed within a short R&D cycle.
  • Step 103 Use the pre-generated meta-model to perform hyper-parameter learning on the training data set of the current content label to obtain the hyper-parameters of the current content label.
  • the label model to be trained in this embodiment may be a deep learning model.
  • a small-scale data set can improve the speed and accuracy of model training
  • deep learning technology is prone to overfitting on a small-scale data set, making the model fail to achieve the expected generalization effect.
  • optional processing methods may include: selecting a small and shallow network structure, adding regularization methods during training, using specific loss to suppress, enriching the data composition from the sampling step, etc.
  • transfer learning based on public pre-trained models on large-scale open source data sets can also avoid overfitting to a certain extent.
  • step 101 For the label set of step 101, if the number of content labels is large, and if each fine-grained content label starts iterating from a pre-trained model on a large-scale open source data set, then training a small model with good quasi-recall performance and strong generalization ability requires algorithm personnel to have strong experience in adjusting multiple hyperparameters and designing training iterations.
  • algorithm personnel it will be difficult to effectively implement a short-cycle rapid iteration plan, because algorithm manpower may bring bottlenecks to the R&D progress.
  • an additional meta-model can be trained to guide the learning of the hyperparameters of the label model of each content label.
  • the neural network of the label model learns the model's weight parameters through data iteration, and the model's learning rate, decay rate, initial weight parameters and other hyperparameters involved in the training need to be manually debugged and set by the algorithm personnel.
  • different hyperparameter configurations are used for different data sets. When the hyperparameter configuration matches the data set distribution properly, it is very helpful to train faster and achieve better results, which can greatly improve the training efficiency. Hyperparameters such as initial weight parameters, learning rates, decay rates, etc.
  • the meta-model is to automate this exploration process to improve the efficiency of algorithm manpower, and ultimately achieve the use of a small amount of algorithm manpower to iterate many satisfactory label models in a short time.
  • step 103 may include the following steps:
  • Step 103 - 1 randomly extracting a first data subset not exceeding a first set ratio and a second data subset not exceeding a second set ratio from a training data set of the current content tag.
  • This embodiment aims to learn the hyperparameters of the label model using a small amount of data. Therefore, a small amount of data can be randomly extracted from the training data set of the current content label to obtain the first data subset and the second data subset.
  • the first set ratio and the second set ratio can be smaller ratios, for example, the first set ratio can be set to 10%, and the second set ratio can be set to 20%. This can ensure the generalization ability of the final training stage, so that as much data as possible can be learned during the final training stage, and with the help of the meta-model, a small amount of data can be used to learn the initial hyper-parameters.
  • the first set ratio may be set smaller than the second set ratio, so that the amount of data increases during the update process from step 103 - 2 to step 103 - 3, thereby avoiding overfitting of the model and improving the generalization ability of the model.
  • Step 103 - 2 iterate the initial deep learning model using the first data subset to obtain initial weight parameters.
  • the first data subset is used to simply iterate the initial deep learning model to learn the initial weight parameters, which are recorded as Weights-1.
  • Step 103 - 3 using the second data subset to continue iterating the deep learning model based on the initial weight parameters, and calculating the loss of each iteration and the gradient value of the loss.
  • the second data subset with a larger data volume is used to continue iteration in the deep learning model of Weights-1, and the loss (loss) of each iteration and the gradient value of the loss are calculated.
  • This embodiment does not limit the specific method of calculating the loss and the gradient value.
  • Step 103 - 4 input the initial weight parameters, the loss obtained in each iteration, and the gradient value into a pre-generated meta-model for hyper-parameter learning to obtain the hyper-parameters of the current content label.
  • This embodiment uses a meta-model to perform hyper-parameter learning on label models of multiple different content labels.
  • the meta-model it is necessary to have a certain memory function, because a small amount of data (a first data subset and a second data subset) is needed to quickly find the hyper-parameters in the label model training task of different content labels. Therefore, the meta-model can adopt a network structure with a memory function, such as a long short-term memory (LSTM) shallow network, a recurrent neural network (RNN) model with memory units, a transformer (a neural network using an attention mechanism) model, or a machine learning model.
  • LSTM long short-term memory
  • RNN recurrent neural network
  • transformer a neural network using an attention mechanism
  • the metamodel can be applied to the training of multiple label models, thereby realizing the reuse function of the metamodel and achieving the purpose of improving efficiency.
  • the meta-model learns and updates the initial weight parameters, learning rate and decay rate in the deep learning model, and uses the new weight parameters, learning rate and decay rate as the hyperparameters of the current content label.
  • weight parameters learning rate, decay rate, and gradient value
  • c t is the weight parameter at time t
  • f t is the decay rate at time t
  • it is the learning rate at time t
  • c t-1 is the weight parameter at time t-1.
  • Step 104 training a label model for each content label according to the hyper parameters of each content label and the training data set for each content label.
  • a label model of the content label may be generated by using a full training data set to perform model training based on hyperparameters of the content label.
  • the structure of the label model can be determined based on the hyperparameters, and then the model can be trained using the full training data set to obtain the final weight parameters that meet the conditions for precision and generalization ability, and the label model of the content label can be obtained based on the final weight parameters.
  • the form of training data for different content labels is also different.
  • the training data in the corresponding training data set is a frame sequence of a set length, such as a frame sequence including 8 or 15 video frames.
  • a training data is a single video frame.
  • This embodiment provides a solution on how to use a small amount of data to build a framework that can be easily migrated or reused between multiple businesses.
  • this embodiment uses fine-grained content labels and a small-scale training data set with a single dimension corresponding to each content label, combined with the hyperparameter learning process of the meta-pattern. It is possible to train a large number of reusable and transferable label models to reduce the model migration cost and the data set collection cost, improve the data purity of the data set, and obtain a more accurate label model. At the same time, it can save a lot of iterative development time in the training stage of many label models, for example, saving 30%-50% of iterative development time.
  • Figure 2 is a flow chart of a video content recognition method provided in Example 2 of the present application.
  • This method embodiment can be applied to a server, and uses the label model mentioned in Example 1 to identify video content. It can be applied to scenarios such as content review and content tagging. It is also suitable for scenarios where complex content under multiple businesses is identified within a shorter R&D cycle.
  • this embodiment may include the following steps:
  • Step 201 Use a pre-generated universal detection model to perform target detection on a target video to obtain a first detection result.
  • the general detection model may include a detection model that outputs basic signals for other label models, for example, a face detection model for outputting face detection results, a scene detection model for outputting scene types, a target detection model for outputting target locations, etc.
  • the target video may come from different services, for example, may include live video, video files imported by users, etc., which is not limited in this embodiment.
  • the target video may be input into the universal detection model, and the universal detection model performs corresponding target detection on the target video and outputs a first detection result.
  • the target video can be input into different universal detection models in parallel, and the output result of each universal detection model is the first detection result; different universal detection models can also be connected in series, the target video is input into the first universal detection model, and then the output result of the first universal detection model is output to the second universal detection model, and so on, until the last universal detection model is processed, and the output result of the last universal detection model is obtained as the first detection result, or the output result of each universal detection model is used as the first detection result.
  • Step 202 Select one or more target label models from a plurality of pre-generated label models according to the first detection result.
  • Each label model has a corresponding content label.
  • the label model is a model generated by training the training data set of the corresponding content label with a pre-generated meta-model and then performing hyper-parameter learning based on the obtained hyper-parameters and the training data set.
  • For the training process of the label model please refer to the description of the first embodiment.
  • the content tags in this embodiment are fine-grained content tags, and each content tag has a corresponding tag model.
  • a tag model matching the first detection result can be selected from all tag models generated in advance as the target tag model.
  • step 202 may include the following steps: selecting a label model with the first detection result as an input signal from a plurality of pre-generated label models, and using the label model as a target label model.
  • a content tag is an "age” tag
  • a face detection model can be used to perform face detection, obtain a face signal, and then input the face signal to the "age” tag model to obtain the corresponding age type. Then, if the face detection model is a general detection model and the face signal is the first detection result, the "age" tag model is the target tag model.
  • the general detection model is a face detection model and the first detection result is a face signal
  • all tag models with "face signal" as input signals can be used as target tag models.
  • Step 203 Use one or more target label models to perform label detection on the target video to obtain one or more second detection results.
  • one or more target label models can run in parallel.
  • the target video can be input into multiple target label models for processing respectively, and the output result of each target label model is used as the second detection result; in other implementations, multiple target label models can also run in series.
  • the output of the previous target label model can be used as the input of the next target label model until the last target label model is processed, and its output result is used as the second detection result.
  • step 203 may include the following steps:
  • Step 203-1 obtaining the priority of each target label model.
  • the label model may also carry priority information, and the priority of the target label model may be determined according to the priority information of the target label model.
  • the priority information of the tag model can be configured by the user.
  • Step 203-2 Determine the calling order of one or more target tag models according to the priority.
  • the target label model with a high priority is called first, then the target label models can be sorted from high to low according to the priority, and then the sorting order is used as the calling order.
  • Step 203-3 calling each target label model in turn according to the calling order to perform label detection on the target video.
  • the signal can be transmitted among the multiple target label models according to the calling sequence, and the signal output by the target label model ranked last is used as the second detection result.
  • the content tag may include an action tag with a time sequence characteristic.
  • a label model corresponding to the action label is used to perform label detection on the target video, and the detection process may include: dividing the target video into multiple video segments; selecting multiple video frames from each video segment as input data, respectively inputting the input data of the multiple video segments into the label model corresponding to the action label, and obtaining multiple label results output by the label model corresponding to the action label; integrating the multiple label results output by the label model corresponding to the action label for the input data of the multiple video segments to generate a second detection result of the label model corresponding to the action label.
  • the target video can be divided into three video clips. For each video clip, 3 frames are extracted per second. If a video clip is longer than 3 seconds and has more than 9 frames, 9 consecutive frames are randomly extracted. Then, the 9 frames selected from each video clip are input into the label model corresponding to the action label as the input data of the video clip, and the label model corresponding to the action label obtains the output label result for the video clip. In this way, three video clips can obtain three label results, and the fusion of these three label results can obtain the second detection result corresponding to the label model corresponding to the current action label.
  • This embodiment divides the video into video segments and then extracts video frames from each video segment as input data, which can improve the research and development efficiency of action label models and greatly reduce the required positive sample size. While reducing the sampling interval of sequence images, it can still achieve a model accuracy of 80%+ that meets the requirements on a small model and small-scale data (the number of positive sample videos is about 30,000).
  • the label result output by the label model corresponding to the action label is the probability that the input data meets the corresponding action label.
  • the label results of multiple video clips can be integrated in the following manner: multiple label results are calculated according to set calculation rules, and the calculation results are used as the second detection result of the label model corresponding to the action label.
  • setting the calculation rule may include average calculation, weighted calculation method, etc., which is not limited in this embodiment.
  • Step 204 Generate a video content recognition result by combining one or more second detection results.
  • the video content recognition results are processed differently depending on the application scenario.
  • the video content recognition results can be passed to the recommendation side or the review side for recommendation to users for viewing or to handle violation risks.
  • the video content recognition result is a diversified recognition result that integrates the second detection results of one or more target label models, which meets the multi-processing situation and can improve the recognition accuracy.
  • the target label model includes an "age” label model, a "gender” label model, and a "body exposure” label model
  • the output results of these three label models can be combined to obtain the video content recognition result, such as "male, adult, shirtless”.
  • the first detection result and the second detection result may also be combined to obtain a video content recognition result.
  • the first detection result of a general detection model such as a scene detection model can be obtained. If the first detection result is a real scene, the output video content recognition result "there is a gun in the real scene” can trigger manual review or an alarm. If the first detection result is a game scene, the output video content recognition result "there is a gun in the game scene” will not trigger manual review or an alarm, and the recognition result can be recorded or ignored.
  • the basic signal output by the general detection model is used as the first detection result, and then one or more target label models are selected from a plurality of pre-generated label models using a policy control method according to the first detection result to perform label detection on the target video, and a second detection result is obtained.
  • the video content recognition result can be obtained by combining the second detection results of the plurality of target label models.
  • Different services have different policy controls.
  • the target label model is directly selected from the pre-generated label models, without the need to retrain the network model corresponding to each content label according to different services, thereby improving the portability and reusability of the label model and reducing the migration or reuse costs.
  • this embodiment only needs to develop a parsable comprehensive strategy to determine the target label model or add some new label models based on the product requirements of the new business. Without retraining and adjusting the existing label model, it can be directly reused to achieve the migration or reuse of label models across businesses.
  • FIG. 3 is a schematic diagram of the structure of a video content recognition device provided in Example 3 of the present application.
  • the device is arranged in a server and may include the following modules: a general detection module 301, configured to use a pre-generated general detection model to perform target detection on a target video to obtain a first detection result; a target label model determination module 302, configured to select at least one target label model from a plurality of pre-generated label models according to the first detection result, wherein each label model has a corresponding content label, and each label model is a model generated by performing hyperparameter learning on a training data set of a corresponding content label using a pre-generated meta-model, and then training the model based on the obtained hyperparameters and the training data set; a label detection module 303, configured to use the at least one target label model to perform label detection on the target video to obtain at least one second detection result; a video content recognition result generation module 304, configured to generate a video content recognition result in combination with the at least one second detection result.
  • the target label model determination module 302 is configured to: select a label model with the first detection result as an input signal from a plurality of pre-generated label models, and use the selected label model as the target label model.
  • the tag detection module 303 is configured to obtain the Priority; determining the calling order of the at least one target label model according to the priority; calling each target label model in turn according to the calling order to perform label detection on the target video.
  • the content label includes an action label with a timing characteristic
  • the selected target label model is a label model corresponding to the action label.
  • the label detection module 303 may include the following modules: a video division module, configured to divide the target video into multiple video segments; a video segment processing module, configured to select multiple video frames from each video segment as input data, respectively input the input data of the multiple video segments into the label model corresponding to the action label, and obtain multiple label results output by the label model corresponding to the action label; a label result integration module, configured to integrate the multiple label results output by the label model corresponding to the action label for the input data of multiple video segments, and generate a second detection result of the label model corresponding to the action label.
  • each label result is the probability that the input data meets the action label; the label result integration module is configured to: calculate the multiple label results according to the set calculation rules, and use the calculation result as the second detection result of the label model corresponding to the action label.
  • a video content recognition device provided in the embodiment of the present application can execute a video content recognition method provided in the second embodiment of the present application, and has functional modules and effects corresponding to the execution method.
  • a label set acquisition module 401 configured to acquire a label set, wherein the label set includes at least two label levels, and each label level has at least one content label
  • a training data set acquisition module 402 configured to acquire, for each content label, a training data set for each content label, respectively, and the number of the training data sets is less than a set small-scale threshold
  • a hyperparameter determination module 403 configured to perform hyperparameter learning on the training data set for each content label using a pre-generated meta-model to obtain the hyperparameters of each content label
  • a label model training module 404 configured to train a label model for each content label according to the hyperparameters of each content label and the training data set for each content label.
  • the hyperparameter determination module 403 is configured to: randomly extract a first data subset not exceeding a first set ratio and a second data subset not exceeding a second set ratio from the training data set of each content tag, wherein the first set ratio is less than the second set ratio; use the first data subset to iterate the initial deep learning model to obtain initial weight parameters; use the second data subset to continue to iterate the deep learning model based on the initial weight parameters, and calculate the loss of each iteration and the gradient value of the loss; input the initial weight parameters, the loss obtained in each iteration, and the gradient value of the loss into a pre-generated meta-model for hyperparameter learning to obtain the Describe the hyperparameters for each content label.
  • the label module training module 404 is configured to: perform model training based on the hyperparameters of each content label using the full amount of training data sets of each content label to generate a label model of each content label.
  • the content label includes an action label with a time sequence characteristic
  • the training data in the training data set of the action label is a frame sequence of a set length
  • the tag set acquisition module 401 is configured to: receive configuration data input by a user; and extract a tag set from the configuration data.
  • a model training device provided in an embodiment of the present application can execute a model training method provided in Example 1 of the present application, and has functional modules and effects corresponding to the execution method.
  • FIG5 shows a schematic diagram of the structure of an electronic device 10 that can be used to implement the method embodiment of the present application.
  • the electronic device 10 includes at least one processor 11, and a storage device that is communicatively connected to the at least one processor 11, such as a read-only memory (ROM) 12, a random access memory (RAM) 13, etc., wherein the storage device stores one or more computer programs that can be executed by at least one processor, and the processor 11 can perform a variety of appropriate actions and processes according to the computer program stored in the ROM 12 or the computer program loaded from the storage unit 18 to the RAM 13.
  • the RAM 13 a variety of programs and data required for the operation of the electronic device 10 can also be stored.
  • the method in Embodiment 1 or Embodiment 2 may be implemented as a computer program, which is tangibly contained in a computer-readable storage medium, such as storage unit 18.
  • a computer-readable storage medium such as storage unit 18.
  • part or all of the computer program may be loaded and/or installed on electronic device 10 via ROM 12 and/or communication unit 19.
  • processor 11 When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the method in Embodiment 1 or Embodiment 2 described above may be performed.
  • the method in embodiment one or embodiment two may be implemented as a computer program product, which includes computer executable instructions, which are used to perform one or more steps of the method in embodiment one or embodiment two described above when executed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

一种视频内容识别及模型训练的方法、装置和设备,其中该视频内容识别方法包括:采用预先生成的通用检测模型对目标视频进行目标检测,获得第一检测结果(S201);根据第一检测结果,从预先生成的多个标签模型中选取至少一个目标标签模型(S202);采用至少一个目标标签模型对目标视频进行标签检测,获得至少一个第二检测结果(S203);结合至少一个第二检测结果,生成视频内容识别结果(S204)。

Description

视频内容识别及模型训练的方法、装置和设备
本申请要求在2022年11月30日提交中国专利局、申请号为202211529260.9的中国专利申请的优先权,以上申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理技术领域,例如涉及一种视频内容识别方法、一种模型训练方法、一种视频内容识别装置、一种模型训练装置、一种电子设备、一种计算机可读存储介质以及一种计算机程序产品。
背景技术
随着以视频为载体的信息传播方式越来越流行,多种视频相关的应用也得到了极大的发展,因此,对于视频的相关技术提出了更高的要求,作为视频处理技术中的基础任务,视频内容识别技术得到了越来越多的关注。
视频内容识别技术主要是应用深度学习技术,在大规模数据量级下拟合出具有一定泛化能力的参数后,对新数据进行识别,该技术主要的特点是:大规模数据、数据纯度高、黑盒不可解析。
大规模数据:根据实际业务中的研发表明,深度学习在图片识别中需要的训练集动辄以十万、百万为单位;而对于视频动作识别,训练集已达到千万图片规模。国外的头部研究机构更以亿为单位的数据进行研究。需要大规模数据的特点会导致数据集搭建时间较长,因此研发迭代周期相对较长。
数据纯度高:仅有大规模数据仍然无法拟合出所需要的模型能力。因为数据的组成和纯度直接决定了模型识别的精度。在实际业务场景中,需要模型识别的数据有较高的精准度才能在审核或者机器打标应用中达到人力资源使用最优。
黑盒不可解析:深度学习或机器学习技术,是基于原始数据或特征数据在神经网络或机器学习算法中进行训练,训练出的模型参数,不能给出可理解的逻辑上的解析。对于使用者来说,仅能通过一定量的数据测试才能稍微摸清模型大致的能力边界,费时费力不说,而且如果对识别内容有稍微的改变,原模型参数需要重新训练,迁移能力差。
数据纯度高且需要结合大规模的量级,就导致训练集构建的成本非常高昂,而且黑盒不可解析的特性又导致模型不能在多个业务之间直接迁移或重复使用,只能在每个业务上都根据本业务数据进行模型的定制,反复耗费人力和机器资源。
发明内容
本申请提供了一种视频内容识别及模型训练的方法、装置和设备,降低了数据集的收集成本,提高了标签模型的可迁移性和复用性,降低迁移或复用成本。
本申请提供了一种视频内容识别方法,所述方法包括:
采用预先生成的通用检测模型对目标视频进行目标检测,获得第一检测结果;
根据所述第一检测结果,从预先生成的多个标签模型中选取至少一个目标标签模型,其中,每个标签模型具有对应的内容标签,并且每个标签模型为采用预先生成的元模型对对应的内容标签的训练数据集进行超参学习后,基于得到的超参数与所述训练数据集训练生成的模型;
采用所述至少一个目标标签模型对所述目标视频进行标签检测,获得至少一个第二检测结果;
结合所述至少一个第二检测结果,生成视频内容识别结果。
本申请提供了一种模型训练方法,所述方法包括:
获取标签集合,所述标签集合中包含至少两个标签等级,每个标签等级具有至少一个内容标签;
针对每个内容标签,分别获取所述每个内容标签的训练数据集,所述训练数据集的数量小于设定小规模阈值;
采用预先生成的元模型对所述每个内容标签的训练数据集进行超参学习,以获得所述每个内容标签的超参数;
根据所述每个内容标签的超参数以及所述每个内容标签的训练数据集训练所述每个内容标签的标签模型。
本申请提供了一种视频内容识别装置,所述装置包括:
通用检测模块,设置为采用预先生成的通用检测模型对目标视频进行目标检测,获得第一检测结果;
目标标签模型确定模块,设置为根据所述第一检测结果,从预先生成的多个标签模型中选取至少一个目标标签模型,其中,每个标签模型具有对应的内容标签,并且每个标签模型为采用预先生成的元模型对对应的内容标签的训练数据集进行超参学习后,基于得到的超参数与所述训练数据集训练生成的模型;
标签检测模块,设置为采用所述至少一个目标标签模型对所述目标视频进 行标签检测,获得至少一个第二检测结果;
视频内容识别结果生成模块,设置为结合所述至少一个第二检测结果,生成视频内容识别结果。
本申请提供了一种模型训练装置,所述装置包括:
标签集合获取模块,设置为获取标签集合,所述标签集合中包含至少两个标签等级,每个标签等级具有至少一个内容标签;
训练数据集获取模块,设置为针对每个内容标签,分别获取所述每个内容标签的训练数据集,所述训练数据集的数量小于设定小规模阈值;
超参数确定模块,设置为采用预先生成的元模型对所述每个内容标签的训练数据集进行超参学习,以获得所述每个内容标签的超参数;
标签模块训练模块,设置为根据所述每个内容标签的超参数以及所述每个内容标签的训练数据集训练所述每个内容标签的标签模型。
本申请提供了一种电子设备,所述电子设备包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;
其中,所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述所述的方法。
本申请提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现上述所述的方法。
本申请提供了一种计算机程序产品,所述计算机程序产品包括计算机可执行指令,所述计算机可执行指令在被执行时用于实现上述第所述的方法。
附图说明
下面将对实施例描述中所需要使用的附图作简单地介绍。
图1是本申请实施例一提供的一种模型训练方法的流程图;
图2是本申请实施例二提供的一种视频内容识别方法的流程图;
图3是本申请实施例三提供的一种视频内容识别装置的结构示意图;
图4是本申请实施例四提供的一种模型训练装置的结构示意图;
图5是本申请实施例五提供的一种电子设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行说明。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,除了包含本申请实施例示出的一系列步骤或单元的过程、方法、系统、产品或设备,还可包括没有清楚地列出的这一系列步骤或单元的其它过程、方法、系统、产品或设备,或对于这些过程、方法、系统、产品或设备固有的其它步骤或单元。
实施例一
图1为本申请实施例一提供的一种模型训练方法的流程图,该方法实施例可以应用于服务器中,用于对标签模型进行训练,该标签模型用于输出当前输入数据属于对应的内容标签的概率。
如图1所示,本实施例可以包括如下步骤:
步骤101,获取标签集合,所述标签集合中包含至少两个标签等级,每个标签等级具有至少一个内容标签。
例如,若一个标签集合中包括一级标签、二级标签和三级标签这三个标签等级,假设一级标签的内容标签为初始内容标签,则二级标签的内容标签为对初始内容标签进行拆解后得到的内容标签,三级标签的内容标签为对二级标签的内容标签进一步拆解后得到的内容标签。
在一种实现中,对内容标签的拆解方式可以包括内容、数据格式等多个维度的拆解。比如,假设一级标签的内容标签为“美女舞蹈”,可以按照内容维度将“美女舞蹈”拆解成“美女”、“舞蹈”两个二级标签的内容标签。接着,还可以对二级标签的每个内容标签进一步拆解,将“美女”拆解成“性别”、“颜值”两个三级标签的内容标签,将“舞蹈”拆解成“流行舞蹈”、“传统舞蹈”、“体育舞蹈”三个三级标签的内容标签。
在一种实施例中,步骤101可以包括如下步骤:接收用户输入的配置数据,并从该配置数据中提取标签集合。
在实现时,服务器可以向用户(如开发人员或产品经理等)提供配置入口, 用户通过该配置入口输入标签集合的配置数据。服务器收到该配置数据以后,解析出标签集合。
标签集合可以采用树状结构、列表结构、键值对等方式记录,本实施例对此不作限制,只要能区分出不同的标签等级以及每个标签等级下的内容标签即可。
标签等级越多、内容标签的拆解粒度就越细,则覆盖的内容越清晰,这就降低了模型快速训练迭代过程中收集高纯度正样本的难度,而且粒度越细,需要的正样本数量级越少,可以进一步提高迭代速率。每个内容标签都可以参与到多个业务中,足够细粒度的内容标签可以避免在新增业务上需重新收集数据对标签进行训练,从而增强后续训练的模型的复用性和可迁移性。
步骤102,针对每个内容标签,分别获取每个内容标签的训练数据集。
每个内容标签的训练数据集的数量小于设定小规模阈值,使得每个内容标签的训练数据集为小规模数据集,例如正样本数据是几万量级。根据项目中实际时耗,这种量级的正样本,一般在几天内可以完成收集和标注,而且算法人员可以较快平衡好训练数据集的组成结构,达到数据纯度较高、数据成分均衡性好的要求,从而提升模型的训练速度和精度,使得标签模型的训练可在研发的短周期内完成。
步骤103,采用预先生成的元模型对当前内容标签的训练数据集进行超参学习,以获得当前内容标签的超参数。
示例性地,本实施例需要训练的标签模型可以为深度学习模型。在实际中,小规模数据集虽然能够提升模型训练的速度和精度,但深度学习技术在小规模数据集上容易出现过拟合现象,使得模型达不到预期的泛化效果。
为了防止模型的过拟合现象,可选的处理手段可以包括:选择小而浅的网络结构、训练时候加入正则化手段、采用特定loss(损失)进行抑制、从采样步骤丰富数据组成等。此外,基于大规模开源数据集上的公开预训练模型进行迁移学习也可以在一定程度上避免过拟合。
针对步骤101的标签集合,若内容标签的数量较多,如果每个细粒度的内容标签均从大规模开源数据集上预训练模型开始迭代,那么训练出准召性能好、泛化能力强的小模型,需要算法人员对多种超参的调节和训练迭代的设计都有比较强的经验要求。然而,一旦每个细粒度标签的模型训练都需要算法人员精心介入,就难以有效实现短周期快速迭代的计划,因为算法人力可能会给研发进度带来瓶颈。基于此,为了提高标签模型的训练效率,可以额外训练一个元模型对每个内容标签的标签模型的超参进行指导学习。
在实际中,标签模型的神经网络,都是通过数据的迭代去学习模型的权重参数,而训练中所涉及的模型的学习率、衰减率、初始权重参数等超参需要算法人员手动调试设置。在实践中,在神经网络训练中,针对不同的数据集会使用不同的超参配置,当超参配置与数据集分布匹配得当时,是十分有助于更快速训练出较好的效果,可以较大提高训练效率。适配特定数据集的初始权重参数、学习率、衰减率等超参可以快速迭代出所要求的模型效果,而元模型的目的是为了把这个探索过程自动化,以提高算法人力的效率,最终达到使用少量算法人力就可以在短时间内迭代出众多满足效果的标签模型。
在一种实施例中,步骤103可以包括如下步骤:
步骤103-1,从当前内容标签的训练数据集中随机抽取不超过第一设定比例的第一数据子集、以及不超过第二设定比例的第二数据子集。
本实施例意在采用少量数据即可学习到标签模型的超参数。因此可以从当前内容标签的训练数据集中随机抽取少量数据得到第一数据子集和第二数据子集。
第一设定比例和第二设定比例可以为较少的比例,例如,第一设定比例可以设置为10%,第二设定比例可以设置为20%。这样可以保证最终训练阶段的泛化能力,让最终训练阶段学习的时候尽可能多的学习没有见过的数据,而且辅助以元模型,少量数据即可学习得到初始的超参。
可选地,可以设置第一设定比例小于第二设定比例,这样,从步骤103-2到步骤103-3的更新过程中,数据量是递增的,避免模型过拟合,提升模型的泛化能力。
步骤103-2,采用第一数据子集对初始深度学习模型进行迭代,以获得初始权重参数。
该步骤中,采用第一数据子集简单迭代初始深度学习模型,从而学习出初始权重参数,记作Weights-1。
步骤103-3,采用第二数据子集继续迭代基于初始权重参数的深度学习模型,并计算每次迭代的损失和所述损失的梯度值。
在该步骤中,使用数据量更大的第二数据子集,在Weights-1的深度学习模型中继续进行迭代,并计算每次迭代的loss(损失)和和该loss的梯度值。本实施例对计算loss和梯度值的具体方式不作限定。
步骤103-4,将初始权重参数、每次迭代获得的损失以及梯度值输入至预先生成的元模型中进行超参学习,以获得当前内容标签的超参数。
本实施例使用一个元模型对多个不同内容标签的标签模型进行超参学习,对于元模型而言,需要具有一定的记忆功能,因为需要使用小量数据(第一数据子集和第二数据子集)快速找出不同内容标签的标签模型训练任务中的超参。因此,元模型可以采用带有记忆功能的网络结构,例如长短期记忆(Long short-term memory,LSTM)浅层网络、含有记忆单元的循环神经网络(Recurrent Neural Network,RNN)模型、transformer(使用注意力机制的神经网络)模型或者机器学习模型等。
通过元模型的记忆能力,可以将元模型应用于多个标签模型的训练中,从而实现了元模型的复用功能,达到了提升效率的目的。
当获得初始权重参数、每次迭代的损失以及梯度值以后,则将该初始权重参数、每次迭代的损失以及梯度值输入元模型,由元模型对深度学习模型中的初始权重参数、学习率和衰减率进行学习和更新,将得到的新的权重参数、学习率和衰减率作为当前内容标签的超参数。
在一种实现中,权重参数、学习率、衰减率和梯度值的关系如下:
其中,ct为t时刻的权重参数,ft为t时刻的衰减率,it为t时刻的学习率,为t时刻的梯度值,ct-1为t-1时刻的权重参数。
步骤104,根据每个内容标签的超参数以及所述每个内容标签的训练数据集训练所述每个内容标签的标签模型。
在一种实现中,可以基于内容标签的超参数,采用全量的训练数据集进行模型训练,生成内容标签的标签模型。
当确定待训练的标签模型的超参数以后,则可以根据该超参数确定标签模型的结构,然后采用全量的训练数据集进行模型训练,获得准召率和泛化能力满足条件的最终的权重参数,并根据该最终的权重参数得到该内容标签的标签模型。
在实际中,对于不同的内容标签,其训练数据的形式也是不一样,例如,若内容标签包括具有时序特性的动作标签,则对应的训练数据集中的训练数据为设定长度的帧序列,比如一个帧序列包括8帧或15帧的视频帧。而对于非动作类的内容标签,则一个训练数据为单帧视频帧。
本实施例提供了如何使用少量数据搭建一套能在多个业务之间容易迁移或模型复用的框架的方案,为了一套模型能在多个业务之间迁移或复用而且要兼顾在每个业务中发挥出满意的效果,本实施例通过细粒度的内容标签,以及每个内容标签对应的维度单一的小规模训练数据集,结合元模式的超参学习过程, 可以训练众多可复用、可迁移的标签模型,以减少模型的迁移成本以及数据集的收集成本,提升了数据集的数据纯度,得到精度更高的标签模型。同时在众多标签模型的训练阶段中可以节省大量的迭代开发时间,例如节省30%-50%的迭代开发时间。
实施例二
图2为本申请实施例二提供的一种视频内容识别方法的流程图,该方法实施例可以应用于服务器中,采用实施例一提及的标签模型,对视频内容进行识别,可应用于内容审核、内容打标等场景中,还适用于在较短研发周期时间内对多个业务下的复杂内容进行识别的场景。
如图2所示,本实施例可以包括如下步骤:
步骤201,采用预先生成的通用检测模型对目标视频进行目标检测,获得第一检测结果。
通用检测模型可以包括为其他标签模型输出基础信号的检测模型,例如,用于输出人脸检测结果的人脸检测模型、用于输出场景类型的场景检测模型、用于输出目标位置的目标检测模型等。
目标视频可以来源于不同的业务,例如可以包括直播视频、用户导入的视频文件等,本实施例对此不作限制。
获得目标视频以后,可以将目标视频输入至通用检测模型中,由通用检测模型根据对目标视频进行相应的目标检测,输出第一检测结果。
若通用检测模型的数量超过一个,根据预设的拓扑策略,可以将目标视频并行输入至不同的通用检测模型中,则每个通用检测模型的输出结果为第一检测结果;也可以将不同的通用检测模型串联起来,目标视频输入至第一个通用检测模型,然后第一个通用检测模型输出结果至第二个通用检测模型,以此类推,直到最后的通用检测模型处理完毕,将获得的最后的通用检测模型的输出结果作为第一检测结果,或者将每个通用检测模型的输出结果均作为第一检测结果。
步骤202,根据第一检测结果,从预先生成的多个标签模型中选取一个或多个目标标签模型。
每个标签模型具有对应的内容标签,标签模型为采用预先生成的元模型对对应内容标签的训练数据集进行超参学习后,基于得到的超参数与训练数据集训练生成的模型。关于标签模型的训练过程请参考实施例一的描述。
本实施例中的内容标签为细粒度的内容标签,每个内容标签具有对应的标签模型。则在确定作为基础信号的第一检测结果以后,可以从在先生成的所有标签模型中选择与第一检测结果匹配的标签模型作为目标标签模型。
在一种实施例中,步骤202可以包括如下步骤:从预先生成的多个标签模型中,选取以该第一检测结果为输入信号的标签模型,将该标签模型作为目标标签模型。
例如,若一个内容标签为“年龄”标签,则为了检测人物的年龄,可以采用人脸检测模型进行人脸检测,获得人脸信号,然后将人脸信号输入给“年龄”标签模型,则可以得到对应的年龄类型。则,若人脸检测模型为通用检测模型,人脸信号为第一检测结果,则“年龄”标签模型为目标标签模型。也就是说,若通用检测模型为人脸检测模型,第一检测结果为人脸信号,则可以将所有以“人脸信号”为输入信号的标签模型作为目标标签模型。
步骤203,采用一个或多个目标标签模型对目标视频进行标签检测,获得一个或多个第二检测结果。
在实现时,一个或多个目标标签模型可以并行运行,此时可以将目标视频分别输入至多个目标标签模型进行处理,每个目标标签模型的输出结果均作为第二检测结果;在其他实现中,多个目标标签模型也可以串行运行,此时前一个目标标签模型的输出可以作为下一个目标标签模型的输入,直到最后一个目标标签模型处理完毕,将其输出结果作为第二检测结果。
在一种实施例中,步骤203可以包括如下步骤:
步骤203-1,获取每个目标标签模型的优先级。
在实际中,标签模型还可以携带优先级信息,则可以根据目标标签模型的优先级信息确定目标标签模型的优先级。
在实现时,标签模型的优先级信息可以由用户配置。
步骤203-2,根据优先级确定一个或多个目标标签模型的调用顺序。
例如,可以设置优先级高的目标标签模型优先被调用,则可以将目标标签模型按照优先级由高到低进行排序,然后将排序的顺序作为调用顺序。
步骤203-3,按照调用顺序依次调用每个目标标签模型对所述目标视频进行标签检测。
当确定调用顺序以后,则信号可以按照调用顺序在多个目标标签模型间传递,并将排序在最后的目标标签模型输出的信号作为第二检测结果。
在一种实施例中,内容标签可以包括具有时序特性的动作标签,若采用动 作标签对应的标签模型对目标视频进行标签检测,其检测过程可以包括:将目标视频划分成多个视频片段;分别从每个视频片段中选取多个视频帧作为输入数据,分别将多个视频片段的输入数据输入至动作标签对应的标签模型,并获取该动作标签对应的标签模型输出的多个标签结果;整合动作标签对应的标签模型针对多个视频片段的输入数据输出的多个标签结果,生成动作标签对应的标签模型的第二检测结果。
例如,可以将目标视频划分成3个视频片段,对于每个视频片段,每秒抽取3帧,如果一个视频片段大于3秒超过9帧,则随机抽取连续9帧。然后将每个视频片段选取的9帧作为该视频片段的输入数据输入至动作标签对应的标签模型,由动作标签对应的标签模型得到针对该视频片段输出标签结果。这样,3个视频片段可以得到3个标签结果,对这3个标签结果进行融合则可以得到当前动作标签对应的标签模型对应的第二检测结果。
该实施例通过将视频划分成视频片段,然后从每个视频片段中抽取视频帧作为输入数据,可以提高动作类的标签模型的研发效率,并极大降低了所需的正样本规模,在降低序列图片的采样间隔的情况下,仍然能在小模型和小规模数据(正样本视频数量3万左右)上达到满足需求的80%+的模型精度。
在一种实施例中,动作标签对应的标签模型输出的标签结果为输入数据符合对应动作标签的概率,则可以采用如下方式整合多个视频片段的标签结果:对多个标签结果按照设定计算规则进行计算,并将计算结果作为动作标签对应的标签模型的第二检测结果。
示例性地,设定计算规则可以包括均值计算、加权计算方法等,本实施例对此不作限制。
步骤204,结合一个或多个第二检测结果,生成视频内容识别结果。
视频内容识别结果根据应用场景的不同有不同的处理,例如,可以将视频内容识别结果传递给推荐侧或者审核侧,用于推荐下发给用户观看或者处理违规风险。
在该步骤中,视频内容识别结果为综合了一个或多个目标标签模型的第二检测结果的多元化识别结果,满足了多元处理的情况,且可以提升识别精度。
例如,在安全类或者敏感类内容审核场景下,若目标标签模型包括“年龄”标签模型、“性别”标签模型和“身体裸露”标签模型,则可以结合这三种标签模型的输出结果,得到视频内容识别结果,如“男,成年,裸露上身”。
在本申请实施例中,还可以结合第一检测结果与第二检测结果得到视频内容识别结果。
例如,若采用“枪支”标签模型对目标视频检测时检测结果为存在枪支,则可以获取通用检测模型如场景检测模型的第一检测结果,如果第一检测结果为真实场景,则输出的视频内容识别结果“真实场景下存在枪支”,此时可以触发人工审核或者告警;而如果第一检测结果为游戏场景,则输出的视频内容识别结果“游戏场景下存在枪支”,此时则不触发人工审核或者告警,可以记录或忽略掉该识别结果。
在本实施例中,在对目标视频进行视频内容识别时,将通用检测模型输出的基础信号作为第一检测结果,然后根据第一检测结果采用策略控制方法从预先生成的多个标签模型中选取一个或多个目标标签模型对目标视频进行标签检测,获得第二检测结果,综合多个目标标签模型的第二检测结果则可以得到视频内容识别结果。不同的业务有不同的策略控制,根据策略控制直接从预先生成的标签模型中选择目标标签模型,而无需根据不同的业务重新训练每个内容标签对应的网络模型,提高了标签模型的可迁移性和复用性,降低迁移或复用成本。
如果有新增的业务,本实施例也只需根据新业务的产品需求,开发出可解析的综合策略来确定目标标签模型或新增部分标签模型,在无需重新训练调整已有标签模型的情况下就可以直接复用,实现标签模型跨业务的迁移或复用。
实施例三
图3为本申请实施例三提供的一种视频内容识别装置的结构示意图,该装置设置于服务器中,可以包括如下模块:通用检测模块301,设置为采用预先生成的通用检测模型对目标视频进行目标检测,获得第一检测结果;目标标签模型确定模块302,设置为根据所述第一检测结果,从预先生成的多个标签模型中选取至少一个目标标签模型,其中,每个标签模型具有对应的内容标签,并且每个标签模型为采用预先生成的元模型对对应的内容标签的训练数据集进行超参学习后,基于得到的超参数与所述训练数据集训练生成的模型;标签检测模块303,设置为采用所述至少一个目标标签模型对所述目标视频进行标签检测,获得至少一个第二检测结果;视频内容识别结果生成模块304,设置为结合所述至少一个第二检测结果,生成视频内容识别结果。
在一种实施例中,目标标签模型确定模块302是设置为:从预先生成的多个标签模型中,选取以所述第一检测结果为输入信号的标签模型,并将所述选取的标签模型作为目标标签模型。
在一种实施例中,标签检测模块303是设置为:获取每个目标标签模型的 优先级;根据所述优先级确定所述至少一个目标标签模型的调用顺序;按照所述调用顺序依次调用每个目标标签模型对所述目标视频进行标签检测。
在一种实施例中,所述内容标签包括具有时序特性的动作标签,所述选取的目标标签模型为所述动作标签对应的标签模型,标签检测模块303可以包括如下模块:视频划分模块,设置为将所述目标视频划分成多个视频片段;视频片段处理模块,设置为分别从每个视频片段中选取多个视频帧作为输入数据,分别将所述多个视频片段的输入数据输入至所述动作标签对应的标签模型,并获取所述动作标签对应的标签模型输出的多个标签结果;标签结果整合模块,设置为整合所述动作标签对应的标签模型针对多个视频片段的输入数据输出的标多个签结果,生成所述动作标签对应的标签模型的第二检测结果。
在一种实施例中,每个标签结果为所述输入数据符合所述动作标签的概率;标签结果整合模块是设置为:对所述多个标签结果按照设定计算规则进行计算,并将计算结果作为所述动作标签对应的标签模型的第二检测结果。
本申请实施例所提供的一种视频内容识别装置可执行本申请实施例二所提供的一种视频内容识别方法,具备执行方法相应的功能模块和效果。
实施例四
图4为本申请实施例四提供的一种模型训练装置的结构示意图,该装置设置于服务器中,可以包括如下模块:标签集合获取模块401,设置为获取标签集合,所述标签集合中包含至少两个标签等级,每个标签等级具有至少一个内容标签;训练数据集获取模块402,设置为针对每个内容标签,分别获取所述每个内容标签的训练数据集,所述训练数据集的数量小于设定小规模阈值;超参数确定模块403,设置为采用预先生成的元模型对所述每个内容标签的训练数据集进行超参学习,以获得所述每个内容标签的超参数;标签模块训练模块404,设置为根据所述每个内容标签的超参数以及所述每个内容标签的训练数据集训练所述每个内容标签的标签模型。
在一种实施例中,超参数确定模块403是设置为:从所述每个内容标签的训练数据集中随机抽取不超过第一设定比例的第一数据子集、以及不超过第二设定比例的第二数据子集,所述第一设定比例小于所述第二设定比例;采用所述第一数据子集对初始深度学习模型进行迭代,以获得初始权重参数;采用所述第二数据子集继续迭代基于所述初始权重参数的深度学习模型,并计算每次迭代的损失以及所述损失的梯度值;将所述初始权重参数、每次迭代获得的损失以及所述损失的梯度值输入至预先生成的元模型中进行超参学习,以获得所 述每个内容标签的超参数。
在一种实施例中,标签模块训练模块404是设置为:基于所述每个内容标签的超参数,采用所述每个内容标签的全量的训练数据集进行模型训练,生成所述每个内容标签的标签模型。
在一种实施例中,所述内容标签包括具有时序特性的动作标签,所述动作标签的的训练数据集中的训练数据为设定长度的帧序列。
在一种实施例中,标签集合获取模块401是设置为:接收用户输入的配置数据;从所述配置数据中提取标签集合。
本申请实施例所提供的一种模型训练装置可执行本申请实施例一所提供的一种模型训练方法,具备执行方法相应的功能模块和效果。
实施例五
图5示出了可以用来实施本申请的方法实施例的电子设备10的结构示意图。如图5所示,电子设备10包括至少一个处理器11,以及与至少一个处理器11通信连接的存储装置,如只读存储器(Read-Only Memory,ROM)12、随机访问存储器(Random Access Memory,RAM)13等,其中,存储装置存储有可被至少一个处理器执行的一个或多个计算机程序,处理器11可以根据存储在只读存储器(ROM)12中的计算机程序或者从存储单元18加载到随机访问存储器(RAM)13中的计算机程序,来执行多种适当的动作和处理。在RAM 13中,还可存储电子设备10操作所需的多种程序和数据。
在一些实施例中,实施例一或实施例二中的方法可被实现为计算机程序,其被有形地包含于计算机可读存储介质,例如存储单元18。在一些实施例中,计算机程序的部分或者全部可以经由ROM 12和/或通信单元19而被载入和/或安装到电子设备10上。当计算机程序加载到RAM 13并由处理器11执行时,可以执行上文描述的实施例一或实施例二中的方法的一个或多个步骤。
在一些实施例中,实施例一或实施例二中的方法可被实现为计算机程序产品,该计算机程序产品包括计算机可执行指令,该计算机可执行指令在被执行时用于执行上文描述的实施例一或实施例二中的方法的一个或多个步骤。

Claims (15)

  1. 一种视频内容识别方法,包括:
    采用预先生成的通用检测模型对目标视频进行目标检测,获得第一检测结果;
    根据所述第一检测结果,从预先生成的多个标签模型中选取至少一个目标标签模型,其中,每个标签模型具有对应的内容标签,并且每个标签模型为采用预先生成的元模型对对应的内容标签的训练数据集进行超参学习后,基于得到的超参数与所述训练数据集训练生成的模型;
    采用所述至少一个目标标签模型对所述目标视频进行标签检测,获得至少一个第二检测结果;
    结合所述至少一个第二检测结果,生成视频内容识别结果。
  2. 根据权利要求1所述的方法,其中,所述根据所述第一检测结果,从预先生成的多个标签模型中选取至少一个目标标签模型,包括:
    从预先生成的多个标签模型中,选取以所述第一检测结果为输入信号的标签模型,并将所述选取的标签模型作为目标标签模型。
  3. 根据权利要求1或2所述的方法,其中,所述采用所述至少一个目标标签模型对所述目标视频进行标签检测,包括:
    获取每个目标标签模型的优先级;
    根据所述优先级确定所述至少一个目标标签模型的调用顺序;
    按照所述调用顺序依次调用每个目标标签模型对所述目标视频进行标签检测。
  4. 根据权利要求3所述的方法,其中,所述内容标签包括具有时序特性的动作标签,所述选取的目标标签模型为所述动作标签对应的标签模型,所述采用所述至少一个目标标签模型对所述目标视频进行标签检测,获得至少一个第二检测结果,包括:
    将所述目标视频划分成多个视频片段;
    分别从每个视频片段中选取多个视频帧作为输入数据,分别将所述多个视频片段的输入数据输入至所述动作标签对应的标签模型,并获取所述动作标签对应的标签模型输出的多个标签结果;
    整合所述动作标签对应的标签模型针对多个视频片段的输入数据输出的多个标签结果,生成所述动作标签对应的标签模型的第二检测结果。
  5. 根据权利要求4所述的方法,其中,每个标签结果为所述输入数据符合 所述动作标签的概率;
    所述整合所述动作标签对应的标签模型针对多个视频片段的输入数据输出的多个标签结果,生成所述动作标签对应的标签模型的第二检测结果,包括:
    对所述多个标签结果按照设定计算规则进行计算,并将计算结果作为所述动作标签对应的标签模型的第二检测结果。
  6. 一种模型训练方法,包括:
    获取标签集合,所述标签集合中包含至少两个标签等级,每个标签等级具有至少一个内容标签;
    针对每个内容标签,分别获取所述每个内容标签的训练数据集,所述训练数据集的数量小于设定小规模阈值;
    采用预先生成的元模型对所述每个内容标签的训练数据集进行超参学习,以获得所述每个内容标签的超参数;
    根据所述每个内容标签的超参数以及所述每个内容标签的训练数据集训练所述每个内容标签的标签模型。
  7. 根据权利要求6所述的方法,其中,所述采用预先生成的元模型对所述每个内容标签的训练数据集进行超参学习,以获得所述每个内容标签的超参数,包括:
    从所述每个内容标签的训练数据集中随机抽取不超过第一设定比例的第一数据子集、以及不超过第二设定比例的第二数据子集,所述第一设定比例小于所述第二设定比例;
    采用所述第一数据子集对初始深度学习模型进行迭代,以获得初始权重参数;
    采用所述第二数据子集继续迭代基于所述初始权重参数的深度学习模型,并计算每次迭代的损失以及所述损失的梯度值;
    将所述初始权重参数、每次迭代获得的损失以及所述损失的梯度值输入至预先生成的元模型中进行超参学习,以获得所述每个内容标签的超参数。
  8. 根据权利要求6所述的方法,其中,所述根据所述每个内容标签的超参数以及所述每个内容标签的训练数据集训练所述每个内容标签的标签模型,包括:
    基于所述每个内容标签的超参数,采用所述每个内容标签的全量的训练数据集进行模型训练,生成所述每个内容标签的标签模型。
  9. 根据权利要求6-8任一项所述的方法,其中,所述内容标签包括具有时序特性的动作标签,所述动作标签的训练数据集中的训练数据为设定长度的帧序列。
  10. 根据权利要求6-8任一项所述的方法,其中,所述获取标签集合,包括:
    接收用户输入的配置数据;
    从所述配置数据中提取标签集合。
  11. 一种视频内容识别装置,包括:
    通用检测模块,设置为采用预先生成的通用检测模型对目标视频进行目标检测,获得第一检测结果;
    目标标签模型确定模块,设置为根据所述第一检测结果,从预先生成的多个标签模型中选取至少一个目标标签模型,其中,每个标签模型具有对应的内容标签,并且每个标签模型为采用预先生成的元模型对对应的内容标签的训练数据集进行超参学习后,基于得到的超参数与所述训练数据集训练生成的模型;
    标签检测模块,设置为采用所述至少一个目标标签模型对所述目标视频进行标签检测,获得至少一个第二检测结果;
    视频内容识别结果生成模块,设置为结合所述至少一个第二检测结果,生成视频内容识别结果。
  12. 一种模型训练装置,包括:
    标签集合获取模块,设置为获取标签集合,所述标签集合中包含至少两个标签等级,每个标签等级具有至少一个内容标签;
    训练数据集获取模块,设置为针对每个内容标签,分别获取所述每个内容标签的训练数据集,所述训练数据集的数量小于设定小规模阈值;
    超参数确定模块,设置为采用预先生成的元模型对所述每个内容标签的训练数据集进行超参学习,以获得所述每个内容标签的超参数;
    标签模块训练模块,设置为根据所述每个内容标签的超参数以及所述每个内容标签的训练数据集训练所述每个内容标签的标签模型。
  13. 一种电子设备,包括:
    至少一个处理器;
    存储装置,设置为存储至少一个程序,
    当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-10任一项所述的方法。
  14. 一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1-10任一项所述的方法。
  15. 一种计算机程序产品,所述计算机程序产品包括计算机可执行指令,所述计算机可执行指令在被执行时用于实现权利要求1-10中任一项所述的方法。
PCT/CN2023/131057 2022-11-30 2023-11-10 视频内容识别及模型训练的方法、装置和设备 WO2024114341A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211529260.9A CN115965890A (zh) 2022-11-30 2022-11-30 一种视频内容识别及模型训练的方法、装置和设备
CN202211529260.9 2022-11-30

Publications (1)

Publication Number Publication Date
WO2024114341A1 true WO2024114341A1 (zh) 2024-06-06

Family

ID=87352120

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/131057 WO2024114341A1 (zh) 2022-11-30 2023-11-10 视频内容识别及模型训练的方法、装置和设备

Country Status (2)

Country Link
CN (1) CN115965890A (zh)
WO (1) WO2024114341A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115965890A (zh) * 2022-11-30 2023-04-14 百果园技术(新加坡)有限公司 一种视频内容识别及模型训练的方法、装置和设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190244139A1 (en) * 2018-02-02 2019-08-08 Oracle International Corporation Using meta-learning for automatic gradient-based hyperparameter optimization for machine learning and deep learning models
CN111428806A (zh) * 2020-04-03 2020-07-17 北京达佳互联信息技术有限公司 图像标签确定方法、装置、电子设备及存储介质
CN111898703A (zh) * 2020-08-14 2020-11-06 腾讯科技(深圳)有限公司 多标签视频分类方法、模型训练方法、装置及介质
CN114329065A (zh) * 2021-11-30 2022-04-12 腾讯科技(深圳)有限公司 视频标签预测模型的处理方法、视频标签预测方法和装置
CN115223218A (zh) * 2022-05-27 2022-10-21 天翼电子商务有限公司 基于alfa元学习优化算法的自适应人脸识别技术
CN115965890A (zh) * 2022-11-30 2023-04-14 百果园技术(新加坡)有限公司 一种视频内容识别及模型训练的方法、装置和设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190244139A1 (en) * 2018-02-02 2019-08-08 Oracle International Corporation Using meta-learning for automatic gradient-based hyperparameter optimization for machine learning and deep learning models
CN111428806A (zh) * 2020-04-03 2020-07-17 北京达佳互联信息技术有限公司 图像标签确定方法、装置、电子设备及存储介质
CN111898703A (zh) * 2020-08-14 2020-11-06 腾讯科技(深圳)有限公司 多标签视频分类方法、模型训练方法、装置及介质
CN114329065A (zh) * 2021-11-30 2022-04-12 腾讯科技(深圳)有限公司 视频标签预测模型的处理方法、视频标签预测方法和装置
CN115223218A (zh) * 2022-05-27 2022-10-21 天翼电子商务有限公司 基于alfa元学习优化算法的自适应人脸识别技术
CN115965890A (zh) * 2022-11-30 2023-04-14 百果园技术(新加坡)有限公司 一种视频内容识别及模型训练的方法、装置和设备

Also Published As

Publication number Publication date
CN115965890A (zh) 2023-04-14

Similar Documents

Publication Publication Date Title
Fu et al. Fast crowd density estimation with convolutional neural networks
De et al. Predicting the popularity of instagram posts for a lifestyle magazine using deep learning
Quattoni et al. An efficient projection for l 1,∞ regularization
CN110580500A (zh) 一种面向人物交互的网络权重生成少样本图像分类方法
WO2024114341A1 (zh) 视频内容识别及模型训练的方法、装置和设备
CN109711422A (zh) 图像数据处理、模型的建立方法、装置、计算机设备和存储介质
Wang et al. Fire detection in infrared video surveillance based on convolutional neural network and SVM
Shi et al. Hodgepodge: Sound event detection based on ensemble of semi-supervised learning methods
Yang et al. Multi-scale structure-aware network for weakly supervised temporal action detection
Yan et al. Predicting human interaction via relative attention model
CN113705215A (zh) 一种基于元学习的大规模多标签文本分类方法
Ilyas et al. A deep learning based approach for precise video tagging
Ju et al. Multi-modal prompting for low-shot temporal action localization
CN114741473A (zh) 一种基于多任务学习的事件抽取方法
Zhu et al. A novel simple visual tracking algorithm based on hashing and deep learning
CN115705706A (zh) 视频处理方法、装置、计算机设备和存储介质
CN112541010B (zh) 一种基于逻辑回归的用户性别预测方法
Song et al. Text Siamese network for video textual keyframe detection
Wang et al. RETRACTED ARTICLE: Human behaviour recognition and monitoring based on deep convolutional neural networks
Zhang et al. Recognition of emotions in user-generated videos through frame-level adaptation and emotion intensity learning
Aryal et al. Using pre-trained models as feature extractor to classify video styles used in MOOC videos
Yang et al. Prototypical partial optimal transport for universal domain adaptation
CN115168678A (zh) 一种时序感知的异质图神经谣言检测模型
Morfi et al. Deep learning for audio transcription on low-resource datasets
Park et al. Graph regularization network with semantic affinity for weakly-supervised temporal action localization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23896493

Country of ref document: EP

Kind code of ref document: A1