CN111311475A

CN111311475A - Detection model training method and device, storage medium and computer equipment

Info

Publication number: CN111311475A
Application number: CN202010108690.8A
Authority: CN
Inventors: 毛懿荣; 李岩; 王汉杰; 陈波
Original assignee: Guangzhou Tencent Technology Co Ltd
Current assignee: Guangzhou Tencent Technology Co Ltd
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2020-06-19

Abstract

The application relates to a detection model training method, a detection model training device, a storage medium and computer equipment, wherein the method comprises the following steps: acquiring an original image to be processed and more than one type of marked images; for each type of marked image, respectively randomly selecting a target position from a target area of the original image as an embedding position of the marked image; for each type of marked image, embedding at least one part of the marked image into the original image according to the corresponding embedding position to obtain a corresponding sample image; taking the sample image as a training sample, and taking the label category of the label image embedded in the sample image as a corresponding training label; and training the detection model to be trained through the training sample and the corresponding training label. The scheme that this application provided can improve training efficiency.

Description

Detection model training method and device, storage medium and computer equipment

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to a detection model training method, an apparatus, a computer-readable storage medium, and a computer device.

Background

With the development of computer technology, machine learning technology appears, and the computer can be trained to simulate or realize the learning behaviors of human beings through the machine learning technology, so that convenience is brought to the life and work of people. For example, in the field of image processing, a model may be trained by training data to enable the model to learn the ability to classify or locate, which may allow a machine to process images instead of human.

In practical applications, for example, when a detection model capable of identifying a target object (such as a watermark or a trademark) needs to be trained, a large amount of label data is often required and then model training is performed. However, in the conventional method, the category of the target object and the position of the target object in the image are labeled by manpower, and the labeling speed is slow, so that the model training efficiency is low.

Disclosure of Invention

Based on this, it is necessary to provide a detection model training method, apparatus, computer-readable storage medium and computer device for solving the technical problem of inefficient model training caused by human labeled data.

A detection model training method, comprising:

acquiring an original image to be processed and more than one type of marked images;

for each type of marked image, respectively randomly selecting a target position from a target area of the original image as an embedding position of the marked image;

for each type of marked image, embedding at least one part of the marked image into the original image according to the corresponding embedding position to obtain a corresponding sample image;

taking the sample image as a training sample, and taking the label category of the label image embedded in the sample image as a corresponding training label;

and training the detection model to be trained through the training sample and the corresponding training label.

A test model training apparatus, the apparatus comprising:

the acquisition module is used for acquiring an original image to be processed and more than one type of marked images;

the selecting module is used for respectively and randomly selecting a target position from the target area of the original image as the embedding position of the marked image for each type of marked image;

the embedding module is used for embedding at least one part of the marked images into the original image to obtain corresponding sample images according to corresponding embedding positions of the marked images;

the determining module is used for taking the sample image as a training sample and taking the label type of the label image embedded in the sample image as a corresponding training label;

and the training module is used for training the detection model to be trained through the training sample and the corresponding training label.

In one embodiment, the acquiring module is further configured to acquire an original image to be processed and more than one type of mark template; randomly selecting a target size proportion from a preset size proportion range; and according to the size of the original image, respectively carrying out scaling processing on each type of marking template according to the target size proportion to obtain a corresponding marking image.

In one embodiment, the target locations include a core location and a non-core location; the selection module is further used for determining a core position in a target area of the original image; obtaining a probability value when the core position is taken as an embedding position; the probability value corresponding to the core position is the maximum value in the probability values corresponding to all target positions in the target area; determining probability values when the non-core positions are respectively used as embedding positions according to the distances between the non-core positions in the target area and the core positions; the probability value corresponding to the non-core position is in negative correlation with the distance from the non-core position to the core position; and for each type of marked image, selecting a corresponding target position as the embedded position of the marked image according to the probability value corresponding to each target position in the target area of the original image.

In one embodiment, the selection module is further configured to obtain a preset number of platform-specific images; each platform special image comprises a mark image corresponding to a corresponding platform; determining an average coordinate corresponding to a target vertex according to the coordinate of the target vertex of each marked image in the platform-specific image; and taking the average coordinate as a core position in a target area of the original image.

In one embodiment, the target area comprises an upper left corner area and a lower right corner area; the selecting module is further configured to select, for each type of labeled template, a corresponding target position as an embedded position corresponding to an upper left vertex of the labeled image according to a probability value corresponding to each target position in the upper left corner region of the original image, when the target region is an upper left corner region; and when the target area is a lower right corner area, selecting a corresponding target position as an embedding position corresponding to a lower right vertex of the marked image for each type of marked template according to the probability value corresponding to each target position in the lower right corner area of the original image.

In one embodiment, the sample image comprises a first sample image and a second sample image; the embedding module is also used for determining a first marker image to be completely embedded and a second marker image to be shielded and embedded in each type of marker image; completely embedding the first mark image into the original image according to the corresponding embedding position to obtain a corresponding first sample image; and completely embedding the second marked image into the original image according to the corresponding embedding position, randomly selecting a target shielding proportion from a preset shielding proportion range, and moving a part of the second marked image out to the boundary of the original image according to the target shielding proportion to obtain a corresponding second sample image.

In one embodiment, the determining module is further configured to determine a label category of a label image embedded in each sample image and position information of the label image in the original image; and taking the sample image as a training sample, and taking the label type of the label image embedded in the sample image and the corresponding position information as a training label of the training sample.

In one embodiment, the training module is further configured to crop the sample image according to the target area to obtain a corresponding sample image block; respectively extracting the characteristics of each sample image block through a detection model to be trained to obtain a corresponding characteristic diagram, and detecting and outputting a prediction result based on the characteristic diagram; and adjusting the model parameters of the detection model according to the difference between the prediction result corresponding to the sample image block and the corresponding training label until the training stopping condition is met, and stopping training.

In one embodiment, the device further comprises a mark detection module, which is used for acquiring a video to be detected and a trained detection model; extracting a preset number of video frames from the video to be detected, and cutting each video frame according to the target area to obtain a corresponding target image block; inputting each target image block into the trained detection model respectively, and outputting a detection result corresponding to each target image block; and fusing the detection results of the target image blocks to obtain a detection result corresponding to the video to be detected.

In one embodiment, the mark detection module is further configured to input each of the target image blocks to the trained detection model respectively; sequentially processing the input target image blocks through at least three groups of convolution groups in the trained detection model; the down-sampling layer in the last group of convolution groups is a cavity convolution with the step length being a preset value so as to keep the size of the characteristic graph output by the last group of convolution groups as a preset size; carrying out convolution processing on the feature map output by the convolution group of the middle group to obtain a first feature map to be detected; taking the feature map output by the last group of convolution groups as a second feature map to be detected; performing convolution processing on the second characteristic diagram to be detected to obtain at least one third characteristic diagram to be detected; respectively detecting the first characteristic diagram to be detected, the second characteristic diagram to be detected and the third characteristic diagram to be detected to obtain a candidate detection result and a confidence coefficient corresponding to the candidate result; and screening out candidate detection results with corresponding confidence degrees meeting the high-confidence-degree condition from the candidate detection results corresponding to the feature graphs to be detected as the input detection results corresponding to the target image blocks.

In one embodiment, the detection result includes a label category to which a label image in the video to be detected belongs; the device also comprises a video pushing module used for obtaining a video filtering instruction; the video filtering instructions include a first target category; determining the label type respectively corresponding to each video in the video library through the trained detection model; searching videos to be filtered with the mark category as a first target category from the video library; and responding to the video filtering instruction, and pushing the videos except the video to be filtered in the video library to a user terminal initiating the video filtering instruction.

In one embodiment, the video pushing module is further configured to obtain a video search instruction; the video search instruction comprises a second target category; determining the label type respectively corresponding to each video in the video library through the trained detection model; searching a target video with a mark category being a second target category from the video library; and responding to the video searching instruction, and pushing the target video to a user terminal initiating the video searching instruction.

A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:

According to the detection model training method, the detection model training device, the computer readable storage medium and the computer equipment, each type of marked image is randomly embedded into the original image, the situation that the marked image in a real scene is possibly edited or compressed is simulated during embedding, and partial marked images are embedded in all the marked images or in the hidden images, so that the marked training data can be automatically generated for training the detection model. Wherein the training labels in the training data are the label categories to which the embedded label images belong. Therefore, training data do not need to be artificially labeled, and the situation that real labeled images appear in original images is simulated by adopting various random strategies, so that the labor cost and the labeling efficiency of labeling the training data are greatly reduced, and the model training efficiency is greatly improved.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a test model training method;

FIG. 2 is a schematic flow chart diagram of a test model training method in one embodiment;

FIG. 3 is a diagram illustrating marking templates, in one embodiment;

FIG. 4 is a flowchart illustrating the steps of randomly selecting a target position from a target area of an original image as an embedding position of a marker image for each type of marker image in one embodiment;

FIG. 5 is a flowchart illustrating steps of performing label detection on a video to be detected by a trained detection model according to an embodiment;

FIG. 6 is a schematic diagram of a network structure of a detection network based on the SSD algorithm in one embodiment;

fig. 7 is a schematic diagram of a network structure of a RetinaNet network according to an embodiment;

FIG. 8 is a flowchart illustrating steps of performing label detection on a video to be detected by using a trained detection model and obtaining a detection result in an embodiment;

FIG. 9 is a block diagram showing the structure of a test pattern training apparatus according to an embodiment;

FIG. 10 is a block diagram showing the construction of a test pattern training apparatus according to another embodiment;

FIG. 11 is a block diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

FIG. 1 is a diagram of an exemplary implementation of a training method for a detection model. Referring to fig. 1, the detection model training method is applied to a detection model training system. The detection model training system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers. Both the terminal 110 and the server 120 can be independently used to perform the detection model training method provided in the embodiments of the present application. The terminal 110 and the server 120 may also be cooperatively used to perform the detection model training method provided in the embodiments of the present application.

It should be noted that the detection model training method relates to Machine Learning (ML), which is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

As shown in FIG. 2, in one embodiment, a detection model training method is provided. The embodiment is mainly exemplified by applying the method to a computer device, and the computer device may specifically be the terminal 110 or the server 120 in fig. 1. Referring to fig. 2, the detection model training method specifically includes the following steps:

s202, acquiring an original image to be processed and more than one type of marked images.

Wherein the original image is an image without the marking information to be processed. The original image may be a real image captured by a camera, or a video frame of one or more frames segmented from a video file, or an image synthesized by a computer device. The marking information is information for performing a specific marking, and may specifically be information with a certain degree of recognition, such as specific characters, icons, signs, or audio. The marking image is an image including marking information, and specifically may be a watermark image or a trademark image.

Specifically, the computer device may obtain one or more original images from a local or other computer device, where a plurality is more than one, and the terms "plurality", "a plurality", or "a plurality of types" mentioned in the embodiments of the present application mean "more than one", or "more than one type" without specific description. The computer device may pre-acquire label information belonging to different label categories and generate corresponding label images according to the corresponding label information. The mark category may be a category for marking different platforms, that is, a category of a platform to which the mark image belongs. The different platforms may specifically be different media platforms, such as a "tremble" platform, a "micro-view" platform, a "watermelon video" platform, a "fast-hand" platform, a "volcano small video" platform, or a "prawns" platform, etc. The marked images corresponding to different platforms have the characteristics of the respective platforms and are different from each other.

In one embodiment, to ensure the training effect on the detection model, the computer device may adjust the format of the acquired original image to obtain an original image with a uniform format and size.

In an embodiment, the computer device may further obtain a marking template having 4 channel information of rgba (red green blue alpha), and adjust the size of the marking template according to a preset size ratio to obtain a corresponding marking image according to the size of the original image.

And S204, respectively randomly selecting a target position from the target area of the original image as the embedding position of the marker image for each type of marker image.

The target region is a region for embedding the marker image, and may specifically be the entire region of the original image, or may be a partial region in the original image, such as an upper left corner region, a lower right corner region, a lower left corner region, or an upper right corner region having a preset size in the original image. The embedding position is a position point for locating a specific position of the marker image in the original image.

Specifically, the computer device may randomly select a certain type of marker image from various types of marker images, and randomly select a target position from a target region of the original image as an embedding position of the marker image. It can be understood that, in order to obtain a good detection effect of the detection model, different label categories can be accurately and respectively constructed, when constructing the training data, an equal number of sample images are constructed for each category of label images, that is, when selecting the label images, the probability of each category of label images being selected is equal. The training data specifically includes training samples and training labels.

In one embodiment, the computer device may preset a fixed area as a target area of the original image. Alternatively, the computer device may determine the target region based on a region in which the marker image is present in the real image including the marker image in the actual situation. For example, the computer device may count the positions of the occurrence of the marker images in a large number of real images, and take a region in which the probability of the occurrence of the marker images in the real images is greater than a preset probability threshold or a larger region as a target region. The real image here is an image with mark information of a platform corresponding to each of different platforms in a real scene, and may also be referred to as a platform-specific image.

In one embodiment, when the computer device selects the target location from the target area, the probability values of the selection of the respective locations in the target area may be set to the same probability value or may be set to different probability values. In one embodiment, the computer device may determine a location in the target area as a core location, the core location corresponding to a maximum selected probability value, and the selected probability values corresponding to other non-core locations decrease in magnitude around the core location.

In an embodiment, the computer device may use the target position selected in the target area as the embedding position of the marker image, and specifically may use the target position selected in the target area as the embedding position corresponding to the upper left vertex of the marker image, as the embedding position corresponding to the lower right vertex of the marker image, or as the embedding position corresponding to the center point of the marker image, and the like, which is not limited in this embodiment of the application.

And S206, for each type of marked image, embedding at least one part of the marked image into the original image according to the corresponding embedding position to obtain a corresponding sample image.

Specifically, after determining an embedding position corresponding to a certain marker image, the computer device may embed at least a portion of image content in the marker image into the original image according to the embedding position to obtain a corresponding sample image.

In one embodiment, the computer device can paste the complete marker image entirely onto the original image according to the embedding location to obtain a corresponding sample image.

In one embodiment, the computer device may paste the complete marked image to the original image according to the embedded position, and then move the marked image toward the edge of the original image, so that a part of the marked image moves out to the edge of the original image, and the image content of the remaining part obtains a corresponding sample image on the original image. This may better simulate the fact that the edges of the marked image may be cropped.

In one embodiment, the computer device may crop out portions of the image content from the marker image, and paste the portions of the marker image into the original image according to the embedding location to obtain a corresponding sample image.

For example, the computer device may preset an occlusion probability value, such as 0.25, that 25% of the training samples have their labeled images occluded and 75% of the training samples have their labeled images intact. For the training sample with the marked image being shielded, the computer equipment can randomly select the target shielding proportion from the preset shielding proportion range so as to adjust the height, width or area of the marked image in the original image according to the target shielding proportion. For example, the computer device can set the preset shielding ratio range to be 0.3-0.7, and then the computer device can uniformly select the target shielding ratio from 0.3-0.7.

And S208, taking the sample image as a training sample, and taking the label type of the label image embedded in the sample image as a corresponding training label.

Specifically, the computer device may randomly generate a large number of sample images including marker images of different marker categories based on the random manner mentioned in the above steps. The embedding positions of the mark images in the generated sample images are randomly distributed, and the coverage areas of the mark images in the sample images are also randomly distributed. And then the computer equipment can take the sample images as training samples and take the label types of the label images embedded in the sample images as training labels corresponding to the corresponding training samples.

In one embodiment, the step S208, that is, the step of using the sample image as a training sample and using the label category of the label image embedded in the sample image as a corresponding training label specifically includes: determining the mark type of the mark image embedded in each sample image and the position information of the mark image in the original image; and taking the sample image as a training sample, and taking the label type of the label image embedded in the sample image and the corresponding position information as a training label of the training sample.

In one embodiment, the computer device may determine the label class of the label image embedded in each sample image, as well as the position information of the label image in the original image. The position information of the marker image in the original image is used to locate the marker image, and specifically may be coordinates of an upper left vertex and a lower right vertex of the embedded marker image, or coordinates of the lower left vertex and the upper right vertex, or the like. Furthermore, the computer device may use the sample image as a training sample, and use the label category of the label image embedded in the sample image and the corresponding position information together as a training label of the training sample. Therefore, the detection model obtained through training corresponding to the training sample and the training label can predict the label type of the image to be detected and can also position the label image in the image to be detected.

And S210, training the detection model to be trained through the training sample and the corresponding training label.

The detection model is a convolutional neural network model and is used for carrying out mark detection on the image to be detected or the video to be detected. When the training labels are only label categories, the detection model obtained through training can be used for classifying the label images included in the input images to be detected. When the training labels comprise the label categories and the position information of the label images, the detection model obtained through training can be used for classifying and positioning the label images included in the input images to be detected.

In particular, the training of the detection model is a supervised training process. The computer equipment inputs the training sample into the detection model, takes the corresponding training label of the training sample as target output, and enables the actual output of the detection model to continuously approach the target output by adjusting the model parameters of the detection model.

In one embodiment, the computer device may input the training samples into the detection model for training to obtain the prediction result. And constructing a loss function according to the difference of the prediction result and the training label. And taking the model parameters when the loss function is minimized as the model parameters of the detection model, returning to the step of inputting the training samples into the detection model for training to obtain the detection result, and stopping training until the training stopping condition is met.

Wherein the training stop condition is a condition for ending the model training. The training stopping condition may be that a preset number of iterations is reached, or that the performance index of the detection model after the model parameters are adjusted reaches a preset index.

In an embodiment, the model structure of the detection model and the detection algorithm used in the detection model are not limited in this application. The detection model can be specifically realized by a two-stage detection algorithm based on a neural network, and also can be realized by a one-stage detection algorithm based on the neural network. For example, the computer device may construct a detection model based on an SSD algorithm (Single Shot multi box Detector, a target detection algorithm) or based on a RetinaNet algorithm (a Single-stage target detection algorithm) through a convolutional neural network.

In an embodiment, the step S210 of training the detection model to be trained by using the training samples and the corresponding training labels specifically includes: cutting the sample image according to the target area to obtain a corresponding sample image block; respectively extracting the characteristics of each sample image block through a detection model to be trained to obtain a corresponding characteristic diagram, and detecting and outputting a prediction result based on the characteristic diagram; and adjusting the model parameters of the detection model according to the difference between the prediction result corresponding to the sample image block and the corresponding training label until the training stopping condition is met, and stopping training.

In one embodiment, the computer device may crop the sample image according to the target area to obtain a corresponding sample image block. And then the computer equipment can respectively input each sample image block into the detection model to be trained, respectively perform feature extraction on each sample image block through the convolution layer included in the detection model to obtain a corresponding feature map, and perform marking detection based on each feature map to output a prediction result. When the training label is the label category, the corresponding prediction result is the predicted label category. And when the training labels are the label types and the position information of the label images in the target image block, the corresponding prediction results are the predicted label types and the regression frame. The computer equipment can adjust the model parameters of the detection model towards the direction of reducing the difference according to the difference between the prediction result corresponding to the sample image block and the corresponding training label, and the training is stopped until the training stopping condition is met.

In the embodiment, the sample image is cut according to the target area to obtain the corresponding sample image block, and then the detection model to be trained is trained through the sample image block, so that the interference and extra processing amount caused by background information irrelevant to the marked image in the training process can be greatly reduced, and the model training efficiency and accuracy are improved.

It can be understood that, in practical applications, in order to avoid the blocking of the effective content of the original image by the marked image, the marked image generally appears in the upper left corner region or the lower right corner region of the original image, which can achieve the marking effect and cannot block the effective content of the original image. Therefore, when training data is designed, if only the labeled images are embedded in the fixed positions of the upper left corner region and the lower right corner region of the original image, the detection model may mistakenly assume that the labeled images possibly exist at the fixed positions, but does not really have the capability of judging that the different positions contain the labeled images, and only learns the capability of distinguishing the positions. Some random factors are added in determining the embedding position of the marker image to simulate the real situation in step S204. In step S206, at least a part of the marker image is embedded into the original image, mainly considering that the marker image in the real scene may be edited or compressed, so that only a part of the marker image appears, and adding a random occlusion factor so that the detection model can handle the case where the marker image is only partially visible. Therefore, the real labeled image generation condition is simulated through various random strategies, the reliable and effective training data with labeled information is automatically generated, and the labor cost for obtaining the training data is greatly reduced. And training the detection model through the generated training data fitting the real situation, so that the trained detection model has higher detection precision.

According to the detection model training method, each type of marked image is randomly embedded into the original image, the situation that the marked image in a real scene is possibly edited or compressed is simulated during embedding, and partial marked images are embedded in the real scene in a shielding mode, so that the marked training data can be automatically generated to train the detection model. Wherein the training labels in the training data are the label categories to which the embedded label images belong. Therefore, training data do not need to be artificially labeled, and the situation that real labeled images appear in original images is simulated by adopting various random strategies, so that the labor cost and the labeling efficiency of labeling the training data are greatly reduced, and the model training efficiency is greatly improved.

In one embodiment, the step S202, that is, the step of acquiring the original image to be processed and the more than one type of marked images specifically includes: acquiring an original image to be processed and more than one type of marking template; randomly selecting a target size proportion from a preset size proportion range; and according to the size of the original image, scaling the various marking templates according to the target size proportion to obtain corresponding marking images.

The marking template is a marking sample corresponding to each platform and can be used as a template. The size ratio is a size ratio between sizes of two objects, and specifically refers to a size ratio of the mark image to the original image in the embodiment of the present application. The preset size ratio range is a preset range composed of a series of size ratios, such as a height ratio range, a width ratio range, or an area ratio range.

Specifically, the computer device may obtain respective corresponding markup templates of different platforms, that is, markup templates belonging to different markup categories. The computer device may specifically obtain a markup template having RGBA4 channel data. And then the computer equipment can select a target size proportion from a preset size proportion range, and respectively carry out scaling processing on various marking templates according to the size of the original image and the target size proportion to obtain corresponding marking images.

In one embodiment, the computer device may randomly select a token template from the candidate token templates, wherein the probability of each token template selection is the same. The computer device may preset a range of height ratios of the marker image to the original image, such as a range of height ratios of 0.04 to 0.14. The computer equipment can randomly select a target height ratio from the height ratio range, and then the marking template is scaled according to the corresponding target height ratio to obtain a marking image. It can be understood that the preset size ratio range may also be a preset width ratio range or an area ratio range, and the correspondingly selected target size ratio may also be a target width ratio or a target area ratio, which is not limited in this application.

In one embodiment, the computer device may set a selection probability corresponding to different size ratios in a range of preset size ratios. And then randomly selecting the target size proportion from the range of the preset size proportion according to the corresponding selection probability. In one embodiment, the selection probability is set according to the real image containing the marked image, a larger selection probability is set for the size proportion which frequently appears, and a smaller selection probability is set for the size proportion which rarely appears. Therefore, the real situation can be simulated more truly for the size situation of the marked image relative to the original image, and the training effect of the detection model can be improved.

Referring to FIG. 3, FIG. 3 is a diagram illustrating various marking templates in one embodiment. For example, the mark template may specifically be a mark template corresponding to a platform such as "micro view", "time video", or "video for dinner". It is understood that fig. 3 is only a partial example, and the mark template and the mark image mentioned in the embodiments of the present application include, but are not limited to, the above-mentioned several. For example, the marked template may also be a template corresponding to a "tremble" platform, a "watermelon video" platform, a "fast-hand" platform, a "volcano small video" platform, or a "shrimp" platform.

In an embodiment, for each class of labeled templates, the frequency with which the computer device is used to construct the corresponding training data may be the same or different, and the embodiment of the present application is not limited herein. With the appearance of different platforms on the market, the corresponding marking templates are more and more, and for the newly appearing marking templates, training data corresponding to the marking templates can be constructed in the manner mentioned in the embodiment of the application without additionally marking the training data through manpower to retrain the detection model so as to improve the detection range and the detection capability of the detection model.

In the embodiment, the target size proportion is randomly selected from the range of the preset size proportion, and then the scaling processing is respectively carried out on various marking templates according to the size of the original image and the target size proportion, so that the marking images with different sizes can be obtained, the real situation can be better fitted, the constructed training sample is more accurate, and the training effect on the detection model can be greatly improved.

In one embodiment, the target locations include core locations and non-core locations; step S202, that is, for each type of marked image, the step of randomly selecting a target position from the target region of the original image as the embedded position of the marked image specifically includes the following steps:

s402, determining the core position in the target area of the original image.

Wherein the core location is a location in the target region selected as the embedding location having the highest probability value. The computer device may select a position from the image data of the preset number of real scenes as a core position, the position having the most number of times of embedding. Or, the computer device may further use an average position of the embedding positions corresponding to the respective marker images as a core position according to the image data of the plurality of real scenes.

In one embodiment, the step S402, namely the step of determining the core position in the target region of the original image, specifically includes: acquiring a preset number of platform special images; the special images of each platform respectively comprise mark images corresponding to the corresponding platforms; determining an average coordinate corresponding to the target vertex according to the coordinates of the target vertex of each marked image in the platform-specific image; the average coordinates are taken as the core position in the target area of the original image.

The target vertex of the marker image may be specifically an upper left vertex, a lower left vertex, an upper right vertex, or a lower right vertex. When the target area is the upper left corner area of the original image, the target vertex of the marked image can be the upper left vertex specifically; when the target area is a lower right corner area of the original image, the target vertex of the marker image may be a lower right vertex.

In one embodiment, a computer device may obtain a preset number of platform specific images, each platform specific image being an image in a real scene. The computer device may determine vertex coordinates of target vertices of the respective tagged images in the platform-specific image, calculate an average coordinate of the respective vertex coordinates, and then use the average coordinate as a core position in the target region of the original image. It will be appreciated that the core positions thus determined from the target vertices, when selected as embedding positions in the construction training data, may accordingly be re-posted in the original image by using the core positions in the original image as the positions of the corresponding target vertices of the tagged image.

For example, a computer device may acquire 50 images of a real scene, i.e., a platform-specific image. The average coordinate of the corresponding marked image (such as a watermark image) in the 50 platform private images in the upper left corner area is calculated, and the position corresponding to the average coordinate is taken as the core position in the upper left corner area. And setting the height of the original image as h and the width as w, and establishing a rectangular coordinate system downwards by taking the upper left vertex of the original image as a coordinate origin. The coordinates of the core position may exemplarily be (w 0.03, h 0.015). When the target region is a lower right corner region, the coordinates of the core position of the lower right corner region may be (w-w 0.03, h-h 0.035), for example.

In the above embodiment, the average position corresponding to the target vertex of the marker image in the platform specific images of the preset number in the real scene is used as the core position to allocate the maximum probability value, so as to better fit the real situation.

S404, obtaining a probability value when the core position is taken as an embedding position; the probability value corresponding to the core position is the maximum value of the probability values corresponding to all target positions in the target area.

Specifically, the computer device may preset a core position as a probability value when the core position is embedded, where the probability value corresponding to the core position is a maximum value of the probability values corresponding to the target positions in the target region.

S406, determining probability values when the non-core positions are respectively used as embedding positions according to the distances between the non-core positions in the target area and the core positions; the probability value corresponding to the non-core position is inversely related to the distance from the non-core position to the core position.

In particular, locations other than the core location in the target region may be referred to as non-core locations. The computer device may calculate a distance between each of the non-core locations in the target area and the core location, and the distance may be specifically represented by an exemplary distance. And determining a probability value when each non-core position is taken as an embedding position according to the distance between each non-core position and the core position in the target area, wherein the size of the probability value is inversely proportional to an canonical distance from the core point. That is, the probability distribution of each position decreases around the core position. It can be understood that the sum of the probability values corresponding to the respective target positions in the target area is 1.

And S408, selecting corresponding target positions as the embedding positions of the marked images according to the probability values corresponding to the target positions in the target area of the original image for each type of marked images.

Specifically, for the marker images selected from the various marker images, the computer device may select corresponding target positions as the embedding positions of the marker images according to the probability values corresponding to the target positions in the target region of the original image.

In one embodiment, the target region includes an upper left corner region and a lower right corner region; for each type of marked image, selecting a corresponding target position as an embedded position of the marked image according to the probability value corresponding to each target position in the target area of the original image respectively, wherein the method comprises the following steps: when the target area is the upper left corner area, for each type of marking template, respectively selecting a corresponding target position as an embedded position corresponding to the upper left vertex of the marking image according to the probability value corresponding to each target position in the upper left corner area of the original image; and when the target area is a lower right corner area, selecting the corresponding target position as an embedding position corresponding to a lower right vertex of the marked image for each type of marked template according to the probability value corresponding to each target position in the lower right corner area of the original image.

In an embodiment, when the target region is an upper left corner region, the computer device may select a corresponding target position as an embedding position corresponding to an upper left vertex of the tagged image according to respective probability values corresponding to respective target positions in the upper left corner region of the original image, and further embed the upper left corner of the tagged image into the position when the tagged image is embedded. When the target area is a lower right corner area, the computer device may select a corresponding target position as an embedding position corresponding to a lower right vertex of the tagged image according to respective probability values corresponding to target positions in the lower right corner area of the original image, and further embed the lower right corner of the tagged image into the position when the tagged image is embedded. In this way, for the upper left corner area and the lower right corner area, the upper left vertex and the lower right vertex of the marked image are respectively used as reference points for embedding, so that the embedding is more convenient and accurate.

In the above embodiment, the maximum probability value corresponding to the core position in the target region is set, and the probability distribution of the non-core positions decreases gradually from the core position to the periphery, so that the target position is selected as the embedding position of the marker image according to the probability value corresponding to each position, the marker image is more suitable for the actual situation, random distribution is met, and the structure of the sample image is more rapid and accurate.

In one embodiment, the sample image includes a first sample image and a second sample image; step S208, that is, for each type of marked image, embedding at least a part of the marked image into the original image according to the corresponding embedding position to obtain a corresponding sample image specifically includes: for each type of marked image, determining a first marked image to be completely embedded and a second marked image to be shielded and embedded in the marked image; completely embedding the first mark image into the original image according to the corresponding embedding position to obtain a corresponding first sample image; and completely embedding the second marked image into the original image according to the corresponding embedding position, randomly selecting a target shielding proportion from a preset shielding proportion range, and moving a part of the second marked image out to the boundary of the original image according to the target shielding proportion to obtain a corresponding second sample image.

The first marker image is the marker image which is selected to be completely embedded into the original image, and the second marker image is the marker image which is selected to be shielded and embedded into the original image. And correspondingly, the first sample image is a sample image corresponding to the first marker image, and the second sample image is a sample image corresponding to the second marker image. It is to be understood that the original images may be the same original image or different original images, and this is not limited in this embodiment of the application.

In one embodiment, the computer device may preset the ratio of the mark images to be completely embedded and to be occluded and embedded, such as the ratio of the number of first mark images to be completely embedded to the number of second mark images to be occluded and embedded is 1: 3, that is, it is determined with a probability of 0.25 whether the marker image is occluded. That is, for each class of marker images, 25% of the marker images are occluded and embedded in the original image and 75% of the marker images are completely embedded in the original image, out of a certain number of marker images. Of course, the ratio may be other values, and the present embodiment is not limited thereto.

Further, the computer device may determine, from all the marker images, a first marker image to be completely embedded and a second marker image to be occluded and embedded. The computer device can completely embed the first mark image into the original image according to the corresponding embedding position to obtain a corresponding first sample image. When the target area is an upper left corner area, the computer device may embed an upper left vertex of the first marker image into the embedding position, so as to paste the complete first marker image into the original image, resulting in a corresponding first sample image. When the target area is a lower right corner area, the computer device may embed a lower right vertex of the first marker image into the embedded position, so as to paste the complete first marker image into the original image, resulting in a corresponding first sample image.

And for the second marked image, the computer equipment can randomly select a target shielding proportion from a preset shielding proportion range, completely embed the second marked image into the original image according to the corresponding embedding position, and then move out a part of the second marked image to the boundary of the original image according to the selected target shielding proportion to obtain a corresponding second sample image. The position information of the marker image embedded in the second sample image is updated again according to the moved position.

The occlusion ratio is a ratio between the height, width or area of the portion of the second mark image outside the original image and the original image, such as a height occlusion ratio, a width occlusion ratio or an area occlusion ratio. The preset occlusion ratio range is a preset range composed of a series of occlusion ratios, such as a high occlusion ratio range, a wide occlusion ratio range, or an area occlusion ratio range. That is, the computer device may randomly select a target occlusion ratio from the range, and then move the second marker image upward, downward, leftward or rightward toward the boundary of the original image according to the target occlusion ratio, so that the ratio of the height, width or area of the part of the marker image, which is outside the original image, of the second marker image to the height, width or area of the second marker image is the target occlusion ratio.

For example, when the preset occlusion ratio range is a height occlusion ratio range, such as 0.3 to 0.7, the computer device may randomly select a target height occlusion ratio, such as 0.4, from the corresponding range, and then shift the second marker image upward or downward when moving the second marker image, so that a partial area of the second marker image is outside the original image, and the height of the partial marker image outside the original image is 0.4 times the height of the marker image. When the preset occlusion ratio range is a width occlusion ratio range, such as 0.3 to 0.7, the computer device may randomly select a target width occlusion ratio, such as 0.4, from the corresponding range, and further shift the second marker image to the left or to the right when moving the second marker image, so that a partial area of the second marker image is outside the original image, and a width of the partial marker image outside the original image is 0.4 times the width of the marker image. When the preset occlusion ratio range is an area occlusion ratio range, for example, 0.3 to 0.7, the computer device may randomly select a target area occlusion ratio, for example, 0.4, from the corresponding range, and then move the boundary of the original image of the second marked image when moving the second marked image, so that a partial area of the second marked image is outside the original image, and the area of the partial marked image outside the original image is 0.4 times the area of the marked image. It is to be understood that the above-mentioned range of the occlusion ratio and the target occlusion ratio are only exemplary values and are not intended to limit the present application.

In one embodiment, for different occlusion proportions within the preset occlusion proportion range, the computer device may set a selection probability corresponding thereto. The probability values selected by different shielding proportions may be the same or different, and the embodiment of the present application does not limit this.

In the embodiment, when the marked images are embedded into the original image, some marked images are completely embedded into the original image, and some marked images are embedded into the original image in a shielding manner, so that the condition that the marked images in a real scene are possibly edited or compressed can be well simulated, and the accuracy of training data construction is greatly improved.

In one embodiment, the detection model training method further includes a step of performing label detection on a video to be detected through the trained detection model, where the step specifically includes:

and S502, acquiring a video to be detected and a trained detection model.

Specifically, the computer device may obtain a video to be detected and a trained detection model in a video library. In a specific application scenario, the video to be detected is a small video to be detected. The small video to be detected is a video with the video duration being less than the preset duration or the video size being less than the preset size.

S504, extracting a preset number of video frames from the video to be detected, and cutting each video frame according to the target area to obtain a corresponding target image block.

Specifically, the computer device may convert the video to be detected into a frame-by-frame video frame at a preset frequency, for example, the computer device may extract a frame of image every second as the video frame. And the computer device can screen out a preset number of video frames from the multiple video frames, for example, screen out the video frames of the first frame and the intermediate frame as the video frames to be detected. Based on practical experience, it can be known that, for a small video, a watermark image (namely, a marked image) appears in a very high probability in an upper left corner region of a first frame, and a watermark image appears in a very high probability in a very low right corner region of an intermediate frame, so that the first frame and the intermediate frame are selected as video frames to be detected, and the accuracy of mark detection can be improved.

Furthermore, the computer device can cut out the image data of the target area in each extracted frame video frame to obtain a corresponding target image block. Specifically, the upper left corner region and the lower right corner region in each frame of video frame may be clipped out to obtain the corresponding target image block.

And S506, inputting each target image block into the trained detection model respectively, and outputting the detection result corresponding to each target image block.

Specifically, the computer device may input each target image block as input data to the trained detection model, process the input data through the model structure and the model parameters of the detection model, and output the detection result corresponding to each target image block.

It can be understood that, when the training labels of the detection model in the training process are only the label categories of the label images, the detection result output after the trained detection model is processed is the label category corresponding to each target image block. When the training labels of the detection model in the training process are the label types and the position information of the labeled images, the detection result output after the trained detection model is processed is the label type and the position information corresponding to each target image block.

In an embodiment, the step S506, that is, the step of inputting each target image block into the trained detection model respectively and outputting the detection result corresponding to each target image block specifically includes: inputting each target image block into the trained detection model respectively; sequentially processing the input target image blocks through at least three groups of convolution groups in the trained detection model; the down-sampling layer in the last group of convolution groups is a cavity convolution with the step length being a preset value so as to keep the size of the characteristic graph output by the last group of convolution groups as a preset size; carrying out convolution processing on the feature map output by the convolution group of the middle group to obtain a first feature map to be detected; taking the feature map output by the last group of convolution groups as a second feature map to be detected; performing convolution processing on the second characteristic diagram to be detected to obtain at least one third characteristic diagram to be detected; respectively detecting the first characteristic diagram to be detected, the second characteristic diagram to be detected and the third characteristic diagram to be detected to obtain corresponding candidate detection results and confidence degrees corresponding to the candidate results; and screening out candidate detection results with corresponding confidence degrees meeting the high-confidence-degree condition from the candidate detection results corresponding to the feature graphs to be detected as the detection results corresponding to the input target image blocks.

The convolution group (block) is a network structure including a plurality of convolution layers. Specifically, the trained detection model comprises at least three convolution groups, the computer device inputs the target image block into the trained convolution network, and the target image block is processed through each convolution group in the trained convolution network. After each convolution group is carried out on the target image, the width and the height of the corresponding feature map are reduced by half. The computer device may set the downsampled layers in the last convolution group to be convolved with holes having a step size of a preset value (e.g., 2), so that the size of the last convolution group is a preset size. And the computer equipment can perform convolution processing on the feature maps output by the convolution groups of the middle group (such as the second group) to obtain a first feature map to be detected, and takes the feature map output by the last convolution group as a second feature map to be detected. In addition, the computer equipment can also carry out different convolution processing on the second characteristic diagram to be detected to obtain at least one third characteristic diagram to be detected, wherein the size of each third characteristic diagram to be detected is different from each other.

Further, the computer device may perform detection on multiple layers of feature maps to be detected, where each layer has its own detector (these detectors do not share model parameters), and each detector outputs a candidate detection result (specifically, a regression frame and a classification result) corresponding to each position of the feature map to be detected. For each layer of feature map to be detected, the computer device may use the candidate detection result with the highest confidence level in the candidate detection results of each position in the feature map to be detected as the candidate detection result corresponding to the feature map to be detected. And then the computer equipment can screen out candidate detection results with corresponding confidence degrees meeting the high-confidence-degree condition from the candidate detection results corresponding to the feature graphs to be detected as the detection results corresponding to the input target image blocks. The confidence satisfying the high confidence condition may specifically be the maximum confidence or the top N (N is a positive integer greater than 1) names after ordering the confidence from high to low, and the like.

In one particular embodiment, the computer device may employ a neural network-based detection algorithm. The detection algorithm based on the neural network can be roughly divided into a one-stage detection algorithm and a two-stage detection algorithm, and considering that the speed of the one-stage detection algorithm is higher, the detection model in the application can specifically adopt an SSD detection algorithm in the one-stage detection algorithm. Referring to fig. 6, fig. 6 is a schematic diagram of a network structure of a detection network based on the SSD algorithm in an embodiment. The main network of the detection network can adopt a ResNet-34 layer network, a target image block with the size of 300x300 is input, then detection is carried out on a multi-layer characteristic graph to be detected of the network, each layer is provided with a respective detector (the detectors do not share parameters), and each detector outputs a regression frame and a classification result of each position of the characteristic graph to be detected.

The detection network mentioned in the embodiment of the application sequentially comprises a convolution layer and 4 Block layers, and the width and the height of a feature map are reduced by one time after each Block. The size of a feature map output by a target image Block through a second Block layer is 38x38, the size of the feature map output by a third Block layer is 19x19, in order to enable the feature map output by a fourth Block layer to be larger, a computer device realizes the downsampling layer of the fourth Block by adopting hole convolution with the step size of 2, the size of the output feature map is still 19x19, and the feature expression capability is stronger. The 38x38 feature map output by the second Block layer passes through two convolutional layers and then outputs a first feature map to be detected (with the size of 38x38), the 19x19 feature map corresponding to the fourth Block layer is a second feature map to be detected, and after the second feature map to be detected passes through a convolution structure of 3 two layers (with the second convolution step length of 2), third feature maps to be detected of 10x10, 5x5 and 3x3 are output in sequence. The first characteristic diagram to be detected, the second characteristic diagram to be detected and the third characteristic diagram to be detected have 5 characteristic diagrams to be detected in total, and each characteristic diagram is provided with a corresponding detector (detector head). For each position of the feature map to be detected, a preset number of anchor frames are designed, and the detector predicts the probability that the preset number of anchor frames belong to each mark category and the rectangular bounding frame of the mark image respectively. Wherein, the probability that each anchor frame belongs to each mark category and the rectangular bounding box of the mark image are the candidate detection results corresponding to each position. The computer device may use the candidate detection result with the highest confidence level in the candidate detection results of each position in the 5 feature maps to be detected as the candidate detection result corresponding to the target image block.

In the above embodiment, the multiple to-be-detected maps corresponding to the target image block are respectively subjected to mark detection, so that the candidate detection result with higher confidence coefficient is selected from the candidate detection results corresponding to the to-be-detected feature maps as the detection result corresponding to the target image block, and the accuracy of the detection result can be ensured.

In one embodiment, the detection model can be realized by a RetinaNet network, wherein the RetinaNet network adopts a pyramid network structure, a side branch from top to bottom is added to the pyramid network structure, and the side branch amplifies the high-level features and adds the high-level features to the low-level features in a bit-by-bit manner, so that the expression capability of the low-level features is enhanced. Referring to fig. 7, fig. 7 is a schematic diagram of a network structure of a RetinaNet network in an embodiment. As shown in fig. 7, the network structure of the RetinaNet network mainly includes (a) ResNet, (b) FPN, and 2 FCN subnetworks. Wherein, ResNet, english is called resolalnetwork, also called residual network. FPN, also known as Feature Pyramid Network, is known in english as Feature Pyramid Network. FCN, also known as full volumetric network, is known in english as full volumetric network. The 2 full convolution networks in the embodiment of the present application may specifically be (c) a classification sub-network (class sub-net) and (d) a detection frame position returning sub-network (boxsub-net), which are used to predict a marker category to which a marker image belongs and position information of the marker image. The Backbone network (Backbone) of the RetinaNet network can be formed by ResNet + FPN, and a characteristic pyramid can be obtained after the input image is subjected to characteristic extraction of the Backbone network. After the feature pyramid is obtained, with reference to fig. 7, two sub-networks, namely, a class sub-network (class sub-net) and a box position sub-network (box sub-net), are used for each layer of feature pyramid, and the final detection result is processed and output.

In one embodiment, the RetinaNet backbone network employs a network of restnet 50, and the size of the input target image block is 600 × 600. In practical applications, for image data there are often foreground objects much smaller than the background, while the detector predicts the class probability at each position of the feature map, which will result in a number of background classes much larger than the number of foreground classes. To solve this front background imbalance problem, the Loss function can be weighted by the Focal local (a Loss function algorithm) of the RetinaNet network. The Focal local has the advantages that the hard samples can be automatically mined according to the size of the Loss function, and simple samples are filtered out, so that the network can efficiently learn from the hard samples, namely, the classification of the marked images can be learned more.

It is understood that the detection model can also be implemented by using other neural network algorithms, and the SSD algorithm and the RetinaNet mentioned in the above embodiments are only used for exemplary illustration, and are not used to limit the structure and algorithm of the detection network in the embodiments of the present application.

And S508, fusing the detection results of the target image blocks to obtain a detection result corresponding to the video to be detected.

Specifically, the computer device may determine a confidence of a detection result corresponding to each target image block, and use the detection result of the target image block corresponding to the maximum confidence as the detection result corresponding to the video to be detected. And the confidence corresponding to the detection result represents the credibility of the detection result.

In the above embodiment, image data of the target area is cut out from the video frame in the video to be detected to obtain the corresponding target image block, and then each target image block is labeled and detected through the trained detection model, and the detection result corresponding to each target image block is fused to obtain the detection result corresponding to the video to be detected, so that the detection speed is high and the detection precision is high when the video to be detected is labeled and detected.

In an embodiment, referring to fig. 8, fig. 8 is a flowchart illustrating steps of performing label detection on a video to be detected through a trained detection model and obtaining a detection result in an embodiment. As shown in fig. 8, a computer device obtains a video to be detected, such as a mobile phone small video, extracts a video frame from the mobile phone small video, and captures target image blocks corresponding to an upper left corner region and a lower right corner region from the extracted video frame. And the computer equipment can carry out mark detection on each target image block. In this embodiment, the marked image is embodied as a watermark image and may thus also be referred to as watermark detection. The computer equipment can fuse the detection results of the watermark detection corresponding to each image block to obtain the detection result corresponding to the video to be detected.

In practical application, when the marked image is a watermark image, the detection model obtained by training through the detection model training method provided by the embodiment of the application can well perform accurate watermark identification and positioning on the small video with the watermark of each platform in practical use, and the detection precision is high. Moreover, under the condition that various new platforms are in endless and the number of derived small video-related watermarks is more and more, new training data are automatically generated by the detection model training method provided by the embodiment of the application, and the detection model is retrained, so that the new watermarks can be well identified, the waste of human resources is greatly reduced, and the training efficiency and accuracy are greatly improved.

In a specific application scene, the detection result comprises a mark category to which a mark image in a video to be detected belongs; the detection model training method further comprises a step of filtering the specific video, and the step specifically comprises the following steps: acquiring a video filtering instruction; the video filtering instructions include a first target category; determining the label types respectively corresponding to the videos in the video library through the trained detection model; searching a video to be filtered with a mark category as a first target category from a video library; and responding to the video filtering instruction, and pushing the videos except the videos to be filtered in the video library to the user terminal initiating the video filtering instruction.

In a specific application scenario, when a user initiates a video filtering instruction to a computer device through a user terminal, the computer device may extract a first target category carried in the video filtering instruction. The computer equipment can carry out label detection on each video in the video library through the trained detection model so as to obtain the detection result corresponding to each video, wherein the detection result comprises the label category corresponding to each video. And then the computer equipment can search the video to be filtered with the mark category as the first target category from the video library, and push the video with the video to be filtered in the video library to the corresponding user terminal.

In one embodiment, the marked image is a watermark image, and the computer device can filter the video containing the specific watermark image according to the video filtering instruction, so that the user can filter the video from the specific video application according to the likes and dislikes of the user.

In the embodiment, the label category corresponding to each video in the video library can be determined through the trained detection model, and further, videos of a certain specific label category which is disliked by the user are filtered out, so that the video service can be conveniently and intelligently provided for the user.

In a specific application scene, the detection result comprises a mark category to which a mark image in a video to be detected belongs; the detection model training method further comprises a step of searching for a specific video, and the step specifically comprises the following steps: acquiring a video searching instruction; the video search instruction includes a second target category; determining the label types respectively corresponding to the videos in the video library through the trained detection model; searching a target video with a mark category as a second target category from a video library; and responding to the video searching instruction, and pushing the target video to the user terminal initiating the video instruction.

In one embodiment, when a user initiates a video search instruction to a computer device through a user terminal, the computer device may extract the second target category carried in the video search instruction. The computer equipment can carry out label detection on each video in the video library through the trained detection model so as to obtain the detection result corresponding to each video, wherein the detection result comprises the label category corresponding to each video. And the computer equipment can search the video library for the target video with the mark category as the second target category and push the target video to the user terminal initiating the video instruction.

In one embodiment, the tagged image is a watermark image, and a user may enter a watermark category associated with a video application so that a computer device may quickly retrieve a corresponding video, such that the user may search for video from a particular video application according to his or her preferences.

In the embodiment, the label category corresponding to each video in the video library can be determined through the trained detection model, and then the video of a certain favorite specific label category of the user is searched, so that the video service can be conveniently and intelligently provided for the user.

In one embodiment, the detection result comprises the mark category and the position information to which the mark image in the video to be detected belongs; the detection model training method further comprises the step of editing the marked images in the video, and the step specifically comprises the following steps: the computer equipment can determine the mark types respectively corresponding to the videos in the video library and the position information of the mark images through the trained detection model, and then positions the mark images so as to realize the editing processing of the mark images.

FIG. 2 is a flowchart illustrating a method for training a detection model according to an embodiment. It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in FIG. 9, a test model training apparatus 900 is provided, which may be a part of a computer device using software modules or hardware modules, or a combination of both. The device comprises an acquisition module 901, a selection module 902, an embedding module 903, a determination module 904 and a training module 905, wherein:

an obtaining module 901, configured to obtain an original image to be processed and more than one type of marked images.

And a selecting module 902, configured to, for each type of marked image, randomly select a target position from a target area of the original image as an embedding position of the marked image.

And the embedding module 903 is configured to embed at least a part of the marker image into the original image according to the corresponding embedding position to obtain a corresponding sample image for each type of marker image.

The determining module 904 takes the sample image as a training sample, and takes the label category of the label image embedded in the sample image as a corresponding training label.

The training module 905 is configured to train the detection model to be trained through the training samples and the corresponding training labels.

In one embodiment, the obtaining module 901 is further configured to obtain an original image to be processed and more than one type of mark templates; randomly selecting a target size proportion from a preset size proportion range; and according to the size of the original image, scaling the various marking templates according to the target size proportion to obtain corresponding marking images.

In one embodiment, the target locations include core locations and non-core locations; the selecting module 902 is further configured to determine a core position in the target region of the original image; acquiring a probability value when the core position is taken as an embedded position; the probability value corresponding to the core position is the maximum value in the probability values corresponding to all target positions in the target area; determining probability values when the non-core positions are respectively used as embedding positions according to the distances between the non-core positions in the target area and the core positions; the probability value corresponding to the non-core position is negatively correlated with the distance from the non-core position to the core position; and for each type of marked image, selecting a corresponding target position as an embedded position of the marked image according to the probability value corresponding to each target position in the target area of the original image.

In one embodiment, the selecting module 902 is further configured to obtain a preset number of platform-specific images; the special images of each platform respectively comprise mark images corresponding to the corresponding platforms; determining an average coordinate corresponding to the target vertex according to the coordinates of the target vertex of each marked image in the platform-specific image; the average coordinates are taken as the core position in the target area of the original image.

In one embodiment, the target region includes an upper left corner region and a lower right corner region; the selecting module 902 is further configured to, when the target region is an upper left corner region, select, for each type of labeled template, a corresponding target position as an embedded position corresponding to an upper left vertex of the labeled image according to a probability value corresponding to each target position in the upper left corner region of the original image; and when the target area is a lower right corner area, selecting the corresponding target position as an embedding position corresponding to a lower right vertex of the marked image for each type of marked template according to the probability value corresponding to each target position in the lower right corner area of the original image.

In one embodiment, the sample image includes a first sample image and a second sample image; the embedding module 903 is further configured to determine, for each type of marker image, a first marker image to be completely embedded and a second marker image to be occluded and embedded in the marker image; completely embedding the first mark image into the original image according to the corresponding embedding position to obtain a corresponding first sample image; and embedding the second marked image into the original image according to the corresponding embedding position, randomly selecting a target shielding proportion from a preset shielding proportion range, and moving a part of the second marked image out to the boundary of the original image according to the target shielding proportion to obtain a corresponding second sample image.

In one embodiment, the determining module 904 is further configured to determine a label category of the label image embedded in each sample image and position information of the label image in the original image; and taking the sample image as a training sample, and taking the label type of the label image embedded in the sample image and the corresponding position information as a training label of the training sample.

In one embodiment, the training module 905 is further configured to crop the sample image according to the target area to obtain a corresponding sample image block; respectively extracting the characteristics of each sample image block through a detection model to be trained to obtain a corresponding characteristic diagram, and detecting and outputting a prediction result based on the characteristic diagram; and adjusting the model parameters of the detection model according to the difference between the prediction result corresponding to the sample image block and the corresponding training label until the training stopping condition is met, and stopping training.

In one embodiment, the detection model training apparatus 900 further includes a label detection module 906, configured to obtain a video to be detected and a trained detection model; extracting a preset number of video frames from a video to be detected, and cutting each video frame according to a target area to obtain a corresponding target image block; inputting each target image block into a trained detection model respectively, and outputting a detection result corresponding to each target image block; and fusing the detection results of all the target image blocks to obtain a detection result corresponding to the video to be detected.

In one embodiment, the label detection module 906 is further configured to input each target image block to the trained detection model; sequentially processing the input target image blocks through at least three groups of convolution groups in the trained detection model; the down-sampling layer in the last group of convolution groups is a cavity convolution with the step length being a preset value so as to keep the size of the characteristic graph output by the last group of convolution groups as a preset size; carrying out convolution processing on the feature map output by the convolution group of the middle group to obtain a first feature map to be detected; taking the feature map output by the last group of convolution groups as a second feature map to be detected; performing convolution processing on the second characteristic diagram to be detected to obtain at least one third characteristic diagram to be detected; respectively detecting the first characteristic diagram to be detected, the second characteristic diagram to be detected and the third characteristic diagram to be detected to obtain corresponding candidate detection results and confidence degrees corresponding to the candidate results; and screening out candidate detection results with corresponding confidence degrees meeting the high-confidence-degree condition from the candidate detection results corresponding to the feature graphs to be detected as the detection results corresponding to the input target image blocks.

Referring to fig. 10, in one embodiment, the detection result includes a label category to which a label image in the video to be detected belongs; the detection model training device 900 further includes a video pushing module 907 for acquiring a video filtering instruction; the video filtering instructions include a first target category; determining the label types respectively corresponding to the videos in the video library through the trained detection model; searching a video to be filtered with a mark category as a first target category from a video library; and responding to the video filtering instruction, and pushing the videos except the videos to be filtered in the video library to the user terminal initiating the video filtering instruction.

In one embodiment, the video pushing module 907 is further configured to obtain a video search instruction; the video search instruction includes a second target category; determining the label types respectively corresponding to the videos in the video library through the trained detection model; searching a target video with a mark category as a second target category from a video library; and responding to the video searching instruction, and pushing the target video to the user terminal initiating the video instruction.

According to the detection model training device, each type of marked image is randomly embedded into the original image, the situation that the marked image in a real scene is possibly edited or compressed is simulated during embedding, and partial marked images are embedded in the real scene in a shielding mode, so that training data with marks can be automatically generated to train the detection model. Wherein the training labels in the training data are the label categories to which the embedded label images belong. Therefore, training data do not need to be artificially labeled, and the situation that real labeled images appear in original images is simulated by adopting various random strategies, so that the labor cost and the labeling efficiency of labeling the training data are greatly reduced, and the model training efficiency is greatly improved.

For specific limitations of the detection model training apparatus, reference may be made to the above limitations of the detection model training method, which are not described herein again. The modules in the detection model training device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

FIG. 11 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the terminal 110 or the server 120 in fig. 1. As shown in fig. 11, the computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement the detection model training method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform a detection model training method.

Those skilled in the art will appreciate that the architecture shown in fig. 11 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A detection model training method, comprising:

2. The method of claim 1, wherein the acquiring the raw image to be processed and the more than one type of labeled image comprises:

acquiring an original image to be processed and more than one type of marking template;

randomly selecting a target size proportion from a preset size proportion range;

and according to the size of the original image, respectively carrying out scaling processing on each type of marking template according to the target size proportion to obtain a corresponding marking image.

3. The method of claim 1, wherein the target locations comprise core locations and non-core locations; for each type of marked image, respectively randomly selecting a target position from a target area of the original image as an embedding position of the marked image, including:

determining a core position in a target region of the original image;

obtaining a probability value when the core position is taken as an embedding position; the probability value corresponding to the core position is the maximum value in the probability values corresponding to all target positions in the target area;

determining probability values when the non-core positions are respectively used as embedding positions according to the distances between the non-core positions in the target area and the core positions; the probability value corresponding to the non-core position is in negative correlation with the distance from the non-core position to the core position;

and for each type of marked image, selecting a corresponding target position as the embedded position of the marked image according to the probability value corresponding to each target position in the target area of the original image.

4. The method of claim 3, wherein the determining the core location in the target region of the original image comprises:

acquiring a preset number of platform special images; each platform special image comprises a mark image corresponding to a corresponding platform;

determining an average coordinate corresponding to a target vertex according to the coordinate of the target vertex of each marked image in the platform-specific image;

and taking the average coordinate as a core position in a target area of the original image.

5. The method of claim 3, wherein the target region comprises an upper left corner region and a lower right corner region; for each type of marked image, selecting a corresponding target position as an embedded position of the marked image according to the probability value corresponding to each target position in the target area of the original image, respectively, including:

when the target area is the upper left corner area, selecting corresponding target positions as embedding positions corresponding to the upper left vertex of the marked image for each type of marked template according to the probability values corresponding to the target positions in the upper left corner area of the original image;

and when the target area is a lower right corner area, selecting a corresponding target position as an embedding position corresponding to a lower right vertex of the marked image for each type of marked template according to the probability value corresponding to each target position in the lower right corner area of the original image.

6. The method of claim 1, wherein the sample image comprises a first sample image and a second sample image; for each type of marked image, embedding at least one part of the marked image into the original image according to the corresponding embedding position to obtain a corresponding sample image, including:

for each type of marked image, determining a first marked image to be completely embedded and a second marked image to be shielded and embedded in the marked image;

completely embedding the first mark image into the original image according to the corresponding embedding position to obtain a corresponding first sample image;

and completely embedding the second marked image into the original image according to the corresponding embedding position, randomly selecting a target shielding proportion from a preset shielding proportion range, and moving a part of the second marked image out to the boundary of the original image according to the target shielding proportion to obtain a corresponding second sample image.

7. The method according to claim 1, wherein the using the sample image as a training sample and the label category of the label image embedded in the sample image as a corresponding training label comprises:

determining a mark type of a mark image embedded in each sample image and position information of the mark image in the original image;

and taking the sample image as a training sample, and taking the label type of the label image embedded in the sample image and the corresponding position information as a training label of the training sample.

8. The method of claim 1, wherein training the detection model to be trained through the training samples and corresponding training labels comprises:

cutting the sample image according to the target area to obtain a corresponding sample image block;

respectively extracting the characteristics of each sample image block through a detection model to be trained to obtain a corresponding characteristic diagram, and detecting and outputting a prediction result based on the characteristic diagram;

and adjusting the model parameters of the detection model according to the difference between the prediction result corresponding to the sample image block and the corresponding training label until the training stopping condition is met, and stopping training.

9. The method according to any one of claims 1 to 8, further comprising:

acquiring a video to be detected and a trained detection model;

extracting a preset number of video frames from the video to be detected, and cutting each video frame according to the target area to obtain a corresponding target image block;

inputting each target image block into the trained detection model respectively, and outputting a detection result corresponding to each target image block;

and fusing the detection results of the target image blocks to obtain a detection result corresponding to the video to be detected.

10. The method according to claim 9, wherein the inputting each of the target image blocks into the trained detection model and outputting the detection result corresponding to each of the target image blocks respectively comprises:

inputting each target image block into the trained detection model respectively;

sequentially processing the input target image blocks through at least three groups of convolution groups in the trained detection model; the down-sampling layer in the last group of convolution groups is a cavity convolution with the step length being a preset value so as to keep the size of the characteristic graph output by the last group of convolution groups as a preset size;

carrying out convolution processing on the feature map output by the convolution group of the middle group to obtain a first feature map to be detected;

taking the feature map output by the last group of convolution groups as a second feature map to be detected;

performing convolution processing on the second characteristic diagram to be detected to obtain at least one third characteristic diagram to be detected;

respectively detecting the first characteristic diagram to be detected, the second characteristic diagram to be detected and the third characteristic diagram to be detected to obtain a candidate detection result and a confidence coefficient corresponding to the candidate result;

and screening out candidate detection results with corresponding confidence degrees meeting the high-confidence-degree condition from the candidate detection results corresponding to the feature graphs to be detected as the input detection results corresponding to the target image blocks.

11. The method according to claim 9, wherein the detection result comprises a label category to which a label image in the video to be detected belongs; the method further comprises the following steps:

acquiring a video filtering instruction; the video filtering instructions include a first target category;

determining the label type respectively corresponding to each video in the video library through the trained detection model;

searching videos to be filtered with the mark category as a first target category from the video library;

and responding to the video filtering instruction, and pushing the videos except the video to be filtered in the video library to a user terminal initiating the video filtering instruction.

12. The method according to claim 9, wherein the detection result comprises a label category to which a label image in the video to be detected belongs; the method further comprises the following steps:

acquiring a video searching instruction; the video search instruction comprises a second target category;

searching a target video with a mark category being a second target category from the video library;

and responding to the video searching instruction, and pushing the target video to a user terminal initiating the video searching instruction.

13. A test pattern training apparatus, comprising:

14. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 12.

15. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any one of claims 1 to 12.