WO2019223361A1

WO2019223361A1 - Video analysis method and apparatus

Info

Publication number: WO2019223361A1
Application number: PCT/CN2019/073661
Authority: WO
Inventors: 戴威
Original assignee: 北京国双科技有限公司
Priority date: 2018-05-23
Filing date: 2019-01-29
Publication date: 2019-11-28
Also published as: CN110532833A

Abstract

Disclosed are a video analysis method and apparatus, the method comprising: acquiring a video to be analyzed; identifying target identifiers in the video to be analyzed; detecting whether the identified target identifiers meet a pre-set condition, wherein the pre-set condition comprises: placeholders of target identifiers distributed in at least two adjacent frames of images being at least partially overlapped, so as to obtain a detection result; and determining, according to the detection result, exposure data of the target identifiers meeting the pre-set condition, in the video to be analyzed. By means of the embodiments of the present application, the exposure data of the target identifiers, in the video to be analyzed can be determined, and human resources can be saved.

Description

Video analysis method and device

This application claims priority from a Chinese patent application filed with the Chinese Patent Office on May 23, 2018, with application number 201810502120.X, and with the invention name "A Video Analysis Method and Device", the entire contents of which are incorporated herein by reference Applying.

Technical field

The present application relates to the field of video processing, and in particular, to a video analysis method and device.

Background technique

At present, the title of the program has become an effective channel for advertisers to promote corporate brands. Specifically, advertisers have embedded corporate brand advertisements in TV programs, so that viewers noticed that the corporate brand advertisements were embedded in the process of watching TV programs. Then play the effect of publicizing the corporate brand. In practical applications, exposure data such as whether or not a company's brand is exposed in a television program, the length of the exposure, and the duration of the exposure will affect the effectiveness of the corporate brand's promotion. Therefore, it is necessary to analyze the exposure data of the corporate brand in the TV program in order to find a publicity way for the advertiser's corporate brand to achieve better publicity, or analyze the exposure data of the competitor's corporate brand.

At present, professionals watch TV programs and analyze the exposure data of target logos representing the brands of the companies to be analyzed in the TV programs.

However, human resources are wasted by analyzing the exposure data of the target mark in the TV program by professionals.

Summary of the Invention

In view of the above problems, the present invention is provided in order to provide a video analysis method and device that overcome the above problems or at least partially solve the above problems.

Among them, a video analysis method includes:

Get the video to be analyzed;

Identifying a target identifier in the video to be analyzed;

Detecting whether the identified target identifier satisfies a preset condition, the preset condition includes: the placeholders of the target identifiers distributed in at least two frames of adjacent images at least partially overlap to obtain a detection result;

According to the detection result, exposure data satisfying the target identifier in the video to be analyzed is determined.

Wherein, the separately identifying target identifiers in the videos to be analyzed includes:

Inputting each frame image in the video to be analyzed into a trained preset model, so that the trained preset model recognizes a target identifier in each frame of the video to be analyzed;

For any one frame image in the video to be analyzed, the preset model identifies the target identifier in the any one frame image according to the following steps:

Extracting the multi-scale features of the arbitrary one-frame image to obtain a multi-scale feature image set;

Generating candidate regions based on the multi-scale feature image set;

Selecting a feature image set of at least two scales from the multi-scale feature image set;

Respectively extracting a region set corresponding to the candidate region from the feature image set of the at least two scales to obtain a region set of at least two scales corresponding to the feature image set of the at least two scales;

By fully connecting the region sets of at least two scales, the target identifier in the arbitrary one-frame image is identified.

The preset model is based on a Faster-RCNN architecture, and the architecture includes a low-level feature extraction model and a candidate region generation network;

Wherein, extracting the multi-scale features of the arbitrary frame of images to obtain a multi-scale feature image set includes:

Extracting the multi-scale features of the arbitrary frame image through the underlying feature extraction module to obtain the multi-scale feature image set;

The generating candidate regions based on the multi-scale feature image set includes:

The multi-scale feature image set is input to the candidate region generation network, and the candidate region is generated by the candidate region generation network.

The preset model is trained in the following manner to obtain the trained preset model:

Acquiring a training set; the training set includes: a plurality of frames of images to which the target identifier is marked;

Using the multi-frame image to train the preset model to obtain a first preset model;

Inputting an image in the video to be analyzed into the first preset model;

Acquiring an image labeled with the target identifier in the video to be analyzed through the first preset model; the image labeled with the target identifier has an incorrect label;

Obtaining a corrected image; the corrected image is: an image that has been manually corrected for the incorrect annotation;

Training the first preset model by using the modified image to obtain the trained preset model.

The preset conditions further include:

The ratio of overlap between the target identifiers at least partially overlapping the placeholders is greater than a preset percentage; among the target identifiers at least partially overlapping the placeholders, the total number of target identifiers having a sharpness greater than a preset sharpness threshold is greater than a preset The total number.

Wherein, according to the detection result, determining that exposure data satisfying the target identifier in the video to be analyzed includes:

In a case where the detection result is that the target identifier meets the preset condition, determining that the target identifier is exposed in the video to be analyzed, and further determining an exposure parameter, wherein the exposure parameter includes at least one of the following 1: exposure time, exposure position;

When the detection result is that the target identifier does not satisfy the preset condition, it is determined that the target identifier is not exposed in the video to be analyzed.

A video analysis device includes:

An acquisition unit for acquiring a video to be analyzed;

A first identification unit, configured to identify a target identifier in the video to be analyzed;

A detection unit, configured to detect whether the identified target identifier meets a preset condition, the preset condition includes: at least part of the overlap of the target identifiers distributed in at least two frames of adjacent images to obtain a detection result;

A determining unit is configured to determine, according to the detection result, exposure data that satisfies the target identifier in the video to be analyzed.

The first identification unit includes:

A first input subunit, configured to input each frame image in the video to be analyzed into a preset model after training, so that the trained preset model identifies a target identifier in each frame of the video to be analyzed ;

For any one frame image in the video to be analyzed, the preset model includes:

A first extraction unit, configured to extract multi-scale features of the arbitrary one-frame image to obtain a multi-scale feature image set;

A generating unit, configured to generate a candidate region based on the multi-scale feature image set;

A selection unit, configured to select a feature image set of at least two scales from the multi-scale feature image set;

A second extraction unit, configured to respectively extract a region set corresponding to the candidate region from the feature image set of the at least two scales, to obtain a region of at least two scales corresponding to the feature image set of the at least two scales set;

The second recognition unit recognizes the target identifier in the arbitrary one-frame image by fully connecting the region sets of at least two scales.

The first extraction unit is specifically configured to extract the multi-scale features of the arbitrary one-frame image by using the underlying feature extraction module to obtain the multi-scale feature image set;

The generating unit is specifically configured to input the multi-scale feature image set into the candidate region generating network, and generate the candidate region through the candidate region generating network.

Which also includes: training units;

The training unit is configured to train the preset model to obtain the trained preset model;

The training unit includes:

A first acquisition subunit, configured to acquire a training set, where the training set includes: multiple frames of images to which the target identifier has been labeled;

A first training subunit, configured to train the preset model by using the multi-frame image to obtain a first preset model;

A second input subunit, configured to input an image in the video to be analyzed into the first preset model;

A second acquisition subunit, configured to acquire an image labeled with the target identifier through the first preset model in the video to be analyzed; the image labeled with the target identifier has an incorrect label;

A third acquisition subunit, configured to acquire a corrected image; the corrected image is: an image that has been manually corrected for the incorrect annotation;

A second training subunit is configured to use the modified image to train the first preset model to obtain the trained preset model.

The detection unit is further configured to detect that an overlap ratio between the target identifiers at least partially overlapping the placeholders is greater than a preset percentage; and in the target identifiers at least partially overlapping the placeholders, the sharpness is greater than a preset sharpness threshold The total number of target IDs is greater than the preset total number.

The determining unit includes:

A first determining subunit, configured to determine that the target identifier is exposed in the video to be analyzed, and further determine an exposure parameter if the detection result is that the target identifier meets the preset condition, and The exposure parameter includes at least one of the following: exposure duration and exposure position;

A second determining subunit is configured to determine that the target identifier is not exposed in the video to be analyzed if the detection result is that the target identifier does not meet the preset condition.

A storage medium stores a program on the storage medium, and when the program is executed by a processor, the video analysis method according to any one of the foregoing is implemented.

A processor is configured to run a program, and when the program runs, the video analysis method according to any one of the foregoing is performed.

Based on the above technical solution, the technical solution provided by the present invention identifies the target identifier in the video to be analyzed; detects whether the identified target identifier meets a preset condition, and the characteristics of the target identifier that meets the preset condition, and The specific match that the target identifier has when exposed in the video; in this embodiment, a detection result obtained by detecting whether the identified target identifier satisfies a preset condition, and the detection result includes that the identified target identifier satisfies a preset The condition and the identified target identifier do not meet the preset conditions; therefore, in this embodiment, whether the target identifier is exposed according to the detection result; when the target identifier is exposed, the identified target identifier meets the preset condition, At this time, the preset conditions define that the placeholders of the target identifiers distributed in at least two frames of adjacent images are at least overlapping. At this time, the placeholders of the at least overlapping target identifiers reflect the exposed target identifiers. Further, according to the position of the target identification of the exposure, the position of the exposure at the position in the video to be identified can be determined. Playback time length BRANDING; Accordingly, in the present application embodiment, the exposure may be determined in the target identification data of the video to be analyzed according to the detection result, and can save manpower resources.

The above description is only an overview of the technical solution of the present invention. In order to understand the technical means of the present invention more clearly, it can be implemented according to the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more obvious and understandable In the following, specific embodiments of the invention are enumerated.

BRIEF DESCRIPTION OF THE DRAWINGS

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the detailed description of the preferred embodiments below. The drawings are only for the purpose of illustrating preferred embodiments and are not to be considered as limiting the invention. Moreover, the same reference numerals are used throughout the drawings to refer to the same parts. In the drawings:

FIG. 1 shows a flowchart of an embodiment of a model training method in the present application;

FIG. 2 shows a schematic diagram of labeling each BMW brand logo in an image by using a box in the present application; FIG.

FIG. 3 shows a flowchart of an embodiment of a method for analyzing target identification in a video in the present application;

FIG. 4 is a schematic diagram showing a distribution of target identifiers identified in an image included in an image set in the present application; FIG.

FIG. 5 is a schematic structural diagram of an embodiment of an analysis apparatus for target identification in a video in the present application.

Detailed ways

Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a thorough understanding of the present disclosure, and to fully convey the scope of the present disclosure to those skilled in the art.

In this embodiment, a model for target recognition is set, and specifically, it can be applied to a scene based on target recognition, for example, image classification and image segmentation. The model architecture can be a Faster-RCNN architecture. Among them, in this model architecture, the ResNet model is used as the underlying feature extraction model, and the RPN network is used as the candidate region generation network. Among them, the ResNet model includes five parts, which are part 1, part 2, part 3, part 4 and part 5, each of which includes a pooling layer and a convolution layer.

In this embodiment, the processing flow of the model is improved. Specifically, taking the model for image recognition as an example, the specific improvement of the processing flow of the model in this embodiment is introduced. First, the image to be processed is input into the model, and the convolutional layers in different parts of the model output information of different scales of the image to be processed (different scales of the image can be understood as different resolutions). For example, when the size of the image to be processed is M * M, the convolutional layer of part 1 outputs a first feature image set of size M * M, and the convolutional layer of part 2 outputs a second feature image set of size M * M. , The convolutional layer of part 3 outputs a third feature image set of size M / 2 * M / 2, the convolutional layer of part 4 outputs a fourth feature image set of size M / 4 * M / 4, and the convolution of part 5 The layer outputs a fifth feature image set of size M / 8 * M / 8. It should be noted that the first feature image set, the second feature image set, the third feature image set, the fourth feature image set, and the fifth feature image set are all composed of multiple layers of images. The specific number of image layers and the feature The number of convolution kernels in the convolution layer corresponding to the image set is the same.

Secondly, after outputting feature image sets of different scales in sections 1, 2, 2, 3, 4, and 5, respectively, the feature image sets of different scales are input to an RPN network, which generates candidate regions; again, from 5 At least two feature image sets are selected from each of the feature image sets, and the region sets corresponding to the candidate coordinate regions are extracted from the at least two feature image sets, respectively, and the extracted region sets are uniformly pre-defined in length and width. Set the size; finally, the area set unified to the preset size is stitched from the number of layers, and the stitched area set is fully connected to identify the target to be identified in the image to be processed. It should be noted that the foregoing description of the process flow for receiving the model was described using an image recognition scenario as an example. Of course, the model in this embodiment can also be used in other scenarios, and this embodiment does not limit the specific scenario of the model.

The above-mentioned extraction region set, the extracted region set is unified to a preset size, and the region set unified to a preset size is spliced from the number of layers. Assuming that the first feature image set is 128 * 128 * 3, where 3 represents the number of frames of the first feature image included in the first feature image set, 128 * 128 represents the size of any one of the first feature images is 128 * 128; the second feature image set is 64 * 64 * 6 , Where 6 represents the number of frames of the second feature image included in the second feature image set, 64 * 64 represents the size of any one of the second feature images is 64 * 64; the third feature image set is 32 * 32 * 4, The fourth feature image set is 16 * 16 * 2, and the fifth feature image set is 4 * 4 * 3. The meaning of the parameters of the third image feature set, the fourth image feature set, and the fifth image feature set is the same as the first The meanings of the parameters in the feature image set are the same and will not be repeated here.

Selecting at least two feature image sets from the first feature image set, the second feature image set, the third feature image set, the fourth feature image set, and the fifth feature image set, and selecting the selected at least two feature images The set performs data sampling so that the images included in the selected at least two feature image sets are unified in size, for example, the sizes are unified into 7 * 7 images.

After the selected at least two feature image sets are uniformly sized, they are superimposed from the number of layers. Specifically, it is assumed that the selected at least two feature image sets are a third feature image set and a fifth feature image set, and the sizes of the images included in the two feature image sets are unified to 7 * 7. At this time, the sizes are The two feature image sets unified as 7 * 7 are superimposed from the number of layers, and the superimposed feature image set is 7 * 7 * 10.

The model in this embodiment uses the ResNet model as the underlying feature extraction model, uses the RPN network as the candidate region generation network, and uses an improved processing flow to process the input to-be-processed image. The model in this embodiment is based on the RPN After the model generates candidate regions, the region sets corresponding to the candidate regions are extracted from at least two feature image sets respectively to obtain at least two region sets. Since the at least two region sets are from different feature image sets, and the different feature image sets reflect information of the image to be processed at different scales, the model in this embodiment will include images of at least two scales in the image to be processed. The information is fully connected, so that the model in this embodiment recognizes information of different scales of the image to be processed. The model of the standard Faster-RCNN architecture is used to fully connect the image region set corresponding to the candidate region extracted from the image to be processed, so that the model using the standard Faster-RCNN architecture only performs information on the scale of the processed image. Identify.

In practical applications, target images of different sizes may exist in the image to be processed, and the characteristics of target identifiers of different sizes may be reflected on feature image sets of different scales. The model in this embodiment can identify information in feature image sets of different scales. Therefore, the model in this embodiment can identify information reflected on feature image sets of different scales. Furthermore, compared with a model using a standard Faster-RCNN architecture, the model in this embodiment has a higher recognition accuracy rate for identifying target identifiers of different sizes.

In this embodiment, the model with the architecture set is trained, and the specific process of training the model can be referred to FIG. 1. FIG. 1 shows a flowchart of an embodiment of a model training method in the present application. The method Embodiments may include:

Step 101: Obtain a training set.

In this embodiment, the model is used for image recognition as an example to introduce the training process of the model. The specific image recognition scene is: identifying whether the BMW brand logo exists in the image. In this step, a training set for training the model is obtained, where the training set includes a large number of images marked with the BMW brand logo.

Specifically, a large number of images used to compose the training set can be obtained by searching for images containing the BMW brand logo from search platforms such as Baidu, Google, or other material websites; and also using screen capture software from videos such as live shows To capture images containing the BMW brand logo. Of course, in practical applications, other methods can be used to obtain a large number of images containing the BMW brand logo. This step only provides two methods to obtain the images containing the BMW brand logo, but not specific methods to obtain the images containing the BMW brand logo. Make restrictions.

After obtaining a large number of images containing the BMW brand logo, the BMW brand logo in the image is labeled for each of the acquired images. Specifically, as shown in FIG. 2, each box in the image is labeled BMW brand identity.

Step 102: Train the model using the acquired training set to obtain a first model.

After the training set is acquired, then, in this step, the model is trained using a large number of images in the acquired training set. Specifically, an image labeled with the BMW brand logo is input into the model, and the model uses an improved process to identify and label the BMW brand logo in the input image. Based on the training set, the BMW brand logo is used as a benchmark and automatically adjusted. The parameters in the model are adjusted multiple times in the model. When a certain standard is reached, the first model is obtained.

Step 103: Input a preset number of frames of images to be identified into the first model.

After training the model using the training set to obtain a first model, then, in this step, a preset number of frames of images to be identified are input into the first model, and for each of the input frames of images to be identified, the first model recognizes and Mark out that each frame contains the BMW brand logo.

Step 104: Obtain a preset number of frame images that the first model separately recognizes and labels the target identifier.

After the first model recognizes and labels the target identifier in each input frame of the image to be identified, a preset number of frames of images identified by the first model and labeled with the target identifier are obtained. In practical applications, when the first model recognizes and labels the target identifier in the image to be recognized, misidentification occurs. At this time, the labeled target identifier is also wrong. Therefore, in this step, in the preset number of frames obtained by identifying and labeling the target identifier with the first model, there are symbols labeled with non-target identifiers. For convenience of description, this embodiment will label the symbols with non-target identifiers. Collectively referred to as error symbols.

Step 105: Obtain a preset number of frame images with artificially corrected error symbols.

Correct the error symbols manually, that is, manually identify the error symbols and manually mark the target identification. In this step, a preset number of frame images are manually corrected for the error symbol.

Step 106: input the corrected preset number of frame images into the first model, and train the first model to obtain a trained model.

After obtaining a preset number of frame images that have been manually modified, then, in this step, the corrected preset number of frame images are input to a first model, and the first model is further trained. Specifically, the process of training the first model in this step is the same as the idea of training the model in step 102. For the specific training process, refer to step 102, which is not described again here. For convenience of description, in this embodiment, the models obtained after training the first model are collectively referred to as a trained model.

In this embodiment, the first model is obtained after training the model through the training set. Since the images of the training set are collected from the search platform, after training the model using the images in the training set, the model is only Learn the target identifiers in this training set. In practical applications, there may be similar identifiers similar to the target identifier in the image to be identified. In order to allow the model to better distinguish the distinguishing features of the target identifier and the similar identifier, in this embodiment, a preset number of frames are to be identified. The image is input to the first model, and there are error symbols in the symbols output by the first model for labeling the target identifier. A preset number of frame images corrected by the error symbols are manually used to train the first model again to obtain training. Post model. At this time, compared with the first model, the recognition accuracy of the target identifier in the image to be identified is improved after training. Therefore, the training method of this embodiment can further improve the accuracy of the target identifier in the image to be identified by the model. Identification accuracy.

After the trained model is obtained, then, in this embodiment, the trained model is applied to a scenario for analyzing the implantation of target identifiers in a video. Specifically, referring to FIG. 3, a flowchart of an embodiment of a method for analyzing a target identifier in a video in the present application is shown. The method embodiment may include:

Step 301: Obtain a video to be analyzed.

The video to be analyzed obtained in this step may be an encoded video to be analyzed.

Step 302: Decode the obtained video to be analyzed to obtain a decoded video to be analyzed.

Step 303: For the decoded video to be analyzed, the decoded video is divided into multiple image sets according to the sequence of the video frames and the principle of using the first preset number of frames as an image set.

In this embodiment, the target logo embedded in the video is generally played continuously for two to three seconds, where the target logo represents a preset type of logo. For example, the BMW brand logo in the video needs to be analyzed. At this time, BMW The brand identity is the target identity. In practical applications, the image played every second is about 5 frames. Therefore, the image embedded with the target identifier in the decoded video to be analyzed generally appears in consecutive 10 to 15 frames. Therefore, in order to more accurately analyze the implantation of the target identifier in the video to be analyzed, in this step, for the decoded video to be analyzed, according to the sequence of the video frames in the video to be analyzed, the first preset number of frame images As an image set, the preset number can be any number from 5 to 7 frames. At this time, the decoded video to be analyzed is divided into multiple image sets.

Step 304: Input the images in each image set into the trained models separately, so that the trained models recognize the target identifiers in the images contained in each image set.

After the decoded video to be analyzed is divided into multiple image sets, then, in this step, the images in each image set are respectively input into a trained model, and the trained model targets the objects in each frame of the image. Identification. In practical applications, after identifying the target logo in the video to be analyzed in the trained model, the identified target logo is labeled. For example, the trained model recognizes a BMW brand logo. A box can be used. Frame the identified BMW brand logo, and output an image that frames the identified BMW brand logo with a frame.

Step 305: Obtain an image set labeled with a target identifier corresponding to each image set and output by the trained model.

After the trained model outputs images marked with preset symbols, an image set marked with preset symbols corresponding to each of the divided image sets is obtained, and multiple recognized image sets are obtained.

Step 306: Detect whether the target identifier marked in each image set meets a preset condition.

After obtaining a plurality of image sets labeled with target identifiers, then, in this step, it is respectively detected whether the target identifiers labeled in each image set exist and satisfy a preset condition. In this step, taking any image set as an example, it is introduced whether the target identifier marked in the any image set satisfies a preset condition.

Wherein, the preset condition may include that the placeholders of the target identifiers distributed in at least two adjacent images at least partially overlap. The placeholder of the target identifier refers to a space area occupied by the target identifier in a reference coordinate system.

The following uses a specific scenario as an example to introduce whether the identified target identifiers in the image collection meet a preset condition. The specific scene is: the image set includes 5 frames of images, namely the first frame, the second frame, the third frame, the fourth frame, and the fifth frame, and the target logo is the BMW brand logo; The position distribution of the identified target identifiers in the second frame image, the third frame image, the fourth frame image, and the fifth frame image is shown in FIG. 4. Specifically, the two BMW brand logos identified in the first frame of the image are two, one distributed at the upper left corner of the image and the other at the lower right corner of the image; two BMW brands are identified in the second frame of the image. Logos, one distributed in the upper right corner of the image and the other in the lower right corner of the image; a BMW brand logo was identified in the third frame of the image and distributed in the lower right corner of the image; BMW was not identified in the fourth frame Brand logo; a BMW brand logo was identified in the fifth frame image and distributed in the lower right corner of the image; wherein the placeholders of the target logo in the lower right corner of the first frame image, the second frame image, and the third frame image overlap .

The preset condition is "the placeholders of the target identifiers distributed in at least two frames of adjacent images at least partially overlap", so in this scene, the target identifiers distributed in at least two frames of adjacent images are specifically: 5 BMW brand logos in the first frame image, the second frame image, and the third frame image; then, determine whether the placeholders of the target logos distributed in at least two adjacent frames at least partially overlap, and in the first frame The three BMW brand logos in the bottom right corner of the three frames of the image, the second frame image and the third frame image are overlapping. Therefore, the BMW brand identity identified in the image collection meets a preset condition.

In detecting whether the identified target identifiers in the image collection meet the preset conditions, two detection results are obtained, one is: the identified target identifiers in the image collection meet the preset conditions, and the other is: The identified target identifier does not meet the preset conditions.

In order to make the detection result more accurate, in this embodiment, the preset condition may further include: an overlap ratio between target identifiers with at least partially overlapping placeholders is greater than a preset percentage; the target identifiers with at least partially overlapping placeholders In the method, the total number of target identifiers whose sharpness is greater than a preset sharpness threshold is greater than a preset total number. The value range of the preset percentage may be not less than 50%, and the value range of the preset total number may not be less than 5.

It should be noted that this embodiment only provides a preferred value range of the preset percentage and the preset total number. Of course, in actual applications, the preset percentage and the preset total number can also be determined based on actual conditions. This embodiment does not limit the specific values of the preset percentage and the preset total number.

Step 307: Determine the exposure data of the target identifier in the video to be analyzed according to the detection result.

After the detection result is obtained, then, in this step, according to the detection result, the exposure data of the target identifier in the video to be analyzed is determined. The exposure data includes: exposure, exposure position, and exposure duration. Specifically, in this step, if the detection result is: the identified target identifier in the image set meets a preset condition, it indicates that the target identifier is exposed in the image set; The space position occupied by at least partially overlapping target marks is determined as the exposure position of the target mark; and based on the exposure position, the number of frames of continuous images in which the target mark exists at the exposure position in the video to be analyzed is calculated according to the frame The number determines how long the target ID plays.

It should be noted that in actual applications, there may be multiple exposure positions of the target mark. At this time, the playback time of the target mark for each exposure position is determined separately, and the sum of the playback time corresponding to all the exposure positions is used as The total playing time of the target identifier.

When the detection result is: the identified target ID in the image set does not meet the preset conditions, it indicates that the target ID is not exposed in the image set, and if the target ID is not exposed in each image set, it indicates that the target ID The target identifier is not exposed in the video to be analyzed. At this time, there is no exposure position and exposure time.

In this embodiment, the target identifier in the video to be analyzed is identified; whether the identified target identifier satisfies a preset condition, the characteristics of the target identifier satisfying the preset condition, and the characteristics of the target identifier when the target identifier is exposed in the video are detected. Has a specific match; in this embodiment, a detection result obtained by detecting whether the identified target identifier satisfies a preset condition, and the detection result includes: the identified target identifier meets the preset condition and the identified target identifier does not match. The preset conditions are met; therefore, in this embodiment, whether the target logo is exposed can be determined according to the detection result; when the target logo is exposed, the identified target logo meets the preset conditions. At this time, in the preset conditions, It is defined that the placeholders of the target identifiers distributed in at least two frames of adjacent images are at least overlapping. At this time, the placeholders of the at least overlapping target identifiers reflect the positions of the exposed target identifiers; further, based on the exposure The position of the target identifier can be used to determine the playing time of the exposed target identifier at that position in the video to be identified; therefore, in Application embodiments, the exposure may be determined to be analyzed in the video data based on a detection result of the target identifier.

Referring to FIG. 5, a schematic structural diagram of an embodiment of an apparatus for analyzing target identification in a video in the present application is shown. The apparatus embodiment may include:

An obtaining unit 501, configured to obtain a video to be analyzed;

A first identification unit 502, configured to identify a target identifier in the video to be analyzed;

A detection unit 503 is configured to detect whether the identified target identifier meets a preset condition, where the preset condition includes: at least part of the placeholders of the target identifiers distributed in at least two frames of adjacent images overlap to obtain a detection result;

A determining unit 504 is configured to determine, according to the detection result, exposure data that satisfies the target identifier in the video to be analyzed.

The first identification unit 502 may include:

For any one frame image in the video to be analyzed, the preset model includes:

Wherein, the first extraction unit is specifically configured to extract the multi-scale features of the arbitrary frame of images through the underlying feature extraction module to obtain the multi-scale feature image set;

The device may further include: a training unit;

The training unit includes:

The detection unit 503 is further configured to detect that an overlap ratio between the target identifiers at least partially overlapping the placeholders is greater than a preset percentage; and in the target identifiers at least partially overlapping the placeholders, the definition is greater than a preset definition The total number of target identifiers of the threshold is greater than the preset total number.

The determining unit 504 may include:

The analysis device for the target identification in the video includes a processor and a memory. The acquisition unit, the first identification unit, the detection unit, the determination unit, and the training unit are all stored in the memory as program units, and are executed by the processor and stored in the memory. The above program units are used to implement the corresponding functions.

The processor contains a kernel, and the kernel retrieves the corresponding program unit from the memory. The kernel can set one or more, and adjust the kernel parameters to analyze the exposure data of the target logo in the video.

Memory may include non-permanent memory, random access memory (RAM), and / or non-volatile memory in computer-readable media, such as read-only memory (ROM) or flash memory (RAM). Memory includes at least one Memory chip.

An embodiment of the present invention provides a storage medium on which a program is stored, and the video analysis method is implemented when the program is executed by a processor.

An embodiment of the present invention provides a processor, where the processor is configured to run a program, and the video analysis method is executed when the program runs.

An embodiment of the present invention provides a device. The device includes a processor, a memory, and a program stored on the memory and executable on the processor. When the processor executes the program, the following steps are implemented:

Get the video to be analyzed;

Identifying a target identifier in the video to be analyzed;

Specifically, inputting each frame image in the video to be analyzed into a trained preset model, so that the trained preset model recognizes a target identifier in each frame of the video to be analyzed;

Generating candidate regions based on the multi-scale feature image set;

Inputting an image in the video to be analyzed into the first preset model;

The preset conditions may further include:

According to the detection result, exposure data that satisfies the target identifier in the video to be analyzed is determined.

Specifically, in a case where the detection result is that the target identifier meets the preset condition, determining that the target identifier is exposed in the video to be analyzed, and further determining an exposure parameter, where the exposure parameter includes At least one of the following: exposure duration, exposure position;

The equipment in this article can be server, PC, PAD, mobile phone, etc.

This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program having the following method steps for initialization:

Get the video to be analyzed;

Identifying a target identifier in the video to be analyzed;

Generating candidate regions based on the multi-scale feature image set;

Inputting an image in the video to be analyzed into the first preset model;

The preset conditions further include:

Those skilled in the art should understand that the embodiments of the present application may be provided as a method, a system, or a computer program product. Therefore, this application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Moreover, this application may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

This application is described with reference to flowcharts and / or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present application. It should be understood that each process and / or block in the flowcharts and / or block diagrams, and combinations of processes and / or blocks in the flowcharts and / or block diagrams can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing device to produce a machine, so that instructions generated by the processor of the computer or other programmable data processing device may be used to Means for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing device to work in a specific manner such that the instructions stored in the computer-readable memory produce a manufactured article including an instruction device, the instructions The device implements the functions specified in one or more flowcharts and / or one or more blocks of the block diagram.

These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of steps can be performed on the computer or other programmable device to produce a computer-implemented process, which can be executed on the computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.

In a typical configuration, a computing device includes one or more processors (CPUs), input / output interfaces, network interfaces, and memory.

The memory may include non-permanent memory, random access memory (RAM), and / or non-volatile memory in computer-readable media, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media includes permanent and non-persistent, removable and non-removable media. Information storage can be accomplished by any method or technology. Information may be computer-readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, read-only disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transmitting medium may be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include temporary computer-readable media, such as modulated data signals and carrier waves.

It should also be noted that the terms "including", "comprising" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, product, or device that includes a series of elements includes not only those elements, but also Other elements not explicitly listed, or those that are inherent to such a process, method, product, or device. Without more restrictions, the elements defined by the sentence "including a ..." do not exclude the existence of other identical elements in the process, method, product or equipment including the elements.

The above are only examples of the present application and are not intended to limit the present application. For those skilled in the art, this application may have various modifications and changes. Any modification, equivalent replacement, and improvement made within the spirit and principle of this application shall be included in the scope of claims of this application.

Claims

A video analysis method, comprising:

Get the video to be analyzed;

Identifying a target identifier in the video to be analyzed;

Detecting whether the identified target identifier satisfies a preset condition, the preset condition includes: the placeholders of the target identifiers distributed in at least two frames of adjacent images at least partially overlap to obtain a detection result;

According to the detection result, exposure data satisfying the target identifier in the video to be analyzed is determined.
The method according to claim 1, wherein the separately identifying target identifiers in the videos to be analyzed comprises:

Inputting each frame image in the video to be analyzed into a trained preset model, so that the trained preset model recognizes a target identifier in each frame of the video to be analyzed;

For any one frame image in the video to be analyzed, the preset model identifies the target identifier in the any one frame image according to the following steps:

Extracting the multi-scale features of the arbitrary one-frame image to obtain a multi-scale feature image set;

Generating candidate regions based on the multi-scale feature image set;

Selecting a feature image set of at least two scales from the multi-scale feature image set;

Respectively extracting a region set corresponding to the candidate region from the feature image set of the at least two scales to obtain a region set of at least two scales corresponding to the feature image set of the at least two scales;

By fully connecting the region sets of at least two scales, the target identifier in the arbitrary one-frame image is identified.
The method according to claim 2, wherein the preset model is based on a Faster-RCNN architecture, and the architecture includes a low-level feature extraction model and a candidate region generation network;

Wherein, extracting the multi-scale features of the arbitrary frame of images to obtain a multi-scale feature image set includes:

Extracting the multi-scale features of the arbitrary frame image through the underlying feature extraction module to obtain the multi-scale feature image set;

The generating candidate regions based on the multi-scale feature image set includes:

The multi-scale feature image set is input to the candidate region generation network, and the candidate region is generated by the candidate region generation network.
The method according to claim 2, wherein the preset model is trained in the following manner to obtain the trained preset model:

Acquiring a training set; the training set includes: a plurality of frames of images to which the target identifier is marked;

Using the multi-frame image to train the preset model to obtain a first preset model;

Inputting an image in the video to be analyzed into the first preset model;

Acquiring an image labeled with the target identifier in the video to be analyzed through the first preset model; the image labeled with the target identifier has an incorrect label;

Obtaining a corrected image; the corrected image is: an image that has been manually corrected for the incorrect annotation;

Training the first preset model by using the modified image to obtain the trained preset model.
The method according to claim 1, wherein the preset condition further comprises:

The ratio of overlap between the target identifiers at least partially overlapping the placeholders is greater than a preset percentage; among the target identifiers at least partially overlapping the placeholders, the total number of target identifiers having a sharpness greater than a preset sharpness threshold is greater than a preset The total number.
The method according to claim 1, wherein, according to the detection result, determining that exposure data satisfying the target identifier in the video to be analyzed comprises:

In a case where the detection result is that the target identifier meets the preset condition, determining that the target identifier is exposed in the video to be analyzed, and further determining an exposure parameter, wherein the exposure parameter includes at least one of the following 1: exposure time, exposure position;

When the detection result is that the target identifier does not satisfy the preset condition, it is determined that the target identifier is not exposed in the video to be analyzed.
A video analysis device, comprising:

An acquisition unit for acquiring a video to be analyzed;

A first identification unit, configured to identify a target identifier in the video to be analyzed;

A detection unit, configured to detect whether the identified target identifier meets a preset condition, the preset condition includes: at least part of the footprints of the target identifiers distributed in at least two adjacent frames of the image overlap to obtain a detection result;

A determining unit is configured to determine, according to the detection result, exposure data that satisfies the target identifier in the video to be analyzed.
The device according to claim 7, wherein the first identification unit comprises:

A first input subunit, configured to input each frame image in the video to be analyzed into a preset model after training, so that the trained preset model identifies a target identifier in each frame of the video to be analyzed ;

For any one frame image in the video to be analyzed, the preset model includes:

A first extraction unit, configured to extract multi-scale features of the arbitrary one-frame image to obtain a multi-scale feature image set;

A generating unit, configured to generate a candidate region based on the multi-scale feature image set;

A selection unit, configured to select a feature image set of at least two scales from the multi-scale feature image set;

A second extraction unit, configured to respectively extract a region set corresponding to the candidate region from the feature image set of the at least two scales, to obtain a region of at least two scales corresponding to the feature image set of the at least two scales set;

The second recognition unit recognizes the target identifier in the arbitrary one-frame image by fully connecting the region sets of at least two scales.
A storage medium, characterized in that a program is stored on the storage medium, and when the program is executed by a processor, the video analysis method according to any one of claims 1 to 6 is implemented.
A processor, wherein the processor is used to run a program, and when the program runs, the video analysis method according to any one of claims 1 to 6 is executed.