WO2019223361A1 - Procédé et appareil d'analyse vidéo - Google Patents

Procédé et appareil d'analyse vidéo Download PDF

Info

Publication number
WO2019223361A1
WO2019223361A1 PCT/CN2019/073661 CN2019073661W WO2019223361A1 WO 2019223361 A1 WO2019223361 A1 WO 2019223361A1 CN 2019073661 W CN2019073661 W CN 2019073661W WO 2019223361 A1 WO2019223361 A1 WO 2019223361A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
analyzed
image
target identifier
preset
Prior art date
Application number
PCT/CN2019/073661
Other languages
English (en)
Chinese (zh)
Inventor
戴威
Original Assignee
北京国双科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京国双科技有限公司 filed Critical 北京国双科技有限公司
Publication of WO2019223361A1 publication Critical patent/WO2019223361A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Definitions

  • the present application relates to the field of video processing, and in particular, to a video analysis method and device.
  • the title of the program has become an effective channel for advertisers to promote corporate brands.
  • advertisers have embedded corporate brand advertisements in TV programs, so that viewers noticed that the corporate brand advertisements were embedded in the process of watching TV programs. Then play the effect of publicizing the corporate brand.
  • exposure data such as whether or not a company's brand is exposed in a television program, the length of the exposure, and the duration of the exposure will affect the effectiveness of the corporate brand's promotion. Therefore, it is necessary to analyze the exposure data of the corporate brand in the TV program in order to find a publicity way for the advertiser's corporate brand to achieve better publicity, or analyze the exposure data of the competitor's corporate brand.
  • the present invention is provided in order to provide a video analysis method and device that overcome the above problems or at least partially solve the above problems.
  • a video analysis method includes:
  • the preset condition includes: the placeholders of the target identifiers distributed in at least two frames of adjacent images at least partially overlap to obtain a detection result;
  • the separately identifying target identifiers in the videos to be analyzed includes:
  • the preset model identifies the target identifier in the any one frame image according to the following steps:
  • the target identifier in the arbitrary one-frame image is identified.
  • the preset model is based on a Faster-RCNN architecture, and the architecture includes a low-level feature extraction model and a candidate region generation network;
  • extracting the multi-scale features of the arbitrary frame of images to obtain a multi-scale feature image set includes:
  • the generating candidate regions based on the multi-scale feature image set includes:
  • the multi-scale feature image set is input to the candidate region generation network, and the candidate region is generated by the candidate region generation network.
  • the preset model is trained in the following manner to obtain the trained preset model:
  • the training set includes: a plurality of frames of images to which the target identifier is marked;
  • the corrected image is: an image that has been manually corrected for the incorrect annotation
  • the preset conditions further include:
  • the ratio of overlap between the target identifiers at least partially overlapping the placeholders is greater than a preset percentage; among the target identifiers at least partially overlapping the placeholders, the total number of target identifiers having a sharpness greater than a preset sharpness threshold is greater than a preset The total number.
  • determining that exposure data satisfying the target identifier in the video to be analyzed includes:
  • the detection result is that the target identifier meets the preset condition
  • determining that the target identifier is exposed in the video to be analyzed and further determining an exposure parameter, wherein the exposure parameter includes at least one of the following 1: exposure time, exposure position;
  • the detection result is that the target identifier does not satisfy the preset condition, it is determined that the target identifier is not exposed in the video to be analyzed.
  • a video analysis device includes:
  • a first identification unit configured to identify a target identifier in the video to be analyzed
  • a detection unit configured to detect whether the identified target identifier meets a preset condition, the preset condition includes: at least part of the overlap of the target identifiers distributed in at least two frames of adjacent images to obtain a detection result;
  • a determining unit is configured to determine, according to the detection result, exposure data that satisfies the target identifier in the video to be analyzed.
  • the first identification unit includes:
  • a first input subunit configured to input each frame image in the video to be analyzed into a preset model after training, so that the trained preset model identifies a target identifier in each frame of the video to be analyzed ;
  • the preset model includes:
  • a first extraction unit configured to extract multi-scale features of the arbitrary one-frame image to obtain a multi-scale feature image set
  • a generating unit configured to generate a candidate region based on the multi-scale feature image set
  • a selection unit configured to select a feature image set of at least two scales from the multi-scale feature image set
  • a second extraction unit configured to respectively extract a region set corresponding to the candidate region from the feature image set of the at least two scales, to obtain a region of at least two scales corresponding to the feature image set of the at least two scales set;
  • the second recognition unit recognizes the target identifier in the arbitrary one-frame image by fully connecting the region sets of at least two scales.
  • the preset model is based on a Faster-RCNN architecture, and the architecture includes a low-level feature extraction model and a candidate region generation network;
  • the first extraction unit is specifically configured to extract the multi-scale features of the arbitrary one-frame image by using the underlying feature extraction module to obtain the multi-scale feature image set;
  • the generating unit is specifically configured to input the multi-scale feature image set into the candidate region generating network, and generate the candidate region through the candidate region generating network.
  • the training unit is configured to train the preset model to obtain the trained preset model
  • the training unit includes:
  • a first acquisition subunit configured to acquire a training set, where the training set includes: multiple frames of images to which the target identifier has been labeled;
  • a first training subunit configured to train the preset model by using the multi-frame image to obtain a first preset model
  • a second input subunit configured to input an image in the video to be analyzed into the first preset model
  • a second acquisition subunit configured to acquire an image labeled with the target identifier through the first preset model in the video to be analyzed; the image labeled with the target identifier has an incorrect label;
  • a third acquisition subunit configured to acquire a corrected image;
  • the corrected image is: an image that has been manually corrected for the incorrect annotation;
  • a second training subunit is configured to use the modified image to train the first preset model to obtain the trained preset model.
  • the detection unit is further configured to detect that an overlap ratio between the target identifiers at least partially overlapping the placeholders is greater than a preset percentage; and in the target identifiers at least partially overlapping the placeholders, the sharpness is greater than a preset sharpness threshold The total number of target IDs is greater than the preset total number.
  • the determining unit includes:
  • a first determining subunit configured to determine that the target identifier is exposed in the video to be analyzed, and further determine an exposure parameter if the detection result is that the target identifier meets the preset condition, and
  • the exposure parameter includes at least one of the following: exposure duration and exposure position;
  • a second determining subunit is configured to determine that the target identifier is not exposed in the video to be analyzed if the detection result is that the target identifier does not meet the preset condition.
  • a storage medium stores a program on the storage medium, and when the program is executed by a processor, the video analysis method according to any one of the foregoing is implemented.
  • a processor is configured to run a program, and when the program runs, the video analysis method according to any one of the foregoing is performed.
  • the technical solution provided by the present invention identifies the target identifier in the video to be analyzed; detects whether the identified target identifier meets a preset condition, and the characteristics of the target identifier that meets the preset condition, and The specific match that the target identifier has when exposed in the video; in this embodiment, a detection result obtained by detecting whether the identified target identifier satisfies a preset condition, and the detection result includes that the identified target identifier satisfies a preset The condition and the identified target identifier do not meet the preset conditions; therefore, in this embodiment, whether the target identifier is exposed according to the detection result; when the target identifier is exposed, the identified target identifier meets the preset condition, At this time, the preset conditions define that the placeholders of the target identifiers distributed in at least two frames of adjacent images are at least overlapping.
  • the placeholders of the at least overlapping target identifiers reflect the exposed target identifiers. Further, according to the position of the target identification of the exposure, the position of the exposure at the position in the video to be identified can be determined. Playback time length BRANDING; Accordingly, in the present application embodiment, the exposure may be determined in the target identification data of the video to be analyzed according to the detection result, and can save manpower resources.
  • FIG. 1 shows a flowchart of an embodiment of a model training method in the present application
  • FIG. 2 shows a schematic diagram of labeling each BMW brand logo in an image by using a box in the present application
  • FIG. 3 shows a flowchart of an embodiment of a method for analyzing target identification in a video in the present application
  • FIG. 4 is a schematic diagram showing a distribution of target identifiers identified in an image included in an image set in the present application
  • FIG. 5 is a schematic structural diagram of an embodiment of an analysis apparatus for target identification in a video in the present application.
  • a model for target recognition is set, and specifically, it can be applied to a scene based on target recognition, for example, image classification and image segmentation.
  • the model architecture can be a Faster-RCNN architecture.
  • the ResNet model is used as the underlying feature extraction model, and the RPN network is used as the candidate region generation network.
  • the ResNet model includes five parts, which are part 1, part 2, part 3, part 4 and part 5, each of which includes a pooling layer and a convolution layer.
  • the processing flow of the model is improved. Specifically, taking the model for image recognition as an example, the specific improvement of the processing flow of the model in this embodiment is introduced.
  • the image to be processed is input into the model, and the convolutional layers in different parts of the model output information of different scales of the image to be processed (different scales of the image can be understood as different resolutions). For example, when the size of the image to be processed is M * M, the convolutional layer of part 1 outputs a first feature image set of size M * M, and the convolutional layer of part 2 outputs a second feature image set of size M * M.
  • the convolutional layer of part 3 outputs a third feature image set of size M / 2 * M / 2
  • the convolutional layer of part 4 outputs a fourth feature image set of size M / 4 * M / 4
  • the convolution of part 5 The layer outputs a fifth feature image set of size M / 8 * M / 8.
  • the first feature image set, the second feature image set, the third feature image set, the fourth feature image set, and the fifth feature image set are all composed of multiple layers of images.
  • the specific number of image layers and the feature The number of convolution kernels in the convolution layer corresponding to the image set is the same.
  • the feature image sets of different scales are input to an RPN network, which generates candidate regions; again, from 5 At least two feature image sets are selected from each of the feature image sets, and the region sets corresponding to the candidate coordinate regions are extracted from the at least two feature image sets, respectively, and the extracted region sets are uniformly pre-defined in length and width.
  • the above-mentioned extraction region set, the extracted region set is unified to a preset size, and the region set unified to a preset size is spliced from the number of layers.
  • the first feature image set is 128 * 128 * 3, where 3 represents the number of frames of the first feature image included in the first feature image set, 128 * 128 represents the size of any one of the first feature images is 128 * 128;
  • the second feature image set is 64 * 64 * 6 , Where 6 represents the number of frames of the second feature image included in the second feature image set, 64 * 64 represents the size of any one of the second feature images is 64 * 64;
  • the third feature image set is 32 * 32 * 4,
  • the fourth feature image set is 16 * 16 * 2, and the fifth feature image set is 4 * 4 * 3.
  • the meaning of the parameters of the third image feature set, the fourth image feature set, and the fifth image feature set is the same as the first The meanings of the parameters in the feature image set are the same and will not be repeated here.
  • the set performs data sampling so that the images included in the selected at least two feature image sets are unified in size, for example, the sizes are unified into 7 * 7 images.
  • the selected at least two feature image sets are uniformly sized, they are superimposed from the number of layers. Specifically, it is assumed that the selected at least two feature image sets are a third feature image set and a fifth feature image set, and the sizes of the images included in the two feature image sets are unified to 7 * 7. At this time, the sizes are The two feature image sets unified as 7 * 7 are superimposed from the number of layers, and the superimposed feature image set is 7 * 7 * 10.
  • the model in this embodiment uses the ResNet model as the underlying feature extraction model, uses the RPN network as the candidate region generation network, and uses an improved processing flow to process the input to-be-processed image.
  • the model in this embodiment is based on the RPN
  • the region sets corresponding to the candidate regions are extracted from at least two feature image sets respectively to obtain at least two region sets. Since the at least two region sets are from different feature image sets, and the different feature image sets reflect information of the image to be processed at different scales, the model in this embodiment will include images of at least two scales in the image to be processed. The information is fully connected, so that the model in this embodiment recognizes information of different scales of the image to be processed.
  • the model of the standard Faster-RCNN architecture is used to fully connect the image region set corresponding to the candidate region extracted from the image to be processed, so that the model using the standard Faster-RCNN architecture only performs information on the scale of the processed image. Identify.
  • target images of different sizes may exist in the image to be processed, and the characteristics of target identifiers of different sizes may be reflected on feature image sets of different scales.
  • the model in this embodiment can identify information in feature image sets of different scales. Therefore, the model in this embodiment can identify information reflected on feature image sets of different scales. Furthermore, compared with a model using a standard Faster-RCNN architecture, the model in this embodiment has a higher recognition accuracy rate for identifying target identifiers of different sizes.
  • FIG. 1 shows a flowchart of an embodiment of a model training method in the present application.
  • the method Embodiments may include:
  • Step 101 Obtain a training set.
  • the model is used for image recognition as an example to introduce the training process of the model.
  • the specific image recognition scene is: identifying whether the BMW brand logo exists in the image.
  • a training set for training the model is obtained, where the training set includes a large number of images marked with the BMW brand logo.
  • a large number of images used to compose the training set can be obtained by searching for images containing the BMW brand logo from search platforms such as Baidu, Google, or other material websites; and also using screen capture software from videos such as live shows To capture images containing the BMW brand logo.
  • search platforms such as Baidu, Google, or other material websites
  • screen capture software from videos such as live shows
  • other methods can be used to obtain a large number of images containing the BMW brand logo. This step only provides two methods to obtain the images containing the BMW brand logo, but not specific methods to obtain the images containing the BMW brand logo. Make restrictions.
  • the BMW brand logo in the image is labeled for each of the acquired images. Specifically, as shown in FIG. 2, each box in the image is labeled BMW brand identity.
  • Step 102 Train the model using the acquired training set to obtain a first model.
  • the model is trained using a large number of images in the acquired training set. Specifically, an image labeled with the BMW brand logo is input into the model, and the model uses an improved process to identify and label the BMW brand logo in the input image. Based on the training set, the BMW brand logo is used as a benchmark and automatically adjusted. The parameters in the model are adjusted multiple times in the model. When a certain standard is reached, the first model is obtained.
  • Step 103 Input a preset number of frames of images to be identified into the first model.
  • a preset number of frames of images to be identified are input into the first model, and for each of the input frames of images to be identified, the first model recognizes and Mark out that each frame contains the BMW brand logo.
  • Step 104 Obtain a preset number of frame images that the first model separately recognizes and labels the target identifier.
  • a preset number of frames of images identified by the first model and labeled with the target identifier are obtained.
  • misidentification occurs.
  • the labeled target identifier is also wrong. Therefore, in this step, in the preset number of frames obtained by identifying and labeling the target identifier with the first model, there are symbols labeled with non-target identifiers. For convenience of description, this embodiment will label the symbols with non-target identifiers. Collectively referred to as error symbols.
  • Step 105 Obtain a preset number of frame images with artificially corrected error symbols.
  • Step 106 input the corrected preset number of frame images into the first model, and train the first model to obtain a trained model.
  • the corrected preset number of frame images are input to a first model, and the first model is further trained.
  • the process of training the first model in this step is the same as the idea of training the model in step 102.
  • the models obtained after training the first model are collectively referred to as a trained model.
  • the first model is obtained after training the model through the training set. Since the images of the training set are collected from the search platform, after training the model using the images in the training set, the model is only Learn the target identifiers in this training set. In practical applications, there may be similar identifiers similar to the target identifier in the image to be identified. In order to allow the model to better distinguish the distinguishing features of the target identifier and the similar identifier, in this embodiment, a preset number of frames are to be identified. The image is input to the first model, and there are error symbols in the symbols output by the first model for labeling the target identifier. A preset number of frame images corrected by the error symbols are manually used to train the first model again to obtain training. Post model. At this time, compared with the first model, the recognition accuracy of the target identifier in the image to be identified is improved after training. Therefore, the training method of this embodiment can further improve the accuracy of the target identifier in the image to be identified by the model. Identification accuracy.
  • the trained model is obtained, then, in this embodiment, the trained model is applied to a scenario for analyzing the implantation of target identifiers in a video.
  • FIG. 3 a flowchart of an embodiment of a method for analyzing a target identifier in a video in the present application is shown.
  • the method embodiment may include:
  • Step 301 Obtain a video to be analyzed.
  • the video to be analyzed obtained in this step may be an encoded video to be analyzed.
  • Step 302 Decode the obtained video to be analyzed to obtain a decoded video to be analyzed.
  • Step 303 For the decoded video to be analyzed, the decoded video is divided into multiple image sets according to the sequence of the video frames and the principle of using the first preset number of frames as an image set.
  • the target logo embedded in the video is generally played continuously for two to three seconds, where the target logo represents a preset type of logo.
  • the target logo represents a preset type of logo.
  • the BMW brand logo in the video needs to be analyzed.
  • BMW The brand identity is the target identity.
  • the image played every second is about 5 frames. Therefore, the image embedded with the target identifier in the decoded video to be analyzed generally appears in consecutive 10 to 15 frames. Therefore, in order to more accurately analyze the implantation of the target identifier in the video to be analyzed, in this step, for the decoded video to be analyzed, according to the sequence of the video frames in the video to be analyzed, the first preset number of frame images As an image set, the preset number can be any number from 5 to 7 frames. At this time, the decoded video to be analyzed is divided into multiple image sets.
  • Step 304 Input the images in each image set into the trained models separately, so that the trained models recognize the target identifiers in the images contained in each image set.
  • the images in each image set are respectively input into a trained model, and the trained model targets the objects in each frame of the image.
  • Identification In practical applications, after identifying the target logo in the video to be analyzed in the trained model, the identified target logo is labeled. For example, the trained model recognizes a BMW brand logo. A box can be used. Frame the identified BMW brand logo, and output an image that frames the identified BMW brand logo with a frame.
  • Step 305 Obtain an image set labeled with a target identifier corresponding to each image set and output by the trained model.
  • an image set marked with preset symbols corresponding to each of the divided image sets is obtained, and multiple recognized image sets are obtained.
  • Step 306 Detect whether the target identifier marked in each image set meets a preset condition.
  • this step After obtaining a plurality of image sets labeled with target identifiers, then, in this step, it is respectively detected whether the target identifiers labeled in each image set exist and satisfy a preset condition.
  • this step taking any image set as an example, it is introduced whether the target identifier marked in the any image set satisfies a preset condition.
  • the preset condition may include that the placeholders of the target identifiers distributed in at least two adjacent images at least partially overlap.
  • the placeholder of the target identifier refers to a space area occupied by the target identifier in a reference coordinate system.
  • the following uses a specific scenario as an example to introduce whether the identified target identifiers in the image collection meet a preset condition.
  • the specific scene is: the image set includes 5 frames of images, namely the first frame, the second frame, the third frame, the fourth frame, and the fifth frame, and the target logo is the BMW brand logo;
  • the position distribution of the identified target identifiers in the second frame image, the third frame image, the fourth frame image, and the fifth frame image is shown in FIG. 4.
  • the two BMW brand logos identified in the first frame of the image are two, one distributed at the upper left corner of the image and the other at the lower right corner of the image; two BMW brands are identified in the second frame of the image.
  • the preset condition is "the placeholders of the target identifiers distributed in at least two frames of adjacent images at least partially overlap", so in this scene, the target identifiers distributed in at least two frames of adjacent images are specifically: 5 BMW brand logos in the first frame image, the second frame image, and the third frame image; then, determine whether the placeholders of the target logos distributed in at least two adjacent frames at least partially overlap, and in the first frame
  • the three BMW brand logos in the bottom right corner of the three frames of the image, the second frame image and the third frame image are overlapping. Therefore, the BMW brand identity identified in the image collection meets a preset condition.
  • two detection results are obtained, one is: the identified target identifiers in the image collection meet the preset conditions, and the other is: The identified target identifier does not meet the preset conditions.
  • the preset condition may further include: an overlap ratio between target identifiers with at least partially overlapping placeholders is greater than a preset percentage; the target identifiers with at least partially overlapping placeholders
  • the total number of target identifiers whose sharpness is greater than a preset sharpness threshold is greater than a preset total number.
  • the value range of the preset percentage may be not less than 50%, and the value range of the preset total number may not be less than 5.
  • this embodiment only provides a preferred value range of the preset percentage and the preset total number.
  • the preset percentage and the preset total number can also be determined based on actual conditions. This embodiment does not limit the specific values of the preset percentage and the preset total number.
  • Step 307 Determine the exposure data of the target identifier in the video to be analyzed according to the detection result.
  • the exposure data of the target identifier in the video to be analyzed is determined.
  • the exposure data includes: exposure, exposure position, and exposure duration. Specifically, in this step, if the detection result is: the identified target identifier in the image set meets a preset condition, it indicates that the target identifier is exposed in the image set; The space position occupied by at least partially overlapping target marks is determined as the exposure position of the target mark; and based on the exposure position, the number of frames of continuous images in which the target mark exists at the exposure position in the video to be analyzed is calculated according to the frame The number determines how long the target ID plays.
  • the playback time of the target mark for each exposure position is determined separately, and the sum of the playback time corresponding to all the exposure positions is used as The total playing time of the target identifier.
  • the detection result is: the identified target ID in the image set does not meet the preset conditions, it indicates that the target ID is not exposed in the image set, and if the target ID is not exposed in each image set, it indicates that the target ID
  • the target identifier is not exposed in the video to be analyzed. At this time, there is no exposure position and exposure time.
  • the target identifier in the video to be analyzed is identified; whether the identified target identifier satisfies a preset condition, the characteristics of the target identifier satisfying the preset condition, and the characteristics of the target identifier when the target identifier is exposed in the video are detected. Has a specific match; in this embodiment, a detection result obtained by detecting whether the identified target identifier satisfies a preset condition, and the detection result includes: the identified target identifier meets the preset condition and the identified target identifier does not match.
  • the preset conditions are met; therefore, in this embodiment, whether the target logo is exposed can be determined according to the detection result; when the target logo is exposed, the identified target logo meets the preset conditions.
  • the placeholders of the target identifiers distributed in at least two frames of adjacent images are at least overlapping.
  • the placeholders of the at least overlapping target identifiers reflect the positions of the exposed target identifiers; further, based on the exposure
  • the position of the target identifier can be used to determine the playing time of the exposed target identifier at that position in the video to be identified; therefore, in Application embodiments, the exposure may be determined to be analyzed in the video data based on a detection result of the target identifier.
  • the apparatus embodiment may include:
  • An obtaining unit 501 configured to obtain a video to be analyzed
  • a first identification unit 502 configured to identify a target identifier in the video to be analyzed
  • a detection unit 503 is configured to detect whether the identified target identifier meets a preset condition, where the preset condition includes: at least part of the placeholders of the target identifiers distributed in at least two frames of adjacent images overlap to obtain a detection result;
  • a determining unit 504 is configured to determine, according to the detection result, exposure data that satisfies the target identifier in the video to be analyzed.
  • the first identification unit 502 may include:
  • a first input subunit configured to input each frame image in the video to be analyzed into a preset model after training, so that the trained preset model identifies a target identifier in each frame of the video to be analyzed ;
  • the preset model includes:
  • a first extraction unit configured to extract multi-scale features of the arbitrary one-frame image to obtain a multi-scale feature image set
  • a generating unit configured to generate a candidate region based on the multi-scale feature image set
  • a selection unit configured to select a feature image set of at least two scales from the multi-scale feature image set
  • a second extraction unit configured to respectively extract a region set corresponding to the candidate region from the feature image set of the at least two scales, to obtain a region of at least two scales corresponding to the feature image set of the at least two scales set;
  • the second recognition unit recognizes the target identifier in the arbitrary one-frame image by fully connecting the region sets of at least two scales.
  • the preset model is based on a Faster-RCNN architecture, and the architecture includes a low-level feature extraction model and a candidate region generation network;
  • the first extraction unit is specifically configured to extract the multi-scale features of the arbitrary frame of images through the underlying feature extraction module to obtain the multi-scale feature image set;
  • the generating unit is specifically configured to input the multi-scale feature image set into the candidate region generating network, and generate the candidate region through the candidate region generating network.
  • the device may further include: a training unit;
  • the training unit is configured to train the preset model to obtain the trained preset model
  • the training unit includes:
  • a first acquisition subunit configured to acquire a training set, where the training set includes: multiple frames of images to which the target identifier has been labeled;
  • a first training subunit configured to train the preset model by using the multi-frame image to obtain a first preset model
  • a second input subunit configured to input an image in the video to be analyzed into the first preset model
  • a second acquisition subunit configured to acquire an image labeled with the target identifier through the first preset model in the video to be analyzed; the image labeled with the target identifier has an incorrect label;
  • a third acquisition subunit configured to acquire a corrected image;
  • the corrected image is: an image that has been manually corrected for the incorrect annotation;
  • a second training subunit is configured to use the modified image to train the first preset model to obtain the trained preset model.
  • the detection unit 503 is further configured to detect that an overlap ratio between the target identifiers at least partially overlapping the placeholders is greater than a preset percentage; and in the target identifiers at least partially overlapping the placeholders, the definition is greater than a preset definition
  • the total number of target identifiers of the threshold is greater than the preset total number.
  • the determining unit 504 may include:
  • a first determining subunit configured to determine that the target identifier is exposed in the video to be analyzed, and further determine an exposure parameter if the detection result is that the target identifier meets the preset condition, and
  • the exposure parameter includes at least one of the following: exposure duration and exposure position;
  • a second determining subunit is configured to determine that the target identifier is not exposed in the video to be analyzed if the detection result is that the target identifier does not meet the preset condition.
  • the analysis device for the target identification in the video includes a processor and a memory.
  • the acquisition unit, the first identification unit, the detection unit, the determination unit, and the training unit are all stored in the memory as program units, and are executed by the processor and stored in the memory.
  • the above program units are used to implement the corresponding functions.
  • the processor contains a kernel, and the kernel retrieves the corresponding program unit from the memory.
  • the kernel can set one or more, and adjust the kernel parameters to analyze the exposure data of the target logo in the video.
  • Memory may include non-permanent memory, random access memory (RAM), and / or non-volatile memory in computer-readable media, such as read-only memory (ROM) or flash memory (RAM).
  • Memory includes at least one Memory chip.
  • An embodiment of the present invention provides a storage medium on which a program is stored, and the video analysis method is implemented when the program is executed by a processor.
  • An embodiment of the present invention provides a processor, where the processor is configured to run a program, and the video analysis method is executed when the program runs.
  • An embodiment of the present invention provides a device.
  • the device includes a processor, a memory, and a program stored on the memory and executable on the processor.
  • the processor executes the program, the following steps are implemented:
  • the preset model identifies the target identifier in the any one frame image according to the following steps:
  • the target identifier in the arbitrary one-frame image is identified.
  • the preset model is based on a Faster-RCNN architecture, and the architecture includes a low-level feature extraction model and a candidate region generation network;
  • extracting the multi-scale features of the arbitrary frame of images to obtain a multi-scale feature image set includes:
  • the generating candidate regions based on the multi-scale feature image set includes:
  • the multi-scale feature image set is input to the candidate region generation network, and the candidate region is generated by the candidate region generation network.
  • the preset model is trained in the following manner to obtain the trained preset model:
  • the training set includes: a plurality of frames of images to which the target identifier is marked;
  • the corrected image is: an image that has been manually corrected for the incorrect annotation
  • the preset condition includes: the placeholders of the target identifiers distributed in at least two frames of adjacent images at least partially overlap to obtain a detection result;
  • the preset conditions may further include:
  • the ratio of overlap between the target identifiers at least partially overlapping the placeholders is greater than a preset percentage; among the target identifiers at least partially overlapping the placeholders, the total number of target identifiers having a sharpness greater than a preset sharpness threshold is greater than a preset The total number.
  • exposure data that satisfies the target identifier in the video to be analyzed is determined.
  • the detection result is that the target identifier meets the preset condition
  • determining that the target identifier is exposed in the video to be analyzed and further determining an exposure parameter, where the exposure parameter includes At least one of the following: exposure duration, exposure position;
  • the detection result is that the target identifier does not satisfy the preset condition, it is determined that the target identifier is not exposed in the video to be analyzed.
  • the equipment in this article can be server, PC, PAD, mobile phone, etc.
  • This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program having the following method steps for initialization:
  • the preset model identifies the target identifier in the any one frame image according to the following steps:
  • the target identifier in the arbitrary one-frame image is identified.
  • the preset model is based on a Faster-RCNN architecture, and the architecture includes a low-level feature extraction model and a candidate region generation network;
  • extracting the multi-scale features of the arbitrary frame of images to obtain a multi-scale feature image set includes:
  • the generating candidate regions based on the multi-scale feature image set includes:
  • the multi-scale feature image set is input to the candidate region generation network, and the candidate region is generated by the candidate region generation network.
  • the preset model is trained in the following manner to obtain the trained preset model:
  • the training set includes: a plurality of frames of images to which the target identifier is marked;
  • the corrected image is: an image that has been manually corrected for the incorrect annotation
  • the preset condition includes: the placeholders of the target identifiers distributed in at least two frames of adjacent images at least partially overlap to obtain a detection result;
  • the preset conditions further include:
  • the ratio of overlap between the target identifiers at least partially overlapping the placeholders is greater than a preset percentage; among the target identifiers at least partially overlapping the placeholders, the total number of target identifiers having a sharpness greater than a preset sharpness threshold is greater than a preset The total number.
  • exposure data that satisfies the target identifier in the video to be analyzed is determined.
  • the detection result is that the target identifier meets the preset condition
  • determining that the target identifier is exposed in the video to be analyzed and further determining an exposure parameter, where the exposure parameter includes At least one of the following: exposure duration, exposure position;
  • the detection result is that the target identifier does not satisfy the preset condition, it is determined that the target identifier is not exposed in the video to be analyzed.
  • this application may be provided as a method, a system, or a computer program product. Therefore, this application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Moreover, this application may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing device to work in a specific manner such that the instructions stored in the computer-readable memory produce a manufactured article including an instruction device, the instructions
  • the device implements the functions specified in one or more flowcharts and / or one or more blocks of the block diagram.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of steps can be performed on the computer or other programmable device to produce a computer-implemented process, which can be executed on the computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.
  • a computing device includes one or more processors (CPUs), input / output interfaces, network interfaces, and memory.
  • processors CPUs
  • input / output interfaces output interfaces
  • network interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-permanent memory, random access memory (RAM), and / or non-volatile memory in computer-readable media, such as read-only memory (ROM) or flash memory (flash RAM).
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media includes permanent and non-persistent, removable and non-removable media.
  • Information storage can be accomplished by any method or technology.
  • Information may be computer-readable instructions, data structures, modules of a program, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, read-only disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transmitting medium may be used to store information that can be accessed by a computing device.
  • computer-readable media does not include temporary computer-readable media, such as modulated data signals and carrier waves.
  • this application may be provided as a method, a system, or a computer program product. Therefore, this application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Moreover, this application may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Multimedia (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

Cette invention concerne un procédé et un appareil d'analyse vidéo, le procédé comprenant les étapes consistant à : acquérir une vidéo à analyser ; identifier des identifiants cibles dans la vidéo à analyser ; détecter si les identifiants cibles identifiés satisfont une condition prédéfinie, la condition prédéfinie comprenant : le chevauchement partiel de marques substitutives d'identifiants cibles répartis dans au moins deux trames d'images adjacentes, de façon à obtenir un résultat de détection ; et déterminer, en fonction du résultat de détection, des données d'exposition des identifiants cibles satisfaisant la condition prédéfinie, dans la vidéo à analyser. Au moyen des modes de réalisation de la présente invention, les données d'exposition des identifiants cibles dans la vidéo à analyser peuvent être déterminées, et les ressources humaines peuvent être économisées.
PCT/CN2019/073661 2018-05-23 2019-01-29 Procédé et appareil d'analyse vidéo WO2019223361A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810502120.X 2018-05-23
CN201810502120.XA CN110532833A (zh) 2018-05-23 2018-05-23 一种视频分析方法及装置

Publications (1)

Publication Number Publication Date
WO2019223361A1 true WO2019223361A1 (fr) 2019-11-28

Family

ID=68616536

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/073661 WO2019223361A1 (fr) 2018-05-23 2019-01-29 Procédé et appareil d'analyse vidéo

Country Status (2)

Country Link
CN (1) CN110532833A (fr)
WO (1) WO2019223361A1 (fr)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111027510A (zh) * 2019-12-23 2020-04-17 上海商汤智能科技有限公司 行为检测方法及装置、存储介质
CN111046849A (zh) * 2019-12-30 2020-04-21 珠海格力电器股份有限公司 一种厨房安全的实现方法、装置以及智能终端、存储介质
CN111062527A (zh) * 2019-12-10 2020-04-24 北京爱奇艺科技有限公司 一种视频集流量预测方法及装置
CN111310695A (zh) * 2020-02-26 2020-06-19 酷黑科技(北京)有限公司 一种迫降方法、装置及电子设备
CN111950424A (zh) * 2020-08-06 2020-11-17 腾讯科技(深圳)有限公司 一种视频数据处理方法、装置、计算机及可读存储介质
CN112055249A (zh) * 2020-09-17 2020-12-08 京东方科技集团股份有限公司 一种视频插帧方法及装置
CN112989934A (zh) * 2021-02-05 2021-06-18 方战领 视频分析方法、装置及系统
CN113191293A (zh) * 2021-05-11 2021-07-30 创新奇智(重庆)科技有限公司 广告检测方法、装置、电子设备、系统及可读存储介质
CN113312951A (zh) * 2020-10-30 2021-08-27 阿里巴巴集团控股有限公司 动态视频目标跟踪系统、相关方法、装置及设备
CN113825013A (zh) * 2021-07-30 2021-12-21 腾讯科技(深圳)有限公司 图像显示方法和装置、存储介质及电子设备
CN114095722A (zh) * 2021-10-08 2022-02-25 钉钉(中国)信息技术有限公司 清晰度的确定方法、装置及设备
CN112989934B (zh) * 2021-02-05 2024-05-24 方战领 视频分析方法、装置及系统

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113496230A (zh) * 2020-03-18 2021-10-12 中国电信股份有限公司 图像匹配方法和系统
CN111556337B (zh) * 2020-05-15 2021-09-21 腾讯科技(深圳)有限公司 一种媒体内容植入方法、模型训练方法以及相关装置
CN113573043B (zh) * 2021-01-18 2022-11-08 腾讯科技(深圳)有限公司 视频噪点识别方法、存储介质及设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020056124A1 (en) * 2000-03-15 2002-05-09 Cameron Hay Method of measuring brand exposure and apparatus therefor
CN105163127A (zh) * 2015-09-07 2015-12-16 浙江宇视科技有限公司 视频分析方法及装置
CN107122773A (zh) * 2017-07-05 2017-09-01 司马大大(北京)智能系统有限公司 一种视频广告检测方法、装置及设备
CN107679250A (zh) * 2017-11-01 2018-02-09 浙江工业大学 一种基于深度自编码卷积神经网络的多任务分层图像检索方法
CN107944409A (zh) * 2017-11-30 2018-04-20 清华大学 视频分析方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101777124A (zh) * 2010-01-29 2010-07-14 北京新岸线网络技术有限公司 一种提取视频文本信息的方法及装置
CN102567982A (zh) * 2010-12-24 2012-07-11 浪潮乐金数字移动通信有限公司 一种视频节目特定信息的提取系统及其方法、移动终端
CN107197269B (zh) * 2017-07-04 2020-02-21 广东工业大学 一种视频拼接的方法与装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020056124A1 (en) * 2000-03-15 2002-05-09 Cameron Hay Method of measuring brand exposure and apparatus therefor
CN105163127A (zh) * 2015-09-07 2015-12-16 浙江宇视科技有限公司 视频分析方法及装置
CN107122773A (zh) * 2017-07-05 2017-09-01 司马大大(北京)智能系统有限公司 一种视频广告检测方法、装置及设备
CN107679250A (zh) * 2017-11-01 2018-02-09 浙江工业大学 一种基于深度自编码卷积神经网络的多任务分层图像检索方法
CN107944409A (zh) * 2017-11-30 2018-04-20 清华大学 视频分析方法及装置

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062527A (zh) * 2019-12-10 2020-04-24 北京爱奇艺科技有限公司 一种视频集流量预测方法及装置
CN111062527B (zh) * 2019-12-10 2023-12-05 北京爱奇艺科技有限公司 一种视频集流量预测方法及装置
CN111027510A (zh) * 2019-12-23 2020-04-17 上海商汤智能科技有限公司 行为检测方法及装置、存储介质
CN111046849A (zh) * 2019-12-30 2020-04-21 珠海格力电器股份有限公司 一种厨房安全的实现方法、装置以及智能终端、存储介质
CN111310695A (zh) * 2020-02-26 2020-06-19 酷黑科技(北京)有限公司 一种迫降方法、装置及电子设备
CN111310695B (zh) * 2020-02-26 2023-11-24 酷黑科技(北京)有限公司 一种迫降方法、装置及电子设备
CN111950424A (zh) * 2020-08-06 2020-11-17 腾讯科技(深圳)有限公司 一种视频数据处理方法、装置、计算机及可读存储介质
CN112055249A (zh) * 2020-09-17 2020-12-08 京东方科技集团股份有限公司 一种视频插帧方法及装置
CN113312951A (zh) * 2020-10-30 2021-08-27 阿里巴巴集团控股有限公司 动态视频目标跟踪系统、相关方法、装置及设备
CN113312951B (zh) * 2020-10-30 2023-11-07 阿里巴巴集团控股有限公司 动态视频目标跟踪系统、相关方法、装置及设备
CN112989934A (zh) * 2021-02-05 2021-06-18 方战领 视频分析方法、装置及系统
CN112989934B (zh) * 2021-02-05 2024-05-24 方战领 视频分析方法、装置及系统
CN113191293A (zh) * 2021-05-11 2021-07-30 创新奇智(重庆)科技有限公司 广告检测方法、装置、电子设备、系统及可读存储介质
CN113825013A (zh) * 2021-07-30 2021-12-21 腾讯科技(深圳)有限公司 图像显示方法和装置、存储介质及电子设备
CN113825013B (zh) * 2021-07-30 2023-11-14 腾讯科技(深圳)有限公司 图像显示方法和装置、存储介质及电子设备
CN114095722A (zh) * 2021-10-08 2022-02-25 钉钉(中国)信息技术有限公司 清晰度的确定方法、装置及设备

Also Published As

Publication number Publication date
CN110532833A (zh) 2019-12-03

Similar Documents

Publication Publication Date Title
WO2019223361A1 (fr) Procédé et appareil d'analyse vidéo
CN109740670B (zh) 视频分类的方法及装置
CN109117848B (zh) 一种文本行字符识别方法、装置、介质和电子设备
CN107707931B (zh) 根据视频数据生成解释数据、数据合成方法及装置、电子设备
CN110827247B (zh) 一种识别标签的方法及设备
CN106649316B (zh) 一种视频推送方法及装置
US8879894B2 (en) Pixel analysis and frame alignment for background frames
Yang et al. Lecture video indexing and analysis using video ocr technology
Yang et al. Automatic lecture video indexing using video OCR technology
WO2019062388A1 (fr) Procédé et dispositif d'analyse d'effet de publicité
US20150248592A1 (en) Method and device for identifying target object in image
CN110827292B (zh) 一种基于卷积神经网络的视频实例分割方法及设备
CN111147891A (zh) 视频画面中对象的信息的获取方法、装置及设备
CN111160134A (zh) 一种以人为主体的视频景别分析方法和装置
Nguyen et al. Semantic prior analysis for salient object detection
US20110216939A1 (en) Apparatus and method for tracking target
CN111541939B (zh) 一种视频拆分方法、装置、电子设备及存储介质
CN111836118A (zh) 视频处理方法、装置、服务器及存储介质
CN111798543A (zh) 模型训练方法、数据处理方法、装置、设备及存储介质
CN108229285B (zh) 物体分类方法、物体分类器的训练方法、装置和电子设备
CN113923504B (zh) 视频预览动图生成方法和装置
CN112348566A (zh) 推荐广告的确定方法、装置及存储介质
KR20110087620A (ko) 레이아웃 기반의 인쇄매체 페이지 인식방법
CN110019951B (zh) 一种生成视频缩略图的方法及设备
Nag et al. Offline extraction of Indic regional language from natural scene image using text segmentation and deep convolutional sequence

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19806421

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19806421

Country of ref document: EP

Kind code of ref document: A1