CN111126457A

CN111126457A - Information acquisition method and device, storage medium and electronic device

Info

Publication number: CN111126457A
Application number: CN201911239753.7A
Authority: CN
Inventors: 李冠楠
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2020-05-08

Abstract

The application provides an information acquisition method and device, a storage medium and an electronic device, wherein the method comprises the following steps: extracting key frames from a video to be detected; determining a candidate region with a target shape feature in a frame image of a key frame; extracting target image characteristics of the candidate region; acquiring item object information corresponding to a target item object contained in a target area in the case where the target area is determined from among a plurality of reference areas, the image characteristics of which match the target image characteristics, wherein each of the plurality of reference areas contains a reference item object having a target shape characteristic; and carrying out region tracking on the candidate region in the video frame sequence of the video to be detected, and determining the occurrence information of the target object in the video to be detected. The method and the device solve the problem that the recall rate of the object information is low due to the fact that the content background of the video program is complex in the acquisition mode of the object information in the related technology.

Description

Information acquisition method and device, storage medium and electronic device

Technical Field

The present application relates to the field of computers, and in particular, to a method and an apparatus for acquiring information, a storage medium, and an electronic apparatus.

Background

At present, when a commodity object appears in a video program (e.g., a variety program), a convenient purchase entry of the commodity object can be provided for a user by implanting corresponding commodity information, and the demand of the user for purchasing the same or similar commodities is met.

In order to obtain information of commodity objects contained in a video program, the commodity objects appearing in the video program, corresponding brands, styles and the like need to be identified, and the method can be further popularized to users in a pop-up window mode when the commodity objects appear.

However, the content background of the video program is complex, and when the existing general object detection algorithm is used for detecting the commodity, many commodity objects cannot be accurately detected, so that the problem of low recall rate of the commodity information of the commodity objects is caused.

Therefore, the related art method for acquiring the item object information (e.g., the merchandise information of the merchandise object) has a problem that the recall rate of the item object information is low due to the complicated content background of the video program.

Disclosure of Invention

The embodiment of the application provides an information acquisition method and device, a storage medium and an electronic device, and aims to at least solve the problem that the recall rate of object information is low due to the fact that the content background of a video program is complex in an object information acquisition mode in the related art.

According to an aspect of an embodiment of the present application, there is provided an information obtaining method, including: extracting key frames from a video to be detected; determining a candidate region with a target shape feature in a frame image of a key frame; extracting target image characteristics of the candidate region; acquiring item object information corresponding to a target item object contained in a target area in the case where the target area is determined from among a plurality of reference areas, the image characteristics of which match the target image characteristics, wherein each of the plurality of reference areas contains a reference item object having a target shape characteristic; and carrying out region tracking on the candidate region in the video frame sequence of the video to be detected, and determining the occurrence information of the target object in the video to be detected.

According to another aspect of the embodiments of the present application, there is provided an apparatus for acquiring information, including: the first extraction unit is used for extracting key frames from a video to be detected; a first determination unit configured to determine a candidate region having a target shape feature in a frame image of a key frame; a second extraction unit configured to extract a target image feature of the candidate region; a first acquisition unit configured to acquire item object information corresponding to a target item object included in a target area, in a case where the target area whose image features match with target image features is determined from a plurality of reference areas, each of the plurality of reference areas including a reference item object having a target shape feature; and the second determining unit is used for carrying out region tracking on the candidate region in the video frame sequence of the video to be detected and determining the appearance information of the target object in the video to be detected.

Optionally, the first extraction unit includes: the first extraction module is used for extracting key frames from a video to be detected according to a target interval; or, the second extraction module is configured to extract a key frame corresponding to a shot from the shot included in the video to be detected.

Optionally, the first determination unit includes: and the detection module is used for detecting the frame image of the key frame by using the target shape detector to obtain a candidate region with the target shape feature in the frame image of the key frame, wherein the target shape detector is obtained by training the initial shape detector by using a training sample, and the training sample is an image marked with the training region containing the target shape feature.

Optionally, the apparatus further comprises: the third determining unit is used for clustering the shape features of the multiple article objects before determining the candidate region with the target shape feature in the frame image of the key frame, and determining the target shape feature, wherein the target shape feature comprises multiple sub-shape features; and the training unit is used for training the initial shape detector by using a plurality of training samples to obtain the target shape detector, wherein each training sample comprises a training article object with a sub-shape feature, the coincidence degree of a first region and a second region, which are detected by the target shape detector and comprise the sub-shape feature, is greater than or equal to a target threshold value, and the second region is a marked region comprising the training article object.

Optionally, the apparatus further comprises: the detection unit is used for detecting each reference image in the plurality of reference images by using the target shape detector respectively before determining the candidate area with the target shape characteristic in the frame image of the key frame to obtain a plurality of reference areas, wherein each reference image comprises at least one reference object with the target shape characteristic.

Optionally, the apparatus further comprises: a second acquisition unit configured to acquire, after extracting the target image feature of the candidate region, a candidate target region in which the global image feature matches the target global image feature from the plurality of reference image regions in a case where the target image feature includes the target global image feature and the target local image feature; and the fourth determining unit is used for determining a target area from the candidate target area by using the target local image feature, wherein the target area is an area of the candidate target area, and the local image feature is matched with the target local image feature.

Optionally, the second extraction unit comprises: and the third extraction module is used for extracting a target global image feature of the candidate region and a target local image feature of the candidate region by using the target convolutional neural network, wherein the target global image feature is input of a full connection layer of the target convolutional neural network, and the target local image feature is output of one convolutional layer in the target convolutional neural network.

Optionally, the apparatus further comprises: an adding unit, the second determining unit including: the device comprises a determining module, a judging module and a judging module, wherein the determining module is used for respectively carrying out region detection on a video frame positioned in front of a key frame and a video frame positioned behind the key frame in a video to be detected according to candidate regions, and determining time period information and position information of a target object appearing in the video to be detected, wherein the appearing information comprises time period information and position information; and the adding unit is used for carrying out region tracking on the candidate region in the video frame sequence of the video to be detected, and adding control information in the video to be detected after determining the occurrence information of the target object in the video to be detected, wherein the control information is used for controlling the object information to be displayed on the position corresponding to the position information in the video to be detected in a pop-up window mode when the video to be detected is played to the time period corresponding to the time period information.

According to a further embodiment of the present invention, a computer-readable storage medium is also provided, in which a computer program is stored, wherein the computer program is configured to carry out the steps of any of the above-described method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the method, the key frame is extracted from the video to be detected in a mode of matching image information by using the candidate area with the target shape characteristic in the frame image of the key frame; determining a candidate region with a target shape feature in a frame image of a key frame; extracting target image characteristics of the candidate region; acquiring item object information corresponding to a target item object contained in a target area in the case where the target area is determined from among a plurality of reference areas, the image characteristics of which match the target image characteristics, wherein each of the plurality of reference areas contains a reference item object having a target shape characteristic; the candidate area is subjected to area tracking in the video frame sequence of the video to be detected, the occurrence information of the target object in the video to be detected is determined, and the object area (candidate area) is positioned through the shape information, so that the purpose of removing the content background in the image can be realized, the technical effect of improving the object information recall rate is achieved, and the problem that the object information recall rate is low due to the fact that the content background of the video program is complex in the object information acquisition mode in the related technology is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a block diagram of an alternative server hardware configuration according to an embodiment of the present application;

fig. 2 is a schematic flowchart of an alternative information obtaining method according to an embodiment of the present application;

fig. 3 is a schematic flow chart diagram illustrating another alternative information acquisition method according to an embodiment of the present application; and the number of the first and second groups,

fig. 4 is a block diagram of an alternative information acquisition apparatus according to an embodiment of the present application.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

According to one aspect of the embodiment of the application, an information acquisition method is provided. Alternatively, the method may be executed in a server (server of a video content playing platform), a user terminal, or a similar computing device. Taking an example of an application running on a server, fig. 1 is a block diagram of a hardware structure of an optional server according to an embodiment of the present application. As shown in fig. 1, the server 10 may include one or more (only one shown in fig. 1) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and is not intended to limit the structure of the server. For example, the server 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program and a module of application software, such as a computer program corresponding to the information obtaining method in the embodiment of the present application, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 104 may further include memory located remotely from processor 102, which may be connected to server 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the server 10. In one example, the transmission device 106 includes a NIC (Network Interface Controller) that can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be an RF (Radio Frequency) module, which is used for communicating with the internet in a wireless manner.

In this embodiment, a method for acquiring information running on the above server is provided, and fig. 2 is a flowchart of an optional method for acquiring information according to an embodiment of the present application, and as shown in fig. 2, the flowchart includes the following steps:

step S202, extracting key frames from a video to be detected;

step S204, determining a candidate area with target shape characteristics in the frame image of the key frame;

step S206, extracting the target image characteristics of the candidate region;

step S208, under the condition that a target area with the image characteristics matched with the target image characteristics is determined from a plurality of reference areas, acquiring item object information corresponding to a target item object contained in the target area, wherein each reference area in the plurality of reference areas contains a reference item object with target shape characteristics;

step S210, carrying out region tracking on the candidate region in the video frame sequence of the video to be detected, and determining the occurrence information of the target object in the video to be detected.

Alternatively, the execution subject of the above steps may be a server, a user terminal, etc., but is not limited thereto.

By the embodiment, the method for matching the image information by using the candidate area with the target shape feature in the frame image of the key frame is adopted, and the object area (candidate area) of the article is positioned by the shape information, so that the problem that the recall rate of the object information of the article is low due to the complicated content background of the video program in the acquisition method of the object information in the related art is solved, and the recall rate of the object information of the article is improved.

The method for acquiring the above information will be described with reference to fig. 2.

In step S202, key frames are extracted from the video to be detected.

For a program video of a video program (e.g., a variety program), a user may view the program video through a means such as a client, a web page, etc.

In order to display item object information (e.g., commodity information) corresponding to an item object (e.g., a commodity item) in the program video, the item object information corresponding to the item object in the program video may be added in advance between the broadcast of the program video, or the item object information corresponding to the item object included in the program video may be determined and acquired in real time during the broadcast of the program video.

In acquiring the object information, a key frame of a video to be detected (e.g., a program video) may be first extracted, so as to process a frame image of the extracted key frame. The manner of extracting the key frame of the video to be detected can be various, and may include but is not limited to one of the following: and extracting at equal intervals and extracting according to the shots.

As an alternative embodiment, the extracting key frames from the video to be detected includes: extracting key frames from a video to be detected according to a target interval; or extracting key frames corresponding to the shots from the shots contained in the video to be detected.

As an alternative implementation, the key frames may be extracted from the video to be detected according to the target interval. The target interval may be a target time interval, for example, one video frame may be extracted every 2s (which may be set or modified as needed) as the current key frame, or a predetermined number of video frame intervals, for example, one video frame may be extracted every 50 video frames (which may be set or modified as needed) as the current key frame. The specific manner of extracting the key frames at equal intervals may be set as needed, which is not specifically limited in this embodiment.

As another alternative, key frames corresponding to shots may be extracted from shots included in the video to be detected. The video to be detected may include a plurality of shots, wherein, in video frames within the same shot, the similarity of adjacent video frames may be greater than or equal to a first threshold. For each shot, one or more video frames may be extracted as key frames. The specific manner for determining the shots included in the video to be detected and the manner for extracting the key frames from each shot may be set as required (e.g., according to the similarity of adjacent video frames), which is not specifically limited in this embodiment.

Through this embodiment, through the key frame of the video of waiting to detect of extraction according to specific interval, perhaps, according to the key frame of the video of waiting to detect of shot extraction, can guarantee the rationality of key frame extraction, improve the extraction efficiency of key frame.

In step S204, a candidate region having a target shape feature in the frame image of the key frame is determined.

For the extracted key frames, the frame images of the key frames can be sequentially processed, and candidate regions with target shape features in the frame images of the key frames are determined.

There are various ways to determine the candidate region having the target shape feature in the frame image of the key frame, and any way that can perform region detection according to the shape information can be used to determine the candidate region.

As an alternative embodiment, determining the candidate region having the target shape feature in the frame image of the key frame includes: and detecting the frame image of the key frame by using an object shape detector to obtain a candidate region with object shape characteristics in the frame image of the key frame, wherein the object shape detector is obtained by training the initial shape detector by using a training sample, and the training sample is an image marked with a training region containing the object shape characteristics.

For the frame image of the key frame, the detection of the candidate region may be performed using an object shape detector. The object shape detector may be trained on the initial shape detector using a training sample, which is an image that identifies a training area containing features of the object shape.

In order to increase the recall rate of the article object information, the object shape detector may be sensitive to specific shape information, and may detect an area including a feature of the object shape in the image, and the detected area may or may not include a specific article object (for example, a commodity article).

For example, a keyframe calculation may be performed on a video to be detected, and a commodity object detection may be performed on each keyframe by using a trained commodity object shape detector (target shape detector) to obtain a commodity object candidate region (candidate region) in the video.

By the embodiment, the trained target shape detector is used for detecting the candidate region, so that the efficiency of determining the candidate region can be improved, and the accuracy of determining the candidate region can be improved.

Before using the target shape detector, the initial shape detector may be trained using training samples to obtain the target shape detector.

As an alternative embodiment, before determining a candidate region having a target shape feature in a frame image of a key frame, clustering shape features of a plurality of object objects to determine the target shape feature, where the target shape feature includes a plurality of sub-shape features; and training the initial shape detector by using a plurality of training samples to obtain the target shape detector, wherein each training sample comprises a training article object with a sub-shape feature, the coincidence degree of a first region and a second region, which are detected by the target shape detector and comprise the sub-shape feature, is greater than or equal to a target threshold value, and the second region is a marked region comprising the training article object.

For the detection of the shape information of the various different types of object objects, the shape features of the various object objects can be clustered to determine the target shape feature, wherein the target shape feature includes various sub-shape features.

For example, commodity objects are clustered based on shape information (e.g., contour information), and common object shapes corresponding to daily consumables are abstracted.

For each training sample of the plurality of training samples, each training sample comprises one or more training object objects, and the shape of each training object belongs to one of the plurality of sub-shape features. For each training sample, each training sample can be marked in a manual marking or machine marking mode, and a region (second region) containing one training article object can be marked, wherein the marking mode can be as follows: the second region is marked with the two points of the training object furthest in the transverse direction as the transverse length and the two points furthest in the longitudinal direction as the longitudinal length. In addition to the second region, shape information of the training article object may be noted as auxiliary information for training the initial shape detector to improve the performance of the target shape detector.

When the initial shape detector is trained, each training sample can be sequentially input into the initial shape detector in an iterative manner to obtain a detection result (a first region) output by the initial shape detector, parameters of the initial shape detector are adjusted according to the similarity (for example, the degree of overlap) between the first region and the second region to adjust the similarity between the output first region and the output second region, and when the convergence condition is satisfied, the training is determined to be completed, so that the target shape detector is obtained.

The above convergence condition may be defined by an objective function, and the convergence condition may be: the coincidence degree of the first region and the second region output by the object shape detector is greater than or equal to a second threshold value (object threshold value).

For example, commodity image data may be collected, an initial commodity object shape detector trained, and a commodity object shape detector (target shape detector) obtained.

The server using the target shape detector and the server training the initial shape detector may be the same server or different servers, and the server performing the shape feature clustering and the server performing the training of the initial shape detector may be the same server or different servers.

By means of the method and the device, the target shape characteristics of the article objects are obtained by clustering the shape information of the article objects, the training of the shape detector is carried out based on the obtained target shape characteristics, and the detection performance of the shape detector obtained through training can be improved.

In order to obtain a plurality of reference regions, the reference regions can be determined from the reference image by means of manual marking or machine marking.

As an alternative embodiment, before determining the candidate region having the target shape feature in the frame image of the key frame, a target shape detector may be used to detect each of a plurality of reference images, respectively, to obtain a plurality of reference regions, where each reference image includes at least one reference object having the target shape feature.

For each reference image, one or more reference item objects having a target shape feature may be included in each reference image. The target shape detector may be used to detect each reference image to obtain a reference region having target shape features in each reference image, and the reference regions may be in one-to-one correspondence with the reference object.

After the reference areas of the reference images are obtained, the reference areas can be rechecked in a manual rechecking mode, so that the accuracy of determining the reference areas in the reference images is guaranteed.

For example, each commodity image to be put in storage (reference image, commodity in the image library is commodity article to be detected) may be detected by using a commodity object shape detector, and a commodity object area (reference area) in the image may be obtained.

By the embodiment, the target shape detector is used for acquiring the reference region in the reference image, so that the efficiency of acquiring the reference region can be improved, and the labor cost for acquiring the reference region can be reduced.

In step S206, the target image feature of the candidate region is extracted.

After the candidate region of the frame image of the key frame is obtained, the image feature (target image feature) of the candidate region may be extracted using a method of extracting an image feature. The way of extracting the target image features may be various, and may include, but is not limited to, at least one of the following: convolutional neural networks, Scale-invariant feature transform (SIFT) algorithms and Speeded Up Robust Features (SURF) algorithms, etc.

To improve the accuracy of item object information acquisition, multi-level image features of the candidate region may be extracted, which may include, but are not limited to, at least one of: global image features and local image features. The global image feature may be a feature of the candidate region as a whole, and may represent high-level semantic information of the candidate region, and the local image feature (image detail feature) may be local information of the candidate region, and may represent image details of the candidate region. The global image feature and the local image feature of the candidate region may be extracted using the same algorithm model or different algorithm models.

For example, a general image classification model may be used to extract multi-level image features of the merchandise region, including global features and detail features of the merchandise object candidate region.

As an alternative embodiment, extracting the target image feature of the candidate region includes: and extracting a target global image feature of the candidate region and a target local image feature of the candidate region by using the target convolutional neural network, wherein the target global image feature is input of a full connection layer of the target convolutional neural network, and the target local image feature is output of one convolutional layer in the target convolutional neural network.

The extraction of the candidate region image features may be performed using a convolutional neural network, which may include: the convolutional layer, the pooling layer, the full link layer, can also include: and activating the layer.

For convolutional layers, the convolutional layer is convolved with the input of the convolutional layer using a convolution kernel to obtain the output of the convolutional layer, and thus the output of the convolutional layer can be used to represent local image features of an image. For a fully-connected layer, the fully-connected layer determines the probability that the image corresponds to each possible output based on features extracted from the image, and thus the input to the fully-connected layer may be used to represent global image features of the image.

For a candidate region, the image information of the candidate region may be input to a target convolutional neural network, the target convolutional neural network is used for performing image feature extraction (the training process may refer to a process of training a convolutional neural network for performing image feature extraction in the related art), the output of one convolutional layer of the target convolutional neural network is used as the target local image feature of the candidate region, and the input of a fully-connected layer of the target convolutional neural network is used as the target global image feature of the candidate region.

By the embodiment, the global image features and the local image features of the candidate region are extracted by using the convolutional neural network, the multi-level image features of the candidate region can be extracted by using the same network, the occupation of processing resources by image extraction is reduced, and the capability of the extracted image features for representing the candidate region is improved.

Acquiring a candidate target region of which the global image feature is matched with the target global image feature from a plurality of reference region regions after extracting the target image feature of the candidate region in the case that the target image feature comprises the target global image feature and the target local image feature, wherein the target image feature comprises the target global image feature and the target local image feature; and determining a target area from the candidate target area by using the target local image feature, wherein the target area is an area in the candidate target area, and the local image feature is matched with the target local image feature.

After the multi-level image features of the candidate region are obtained, the target global image features may be first used to match with the global image features of each of the plurality of reference regions, for example, the similarity between the target global image features and the global image features of each of the reference regions may be respectively determined; then, a candidate target region in which the global image feature matches the target global image feature is obtained from the plurality of reference regions, the candidate target region being a candidate target region, and there may be one or more candidate target regions, for example, a reference region in which the similarity between the global image feature and the target global image feature is greater than or equal to a first similarity threshold may be obtained from the plurality of reference regions as the candidate target region.

For the obtained candidate target regions, the target local image features may be used for matching with the local image features of the candidate target regions, for example, the similarity between the target global image features and the local image features of each candidate target region may be respectively determined; then, a region where the local image feature matches the target local image feature is obtained from the candidate target regions as the target region, and for example, the candidate target region where the local image feature has the highest similarity with the target local image feature and is greater than or equal to the second similarity threshold may be used as the target region.

For example, a general image classification model can be used to extract multi-level image features of a commodity region (reference region), including global image features and image detail features, and the global image features and the image detail features are stored respectively to construct a commodity feature database. And then, the style of the returned result (commodity style candidate) is confirmed by using the image detail characteristics of the commodity object candidate area obtained by calculation, the consistency of the detail characteristics is judged, and if the detail characteristics are consistent, the commodity object candidate area is used as an identified commodity area (target area).

In step S208, when a target area whose image characteristics match the target image characteristics is specified from among the plurality of reference areas, item object information corresponding to the target item object included in the target area is acquired.

After a target region matching the candidate region is obtained, it may be determined that the item objects contained in the candidate region match the target item objects contained in the target region. Item object information corresponding to the target item object may be obtained, which may include, but is not limited to: text information, link information, or other information associated with the target item object.

Acquiring item object information corresponding to the target item object may be: and extracting the object information of the target object from a database which stores the object information of each object according to the object identifier corresponding to the target object.

In step S210, the candidate region is subjected to region tracking in the video frame sequence of the video to be detected, and occurrence information of the target object in the video to be detected is determined.

After acquiring the item object information corresponding to the target item object contained in the target area, the candidate area may be subjected to area tracking in the video frame sequence of the video to be detected, and occurrence information of the target item object in the video to be detected is determined, where the occurrence information is used to indicate information that the target item object appears in the video to be detected, and the information may include, but is not limited to, at least one of the following: time period information (time point location information), location information.

The time period (target time period) in which the target object appears in the video to be detected may be one time period or may be a plurality of time periods. In each video frame within the target time period, the position information of the target item object may be coordinate information of the target item object in the video frame (for example, coordinate information expressed by x, y coordinates), or may be area information of the target item object in the video frame (for example, a left half area, a right half area, and also, for example, a middle area, an upper left area, a lower left area, an upper right area, and a lower right area).

The position information of the target item object appearing in the video to be detected may be position information of a specific point (e.g., a center point) of the candidate region appearing in the video to be detected. When the position information is stored, the position information of the target object in each video frame in the target time period may be stored, or only the change of the position information of the target object in the video frame in the target time period may be stored.

For example, the target object appears in the video frames from 5s to 10s of the video to be detected, wherein in the video frames from 5s to 7s, the position coordinate of the appearance of the target object is (x)₁，y₁) In the 7 th-9 th video frames, the target object appearsThe position coordinate is (x)₂，y₂) In the 9 th to 10 th video frames, the position coordinate of the object of the target object is (x)₃，y₃). Then the time period of the target object in the video to be detected is as follows: and 5-10s, the positions of the target object in the video to be detected are as follows: 5 th to 7 th, (x)₁，y₁) (ii) a 7 th to 9 th, (x)₂，y₂) (alternatively), (x)₂-x₁，y₂-y₁) ); 9 th to 10 th, (x)₃，y₃) (alternatively), (x)₃-x₁，y₃-y₁) Or, (x)₃-x₂，y₃-y₂))。

As an alternative embodiment, performing region tracking on the candidate region in the video frame sequence of the video to be detected, and determining the occurrence information of the target object in the video to be detected includes: and respectively carrying out region detection on a video frame before the key frame and a video frame after the key frame in the video to be detected according to the candidate regions, and determining time period information and position information of the target object in the video to be detected, wherein the occurrence information comprises the time period information and the position information.

After the item object information corresponding to the target item object contained in the target area is obtained, the time point location information of the target item object appearing in the video to be detected can be further determined. The manner of determining the time point location information may be: and performing bidirectional tracking on the candidate region in a video frame sequence of the video to be detected, and determining time period information of the target object appearing in the video to be detected.

The bidirectional tracking method may be multiple, for example, the method may perform region detection on a video frame before a key frame and a video frame after the key frame in the video to be detected, determine video frames in which candidate regions appear in a video frame sequence of the video to be detected, obtain time point location information of the candidate regions, and further determine time period information of the candidate regions. For another example, the region detection may be performed on a key frame before the current key frame and a key frame after the current key frame in the video to be detected, the key frame in which the candidate region appears in the video frame sequence of the video to be detected is determined, the time point location information of the candidate region is obtained, and the time period information of the candidate region is determined.

Besides the time point location information, the position information of the candidate region in the video frame where the candidate region appears can be determined, and then the position information of the target object appearing in the video to be detected can be determined.

After time period information of a target object appearing in a video to be detected is acquired, control information can be added to the video to be detected, wherein the control information is used for controlling the object information to be displayed in a pop-up window mode when the video to be detected is played to a time period corresponding to the time period information.

After the time point location information and the position information of the target object appearing in the video to be detected are obtained, control information can be added into the video to be detected, wherein the control information is used for controlling that when the video to be detected is played to the time period corresponding to the time period information, the object information is displayed on the position corresponding to the position information in the video to be detected in a pop-up window mode (or other modes).

According to the embodiment, the candidate region is subjected to region tracking in the video to be detected, and the time period information and the position information of the target object appearing in the video to be detected are determined, so that the object information is added conveniently, and the accuracy of adding the object information is improved.

The following describes the above-described information acquisition method with reference to an alternative example. The method may operate in a video server.

The information acquisition method in the example is a method for identifying commodity objects in a video, a reference commodity library is constructed in advance, when the video is analyzed, time point positions of reference commodities and position information of the commodity objects in a picture can be automatically identified, and in addition, for the problem that the commodity object detection recall rate is low, the commodity objects can be clustered based on shape information, a commodity shape detector is trained, and therefore the quick detection of common commodity objects is realized.

As shown in fig. 3, the information acquisition method in the present example may include the steps of:

step 1, training a commodity shape detector.

Firstly, clustering commodity objects based on shape information, abstracting common object shapes corresponding to daily consumer goods, then collecting commodity image data, and training a commodity object shape detector.

And 2, extracting the multi-level features of the commodity image and constructing a commodity feature library.

Each commodity image to be warehoused can be detected by using the generated commodity shape detector, and a commodity object area in the image is obtained. And extracting multi-level image features of the commodity region by using a general image classification model, wherein the multi-level image features comprise global image features and image detail features. And respectively storing the global image characteristics and the image detail characteristics to construct a commodity characteristic library.

And 3, extracting key frames of the video to be detected, detecting the shape of the commodity of each key frame image, and extracting a candidate region of the commodity object.

For the video to be processed, the keyframe calculation can be performed on the video to be processed, and the commodity object detection is performed on each keyframe by using the generated commodity shape detector to obtain the commodity object candidate area in the video to be processed.

And 4, extracting the multi-level appearance characteristics of the commodity extraction area by using the image characteristic expression model.

Similar to step 2, a general image classification model (image feature expression model) may be used to extract multi-level image features of the commodity object candidate region, including global image features and image detail features of the commodity object candidate region.

And 5, searching in the commodity feature library based on the high-level appearance features, returning candidate commodity objects in the database, confirming commodity style by using the low-level local appearance features, and confirming whether the candidate areas exist in the database.

And (4) searching and inquiring in the commodity database by using the global image characteristics obtained by calculation in the step (4), and taking the images in the database which meet the similarity requirement as commodity style candidates. And (4) confirming the style of the returned result (commodity style candidate) by using the image detail features obtained in the step (4), judging the consistency of the image detail features, and if the image detail features are consistent, taking the commodity object candidate area as an identified commodity area.

And 6, performing bidirectional tracking on the commodity objects in the database by using an object tracking technology to obtain time point location information of the commodity objects in the video.

The identified commodity region can be bidirectionally tracked in the video sequence, the time point information of the commodity appearing in the video is obtained, and the prompt and popularization of the commodity information in the video popup mode in the time period are supported.

According to the method, the commodity identification is realized by combining the parallel frame extraction, detection, identification and tracking frames, and compared with a common identification method, the method has the characteristics of higher processing speed and better time sequence consistency of the identification result; the detection of the commodity object candidate area is realized by adopting a shape detector, and when a daily consumer product object is newly added, only an image needs to be added to update a commodity characteristic database, so that the retraining cost of a commodity detection model is obviously reduced; and the high-precision same-style commodity identification is realized by adopting a style confirmation scheme of detection and multi-level identification.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

According to another aspect of the embodiments of the present application, there is provided an information acquisition apparatus for implementing the above-described information acquisition method. Optionally, the apparatus is used to implement the above embodiments and preferred embodiments, and details are not repeated for what has been described. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 4 is a block diagram of an alternative information acquisition apparatus according to an embodiment of the present application, and as shown in fig. 4, the apparatus includes:

(1) a first extraction unit 402, configured to extract a key frame from a video to be detected;

(2) a first determining unit 404, connected to the first extracting unit 402, for determining a candidate region having a target shape feature in the frame image of the key frame;

(3) a second extracting unit 406, connected to the first determining unit 404, for extracting the target image feature of the candidate region;

(4) a first acquisition unit 408 connected to the second extraction unit 406, for acquiring item object information corresponding to a target item object contained in a target area in a case where the target area whose image features match the target image features is determined from a plurality of reference areas, each of which contains a reference item object having a target shape feature;

(5) the second determining unit 410 is connected to the first obtaining unit 408, and is configured to perform region tracking on the candidate region in the video frame sequence of the video to be detected, and determine occurrence information of the target object in the video to be detected.

Alternatively, the first extracting unit 402 may be used in step S202 in the foregoing embodiment, the first determining unit 404 may be used in step S204 in the foregoing embodiment, the second extracting unit 406 may be used in step S206 in the foregoing embodiment, the first obtaining unit 408 may be used in step S208 in the foregoing embodiment, and the second determining unit 410 may be used in step S210 in the foregoing embodiment.

As an alternative embodiment, the first extraction unit 402 includes:

(1) the first extraction module is used for extracting key frames from a video to be detected according to a target interval; alternatively, the first and second electrodes may be,

(2) and the second extraction module is used for extracting key frames corresponding to the shots from the shots contained in the video to be detected.

As an alternative embodiment, the first determining unit 404 includes:

(1) and the detection module is used for detecting the frame image of the key frame by using the target shape detector to obtain a candidate region with the target shape feature in the frame image of the key frame, wherein the target shape detector is obtained by training the initial shape detector by using a training sample, and the training sample is an image marked with the training region containing the target shape feature.

As an alternative embodiment, the apparatus further comprises:

(1) the third determining unit is used for clustering the shape features of the multiple article objects before determining the candidate region with the target shape feature in the frame image of the key frame, and determining the target shape feature, wherein the target shape feature comprises multiple sub-shape features;

(2) and the training unit is used for training the initial shape detector by using a plurality of training samples to obtain the target shape detector, wherein each training sample comprises a training article object with a sub-shape feature, the coincidence degree of a first region and a second region, which are detected by the target shape detector and comprise the sub-shape feature, is greater than or equal to a target threshold value, and the second region is a marked region comprising the training article object.

As an alternative embodiment, the apparatus further comprises:

(1) the detection unit is used for detecting each reference image in the plurality of reference images by using the target shape detector respectively before determining the candidate area with the target shape characteristic in the frame image of the key frame to obtain a plurality of reference areas, wherein each reference image comprises at least one reference object with the target shape characteristic.

As an alternative embodiment, the apparatus further comprises:

(1) a second acquisition unit configured to acquire, after extracting the target image feature of the candidate region, a candidate target region in which the global image feature matches the target global image feature from the plurality of reference image regions in a case where the target image feature includes the target global image feature and the target local image feature;

(2) and the fourth determining unit is used for determining a target area from the candidate target area by using the target local image feature, wherein the target area is an area of the candidate target area, and the local image feature is matched with the target local image feature.

As an alternative embodiment, the second extraction unit 406 includes:

(1) and the third extraction module is used for extracting a target global image feature of the candidate region and a target local image feature of the candidate region by using the target convolutional neural network, wherein the target global image feature is input of a full connection layer of the target convolutional neural network, and the target local image feature is output of one convolutional layer in the target convolutional neural network.

As an alternative embodiment, the apparatus further comprises: an adding unit, the second determining unit 410 includes: a determination module that determines, wherein,

(1) the determining module is used for respectively performing region detection on a video frame positioned in front of the key frame and a video frame positioned behind the key frame in the video to be detected according to the candidate regions, and determining time period information and position information of a target object appearing in the video to be detected, wherein the appearing information comprises the time period information and the position information;

(2) and the adding unit is used for carrying out region tracking on the candidate region in the video frame sequence of the video to be detected, and adding control information in the video to be detected after determining the occurrence information of the target object in the video to be detected, wherein the control information is used for controlling the object information to be displayed on the position corresponding to the position information in the video to be detected in a pop-up window mode when the video to be detected is played to the time period corresponding to the time period information.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

According to yet another aspect of embodiments herein, there is provided a computer-readable storage medium. Optionally, the storage medium has a computer program stored therein, where the computer program is configured to execute the steps in any one of the methods provided in the embodiments of the present application when the computer program is executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, extracting key frames from the video to be detected;

s2, determining candidate areas with target shape features in the frame images of the key frames;

s3, extracting the target image characteristics of the candidate region;

s4, in a case where a target area whose image features match the target image features is determined from among a plurality of reference areas, each of which contains a reference item object having a target shape feature, item object information corresponding to the target item object contained in the target area is acquired;

and S5, performing region tracking on the candidate region in the video frame sequence of the video to be detected, and determining the occurrence information of the target object in the video to be detected.

Optionally, in this embodiment, the storage medium may include, but is not limited to: a variety of media that can store computer programs, such as a usb disk, a ROM (Read-only Memory), a RAM (Random Access Memory), a removable hard disk, a magnetic disk, or an optical disk.

According to still another aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor (which may be the processor 102 in fig. 1) and a memory (which may be the memory 104 in fig. 1) having a computer program stored therein, the processor being configured to execute the computer program to perform the steps of any of the above methods provided in embodiments of the present application.

Optionally, the electronic apparatus may further include a transmission device (the transmission device may be the transmission device 106 in fig. 1) and an input/output device (the input/output device may be the input/output device 108 in fig. 1), wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, extracting key frames from the video to be detected;

s3, extracting the target image characteristics of the candidate region;

Optionally, for an optional example in this embodiment, reference may be made to the examples described in the above embodiment and optional implementation, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An information acquisition method, comprising:

extracting key frames from a video to be detected;

determining a candidate region with a target shape feature in a frame image of the key frame;

extracting target image features of the candidate region;

acquiring item object information corresponding to a target item object contained in a target area when the target area with image characteristics matched with the target image characteristics is determined from a plurality of reference areas, wherein each reference area of the plurality of reference areas contains a reference item object with the target shape characteristics;

and carrying out region tracking on the candidate region in the video frame sequence of the video to be detected, and determining the occurrence information of the target object in the video to be detected.

2. The method of claim 1, wherein extracting the key frames from the video to be detected comprises:

extracting the key frames from the video to be detected according to the target interval; alternatively, the first and second electrodes may be,

and extracting the key frames corresponding to the shots from the shots contained in the video to be detected.

3. The method of claim 1, wherein determining the candidate region having the target shape feature in the frame image of the key frame comprises:

and detecting the frame image of the key frame by using an object shape detector to obtain the candidate region with the object shape feature in the frame image of the key frame, wherein the object shape detector is obtained by training an initial shape detector by using a training sample, and the training sample is an image marked with a training region containing the object shape feature.

4. The method of claim 3, wherein prior to determining the candidate region having the target shape feature in the frame image of the key frame, the method further comprises:

clustering the shape features of the multiple article objects to determine the target shape features, wherein the target shape features comprise multiple sub-shape features;

and training an initial shape detector by using a plurality of training samples to obtain the target shape detector, wherein each training sample comprises a training article object with one sub-shape feature, the coincidence degree of a first region and a second region, which are detected by the target shape detector and comprise the sub-shape feature, is greater than or equal to a target threshold value, and the second region is a marked region comprising the training article object.

5. The method of claim 3, wherein prior to determining the candidate region having the target shape feature in the frame image of the key frame, the method further comprises:

and detecting each reference image in a plurality of reference images by using the target shape detector to obtain a plurality of reference areas, wherein each reference image contains at least one reference article object with the target shape characteristic.

6. The method of claim 1, wherein after extracting the target image feature of the candidate region, the method further comprises:

under the condition that the target image features comprise target global image features and target local image features, acquiring candidate target regions with global image features matched with the target global image features from the plurality of reference image regions;

and determining the target area from the candidate target area by using the target local image feature, wherein the target area is an area of the candidate target area, of which the local image feature is matched with the target local image feature.

7. The method of claim 6, wherein extracting the target image feature of the candidate region comprises:

extracting the target global image feature of the candidate region and the target local image feature of the candidate region by using a target convolutional neural network, wherein the target global image feature is input of a full connection layer of the target convolutional neural network, and the target local image feature is output of one convolutional layer in the target convolutional neural network.

8. The method according to any one of claims 1 to 7,

performing region tracking on the candidate region in the video frame sequence of the video to be detected, and determining the occurrence information of the target object in the video to be detected includes: respectively performing region detection on a video frame positioned in front of the key frame and a video frame positioned behind the key frame in the video to be detected according to the candidate regions, and determining time period information and position information of the target object appearing in the video to be detected, wherein the appearance information comprises the time period information and the position information;

after performing region tracking on the candidate region in the video frame sequence of the video to be detected and determining the occurrence information of the target object in the video to be detected, the method further includes: adding control information into the video to be detected, wherein the control information is used for controlling the object information to be displayed at the position corresponding to the position information in the video to be detected in a pop-up window mode when the video to be detected is played to the time period corresponding to the time period information.

9. An apparatus for acquiring information, comprising:

the first extraction unit is used for extracting key frames from a video to be detected;

a first determination unit, configured to determine a candidate region having a target shape feature in a frame image of the key frame;

a second extraction unit configured to extract a target image feature of the candidate region;

a first acquisition unit configured to acquire item object information corresponding to a target item object included in a target area when the target area whose image features match the target image features is determined from a plurality of reference areas, each of the plurality of reference areas including a reference item object having the target shape features;

and the second determining unit is used for carrying out region tracking on the candidate region in the video frame sequence of the video to be detected and determining the appearance information of the target object in the video to be detected.

10. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 8 when executed.

11. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 8 by means of the computer program.