CN110378215A

CN110378215A - Purchase analysis method based on first person shopping video

Info

Publication number: CN110378215A
Application number: CN201910508074.9A
Authority: CN
Inventors: 段凌宇; 张琳; 王策
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2019-06-12
Filing date: 2019-06-12
Publication date: 2019-10-25
Anticipated expiration: 2039-06-12
Also published as: CN110378215B

Abstract

The present invention relates to artificial intelligence application technical field, in particular to a kind of purchase analysis method based on first person shopping video.It specifically includes: complete shopping video is divided into multiple video clips；N frame picture frame is extracted from the video clip；The picture frame that analysis is extracted obtains the corresponding shopping type of action of the video frequency band；And according to the corresponding shopping type of action of each video clip of acquisition, the corresponding commodity of video clip of default shopping type of action are identified；Establish the corresponding relationship between the corresponding shopping type of action of the commodity identified.Consumer's shopping video of present invention first person, comprehensive Consumption is carried out, relative to the analysis method based on picture, this patent saves the burden of consumer's shooting and upload, and entire shopping process can be comprehensively analyzed, complete consumer record is obtained.

Description

Shopping analysis method based on first-person visual angle shopping video

Technical Field

The invention relates to the technical field of artificial intelligence application, in particular to a shopping analysis method based on a first-person visual angle shopping video.

Background

The analysis and record of consumer shopping are the basis for analyzing consumer preference, finding key factors influencing purchasing, and directionally recommending and helping the consumer to shop, have important significance for intelligent service of a market and improvement of life quality of the consumer, and have great commercial value.

Fig. 1 illustrates a conventional consumption analysis. Detecting the areas where the commodities exist based on the commodity pictures uploaded by the users, extracting image features for each area, extracting features for the commodity pictures in the database, and comparing the features of each area of the shot pictures with the features of the pictures in the database to obtain a final commodity identification result. The method relies on manual shooting and uploading of the user, is low in efficiency, complicated in operation and difficult to obtain comprehensive consumption analysis, once the user forgets shooting or uploading, the user cannot obtain comprehensive consumer shopping records, the commodity-user relationship obtained by the method is single, and various correlations of commodity-user cannot be established by utilizing rich consumer behaviors in the whole shopping process.

Disclosure of Invention

The embodiment of the invention provides a shopping analysis method based on a shopping video of a first person perspective, which is used for carrying out comprehensive consumption analysis on a shopping video of a consumer at the first person perspective.

According to a first aspect of an embodiment of the present invention, a shopping analysis method based on a first person perspective shopping video specifically includes:

dividing the complete shopping video into a plurality of video segments;

extracting N image frames from each video segment, wherein N is an integer greater than 1;

analyzing the extracted image frame to obtain a shopping action type corresponding to the video frequency band; and are

Identifying commodities corresponding to the video clips with preset shopping action types according to the obtained shopping action types corresponding to the video clips;

and establishing a corresponding relation between the identified commodities and the shopping action types corresponding to the commodities.

The preset shopping action type comprises a selecting action and a purchasing action; and

establishing a corresponding relation between the identified commodity and the shopping action type corresponding to the commodity, specifically comprising:

determining a plurality of first commodities identified in a video clip corresponding to the purchasing action as shopping records;

and determining the first commodities identified in the video clip corresponding to the selection action as commodity records interested by the user.

Analyzing the extracted image frame to obtain a shopping action type corresponding to the video frequency band, specifically comprising:

and analyzing the extracted image frames by using a non-local neural network to obtain the shopping action type corresponding to the video frequency band.

Identifying the commodity corresponding to the video clip with the preset shopping action type specifically comprises the following steps:

inputting a video clip corresponding to a preset shopping action type into a classification network to obtain a commodity type contained in the video clip, wherein the commodity type comprises a food material type or a non-food material type;

for food material commodities, identifying commodities of key frames by using a multi-classification model;

and for non-food material commodities, searching the non-food material commodities in the key frame by using a multi-object searching method.

The basic network of the non-local neural network is ResNet50, ResNet50 is converted into a 3D ResNet50 network, and a non-local block is inserted at the end of the first three blocks of the 3D ResNet50 network.

The food material class commodity identification method comprises the following substeps:

a.1, extracting key frames of image frames of the video clip;

2, a.2, sequentially inputting the key frames into a pre-trained spatial regularization network to obtain the prediction scores of the frames on each food material category;

and 2.a.3, adding the corresponding category scores of all the key frames, and dividing the sum by the number of the key frames to obtain the prediction score of the video clip on each food material category.

The non-food material commodity identification method specifically comprises the following substeps:

b.1, extracting key frames of image frames of the video clip;

b.2 preprocessing, namely training a fast r-cnn network by using a commodity data set RPC disclosed by the network; the RPC data set comprises a plurality of commodity pictures, and each picture uses a plurality of detection frames bbox to provide a uniform label 'commodity' category for all the detection frames bbox; during training, a commodity image library is constructed, the library comprises a plurality of commodity images, each image comprises a commodity and is of a clean background, and for all pictures in the commodity library, a compact visual search technology is used for establishing extraction features and establishing indexes;

and 2, b.3, detecting the commodity region by using the trained fast r-cnn for each key frame to generate a plurality of detection frames bbox and the prediction scores of the detection frames bbox, and reserving the detection frames bbox with the prediction scores larger than 0.5.

And 2.b.4, clipping the image by using a detection frame bbox for each key frame to generate a plurality of local images.

And 2.b.5, for each key frame, cutting the key frame into a plurality of local graphs, wherein each local graph uses a compact visual search technology to extract features, and using indexes built in a commodity library to search related commodities in the commodity library to obtain a related commodity list of each local graph, wherein the degree of correlation is from high to low.

And 2, b.6, for a plurality of key frames of a video clip, each key frame is provided with a plurality of local graphs, each graph is provided with a related commodity list, and the related commodity list is obtained according to the prediction scores of the local graphs.

The spatial regularization network of said step 2.a.2 comprises

The key frames are sequentially input into ResNet50 to provide coarse class predictionAnd a preliminary feature f_cls；

The preliminary characteristics f_clsInputting a space regularization module to generate two feature maps, attention feature map f_attAnd confidence feature map f_cof；

Then f_attQuilt f_cofRe-weighting and outputting accurate prediction results of a series of convolutional layersBy making a pair of f_attPerforming linear conversion samples to obtain a rough prediction

By passingAnd obtaining a predicted value.

The predicted value isIn the application the predicted value is

The feature extraction of the compact visual search technology in the step 2.b.2 and the step 2.b.5 comprises interest point detection, local feature selection, local feature description, local feature compression, local feature position compression and local feature aggregation.

In the first step, for the inserted non-local block, the output of the position i isWherein,x_ias input of position i, x_jIs an input at position j, g (x)_j)＝W_gx_jWherein W is_gIs one anda learned weight matrix.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

1. compared with the picture-based analysis method, the method saves the burden of shooting and uploading of the consumer, can comprehensively analyze the whole shopping process, and obtains a complete consumption record.

2. For store merchandise to change problems over time, the patent reduces the model changes required for merchandise category changes. As for food material commodities, although the origin and the manufacturer are different, the types of food materials are the same, and food material commodities of new manufacturers belong to the original food material types, so that the model remains unchanged. For non-food material type commodities, an individual-level identification model needs to be established according to manufacturer and attribute differences, and new commodity types can be brought by introducing new commodities. The method uses a compact retrieval technology, and ensures that related commodities can be found only by adding white background commodity pictures of new commodities into a commodity library and without changing a retrieval model in the changing process of the commodities in the store. Other methods do not consider the problem of model change, and uniformly process food materials and non-food material commodities.

3. The method deeply excavates the whole shopping video, establishes various 'commodity-user' relationships, and can provide richer consumer consumption information and establish a complete consumer portrait relative to the traditional simple 'purchasing' relationship.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic flow diagram of a conventional shopping analysis method;

FIG. 2 is a schematic diagram of a shopping analysis method based on a first-person perspective shopping video according to an embodiment of the present invention;

FIG. 3 is a flowchart of a shopping analysis method based on a first-person perspective shopping video according to an embodiment of the present invention;

FIG. 4 is a flow chart of the prediction scores of the local graph of the non-food material commodity according to the present invention;

FIG. 5 is a flow diagram of the compact visual search used in feature extraction.

Detailed Description

Figure 2 shows a schematic flow diagram of the present invention. As shown in fig. 2 and 3, the present invention provides a shopping analysis method based on a first-person perspective shopping video, including:

for a complete shopping video, dividing the video into a plurality of video segments;

and selecting N image frames from each video clip at equal time intervals, and classifying the video clips for shopping actions.

For a complete shopping video of a user in a shop, based on different consumer behavior definitions, dividing the shopping video into a plurality of video segments at equal time intervals, and extracting N frames of image frames from the video segments, wherein N is a positive integer; preferably, a two-second video segment is taken every two seconds, and 16 frames with equal intervals are extracted from the video segment for motion prediction;

preferably, the shopping actions of the video segments are classified into N image frames of the video segments, the N image frames are input into a pre-trained non-local neural network, the prediction score of each shopping action of the segment is obtained, and the shopping action corresponding to the highest score is taken as the action category of the video segment;

the method comprises the steps of pre-training a non-local neural network into collected videos, dividing the videos into video segments, manually marking category labels, dividing the videos into frames, making the frames into matrixes, inputting the matrixes into the non-local neural network, outputting a fractional vector by the non-local neural network, calculating loss by using a cross entropy loss function on the vectors and the real category labels, and updating network parameters by using a back propagation mode.

Shopping videos are first input into a shopping behavior classification model to obtain a number of action segments for different consumer behaviors. Because the first-person shopping video can only record scene changes, the behavior of the consumer is invisible, so that the action category is difficult to estimate from the video; furthermore, there is a large inter-class similarity in the shopping action data, since the background in the video is always a store, and the difference between shopping actions is small. Therefore, the classification model should focus more on the variation and correlation between frames to find the category discriminating appearance. In this system, we use non-local neural networks for shopping behavior classification.

Preferably, the basic network of the non-local neural network may be ResNet50, and in order to use it on video data, ResNet50 is converted into a 3D ResNet50 network, i.e. all convolutional layers are converted into 3D convolutions, and a non-local block is inserted at the end of the first three blocks of the 3D ResNet50 network, i.e. the outputs of activation _59, activation _71, and activation _ 89.

Non-local neural networks use non-local blocks to capture the spatial, temporal and spatiotemporal dependencies of data.

Preferably, for an inserted non-local block, the output of position i is treated as a normalized linear combination of all position depth information in the input, i.e.Linear coefficient f (x)_i，x_j) Is a scalar quantity, g (x), reflecting the relationship between the positions_j) Containing the deep information entered in location j. The non-local neural network may process messages on all input signals. Using this network, the classification model can discover changes in information flow and frames.x_iAs input of position i, x_jIs an input at position j, g (x)_j) Is a linear transformation W_gx_jWherein W is_gIs a learnable weight matrix.

The classification for video actions is shown in table 1.

TABLE 1 Classification of video actions

Inputting the video clips belonging to selection and selection in the clips of the shopping action into a classification network, and distinguishing commodities in the video clips into food materials or non-food materials;

for food material commodities, identifying a plurality of food material categories of key frames of video clips by using a multi-classification model;

for non-food material commodities, due to more varieties and continuous growth, a multi-object retrieval method is used for retrieving the non-food material commodities in the key frame of the video clip;

after dividing the video into a plurality of action segments, we perform video content analysis on the 'pick' and 'select' action segments to obtain consumer shopping records, because the segments contain information about goods liked and purchased by the user. The commodities include food-class and non-food-class commodities, and two visual analysis models are used for the two types of commodities.

Preferably, we first use the RetNet50 classification network to distinguish food class and non-food class commodity frames on the key frames of the input video segment. The food frames are then input to a corresponding classification or retrieval model.

For food categories, such as vegetables and meat, a multi-classification model is employed because while they may have different growth areas, the categories are limited and fixed. The method specifically comprises the following substeps:

a.1, extracting key frames of image frames of the video clip by using ffmpeg;

a.2, sequentially inputting the key frames into a pre-trained Space Regularization Network (SRN) to obtain the prediction scores of the frames on each food material category;

The environment of a store is complex, the problems of reflection, color change and the like can be encountered after shooting, food materials are often cut and packaged in the store, a Space Regularization Network (SRN) is used as a multi-classification model, the multi-classification model is concentrated in a class area, fine-grained characteristics are found, and meanwhile, the problems of reflection and color change of a picture are adjusted locally.

The SRN consists of two parts, namely a feature extraction module and a spatial regularization module. The feature extraction module provides coarse class prediction using ResNet50And a preliminary feature f_cls。

The spatial regularization module maps the preliminary features f_clsAs input, first two feature maps are generated-attention feature map f_attAnd confidence feature map f_cof. Then f_attQuilt f_cofRe-weighting and outputting accurate prediction results of a series of convolutional layersBy making a pair of f_attPerforming a linear transformation can also yield a rough predictionThe mechanism in the spatial regularization module will greatly facilitate performance because the attention feature map generates important regions for each class to discover subtle class features, the confidence feature map adjusts f_attTo adjust for problems such as reflection and color change.

Preferably, during the training process, the model uses cross entropy loss optimization, and the optimized predicted value isUse in applicationsAs a prediction score.

For non-food commodities, the usability of the method after data expansion is ensured by adopting a retrieval technology in consideration of the category diversity and the increasing quantity of the non-food commodities. The system only needs to update the commodity database step by step without retraining new models.

The non-food material type commodity specifically comprises the following substeps:

b.1, extracting key frames of image frames of the video clip by using ffmpeg;

b.2 preprocessing, training a fast r-cnn network using a network published commodity data set RPC, and finally achieving 97.6% detection results on the data set. The RPC data set comprises a plurality of commodity graphs, wherein each graph is marked with a plurality of detection boxes (bbox) in a plurality of commodity areas of the graph, and each detection box is provided with an attached commodity category label. During training we ignore the attached commodity category label, but give all bbox a uniform label "commodity" category. A commodity image library is constructed, wherein the library comprises a plurality of commodity images, each image comprises a commodity and is of a clean background. And establishing extraction features and indexes for all pictures of the commodity library by using a compact visual search technology.

B.3 for each key frame, using trained fast r-cnn for commercial region detection, multiple bboxs are generated, and the predicted scores of bboxs (between 0-1, indicating how likely the bboxs contain commercial products). Bbox with a prediction score greater than 0.5 is retained.

B.4 for each key frame, cropping the image using bbox, generating a plurality of partial maps.

B.6 for a plurality of key frames of a video clip, each key frame has a plurality of partial graphs, each graph has a related commodity list, the partial graphs are arranged from top to bottom according to the prediction scores of the partial graphs, and the result is shown in fig. 4, wherein circles in the graphs represent commodity retrieval lists. Where a row of circles across the product is unlikely to repeat, but a column on the vertical may repeat, since the detection of each partial image is not mutually affected.

The results of a key frame are first fused. Suppose there are k partial graphs B₁-B_kPrediction score goes from high to low, for local graph B_iTake the first 30 commoditiesThe degree of correlation goes from high to low. When merging, a list L is maintained, firstly B is₁-B_kFirst article of merchandise ofAdding L, if any, in sequenceAlready in L, then skip. Then will beSecond article of commerceAdd L in sequence, and so on until L has 30 items. Thus, each key frame has a list L of length 30.

The results of all key frames are then fused. Suppose there are t key frames F₁-F_tThe degree of correlation is from high to low, for key frame F_iList L of_iThere are commoditiesThe degree of correlation goes from high to low. During fusion, a list E is maintained, and F is counted₁-F_kFirst article of merchandise ofThe number of commodity categories and the number of occurrences of each category, and adding the commodity categories to E from high to low. Then B is put₁-B_kSecond article of commerceAdd E, skip if already present in E, and so on until the goods in E reach 30.

We use a multi-product retrieval method to obtain products purchased or liked by consumers. In order to obtain more accurate retrieval results, we first use the commodity position detection model to cut the image into a plurality of areas possibly containing commodities, which will increase the calculation requirement and time. In addition, the realization of ultra-fine-grained commodity retrieval, such as different tastes of the same potato chip brand, will face smaller inter-class differences (as in the text and texture of the commodity package). To address both of these issues, compact visual search techniques are used to retrieve products, thereby focusing more on local features, resulting in more efficient retrieval. Prior to the use of the compact visual search technique,

fig. 5 shows a schematic diagram of a feature extraction flow of the compact visual search technique.

The feature extraction of step 2.b.2 and step 2.b.5 compact visual search techniques can be divided into 6 sections: the method comprises the steps of interest point detection, local feature selection, local feature description, local feature compression, local feature position compression and local feature aggregation. Integrating a block-based frequency domain Laplacian Gaussian (BFLoG) method with an ALP detector as a point of interest detection method; calculating correlation to rank the features, selecting a fixed number of local features; using a SIFT descriptor as a feature descriptor; a low complexity transform coding scheme is employed, a small linear transform is applied to 8 values of each individual spatial interval of the SIFT descriptors, only a subset of the transformed descriptor elements is included in the bitstream, compressing the local features; compressing the local feature locations using a histogram coding scheme, the location data being represented as a spatial histogram comprised of a binary map and a set of histogram counts; with the scalable compressed Fisher vector, a subset of gaussian components from the gaussian mixture model is selected based on the total feature data budget, retaining only the information in the selected components. Local feature aggregation is performed by selecting a different set of components for each image based on the energy's concentration in the Fisher vector.

And 2, b.2, establishing an index, and calculating the Hamming distance very quickly for the long binary global descriptor by adopting an MBIT retrieval technology. MBIT reduces the exhaustive distance computation between features to the aligned component-to-component independent matching problem and constructs multiple hash tables for these components. Given a query descriptor, the relevant data for its candidates is retrieved using the query binary subvector (i.e., component) as an index into its corresponding hash table, thereby significantly reducing the number of required candidate images for subsequent linear searches.

For the 'selection' video clip, predicting a first corresponding food material category and a first retrieval result of non-food material retrieval by using food materials as commodity records purchased by the user; for the 'picking' action video, the food material types corresponding to the first three scores and the first three retrieval results of the non-food material retrieval are predicted by using the food materials and used as commodity records interested by the user.

The final consumer shopping record is composed of information of commodities purchased and interested by the user, wherein the commodities purchased by the consumer are the first food material category of food material classification and the first commodity category of non-food material retrieval on the selection action video clip, and the commodities interested by the consumer are the three food material categories before the food material classification and the three commodity categories before the non-food material retrieval on the selection action video clip.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A shopping analysis method based on a first person perspective shopping video is characterized by specifically comprising the following steps:

dividing the complete shopping video into a plurality of video segments;

2. The method of claim 1, wherein the preset shopping action type includes a pick action and a buy action; and

3. The method of claim 1, wherein analyzing the extracted image frames to obtain the shopping action types corresponding to the video frequency bands specifically comprises:

4. The method of claim 1, wherein identifying the commodity corresponding to the video segment of the preset shopping action type specifically comprises:

5. The method of claim 3 or 4, wherein the base network of the non-local neural network is ResNet50, ResNet50 is converted to a 3D ResNet50 network, and a non-local block is inserted at the end of each of the first three blocks of the 3D ResNet50 network.

6. The method according to claim 5, characterized in that for food-class goods identification, the following sub-steps are included:

a.1, extracting key frames of image frames of the video clip;

7. The method of claim 6, wherein for non-food item commodity identification, the following sub-steps are specifically included:

b.1, extracting key frames of image frames of the video clip;

8. The method of claim 7, wherein the spatial regularization network of step 2.a.2 comprises

By passingAnd obtaining a predicted value.

9. The method of claim 8, wherein the feature extraction of the compact visual search technique in steps 2.b.2 and 2.b.5 comprises point of interest detection, local feature selection, local feature description, local feature compression, local feature location compression, local feature aggregation.

10. The method of claim 9, wherein in step one, for the inserted non-local block, the output of position i isWherein, x_ias input of position i, x_jIs an input at position j, g (x)_j)＝W_gx_jWherein W is_gIs a learnable weight matrix.