CN110378215B - Shopping analysis method based on first-person visual angle shopping video - Google Patents

Shopping analysis method based on first-person visual angle shopping video Download PDF

Info

Publication number
CN110378215B
CN110378215B CN201910508074.9A CN201910508074A CN110378215B CN 110378215 B CN110378215 B CN 110378215B CN 201910508074 A CN201910508074 A CN 201910508074A CN 110378215 B CN110378215 B CN 110378215B
Authority
CN
China
Prior art keywords
commodity
shopping
video
commodities
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910508074.9A
Other languages
Chinese (zh)
Other versions
CN110378215A (en
Inventor
段凌宇
张琳
王策
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201910508074.9A priority Critical patent/CN110378215B/en
Publication of CN110378215A publication Critical patent/CN110378215A/en
Application granted granted Critical
Publication of CN110378215B publication Critical patent/CN110378215B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Abstract

The invention relates to the technical field of artificial intelligence application, in particular to a shopping analysis method based on a first-person visual angle shopping video. The method specifically comprises the following steps: dividing the complete shopping video into a plurality of video segments; extracting N image frames from the video clip; analyzing the extracted image frame to obtain a shopping action type corresponding to the video frequency band; identifying commodities corresponding to the video clips with preset shopping action types according to the obtained shopping action types corresponding to the video clips; and establishing a corresponding relation between the identified commodities and the shopping action types corresponding to the commodities. The invention uses the shopping video of the consumer at the first-person visual angle to carry out comprehensive consumption analysis, saves the burden of shooting and uploading of the consumer compared with the analysis method based on pictures, and can comprehensively analyze the whole shopping process to obtain a complete consumption record.

Description

Shopping analysis method based on first-person visual angle shopping video
Technical Field
The invention relates to the technical field of artificial intelligence application, in particular to a shopping analysis method based on a first-person visual angle shopping video.
Background
The analysis and record of consumer shopping are the basis for analyzing consumer preference, finding key factors influencing purchasing, and directionally recommending and helping the consumer to shop, have important significance for intelligent service of a market and improvement of life quality of the consumer, and have great commercial value.
Fig. 1 illustrates a conventional consumption analysis. Detecting the areas where the commodities exist based on the commodity pictures uploaded by the users, extracting image features for each area, extracting features for the commodity pictures in the database, and comparing the features of each area of the shot pictures with the features of the pictures in the database to obtain a final commodity identification result. The method relies on manual shooting and uploading of the user, is low in efficiency, complicated in operation and difficult to obtain comprehensive consumption analysis, once the user forgets shooting or uploading, the user cannot obtain comprehensive consumer shopping records, the commodity-user relationship obtained by the method is single, and various correlations of commodity-user cannot be established by utilizing rich consumer behaviors in the whole shopping process.
Disclosure of Invention
The embodiment of the invention provides a shopping analysis method based on a shopping video of a first person perspective, which is used for carrying out comprehensive consumption analysis on a shopping video of a consumer at the first person perspective.
According to a first aspect of an embodiment of the present invention, a shopping analysis method based on a first person perspective shopping video specifically includes:
dividing the complete shopping video into a plurality of video segments;
extracting N image frames from each video segment, wherein N is an integer greater than 1;
analyzing the extracted image frame to obtain a shopping action type corresponding to the video frequency band; and are
Identifying commodities corresponding to the video clips with preset shopping action types according to the obtained shopping action types corresponding to the video clips;
and establishing a corresponding relation between the identified commodities and the shopping action types corresponding to the commodities.
The preset shopping action type comprises a selecting action and a purchasing action; and
establishing a corresponding relation between the identified commodity and the shopping action type corresponding to the commodity, specifically comprising:
determining a plurality of first commodities identified in a video clip corresponding to the purchasing action as shopping records;
and determining the first commodities identified in the video clip corresponding to the selection action as commodity records interested by the user.
Analyzing the extracted image frame to obtain a shopping action type corresponding to the video frequency band, specifically comprising:
and analyzing the extracted image frames by using a non-local neural network to obtain the shopping action type corresponding to the video frequency band.
Identifying the commodity corresponding to the video clip with the preset shopping action type specifically comprises the following steps:
inputting a video clip corresponding to a preset shopping action type into a classification network to obtain a commodity type contained in the video clip, wherein the commodity type comprises a food material type or a non-food material type;
for food material commodities, identifying commodities of key frames by using a multi-classification model;
and for non-food material commodities, searching the non-food material commodities in the key frame by using a multi-object searching method.
The basic network of the non-local neural network is ResNet50, ResNet50 is converted into a 3D ResNet50 network, and a non-local block is inserted at the end of the first three blocks of the 3D ResNet50 network.
The food material class commodity identification method comprises the following substeps:
a.1, extracting key frames of image frames of the video clip;
2, a.2, sequentially inputting the key frames into a pre-trained spatial regularization network to obtain the prediction scores of the frames on each food material category;
and 2.a.3, adding the corresponding category scores of all the key frames, and dividing the sum by the number of the key frames to obtain the prediction score of the video clip on each food material category.
The non-food material commodity identification method specifically comprises the following substeps:
b.1, extracting key frames of image frames of the video clip;
b.2 preprocessing, namely training a fast r-cnn network by using a commodity data set RPC disclosed by the network; the RPC data set comprises a plurality of commodity pictures, and each picture uses a plurality of detection frames bbox to provide a uniform label 'commodity' category for all the detection frames bbox; during training, a commodity image library is constructed, the library comprises a plurality of commodity images, each image comprises a commodity and is of a clean background, and for all pictures in the commodity library, a compact visual search technology is used for establishing extraction features and establishing indexes;
and 2, b.3, detecting the commodity region by using the trained fast r-cnn for each key frame to generate a plurality of detection frames bbox and the prediction scores of the detection frames bbox, and reserving the detection frames bbox with the prediction scores larger than 0.5.
And 2.b.4, clipping the image by using a detection frame bbox for each key frame to generate a plurality of local images.
And 2.b.5, for each key frame, cutting the key frame into a plurality of local graphs, wherein each local graph uses a compact visual search technology to extract features, and using indexes built in a commodity library to search related commodities in the commodity library to obtain a related commodity list of each local graph, wherein the degree of correlation is from high to low.
And 2, b.6, for a plurality of key frames of a video clip, each key frame is provided with a plurality of local graphs, each graph is provided with a related commodity list, and the related commodity list is obtained according to the prediction scores of the local graphs.
The spatial regularization network of said step 2.a.2 comprises
The key frames are sequentially input into ResNet50 to provide coarse class prediction
Figure RE-GDA0002163994760000031
And a preliminary feature fcls
The preliminary characteristics fclsInputting a space regularization module to generate two feature maps, attention feature map fattAnd confidence feature map fcof
Then fattQuilt fcofRe-weighting and outputting accurate prediction results of a series of convolutional layers
Figure RE-GDA0002163994760000032
By making a pair of fattPerforming linear conversion samples to obtain a rough prediction
Figure RE-GDA0002163994760000033
By passing
Figure RE-GDA0002163994760000034
And obtaining a predicted value.
The predicted value is
Figure RE-GDA0002163994760000035
In the application the predicted value is
Figure RE-GDA0002163994760000036
Figure RE-GDA0002163994760000037
The feature extraction of the compact visual search technology in the step 2.b.2 and the step 2.b.5 comprises interest point detection, local feature selection, local feature description, local feature compression, local feature position compression and local feature aggregation.
In the first step, for the inserted non-local block, the output of the position i is
Figure RE-GDA0002163994760000038
Wherein the content of the first and second substances,
Figure RE-GDA0002163994760000039
xias input of position i, xjIs an input at position j, g (x)j)=WgxjWherein W isgIs a learnable weight matrix.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
1. compared with the picture-based analysis method, the method saves the burden of shooting and uploading of the consumer, can comprehensively analyze the whole shopping process, and obtains a complete consumption record.
2. For store merchandise to change problems over time, the patent reduces the model changes required for merchandise category changes. As for food material commodities, although the origin and the manufacturer are different, the types of food materials are the same, and food material commodities of new manufacturers belong to the original food material types, so that the model remains unchanged. For non-food material type commodities, an individual-level identification model needs to be established according to manufacturer and attribute differences, and new commodity types can be brought by introducing new commodities. The method uses a compact retrieval technology, and ensures that related commodities can be found only by adding white background commodity pictures of new commodities into a commodity library and without changing a retrieval model in the changing process of the commodities in the store. Other methods do not consider the problem of model change, and uniformly process food materials and non-food material commodities.
3. The method deeply excavates the whole shopping video, establishes various 'commodity-user' relationships, and can provide richer consumer consumption information and establish a complete consumer portrait relative to the traditional simple 'purchasing' relationship.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic flow diagram of a conventional shopping analysis method;
FIG. 2 is a schematic diagram of a shopping analysis method based on a first-person perspective shopping video according to an embodiment of the present invention;
FIG. 3 is a flowchart of a shopping analysis method based on a first-person perspective shopping video according to an embodiment of the present invention;
FIG. 4 is a flow chart of the prediction scores of the local graph of the non-food material commodity according to the present invention;
FIG. 5 is a flow diagram of the compact visual search used in feature extraction.
Detailed Description
Figure 2 shows a schematic flow diagram of the present invention. As shown in fig. 2 and 3, the present invention provides a shopping analysis method based on a first-person perspective shopping video, including:
for a complete shopping video, dividing the video into a plurality of video segments;
and selecting N image frames from each video clip at equal time intervals, and classifying the video clips for shopping actions.
For a complete shopping video of a user in a shop, based on different consumer behavior definitions, dividing the shopping video into a plurality of video segments at equal time intervals, and extracting N frames of image frames from the video segments, wherein N is a positive integer; preferably, a two-second video segment is taken every two seconds, and 16 frames with equal intervals are extracted from the video segment for motion prediction;
preferably, the shopping actions of the video segments are classified into N image frames of the video segments, the N image frames are input into a pre-trained non-local neural network, the prediction score of each shopping action of the segment is obtained, and the shopping action corresponding to the highest score is taken as the action category of the video segment;
the method comprises the steps of pre-training a non-local neural network into collected videos, dividing the videos into video segments, manually marking category labels, dividing the videos into frames, making the frames into matrixes, inputting the matrixes into the non-local neural network, outputting a fractional vector by the non-local neural network, calculating loss by using a cross entropy loss function on the vectors and the real category labels, and updating network parameters by using a back propagation mode.
Shopping videos are first input into a shopping behavior classification model to obtain a number of action segments for different consumer behaviors. Because the first-person shopping video can only record scene changes, the behavior of the consumer is invisible, so that the action category is difficult to estimate from the video; furthermore, there is a large inter-class similarity in the shopping action data, since the background in the video is always a store, and the difference between shopping actions is small. Therefore, the classification model should focus more on the variation and correlation between frames to find the category discriminating appearance. In this system, we use non-local neural networks for shopping behavior classification.
Preferably, the basic network of the non-local neural network may be ResNet50, and in order to use it on video data, ResNet50 is converted into a 3D ResNet50 network, i.e. all convolutional layers are converted into 3D convolutions, and a non-local block is inserted at the end of the first three blocks of the 3D ResNet50 network, i.e. the outputs of activation _59, activation _71, and activation _ 89.
Non-local neural networks use non-local blocks to capture the spatial, temporal and spatiotemporal dependencies of data.
Preferably, for an inserted non-local block, the output of position i is treated as a normalized linear combination of all position depth information in the input, i.e.
Figure RE-GDA0002163994760000051
Linear coefficient f (x)i,xj) Is a scalar quantity, g (x), reflecting the relationship between the positionsj) Containing the deep information entered in location j. The non-local neural network may process messages on all input signals. Using this network, the classification model can discover changes in information flow and frames.
Figure RE-GDA0002163994760000061
xiAs input of position i, xjIs an input at position j, g (x)j) Is a linear transformation WgxjWherein W isgIs a learnable weight matrix.
The classification for video actions is shown in table 1.
TABLE 1 Classification of video actions
Figure RE-GDA0002163994760000062
Inputting the video clips belonging to selection and selection in the clips of the shopping action into a classification network, and distinguishing commodities in the video clips into food materials or non-food materials;
for food material commodities, identifying a plurality of food material categories of key frames of video clips by using a multi-classification model;
for non-food material commodities, due to more varieties and continuous growth, a multi-object retrieval method is used for retrieving the non-food material commodities in the key frame of the video clip;
after dividing the video into a plurality of action segments, we perform video content analysis on the 'pick' and 'select' action segments to obtain consumer shopping records, because the segments contain information about goods liked and purchased by the user. The commodities include food-class and non-food-class commodities, and two visual analysis models are used for the two types of commodities.
Preferably, we first use the RetNet50 classification network to distinguish food class and non-food class commodity frames on the key frames of the input video segment. The food frames are then input to a corresponding classification or retrieval model.
For food categories, such as vegetables and meat, a multi-classification model is employed because while they may have different growth areas, the categories are limited and fixed. The method specifically comprises the following substeps:
a.1, extracting key frames of image frames of the video clip by using ffmpeg;
a.2, sequentially inputting the key frames into a pre-trained Space Regularization Network (SRN) to obtain the prediction scores of the frames on each food material category;
and 2.a.3, adding the corresponding category scores of all the key frames, and dividing the sum by the number of the key frames to obtain the prediction score of the video clip on each food material category.
The environment of a store is complex, the problems of reflection, color change and the like can be encountered after shooting, food materials are often cut and packaged in the store, a Space Regularization Network (SRN) is used as a multi-classification model, the multi-classification model is concentrated in a class area, fine-grained characteristics are found, and meanwhile, the problems of reflection and color change of a picture are adjusted locally.
The SRN consists of two parts, namely a feature extraction module and a spatial regularization module. The feature extraction module provides coarse class prediction using ResNet50
Figure RE-GDA0002163994760000071
And a preliminary feature fcls
The spatial regularization module maps the preliminary features fclsAs input, first two feature maps are generated-attention feature map fattAnd confidence featuresFIG. fcof. Then fattQuilt fcofRe-weighting and outputting accurate prediction results of a series of convolutional layers
Figure RE-GDA0002163994760000072
By making a pair of fattPerforming a linear transformation can also yield a rough prediction
Figure RE-GDA0002163994760000073
The mechanism in the spatial regularization module will greatly facilitate performance because the attention feature map generates important regions for each class to discover subtle class features, the confidence feature map adjusts fattTo adjust for problems such as reflection and color change.
Preferably, during the training process, the model uses cross entropy loss optimization, and the optimized predicted value is
Figure RE-GDA0002163994760000074
Use in applications
Figure RE-GDA0002163994760000075
As a prediction score.
For non-food commodities, the usability of the method after data expansion is ensured by adopting a retrieval technology in consideration of the category diversity and the increasing quantity of the non-food commodities. The system only needs to update the commodity database step by step without retraining new models.
The non-food material type commodity specifically comprises the following substeps:
b.1, extracting key frames of image frames of the video clip by using ffmpeg;
b.2 preprocessing, training a fast r-cnn network using a network published commodity data set RPC, and finally achieving 97.6% detection results on the data set. The RPC data set comprises a plurality of commodity graphs, wherein each graph is marked with a plurality of detection boxes (bbox) in a plurality of commodity areas of the graph, and each detection box is provided with an attached commodity category label. During training we ignore the attached commodity category label, but give all bbox a uniform label "commodity" category. A commodity image library is constructed, wherein the library comprises a plurality of commodity images, each image comprises a commodity and is of a clean background. And establishing extraction features and indexes for all pictures of the commodity library by using a compact visual search technology.
B.3 for each key frame, using trained fast r-cnn for commercial region detection, multiple bboxs are generated, and the predicted scores of bboxs (between 0-1, indicating how likely the bboxs contain commercial products). Bbox with a prediction score greater than 0.5 is retained.
B.4 for each key frame, cropping the image using bbox, generating a plurality of partial maps.
And 2.b.5, for each key frame, cutting the key frame into a plurality of local graphs, wherein each local graph uses a compact visual search technology to extract features, and using indexes built in a commodity library to search related commodities in the commodity library to obtain a related commodity list of each local graph, wherein the degree of correlation is from high to low.
B.6 for a plurality of key frames of a video clip, each key frame has a plurality of partial graphs, each graph has a related commodity list, the partial graphs are arranged from top to bottom according to the prediction scores of the partial graphs, and the result is shown in fig. 4, wherein circles in the graphs represent commodity retrieval lists. Where a row of circles across the product is unlikely to repeat, but a column on the vertical may repeat, since the detection of each partial image is not mutually affected.
The results of a key frame are first fused. Suppose there are k partial graphs B1-BkPrediction score goes from high to low, for local graph BiTake the first 30 commodities
Figure RE-GDA0002163994760000081
The degree of correlation goes from high to low. When merging, a list L is maintained, firstly B is1-BkFirst article of merchandise of
Figure RE-GDA0002163994760000082
Adding L, if any, in sequence
Figure RE-GDA0002163994760000083
Already in L, then skip. Then will be
Figure RE-GDA0002163994760000084
Second article of commerce
Figure RE-GDA0002163994760000085
Add L in sequence, and so on until L has 30 items. Thus, each key frame has a list L of length 30.
The results of all key frames are then fused. Suppose there are t key frames F1-FtThe degree of correlation is from high to low, for key frame FiList L ofiThere are commodities
Figure RE-GDA0002163994760000086
The degree of correlation goes from high to low. During fusion, a list E is maintained, and F is counted1-FkFirst article of merchandise of
Figure RE-GDA0002163994760000087
The number of commodity categories and the number of occurrences of each category, and adding the commodity categories to E from high to low. Then B is put1-BkSecond article of commerce
Figure RE-GDA0002163994760000088
Add E, skip if already present in E, and so on until the goods in E reach 30.
We use a multi-product retrieval method to obtain products purchased or liked by consumers. In order to obtain more accurate retrieval results, we first use the commodity position detection model to cut the image into a plurality of areas possibly containing commodities, which will increase the calculation requirement and time. In addition, the realization of ultra-fine-grained commodity retrieval, such as different tastes of the same potato chip brand, will face smaller inter-class differences (as in the text and texture of the commodity package). To address both of these issues, compact visual search techniques are used to retrieve products, thereby focusing more on local features, resulting in more efficient retrieval. Prior to the use of the compact visual search technique,
fig. 5 shows a schematic diagram of a feature extraction flow of the compact visual search technique.
The feature extraction of step 2.b.2 and step 2.b.5 compact visual search techniques can be divided into 6 sections: the method comprises the steps of interest point detection, local feature selection, local feature description, local feature compression, local feature position compression and local feature aggregation. Integrating a block-based frequency domain Laplacian Gaussian (BFLoG) method with an ALP detector as a point of interest detection method; calculating correlation to rank the features, selecting a fixed number of local features; using a SIFT descriptor as a feature descriptor; a low complexity transform coding scheme is employed, a small linear transform is applied to 8 values of each individual spatial interval of the SIFT descriptors, only a subset of the transformed descriptor elements is included in the bitstream, compressing the local features; compressing the local feature locations using a histogram coding scheme, the location data being represented as a spatial histogram comprised of a binary map and a set of histogram counts; with the scalable compressed Fisher vector, a subset of gaussian components from the gaussian mixture model is selected based on the total feature data budget, retaining only the information in the selected components. Local feature aggregation is performed by selecting a different set of components for each image based on the energy's concentration in the Fisher vector.
And 2, b.2, establishing an index, and calculating the Hamming distance very quickly for the long binary global descriptor by adopting an MBIT retrieval technology. MBIT reduces the exhaustive distance computation between features to the aligned component-to-component independent matching problem and constructs multiple hash tables for these components. Given a query descriptor, the relevant data for its candidates is retrieved using the query binary subvector (i.e., component) as an index into its corresponding hash table, thereby significantly reducing the number of required candidate images for subsequent linear searches.
For the 'selection' video clip, predicting a first corresponding food material category and a first retrieval result of non-food material retrieval by using food materials as commodity records purchased by the user; for the 'picking' action video, the food material types corresponding to the first three scores and the first three retrieval results of the non-food material retrieval are predicted by using the food materials and used as commodity records interested by the user.
The final consumer shopping record is composed of information of commodities purchased and interested by the user, wherein the commodities purchased by the consumer are the first food material category of food material classification and the first commodity category of non-food material retrieval on the selection action video clip, and the commodities interested by the consumer are the three food material categories before the food material classification and the three commodity categories before the non-food material retrieval on the selection action video clip.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (5)

1. A shopping analysis method based on a first person perspective shopping video is characterized by specifically comprising the following steps:
dividing the complete shopping video into a plurality of video segments;
extracting N image frames from each video segment, wherein N is an integer greater than 1;
analyzing the extracted image frames by using a non-local neural network to obtain shopping action types corresponding to the video clips; the basic network of the non-local neural network is ResNet50, ResNet50 is converted into a 3D ResNet50 network, a non-local block is inserted at the end of the first three blocks of the 3D ResNet50 network, and
identifying commodities corresponding to the video clips with preset shopping action types according to the shopping action types corresponding to the obtained video clips, wherein the commodity identification method comprises the following steps: inputting a video clip corresponding to a preset shopping action type into a classification network to obtain a commodity type contained in the video clip, wherein the commodity type comprises a food material type or a non-food material type; for food material commodities, identifying commodities of key frames by using a multi-classification model; for non-food material commodities, searching the non-food material commodities in the key frame by using a multi-object searching method; the food material class commodity identification method comprises the following substeps: a.1, extracting key frames of image frames of the video clip; 2, a.2, sequentially inputting the key frames into a pre-trained spatial regularization network to obtain the prediction scores of the frames on each food material category; a.3, adding the corresponding category scores of all the key frames, and dividing the sum by the number of the key frames to obtain the prediction score of the video clip on each food material category;
the non-food material commodity identification method specifically comprises the following substeps:
b.1, extracting key frames of image frames of the video clip;
b.2 preprocessing, namely training a fast r-cnn network by using a commodity data set RPC disclosed by the network; the RPC data set comprises a plurality of commodity pictures, and each picture uses a plurality of detection frames bbox to provide a uniform label 'commodity' category for all the detection frames bbox; during training, a commodity image library is constructed, the library comprises a plurality of commodity images, each image comprises a commodity and is of a clean background, and for all pictures in the commodity image library, a compact visual search technology is used for establishing extraction features and establishing indexes;
b.3, detecting the commodity region of each key frame by using the trained fast r-cnn to generate a plurality of detection frames bbox and the prediction scores of the detection frames bbox, and reserving the detection frames bbox with the prediction scores larger than 0.5;
b.4, cutting the image by using a detection frame bbox for each key frame to generate a plurality of local images;
2, b.5, for each key frame, cutting the key frame into a plurality of local graphs, wherein each local graph uses a compact visual search technology to extract features, and searches related commodities in a commodity library by using an index established by the commodity library to obtain a related commodity list of each local graph, wherein the correlation degree is from high to low;
b.6, for a plurality of key frames of a video clip, each key frame is provided with a plurality of local graphs, each graph is provided with a related commodity list, and the related commodity list is obtained according to the prediction scores of the local graphs;
and establishing a corresponding relation between the identified commodities and the shopping action types corresponding to the commodities.
2. The method of claim 1, wherein the preset shopping action type includes a pick action and a buy action; and
establishing a corresponding relation between the identified commodity and the shopping action type corresponding to the commodity, specifically comprising:
determining a plurality of first commodities identified in a video clip corresponding to the purchasing action as shopping records;
and determining the first commodities identified in the video clip corresponding to the selection action as commodity records interested by the user.
3. The method of claim 1, wherein the spatial regularization network of step 2.a.2 comprises:
the key frames are sequentially input into ResNet50 to provide coarse class prediction
Figure FDA0003188970560000021
And a preliminary feature fcls
The preliminary characteristics fclsInputting a space regularization module to generate two feature maps, attention feature map fattAnd confidence feature map fcof
Then fattQuilt fcofRe-weighting and outputting accurate prediction results of a series of convolutional layers
Figure FDA0003188970560000022
By making a pair of fattPerforming linear conversion samples to obtain a rough prediction
Figure FDA0003188970560000023
By passing
Figure FDA0003188970560000024
And obtaining a predicted value.
4. The method of claim 3, wherein the feature extraction of the compact visual search technique in steps 2.b.2 and 2.b.5 comprises point of interest detection, local feature selection, local feature description, local feature compression, local feature location compression, local feature aggregation.
5. The method of claim 4, wherein for an inserted non-local block, the output of position i is
Figure FDA0003188970560000031
Wherein the content of the first and second substances,
Figure FDA0003188970560000032
Figure FDA0003188970560000033
xias input of position i, xjIs an input at position j, g (x)j)=WgxjWherein W isgIs a learnable weight matrix.
CN201910508074.9A 2019-06-12 2019-06-12 Shopping analysis method based on first-person visual angle shopping video Active CN110378215B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910508074.9A CN110378215B (en) 2019-06-12 2019-06-12 Shopping analysis method based on first-person visual angle shopping video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910508074.9A CN110378215B (en) 2019-06-12 2019-06-12 Shopping analysis method based on first-person visual angle shopping video

Publications (2)

Publication Number Publication Date
CN110378215A CN110378215A (en) 2019-10-25
CN110378215B true CN110378215B (en) 2021-11-02

Family

ID=68250201

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910508074.9A Active CN110378215B (en) 2019-06-12 2019-06-12 Shopping analysis method based on first-person visual angle shopping video

Country Status (1)

Country Link
CN (1) CN110378215B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113392671A (en) * 2020-02-26 2021-09-14 上海依图信息技术有限公司 Commodity retrieval method and device based on customer actions and electronic equipment
CN112906759A (en) * 2021-01-29 2021-06-04 哈尔滨工业大学 Pure vision-based entrance-guard-free unmanned store checkout method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101978370A (en) * 2008-03-21 2011-02-16 日升研发控股有限责任公司 Acquiring actual real-time shopper behavior data during a shopper's product selection
CN109063534A (en) * 2018-05-25 2018-12-21 隆正信息科技有限公司 A kind of shopping identification and method of expressing the meaning based on image
CN109166007A (en) * 2018-08-23 2019-01-08 深圳码隆科技有限公司 A kind of Method of Commodity Recommendation and its device based on automatic vending machine
CN109711481A (en) * 2019-01-02 2019-05-03 京东方科技集团股份有限公司 Neural network, correlation technique, medium and equipment for the identification of paintings multi-tag

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ITMI20121210A1 (en) * 2012-07-11 2014-01-12 Rai Radiotelevisione Italiana A METHOD AND AN APPARATUS FOR THE EXTRACTION OF DESCRIPTORS FROM VIDEO CONTENT, PREFERABLY FOR SEARCH AND RETRIEVAL PURPOSE

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101978370A (en) * 2008-03-21 2011-02-16 日升研发控股有限责任公司 Acquiring actual real-time shopper behavior data during a shopper's product selection
CN109063534A (en) * 2018-05-25 2018-12-21 隆正信息科技有限公司 A kind of shopping identification and method of expressing the meaning based on image
CN109166007A (en) * 2018-08-23 2019-01-08 深圳码隆科技有限公司 A kind of Method of Commodity Recommendation and its device based on automatic vending machine
CN109711481A (en) * 2019-01-02 2019-05-03 京东方科技集团股份有限公司 Neural network, correlation technique, medium and equipment for the identification of paintings multi-tag

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
结合模板匹配与单样本深度学习的货架商品定位与识别技术研究;朱柳依;《中国优秀硕士学位论文全文数据库信息科技辑》;20190115;第56-69页 *
超市中人体异常行为识别方法的研究;陈若愚;《中国优秀硕士学位论文全文数据库信息科技辑》;20160315;全文 *

Also Published As

Publication number Publication date
CN110378215A (en) 2019-10-25

Similar Documents

Publication Publication Date Title
US10671853B2 (en) Machine learning for identification of candidate video insertion object types
CN110263265B (en) User tag generation method, device, storage medium and computer equipment
JP3568117B2 (en) Method and system for video image segmentation, classification, and summarization
CN107220365B (en) Accurate recommendation system and method based on collaborative filtering and association rule parallel processing
KR20190117584A (en) Method and apparatus for detecting, filtering and identifying objects in streaming video
US20120102033A1 (en) Systems and methods for building a universal multimedia learner
US20110106656A1 (en) Image-based searching apparatus and method
Ullah et al. Image-based service recommendation system: A JPEG-coefficient RFs approach
Gitte et al. Content based video retrieval system
CN110378215B (en) Shopping analysis method based on first-person visual angle shopping video
CN112714349B (en) Data processing method, commodity display method and video playing method
CN111984824A (en) Multi-mode-based video recommendation method
CN105792010A (en) Television shopping method and device based on image content analysis and picture index
CN113766330A (en) Method and device for generating recommendation information based on video
CN105894362A (en) Method and device for recommending related item in video
Papadopoulos et al. Automatic summarization and annotation of videos with lack of metadata information
CN109086830A (en) Typical association analysis based on sample punishment closely repeats video detecting method
CN109934681B (en) Recommendation method for user interested goods
Ulges et al. A system that learns to tag videos by watching youtube
US20200327600A1 (en) Method and system for providing product recommendation to a user
CN107944946B (en) Commodity label generation method and device
WO2021250009A1 (en) Method and system for selecting highlight segments
Fei et al. Learning user interest with improved triplet deep ranking and web-image priors for topic-related video summarization
JP2002513487A (en) Algorithms and systems for video search based on object-oriented content
CN110379483A (en) For the diet supervision of sick people and recommended method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant