CN110378215A - Purchase analysis method based on first person shopping video - Google Patents
Purchase analysis method based on first person shopping video Download PDFInfo
- Publication number
- CN110378215A CN110378215A CN201910508074.9A CN201910508074A CN110378215A CN 110378215 A CN110378215 A CN 110378215A CN 201910508074 A CN201910508074 A CN 201910508074A CN 110378215 A CN110378215 A CN 110378215A
- Authority
- CN
- China
- Prior art keywords
- commodity
- video
- shopping
- frames
- local
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 20
- 230000009471 action Effects 0.000 claims abstract description 49
- 238000000034 method Methods 0.000 claims abstract description 38
- 235000013305 food Nutrition 0.000 claims description 52
- 239000000463 material Substances 0.000 claims description 46
- 238000001514 detection method Methods 0.000 claims description 22
- 230000000007 visual effect Effects 0.000 claims description 16
- 238000013528 artificial neural network Methods 0.000 claims description 12
- 238000005516 engineering process Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 10
- 238000013145 classification model Methods 0.000 claims description 9
- 230000006835 compression Effects 0.000 claims description 6
- 238000007906 compression Methods 0.000 claims description 6
- 230000002776 aggregation Effects 0.000 claims description 4
- 238000004220 aggregation Methods 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 7
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 230000006399 behavior Effects 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 239000013598 vector Substances 0.000 description 4
- 230000004913 activation Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 2
- 235000002595 Solanum tuberosum Nutrition 0.000 description 1
- 244000061456 Solanum tuberosum Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 235000015219 food category Nutrition 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 235000013372 meat Nutrition 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 235000019640 taste Nutrition 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 235000013311 vegetables Nutrition 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Business, Economics & Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Strategic Management (AREA)
- Development Economics (AREA)
- Multimedia (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Entrepreneurship & Innovation (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Marketing (AREA)
- Game Theory and Decision Science (AREA)
- General Business, Economics & Management (AREA)
- Economics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The present invention relates to artificial intelligence application technical field, in particular to a kind of purchase analysis method based on first person shopping video.It specifically includes: complete shopping video is divided into multiple video clips;N frame picture frame is extracted from the video clip;The picture frame that analysis is extracted obtains the corresponding shopping type of action of the video frequency band;And according to the corresponding shopping type of action of each video clip of acquisition, the corresponding commodity of video clip of default shopping type of action are identified;Establish the corresponding relationship between the corresponding shopping type of action of the commodity identified.Consumer's shopping video of present invention first person, comprehensive Consumption is carried out, relative to the analysis method based on picture, this patent saves the burden of consumer's shooting and upload, and entire shopping process can be comprehensively analyzed, complete consumer record is obtained.
Description
Technical Field
The invention relates to the technical field of artificial intelligence application, in particular to a shopping analysis method based on a first-person visual angle shopping video.
Background
The analysis and record of consumer shopping are the basis for analyzing consumer preference, finding key factors influencing purchasing, and directionally recommending and helping the consumer to shop, have important significance for intelligent service of a market and improvement of life quality of the consumer, and have great commercial value.
Fig. 1 illustrates a conventional consumption analysis. Detecting the areas where the commodities exist based on the commodity pictures uploaded by the users, extracting image features for each area, extracting features for the commodity pictures in the database, and comparing the features of each area of the shot pictures with the features of the pictures in the database to obtain a final commodity identification result. The method relies on manual shooting and uploading of the user, is low in efficiency, complicated in operation and difficult to obtain comprehensive consumption analysis, once the user forgets shooting or uploading, the user cannot obtain comprehensive consumer shopping records, the commodity-user relationship obtained by the method is single, and various correlations of commodity-user cannot be established by utilizing rich consumer behaviors in the whole shopping process.
Disclosure of Invention
The embodiment of the invention provides a shopping analysis method based on a shopping video of a first person perspective, which is used for carrying out comprehensive consumption analysis on a shopping video of a consumer at the first person perspective.
According to a first aspect of an embodiment of the present invention, a shopping analysis method based on a first person perspective shopping video specifically includes:
dividing the complete shopping video into a plurality of video segments;
extracting N image frames from each video segment, wherein N is an integer greater than 1;
analyzing the extracted image frame to obtain a shopping action type corresponding to the video frequency band; and are
Identifying commodities corresponding to the video clips with preset shopping action types according to the obtained shopping action types corresponding to the video clips;
and establishing a corresponding relation between the identified commodities and the shopping action types corresponding to the commodities.
The preset shopping action type comprises a selecting action and a purchasing action; and
establishing a corresponding relation between the identified commodity and the shopping action type corresponding to the commodity, specifically comprising:
determining a plurality of first commodities identified in a video clip corresponding to the purchasing action as shopping records;
and determining the first commodities identified in the video clip corresponding to the selection action as commodity records interested by the user.
Analyzing the extracted image frame to obtain a shopping action type corresponding to the video frequency band, specifically comprising:
and analyzing the extracted image frames by using a non-local neural network to obtain the shopping action type corresponding to the video frequency band.
Identifying the commodity corresponding to the video clip with the preset shopping action type specifically comprises the following steps:
inputting a video clip corresponding to a preset shopping action type into a classification network to obtain a commodity type contained in the video clip, wherein the commodity type comprises a food material type or a non-food material type;
for food material commodities, identifying commodities of key frames by using a multi-classification model;
and for non-food material commodities, searching the non-food material commodities in the key frame by using a multi-object searching method.
The basic network of the non-local neural network is ResNet50, ResNet50 is converted into a 3D ResNet50 network, and a non-local block is inserted at the end of the first three blocks of the 3D ResNet50 network.
The food material class commodity identification method comprises the following substeps:
a.1, extracting key frames of image frames of the video clip;
2, a.2, sequentially inputting the key frames into a pre-trained spatial regularization network to obtain the prediction scores of the frames on each food material category;
and 2.a.3, adding the corresponding category scores of all the key frames, and dividing the sum by the number of the key frames to obtain the prediction score of the video clip on each food material category.
The non-food material commodity identification method specifically comprises the following substeps:
b.1, extracting key frames of image frames of the video clip;
b.2 preprocessing, namely training a fast r-cnn network by using a commodity data set RPC disclosed by the network; the RPC data set comprises a plurality of commodity pictures, and each picture uses a plurality of detection frames bbox to provide a uniform label 'commodity' category for all the detection frames bbox; during training, a commodity image library is constructed, the library comprises a plurality of commodity images, each image comprises a commodity and is of a clean background, and for all pictures in the commodity library, a compact visual search technology is used for establishing extraction features and establishing indexes;
and 2, b.3, detecting the commodity region by using the trained fast r-cnn for each key frame to generate a plurality of detection frames bbox and the prediction scores of the detection frames bbox, and reserving the detection frames bbox with the prediction scores larger than 0.5.
And 2.b.4, clipping the image by using a detection frame bbox for each key frame to generate a plurality of local images.
And 2.b.5, for each key frame, cutting the key frame into a plurality of local graphs, wherein each local graph uses a compact visual search technology to extract features, and using indexes built in a commodity library to search related commodities in the commodity library to obtain a related commodity list of each local graph, wherein the degree of correlation is from high to low.
And 2, b.6, for a plurality of key frames of a video clip, each key frame is provided with a plurality of local graphs, each graph is provided with a related commodity list, and the related commodity list is obtained according to the prediction scores of the local graphs.
The spatial regularization network of said step 2.a.2 comprises
The key frames are sequentially input into ResNet50 to provide coarse class predictionAnd a preliminary feature fcls;
The preliminary characteristics fclsInputting a space regularization module to generate two feature maps, attention feature map fattAnd confidence feature map fcof;
Then fattQuilt fcofRe-weighting and outputting accurate prediction results of a series of convolutional layersBy making a pair of fattPerforming linear conversion samples to obtain a rough prediction
By passingAnd obtaining a predicted value.
The predicted value isIn the application the predicted value is
The feature extraction of the compact visual search technology in the step 2.b.2 and the step 2.b.5 comprises interest point detection, local feature selection, local feature description, local feature compression, local feature position compression and local feature aggregation.
In the first step, for the inserted non-local block, the output of the position i isWherein,xias input of position i, xjIs an input at position j, g (x)j)=WgxjWherein W isgIs one anda learned weight matrix.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
1. compared with the picture-based analysis method, the method saves the burden of shooting and uploading of the consumer, can comprehensively analyze the whole shopping process, and obtains a complete consumption record.
2. For store merchandise to change problems over time, the patent reduces the model changes required for merchandise category changes. As for food material commodities, although the origin and the manufacturer are different, the types of food materials are the same, and food material commodities of new manufacturers belong to the original food material types, so that the model remains unchanged. For non-food material type commodities, an individual-level identification model needs to be established according to manufacturer and attribute differences, and new commodity types can be brought by introducing new commodities. The method uses a compact retrieval technology, and ensures that related commodities can be found only by adding white background commodity pictures of new commodities into a commodity library and without changing a retrieval model in the changing process of the commodities in the store. Other methods do not consider the problem of model change, and uniformly process food materials and non-food material commodities.
3. The method deeply excavates the whole shopping video, establishes various 'commodity-user' relationships, and can provide richer consumer consumption information and establish a complete consumer portrait relative to the traditional simple 'purchasing' relationship.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic flow diagram of a conventional shopping analysis method;
FIG. 2 is a schematic diagram of a shopping analysis method based on a first-person perspective shopping video according to an embodiment of the present invention;
FIG. 3 is a flowchart of a shopping analysis method based on a first-person perspective shopping video according to an embodiment of the present invention;
FIG. 4 is a flow chart of the prediction scores of the local graph of the non-food material commodity according to the present invention;
FIG. 5 is a flow diagram of the compact visual search used in feature extraction.
Detailed Description
Figure 2 shows a schematic flow diagram of the present invention. As shown in fig. 2 and 3, the present invention provides a shopping analysis method based on a first-person perspective shopping video, including:
for a complete shopping video, dividing the video into a plurality of video segments;
and selecting N image frames from each video clip at equal time intervals, and classifying the video clips for shopping actions.
For a complete shopping video of a user in a shop, based on different consumer behavior definitions, dividing the shopping video into a plurality of video segments at equal time intervals, and extracting N frames of image frames from the video segments, wherein N is a positive integer; preferably, a two-second video segment is taken every two seconds, and 16 frames with equal intervals are extracted from the video segment for motion prediction;
preferably, the shopping actions of the video segments are classified into N image frames of the video segments, the N image frames are input into a pre-trained non-local neural network, the prediction score of each shopping action of the segment is obtained, and the shopping action corresponding to the highest score is taken as the action category of the video segment;
the method comprises the steps of pre-training a non-local neural network into collected videos, dividing the videos into video segments, manually marking category labels, dividing the videos into frames, making the frames into matrixes, inputting the matrixes into the non-local neural network, outputting a fractional vector by the non-local neural network, calculating loss by using a cross entropy loss function on the vectors and the real category labels, and updating network parameters by using a back propagation mode.
Shopping videos are first input into a shopping behavior classification model to obtain a number of action segments for different consumer behaviors. Because the first-person shopping video can only record scene changes, the behavior of the consumer is invisible, so that the action category is difficult to estimate from the video; furthermore, there is a large inter-class similarity in the shopping action data, since the background in the video is always a store, and the difference between shopping actions is small. Therefore, the classification model should focus more on the variation and correlation between frames to find the category discriminating appearance. In this system, we use non-local neural networks for shopping behavior classification.
Preferably, the basic network of the non-local neural network may be ResNet50, and in order to use it on video data, ResNet50 is converted into a 3D ResNet50 network, i.e. all convolutional layers are converted into 3D convolutions, and a non-local block is inserted at the end of the first three blocks of the 3D ResNet50 network, i.e. the outputs of activation _59, activation _71, and activation _ 89.
Non-local neural networks use non-local blocks to capture the spatial, temporal and spatiotemporal dependencies of data.
Preferably, for an inserted non-local block, the output of position i is treated as a normalized linear combination of all position depth information in the input, i.e.Linear coefficient f (x)i,xj) Is a scalar quantity, g (x), reflecting the relationship between the positionsj) Containing the deep information entered in location j. The non-local neural network may process messages on all input signals. Using this network, the classification model can discover changes in information flow and frames.xiAs input of position i, xjIs an input at position j, g (x)j) Is a linear transformation WgxjWherein W isgIs a learnable weight matrix.
The classification for video actions is shown in table 1.
TABLE 1 Classification of video actions
Inputting the video clips belonging to selection and selection in the clips of the shopping action into a classification network, and distinguishing commodities in the video clips into food materials or non-food materials;
for food material commodities, identifying a plurality of food material categories of key frames of video clips by using a multi-classification model;
for non-food material commodities, due to more varieties and continuous growth, a multi-object retrieval method is used for retrieving the non-food material commodities in the key frame of the video clip;
after dividing the video into a plurality of action segments, we perform video content analysis on the 'pick' and 'select' action segments to obtain consumer shopping records, because the segments contain information about goods liked and purchased by the user. The commodities include food-class and non-food-class commodities, and two visual analysis models are used for the two types of commodities.
Preferably, we first use the RetNet50 classification network to distinguish food class and non-food class commodity frames on the key frames of the input video segment. The food frames are then input to a corresponding classification or retrieval model.
For food categories, such as vegetables and meat, a multi-classification model is employed because while they may have different growth areas, the categories are limited and fixed. The method specifically comprises the following substeps:
a.1, extracting key frames of image frames of the video clip by using ffmpeg;
a.2, sequentially inputting the key frames into a pre-trained Space Regularization Network (SRN) to obtain the prediction scores of the frames on each food material category;
and 2.a.3, adding the corresponding category scores of all the key frames, and dividing the sum by the number of the key frames to obtain the prediction score of the video clip on each food material category.
The environment of a store is complex, the problems of reflection, color change and the like can be encountered after shooting, food materials are often cut and packaged in the store, a Space Regularization Network (SRN) is used as a multi-classification model, the multi-classification model is concentrated in a class area, fine-grained characteristics are found, and meanwhile, the problems of reflection and color change of a picture are adjusted locally.
The SRN consists of two parts, namely a feature extraction module and a spatial regularization module. The feature extraction module provides coarse class prediction using ResNet50And a preliminary feature fcls。
The spatial regularization module maps the preliminary features fclsAs input, first two feature maps are generated-attention feature map fattAnd confidence feature map fcof. Then fattQuilt fcofRe-weighting and outputting accurate prediction results of a series of convolutional layersBy making a pair of fattPerforming a linear transformation can also yield a rough predictionThe mechanism in the spatial regularization module will greatly facilitate performance because the attention feature map generates important regions for each class to discover subtle class features, the confidence feature map adjusts fattTo adjust for problems such as reflection and color change.
Preferably, during the training process, the model uses cross entropy loss optimization, and the optimized predicted value isUse in applicationsAs a prediction score.
For non-food commodities, the usability of the method after data expansion is ensured by adopting a retrieval technology in consideration of the category diversity and the increasing quantity of the non-food commodities. The system only needs to update the commodity database step by step without retraining new models.
The non-food material type commodity specifically comprises the following substeps:
b.1, extracting key frames of image frames of the video clip by using ffmpeg;
b.2 preprocessing, training a fast r-cnn network using a network published commodity data set RPC, and finally achieving 97.6% detection results on the data set. The RPC data set comprises a plurality of commodity graphs, wherein each graph is marked with a plurality of detection boxes (bbox) in a plurality of commodity areas of the graph, and each detection box is provided with an attached commodity category label. During training we ignore the attached commodity category label, but give all bbox a uniform label "commodity" category. A commodity image library is constructed, wherein the library comprises a plurality of commodity images, each image comprises a commodity and is of a clean background. And establishing extraction features and indexes for all pictures of the commodity library by using a compact visual search technology.
B.3 for each key frame, using trained fast r-cnn for commercial region detection, multiple bboxs are generated, and the predicted scores of bboxs (between 0-1, indicating how likely the bboxs contain commercial products). Bbox with a prediction score greater than 0.5 is retained.
B.4 for each key frame, cropping the image using bbox, generating a plurality of partial maps.
And 2.b.5, for each key frame, cutting the key frame into a plurality of local graphs, wherein each local graph uses a compact visual search technology to extract features, and using indexes built in a commodity library to search related commodities in the commodity library to obtain a related commodity list of each local graph, wherein the degree of correlation is from high to low.
B.6 for a plurality of key frames of a video clip, each key frame has a plurality of partial graphs, each graph has a related commodity list, the partial graphs are arranged from top to bottom according to the prediction scores of the partial graphs, and the result is shown in fig. 4, wherein circles in the graphs represent commodity retrieval lists. Where a row of circles across the product is unlikely to repeat, but a column on the vertical may repeat, since the detection of each partial image is not mutually affected.
The results of a key frame are first fused. Suppose there are k partial graphs B1-BkPrediction score goes from high to low, for local graph BiTake the first 30 commoditiesThe degree of correlation goes from high to low. When merging, a list L is maintained, firstly B is1-BkFirst article of merchandise ofAdding L, if any, in sequenceAlready in L, then skip. Then will beSecond article of commerceAdd L in sequence, and so on until L has 30 items. Thus, each key frame has a list L of length 30.
The results of all key frames are then fused. Suppose there are t key frames F1-FtThe degree of correlation is from high to low, for key frame FiList L ofiThere are commoditiesThe degree of correlation goes from high to low. During fusion, a list E is maintained, and F is counted1-FkFirst article of merchandise ofThe number of commodity categories and the number of occurrences of each category, and adding the commodity categories to E from high to low. Then B is put1-BkSecond article of commerceAdd E, skip if already present in E, and so on until the goods in E reach 30.
We use a multi-product retrieval method to obtain products purchased or liked by consumers. In order to obtain more accurate retrieval results, we first use the commodity position detection model to cut the image into a plurality of areas possibly containing commodities, which will increase the calculation requirement and time. In addition, the realization of ultra-fine-grained commodity retrieval, such as different tastes of the same potato chip brand, will face smaller inter-class differences (as in the text and texture of the commodity package). To address both of these issues, compact visual search techniques are used to retrieve products, thereby focusing more on local features, resulting in more efficient retrieval. Prior to the use of the compact visual search technique,
fig. 5 shows a schematic diagram of a feature extraction flow of the compact visual search technique.
The feature extraction of step 2.b.2 and step 2.b.5 compact visual search techniques can be divided into 6 sections: the method comprises the steps of interest point detection, local feature selection, local feature description, local feature compression, local feature position compression and local feature aggregation. Integrating a block-based frequency domain Laplacian Gaussian (BFLoG) method with an ALP detector as a point of interest detection method; calculating correlation to rank the features, selecting a fixed number of local features; using a SIFT descriptor as a feature descriptor; a low complexity transform coding scheme is employed, a small linear transform is applied to 8 values of each individual spatial interval of the SIFT descriptors, only a subset of the transformed descriptor elements is included in the bitstream, compressing the local features; compressing the local feature locations using a histogram coding scheme, the location data being represented as a spatial histogram comprised of a binary map and a set of histogram counts; with the scalable compressed Fisher vector, a subset of gaussian components from the gaussian mixture model is selected based on the total feature data budget, retaining only the information in the selected components. Local feature aggregation is performed by selecting a different set of components for each image based on the energy's concentration in the Fisher vector.
And 2, b.2, establishing an index, and calculating the Hamming distance very quickly for the long binary global descriptor by adopting an MBIT retrieval technology. MBIT reduces the exhaustive distance computation between features to the aligned component-to-component independent matching problem and constructs multiple hash tables for these components. Given a query descriptor, the relevant data for its candidates is retrieved using the query binary subvector (i.e., component) as an index into its corresponding hash table, thereby significantly reducing the number of required candidate images for subsequent linear searches.
For the 'selection' video clip, predicting a first corresponding food material category and a first retrieval result of non-food material retrieval by using food materials as commodity records purchased by the user; for the 'picking' action video, the food material types corresponding to the first three scores and the first three retrieval results of the non-food material retrieval are predicted by using the food materials and used as commodity records interested by the user.
The final consumer shopping record is composed of information of commodities purchased and interested by the user, wherein the commodities purchased by the consumer are the first food material category of food material classification and the first commodity category of non-food material retrieval on the selection action video clip, and the commodities interested by the consumer are the three food material categories before the food material classification and the three commodity categories before the non-food material retrieval on the selection action video clip.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.
Claims (10)
1. A shopping analysis method based on a first person perspective shopping video is characterized by specifically comprising the following steps:
dividing the complete shopping video into a plurality of video segments;
extracting N image frames from each video segment, wherein N is an integer greater than 1;
analyzing the extracted image frame to obtain a shopping action type corresponding to the video frequency band; and are
Identifying commodities corresponding to the video clips with preset shopping action types according to the obtained shopping action types corresponding to the video clips;
and establishing a corresponding relation between the identified commodities and the shopping action types corresponding to the commodities.
2. The method of claim 1, wherein the preset shopping action type includes a pick action and a buy action; and
establishing a corresponding relation between the identified commodity and the shopping action type corresponding to the commodity, specifically comprising:
determining a plurality of first commodities identified in a video clip corresponding to the purchasing action as shopping records;
and determining the first commodities identified in the video clip corresponding to the selection action as commodity records interested by the user.
3. The method of claim 1, wherein analyzing the extracted image frames to obtain the shopping action types corresponding to the video frequency bands specifically comprises:
and analyzing the extracted image frames by using a non-local neural network to obtain the shopping action type corresponding to the video frequency band.
4. The method of claim 1, wherein identifying the commodity corresponding to the video segment of the preset shopping action type specifically comprises:
inputting a video clip corresponding to a preset shopping action type into a classification network to obtain a commodity type contained in the video clip, wherein the commodity type comprises a food material type or a non-food material type;
for food material commodities, identifying commodities of key frames by using a multi-classification model;
and for non-food material commodities, searching the non-food material commodities in the key frame by using a multi-object searching method.
5. The method of claim 3 or 4, wherein the base network of the non-local neural network is ResNet50, ResNet50 is converted to a 3D ResNet50 network, and a non-local block is inserted at the end of each of the first three blocks of the 3D ResNet50 network.
6. The method according to claim 5, characterized in that for food-class goods identification, the following sub-steps are included:
a.1, extracting key frames of image frames of the video clip;
2, a.2, sequentially inputting the key frames into a pre-trained spatial regularization network to obtain the prediction scores of the frames on each food material category;
and 2.a.3, adding the corresponding category scores of all the key frames, and dividing the sum by the number of the key frames to obtain the prediction score of the video clip on each food material category.
7. The method of claim 6, wherein for non-food item commodity identification, the following sub-steps are specifically included:
b.1, extracting key frames of image frames of the video clip;
b.2 preprocessing, namely training a fast r-cnn network by using a commodity data set RPC disclosed by the network; the RPC data set comprises a plurality of commodity pictures, and each picture uses a plurality of detection frames bbox to provide a uniform label 'commodity' category for all the detection frames bbox; during training, a commodity image library is constructed, the library comprises a plurality of commodity images, each image comprises a commodity and is of a clean background, and for all pictures in the commodity library, a compact visual search technology is used for establishing extraction features and establishing indexes;
and 2, b.3, detecting the commodity region by using the trained fast r-cnn for each key frame to generate a plurality of detection frames bbox and the prediction scores of the detection frames bbox, and reserving the detection frames bbox with the prediction scores larger than 0.5.
And 2.b.4, clipping the image by using a detection frame bbox for each key frame to generate a plurality of local images.
And 2.b.5, for each key frame, cutting the key frame into a plurality of local graphs, wherein each local graph uses a compact visual search technology to extract features, and using indexes built in a commodity library to search related commodities in the commodity library to obtain a related commodity list of each local graph, wherein the degree of correlation is from high to low.
And 2, b.6, for a plurality of key frames of a video clip, each key frame is provided with a plurality of local graphs, each graph is provided with a related commodity list, and the related commodity list is obtained according to the prediction scores of the local graphs.
8. The method of claim 7, wherein the spatial regularization network of step 2.a.2 comprises
The key frames are sequentially input into ResNet50 to provide coarse class predictionAnd a preliminary feature fcls;
The preliminary characteristics fclsInputting a space regularization module to generate two feature maps, attention feature map fattAnd confidence feature map fcof;
Then fattQuilt fcofRe-weighting and outputting accurate prediction results of a series of convolutional layersBy making a pair of fattPerforming linear conversion samples to obtain a rough prediction
By passingAnd obtaining a predicted value.
9. The method of claim 8, wherein the feature extraction of the compact visual search technique in steps 2.b.2 and 2.b.5 comprises point of interest detection, local feature selection, local feature description, local feature compression, local feature location compression, local feature aggregation.
10. The method of claim 9, wherein in step one, for the inserted non-local block, the output of position i isWherein, xias input of position i, xjIs an input at position j, g (x)j)=WgxjWherein W isgIs a learnable weight matrix.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910508074.9A CN110378215B (en) | 2019-06-12 | 2019-06-12 | Shopping analysis method based on first-person visual angle shopping video |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910508074.9A CN110378215B (en) | 2019-06-12 | 2019-06-12 | Shopping analysis method based on first-person visual angle shopping video |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110378215A true CN110378215A (en) | 2019-10-25 |
CN110378215B CN110378215B (en) | 2021-11-02 |
Family
ID=68250201
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910508074.9A Active CN110378215B (en) | 2019-06-12 | 2019-06-12 | Shopping analysis method based on first-person visual angle shopping video |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110378215B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112906759A (en) * | 2021-01-29 | 2021-06-04 | 哈尔滨工业大学 | Pure vision-based entrance-guard-free unmanned store checkout method |
CN113392671A (en) * | 2020-02-26 | 2021-09-14 | 上海依图信息技术有限公司 | Commodity retrieval method and device based on customer actions and electronic equipment |
CN113836981A (en) * | 2020-06-24 | 2021-12-24 | 阿里巴巴集团控股有限公司 | Data processing method, data processing device, storage medium and computer equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101978370A (en) * | 2008-03-21 | 2011-02-16 | 日升研发控股有限责任公司 | Acquiring actual real-time shopper behavior data during a shopper's product selection |
US20150154456A1 (en) * | 2012-07-11 | 2015-06-04 | Rai Radiotelevisione Italiana S.P.A. | Method and an apparatus for the extraction of descriptors from video content, preferably for search and retrieval purpose |
CN109063534A (en) * | 2018-05-25 | 2018-12-21 | 隆正信息科技有限公司 | A kind of shopping identification and method of expressing the meaning based on image |
CN109166007A (en) * | 2018-08-23 | 2019-01-08 | 深圳码隆科技有限公司 | A kind of Method of Commodity Recommendation and its device based on automatic vending machine |
CN109711481A (en) * | 2019-01-02 | 2019-05-03 | 京东方科技集团股份有限公司 | Neural network, correlation technique, medium and equipment for the identification of paintings multi-tag |
-
2019
- 2019-06-12 CN CN201910508074.9A patent/CN110378215B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101978370A (en) * | 2008-03-21 | 2011-02-16 | 日升研发控股有限责任公司 | Acquiring actual real-time shopper behavior data during a shopper's product selection |
US20150154456A1 (en) * | 2012-07-11 | 2015-06-04 | Rai Radiotelevisione Italiana S.P.A. | Method and an apparatus for the extraction of descriptors from video content, preferably for search and retrieval purpose |
CN109063534A (en) * | 2018-05-25 | 2018-12-21 | 隆正信息科技有限公司 | A kind of shopping identification and method of expressing the meaning based on image |
CN109166007A (en) * | 2018-08-23 | 2019-01-08 | 深圳码隆科技有限公司 | A kind of Method of Commodity Recommendation and its device based on automatic vending machine |
CN109711481A (en) * | 2019-01-02 | 2019-05-03 | 京东方科技集团股份有限公司 | Neural network, correlation technique, medium and equipment for the identification of paintings multi-tag |
Non-Patent Citations (2)
Title |
---|
朱柳依: "结合模板匹配与单样本深度学习的货架商品定位与识别技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
陈若愚: "超市中人体异常行为识别方法的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113392671A (en) * | 2020-02-26 | 2021-09-14 | 上海依图信息技术有限公司 | Commodity retrieval method and device based on customer actions and electronic equipment |
CN113836981A (en) * | 2020-06-24 | 2021-12-24 | 阿里巴巴集团控股有限公司 | Data processing method, data processing device, storage medium and computer equipment |
CN112906759A (en) * | 2021-01-29 | 2021-06-04 | 哈尔滨工业大学 | Pure vision-based entrance-guard-free unmanned store checkout method |
Also Published As
Publication number | Publication date |
---|---|
CN110378215B (en) | 2021-11-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10671853B2 (en) | Machine learning for identification of candidate video insertion object types | |
US11775800B2 (en) | Methods and apparatus for detecting, filtering, and identifying objects in streaming video | |
CN110378215B (en) | Shopping analysis method based on first-person visual angle shopping video | |
Kuanar et al. | Video key frame extraction through dynamic Delaunay clustering with a structural constraint | |
JP3568117B2 (en) | Method and system for video image segmentation, classification, and summarization | |
CN102334118B (en) | Promoting method and system for personalized advertisement based on interested learning of user | |
US10467507B1 (en) | Image quality scoring | |
US20110106656A1 (en) | Image-based searching apparatus and method | |
Ullah et al. | Image-based service recommendation system: A JPEG-coefficient RFs approach | |
CN106021575A (en) | Retrieval method and device for same commodities in video | |
CN112714349B (en) | Data processing method, commodity display method and video playing method | |
CN111984824A (en) | Multi-mode-based video recommendation method | |
CN109934681B (en) | Recommendation method for user interested goods | |
US20200327600A1 (en) | Method and system for providing product recommendation to a user | |
CN105894362A (en) | Method and device for recommending related item in video | |
CN109272390A (en) | The personalized recommendation method of fusion scoring and label information | |
CN111327930A (en) | Method and device for acquiring target object, electronic equipment and storage medium | |
Fengzi et al. | Neural networks for fashion image classification and visual search | |
US20230230378A1 (en) | Method and system for selecting highlight segments | |
CN107944946B (en) | Commodity label generation method and device | |
Yang et al. | Keyframe recommendation based on feature intercross and fusion | |
JP2002513487A (en) | Algorithms and systems for video search based on object-oriented content | |
Kobs et al. | Indirect: Language-guided zero-shot deep metric learning for images | |
CN117132368A (en) | Novel media intelligent marketing platform based on AI | |
Vandecasteele et al. | Spott: On-the-spot e-commerce for television using deep learning-based video analysis techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |