CN109977896A - A kind of supermarket's intelligence vending system - Google Patents
A kind of supermarket's intelligence vending system Download PDFInfo
- Publication number
- CN109977896A CN109977896A CN201910263910.1A CN201910263910A CN109977896A CN 109977896 A CN109977896 A CN 109977896A CN 201910263910 A CN201910263910 A CN 201910263910A CN 109977896 A CN109977896 A CN 109977896A
- Authority
- CN
- China
- Prior art keywords
- layer
- image
- exporting
- inputting
- region
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 claims abstract description 193
- 230000033001 locomotion Effects 0.000 claims abstract description 145
- 230000008569 process Effects 0.000 claims abstract description 57
- 239000000047 product Substances 0.000 claims description 219
- 239000013598 vector Substances 0.000 claims description 180
- 230000004913 activation Effects 0.000 claims description 161
- 230000009471 action Effects 0.000 claims description 89
- 230000006870 function Effects 0.000 claims description 81
- 238000001514 detection method Methods 0.000 claims description 76
- 238000012544 monitoring process Methods 0.000 claims description 46
- 239000000284 extract Substances 0.000 claims description 41
- 238000010276 construction Methods 0.000 claims description 39
- 238000005286 illumination Methods 0.000 claims description 33
- 230000001815 facial effect Effects 0.000 claims description 32
- 230000003068 static effect Effects 0.000 claims description 29
- 239000011159 matrix material Substances 0.000 claims description 28
- 238000012545 processing Methods 0.000 claims description 23
- 238000011156 evaluation Methods 0.000 claims description 21
- 238000000605 extraction Methods 0.000 claims description 17
- 238000003475 lamination Methods 0.000 claims description 17
- 230000000392 somatic effect Effects 0.000 claims description 15
- 238000013480 data collection Methods 0.000 claims description 12
- 238000007781 pre-processing Methods 0.000 claims description 11
- 238000004519 manufacturing process Methods 0.000 claims description 10
- 230000008859 change Effects 0.000 claims description 9
- 239000003550 marker Substances 0.000 claims description 9
- 241000196324 Embryophyta Species 0.000 claims description 6
- 239000012141 concentrate Substances 0.000 claims description 6
- 230000010354 integration Effects 0.000 claims description 6
- 238000007670 refining Methods 0.000 claims description 6
- 230000006399 behavior Effects 0.000 claims description 4
- 230000005540 biological transmission Effects 0.000 claims description 3
- 239000006227 byproduct Substances 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000011282 treatment Methods 0.000 claims description 3
- 230000002045 lasting effect Effects 0.000 claims description 2
- 210000000746 body region Anatomy 0.000 claims 2
- 238000003909 pattern recognition Methods 0.000 abstract description 3
- 206010000117 Abnormal behaviour Diseases 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 3
- 238000007689 inspection Methods 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000010408 sweeping Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/70—Denoising; Smoothing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/90—Determination of colour characteristics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/107—Static hand or arm
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Strategic Management (AREA)
- Development Economics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Marketing (AREA)
- Economics (AREA)
- General Business, Economics & Management (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a kind of supermarket's intelligence vending systems, the problem that either scans' commodity are excessively time-consuming during tradition is settled accounts is directed to improve, when commodity Input Process is advanceed to shopper's picking, the time loss of items scanning when to remove checkout, checkout speed is greatly improved, improves the shopping experience of customer.Movement during the present invention selects shopper goods using algorithm for pattern recognition is identified and is counted, the picture of commodity is identified to obtain type of merchandize when picking and placing commodity to client, recognition of face is carried out to customer and obtains the identity of customer using human body image recognition when recognition of face is undesirable, the abnormal behaviour of customer is identified to determine whether there is pilferage behavior.This system can realize programming count function under the premise of not reducing customer purchase experience.Original institutional framework the present invention relates to the shopping accounting procedure of customer without changing supermarket, consequently facilitating with existing supermarket's organizational structure seamless interfacing.
Description
Technical field
The present invention relates to computer vision monitoring technology field, target detection, target following and area of pattern recognition, specifically
It is related to for being detected, being tracked to the individual before the shelf based on monitoring camera and the field of action recognition.
Background technique
Checkout is carried out by either scans' commodity under traditional supermarket model, and this mode is very easy to cause congestion, causes
A large amount of shoppers are queued in cashier counter, and whole check-out process is limited by the space of cashier and the number of cashier limitation nothing
Method increases considerably, therefore due to the limitation of traditional cash register mode, congestion of settling accounts not can avoid;Existing customer is by independently sweeping
Although the mode for retouching merchandise checkout can be reduced the time of items scanning, going out, there is still a need for manual inspection commodity, still
It so will cause congestion.The reason of analyzing congestion, most time-consuming process is the process of typing commodity, therefore commodity Input Process is mentioned
It is preceding to shopper's picking when, can by most time-consuming process in advance and can parallel work-flow, so that knot be greatly improved
Account speed improves the shopping experience of customer.
System proposed by the invention is exactly to identify and count during shopper selects goods using monitoring camera
Number is identified to improve checkout speed by the picking to customer with the process for putting back to commodity to carry out picking quantity
Plus-minus, obtain the type of merchandise by being identified when client picks and places commodity to commodity, pass through the abnormal behaviour to customer
Identification is to determine whether there is pilferage behavior, to realize automatically not reducing customer under the premise of selecting the shopping experience of goods process
Statistical function.Original institutional framework the present invention relates to the shopping accounting procedure of customer without changing supermarket, consequently facilitating with existing
There is supermarket's organizational structure seamless interfacing.
Summary of the invention
The technical problem to be solved by the present invention is to propose to overcome slow-footed situation of settling accounts under traditional supermarket model
A kind of supermarket's intelligence vending system.The identification of customer purchase behavior and the identification of commodity are completed using monitoring camera.
The technical solution adopted by the present invention to solve the technical problems is:
A kind of supermarket's intelligence vending system, the video taken the photograph based on the monitoring camera being fixed in supermarket and on shelf
Image is as input.Including image pre-processing module, module of target detection, action recognition module of doing shopping, product identification module is a
Body identification module, recognition result processing module.The video image that the image pre-processing module takes the photograph monitoring camera into
Row pretreatment, carries out denoising to the noise that may contain in the image of input first, then carries out light to the image after denoising
According to compensation, image enhancement then is carried out to the image after illumination compensation, the data after image enhancement are finally passed into target inspection
Survey module;The module of target detection carries out target detection to the image received, detects in current video image respectively
Human body overall region, face facial area, hand region and product area, then hand region and product area are sent to
Human body image region and face facial area, are sent to individual identification module, product area are passed by shopping action recognition module
Pass product identification module;The shopping action recognition module carries out static movement to the hand region information received and knows
Not, the start frame for grasping video is found, it is then lasting that movement is identified until finding the movement for putting down article as knot
Then beam frame is identified video using dynamic action recognition classifier, identify that current action is to take out article, put back to object
Product, take out again put back to, taken out article do not put back to either suspicious stealing.Then recognition result is sent to recognition result processing
Module, the video for by the video of only grasp motion and only putting down movement are sent to recognition result processing module;The production
Product identification module identifies the video of the product area received, identify currently by it is mobile be any product, so
Recognition result is sent to recognition result processing module afterwards, product identification module can also increase at any time or delete some product;
The individual identification module identifies the human face region and human region that receive, in conjunction with human face region and human region
Information, for identification out current individual be in entire supermarket who individual, recognition result is then sent to recognition result
Processing module;The recognition result processing module integrates the recognition result received, is passed according to individual identification module
It passs the ID of customer come and determines the corresponding customer of current shopping information, the recognition result come according to the transmitting of product identification module is come true
The shopping for determining current customer acts corresponding product, is determined currently according to the recognition result that shopping action recognition module transmitting comes
Whether shopping movement modifies to shopping cart.To obtain the shopping list of current customer.Shopping action recognition module is known
Other suspicious stealing sounds an alarm.
The image pre-processing module, method are: in initial phase, the module does not work;In the detection process:
The first step, the monitoring image taken the photograph to monitoring camera carries out mean denoising, thus the monitoring image after being denoised;Second
Step carries out illumination compensation to the monitoring image after denoising, to obtain the image after illumination compensation;Third step, by illumination compensation
Image afterwards carries out image enhancement, and the data after image enhancement are passed to module of target detection.
The monitoring image that the monitoring camera is taken the photograph carries out mean denoising, and method is: setting monitoring camera and is taken the photograph
Monitoring image be Xsrc, because of XsrcFor color RGB image, therefore there are Xsrc-R, Xsrc-G, Xsrc-BThree components, for each
A component Xsrc', it proceeds as follows respectively: the window of one 3 × 3 dimension being set first, considers image Xsrc' each pixel
Point Xsrc' (i, j), it is respectively [X that pixel value corresponding to matrixes is tieed up in 3 × 3 put centered on the pointsrc′(i-1,j-1),Xsrc′
(i-1,j),Xsrc′(i-1,j+1),Xsrc′(i,j-1),Xsrc′(i,j),Xsrc′(i,j+1),Xsrc′(i+1,j-1),Xsrc′(i+
1,j),Xsrc' (j+1, j+1)] it is arranged from big to small, take it to come intermediate value as image X after denoisingsrc" pixel (i,
J) value is assigned to X after corresponding filteringsrc″(i,j);For Xsrc' boundary point, it may appear that its 3 × 3 dimension window corresponding to
The case where certain pixels are not present, then the median for falling in existing pixel in window need to be only calculated, if window
Interior is even number point, is assigned to X for the average value for coming intermediate two pixel values as the pixel value after pixel denoisingsrc″
(i, j), thus, new image array XsrcIt " is XsrcImage array after the denoising of current RGB component, for Xsrc-R,
Xsrc-G, Xsrc-BAfter three components carry out denoising operation respectively, the X that will obtainsrc-R", Xsrc-G", Xsrc-B" component, by this three
A new component is integrated into a new color image XDenResulting image after as denoising.
Described carries out illumination compensation to the monitoring image after denoising, if the monitoring image X after denoisingDen, because of XDenFor
Color RGB image, therefore XDenThere are tri- components of RGB, for each component XDen', illumination compensation is carried out respectively, then will
Obtained Xcpst' integration obtains colored RBG image Xcpst, XcpstAs XDenImage after illumination compensation, to each component
XDen' respectively carry out illumination compensation the step of are as follows: the first step, if XDen' arranged for m row n, construct XDen′sumAnd NumDenFor same m row
The matrix of n column, initial value is 0,Step-lengthWindow size is l, wherein function
Min (m, n) expression takes the minimum value of m and n,Indicate round numbers part, sqrt (l) indicates the square root of l, the l=1 if l < 1;
Second step, if XDenTop left co-ordinate be (1,1), from coordinate (1,1) start, according to window size be l and step-length s determine it is each
A candidate frame, which is [(a, b), (a+l, b+l)] area defined, for XDen' corresponding in candidate frame region
Image array carry out histogram equalization, the image array after obtaining the equalization of candidate region [(a, b), (a+l, b+l)]
XDen", then XDen′sumEach element in the corresponding region [(a, b), (a+l, b+l)] calculates XDen′sum(a+iXsum,b+
jXsum)=XDen′sum(a+iXsum,b+jXsum)+XDen″(iXsum,jXsum), wherein (iXsum,jXsum) it is integer and 1≤iXsum≤ l, 1
≤jXsum≤ l, and by NumDenEach element in the corresponding region [(a, b), (a+l, b+l)] adds 1;Finally, calculating
Wherein (iXsumNum,jXsumNum) it is XDenEach corresponding point, to obtain XcpstAs to present component XDen' carry out illumination
Compensation.
Described is that l and step-length s determines each candidate frame according to window size, be the steps include:
If monitoring image is m row n column, (a, b) is the top left co-ordinate in selected region, and (a+l, b+l) is selection area
Bottom right angular coordinate, which is indicated that the initial value of (a, b) is (1,1) by [(a, b), (a+l, b+l)];
As a+l≤m:
B=1;
As b+l≤n:
Selected region is [(a, b), (a+l, b+l)];
B=b+s;
Interior loop terminates;
A=a+s;
Outer loop terminates;
In the above process, selected region [(a, b), (a+l, b+l)] is candidate frame every time.
It is described for XDen' image array corresponding in candidate frame region carries out histogram equalization, if candidate frame
Region is [(a, b), (a+l, b+l)] area defined, XDenIt " is XDen' the figure in the region [(a, b), (a+l, b+l)]
It as information, the steps include: the first step, construct vector I, I (iI) it is XDen" middle pixel value is equal to iINumber, 0≤iI≤255;The
Two steps calculate vectorThird step, for XDen" on each point (iXDen, jXDen), pixel value is
XDen″(iXDen, jXDen), calculate X "Den(iXDen, jXDen)=I ' (X "Den(iXDen, jXDen)).To XDen" all pixels in image
Histogram equalization process terminates after point value is all calculated and changed, XDen" the result of the interior as histogram equalization saved.
Described carries out image enhancement for the image after illumination compensation, if the image after illumination compensation is Xcpst, correspond to
RGB channel be respectively XcpstR, XcpstG, XcpstB, to XcpstThe image obtained after image enhancement is Xenh.Image increasing is carried out to it
Strong step are as follows: the first step, for XcpstThe important X of institutecpstR, XcpstG, XcpstBIt is calculated to carry out after obscuring by specified scale
Image;Second step, structural matrix LXenhR, LXenhG, LXenhBFor with XcpstRThe matrix of identical dimensional, for image Xcpst's
The channel R in RGB channel calculates LXenhR(i, j)=log (XcpstR(i, j))-LXcpstRThe value range of (i, j), (i, j) is
All points in image array, for image XcpstRGB channel in the channel G and channel B use algorithm same as the channel R
Obtain LXenhGAnd LXenhB;Third step, for image XcpstRGB channel in the channel R, calculate LXenhRMiddle all the points value
Mean value MeanR and mean square deviation VarR (attention is mean square deviation), calculating MinR=MeanR-2 × VarR and MaxR=MeanR+2 ×
Then VarR calculates XenhR(i, j)=Fix ((LXcpstR(i, j)-MinR)/(MaxR-MinR) × 255), wherein Fix expression takes
Integer part is assigned a value of 0 if value < 0, and value > 255 is assigned a value of 255;For in RGB channel the channel G and channel B
X is obtained using algorithm same as the channel RenhGAnd XenhB, the X of RGB channel will be belonging respectively toenhR、XenhG、XenhBIt is integrated into one
Color image Xenh。
It is described for XcpstThe important X of institutecpstR, XcpstG, XcpstBIt calculates it and carries out the figure after obscuring by specified scale
Picture, for the channel the R X in RGB channelcpstR, the steps include: the first step, define Gaussian function G (x, y, σ)=k × exp (- (x2
+y2)/σ2), σ is scale parameter, k=1/ ∫ ∫ G (x, y) dxdy, then for XcpstREach point XcpstR(i, j) is calculated, WhereinIndicate convolution algorithm, for being lower than the point of scale σ apart from boundary, only
Calculate XcpstRWith the convolution of G (x, y, σ) corresponding part, Fix () indicates round numbers part, 0 is assigned a value of if value < 0, value
> 255 is then assigned a value of 255.For in RGB channel the channel G and channel B using algorithm same as the channel R update XcpstGWith
XcpstG。
The module of target detection has demarcated human body image region, face face using having during initialization
The image in region, hand region and product area carries out parameter initialization to algorithm of target detection;In the detection process, figure is received
As the image that preprocessing module is transmitted, then it is handled, each frame image is carried out using algorithm of target detection
Target detection obtains human body image region, face facial area, hand region and the product area of present image, then by hand
Portion region and product area are sent to shopping action recognition module, and human body image region and face facial area are sent to individual
Product area is passed to product identification module by identification module;
The use have demarcated human body image region, face facial area, hand region and product area figure
As carrying out parameter initialization to algorithm of target detection, it the steps include: that the first step, construction feature extract depth network;Second step, structure
Make regional choice network, third step, according to each in database used in the construction feature extraction depth network
Open image X and the corresponding each human region manually demarcatedThen by ROI layers,
Input is image X and regionOutputIt is 7
× 7 × 512 dimensions;Third step, building coordinate refine network.
The construction feature extracts depth network, which is deep learning network structure, network structure are as follows: first
Layer: convolutional layer, inputting is 768 × 1024 × 3, and exporting is 768 × 1024 × 64, port number channels=64;The second layer: volume
Lamination, inputting is 768 × 1024 × 64, and exporting is 768 × 1024 × 64, port number channels=64;Third layer: Chi Hua
Layer, input first layer output 768 × 1024 × 64 are connected in third dimension with third layer output 768 × 1024 × 64,
Output is 384 × 512 × 128;4th layer: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, is led to
Road number channels=128;Layer 5: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, channel
Number channels=128;Layer 6: pond layer, the 4th layer of output 384 × 512 × 128 of input and layer 5 384 × 512 ×
128 are connected in third dimension, and exporting is 192 × 256 × 256;Layer 7: convolutional layer, input as 192 × 256 ×
256, exporting is 192 × 256 × 256, port number channels=256;8th layer: convolutional layer, input as 192 × 256 ×
256, exporting is 192 × 256 × 256, port number channels=256;9th layer: convolutional layer, input as 192 × 256 ×
256, exporting is 192 × 256 × 256, port number channels=256;Tenth layer: pond layer inputs as layer 7 output 192
× 256 × 256 are connected in third dimension with the 9th layer 192 × 256 × 256, and exporting is 96 × 128 × 512;11st
Layer: convolutional layer, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512;Floor 12:
Convolutional layer, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512;13rd layer: volume
Lamination, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512;14th layer: Chi Hua
Layer inputs and is connected in third dimension for eleventh floor output 96 × 128 × 512 with the 13rd layer 96 × 128 × 512,
Output is 48 × 64 × 1024;15th layer: convolutional layer, inputting is 48 × 64 × 1024, and exporting is 48 × 64 × 512, channel
Number channels=512;16th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number
Channels=512;17th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number
Channels=512;18th layer: pond layer, input for the 15th layer export 48 × 64 × 512 and the 17th layer 48 × 64 ×
512 are connected in third dimension, and exporting is 48 × 64 × 1024;19th layer: convolutional layer, input as 48 × 64 ×
1024, exporting is 48 × 64 × 256, port number channels=256;20th layer: pond layer, inputting is 48 × 64 × 256,
Output is 24 × 62 × 256;Second eleventh floor: convolutional layer, inputting is 24 × 32 × 1024, and exporting is 24 × 32 × 256, channel
Number channels=256;Second Floor 12: pond layer, inputting is 24 × 32 × 256, and exporting is 12 × 16 × 256;20th
Three layers: convolutional layer, inputting is 12 × 16 × 256, and exporting is 12 × 16 × 128, port number channels=128;24th
Layer: pond layer, inputting is 12 × 16 × 128, and exporting is 6 × 8 × 128;25th layer: full articulamentum, first by the 6 of input
The data of × 8 × 128 dimensions are launched into the vector of 6144 dimensions, then input into full articulamentum, and output vector length is 768,
Activation primitive is relu activation primitive;26th layer: full articulamentum, input vector length are 768, and output vector length is
96, activation primitive is relu activation primitive;27th layer: full articulamentum, input vector length are 96, and output vector length is
2, activation primitive is soft-max activation primitive;The parameter of all convolutional layers is size=3 convolution kernel kernel, step-length stride
=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, and parameter is pond section size
Kernel_size=2, step-length stride=(2,2);If setting the depth network as Fconv27, for a width color image X, warp
Crossing the obtained feature set of graphs Fconv27 (X) of the depth network indicates, the evaluation function of the network is to (Fconv27
(X)-y) its cross entropy loss function is calculated, convergence direction is to be minimized, and y inputs corresponding classification.Database is in nature
The image comprising passerby and non-passerby of boundary's acquisition, every image are the color image of 768 × 1024 dimensions, according to being in image
No to be divided into two classes comprising pedestrian, the number of iterations is 2000 times.After training, first layer is taken to be characterized extraction to the 17th layer
Depth network Fconv indicates a width color image X by the obtained output of the depth network with Fconv (X).
The structure realm selects network, receives Fconv depth network and extracts 512 48 × 64 feature set of graphs
Fconv (X), then the first step obtains Conv by convolutional layer1(Fconv (X)), the parameter of the convolutional layer are as follows: convolution kernel
Size=1 kernel, step-length stride=(1,1), inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number
Channels=512;Then by Conv1(FConv (X)) is separately input to two convolutional layer (Conv2-1And Conv2-2),
Conv2-1Structure are as follows: inputting is 48 × 64 × 512, and exporting is 48 × 64 × 18, and port number channels=18, the layer obtains
Output be Conv2-1(Conv1(Fconv (X))), then softmax is obtained using activation primitive softmax to the output
(Conv2-1(Conv1(Fconv(X))));Conv2-2Structure are as follows: inputting is 48 × 64 × 512, and exporting is 48 × 64 × 36,
Port number channels=36;There are two the loss functions of the network: first error function loss1 is to Wshad-cls⊙
(Conv2-1(Conv1(Fconv(X)))-Wcls(X)) softmax error is calculated, second error function loss2 is to Wshad-reg
(X)⊙(Conv2-1(Conv1(Fconv(X)))-Wreg(X)) smooth L1 error, the loss function of regional choice network are calculated
=loss1/sum (Wcls(X))+loss2/sum(WclS (X)), the sum of sum () representing matrix all elements, convergence direction is
It is minimized, Wcls(X) and WregIt (X) is respectively the corresponding positive and negative sample information of database images X, ⊙ representing matrix is according to correspondence
Position is multiplied, Wshad-cls(X) and Wshad-regIt (X) is mask, it acts as selection Wshad(X) part that weight is 1 in is trained,
To avoiding positive and negative sample size gap excessive, when each iteration, regenerates Wshad-cls(X) and Wshad-reg(X), algorithm iteration
1000 times.
The construction feature extracts database used in depth network, for each image in database,
Step 1: each human body image-region, face facial area, hand region and product area are manually demarcated, if it schemes in input
The centre coordinate of picture is (abas_tr, bbas_tr), centre coordinate is l in the distance of fore-and-aft distance upper and lower side framebas_tr, centre coordinate
It is w in the distance of lateral distance left and right side framebas_tr, then it corresponds to Conv1Position be that center coordinate isHalf is a length ofHalf-breadth is Indicate round numbers part;The
Two steps: positive negative sample is generated at random.
The positive negative sample of generation at random, method are as follows: the first step constructs 9 regional frames, second step, for data
The each image X in librarytrIf WclsFor 48 × 64 × 18 dimensions, WregFor 48 × 64 × 36 dimensions, all initial values are 0, right
WclsAnd WregIt is filled.
Described 9 regional frames of construction, this 9 regional frames are respectively as follows: Ro1(xRo, yRo)=(xRo, yRo, 64,64), Ro2
(xRo, yRo)=(xRo, yRo, 45,90), Ro3(xRo, yRo)=(xRo, yRo, 90,45), Ro4(xRo, yRo)=(xRo, yRo, 128,
128), Ro5(xRo, yRo)=(xRo, yRo, 90,180), Ro6(xRo, yRo)=(xRo, yRo, 180,90), Ro7(xRo, yRo)=
(xRo, yRo, 256,256), Ro8(xRo, yRo)=(xRo, yRo, 360,180), Ro9(xRo, yRo)=(xRo, yRo, 180,360), it is right
In each region unit, Roi(xRo, yRo) indicate for ith zone frame, the centre coordinate (x of current region frameRo, yRo), the
Three indicate pixel distance of the central point apart from upper and lower side frame, and the 4th indicates pixel distance of the central point apart from left and right side frame, i
Value from 1 to 9.
It is described to WclsAnd WregIt is filled, method are as follows:
For the body compartments that each is manually demarcated, if it is (a in the centre coordinate of input picturebas_tr, bbas_tr),
Centre coordinate is l in the distance of fore-and-aft distance upper and lower side framebas_tr, centre coordinate is in the distance of lateral distance left and right side frame
wbas_tr, then it corresponds to Conv1Position be that center coordinate isHalf is a length ofHalf-breadth is
For the upper left cornerThe lower right corner
CoordinateEach point in the section surrounded
(xCtr, yCtr):
For i value from 1 to 9:
For point (xCtr, yctr), it is upper left angle point (16 (x in the mapping range of database imagesCtr- 1)+1,16
(yctr- 1)+1) bottom right angle point (16xctr, 16yCtr) 16 × 16 sections that are surrounded, for each point (x in the sectionOtr,
yOtr):
Calculate (xOtr, yOtr) corresponding to region Roi(xOtr, yOtr) with current manual calibration section coincidence factor;
Select the highest point (x of coincidence factor in current 16 × 16 sectionIoUMax, yIoUMax), if coincidence factor > 0.7, Wcls
(xCtr, yctr, 2i-1)=1, Wcls(xCtr, yctr, 2i)=0, which is positive sample, Wreg(xCtr, yctr, 4i-3) and=(xOtr-
16xCtr+ 8)/8, Wreg(xCtr, yCtr, 4i-2) and=(yOtr-16yCtr+ 8)/8, Wreg(xCtr, yCtr, 4i-2) and=Down1 (lbas_tr/
RoiThird position), Wreg(xCtr, yCtr, 4i) and=Down1 (wbas_tr/RoiThe 4th), Down1 () if indicate value be greater than 1
Then value is 1;If coincidence factor < 0.3, Wcls(xCtr, yCtr, 2i-1)=0, Wcls(xCtr, yCtr, 2i)=1;Otherwise Wcls
(xCtr, yCtr, 2i-1)=- 1, Wcls(xCtr, yCtr, 2i)=- 1.
If the human region of current manual's calibration does not have the Ro of coincidence factor > 0.6i(xOtr, yOtr), then select coincidence factor most
High Roi(xOtr, yOtr) to WclsAnd WregAssignment, assignment method are identical as the assignment method of coincidence factor > 0.7.
Calculating (the xOtr, yOtr) corresponding to region Roi(xOtr, yOtr) be overlapped with the section of current manual's calibration
Rate, method are as follows: set the body compartments that manually demarcate in the centre coordinate of input picture as (abas_tr, bbas_tr), centre coordinate
It is l in the distance of fore-and-aft distance upper and lower side framebas_tr, centre coordinate is w in the distance of lateral distance left and right side framebas_trIf Roi
(xOtr, yOtr) third position be lOtr, the 4th is wOtrIf meeting | xOtr-abas_tr|≤lOtr+lbas_tr- 1 and | yOtr-
bbas_tr|≤wOtr+wbas_tr- 1, illustrate that there are overlapping region, overlapping regions=(lOtr+lbas_tr-1-|xOtr-abas_tr|)×
(wotr+wbas_tr-1-|yOtr-bbas_tr|), otherwise overlapping region=0;Calculate whole region=(2lotr-1)×(2wOtr-1)+
(2abas_tr-1)×(2wbas_tr- 1)-overlapping region;To obtain coincidence factor=overlapping region/whole region, | | expression takes
Absolute value.
The Wshad-cls(X) and Wshad-reg(X), building method are as follows: for image X, corresponding positive negative sample
Information is Wcls(X) and Wreg(X), the first step constructs Wshad-cls(X) with and Wshad-reg(X), Wshad-cls(X) and Wcls(X) dimension
It is identical, Wshad-reg(X) and Wreg(X) dimension is identical;Second step records the information of all positive samples, for i=1 to 9, if Wcls
(X) (a, b, 2i-1)=1, then Wshad-cls(X) (a, b, 2i-1)=1, Wshad-cls(X) (a, b, 2i)=1, Wshad-reg(X) (a,
B, 4i-3)=1, Wshad-reg(X) (a, b, 4i-2)=1, Wshad-reg(X) (a, b, 4i-1)=1, Wshad-reg(X) (a, b, 4i)=
1, positive sample has selected altogether sum (Wshad-cls(X)) a, sum () indicates to sum to all elements of matrix, if sum
(Wshad-cls(X)) 256 > retain 256 positive samples at random;Third step randomly chooses negative sample, randomly chooses (a, b, i), if
Wcls(X) (a, b, 2i-1)=1, then Wshad-cls(X) (a, b, 2i-1)=1, Wshad-cls(X) (a, b, 2i)=1, Wshad-reg(X)
(a, b, 4i-3)=1, Wshad-reg(X) (a, b, 4i-2)=1, Wshad-reg(X) (a, b, 4i-1)=1, Wshad-reg(X) (a, b,
4i)=1, if the negative sample quantity chosen is 256-sum (Wshad-cls(X)) a, although negative sample lazy weight 256-
sum(Wshad-cls(X)) a but be all unable to get negative sample in 20 generation random numbers (a, b, i), then algorithm terminates.
The ROI layer, input are image X and regionIts method are as follows: for
Image X is 48 × 64 × 512 by the dimension of obtained output Fconv (X) of feature extraction depth network Fconv, for every
One 48 × 64 matrix VROI_IInformation (512 matrixes altogether), extract VROI_IThe upper left corner in matrix The lower right cornerIt is surrounded
Region,Indicate round numbers part;Output is roiI(X) dimension is 7 × 7, then step-length
For iRoI=1: to 7:
For jROI=1 to 7:
Construct section
roiI(X)(iROI, jROIThe value of maximum point in)=section.
When 512 48 × 64 matrix whole after treatments, output splicing is obtained into the output of 7 × 7 × 512 dimensionsParameter is indicated for image X, in regional frame
ROI in range.
The building coordinate refines network, method are as follows: the first step, extending database: extended method is for data
Each image X and the corresponding each region manually demarcated in libraryIts is corresponding
ROI isIf current interval be human body image-region if BClass=[1,0,0,
0,0], [0,0,0,0] BBox=, the BClass=[0,1,0,0,0] if current interval is people's face facial area, BBox=[0,
0,0,0], BClass=[0,0,1,0,0], BBox=[0,0,0,0], if current interval is if current interval is hand region
Product area then [0,0,0,1,0] BClass=, BBox=[0,0,0,0];It is random to generate value random number between -1 to 1
arand, brand, lrand, wrand, to obtain new section Indicate round numbers part, the BBox=[a in the sectionrand, brand, lrand,
wrand], if new section withThe then BClass=current region of coincidence factor > 0.7
BClass, if new section withCoincidence factor < 0.3, then BClass=[0,0,0,0,
1], the two is not satisfied, then not assignment.Each section at most generates 10 positive sample regions, if generating Num1A positive sample area
Domain then generates Num1+ 1 negative sample region, if the inadequate Num in negative sample region1+ 1, then expand arand, brand, lrand, wrand
Range, until finding enough negative sample numbers.Second step, building coordinate refine network: for every in database
One image X and the corresponding each human region manually demarcatedIts corresponding ROI isThe ROI of 7 × 7 × 512 dimensions will be launched into 25088 dimensional vectors, then passed through
Cross two full articulamentum Fc2, obtain output Fc2(ROI), then by Fc2(ROI) micro- by classification layer FClass and section respectively
Layer FBBox is adjusted, output FClass (Fc is obtained2And FBBox (Fc (ROI))2(ROI)), classification layer FClass is full articulamentum,
Input vector length is 512, and output vector length is 5, and it is full articulamentum that layer FBBox is finely tuned in section, and input vector length is
512, output vector length is 4;There are two the loss functions of the network: first error function loss1 is to FClass (Fc2
(ROI))-BClass calculates softmax error, and second error function loss2 is to (FBBox (Fc2(ROI))-BBox) meter
Euclidean distance error is calculated, then whole loss function=loss1+loss2 of the refining network, algorithm iteration process are as follows: change first
1000 convergence error function loss2 of generation, then 1000 convergence whole loss functions of iteration.
The full articulamentum Fc of described two2, structure are as follows: first layer: full articulamentum, input vector length is 25088, defeated
Outgoing vector length is 4096, and activation primitive is relu activation primitive;The second layer: full articulamentum, input vector length is 4096, defeated
Outgoing vector length is 512, and activation primitive is relu activation primitive.
Described carries out target detection using algorithm of target detection to each frame image, obtains the human body image of present image
Region, face facial area, hand region and product area, the steps include:
The first step, by input picture XcpstIt is divided into the subgraph of 768 × 1024 dimensions;
Second step, for each subgraph Xs:
2.1st step is converted using the feature extraction depth network Fconv constructed in initialization, obtains 512 spies
Levy subgraph set Fconv (Xs);
2.2nd step, to Fconv (Xs) using area selection network in first layer Conv1, second layer Conv2-1+soffmax
Activation primitive and Conv2-2Into transformation, output soffmax (Conv is respectively obtained2-1(Conv1(Fconv(Xs)))) and Conv2-2
(Conv1(Fconv(Xs))), all preliminary candidate sections in the section are then obtained according to output valve;
2.3rd step, for all preliminary candidate sections of all subgraphs of current frame image:
2.3.1 step, is chosen according to the score size in its current candidate region, chooses maximum 50 preliminary candidates
Section is as candidate region;
2.3.2 step adjusts candidate section of crossing the border all in candidate section set, then weeds out weight in candidate section
Folded frame, to obtain final candidate section;
2.3.3 step, by subgraph XsROI layers are input to each final candidate section, obtains corresponding ROI output,
If current final candidate section is (aBB(1), bBB(2), lBB(3), wBB(4)) FBBox (Fc, is then calculated2(ROI)) it obtains
Four output (OutBB(1), OutBB(2), OutBB(3), OutBB(4)) to obtain updated coordinate (aBB(1)+8×OutBB
(1), bBB(2)+8×OutBB(2), lBB(3)+8×OutBB(3), wBB(4)+8×OutBB(4));Then FClass (Fc is calculated2
(ROI)) exported, if exporting first maximum current interval be human body image-region, if output second maximum when
It is people's face facial area between proparea, current interval is hand region if exporting third position maximum, if the 4th maximum of output
Current interval is product area, and current interval, which is negative, if exporting the 5th maximum sample areas and deletes the final candidate regions
Between.Third step, the coordinate in the final candidate section after updating the refining of all subgraphs, the method for update is to set current candidate region
Coordinate be (TLx, TLy, RBx, RBy), the top left co-ordinate of corresponding subgraph is (Seasub, Sebsub), updated seat
It is designated as (TLx+Seasub- 1, TLy+Sebsub- 1, RBx, RBy).
It is described by input picture XcpstBe divided into the subgraph of 768 × 1024 dimensions, the steps include: to set the step-length of segmentation as
384 and 512, if window size is m row n column, (asub, bsub) be selected region top left co-ordinate, the initial value of (a, b) is
(1,1);Work as asubWhen < m:
bsub=1:
Work as bsubWhen < n:
Selected region is [(asub, bsub), (asub+ 384, bsub+ 512)], by input picture XcpstUpper section institute is right
The information for the image-region answered copies in new subgraph, and is attached to top left co-ordinate (asub, bsub) it is used as location information;If choosing
Region is determined beyond input picture XcpstSection then will exceed the corresponding rgb pixel value of the pixel in range and be assigned a value of 0;
bsub=bsub+ 512:
Interior loop terminates;
asub=asub+ 384:
Outer loop terminates;
Described obtains all preliminary candidate sections in the section, method according to output valve are as follows: step 1: for
softmax(Conv2-1(Conv1(Fconv(Xs)))) its output be 48 × 64 × 18, for Conv2-2(Conv1(Fconv
(Xs))), output is 48 × 64 × 36, for any point (x, y) on 48 × 64 dimension spaces, softmax (Conv2-1
(Conv1(Fconv(Xs)))) (x, y) be 18 dimensional vector II, Conv2_2(Conv1(Fconv(Xs))) (x, y) be 36 dimensional vectors
IIII, if II (2i-1) > II (2i), for i value from 1 to 9, lOtrFor Roi(xOtr, yotr) third position, wOtrFor Roi
(xOtr, yotr) the 4th, then preliminary candidate section be [II (2i-1), (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y,
lOtr× IIII (4i-1), wOtr× IIII (4i))], wherein the score in first II (2i-1) expression current candidate region, second
Position (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y, IIII (4i-1), IIII (4i)) indicates the center in current candidate section
Point is (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y), and the long half-breadth of the half of candidate frame is respectively lOtr× IIII (4i-1) and
wOtr×IIII(4i))。
All candidate sections of crossing the border, method are as follows: set monitoring image as m row n in the candidate section set of the adjustment
Column, for each candidate section, if its [(ach, bch)], the long half-breadth of the half of candidate frame is respectively lchAnd wchIf ach+lch>
M, thenThen its a is updatedch=a 'ch, lch=
l′ch;If bch+wch> n, thenThen it updates
Its bch=b 'ch, wch=w 'ch.
Described weeds out the frame being overlapped in candidate section, the steps include:
If candidate section set is not sky:
The maximum candidate section i of score is taken out from the set of candidate sectionout:
Calculate candidate section ioutWith candidate section set each of candidate section icCoincidence factor, if coincidence factor >
0.7, then gather from candidate section and deletes candidate section ic;
By candidate section ioutIt is put into the candidate section set of output;
When candidate section set is empty, exporting candidate section contained in candidate section set is to weed out candidate regions
Between middle overlapping frame after obtained candidate section set.
The calculating candidate section ioutWith candidate section set each of candidate section icCoincidence factor, side
Method are as follows: set candidate section icCoordinate section centered on point [(aic, bic)], the long half-breadth of the half of candidate frame is respectively licAnd wic, wait
I between constituencycCoordinate section centered on point [(aiout, bicout)], the long half-breadth of the half of candidate frame is respectively lioutAnd wiout;It calculates
XA=max (aic, aiout);YA=max (bic, biout);XB=min (lic, liout), yB=min (wic, wiout);If meeting |
aic-aiout|≤lic+liout- 1 and | bic-biout|≤wic+wiout- 1, illustrate that there are overlapping region, overlapping regions=(lic+
liout-1-|aic-aiout|)×(wic+wiout-1-|bic-biout|), otherwise overlapping region=0;Calculate whole region=(2lic-
1)×(2wic-1)+(2liout-1)×(2wiout- 1)-overlapping region;To obtain coincidence factor=overlapping region/whole region.
The shopping action recognition module, method is: in initialization, using the hand motion image of standard first
Static action recognition classifier is initialized, so that static action recognition classifier be made to can recognize that the grasping of hand, put
Lower movement;Then dynamic action recognition classifier is initialized using hand motion video, so that dynamic action be made to identify
Classifier can recognize that the taking-up article of hand, put back to article, take out but put back to, taken out article do not put back to either it is suspicious steal
Surreptitiously;In the detection process: the first step carries out each the hand region information received using static action recognition classifier
Identification, recognition methods are as follows: the image inputted each time is set as Handp1, exporting as StaticN (Handp1) is 3 bit vectors, if
First maximum is then identified as grasping, and is identified as putting down if second maximum, if third position maximum is identified as other;Second
Step carries out target following to current grasp motion corresponding region after recognizing grasp motion, if current hand region is next
The recognition result that static action recognition classifier is used corresponding to frame tracking box is when putting down movement, and target following terminates, will
It is currently available since recognizing grasp motion and being video, recognize that put down movement be that video terminates, so that it is dynamic to obtain hand
The video marker is complete video by the continuous videos of work.If tracking is lost during tracking, by currently available from identification
It is that video starts, the image before tracking loss terminates as video to grasp motion, to obtain the view of only grasp motion
It frequently, then is the video of only grasp motion by the video marker;Movement is put down when recognizing, and the movement is not in target following
In obtained image, illustrates that the grasp motion of the movement is lost, then terminates using the corresponding hand region of present image as video,
Tracking is carried forward since present frame using method for tracking target, until tracking is lost, then the next frame of lost frames is as view
The video marker is the video for only putting down movement by the start frame of frequency.Third step makes the obtained complete video of second step
It is identified with dynamic action recognition classifier, recognition methods are as follows: set the image inputted each time as Handv1, export and be
DynamicN (Handv1) is 5 bit vectors, is identified as taking out article if first maximum, if second maximum is identified as putting
Article is returned, is identified as taking out if the maximum of third position and put back to again, if the 4th maximum is identified as having taken out article and not put back to, if
5th maximum is then identified as the movement of suspicious stealing, and the recognition result is then sent to recognition result processing module, will be only
The video for having grasp motion and the video for only putting down movement are sent to recognition result processing module, by complete video and only grab
The video for holding movement is sent to product identification module and individual identification module.
The hand motion image using standard initializes static action recognition classifier, method are as follows:
The first step arranges video data: firstly, choose the video that a large amount of people does shopping in supermarket, these videos include extract product,
Article is put back to, takes out and puts back to, taken out article and do not put back to movement with suspicious stealing;Manually each section of video clip is carried out
Interception encounters commodity as start frame using manpower, leaves commodity as end frame using manpower, then use mesh for each frame of video
Mark detection module extracts its hand region, the color image for being then 256 × 256 by each frame image scaling of hand region,
Will scaling rear video be put into hand motion video collection, and mark the video for take out article, put back to article, take out but put back to,
Article has been taken out one of not put back to the movement of suspicious stealing;It is taking-up article for classification, puts back to article, takes out and puts
It returns, taken out each video that article is not put back to, the first frame of the video is put into the merging of hand motion image set and is labeled as
The last frame of the video is put into hand motion image set and merged labeled as putting down movement by grasp motion, removes the from the video
It takes a frame to be put into hand motion image set outside one frame and last needle at random to merge labeled as other.To obtain hand motion view
Frequency set and hand motion image collection;Second step constructs static action recognition classifier StaticN;Third step, it is dynamic to static state
Make recognition classifier StaticN to be initialized, the hand motion image collection constructed by the first step is inputted, if each time
The image of input is Handp, is exported as StaticN (Handp), classification yHandp, yHandpRepresentation method are as follows: grasp:
yHandp=[1,0,0], puts down: yHandp=[0,1,0], other: yHandp=[0,0,1], the evaluation function of the network are pair
(StaticN(Handp)-yHandp) its cross entropy loss function is calculated, convergence direction is to be minimized, the number of iterations 2000
It is secondary.
The construction static state action recognition classifier StaticN, network structure are as follows: first layer: convolutional layer inputs and is
256 × 256 × 3, exporting is 256 × 256 × 64, port number channels=64;The second layer: convolutional layer, input as 256 ×
256 × 64, exporting is 256 × 256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256
× 256 × 64 are connected in third dimension with third layer output 256 × 256 × 64, and exporting is 128 × 128 × 128;The
Four layers: convolutional layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;5th
Layer: convolutional layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 6:
Pond layer inputs the 4th layer of output 128 × 128 × 128 and is connected in third dimension with layer 5 128 × 128 × 128,
Output is 64 × 64 × 256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number
Channels=256;8th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number
Channels=256;9th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number
Channels=256;Tenth layer: pond layer, input for layer 7 output 64 × 64 × 256 and the 9th layer 64 × 64 × 256
It is connected in third dimension, exporting is 32 × 32 × 512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, exports and is
32 × 32 × 512, port number channels=512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, export as 32 ×
32 × 512, port number channels=512;13rd layer: convolutional layer, inputting is 32 × 32 × 512, export as 32 × 32 ×
512, port number channels=512;14th layer: pond layer inputs as eleventh floor output 32 × 32 × 512 and the 13rd
Layer 32 × 32 × 512 is connected in third dimension, and exporting is 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16
× 16 × 1024, exporting is 16 × 16 × 512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16
× 512, exporting is 16 × 16 × 512, port number channels=512;17th layer: convolutional layer, input as 16 × 16 ×
512, exporting is 16 × 16 × 512, port number channels=512;18th layer: pond layer is inputted and is exported for the 15th layer
16 × 16 × 512 are connected in third dimension with the 17th layer 16 × 16 × 512, and exporting is 8 × 8 × 1024;19th
Layer: convolutional layer, inputting is 8 × 8 × 1024, and exporting is 8 × 8 × 256, port number channels=256;20th layer: Chi Hua
Layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 × 256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4
× 4 × 128, port number channels=128;Second Floor 12: pond layer, inputting is 4 × 4 × 128, export as 2 × 2 ×
128;23rd layer: the data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, so by full articulamentum first
After input into full articulamentum, output vector length is 128, and activation primitive is relu activation primitive;24th layer: full connection
Layer, input vector length are 128, and output vector length is 32, and activation primitive is relu activation primitive;25th layer: Quan Lian
Layer is connect, input vector length is 32, and output vector length is 3, and activation primitive is soft-max activation primitive;All convolutional layers
Parameter is size=3 convolution kernel kernel, and step-length stride=(1,1), activation primitive is relu activation primitive;All pond layers
It is maximum pond layer, parameter is pond section size kernel_size=2, step-length stride=(2,2).
Described initializes dynamic action recognition classifier using hand motion video, method are as follows: the first step,
Construct data acquisition system: the when the hand motion image using standard initializes static action recognition classifier
The hand motion video collection that one step is constructed uniformly extracts 10 frame images, as input;Second step, construction dynamic action identification
Classifier DynamicN;Third step initializes dynamic action recognition classifier DynamicN, and input is the first step pair
The set that 10 frame images of each video extraction are constituted exports if the 10 frame images inputted each time are Handv and is
DynamicN (Handv), classification yHandv, yHandvRepresentation method are as follows: take out article: yHandv=[1,0,0,0,0] is put
Return article: yHandvIt takes out and puts back to again in=[0,1,0,0,0]: yHandvIt has taken out article and has not put back to in=[0,0,1,0,0]: yHandv
The movement of=[0,0,0,1,0] and suspicious stealing: yHandv=[0,0,0,0,1], the evaluation function of the network are to (DynamicN
(Handv)-yHandv) its cross entropy loss function is calculated, convergence direction is to be minimized, and the number of iterations is 2000 times.
The 10 frame images of uniform extraction, method are as follows: for one section of video image, if the length is Nf frames.First
1st frame image zooming-out of video image is come out into the 1st frame as extracted set, by the last frame image of video image
Extract the 10th frame as extracted set, the i-th of extracted setcktFrame is the of video imageFrame, wherein ickt=2 to 9:,Indicate round numbers part.
The construction dynamic action recognition classifier DynamicN, network structure are as follows:
First layer: convolutional layer, inputting is 256 × 256 × 30, and exporting is 256 × 256 × 512, port number channels=
512;The second layer: convolutional layer, inputting is 256 × 256 × 512, and exporting is 256 × 256 × 128, port number channels=
128;Third layer: pond layer, inputting is 256 × 256 × 128, and exporting is 128 × 128 × 128;4th layer: convolutional layer, input
It is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer inputs and is
128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, input the 4th
Layer output 128 × 128 × 128 is connected in third dimension with layer 5 128 × 128 × 128, export as 64 × 64 ×
256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;The
Eight layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: volume
Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, it is defeated
Enter and be connected in third dimension for layer 7 output 64 × 64 × 256 with the 9th layer 64 × 64 × 256, exporting is 32 × 32
×512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=
512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;
13rd layer: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth
Four layers: pond layer inputs and exports 32 × 32 × 512 and the 13rd layer 32 × 32 × 512 in third dimension for eleventh floor
It is connected, exporting is 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, export as 16 × 16 ×
512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512,
Port number channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, channel
Number channels=512;18th layer: pond layer inputs and exports 16 × 16 × 512 and the 17th layer 16 × 16 for the 15th layer
× 512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024,
Output is 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, export as 4 ×
4×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128;
Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will
The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length
It is 128, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 128, and output vector is long
Degree is 32, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 32, and output vector is long
Degree is 3, and activation primitive is soft-max activation primitive;The parameter of all convolutional layers is size=3 convolution kernel kernel, step-length
Stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, parameter Chi Huaqu
Between size kernel_size=2, step-length stride=(2,2).
It is described after recognizing grasp motion, target following, method are carried out to current grasp motion corresponding region are as follows:
If the image of the grasp motion currently recognized is Hgrab, current tracking area is region corresponding to image Hgrab.First
Step extracts the ORB feature ORB of image HgrabHgrab;Second step, it is corresponding for all hand regions in the next frame of Hgrab
Image calculate its ORB feature to obtaining ORB characteristic set, and delete the ORB feature chosen by other tracking box;Third
Step, by ORBHgrabIts Hamming distance compared with each value of ORB characteristic set, selection and ORBHgrabThe Hamming distance of feature
The smallest ORB feature is the ORB feature chosen, if the ORB feature and ORB chosenHgrabThe similarity > 0.85 of feature, similarity
=(Hamming distance/ORB characteristic lengths of two ORB features of 1-), then the corresponding hand region of ORB feature chosen is image
Hgrab next frame tracking box, if otherwise similarity < 0.85 show tracking lose.
The ORB feature, the method that ORB feature is extracted from an image have been relatively mature, and calculate in OpenCV
Has realization inside machine vision library;Its ORB feature is extracted to a picture, input value is current image, is exported as several group leaders
Identical character string is spent, each group represents an ORB feature.
Described terminates using the corresponding hand region of present image as video, using method for tracking target since present frame
Be carried forward tracking, until tracking is lost, method are as follows: set the image for putting down movement currently recognized as Hdown, currently with
Track region is region corresponding to image Hdown.
If not tracking loss:
The first step extracts the ORB feature ORB of image HdownHdown, moved since the process recognizes grasping in described working as
After work, calculated during carrying out target following to current grasp motion corresponding region, so being not required here again
Secondary calculating;
Second step, for the corresponding image of all hand regions in the former frame of image Hdown calculate its ORB feature from
And ORB characteristic set is obtained, and delete the ORB feature chosen by other tracking box;
Third step, by ORBHdownIts Hamming distance compared with each value of ORB characteristic set, selection and ORBHdownIt is special
The smallest ORB feature of the Hamming distance of sign is the ORB feature chosen, if the ORB feature and ORB chosenHdownThe similarity of feature
> 0.85, similarity=(Hamming distance/ORB characteristic lengths of two ORB features of 1-), the then corresponding hand of ORB feature chosen
Portion region is tracking box of the image Hdown in next frame, if otherwise similarity < 0.85 shows that tracking is lost, algorithm terminates.
The product identification module, method is: in initialization, using the product image set of all angles first
Product identification classifier is initialized, and product list is generated to product image;When changing product list: if deleting certain
Product then deletes the image of the product from the product image set of all angles, and corresponding position in product list is deleted,
If increasing certain product, the product image of all angles of current production is put into the product image set of all angles, will be produced
Last the current title for increasing product of back addition of product list, then with the product image set of new all angles with
New product list upgrading products recognition classifier;In the detection process, the first step, according to shopping action recognition module transmitting come
Complete video and only grasp motion video, first in module of target detection institute according to corresponding to current video first frame
Obtained position detects forward the inputted video image of the position from current video first frame, detect the region not by
The frame blocked is finally identified the image in region corresponding to frame as the input of product identification classifier, to obtain
The recognition result of current production, recognition methods are as follows: set the image inputted each time as Goods1, export as GoodsN (Goods1)
For a vector, if the i-th of the vectorgoodsPosition is maximum, then shows that current recognition result is i-th in product listgoodsPosition
Recognition result is sent to recognition result processing module by product;
Described first initializes product identification classifier using the product image set of all angles, and to production
Product image generates product list, method are as follows: the first step, construct data acquisition system and product list: the data acquisition system is that product is each
The image of a angle, product list listGoodsFor a vector, each of vector corresponds to a product name;Second step, structure
Make product identification classifier GoodsN;Third step initializes construction product identification classifier GoodsN, and input is each
The product image set of a angle exports if input picture is Goods as GoodsN (Goods), classification yGoods, yGoods
For one group of vector, length is equal to the number of product in product list, yGoodsRepresentation method are as follows: if image Goods be i-thGoods
The product of position, then yGoodsI-thGoodsPosition is 1, other are to (GoodsN (Goods)-for the evaluation function of 0. network
yGoods) its cross entropy loss function is calculated, convergence direction is to be minimized, and the number of iterations is 2000 times.
The construction product identification classifier GoodsN, two groups of GoodsN1 and GoodsN2 of network layer structure, wherein
The network structure of GoodsN1 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, and exporting is 256 × 256 × 64, port number
Channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, and exporting is 256 × 256 × 128, port number
Channels=128;Third layer: pond layer, inputting is 256 × 256 × 128, and exporting is 128 × 128 × 128;4th layer: volume
Lamination, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolution
Layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer,
It inputs the 4th layer of output 128 × 128 × 128 to be connected in third dimension with layer 5 128 × 128 × 128, exporting is 64
×64×256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=
256;8th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;The
Nine layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond
Change layer, inputs and be connected in third dimension for layer 7 output 64 × 64 × 256 with the 9th layer 64 × 64 × 256, export
It is 32 × 32 × 512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number
Channels=512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number
Channels=512;13rd layer: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number
Channels=512;14th layer: pond layer, input for eleventh floor export 32 × 32 × 512 and the 13rd layer 32 × 32 ×
512 are connected in third dimension, and exporting is 16 × 16 × 1024;15th layer: convolutional layer, input as 16 × 16 ×
1024, exporting is 16 × 16 × 512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512,
Output is 16 × 16 × 512, port number channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, output
It is 16 × 16 × 512, port number channels=512;18th layer: pond layer, input for the 15th layer output 16 × 16 ×
512 are connected in third dimension with the 17th layer 16 × 16 × 512, and exporting is 8 × 8 × 1024;19th layer: convolution
Layer, inputting is 8 × 8 × 1024, and exporting is 8 × 8 × 256, port number channels=256;20th layer: pond layer, input
It is 8 × 8 × 256, exporting is 4 × 4 × 256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, export as 4 × 4 ×
128, port number channels=128;The parameter of all convolutional layers be size=3 convolution kernel kernel, step-length stride=(1,
1), activation primitive is relu activation primitive;All pond layers are maximum pond layer, and parameter is pond section size
Kernel_size=2, step-length stride=(2,2).The network structure of GoodsN2 are as follows: inputting is 4 × 4 × 128, first will be defeated
The data entered are launched into the vector of 2048 dimensions, then input into first layer;First layer: full articulamentum, input vector length are
2048, output vector length is 1024, and activation primitive is relu activation primitive;The second layer: full articulamentum, input vector length are
1024, output vector length is 1024, and activation primitive is relu activation primitive;Third layer: full articulamentum, input vector length are
1024, output vector length is len (listGoods), activation primitive is soft-max activation primitive;len(listGoods) indicate
The length of product list.For any input Goods2, GoodsN (Goods2)=GoodsN2 (GoodsN1 (Goods2)).
The product image set and new product list upgrading products recognition classifier with new all angles,
Method are as follows: the first step modifies network structure: for the network of product identification the classifier GoodsN ', GoodsN1 ' of neotectonics
Structure is constant, identical as GoodsN1 network structure when initialization, the first layer and second layer knot of GoodsN2 ' network structure
Structure remains unchanged, and the output vector length of third layer becomes the length of updated product list;Second step, for neotectonics
Product identification classifier GoodsN ' is initialized: its product image set inputted as new all angles, if input picture
For Goods3, export as GoodsN ' (Goods3)=GoodsN2 ' (GoodsN1 (Goods3)), classification yGoods3, yGoods
For one group of vector, length is equal to the number of updated product list, yGoodsRepresentation method are as follows: if image Goods is the
iGoodsThe product of position, then yGoodsI-thGoodsPosition is 1, other are to (GoodsN for the evaluation function of 0. network
(Goods)-yGoods) its cross entropy loss function is calculated, convergence direction is to be minimized, during initialization in GoodsN1
Parameter value remain unchanged, the number of iterations be 500 times.
It is described according to corresponding to current video first frame in the obtained position of module of target detection, to the position
Inputted video image detects forward from current video first frame, detects the frame that the region is not blocked, method are as follows: set and work as
Corresponding to preceding video first frame the obtained position of module of target detection be (agoods, bgoods, lgoods, wgoods), if currently
Current video first frame is i-thcrgsFrame, frame under process icr=icrgs: the first step, i-thcrFrame is obtained by module of target detection
All detection zones be Taskicr;Second step, for TaskicrEach of regional frame (atask, btask, ltask, wtask),
Calculate its distance dgt=(atask-agoods)2+(btask-bgoods)2-(ltask+lgoods)2-(wtask+wgoods)2.Distance if it does not exist
< 0, then i-thcrCorresponding (a of framegoods, bgoods, lgoods, wgoods) region be region detected for detecting not by
The frame blocked, algorithm terminate;Otherwise, distance < 0 if it exists, the then d (i in recording distance list dcr)=minimum range, and icr
=icr- 1, if icr> 0, then algorithm jumps to the first step, if icr≤ 0, then selection takes this apart from the maximum record of list d intermediate value
Record the corresponding (a of corresponding framegoods, bgoods, lgoods, wgoods) it is what the region detected detected was not blocked
Frame, algorithm terminate.
The individual identification module, method is: in initialization, using the face image set of all angles first
Face characteristic extractor FaceN is initialized and calculates μ face, then using the human body image of all angles to human body spy
Sign extractor BodyN is initialized and is calculated μ body;In the detection process, when user enters supermarket, pass through target detection
Module obtains the image Face1 of the face in current human region Body1 and human region, is then mentioned respectively using characteristics of human body
Device BodyN and face characteristic extractor FaceN is taken to extract characteristics of human body BodyN (Body1) and face characteristic FaceN (Face1),
BodyN (Body1) is saved in BodyFtu set, saves FaceN (Face1) in FaceFtu set, and save current visitor
The id information at family, id information can be user supermarket account either user enter supermarket when be randomly assigned it is unduplicated
Number, id information are used to distinguish different customers, whenever there is customer to enter supermarket, then extract its characteristics of human body and face characteristic;When
In supermarket when user's mobile product, according to shopping action recognition module transmitting come complete video and only grasp motion view
Frequently, its corresponding human region and human face region are searched out, face feature extractor FaceN and characteristics of human body's extractor are used
BodyN carries out recognition of face or human bioequivalence mode, obtains currently doing shopping corresponding to the video that action recognition module transmitting comes
The ID of customer.
The face image set using all angles is initialized and is calculated to face characteristic extractor FaceN
μ face, method are as follows: the first step, the face image set for choosing all angles constitute human face data collection;Second step constructs people
Face feature extractor FaceN is simultaneously initialized using face data set;Step 3:
Everyone i concentrated for human face dataPeop, obtain human face data and concentrate all to belong to iPeopFacial image
Set FaceSet (iPeop):
For FaceSet (iPeop) in each facial image Face (jiPeop):
Calculate face characteristic FaceN (Face (jiPeop));
Count current face's image collection FaceSet (iPeop) in all face characteristics average value as current face scheme
Center center (FaceN (Face (the j of pictureiPeop))), calculate current face's image collection FaceSet (iPeop) in all faces
Feature
With the center center (FaceN (Face (j of current face's imageiPeop))) distance constitute iPeopCorresponding distance
Set.The owner concentrated to human face data obtains its corresponding distance set, after distance set is arranged from small to large, if
Distance set length is ndiset, Indicate round numbers part.
The construction face characteristic extractor FaceN is simultaneously initialized using face data set, if human face data collection
By NfacesetIndividual is constituted, and network layer structure FaceN25 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, is exported and is
256 × 256 × 64, port number channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, export as 256 ×
256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256 × 256 × 64 are defeated with third layer
256 × 256 × 64 are connected in third dimension out, and exporting is 128 × 128 × 128;4th layer: convolutional layer inputs and is
128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer, inputting is 128
× 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, the 4th layer of input are defeated
128 × 128 × 128 are connected in third dimension with layer 5 128 × 128 × 128 out, and exporting is 64 × 64 × 256;The
Seven layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;8th layer: volume
Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: convolutional layer, it is defeated
Entering is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, inputting is
Seven layers of output 64 × 64 × 256 are connected in third dimension with the 9th layer 64 × 64 × 256, and exporting is 32 × 32 × 512;
Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth
Two layers: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;13rd layer:
Convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;14th layer: Chi Hua
Layer inputs and is connected in third dimension for eleventh floor output 32 × 32 × 512 with the 13rd layer 32 × 32 × 512, defeated
It is out 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, and exporting is 16 × 16 × 512, port number
Channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number
Channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number
Channels=512;18th layer: pond layer, input for the 15th layer export 16 × 16 × 512 and the 17th layer 16 × 16 ×
512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024, defeated
It is out 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, and exporting is 4 × 4
×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128;
Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will
The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length
It is 512, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 512, and output vector is long
Degree is 512, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 512, output vector
Length is Nfaceset, activation primitive is soft-max activation primitive;The parameter of all convolutional layers be convolution kernel kernel size=
3, step-length stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, and parameter is
Pond section size kernel_size=2, step-length stride=(2,2).Its initialization procedure are as follows: set for each face
Face4 exports as FaceN25 (face4), classification yface, yfaceIt is equal to N for lengthfacesetVector, yfaceExpression
Method are as follows: if face face4 belongs to i-th in face image setface4Personal face, then yfaceI-thface4Position is 1, other
Position is that the evaluation function of 0. network is to (FaceN25 (face4)-yface) its cross entropy loss function is calculated, restrain direction
To be minimized, the number of iterations is 2000 times;After iteration, face characteristic extractor FaceN be FaceN25 network from
First layer is to the 24th layer.
The human body image using all angles initializes characteristics of human body's extractor BodyN and calculates μ
Body, method are as follows: the first step, the human body image set for choosing all angles constitute somatic data collection;Second step constructs human body
Simultaneously user's volumetric data set initializes feature extractor BodyN;Step 3:
Everyone i concentrated for somatic dataPeop1, obtain somatic data and concentrate all to belong to iPeop1Human figure
Image set closes BodySet (iPeop1):
For BodySet (iPeop1) in each human body image Body (jiPeop1):
Calculate characteristics of human body BodyN (Body (jiPeop1));
Count current human's image collection BodySet (iPeop1) in all characteristics of human body average value as current human scheme
Center center (BodyN (Body (the j of pictureiPeop1))), calculate current human's image collection BodySet (iPeop1) in owner
Center center (BodyN (Body (the j of body characteristics and current human's imageiPeop1))) distance constitute iPeop1Corresponding distance
Set.
The owner concentrated to somatic data obtains its corresponding distance set, and distance set is arranged from small to large
Afterwards, if distance set length is ndiset1, It indicates to be rounded
Number part.
Construction characteristics of human body's extractor BodyN and user's volumetric data set is initialized, if somatic data collection
By NbodysetIndividual is constituted, and network layer structure BodyN25 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, is exported and is
256 × 256 × 64, port number channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, export as 256 ×
256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256 × 256 × 64 are defeated with third layer
256 × 256 × 64 are connected in third dimension out, and exporting is 128 × 128 × 128;4th layer: convolutional layer inputs and is
128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer, inputting is 128
× 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, the 4th layer of input are defeated
128 × 128 × 128 are connected in third dimension with layer 5 128 × 128 × 128 out, and exporting is 64 × 64 × 256;The
Seven layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;8th layer: volume
Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: convolutional layer, it is defeated
Entering is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, inputting is
Seven layers of output 64 × 64 × 256 are connected in third dimension with the 9th layer 64 × 64 × 256, and exporting is 32 × 32 × 512;
Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth
Two layers: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;13rd layer:
Convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;14th layer: Chi Hua
Layer inputs and is connected in third dimension for eleventh floor output 32 × 32 × 512 with the 13rd layer 32 × 32 × 512, defeated
It is out 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, and exporting is 16 × 16 × 512, port number
Channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number
Channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number
Channels=512;18th layer: pond layer, input for the 15th layer export 16 × 16 × 512 and the 17th layer 16 × 16 ×
512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024, defeated
It is out 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, and exporting is 4 × 4
×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128;
Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will
The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length
It is 512, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 512, and output vector is long
Degree is 512, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 512, output vector
Length is Nfaceset, activation primitive is soft-max activation primitive;The parameter of all convolutional layers be convolution kernel kernel size=
3, step-length stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, and parameter is
Pond section size kernel_size=2, step-length stride=(2,2).Its initialization procedure are as follows: set for each Zhang Renti
Body4 exports as BodyN25 (body4), classification ybody, ybodyIt is equal to N for lengthbodysetVector, ybodyExpression
Method are as follows: if human body body4 belongs to i-th in human body image setbody4Personal human body, then ybodyI-thbody4Position is 1, other
Position is that the evaluation function of 0. network is to (BodyN25 (body4)-ybody) its cross entropy loss function is calculated, restrain direction
To be minimized, the number of iterations is 2000 times;After iteration, characteristics of human body's extractor BodyN be BodyN25 network from
First layer is to the 24th layer.
It is described according to the transmitting of shopping action recognition module come complete video and only grasp motion video, search out
Its corresponding human region and human face region carry out people using face feature extractor FaceN and characteristics of human body's extractor BodyN
Face identification or human bioequivalence mode obtain the ID of customer corresponding to the video that currently transmitting of shopping action recognition module comes.Its
Process are as follows: according to shopping action recognition module transmitting come video, begin look for from the first frame of video to corresponding human body area
Domain and human face region, until algorithm terminates or handled the last frame of video:
Corresponding human region image Body2 and human face region image Face2 are used into characteristics of human body's extractor respectively
BodyN and face characteristic extractor FaceN extracts characteristics of human body BodyN (Body2) and face characteristic FaceN (Face2);
Then face identification information is used first: comparing all face characteristics in FaceN (Face2) and FaceFtu set
Euclidean distance dFace, feature when selecting Euclidean distance minimum in corresponding FaceFtu set, if this feature is FaceN
(Face3), if dFace< μ face then identifies that current face's image belongs to the visitor of facial image corresponding to FaceN (Face3)
Family ID is the ID corresponding to the video actions that action recognition module transmitting comes that does shopping, and current identification process terminates;
If dFace>=μ face shows only to identify current individual with face identification method, then compares BodyN
(Body2) the Euclidean distance d of all characteristics of human body in gathering with BodyFtuBody, select Euclidean distance minimum when it is corresponding
Feature in BodyFtu set, if this feature is BodyN (Face3), if dBody+dFace< μ face+ μ body, then identify and work as
The Customer ID that preceding human body image belongs to human body image corresponding to BodyN (Face3) is that shopping action recognition module transmitting comes
ID corresponding to video actions.
If still not finding ID corresponding to video actions after all frames for having handled video, in order to avoid mistake is known
Not Gou Wu main body cause the book keeping operation of mistake, therefore the video come to current shopping action recognition module transmitting is no longer handled.
It is described according to the transmitting of shopping action recognition module come video, begin look for from the first frame of video to corresponding
Human region and human face region, method are as follows: according to the transmitting of shopping action recognition module come video, from the first frame of video into
Row processing.If currently processed to i-thfRgFrame, if it is (a that the frame, which corresponds to video in the obtained position of module of target detection,ifRg,
bifRg, lifRg, wifRg), the frame is corresponding to be combined into BodyFrameSet in the obtained human region collection of module of target detectionifRg
Human region collection is combined into FaceFrameSetifRg, for BodyFrameSetifRgEach of human region (aBFSifRg,
bBFSifRg, lBFSifRg, wBFSifRg), calculate its distance dgbt=(aBFSifRg-aifRg)2+(bBFSifRg-bifRg)2-(lBFSifRg-lifRg)2-
(wBFSifRg-wifRg)2, selecting the smallest human region of distance in all human region set is the corresponding human body area of current video
Domain, if it is (a that the human region chosen, which is position,BFS1, bBFS1, lBFS1, wBFS1), human face region collection is combined into
FaceFrameSetifRgEach of human face region (aFFSifRg, bFFsifRg, lFFsifRg, wFFSifRg), calculate its distance dgft=
(aBFS1-aFFSifRg)2+(bBFS1-bFFSifRg)2-(lBFs1-lFFSifRg)2-(wBFS1-wFFSifRg)2, select all face regional ensembles
It is middle apart from the smallest human face region be the corresponding human face region of current video.
The recognition result processing module does not work in initialization.In identification process, to the identification knot received
Fruit carries out integration to generating the corresponding shopping list of each customer: first according to individual identification module transmit come customer
ID determines the corresponding customer of current shopping information, so that choosing the shopping list number modified is ID, then according to product identification
The recognition result that module transmitting comes sets product to determine that the shopping of current customer acts corresponding product as GoodA, then basis
Whether the recognition result that shopping action recognition module transmitting comes modifies to shopping cart to determine that current shopping acts, if identification
Then increase product G oodA on shopping list ID to take out article, accelerating is 1, is being purchased if being identified as putting back to article
Product G oodA is reduced on object inventory ID, reducing quantity is 1, if be identified as " take out and put back to " or " taken out article not put again
Return " then shopping list do not change, to supermarket's monitoring transmission alarm signal and current video if recognition result is " suspicious stealing "
Corresponding location information.
It, can will be in shopping process the invention has the advantages that when commodity Input Process is advanceed to shopper's picking
Most time-consuming process advances in shopping process, to remove the time loss of items scanning when checkout, is greatly improved
Checkout speed, improves the shopping experience of customer.The present invention selects shopper using algorithm for pattern recognition dynamic during goods
It is identified and is counted, the picture of commodity is identified to obtain type of merchandize when picking and placing commodity to client, is carried out to customer
Recognition of face and the identity for obtaining customer using human body image recognition when recognition of face is undesirable, to the exception of customer
Activity recognition is to determine whether there is pilferage behavior.This system can realize automatic system under the premise of not reducing customer purchase experience
Count function.Original institutional framework the present invention relates to the shopping accounting procedure of customer without changing supermarket, consequently facilitating with existing
Supermarket's organizational structure seamless interfacing.
Detailed description of the invention
Fig. 1 is functional flow diagram of the invention
Fig. 2 is whole functional module of the invention and its correlation block diagram
Specific embodiment
The present invention will be further described below with reference to the drawings.
A kind of supermarket's intelligence vending system, functional flow diagram is as shown in Figure 1, correlation between its module
As shown in Figure 2.
Be provided below three specific embodiments to a kind of detailed process of supermarket's intelligence vending system of the present invention into
Row explanation: embodiment 1:
The present embodiment realizes a kind of process of the parameter initialization of supermarket's intelligence vending system.
1. image pre-processing module, in initial phase, the module does not work;
2. human body target detection module has demarcated human body image region, face face using having during initialization
The image in region, hand region and product area carries out parameter initialization to algorithm of target detection.
The use have demarcated human body image region, face facial area, hand region and product area figure
As carrying out parameter initialization to algorithm of target detection, it the steps include: that the first step, construction feature extract depth network;Second step, structure
Make regional choice network, third step, according to each in database used in the construction feature extraction depth network
Open image X and the corresponding each human region manually demarcatedThen by ROI layers,
Input is image X and regionOutputIt is 7
× 7 × 512 dimensions;Third step, building coordinate refine network.
The construction feature extracts depth network, which is deep learning network structure, network structure are as follows: first
Layer: convolutional layer, inputting is 768 × 1024 × 3, and exporting is 768 × 1024 × 64, port number channels=64;The second layer: volume
Lamination, inputting is 768 × 1024 × 64, and exporting is 768 × 1024 × 64, port number channels=64;Third layer: Chi Hua
Layer, input first layer output 768 × 1024 × 64 are connected in third dimension with third layer output 768 × 1024 × 64,
Output is 384 × 512 × 128;4th layer: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, is led to
Road number channels=128;Layer 5: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, channel
Number channels=128;Layer 6: pond layer, the 4th layer of output 384 × 512 × 128 of input and layer 5 384 × 512 ×
128 are connected in third dimension, and exporting is 192 × 256 × 256;Layer 7: convolutional layer, input as 192 × 256 ×
256, exporting is 192 × 256 × 256, port number channels=256;8th layer: convolutional layer, input as 192 × 256 ×
256, exporting is 192 × 256 × 256, port number channels=256;9th layer: convolutional layer, input as 192 × 256 ×
256, exporting is 192 × 256 × 256, port number channels=256;Tenth layer: pond layer inputs as layer 7 output 192
× 256 × 256 are connected in third dimension with the 9th layer 192 × 256 × 256, and exporting is 96 × 128 × 512;11st
Layer: convolutional layer, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512;Floor 12:
Convolutional layer, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512;13rd layer: volume
Lamination, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512;14th layer: Chi Hua
Layer inputs and is connected in third dimension for eleventh floor output 96 × 128 × 512 with the 13rd layer 96 × 128 × 512,
Output is 48 × 64 × 1024;15th layer: convolutional layer, inputting is 48 × 64 × 1024, and exporting is 48 × 64 × 512, channel
Number channels=512;16th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number
Channels=512;17th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number
Channels=512;18th layer: pond layer, input for the 15th layer export 48 × 64 × 512 and the 17th layer 48 × 64 ×
512 are connected in third dimension, and exporting is 48 × 64 × 1024;19th layer: convolutional layer, input as 48 × 64 ×
1024, exporting is 48 × 64 × 256, port number channels=256;20th layer: pond layer, inputting is 48 × 64 × 256,
Output is 24 × 62 × 256;Second eleventh floor: convolutional layer, inputting is 24 × 32 × 1024, and exporting is 24 × 32 × 256, channel
Number channels=256;Second Floor 12: pond layer, inputting is 24 × 32 × 256, and exporting is 12 × 16 × 256;20th
Three layers: convolutional layer, inputting is 12 × 16 × 256, and exporting is 12 × 16 × 128, port number channels=128;24th
Layer: pond layer, inputting is 12 × 16 × 128, and exporting is 6 × 8 × 128;25th layer: full articulamentum, first by the 6 of input
The data of × 8 × 128 dimensions are launched into the vector of 6144 dimensions, then input into full articulamentum, and output vector length is 768,
Activation primitive is relu activation primitive;26th layer: full articulamentum, input vector length are 768, and output vector length is
96, activation primitive is relu activation primitive;27th layer: full articulamentum, input vector length are 96, and output vector length is
2, activation primitive is soft-max activation primitive;The parameter of all convolutional layers is size=3 convolution kernel kernel, step-length stride
=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, and parameter is pond section size
Kernel size=2, step-length stride=(2,2);If setting the depth network as Fconv27, for a width color image X, warp
Crossing the obtained feature set of graphs Fconv27 (X) of the depth network indicates, the evaluation function of the network is to (Fconv27
(X)-y) its cross entropy loss function is calculated, convergence direction is to be minimized, and y inputs corresponding classification.Database is in nature
The image comprising passerby and non-passerby of boundary's acquisition, every image are the color image of 768 × 1024 dimensions, according to being in image
No to be divided into two classes comprising pedestrian, the number of iterations is 2000 times.After training, first layer is taken to be characterized extraction to the 17th layer
Depth network Fconv indicates a width color image X by the obtained output of the depth network with Fconv (X).
The structure realm selects network, receives Fconv depth network and extracts 512 48 × 64 feature set of graphs
Fconv (X), then the first step obtains Conv by convolutional layer1(Fconv (X)), the parameter of the convolutional layer are as follows: convolution kernel
Size=1 kernel, step-length stride=(1,1), inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number
Channels=512;Then by Conv1(Fconv (X)) is separately input to two convolutional layer (Conv2-1And Conv2-2),
Conv2-1Structure are as follows: inputting is 48 × 64 × 512, and exporting is 48 × 64 × 18, and port number channels=18, the layer obtains
Output be Conv2-1(Conv1(Fconv (X))), then softmax is obtained using activation primitive softmax to the output
(Conv2-1(Conv1(Fconv(X))));Conv2-2Structure are as follows: inputting is 48 × 64 × 512, and exporting is 48 × 64 × 36,
Port number channels=36;There are two the loss functions of the network: first error function loss1 is to Wshad-cls⊙
(Conv2-1(Conv1(Fconv(X)))-Wcls(X)) softmax error is calculated, second error function loss2 is to Wshad-reg
(X)⊙(Conv2-1(Conv1(Fconv(X)))-Wreg(X)) smooth L1 error, the loss function of regional choice network are calculated
=loss1/sum (Wcls(X))+loss2/sum(Wcls(X)), the sum of sum () representing matrix all elements, convergence direction are
It is minimized, Wcls(X) and WregIt (X) is respectively the corresponding positive and negative sample information of database images X, ⊙ representing matrix is according to correspondence
Position is multiplied, Wshad-cls(X) and Wshad-regIt (X) is mask, it acts as selection Wshad(X) part that weight is 1 in is trained,
To avoiding positive and negative sample size gap excessive, when each iteration, regenerates Wshad-cls(X) and Wshad-reg(X), algorithm iteration
1000 times.
The construction feature extracts database used in depth network, for each image in database,
Step 1: each human body image-region, face facial area, hand region and product area are manually demarcated, if it schemes in input
The centre coordinate of picture is (abas_tr, bbas_tr), centre coordinate is l in the distance of fore-and-aft distance upper and lower side framebas_tr, centre coordinate
It is w in the distance of lateral distance left and right side framebas_tr, then it corresponds to Conv1Position be that center coordinate isHalf is a length ofHalf-breadth is Indicate round numbers part;The
Two steps: positive negative sample is generated at random.
The positive negative sample of generation at random, method are as follows: the first step constructs 9 regional frames, second step, for data
The each image X in librarytrIf WclsFor 48 × 64 × 18 dimensions, WregFor 48 × 64 × 36 dimensions, all initial values are 0, right
WclsAnd WregIt is filled.
Described 9 regional frames of construction, this 9 regional frames are respectively as follows: Ro1(xRo, yRo)=(xRo, yRo, 64,64), Ro2
(xRo, yRo)=(xRo, yRo, 45,90), Ro3(xRo, yRo)=(xRo, yRo, 90,45), Ro4(xRo, yRo)=(xRo, yRo, 128,
128), Ro5(xRo, yRo)=(xRo, yRo, 90,180), Ro6(xRo, yRo)=(xRo, yRo, 180,90), Ro7(xRo, yRo)=
(xRo, yRo, 256,256), Ro8(xRo, yRo)=(xRo, yRo, 360,180), Ro9(xRo, yRo)=(xRo, yRo, 180,360), it is right
In each region unit, Roi(xRo, yRo) indicate for ith zone frame, the centre coordinate (x of current region frameRo, yRo), the
Three indicate pixel distance of the central point apart from upper and lower side frame, and the 4th indicates pixel distance of the central point apart from left and right side frame, i
Value from 1 to 9.
It is described to WclsAnd WregIt is filled, method are as follows:
For the body compartments that each is manually demarcated, if it is (a in the centre coordinate of input picturebas_tr, bbas_tr),
Centre coordinate is l in the distance of fore-and-aft distance upper and lower side framebas_tr, centre coordinate is in the distance of lateral distance left and right side frame
wbas_tr, then it corresponds to Conv1Position be that center coordinate isHalf is a length ofHalf-breadth is
For the upper left cornerThe lower right corner
CoordinateEach point in the section surrounded
(xctr, yCtr):
For i value from 1 to 9:
For point (xCtr, yctr), it is upper left angle point (16 (x in the mapping range of database imagesCtr- 1)+1,16
(yctr- 1)+1) bottom right angle point (16xctr, 16yCtr) 16 × 16 sections that are surrounded, for each point (x in the sectionOtr,
yOtr):
Calculate (xOtr, yotr) corresponding to region Roi(xOtr, yOtr) with current manual calibration section coincidence factor;
Select the highest point (x of coincidence factor in current 16 × 16 sectionIoUMax, yIoUMax), if coincidence factor > 0.7, Wcls
(xCtr, yctr, 2i-1)=1, Wcls(xCtr, yctr, 2i)=0, which is positive sample, Wreg(xCtr, yctr, 4i-3) and=(xOtr-
16xCtr+ 8)/8, Wreg(xCtr, yCtr, 4i-2) and=(yOtr-16yctr+ 8)/8, Wreg(xCtr, yctr, 4i-2) and=Down1 (lbas_tr/
RoiThird position), Wreg(xCtr, yCtr, 4i) and=Down1 (wbas_tr/RoiThe 4th), Down1 () if indicate value be greater than 1
Then value is 1;If coincidence factor < 0.3, Wcls(xCtr, yCtr, 2i-1)=0, Wcls(xCtr, yCtr, 2i)=1;Otherwise Wcls
(xCtr, yCtr, 2i-1)=- 1, Wcls(xCtr, yCtr, 2i)=- 1.
If the human region of current manual's calibration does not have the Ro of coincidence factor > 0.6i(xOtr, yOtr), then select coincidence factor most
High Roi(xOtr, yOtr) to WclsAnd WregAssignment, assignment method are identical as the assignment method of coincidence factor > 0.7.
Calculating (the xOtr, yOtr) corresponding to region Roi(xOtr, yOtr) be overlapped with the section of current manual's calibration
Rate, method are as follows: set the body compartments that manually demarcate in the centre coordinate of input picture as (abas_tr, bbas_tr), centre coordinate
It is l in the distance of fore-and-aft distance upper and lower side framebas_tr, centre coordinate is w in the distance of lateral distance left and right side framebas_trIf Roi
(xOtr, yOtr) third position be lOtr, the 4th is wOtrIf meeting | xOtr-abas_tr|≤lOtr+lbas_tr- 1 and | yOtr-
bbas_tr|≤wOtr+wbas_tr- 1, illustrate that there are overlapping region, overlapping regions=(lOtr+lbas_tr-1-|xOtr-abas_tr|)×
(wOtr+wbas_tr-1-|yOtr-bbas_tr|), otherwise overlapping region=0;Calculate whole region=(2lotr-1)×(2wotr-1)+
(2abas_tr-1)×(2wbas_tr- 1)-overlapping region;To obtain coincidence factor=overlapping region/whole region, | | expression takes
Absolute value.
The Wshad-cls(X) and Wshad-reg(X), building method are as follows: for image X, corresponding positive negative sample
Information is Wcls(X) and Wreg(X), the first step constructs Wshad-cls(X) with and Wshad-reg(X), Wshad-cls(X) and Wcls(X) dimension
It is identical, Wshad-reg(X) and Wreg(X) dimension is identical;Second step records the information of all positive samples, for i=1 to 9, if Wcls
(X) (a, b, 2i-1)=1, then Wshad-cls(X) (a, b, 2i-1)=1, Wshad-cls(X) (a, b, 2i)=1, Wshad-reg(X) (a,
B, 4i-3)=1, Wshad-reg(X) (a, b, 4i-2)=1, Wshad-reg(X) (a, b, 4i-1)=1, Wshad-reg(X) (a, b, 4i)=
1, positive sample has selected altogether sum (Wshad-cls(X)) a, sum () indicates to sum to all elements of matrix, if sum
(Wshad-cls(X)) 256 > retain 256 positive samples at random;Third step randomly chooses negative sample, randomly chooses (a, b, i), if
Wcls(X) (a, b, 2i-1)=1, then Wshad-cls(X) (a, b, 2i-1)=1, Wshad-cls(X) (a, b, 2i)=1, Wshad-reg(X)
(a, b, 4i-3)=1, Wshad-reg(X) (a, b, 4i-2)=1, Wshad-reg(X) (a, b, 4i-1)=1, Wshad-reg(X) (a, b,
4i)=1, if the negative sample quantity chosen is 256-Sum (Wshad-cls(X)) a, although negative sample lazy weight 256-
sum(Wshad-cls(X)) a but be all unable to get negative sample in 20 generation random numbers (a, b, i), then algorithm terminates.
The ROI layer, input are image X and regionIts method are as follows: for
Image X is 48 × 64 × 512 by the dimension of obtained output Fconv (X) of feature extraction depth network Fconv, for every
One 48 × 64 matrix VROI_IInformation (512 matrixes altogether), extract VROI_IThe upper left corner in matrix The lower right cornerIt is surrounded
Region,Indicate round numbers part;Output is roiI(X) dimension is 7 × 7, then step-length
For iROI=1: to 7:
For jROI=1 to 7:
Construct section
roiI(X)(iROI, jROIThe value of maximum point in)=section.
When 512 48 × 64 matrix whole after treatments, output splicing is obtained into the output of 7 × 7 × 512 dimensionsParameter is indicated for image X, in regional frame
ROI in range.
The building coordinate refines network, method are as follows: the first step, extending database: extended method is for data
Each image X and the corresponding each region manually demarcated in libraryIts is corresponding
ROI isIf current interval be human body image-region if BClass=[1,0,0,
0,0], [0,0,0,0] BBox=, the BClass=[0,1,0,0,0] if current interval is people's face facial area, BBox=[0,
0,0,0], BClass=[0,0,1,0,0], BBox=[0,0,0,0], if current interval is if current interval is hand region
Product area then [0,0,0,1,0] BClass=, BBox=[0,0,0,0];It is random to generate value random number between -1 to 1
arand, brand, lrand, wrand, to obtain new section Indicate round numbers part, the BBox=[a in the sectionrand, brand, lrand,
wrand], if new section withThe then BClass=current region of coincidence factor > 0.7
BClass, if new section withCoincidence factor < 0.3, then BClass=[0,0,0,0,
1], the two is not satisfied, then not assignment.Each section at most generates 10 positive sample regions, if generating Num1A positive sample area
Domain then generates Num1+ 1 negative sample region, if the inadequate Num in negative sample region1+ 1, then expand arand, brand, lrand, wrand
Range, until finding enough negative sample numbers.Second step, building coordinate refine network: for every in database
One image X and the corresponding each human region manually demarcatedIts corresponding ROI isThe ROI of 7 × 7 × 512 dimensions will be launched into 25088 dimensional vectors, then passed through
Cross two full articulamentum Fc2, obtain output Fc2(ROI), then by Fc2(ROI) micro- by classification layer FClass and section respectively
Layer FBBox is adjusted, output FClass (Fc is obtained2And FBBox (Fc (ROI))2(ROI)), classification layer FClass is full articulamentum,
Input vector length is 512, and output vector length is 5, and it is full articulamentum that layer FBBox is finely tuned in section, and input vector length is
512, output vector length is 4;There are two the loss functions of the network: first error function loss1 is to FClass (Fc2
(ROI))-BClass calculates softmax error, and second error function loss2 is to (FBBox (Fc2(ROI))-BBox) meter
Euclidean distance error is calculated, then whole loss function=loss1+loss2 of the refining network, algorithm iteration process are as follows: change first
1000 convergence error function loss2 of generation, then 1000 convergence whole loss functions of iteration.
The full articulamentum Fc of described two2, structure are as follows: first layer: full articulamentum, input vector length is 25088, defeated
Outgoing vector length is 4096, and activation primitive is relu activation primitive;The second layer: full articulamentum, input vector length is 4096, defeated
Outgoing vector length is 512, and activation primitive is relu activation primitive.
3. action recognition module of doing shopping first knows static state movement using the hand motion image of standard in initialization
Other classifier is initialized, so that static action recognition classifier be made to can recognize that the grasping of hand, put down movement;Then make
Dynamic action recognition classifier is initialized with hand motion video, so that dynamic action recognition classifier be enable to identify
The taking-up article sold puts back to article, takes out and put back to, having taken out article and do not put back to either suspicious stealing.
The hand motion image using standard initializes static action recognition classifier, method are as follows:
The first step arranges video data: firstly, choose the video that a large amount of people does shopping in supermarket, these videos include extract product,
Article is put back to, takes out and puts back to, taken out article and do not put back to movement with suspicious stealing;Manually each section of video clip is carried out
Interception encounters commodity as start frame using manpower, leaves commodity as end frame using manpower, then use mesh for each frame of video
Mark detection module extracts its hand region, the color image for being then 256 × 256 by each frame image scaling of hand region,
Will scaling rear video be put into hand motion video collection, and mark the video for take out article, put back to article, take out but put back to,
Article has been taken out one of not put back to the movement of suspicious stealing;It is taking-up article for classification, puts back to article, takes out and puts
It returns, taken out each video that article is not put back to, the first frame of the video is put into the merging of hand motion image set and is labeled as
The last frame of the video is put into hand motion image set and merged labeled as putting down movement by grasp motion, removes the from the video
It takes a frame to be put into hand motion image set outside one frame and last needle at random to merge labeled as other.To obtain hand motion view
Frequency set and hand motion image collection;Second step constructs static action recognition classifier StaticN;Third step, it is dynamic to static state
Make recognition classifier StaticN to be initialized, the hand motion image collection constructed by the first step is inputted, if each time
The image of input is Handp, is exported as StaticN (Handp), classification yHandp, yHandpRepresentation method are as follows: grasp:
yHandp=[1,0,0], puts down: yHandp=[0,1,0], other: yHandp=[0,0,1], the evaluation function of the network are pair
(StaticN(Handp)-yHandp) its cross entropy loss function is calculated, convergence direction is to be minimized, the number of iterations 2000
It is secondary.
The construction static state action recognition classifier StaticN, network structure are as follows: first layer: convolutional layer inputs and is
256 × 256 × 3, exporting is 256 × 256 × 64, port number channels=64;The second layer: convolutional layer, input as 256 ×
256 × 64, exporting is 256 × 256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256
× 256 × 64 are connected in third dimension with third layer output 256 × 256 × 64, and exporting is 128 × 128 × 128;The
Four layers: convolutional layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;5th
Layer: convolutional layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 6:
Pond layer inputs the 4th layer of output 128 × 128 × 128 and is connected in third dimension with layer 5 128 × 128 × 128,
Output is 64 × 64 × 256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number
Channels=256;8th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number
Channels=256;9th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number
Channels=256;Tenth layer: pond layer, input for layer 7 output 64 × 64 × 256 and the 9th layer 64 × 64 × 256
It is connected in third dimension, exporting is 32 × 32 × 512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, exports and is
32 × 32 × 512, port number channels=512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, export as 32 ×
32 × 512, port number channels=512;13rd layer: convolutional layer, inputting is 32 × 32 × 512, export as 32 × 32 ×
512, port number channels=512;14th layer: pond layer inputs as eleventh floor output 32 × 32 × 512 and the 13rd
Layer 32 × 32 × 512 is connected in third dimension, and exporting is 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16
× 16 × 1024, exporting is 16 × 16 × 512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16
× 512, exporting is 16 × 16 × 512, port number channels=512;17th layer: convolutional layer, input as 16 × 16 ×
512, exporting is 16 × 16 × 512, port number channels=512;18th layer: pond layer is inputted and is exported for the 15th layer
16 × 16 × 512 are connected in third dimension with the 17th layer 16 × 16 × 512, and exporting is 8 × 8 × 1024;19th
Layer: convolutional layer, inputting is 8 × 8 × 1024, and exporting is 8 × 8 × 256, port number channels=256;20th layer: Chi Hua
Layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 × 256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4
× 4 × 128, port number channels=128;Second Floor 12: pond layer, inputting is 4 × 4 × 128, export as 2 × 2 ×
128;23rd layer: the data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, so by full articulamentum first
After input into full articulamentum, output vector length is 128, and activation primitive is relu activation primitive;24th layer: full connection
Layer, input vector length are 128, and output vector length is 32, and activation primitive is relu activation primitive;25th layer: Quan Lian
Layer is connect, input vector length is 32, and output vector length is 3, and activation primitive is soft-max activation primitive;All convolutional layers
Parameter is size=3 convolution kernel kernel, and step-length stride=(1,1), activation primitive is relu activation primitive;All pond layers
It is maximum pond layer, parameter is pond section size kernel_size=2, step-length stride=(2,2).
Described initializes dynamic action recognition classifier using hand motion video, method are as follows: the first step,
Construct data acquisition system: the when the hand motion image using standard initializes static action recognition classifier
The hand motion video collection that one step is constructed uniformly extracts 10 frame images, as input;Second step, construction dynamic action identification
Classifier DynamicN;Third step initializes dynamic action recognition classifier DynamicN, and input is the first step pair
The set that 10 frame images of each video extraction are constituted exports if the 10 frame images inputted each time are Handv and is
DynamicN (Handv), classification yHandv, yHandvRepresentation method are as follows: take out article: yHandv=[1,0,0,0,0] is put
Return article: yHandvIt takes out and puts back to again in=[0,1,0,0,0]: yHandvIt has taken out article and has not put back to in=[0,0,1,0,0]: yHandv
The movement of=[0,0,0,1,0] and suspicious stealing: yHandv=[0,0,0,0,1], the evaluation function of the network are to (DynamicN
(Handv)-yHandv) its cross entropy loss function is calculated, convergence direction is to be minimized, and the number of iterations is 2000 times.
The 10 frame images of uniform extraction, method are as follows: for one section of video image, if the length is Nf frames.First
1st frame image zooming-out of video image is come out into the 1st frame as extracted set, by the last frame image of video image
Extract the 10th frame as extracted set, the i-th of extracted setcktFrame is the of video imageFrame, wherein ickt=2 to 9:,Indicate round numbers part.
The construction dynamic action recognition classifier DynamicN, network structure are as follows:
First layer: convolutional layer, inputting is 256 × 256 × 30, and exporting is 256 × 256 × 512, port number channels=
512;The second layer: convolutional layer, inputting is 256 × 256 × 512, and exporting is 256 × 256 × 128, port number channels=
128;Third layer: pond layer, inputting is 256 × 256 × 128, and exporting is 128 × 128 × 128;4th layer: convolutional layer, input
It is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer inputs and is
128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, input the 4th
Layer output 128 × 128 × 128 is connected in third dimension with layer 5 128 × 128 × 128, export as 64 × 64 ×
256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;The
Eight layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: volume
Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, it is defeated
Enter and be connected in third dimension for layer 7 output 64 × 64 × 256 with the 9th layer 64 × 64 × 256, exporting is 32 × 32
×512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=
512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;
13rd layer: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth
Four layers: pond layer inputs and exports 32 × 32 × 512 and the 13rd layer 32 × 32 × 512 in third dimension for eleventh floor
It is connected, exporting is 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, export as 16 × 16 ×
512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512,
Port number channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, channel
Number channels=512;18th layer: pond layer inputs and exports 16 × 16 × 512 and the 17th layer 16 × 16 for the 15th layer
× 512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024,
Output is 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, export as 4 ×
4×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128;
Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will
The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length
It is 128, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 128, and output vector is long
Degree is 32, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 32, and output vector is long
Degree is 3, and activation primitive is soft-max activation primitive;The parameter of all convolutional layers is size=3 convolution kernel kernel, step-length
Stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, parameter Chi Huaqu
Between size kernel_size=2, step-length stride=(2,2).
4. product identification module, in initialization, first using the product image set of all angles to product identification point
Class device is initialized, and generates product list to product image.
Described first initializes product identification classifier using the product image set of all angles, and to production
Product image generates product list, method are as follows: the first step, construct data acquisition system and product list: the data acquisition system is that product is each
The image of a angle, product list listGoodsFor a vector, each of vector corresponds to a product name;Second step, structure
Make product identification classifier GoodsN;Third step initializes construction product identification classifier GoodsN, and input is each
The product image set of a angle exports if input picture is Goods as GoodsN (Goods), classification yGoods, yGoods
For one group of vector, length is equal to the number of product in product list, yGoodsRepresentation method are as follows: if image Goods be i-thGoods
The product of position, then yGoodsI-thGoodsPosition is 1, other are to (GoodsN (Goods)-for the evaluation function of 0. network
yGoods) its cross entropy loss function is calculated, convergence direction is to be minimized, and the number of iterations is 2000 times.
The construction product identification classifier GoodsN, two groups of GoodsN1 and GoodsN2 of network layer structure, wherein
The network structure of GoodsN1 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, and exporting is 256 × 256 × 64, port number
Channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, and exporting is 256 × 256 × 128, port number
Channels=128;Third layer: pond layer, inputting is 256 × 256 × 128, and exporting is 128 × 128 × 128;4th layer: volume
Lamination, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolution
Layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer,
It inputs the 4th layer of output 128 × 128 × 128 to be connected in third dimension with layer 5 128 × 128 × 128, exporting is 64
×64×256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=
256;8th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;The
Nine layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond
Change layer, inputs and be connected in third dimension for layer 7 output 64 × 64 × 256 with the 9th layer 64 × 64 × 256, export
It is 32 × 32 × 512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number
Channels=512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number
Channels=512;13rd layer: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number
Channels=512;14th layer: pond layer, input for eleventh floor export 32 × 32 × 512 and the 13rd layer 32 × 32 ×
512 are connected in third dimension, and exporting is 16 × 16 × 1024;15th layer: convolutional layer, input as 16 × 16 ×
1024, exporting is 16 × 16 × 512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512,
Output is 16 × 16 × 512, port number channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, output
It is 16 × 16 × 512, port number channels=512;18th layer: pond layer, input for the 15th layer output 16 × 16 ×
512 are connected in third dimension with the 17th layer 16 × 16 × 512, and exporting is 8 × 8 × 1024;19th layer: convolution
Layer, inputting is 8 × 8 × 1024, and exporting is 8 × 8 × 256, port number channels=256;20th layer: pond layer, input
It is 8 × 8 × 256, exporting is 4 × 4 × 256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, export as 4 × 4 ×
128, port number channels=128;The parameter of all convolutional layers be size=3 convolution kernel kernel, step-length stride=(1,
1), activation primitive is relu activation primitive;All pond layers are maximum pond layer, and parameter is pond section size
Kernel_size=2, step-length stride=(2,2).The network structure of GoodsN2 are as follows: inputting is 4 × 4 × 128, first will be defeated
The data entered are launched into the vector of 2048 dimensions, then input into first layer;First layer: full articulamentum, input vector length are
2048, output vector length is 1024, and activation primitive is relu activation primitive;The second layer: full articulamentum, input vector length are
1024, output vector length is 1024, and activation primitive is relu activation primitive;Third layer: full articulamentum, input vector length are
1024, output vector length is len (listGoods), activation primitive is soft-max activation primitive;len(listGoods) indicate
The length of product list.For any input Goods2, GoodsN (Goods2)=GoodsN2 (GoodsN1 (Goods2)).
5. individual identification module first mentions face characteristic using the face image set of all angles in initialization
It takes device FaceN to be initialized and calculates μ face, then using the human body image of all angles to characteristics of human body's extractor
BodyN is initialized and is calculated μ body.
The face image set using all angles is initialized and is calculated to face characteristic extractor FaceN
μ face, method are as follows: the first step, the face image set for choosing all angles constitute human face data collection;Second step constructs people
Face feature extractor FaceN is simultaneously initialized using face data set;Step 3:
Everyone i concentrated for human face dataPeop, obtain human face data and concentrate all to belong to iPeopFacial image
Set FaceSet (iPeop):
For FaceSet (iPeop) in each facial image Face (jiPeop):
Calculate face characteristic FaceN (Face (jiPeop));
Count current face's image collection FaceSet (iPeop) in all face characteristics average value as current face scheme
Center center (FaceN (Face (the j of pictureiPeop))), calculate current face's image collection FaceSet (iPeop) in all faces
Center center (FaceN (Face (the j of feature and current face's imageiPeop))) distance constitute iPeopCorresponding distance set
It closes.
The owner concentrated to human face data obtains its corresponding distance set, and distance set is arranged from small to large
Afterwards, if distance set length is ndiset, Indicate round numbers
Part.
The construction face characteristic extractor FaceN is simultaneously initialized using face data set, if human face data collection
By NfacesetIndividual is constituted, and network layer structure FaceN25 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, is exported and is
256 × 256 × 64, port number channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, export as 256 ×
256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256 × 256 × 64 are defeated with third layer
256 × 256 × 64 are connected in third dimension out, and exporting is 128 × 128 × 128;4th layer: convolutional layer inputs and is
128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer, inputting is 128
× 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, the 4th layer of input are defeated
128 × 128 × 128 are connected in third dimension with layer 5 128 × 128 × 128 out, and exporting is 64 × 64 × 256;The
Seven layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;8th layer: volume
Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: convolutional layer, it is defeated
Entering is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, inputting is
Seven layers of output 64 × 64 × 256 are connected in third dimension with the 9th layer 64 × 64 × 256, and exporting is 32 × 32 × 512;
Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth
Two layers: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;13rd layer:
Convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;14th layer: Chi Hua
Layer inputs and is connected in third dimension for eleventh floor output 32 × 32 × 512 with the 13rd layer 32 × 32 × 512, defeated
It is out 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, and exporting is 16 × 16 × 512, port number
Channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number
Channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number
Channels=512;18th layer: pond layer, input for the 15th layer export 16 × 16 × 512 and the 17th layer 16 × 16 ×
512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024, defeated
It is out 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, and exporting is 4 × 4
×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128;
Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will
The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length
It is 512, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 512, and output vector is long
Degree is 512, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 512, output vector
Length is Nfaceset, activation primitive is soft-max activation primitive;The parameter of all convolutional layers be convolution kernel kernel size=
3, step-length stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, and parameter is
Pond section size kernel_size=2, step-length stride=(2,2).Its initialization procedure are as follows: set for each face
Face4 exports as FaceN25 (face4), classification yface, yfaceIt is equal to N for lengthfacesetVector, yfaceExpression
Method are as follows: if face face4 belongs to i-th in face image setface4Personal face, then yfaceI-thface4Position is 1, other
Position is that the evaluation function of 0. network is to (FaceN25 (face4)-yface) its cross entropy loss function is calculated, restrain direction
To be minimized, the number of iterations is 2000 times;After iteration, face characteristic extractor FaceN be FaceN25 network from
First layer is to the 24th layer.
The human body image using all angles initializes characteristics of human body's extractor BodyN and calculates μ
Body, method are as follows: the first step, the human body image set for choosing all angles constitute somatic data collection;Second step constructs human body
Simultaneously user's volumetric data set initializes feature extractor BodyN;Step 3:
Everyone i concentrated for somatic dataPeop1, obtain somatic data and concentrate all to belong to iPeop1Human figure
Image set closes BodySet (iPeop1):
For BodySet (iPeop1) in each human body image Body (jiPeop1):
Calculate characteristics of human body BodyN (Body (jiPeop1));
Count current human's image collection BodySet (iPeop1) in all characteristics of human body average value as current human scheme
Center center (BodyN (Body (the j of pictureiPeop1))), calculate current human's image collection BodySet (iPeop1) in owner
Center center (BodyN (Body (the j of body characteristics and current human's imageiPeop1))) distance constitute iPeop1Corresponding distance
Set.
The owner concentrated to somatic data obtains its corresponding distance set, and distance set is arranged from small to large
Afterwards, if distance set length is ndiset1, It indicates to be rounded
Number part.
Construction characteristics of human body's extractor BodyN and user's volumetric data set is initialized, if somatic data collection
By NbodysetIndividual is constituted, and network layer structure BodyN25 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, is exported and is
256 × 256 × 64, port number channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, export as 256 ×
256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256 × 256 × 64 are defeated with third layer
256 × 256 × 64 are connected in third dimension out, and exporting is 128 × 128 × 128;4th layer: convolutional layer inputs and is
128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer, inputting is 128
× 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, the 4th layer of input are defeated
128 × 128 × 128 are connected in third dimension with layer 5 128 × 128 × 128 out, and exporting is 64 × 64 × 256;The
Seven layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;8th layer: volume
Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: convolutional layer, it is defeated
Entering is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, inputting is
Seven layers of output 64 × 64 × 256 are connected in third dimension with the 9th layer 64 × 64 × 256, and exporting is 32 × 32 × 512;
Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth
Two layers: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;13rd layer:
Convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;14th layer: Chi Hua
Layer inputs and is connected in third dimension for eleventh floor output 32 × 32 × 512 with the 13rd layer 32 × 32 × 512, defeated
It is out 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, and exporting is 16 × 16 × 512, port number
Channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number
Channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number
Channels=512;18th layer: pond layer, input for the 15th layer export 16 × 16 × 512 and the 17th layer 16 × 16 ×
512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024, defeated
It is out 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, and exporting is 4 × 4
×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128;
Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will
The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length
It is 512, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 512, and output vector is long
Degree is 512, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 512, output vector
Length is Nfaceset, activation primitive is soft-max activation primitive;The parameter of all convolutional layers be convolution kernel kernel size=
3, step-length stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, and parameter is
Pond section size kernel_size=2, step-length stride=(2,2).Its initialization procedure are as follows: set for each Zhang Renti
Body4 exports as BodyN25 (body4), classification ybody, ybodyIt is equal to N for lengthbodysetVector, ybodyExpression
Method are as follows: if human body body4 belongs to i-th in human body image setbody4Personal human body, then ybodyI-thbody4Position is 1, other
Position is that the evaluation function of 0. network is to (BodyN25 (body4)-ybody) its cross entropy loss function is calculated, restrain direction
To be minimized, the number of iterations is 2000 times;After iteration, characteristics of human body's extractor BodyN be BodyN25 network from
First layer is to the 24th layer.
6. recognition result processing module does not work in initialization.
Embodiment 2:
The present embodiment realizes a kind of detection process of supermarket's intelligence vending system.
1. image pre-processing module, in the detection process: the first step, the monitoring image taken the photograph to monitoring camera carry out equal
Value denoising, thus the monitoring image after being denoised;Second step carries out illumination compensation to the monitoring image after denoising, thus
Image after to illumination compensation;Image after illumination compensation is carried out image enhancement, the data after image enhancement is passed by third step
Pass module of target detection.
The monitoring image that the monitoring camera is taken the photograph carries out mean denoising, and method is: setting monitoring camera and is taken the photograph
Monitoring image be Xsrc, because of XsrcFor color RGB image, therefore there are Xsrc-R, Xsrc-G, Xsrc-BThree components, for each
A component Xsrc', it proceeds as follows respectively: the window of one 3 × 3 dimension being set first, considers image Xsrc' each pixel
Point Xsrc' (i, j), it is respectively [X that pixel value corresponding to matrixes is tieed up in 3 × 3 put centered on the pointsrc' (i-1, j-1), Xsrc′
(i-1, j), Xsrc' (i-1, j+1), Xsrc' (i, j-1), Xsrc' (i, j), Xsrc' (i, j+1), Xsrc' (i+1, j-1), Xsrc′(i+
1, j), Xsrc' (j+1, j+1)] it is arranged from big to small, take it to come intermediate value as image X after denoisingsrc" pixel (i,
J) value is assigned to X after corresponding filteringsrc" (i, j);For Xsrc' boundary point, it may appear that its 3 × 3 dimension window corresponding to
The case where certain pixels are not present, then the median for falling in existing pixel in window need to be only calculated, if window
Interior is even number point, is assigned to X for the average value for coming intermediate two pixel values as the pixel value after pixel denoisingsrc″
(i, j), thus, new image array XsrcIt " is XsrcImage array after the denoising of current RGB component, for Xsrc-R,
Xsrc-G, Xsrc-BAfter three components carry out denoising operation respectively, the X that will obtainsrc-R", Xsrc-c", Xsrc-B" component, by this three
A new component is integrated into a new color image XDenResulting image after as denoising.
Described carries out illumination compensation to the monitoring image after denoising, if the monitoring image X after denoisingDen, because of XDenFor
Color RGB image, therefore XDenThere are tri- components of RGB, for each component XDen', illumination compensation is carried out respectively, then will
Obtained Xcpst' integration obtains colored RBG image Xcpst, XcpstAs XDenImage after illumination compensation, to each component
XDen' respectively carry out illumination compensation the step of are as follows: the first step, if XDen' arranged for m row n, construct XDen′sumAnd NumDenFor same m row
The matrix of n column, initial value is 0,Step-lengthWindow size is l, wherein letter
Number min (m, n) expression takes the minimum value of m and n,Indicate round numbers part, sqrt (l) indicates the square root of l, the l if l < 1
=1;Second step, if XDenTop left co-ordinate is (1,1), is started from coordinate (1,1), is that l and step-length s is determined according to window size
Each candidate frame, which is [(a, b), (a+l, b+l)] area defined, for XDen' in candidate frame region institute
Corresponding image array carries out histogram equalization, the image after obtaining the equalization of candidate region [(a, b), (a+l, b+l)]
Matrix XDen", then XDen′sumEach element in the corresponding region [(a, b), (a+l, b+l)] calculates XDen′sum(a+iXsum, b
+jXsum)=XDen′sum(a+iXsum, b+jXsum)+XDen″(iXsum, jXsum), wherein (iXsum, jXsum) it is integer and 1≤iXsum≤ l,
1≤jXsum≤ l, and by NumDenEach element in the corresponding region [(a, b), (a+l, b+l)] adds 1;Finally, calculating
Wherein (iXsumNum, jXsumNum) it is XDenEach corresponding point, to obtain XcpstAs to present component XDen' carry out illumination
Compensation.
Described is that l and step-length s determines each candidate frame according to window size, be the steps include:
If monitoring image is m row n column, (a, b) is the top left co-ordinate in selected region, and (a+l, b+l) is selection area
Bottom right angular coordinate, which is indicated that the initial value of (a, b) is (1,1) by [(a, b), (a+l, b+l)];
As a+l≤m:
B=1;
As b+l≤n:
Selected region is [(a, b), (a+l, b+l)];
B=b+s;
Interior loop terminates;
A=a+s;
Outer loop terminates;
In the above process, selected region [(a, b), (a+l, b+l)] is candidate frame every time.
It is described for XDen' image array corresponding in candidate frame region carries out histogram equalization, if candidate frame
Region is [(a, b), (a+l, b+l)] area defined, XDenIt " is XDen' the figure in the region [(a, b), (a+l, b+l)]
It as information, the steps include: the first step, construct vector I, I (iI) it is XDen" middle pixel value is equal to iINumber, 0≤iI≤255;The
Two steps calculate vectorThird step, for XDen" on each point (iXDen, jXDen), pixel value is
XDen″(iXDen, jXDen), calculate X "Den(iXDen, jXDen)=I ' (X "Den(iXDen, jXDen)).To XDen" all pixels in image
Histogram equalization process terminates after point value is all calculated and changed, XDen" the result of the interior as histogram equalization saved.
Described carries out image enhancement for the image after illumination compensation, if the image after illumination compensation is XcpsT is corresponded to
RGB channel be respectively XcpstR, XcpstG, XcpstB, to XcpstThe image obtained after image enhancement is Xenh.Image increasing is carried out to it
Strong step are as follows: the first step, for XcpstThe important X of institutecpstR, XcpstG, XcpstBIt is calculated to carry out after obscuring by specified scale
Image;Second step, structural matrix LXenhR, LXenhG, LXenhBFor with XcpstRThe matrix of identical dimensional, for image Xcpst's
The channel R in RGB channel calculates LXenhR(i, j)=log (XcpstR(i, j))-LXcpstRThe value range of (i, j), (i, j) is
All points in image array, for image XcpstRGB channel in the channel G and channel B use algorithm same as the channel R
Obtain LXenhGAnd LXenhB;Third step, for image XcpstRGB channel in the channel R, calculate LXenhRMiddle all the points value
Mean value MeanR and mean square deviation VarR (attention is mean square deviation), calculating MinR=MeanR-2 × VarR and MaxR=MeanR+2 ×
Then VarR calculates XenhR(i, j)=Fix ((LXcpstR(i, j)-MinR)/(MaxR-MinR) × 255), wherein Fix expression takes
Integer part is assigned a value of 0 if value < 0, and value > 255 is assigned a value of 255;For in RGB channel the channel G and channel B
X is obtained using algorithm same as the channel RenhGAnd XenhB, the X of RGB channel will be belonging respectively toenhR、XenhG、XenhBIt is integrated into one
Color image Xenh。
It is described for XcpstThe important X of institutecpstR, XcpstG, XcpstBIt calculates it and carries out the figure after obscuring by specified scale
Picture, for the channel the R X in RGB channelcpstR, the steps include: the first step, define Gaussian function G (x, y, σ)=k × exp (- (x2
+y2)/σ2), σ is scale parameter, k=1/ ∫ ∫ G (x, y) dxdy, then for XcpstREach point XcpstR(i, j) is calculated, WhereinIndicate convolution algorithm, for being lower than the point of scale σ apart from boundary, only
Calculate XcpstRWith the convolution of G (x, y, σ) corresponding part, Fix () indicates round numbers part, 0 is assigned a value of if value < 0, value
> 255 is then assigned a value of 255.For in RGB channel the channel G and channel B using algorithm same as the channel R update XcpstGWith
XcpstG。
2. module of target detection receives image pre-processing module and transmits the image come, then to it in the detection process
Handled, to each frame image using algorithm of target detection carry out target detection, obtain present image human body image region,
Then hand region and product area are sent to shopping action recognition mould by face facial area, hand region and product area
Human body image region and face facial area, are sent to individual identification module by block, and product area is passed to product identification mould
Block;
Described carries out target detection using algorithm of target detection to each frame image, obtains the human body image of present image
Region, face facial area, hand region and product area, the steps include:
The first step, by input picture XcpstIt is divided into the subgraph of 768 × 1024 dimensions;
Second step, for each subgraph Xs:
2.1st step is converted using the feature extraction depth network Fconv constructed in initialization, obtains 512 spies
Levy subgraph set Fconv (Xs);
2.2nd step, to Fconv (Xs) using area selection network in first layer Conv1, second layer Conv2-1+softmax
Activation primitive and Conv2-2Into transformation, output softmax (Conv is respectively obtained2-1(Conv1(Fconv(Xs)))) and Conv2-2
(Conv1(Fconv(Xs))), all preliminary candidate sections in the section are then obtained according to output valve;
2.3rd step, for all preliminary candidate sections of all subgraphs of current frame image:
2.3.1 step, is chosen according to the score size in its current candidate region, chooses maximum 50 preliminary candidates
Section is as candidate region;
2.3.2 step adjusts candidate section of crossing the border all in candidate section set, then weeds out weight in candidate section
Folded frame, to obtain final candidate section;
2.3.3 step, by subgraph XsROI layers are input to each final candidate section, obtains corresponding ROI output,
If current final candidate section is (aBB(1), bBB(2), lBB(3), wBB(4)) FBBox (Fc, is then calculated2(ROI)) it obtains
Four output (OutBB(1), OutBB(2), OutBB(3), OutBB(4)) to obtain updated coordinate (aBB(1)+8×OutBB
(1), bBB(2)+8×OutBB(2), lBB(3)+8×OutBB(3), wBB(4)+8×OutBB(4));Then FClass (Fc is calculated2
(ROI)) exported, if exporting first maximum current interval be human body image-region, if output second maximum when
It is people's face facial area between proparea, current interval is hand region if exporting third position maximum, if the 4th maximum of output
Current interval is product area, and current interval, which is negative, if exporting the 5th maximum sample areas and deletes the final candidate regions
Between.Third step, the coordinate in the final candidate section after updating the refining of all subgraphs, the method for update is to set current candidate region
Coordinate be (TLx, TLy, RBx, RBy), the top left co-ordinate of corresponding subgraph is (Seasub, Sebsub), updated seat
It is designated as (TLx+Seasub- 1, TLy+Sebsub- 1, RBx, RBy).
It is described by input picture XcpstBe divided into the subgraph of 768 × 1024 dimensions, the steps include: to set the step-length of segmentation as
384 and 512, if window size is m row n column, (asub, bsub) be selected region top left co-ordinate, the initial value of (a, b) is
(1,1);
Work as asubWhen < m:
bsub=1;
Work as bsubWhen < n:
Selected region is [(asub, bsub), (asub+ 384, bsub+ 512)], by input picture XcpstUpper section institute is right
The information for the image-region answered copies in new subgraph, and is attached to top left co-ordinate (asub, bsub) it is used as location information;If choosing
Region is determined beyond input picture XcpstSection then will exceed the corresponding rgb pixel value of the pixel in range and be assigned a value of 0;
bsub=bsub+512;
Interior loop terminates;
asub=asub+384;
Outer loop terminates;
Described obtains all preliminary candidate sections in the section, method according to output valve are as follows: step 1: for
softmax(Conv2-1(Conv1(Fconv(Xs)))) its output be 48 × 64 × 18, for Conv2-2(Conv1(Fconv
(Xs))), output is 48 × 64 × 36, for any point (x, y) on 48 × 64 dimension spaces, softmax (Conv2-1
(Conv1(Fconv(Xs)))) (x, y) be 18 dimensional vector II, Conv2-2(Conv1(Fconv(Xs))) (x, y) be 36 dimensional vectors
IIII, if II (2i-1) > II (2i), for i value from 1 to 9, lOtrFor Roi(xOtr, yOtr) third position, wOtrFor Roi
(xOtr, yotr) the 4th, then preliminary candidate section be [II (2i-1), (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y,
lOtr× IIII (4i-1), wOtr× IIII (4i))], wherein the score in first II (2i-1) expression current candidate region, second
Position (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y, IIII (4i-1), IIII (4i)) indicates the center in current candidate section
Point is (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y), and the long half-breadth of the half of candidate frame is respectively lOtr× IIII (4i-1) and
wOtr×IIII(4i))。
All candidate sections of crossing the border, method are as follows: set monitoring image as m row n in the candidate section set of the adjustment
Column, for each candidate section, if its [(ach, bch)], the long half-breadth of the half of candidate frame is respectively lchAnd wchIf ach+lch>
M, thenThen its a is updatedch=a 'ch, lch=
l′ch;If bch+wch> n, thenThen it updates
Its bch=b 'ch, wch=w 'ch.
Described weeds out the frame being overlapped in candidate section, the steps include:
If candidate section set is not sky:
The maximum candidate section i of score is taken out from the set of candidate sectionout:
Calculate candidate section ioutWith candidate section set each of candidate section icCoincidence factor, if coincidence factor >
0.7, then gather from candidate section and deletes candidate section ic;
By candidate section ioutIt is put into the candidate section set of output;
When candidate section set is empty, exporting candidate section contained in candidate section set is to weed out candidate regions
Between middle overlapping frame after obtained candidate section set.
The calculating candidate section ioutWith candidate section set each of candidate section icCoincidence factor, side
Method are as follows: set candidate section icCoordinate section centered on point [(aic, bic)], the long half-breadth of the half of candidate frame is respectively licAnd wic, wait
I between constituencycCoordinate section centered on point [(aiout, bicout)], the long half-breadth of the half of candidate frame is respectively lioutAnd wiout;It calculates
XA=max (aic, aiout);YA=max (bic, biout);XB=min (lic, liout), yB=min (wic, wiout);If meeting |
aic-aiout|≤lic+liout- 1 and | bic-biout|≤wic+wiout- 1, illustrate that there are overlapping region, overlapping regions=(lic+
liout-1-|aic-aiout|)×(wic+wiout-1-|bic-biout|), otherwise overlapping region=0;Calculate whole region=(2lic-
1)×(2wic-1)+(2liout-1)×(2wiout- 1)-overlapping region;To obtain coincidence factor=overlapping region/whole region.
3. action recognition module of doing shopping, in the detection process: the first step makes each the hand region information received
It is identified with static action recognition classifier, recognition methods are as follows: set the image inputted each time as Handp1, export and be
StaticN (Handp1) is 3 bit vectors, is identified as grasping if first maximum, if second maximum is identified as putting down, if
Third position maximum is then identified as other;Second step carries out mesh to current grasp motion corresponding region after recognizing grasp motion
Mark tracking, if being using the recognition result of static action recognition classifier corresponding to the next frame tracking box of current hand region
When putting down movement, target following terminates, by it is currently available since recognize grasp motion be video, recognize and put down movement
Terminate for video, is complete video by the video marker to obtain the continuous videos of hand motion.If being tracked during tracking
It loses, then terminates currently available since recognizing grasp motion and being video, from the image before tracking loss as video,
It is then the video of only grasp motion by the video marker to obtain the video of only grasp motion;When recognize put down it is dynamic
Make, and the movement illustrates that the grasp motion of the movement is lost, then with present image not in the obtained image of target following
Corresponding hand region terminates for video, is carried forward tracking since present frame using method for tracking target, until tracking is lost
It loses, then start frame of the next frame of lost frames as video, is the video for only putting down movement by the video marker.Third step,
The obtained complete video of second step is identified using dynamic action recognition classifier, recognition methods are as follows: set defeated each time
The image entered is Handv1, and exporting as DynamicN (Handv1) is 5 bit vectors, is identified as extract if first maximum
Product are identified as putting back to article if second maximum, put back to again if third position maximum is identified as taking out, if the 4th maximum
It is identified as having taken out article and not put back to, the movement of suspicious stealing is identified as if the 5th maximum, then sends out the recognition result
Recognition result processing module is given, the video for by the video of only grasp motion and only putting down movement is sent at recognition result
Module is managed, the video of complete video and only grasp motion is sent to product identification module and individual identification module.
It is described after recognizing grasp motion, target following, method are carried out to current grasp motion corresponding region are as follows:
If the image of the grasp motion currently recognized is Hgrab, current tracking area is region corresponding to image Hgrab.First
Step extracts the ORB feature ORB of image HgrabHgrab;Second step, it is corresponding for all hand regions in the next frame of Hgrab
Image calculate its ORB feature to obtaining ORB characteristic set, and delete the ORB feature chosen by other tracking box;Third
Step, by ORBHgrabIts Hamming distance compared with each value of ORB characteristic set, selection and ORBHgrabThe Hamming distance of feature
The smallest ORB feature is the ORB feature chosen, if the ORB feature and ORB chosenHgrabThe similarity > 0.85 of feature, similarity
=(Hamming distance/0RB characteristic lengths of two ORB features of 1-), then the corresponding hand region of ORB feature chosen is image
Hgrab next frame tracking box, if otherwise similarity < 0.85 show tracking lose.
The ORB feature, the method that ORB feature is extracted from an image have been relatively mature, and calculate in OpenCV
Has realization inside machine vision library;Its ORB feature is extracted to a picture, input value is current image, is exported as several group leaders
Identical character string is spent, each group represents an ORB feature.
Described terminates using the corresponding hand region of present image as video, using method for tracking target since present frame
Be carried forward tracking, until tracking is lost, method are as follows: set the image for putting down movement currently recognized as Hdown, currently with
Track region is region corresponding to image Hdown.
If not tracking loss:
The first step extracts the ORB feature ORB of image HdownHdown, moved since the process recognizes grasping in described working as
After work, calculated during carrying out target following to current grasp motion corresponding region, so being not required here again
Secondary calculating;
Second step, for the corresponding image of all hand regions in the former frame of image Hdown calculate its ORB feature from
And ORB characteristic set is obtained, and delete the ORB feature chosen by other tracking box;
Third step, by ORBHaownIts Hamming distance compared with each value of ORB characteristic set, selection and ORBHdownIt is special
The smallest ORB feature of the Hamming distance of sign is the ORB feature chosen, if the ORB feature and ORB chosenHdownThe similarity of feature
> 0.85, similarity=(Hamming distance/0RB characteristic lengths of two ORB features of 1-), the then corresponding hand of ORB feature chosen
Portion region is tracking box of the image Hdown in next frame, if otherwise similarity < 0.85 shows that tracking is lost, algorithm terminates.
4. product identification module, in the detection process, the first step, according to the transmitting of shopping action recognition module come complete view
The video of frequency and only grasp motion, first in the obtained position of module of target detection according to corresponding to current video first frame
It sets, the inputted video image of the position is detected forward from current video first frame, detect the frame that the region is not blocked,
It is finally identified the image in region corresponding to frame as the input of product identification classifier, to obtain current production
Recognition result, recognition methods are as follows: set the image inputted each time as Goods1, export for GoodsN (Goods1) be one to
Amount, if the i-th of the vectorgoodsPosition is maximum, then shows that current recognition result is i-th in product listgoodsThe product of position, will
Recognition result is sent to recognition result processing module;
It is described according to corresponding to current video first frame in the obtained position of module of target detection, to the position
Inputted video image detects forward from current video first frame, detects the frame that the region is not blocked, method are as follows: set and work as
Corresponding to preceding video first frame the obtained position of module of target detection be (agoods, bgoods, lgoods, wgoods), if currently
Current video first frame is i-thcrgsFrame, frame under process icr=icrgs: the first step, i-thcrFrame is obtained by module of target detection
All detection zones be Taskicr;Second step, for TaskicrEach of regional frame (atask, btask, ltask, wtask),
Calculate its distance dgt=(atask-agoods)2+(btask-bgoods)2-(ltask+lgoods)2-(wtask+wgoods)2.Distance if it does not exist
< 0, then i-thcrCorresponding (a of framegoods, bgoods, lgoods, wgoods) region be region detected for detecting not by
The frame blocked, algorithm terminate;Otherwise, distance < 0 if it exists, the then d (i in recording distance list dcr)=minimum range, and icr
=icr- 1, if icr> 0, then algorithm jumps to the first step, if icr≤ 0, then selection takes this apart from the maximum record of list d intermediate value
Record the corresponding (a of corresponding framegoods, bgoods, lgoods, wgoods) it is what the region detected detected was not blocked
Frame, algorithm terminate.
5. individual identification module when user enters supermarket, is obtained currently in the detection process by module of target detection
The image Face1 of human region Body1 and the face in human region, then respectively using characteristics of human body's extractor BodyN and
Face characteristic extractor FaceN extracts characteristics of human body BodyN (Body1) and face characteristic FaceN (Face1), saves BodyN
(Body1) in BodyFtu set, FaceN (Face1) is saved in FaceFtu set, and saves the ID letter of existing customer
Breath, id information can be the unduplicated number that user is randomly assigned when either user enters supermarket in the account of supermarket, ID
Information is used to distinguish different customers, whenever there is customer to enter supermarket, then extracts its characteristics of human body and face characteristic;When being used in supermarket
When the mobile product of family, according to shopping action recognition module transmitting come complete video and only grasp motion video, search out
Its corresponding human region and human face region carry out people using face feature extractor FaceN and characteristics of human body's extractor BodyN
Face identification or human bioequivalence mode obtain the ID of customer corresponding to the video that currently transmitting of shopping action recognition module comes.
It is described according to the transmitting of shopping action recognition module come complete video and only grasp motion video, search out
Its corresponding human region and human face region carry out people using face feature extractor FaceN and characteristics of human body's extractor BodyN
Face identification or human bioequivalence mode obtain the ID of customer corresponding to the video that currently transmitting of shopping action recognition module comes.Its
Process are as follows: according to shopping action recognition module transmitting come video, begin look for from the first frame of video to corresponding human body area
Domain and human face region, until algorithm terminates or handled the last frame of video:
Corresponding human region image Body2 and human face region image Face2 are used into characteristics of human body's extractor respectively
BodyN and face characteristic extractor FaceN extracts characteristics of human body BodyN (Body2) and face characteristic FaceN (Face2);
Then face identification information is used first: comparing all face characteristics in FaceN (Face2) and FaceFtu set
Euclidean distance dFace, feature when selecting Euclidean distance minimum in corresponding FaceFtu set, if this feature is FaceN
(Face3), if dFace< μ face then identifies that current face's image belongs to the visitor of facial image corresponding to FaceN (Face3)
Family ID is the ID corresponding to the video actions that action recognition module transmitting comes that does shopping, and current identification process terminates;
If dFace>=μ face shows only to identify current individual with face identification method, then compares BodyN
(Body2) the Euclidean distance d of all characteristics of human body in gathering with BodyFtuBody, select Euclidean distance minimum when it is corresponding
Feature in BodyFtu set, if this feature is BodyN (Face3), if dBody+dFace< μ face+ μ body, then identify and work as
The Customer ID that preceding human body image belongs to human body image corresponding to BodyN (Face3) is that shopping action recognition module transmitting comes
ID corresponding to video actions.
If still not finding ID corresponding to video actions after all frames for having handled video, in order to avoid mistake is known
Not Gou Wu main body cause the book keeping operation of mistake, therefore the video come to current shopping action recognition module transmitting is no longer handled.
It is described according to the transmitting of shopping action recognition module come video, begin look for from the first frame of video to corresponding
Human region and human face region, method are as follows: according to the transmitting of shopping action recognition module come video, from the first frame of video into
Row processing.If currently processed to i-thfRgFrame, if it is (a that the frame, which corresponds to video in the obtained position of module of target detection,ifRg,
bifRg, lifRg, wifRg), the frame is corresponding to be combined into BodyFrameSet in the obtained human region collection of module of target detectionifRg
Human region collection is combined into FaceFrameSetifRg, for BodyFrameSetifRgEach of human region (aBFSifRg,
bBFsifRg, lBFSifRg, wBFSifRg), calculate its distance dgbt=(aBFSifRg-aifRg)2+(bBFSifRg-bifRg)2-(lBFSifRg-lifRg)2-
(wBFSifRg-wifRg)2, selecting the smallest human region of distance in all human region set is the corresponding human body area of current video
Domain, if it is (a that the human region chosen, which is position,BFS1, bBFS1, lBFS1, wBFS1), human face region collection is combined into
FaceFrameSetifRgEach of human face region (aFFSifRg, bFFsifRg, lFFsifRg, wFFSifRg), calculate its distance dgft=
(aBFS1-aFFSifRg)2+(bBFS1-bFFSifRg)2-(lBFS1-lFFSifRg)2-(wBFS1-wFFSifRg)2, select all face regional ensembles
It is middle apart from the smallest human face region be the corresponding human face region of current video.
6. it is every to generate to carry out integration to the recognition result received in identification process for recognition result processing module
The corresponding shopping list of one customer: first according to individual identification module transmit come the ID of customer determine current shopping information pair
The customer answered, thus choose the shopping list number modified be ID, then according to product identification module transmit come recognition result
Product is set as GoodA, then according to shopping action recognition module transmitting to determine that the shopping of current customer acts corresponding product
Whether the recognition result come modifies to shopping cart to determine that current shopping acts, clear in shopping if being identified as taking out article
Increase product G oodA on single ID, accelerating is 1, reduces product on shopping list ID if being identified as putting back to article
GoodA, reducing quantity is 1, and shopping list does not change if be identified as " take out and put back to " or " taken out article and do not put back to " again
Become, to supermarket's monitoring transmission alarm signal and the corresponding location information of current video if recognition result is " suspicious stealing ".
Embodiment 3:
The present embodiment realizes a kind of process of the upgrading products list of supermarket's intelligence vending system.
1. a process has only used product identification module.When changing product list: if deleting certain product, from each angle
The image of the product is deleted in the product image set of degree, and corresponding position in product list is deleted, if increasing certain product,
The product image of all angles of current production is put into the product image set of all angles, by product list last
The current title for increasing product of back addition, is then updated with the product image set of new all angles and new product list
Product identification classifier.
The product image set and new product list upgrading products recognition classifier with new all angles,
Method are as follows: the first step modifies network structure: for the network of product identification the classifier GoodsN ', GoodsN1 ' of neotectonics
Structure is constant, identical as GoodsN1 network structure when initialization, the first layer and second layer knot of GoodsN2 ' network structure
Structure remains unchanged, and the output vector length of third layer becomes the length of updated product list;Second step, for neotectonics
Product identification classifier GoodsN ' is initialized: its product image set inputted as new all angles, if input picture
For Goods3, export as GoodsN ' (Goods3)=GoodsN2 ' (GoodsN1 (Goods3)), classification ycoods3, yGoods
For one group of vector, length is equal to the number of updated product list, yGoodsRepresentation method are as follows: if image Goods is the
iGoodsThe product of position, then yGoodsI-thGoodsPosition is 1, other are to (GoodsN for the evaluation function of 0. network
(Goods)-yGoods) its cross entropy loss function is calculated, convergence direction is to be minimized, during initialization in GoodsN1
Parameter value remain unchanged, the number of iterations be 500 times.
Claims (7)
1. a kind of supermarket's intelligence vending system, which is characterized in that based on the monitoring camera being fixed in supermarket and on shelf
The video image taken the photograph is as input;It is made of following 6 functional modules: image pre-processing module, module of target detection, shopping
Action recognition module, product identification module, individual identification module, recognition result processing module;This 6 respective realities of functional module
Existing method is as follows:
The video image that image pre-processing module takes the photograph monitoring camera pre-processes, first to possible in the image of input
The noise that contains carries out denoising, then carries out illumination compensation to the image after denoising, then to the image after illumination compensation into
Data after image enhancement are finally passed to module of target detection by row image enhancement;
Module of target detection carries out target detection to the image received, detects that the human body in current video image is whole respectively
Then hand region and product area are sent to shopping movement and known by region, face facial area, hand region and product area
Human body image region and face facial area are sent to individual identification module by other module, and product area is passed to product and is known
Other module;
Shopping action recognition module carries out static action recognition to the hand region information received, finds the starting for grasping video
Frame, it is then lasting that movement is identified until finding the movement for putting down article as end frame, then video is used dynamic
State action recognition classifier is identified, identifies that current action is to take out article, put back to article, take out and put back to, taken out
Article does not put back to either suspicious stealing;Then recognition result is sent to recognition result processing module, by only grasp motion
Video and only put down the video of movement and be sent to recognition result processing module;
Product identification module identifies the video of the product area received, identify currently by it is mobile be any production
Product, are then sent to recognition result processing module for recognition result, and product identification module can also increase at any time or delete some
Product;
Individual identification module identifies the human face region and human region that receive, believes in conjunction with human face region and human region
Breath, for identification out current individual be in entire supermarket who individual, then recognition result is sent at recognition result
Manage module;
Recognition result processing module integrates the recognition result received, according to individual identification module transmit come customer
ID determines the corresponding customer of current shopping information, according to product identification module transmit come recognition result determine current customer
Shopping acts corresponding product, determined according to the recognition result that shopping action recognition module transmitting comes current shopping act whether
It modifies to shopping cart;To obtain the shopping list of current customer;Suspicious stealing to shopping action recognition module identification
Behavior sounds an alarm.
2. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the image pre-processing module
Concrete methods of realizing are as follows:
In initial phase, the module does not work;In the detection process: the first step, the monitoring image that monitoring camera is taken the photograph into
Row mean denoising, thus the monitoring image after being denoised;Second step carries out illumination compensation to the monitoring image after denoising, from
And obtain the image after illumination compensation;Image after illumination compensation is carried out image enhancement, by the number after image enhancement by third step
According to passing to module of target detection;
The monitoring image that the monitoring camera is taken the photograph carries out mean denoising, and method is: setting the prison that monitoring camera is taken the photograph
Control image is Xsrc, because of XsrcFor color RGB image, therefore there are Xsrc-R, Xsrc-G, Xsrc-BThree components, for each point
Measure Xsrc', it proceeds as follows respectively: the window of one 3 × 3 dimension being set first, considers image Xsrc' each pixel
Xsrc' (i, j), it is respectively [X that pixel value corresponding to matrixes is tieed up in 3 × 3 put centered on the pointsrc′(i-1,j-1),Xsrc′
(i-1,j),Xsrc′(i-1,j+1),Xsrc′(i,j-1),Xsrc′(i,j),Xsrc′(i,j+1),Xsrc′(i+1,j-1),Xsrc′(i+
1,j),Xsrc' (j+1, j+1)] it is arranged from big to small, take it to come intermediate value as image X after denoisingsrc" pixel (i,
J) value is assigned to X after corresponding filteringsrc″(i,j);For Xsrc' boundary point, it may appear that its 3 × 3 dimension window corresponding to
The case where certain pixels are not present, then the median for falling in existing pixel in window need to be only calculated, if window
Interior is even number point, is assigned to X for the average value for coming intermediate two pixel values as the pixel value after pixel denoisingsrc″
(i, j), thus, new image array XsrcIt " is XsrcImage array after the denoising of current RGB component, for Xsrc-R,
Xsrc-G, Xsrc-BAfter three components carry out denoising operation respectively, the X that will obtainsrc-R", Xsrc-G", Xsrc-B" component, by this three
A new component is integrated into a new color image XDenResulting image after as denoising;
Described carries out illumination compensation to the monitoring image after denoising, if the monitoring image X after denoisingDen, because of XDenFor colour
RGB image, therefore XDenThere are tri- components of RGB, for each component XDen', illumination compensation is carried out respectively, then will be obtained
Xcpst' integration obtains colored RBG image Xcpst, XcpstAs XDenImage after illumination compensation, to each component XDen' point
Not carry out illumination compensation the step of are as follows: the first step, if XDen' arranged for m row n, construct XDen′sumAnd NumDenFor same m row n column
Matrix, initial value are 0,Step-lengthWindow size is l, wherein function min
(m, n) indicates to take the minimum value of m and n,Indicate round numbers part, sqrt (l) indicates the square root of l, the l=1 if l < 1;The
Two steps, if XDenTop left co-ordinate is (1,1), is started from coordinate (1,1), is that l and step-length s determines each according to window size
Candidate frame, which is [(a, b), (a+l, b+l)] area defined, for XDen' corresponding in candidate frame region
Image array carries out histogram equalization, the image array after obtaining the equalization of candidate region [(a, b), (a+l, b+l)]
XDen", then XDen′sumEach element in the corresponding region [(a, b), (a+l, b+l)] calculates XDen′sum(a+iXsum,b+
jXsum)=XDen′sum(a+iXsum,b+jXsum)+XDen″(iXsum,jXsum), wherein (iXsum,jXsum) it is integer and 1≤iXsum≤ l, 1
≤jXsum≤ l, and by NumDenEach element in the corresponding region [(a, b), (a+l, b+l)] adds 1;Finally, calculating
Wherein (iXsumNum,jXsumNum) it is XDenEach corresponding point, to obtain XcpstAs to present component XDen' carry out illumination
Compensation;
Described is that l and step-length s determines each candidate frame according to window size, be the steps include:
If monitoring image is m row n column, (a, b) is the top left co-ordinate in selected region, and (a+l, b+l) is the right side of selection area
Lower angular coordinate, the region are indicated that the initial value of (a, b) is (1,1) by [(a, b), (a+l, b+l)];
As a+l≤m:
B=1;
As b+l≤n:
Selected region is [(a, b), (a+l, b+l)];
B=b+s;
Interior loop terminates;
A=a+s;
Outer loop terminates;
In the above process, selected region [(a, b), (a+l, b+l)] is candidate frame every time;
It is described for XDen' image array corresponding in candidate frame region carries out histogram equalization, if candidate frame region
For [(a, b), (a+l, b+l)] area defined, XDenIt " is XDen' image the letter in the region [(a, b), (a+l, b+l)]
Breath the steps include: the first step, construct vector I, I (iI) it is XDen" middle pixel value is equal to iINumber, 0≤iI≤255;Second
Step calculates vectorThird step, for XDen" on each point (iXDen,jXDen), pixel value is
XDen″(iXDen,jXDen), calculate X "Den(iXDen,jXDen)=I ' (X "Den(iXDen,jXDen));To XDen" all pixels in image
Histogram equalization process terminates after point value is all calculated and changed, XDen" the result of the interior as histogram equalization saved;
Described carries out image enhancement for the image after illumination compensation, if the image after illumination compensation is Xcpst, corresponding RGB
Channel is respectively XcpstR,XcpstG,XcpstB, to XcpstThe image obtained after image enhancement is Xenh;Image enhancement is carried out to it
Step are as follows: the first step, for XcpstThe important X of institutecpstR,XcpstG,XcpstBIt calculates it and carries out the figure after obscuring by specified scale
Picture;Second step, structural matrix LXenhR,LXenhG,LXenhBFor with XcpstRThe matrix of identical dimensional, for image XcpstRGB it is logical
The channel R in road calculates LXenhR(i, j)=log (XcpstR(i,j))-LXcpstR(i, j), the value range of (i, j) are image moment
All points in battle array, for image XcpstRGB channel in the channel G and channel B obtained using algorithm same as the channel R
LXenhGAnd LXenhR;Third step, for image XcpstRGB channel in the channel R, calculate LXenhRThe mean value of middle all the points value
MeanR and mean square deviation VarR (attention is mean square deviation) calculates MinR=MeanR-2 × VarR and MaxR=MeanR+2 × VarR,
Then X is calculatedenhR(i, j)=Fix ((LXcpstR(i, j)-MinR)/(MaxR-MinR) × 255), wherein Fix indicates round numbers
Part is assigned a value of 0 if value<0, and value>255 are assigned a value of 255;For in RGB channel the channel G and channel B use and R
The same algorithm in channel obtains XenhGAnd XenhB, the X of RGB channel will be belonging respectively toenhR、XenhG、XenhBIt is integrated into a cromogram
As Xenh;
It is described for XcpstThe important X of institutecpstR,XcpstG,XcpstBIt calculates it and carries out the image after obscuring by specified scale, it is right
The channel R X in RGB channelcpstR, the steps include: the first step, define Gaussian function G (x, y, σ)=k × exp (- (x2+y2)/
σ2), σ is scale parameter, k=1/ ∫ ∫ G (x, y) dxdy, then for XcpstREach point XcpstR(i, j) is calculated, WhereinIndicate convolution algorithm, for being lower than the point of scale σ apart from boundary, only
Calculate XcpstRWith the convolution of G (x, y, σ) corresponding part, Fix () indicates round numbers part, is assigned a value of 0 if value<0, value>
255 are assigned a value of 255;For in RGB channel the channel G and channel B using algorithm same as the channel R update XcpstGWith
XcpstG。
3. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the module of target detection
Concrete methods of realizing are as follows:
During initialization, using with having demarcated human body image region, face facial area, hand region and product area
Image to algorithm of target detection carry out parameter initialization;In the detection process, receive what image pre-processing module transmitted
Then image is handled it, carry out target detection using algorithm of target detection to each frame image, obtain present image
Human body image region, face facial area, hand region and product area, are then sent to purchase for hand region and product area
Human body image region and face facial area are sent to individual identification module, product area are transmitted by object action recognition module
Give product identification module;
The use have demarcated human body image region, face facial area, hand region and product area image pair
Algorithm of target detection carries out parameter initialization, the steps include: that the first step, construction feature extract depth network;Second step, tectonic province
Domain selects network, third step, according to each figure in database used in the construction feature extraction depth network
As X and the corresponding each human region manually demarcatedThen by ROI layers, input
For image X and regionOutputIt is 7 × 7
× 512 dimensions;Third step, building coordinate refine network;
The construction feature extracts depth network, which is deep learning network structure, network structure are as follows: first layer:
Convolutional layer, inputting is 768 × 1024 × 3, and exporting is 768 × 1024 × 64, port number channels=64;The second layer: convolution
Layer, inputting is 768 × 1024 × 64, and exporting is 768 × 1024 × 64, port number channels=64;Third layer: pond layer,
Input first layer output 768 × 1024 × 64 is connected in third dimension with third layer output 768 × 1024 × 64, exports
It is 384 × 512 × 128;4th layer: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, port number
Channels=128;Layer 5: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, port number
Channels=128;Layer 6: pond layer inputs the 4th layer of output 384 × 512 × 128 and layer 5 384 × 512 × 128
It is connected in third dimension, exporting is 192 × 256 × 256;Layer 7: convolutional layer, inputting is 192 × 256 × 256, defeated
It is out 192 × 256 × 256, port number channels=256;8th layer: convolutional layer, inputting is 192 × 256 × 256, output
It is 192 × 256 × 256, port number channels=256;9th layer: convolutional layer, inputting is 192 × 256 × 256, exports and is
192 × 256 × 256, port number channels=256;Tenth layer: pond layer inputs as layer 7 output 192 × 256 × 256
It is connected in third dimension with the 9th layer 192 × 256 × 256, exporting is 96 × 128 × 512;Eleventh floor: convolutional layer,
Input is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512;Floor 12: convolutional layer, it is defeated
Entering is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512;13rd layer: convolutional layer, input
It is 96 × 128 × 512, exporting is 96 × 128 × 512, port number channels=512;14th layer: pond layer inputs and is
Eleventh floor output 96 × 128 × 512 is connected in third dimension with the 13rd layer 96 × 128 × 512, export as 48 ×
64×1024;15th layer: convolutional layer, inputting is 48 × 64 × 1024, and exporting is 48 × 64 × 512, port number channels
=512;16th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number channels=
512;17th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number channels=512;
18th layer: pond layer inputs and exports 48 × 64 × 512 and the 17th layer 48 × 64 × 512 in third dimension for the 15th layer
It is connected on degree, exporting is 48 × 64 × 1024;19th layer: convolutional layer, inputting is 48 × 64 × 1024, and exporting is 48 × 64
× 256, port number channels=256;20th layer: pond layer, inputting is 48 × 64 × 256, export as 24 × 62 ×
256;Second eleventh floor: convolutional layer, inputting is 24 × 32 × 1024, and exporting is 24 × 32 × 256, port number channels=
256;Second Floor 12: pond layer, inputting is 24 × 32 × 256, and exporting is 12 × 16 × 256;23rd layer: convolutional layer,
Input is 12 × 16 × 256, and exporting is 12 × 16 × 128, port number channels=128;24th layer: pond layer, it is defeated
Entering is 12 × 16 × 128, and exporting is 6 × 8 × 128;25th layer: full articulamentum, first by 6 × 8 × 128 dimensions of input
Data be launched into the vectors of 6144 dimensions, then input into full articulamentum, output vector length is 768, and activation primitive is
Relu activation primitive;26th layer: full articulamentum, input vector length are 768, and output vector length is 96, activation primitive
For relu activation primitive;27th layer: full articulamentum, input vector length are 96, and output vector length is 2, activation primitive
For soft-max activation primitive;The parameter of all convolutional layers is size=3 convolution kernel kernel, and step-length stride=(1,1) swashs
Function living is relu activation primitive;All pond layers are maximum pond layer, and parameter is pond section size kernel_size
=2, step-length stride=(2,2);If setting the depth network as Fconv27, for a width color image X, by the depth net
The obtained feature set of graphs of network indicates that the evaluation function of the network is to calculate it to (Fconv27 (X)-y) with Fconv27 (X)
Cross entropy loss function, convergence direction are to be minimized, and y inputs corresponding classification;Database is to include what nature acquired
The image of passerby and non-passerby, every image are the color image of 768 × 1024 dimensions, whether include pedestrian point according in image
At two classes, the number of iterations is 2000 times;After training, takes first layer to be characterized to the 17th layer and extract depth network
Fconv indicates a width color image X by the obtained output of the depth network with Fconv (X);
The structure realm selects network, receives Fconv depth network and extracts 512 48 × 64 feature set of graphs Fconv
(X), then the first step obtains Conv by convolutional layer1(Fconv (X)), the parameter of the convolutional layer are as follows: convolution kernel kernel size
=1, step-length stride=(1,1), inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number channels=
512;Then by Conv1(Fconv (X)) is separately input to two convolutional layer (Conv2-1And Conv2-2), Conv2-1Structure are as follows:
Input is 48 × 64 × 512, and exporting is 48 × 64 × 18, port number channels=18, and the output that this layer obtains is Conv2-1
(Conv1(Fconv (X))), then softmax (Conv is obtained using activation primitive softmax to the output2-1(Conv1(Fconv
(X))));Conv2-2Structure are as follows: inputting is 48 × 64 × 512, and exporting is 48 × 64 × 36, port number channels=36;
There are two the loss functions of the network: first error function loss1 is to Wshad-cls⊙(Conv2-1(Conv1(Fconv
(X)))-Wcls(X)) softmax error is calculated, second error function loss2 is to Wshad-reg(X)⊙(Conv2-1(Conv1
(Fconv(X)))-Wreg(X)) smooth L1 error, loss function=loss1/sum (W of regional choice network are calculatedcls
(X))+loss2/sum(Wcls(X)), the sum of sum () representing matrix all elements, convergence direction are to be minimized, Wcls(X)
And WregIt (X) is respectively the corresponding positive and negative sample information of database images X, ⊙ representing matrix is multiplied according to corresponding position, Wshad-cls
(X) and Wshad-regIt (X) is mask, it acts as selection Wshad(X) part that weight is 1 in is trained, to avoid positive and negative
Sample size gap is excessive, and when each iteration regenerates Wshad-cls(X) and Wshad-reg(X), algorithm iteration 1000 times;
The construction feature extracts database used in depth network, for each image in database, first
Step: each human body image-region, face facial area, hand region and product area are manually demarcated, if it is in input picture
Centre coordinate is (abas_tr,bbas_tr), centre coordinate is l in the distance of fore-and-aft distance upper and lower side framebas_tr, centre coordinate is in cross
It is w to the distance apart from left and right side framebas_tr, then it corresponds to Conv1Position be that center coordinate isHalf is a length ofHalf-breadth is Indicate round numbers part;The
Two steps: positive negative sample is generated at random;
The positive negative sample of generation at random, method are as follows: the first step constructs 9 regional frames, second step, for database
Each image XtrIf WclsFor 48 × 64 × 18 dimensions, WregFor 48 × 64 × 36 dimensions, all initial values are 0, to Wcls
And WregIt is filled;
Described 9 regional frames of construction, this 9 regional frames are respectively as follows: Ro1(yRo,yRo)=(xRo,yRo, 64,64), Ro2(xRo,
yRo)=(xRo,yRo,45,90),Ro3(xRo,yRo)=(xRo,yRo,90,45),Ro4(xRo,yRo)=(xRo,yRo, 128,128),
Ro5(xRo,yRo)=(xRo,yRo,90,180),Ro6(xRo,yRo)=(xRo,yRo,180,90),Ro7(xRo,yRo)=(xRo,yRo,
256,256), Ro8(xRo,yRo)=(xRo,yRo,360,180),Ro9(xRo,yRo)=(xRo,yRo, 180,360), for each
Region unit, Roi(xRo,yRo) indicate for ith zone frame, the centre coordinate (x of current region frameRo,yRo), third position indicates
Pixel distance of the central point apart from upper and lower side frame, the 4th indicates pixel distance of the central point apart from left and right side frame, the value of i from
1 to 9;
It is described to WclsAnd WregIt is filled, method are as follows:
For the body compartments that each is manually demarcated, if it is (a in the centre coordinate of input picturebas_tr,bbas_tr), center
Coordinate is l in the distance of fore-and-aft distance upper and lower side framebas_tr, centre coordinate is w in the distance of lateral distance left and right side framebas_tr,
Then it corresponds to Conv1Position be that center coordinate isHalf is a length ofHalf-breadth
For
For the upper left cornerBottom right angular coordinateEach point in the section surrounded
(xCtr,yCtr):
For i value from 1 to 9:
For point (xCtr,yCtr), it is upper left angle point (16 (x in the mapping range of database imagesCtr-1)+1,16(yCtr-1)+
1) bottom right angle point (16xCtr,16yCtr) 16 × 16 sections that are surrounded, for each point (x in the sectionOtr,yOtr):
Calculate (xOtr,yOtr) corresponding to region Roi(xOtr,yOtr) with current manual calibration section coincidence factor;
Select the highest point (x of coincidence factor in current 16 × 16 sectionIoUMax,yIoUMax), if coincidence factor > 0.7, Wcls(xCtr,
yCtr, 2i-1)=1, Wcls(xCtr,yCtr, 2i)=0, which is positive sample, Wreg(xCtr,yCtr, 4i-3) and=(xOtr-16xCtr+
8)/8, Wreg(xCtr,yCtr, 4i-2) and=(yOtr-16yCtr+ 8)/8, Wreg(xCtr,yCtr, 4i-2) and=Down1 (lbas_tr/Roi's
Third position), Wreg(xCtr,yCtr, 4i) and=Down1 (wbas_tr/RoiThe 4th), Down1 () is indicated if value greater than taking if 1
Value is 1;If coincidence factor < 0.3, Wcls(xCtr,yCtr, 2i-1)=0, Wcls(xCtr,yCtr, 2i)=1;Otherwise Wcls(xCtr,yCtr,
2i-1)=- 1, Wcls(xCtr,yCtr, 2i)=- 1;
If the human region of current manual's calibration does not have the Ro of coincidence factor > 0.6i(xOtr,yOtr), then select the highest Ro of coincidence factori
(xOtr,yOtr) to WclsAnd WregAssignment, assignment method are identical as the assignment method of coincidence factor > 0.7;
Calculating (the xOtr,yOtr) corresponding to region Roi(xOtr,yOtr) with current manual calibration section coincidence factor, side
Method are as follows: set the body compartments that manually demarcate in the centre coordinate of input picture as (abas_tr,bbas_tr), centre coordinate longitudinal direction away from
It is l with a distance from upper and lower side framebas_tr, centre coordinate is w in the distance of lateral distance left and right side framebas_trIf Roi(xOtr,
yOtr) third position be lOtr, the 4th is wOtrIf meeting | xOtr-abas_tr|≤lOtr+lbas_tr- 1 and | yOtr-bbas_tr|≤
wOtr+wbas_tr- 1, illustrate that there are overlapping region, overlapping regions=(lOtr+Ibas_tr-1-|xOtr-abas_tr|)×(wOtr+wbas_tr-
1-|yOtr-bbas_tr|), otherwise overlapping region=0;Calculate whole region=(2lOtr-1)×(2wOtr-1)+(2abas_tr-1)×
(2wbas_tr- 1)-overlapping region;To obtain coincidence factor=overlapping region/whole region, | | expression takes absolute value;
The Wshad-cls(X) and Wshad-reg(X), building method are as follows: for image X, corresponding positive and negative sample information
For Wcls(X) and Wreg(X), the first step constructs Wshad-cls(X) with and Wshad-reg(X), Wshad-cls(X) and Wcls(X) dimension phase
Together, Wshad-reg(X) and Wreg(X) dimension is identical;Second step records the information of all positive samples, for i=1 to 9, if Wcls(X)
(a, b, 2i-1)=1, then Wshad-cls(X) (a, b, 2i-1)=1, Wshad-cls(X) (a, b, 2i)=1, Wshad-reg(X)(a,b,
4i-3)=1, Wshad-reg(X) (a, b, 4i-2)=1, Wshad-reg(X) (a, b, 4i-1)=1, Wshad-reg(X) (a, b, 4i)=1,
Positive sample has selected altogether sum (Wshad-cls(X)) a, sum () indicates to sum to all elements of matrix, if sum
(Wshad-cls(X)) > 256, retain 256 positive samples at random;Third step randomly chooses negative sample, randomly chooses (a, b, i), if
Wcls(X) (a, b, 2i-1)=1, then Wshad-cls(X) (a, b, 2i-1)=1, Wshad-cls(X) (a, b, 2i)=1, Wshad-reg(X)
(a, b, 4i-3)=1, Wshad-reg(X) (a, b, 4i-2)=1, Wshad-reg(X) (a, b, 4i-1)=1, Wshad-reg(X)(a,b,
4i)=1, if the negative sample quantity chosen is 256-sum (Wshad-cls(X)) a, although negative sample lazy weight 256-
sum(Wshad-cls(X)) a but be all unable to get negative sample in 20 generation random numbers (a, b, i), then algorithm terminates;
The ROI layer, input are image X and regionIts method are as follows: for image X
By feature extraction depth network Fconv it is obtained output Fconv (X) dimension be 48 × 64 × 512, for each 48
× 64 matrix VsROI_IInformation (512 matrixes altogether), extract VROI_IThe upper left corner in matrix The lower right cornerIt is surrounded
Region,Indicate round numbers part;Output is roiI(X) dimension is 7 × 7, then step-length
For iROI=1: to 7:
For jROI=1 to 7:
Construct section
roiI(X)(iROI,jROIThe value of maximum point in)=section;
When 512 48 × 64 matrix whole after treatments, output splicing is obtained into the output of 7 × 7 × 512 dimensionsParameter is indicated for image X, in regional frame
ROI in range;
The building coordinate refines network, method are as follows: the first step, extending database: extended method is in database
Each image X and the corresponding each region manually demarcatedIts corresponding ROI isThe BClass=[1,0,0,0,0] if current interval is human body image-region,
BBox=[0,0,0,0], the BClass=[0,1,0,0,0] if current interval is people's face facial area, BBox=[0,0,0,
0], BClass=[0,0,1,0,0], BBox=[0,0,0,0] if current interval is hand region, if current interval is product
Region then [0,0,0,1,0] BClass=, BBox=[0,0,0,0];It is random to generate value random number a between -1 to 1rand,
brand,lrand,wrand, to obtain new section Indicate round numbers part, the BBox=[a in the sectionrand,brand,lrand,
wrand], if new section withCoincidence factor > 0.7 item BClass=current region
BClass, if new section withCoincidence factor < 0.3, then [0,0,0,0,1] BClass=,
The two is not satisfied, then not assignment;Each section at most generates 10 positive sample regions, if generating Num1A positive sample region,
Then generate Num1+ 1 negative sample region, if the inadequate Num in negative sample region1+ 1, then expand arand,brand,lrand,wrandModel
It encloses, until finding enough negative sample numbers;Second step, building coordinate refine network: for each in database
Image X and the corresponding each human region manually demarcatedIts corresponding ROI isThe ROI of 7 × 7 × 512 dimensions will be launched into 25088 dimensional vectors, then passed through
Cross two full articulamentum Fc2, obtain output Fc2(ROI), then by Fc2(ROI) micro- by classification layer FClass and section respectively
Layer FBBox is adjusted, output FClass (Fc is obtained2And FBBox (Fc (ROI))2(ROI)), classification layer FClass is full articulamentum,
Input vector length is 512, and output vector length is 5, and it is full articulamentum that layer FBBox is finely tuned in section, and input vector length is
512, output vector length is 4;There are two the loss functions of the network: first error function loss1 is to FClass (Fc2
(ROI))-BClass calculates softmax error, and second error function loss2 is to (FBBox (Fc2(ROI))-BBox) meter
Euclidean distance error is calculated, then whole loss function=loss1+loss2 of the refining network, algorithm iteration process are as follows: change first
1000 convergence error function loss2 of generation, then 1000 convergence whole loss functions of iteration;
The full articulamentum Fc of described two2, structure are as follows: first layer: full articulamentum, input vector length be 25088, export to
Measuring length is 4096, and activation primitive is relu activation primitive;The second layer: full articulamentum, input vector length be 4096, export to
Measuring length is 512, and activation primitive is relu activation primitive;
Described carries out target detection using algorithm of target detection to each frame image, obtains the human body image area of present image
Domain, face facial area, hand region and product area, the steps include:
The first step, by input picture XcpstIt is divided into the subgraph of 768 × 1024 dimensions;
Second step, for each subgraph Xs:
2.1st step is converted using the feature extraction depth network Fconv constructed in initialization, obtains 512 feature
Set of graphs Fconv (Xs);
2.2nd step, to Fconv (Xs) using area selection network in first layer Conv1, second layer Conv2-1+ softmax activation
Function and Conv2-2Into transformation, output softmax (Conv is respectively obtained2-1(Conv1(Fconv(Xs)))) and Conv2-2(Conv1
(Fconv(Xs))), all preliminary candidate sections in the section are then obtained according to output valve;
2.3rd step, for all preliminary candidate sections of all subgraphs of current frame image:
2.3.1 step, is chosen according to the score size in its current candidate region, chooses maximum 50 preliminary candidate sections
As candidate region;
2.3.2 step adjusts candidate section of crossing the border all in candidate section set, then weeds out and is overlapped in candidate section
Frame, to obtain final candidate section;
2.3.3 step, by subgraph XsROI layers are input to each final candidate section, corresponding ROI output is obtained, if currently
Final candidate section be (aBB(1),bBB(2),lBB(3),wBB(4)) FBBox (Fc, is then calculated2(ROI)) obtain four it is defeated
(Out outBB(1),OutBB(2),OutBB(3),OutBB(4)) to obtain updated coordinate (aBB(1)+8×OutBB(1),bBB
(2)+8×OutBB(2),lBB(3)+8×OutBB(3),wBB(4)+8×OutBB(4));Then FClass (Fc is calculated2(ROI))
To output, current interval is human body image-region if exporting first maximum, if output second maximum current interval is
Face facial area, current interval is hand region if exporting third position maximum, if the 4th maximum current interval of output
For product area, current interval, which is negative, if exporting the 5th maximum sample areas and deletes the final candidate section;
Third step, the coordinate in the final candidate section after updating the refining of all subgraphs, the method for update is to set current candidate area
The coordinate in domain is (TLx, TLy, RBx, RBy), and the top left co-ordinate of corresponding subgraph is (Seasub,Sebsub), it is updated
Coordinate is (TLx+Seasub-1,TLy+Sebsub-1,RBx,RBy);
It is described by input picture XcpstIt is divided into the subgraph of 768 × 1024 dimensions, the steps include: the step-length for setting segmentation as 384 Hes
512, if window size is m row n column, (asub,bsub) be selected region top left co-ordinate, the initial value of (a, b) be (1,
1);Work as asubWhen < m:
bsub=1;
Work as bsubWhen < n:
Selected region is [(asub,bsub),(asub+384,bsub+ 512)], by input picture XcpstFigure corresponding to the upper section
It is copied to as the information in region in new subgraph, and is attached to top left co-ordinate (asub,bsub) it is used as location information;
If selection area exceeds input picture XcpstSection then will exceed the corresponding equal assignment of rgb pixel value of pixel in range
It is 0;
bsub=bsub+512;
Interior loop terminates;
asub=asub+384;
Outer loop terminates;
Described obtains all preliminary candidate sections in the section, method according to output valve are as follows: step 1: for
softmax(Conv2-1(Conv1(Fconv(Xs)))) its output be 48 × 64 × 18, for Conv2-2(Conv1(Fconv
(Xs))), output is 48 × 64 × 36, for any point (x, y) on 48 × 64 dimension spaces, softmax (Conv2-1
(Conv1(Fconv(Xs)))) (x, y) be 18 dimensional vector II, Conv2-2(Conv1(Fconv(Xs))) (x, y) be 36 dimensional vectors
IIII, if II (2i-1) > II (2i), for i value from 1 to 9, lOtrFor Roi (xOtr,yOtr) third position, wOtrFor Roi
(xOtr,yOtr) the 4th, then preliminary candidate section be [II (2i-1), (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y,
lOtr×IIII(4i-1),wOtr× IIII (4i))], wherein the score in first II (2i-1) expression current candidate region, second
Position (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y, IIII (4i-1), IIII (4i)) indicates the center in current candidate section
Point is (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y), and the long half-breadth of the half of candidate frame is respectively lOtr× IIII (4i-1) and
wOtr×IIII(4i));
All candidate sections of crossing the border, method in the candidate section set of the adjustment are as follows: it sets monitoring image and is arranged as m row n, it is right
In each candidate section, if its [(ach,bch)], the long half-breadth of the half of candidate frame is respectively lchAnd wchIf ach+lch> m, thenThen its a is updatedch=a 'ch, lch=l 'ch;
If bch+wch> n, thenThen its b is updatedch
=b 'ch, wch=w 'ch;
Described weeds out the frame being overlapped in candidate section, the steps include:
If candidate section set is not sky:
The maximum candidate section i of score is taken out from the set of candidate sectionout:
Calculate candidate section ioutWith candidate section set each of candidate section icCoincidence factor, if coincidence factor > 0.7,
Gather from candidate section and deletes candidate section ic;
By candidate section ioutIt is put into the candidate section set of output;
When candidate section set is empty, exporting candidate section contained in candidate section set is to weed out in candidate section
Obtained candidate section set after the frame of overlapping;
The calculating candidate section ioutWith candidate section set each of candidate section icCoincidence factor, method are as follows:
If candidate section icCoordinate section centered on point [(aic,bic)], the long half-breadth of the half of candidate frame is respectively licAnd wic, candidate regions
Between icCoordinate section centered on point [(aiout,bicout)], the long half-breadth of the half of candidate frame is respectively lioutAnd wiout;Calculate xA=
max(aic,aiout);YA=max (bic,biout);XB=min (lic,liout), yB=min (wic,wiout);If meeting | aic-
aiout|≤lic+liout- 1 and | bic-biout|≤wic+wiout- 1, illustrate that there are overlapping region, overlapping regions=(lic+liout-
1-|aic-aiout|)×(wic+wiout-1-|bic-biout|), otherwise overlapping region=0;Calculate whole region=(2lic-1)×
(2wic-1)+(2liout-1)×(2wiout- 1)-overlapping region;To obtain coincidence factor=overlapping region/whole region.
4. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the shopping action recognition mould
The concrete methods of realizing of block are as follows:
In initialization, static action recognition classifier is initialized using the hand motion image of standard first, thus
Static action recognition classifier is set to can recognize that the grasping of hand, put down movement;Then dynamic to dynamic using hand motion video
It is initialized as recognition classifier, so that dynamic action recognition classifier be made to can recognize that the taking-up article of hand, put back to object
Product, take out again put back to, taken out article do not put back to either suspicious stealing;In the detection process: the first step, it is every to what is received
One hand region information identified using static action recognition classifier, recognition methods are as follows: set the image inputted each time
For Handp1, exporting as StaticN (Handp1) is 3 bit vectors, is identified as grasping if first maximum, if second is maximum
It is then identified as putting down, is identified as other if the maximum of third position;Second step moves current grasp after recognizing grasp motion
Make corresponding region and carry out target following, if using static action recognition point corresponding to the next frame tracking box of current hand region
The recognition result of class device is when putting down movement, and target following terminates, and is opened for video by currently available from recognizing grasp motion
Begin, recognize and put down movement and terminate for video, is complete view by the video marker to obtain the continuous videos of hand motion
Frequently;If tracking during tracking lose, by it is currently available since recognize grasp motion be video, from tracking lose before
Image terminate as video, be then only grasp motion by the video marker to obtain the video of only grasp motion
Video;Movement is put down when recognizing, and the movement illustrates that the grasping of the movement is dynamic not in the obtained image of target following
Lose, then terminate using the corresponding hand region of present image as video, using method for tracking target since present frame forward
It is tracked, until tracking is lost, then start frame of the next frame of lost frames as video, is only to put down by the video marker
The video of movement;Third step identifies the obtained complete video of second step using dynamic action recognition classifier, identifies
Method are as follows: set the image inputted each time as Handv1, exporting as DynamicN (Handv1) is 5 bit vectors, if first most
It is big then be identified as take out article, be identified as putting back to article if second maximum, if third position maximum be identified as take out again put
It returns, is identified as having taken out article if the 4th maximum and not put back to, if the 5th maximum is identified as the movement of suspicious stealing, so
The recognition result is sent to recognition result processing module afterwards, by the video of only grasp motion and only puts down the video of movement
It is sent to recognition result processing module, the video of complete video and only grasp motion is sent to product identification module and individual
Identification module;
The hand motion image using standard initializes static action recognition classifier, method are as follows: first
Step arranges video data: firstly, choosing the video that a large amount of people does shopping in supermarket, these videos include extract product, put back to
Article takes out and puts back to again, taken out article and do not put back to movement with suspicious stealing;Manually each section of video clip is cut
It takes, commodity is encountered as start frame using manpower, commodity are left as end frame using manpower, target then is used for each frame of video
Detection module extracts its hand region, the color image for being then 256 × 256 by each frame image scaling of hand region, will
Scaling rear video is put into hand motion video collection, and the video is marked to put back to, again to take out article, putting back to article, taking-up
Article is taken out one of not put back to the movement of suspicious stealing;It is taking-up article for classification, puts back to article, takes out and puts
It returns, taken out each video that article is not put back to, the first frame of the video is put into the merging of hand motion image set and is labeled as
The last frame of the video is put into hand motion image set and merged labeled as putting down movement by grasp motion, removes the from the video
It takes a frame to be put into hand motion image set outside one frame and last needle at random to merge labeled as other;To obtain hand motion view
Frequency set and hand motion image collection;Second step constructs static action recognition classifier StaticN;Third step, it is dynamic to static state
Make recognition classifier StaticN to be initialized, the hand motion image collection constructed by the first step is inputted, if each time
The image of input is Handp, is exported as StaticN (Handp), classification yHandp, yHandpRepresentation method are as follows: grasp:
yHandp=[1,0,0], puts down: yHandp=[0,1,0], other: yHandp=[0,0,1], the evaluation function of the network are pair
(StaticN(Handp)-yHandp) its cross entropy loss function is calculated, convergence direction is to be minimized, the number of iterations 2000
It is secondary;
The construction static state action recognition classifier StaticN, network structure are as follows: first layer: convolutional layer, inputting is 256
× 256 × 3, exporting is 256 × 256 × 64, port number channels=64;The second layer: convolutional layer, input as 256 × 256 ×
64, exporting is 256 × 256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256 × 256
× 64 are connected in third dimension with third layer output 256 × 256 × 64, and exporting is 128 × 128 × 128;4th layer:
Convolutional layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 5: volume
Lamination, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 6: Chi Hua
Layer inputs the 4th layer of output 128 × 128 × 128 and is connected in third dimension with layer 5 128 × 128 × 128, exports
It is 64 × 64 × 256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number
Channels=256;8th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number
Channels=256;9th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number
Channels=256;Tenth layer: pond layer, input for layer 7 output 64 × 64 × 256 and the 9th layer 64 × 64 × 256
It is connected in third dimension, exporting is 32 × 32 × 512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, exports and is
32 × 32 × 512, port number channels=512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, export as 32 ×
32 × 512, port number channels=512;13rd layer: convolutional layer, inputting is 32 × 32 × 512, export as 32 × 32 ×
512, port number channels=512;14th layer: pond layer inputs as eleventh floor output 32 × 32 × 512 and the 13rd
Layer 32 × 32 × 512 is connected in third dimension, and exporting is 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16
× 16 × 1024, exporting is 16 × 16 × 512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16
× 512, exporting is 16 × 16 × 512, port number channels=512;17th layer: convolutional layer, input as 16 × 16 ×
512, exporting is 16 × 16 × 512, port number channels=512;18th layer: pond layer is inputted and is exported for the 15th layer
16 × 16 × 512 are connected in third dimension with the 17th layer 16 × 16 × 512, and exporting is 8 × 8 × 1024;19th
Layer: convolutional layer, inputting is 8 × 8 × 1024, and exporting is 8 × 8 × 256, port number channels=256;20th layer: Chi Hua
Layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 × 256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4
× 4 × 128, port number channels=128;Second Floor 12: pond layer, inputting is 4 × 4 × 128, export as 2 × 2 ×
128;23rd layer: the data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, so by full articulamentum first
After input into full articulamentum, output vector length is 128, and activation primitive is relu activation primitive;24th layer: full connection
Layer, input vector length are 128, and output vector length is 32, and activation primitive is relu activation primitive;25th layer: Quan Lian
Layer is connect, input vector length is 32, and output vector length is 3, and activation primitive is soft-max activation primitive;All convolutional layers
Parameter is size=3 convolution kernel kernel, and step-length stride=(1,1), activation primitive is relu activation primitive;All pond layers
It is maximum pond layer, parameter is pond section size kernel_size=2, step-length stride=(2,2);
Described initializes dynamic action recognition classifier using hand motion video, method are as follows: the first step, construction
Data acquisition system: the first step when hand motion image using standard initializes static action recognition classifier
The hand motion video collection constructed uniformly extracts 10 frame images, as input;Second step, construction dynamic action identification classification
Device DynamicN;Third step initializes dynamic action recognition classifier DynamicN, and input is the first step to each
The set that 10 frame images of a video extraction are constituted exports if the 10 frame images inputted each time are Handv as DynamicN
(Handv), classification yHandv, yHandvRepresentation method are as follows: take out article: yHandvArticle is put back to in=[1,0,0,0,0]:
yHandvIt takes out and puts back to again in=[0,1,0,0,0]: yHandvIt has taken out article and has not put back to in=[0,0,1,0,0]: yHandv=[0,0,
0,1,0] and the movement of suspicious stealing: yHandv=[0,0,0,0,1], the evaluation function of the network are to (DynamicN
(Handv)-yHandv) its cross entropy loss function is calculated, convergence direction is to be minimized, and the number of iterations is 2000 times;
The 10 frame images of uniform extraction, method are as follows: for one section of video image, if the length is Nf frames;It first will view
1st frame image zooming-out of frequency image comes out the 1st frame as extracted set, by the last frame image zooming-out of video image
Out as the 10th frame of extracted set, the i-th of extracted setcktFrame is the of video image
Frame, wherein ickt=2 to 9:,Indicate round numbers part;
The construction dynamic action recognition classifier DynamicN, network structure are as follows:
First layer: convolutional layer, inputting is 256 × 256 × 30, and exporting is 256 × 256 × 512, port number channels=512;
The second layer: convolutional layer, inputting is 256 × 256 × 512, and exporting is 256 × 256 × 128, port number channels=
128;
Third layer: pond layer, inputting is 256 × 256 × 128, and exporting is 128 × 128 × 128;4th layer: convolutional layer, input
It is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer inputs and is
128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, input the 4th
Layer output 128 × 128 × 128 is connected in third dimension with layer 5 128 × 128 × 128, export as 64 × 64 ×
256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;The
Eight layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: volume
Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, it is defeated
Enter and be connected in third dimension for layer 7 output 64 × 64 × 256 with the 9th layer 64 × 64 × 256, exporting is 32 × 32
×512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=
512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;
13rd layer: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth
Four layers: pond layer inputs and exports 32 × 32 × 512 and the 13rd layer 32 × 32 × 512 in third dimension for eleventh floor
It is connected, exporting is 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, export as 16 × 16 ×
512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512,
Port number channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, channel
Number channels=512;18th layer: pond layer inputs and exports 16 × 16 × 512 and the 17th layer 16 × 16 for the 15th layer
× 512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024,
Output is 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, export as 4 ×
4×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128;
Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will
The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length
It is 128, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 128, and output vector is long
Degree is 32, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 32, and output vector is long
Degree is 3, and activation primitive is soft-max activation primitive;The parameter of all convolutional layers is size=3 convolution kernel kernel, step-length
Stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, parameter Chi Huaqu
Between size kernel_size=2, step-length stride=(2,2);
It is described after recognizing grasp motion, target following, method are as follows: set and work as are carried out to current grasp motion corresponding region
Before the image of grasp motion that recognizes be Hgrab, current tracking area is region corresponding to image Hgrab;The first step mentions
Take the ORB feature ORB of image HgrabHgrab;Second step, for the corresponding image of all hand regions in the next frame of Hgrab
Its ORB feature is calculated to obtain ORB characteristic set, and deletes the ORB feature chosen by other tracking box;Third step, will
ORBHgrabIts Hamming distance compared with each value of ORB characteristic set, selection and ORBHgrabThe Hamming distance of feature is the smallest
ORB feature is the ORB feature chosen, if the ORB feature and ORB chosenHgrabSimilarity > 0.85 of feature, similarity=(1- two
The Hamming distance of a ORB feature/ORB characteristic length), then the corresponding hand region of ORB feature chosen is that image Hgrab exists
The tracking box of next frame, if otherwise similarity < 0.85 shows that tracking is lost;
The ORB feature, the method that ORB feature is extracted from an image have been relatively mature, and regard in OpenCV computer
Feel inside library have realization;Its ORB feature is extracted to a picture, input value is current image, is exported as several groups length phase
Same character string, each group represents an ORB feature;
Described terminates using the corresponding hand region of present image as video, using method for tracking target since present frame forward
It is tracked, until tracking is lost, method are as follows: set the image for putting down movement currently recognized as Hdown, current tracking area
Domain is region corresponding to image Hdown;
If not tracking loss:
The first step extracts the ORB feature ORB of image HdownHdown, due to the process described after recognizing grasp motion,
It has calculated during carrying out target following to current grasp motion corresponding region, has been counted again so being not required here
It calculates;
Second step calculates its ORB feature for the corresponding image of all hand regions in the former frame of image Hdown to obtain
To ORB characteristic set, and delete the ORB feature chosen by other tracking box;
Third step, by ORBHdownIts Hamming distance compared with each value of ORB characteristic set, selection and ORBHdownThe Chinese of feature
The smallest ORB feature of prescribed distance is the ORB feature chosen, if the ORB feature and ORB chosenHdownSimilarity > 0.85 of feature,
Similarity=(Hamming distance/ORB characteristic lengths of two ORB features of 1-), then the corresponding hand region of ORB feature chosen is i.e.
It is image Hdown in the tracking box of next frame, if otherwise similarity < 0.85 shows that tracking is lost, algorithm terminates.
5. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the product identification module
Concrete methods of realizing are as follows:
In initialization, product identification classifier is initialized using the product image set of all angles first, and right
Product image generates product list;When changing product list: if deleting certain product, from the product image set of all angles
The middle image for deleting the product, and corresponding position in product list is deleted, if increasing certain product, by each of current production
The product image of angle is put into the product image set of all angles, and by product list, last back addition current increases
The title of product, then with the product image set of new all angles and new product list upgrading products recognition classifier;
In the detection process, the first step, according to shopping action recognition module transmitting come complete video and only grasp motion video,
First in the obtained position of module of target detection according to corresponding to current video first frame, to the input video figure of the position
As detecting forward from current video first frame, the frame that the region is not blocked is detected, finally by region corresponding to frame
Image is identified as the input of product identification classifier, to obtain the recognition result of current production, recognition methods are as follows: set
The image inputted each time is Goods1, and exporting as GoodsN (Goods1) is a vector, if the i-th of the vectorgoodsPosition is most
Greatly, then show that current recognition result is i-th in product listgoodsThe product of position, recognition result is sent at recognition result
Manage module;
Described first initializes product identification classifier using the product image set of all angles, and to product figure
As generating product list, method are as follows: the first step, construct data acquisition system and product list: the data acquisition system is each angle of product
The image of degree, product list listGOods is a vector, and each of vector corresponds to a product name;Second step, construction
Product identification classifier GoodsN;Third step initializes construction product identification classifier GoodsN, and input is each
The product image set of angle exports if input picture is Goods as GoodsN (Goods), classification yGoods, yGoodsFor
One group of vector, length are equal to the number of product in product list, yGoodsRepresentation method are as follows: if image Goods be i-thGoodsPosition
Product, then yGoodsI-thGoodsPosition is 1, other positions are 0;The evaluation function of the network is to (GoodsN (Goods)-
yGoods) its cross entropy loss function is calculated, convergence direction is to be minimized, and the number of iterations is 2000 times;
The construction product identification classifier GoodsN, two groups of GoodsN1 and GoodsN2 of network layer structure, wherein
The network structure of GoodsN1 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, and exporting is 256 × 256 × 64, port number
Channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, and exporting is 256 × 256 × 128, port number
Channels=128;Third layer: pond layer, inputting is 256 × 256 × 128, and exporting is 128 × 128 × 128;4th layer: volume
Lamination, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolution
Layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer,
It inputs the 4th layer of output 128 × 128 × 128 to be connected in third dimension with layer 5 128 × 128 × 128, exporting is 64
×64×256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=
256;8th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;The
Nine layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond
Change layer, inputs and be connected in third dimension for layer 7 output 64 × 64 × 256 with the 9th layer 64 × 64 × 256, export
It is 32 × 32 × 512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number
Channels=512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number
Channels=512;13rd layer: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number
Channels=512;14th layer: pond layer, input for eleventh floor export 32 × 32 × 512 and the 13rd layer 32 × 32 ×
512 are connected in third dimension, and exporting is 16 × 16 × 1024;15th layer: convolutional layer, input as 16 × 16 ×
1024, exporting is 16 × 16 × 512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512,
Output is 16 × 16 × 512, port number channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, output
It is 16 × 16 × 512, port number channels=512;18th layer: pond layer, input for the 15th layer output 16 × 16 ×
512 are connected in third dimension with the 17th layer 16 × 16 × 512, and exporting is 8 × 8 × 1024;19th layer: convolution
Layer, inputting is 8 × 8 × 1024, and exporting is 8 × 8 × 256, port number channels=256;20th layer: pond layer, input
It is 8 × 8 × 256, exporting is 4 × 4 × 256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, export as 4 × 4 ×
128, port number channels=128;The parameter of all convolutional layers be size=3 convolution kernel kernel, step-length stride=(1,
1), activation primitive is relu activation primitive;All pond layers are maximum pond layer, and parameter is pond section size
Kernel_size=2, step-length stride=(2,2);The network structure of GoodsN2 are as follows: inputting is 4 × 4 × 128, first will be defeated
The data entered are launched into the vector of 2048 dimensions, then input into first layer;First layer: full articulamentum, input vector length are
2048, output vector length is 1024, and activation primitive is relu activation primitive;The second layer: full articulamentum, input vector length are
1024, output vector length is 1024, and activation primitive is relu activation primitive;Third layer: full articulamentum, input vector length are
1024, output vector length is len (listGoods), activation primitive is soft-max activation primitive;len(listGoods) indicate
The length of product list;For any input Goods2, GoodsN (Goods2)=GoodsN2 (GoodsN1 (Goods2));
The product image set and new product list upgrading products recognition classifier with new all angles, method
Are as follows: the first step modifies network structure: for the network structure of product identification the classifier GoodsN ', GoodsN1 ' of neotectonics
Constant, identical as GoodsN1 network structure when initialization, the first layer and second layer structure of GoodsN2 ' network structure are protected
Hold constant, the output vector length of third layer becomes the length of updated product list;Second step, for the product of neotectonics
Recognition classifier GoodsN ' is initialized: its product image set inputted as new all angles, if input picture is
Goods3 exports as GoodsN ' (Goods3)=GoodsN2 ' (GoodsN1 (Goods3)), classification yGoods3, yGoodsFor
One group of vector, length are equal to the number of updated product list, yGoodsRepresentation method are as follows: if image Goods be i-thGoods
The product of position, then yGoodsI-thGoodsPosition is 1, other positions are 0;The evaluation function of the network is to (GoodsN (Goods)-
yGoods) its cross entropy loss function is calculated, convergence direction is to be minimized, during initialization the parameter value in GoodsN1
It remains unchanged, the number of iterations is 500 times;
The input according to corresponding to current video first frame in the obtained position of module of target detection, to the position
Video image detects forward from current video first frame, detects the frame that the region is not blocked, method are as follows: set and work as forward sight
Corresponding to frequency first frame the obtained position of module of target detection be (agoods,bgoods,lgoods,wgoods), if current
Video first frame is i-thcrgsFrame, frame under process icr=icrgs: the first step, i-thcrFrame is in the obtained institute of module of target detection
Having detection zone is Taskicr;Second step, for TaskicrEach of regional frame (atask,btask,ltask,wtask), it calculates
Its distance dgt=(atask-agoods)2+(btask-bgoods)2-(ltask+lgoods)2-(wtask+wgoods)2;Distance < 0 if it does not exist,
Then i-thcrCorresponding (a of framegoods,bgoods,lgoods,wgoods) region is that region detected for detecting is not blocked
Frame, algorithm terminates;Otherwise, distance < 0 if it exists, the then d (i in recording distance list dcr)=minimum range, and icr=icr-
1, if icr> 0, then algorithm jumps to the first step, if icr≤ 0, then selection takes the record pair apart from the maximum record of list d intermediate value
Answer the corresponding (a of framegoods,bgoods,lgoods,wgoods) it is the frame that the region detected detected is not blocked, algorithm
Terminate.
6. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the individual identification module
Tool
Body implementation method are as follows:
In initialization, face characteristic extractor FaceN is initialized using the face image set of all angles first
And μ face is calculated, then characteristics of human body's extractor BodyN is initialized using the human body image of all angles and calculates μ
body;In the detection process, when user enters supermarket, current human region Body1 and people are obtained by module of target detection
Then the image Face1 of face in body region uses characteristics of human body's extractor BodyN and face characteristic extractor respectively
FaceN extracts characteristics of human body BodyN (Body1) and face characteristic FaceN (Face1), saves BodyN (Body1) in BodyFtu
In set, FaceN (Face1) is saved in FaceFtu set, and save the id information of existing customer, id information can be use
The unduplicated number that family is randomly assigned when either user enters supermarket in the account of supermarket, id information are used to distinguish different Gus
Visitor then extracts its characteristics of human body and face characteristic whenever there is customer to enter supermarket;When user's mobile product in supermarket, according to
Do shopping action recognition module transmitting come complete video and only grasp motion video, search out its corresponding human region with
Human face region carries out recognition of face or human bioequivalence side using face feature extractor FaceN and characteristics of human body's extractor BodyN
Formula obtains the ID of customer corresponding to the video that currently transmitting of shopping action recognition module comes;
The face image set using all angles initializes face characteristic extractor FaceN and calculates μ
Face, method are as follows: the first step, the face image set for choosing all angles constitute human face data collection;Second step constructs face
Feature extractor FaceN is simultaneously initialized using face data set;Step 3:
Everyone i concentrated for human face dataPeop, obtain human face data and concentrate all to belong to iPeopFace image set
FaceSet(iPeop):
For FaceSet (iPeop) in each facial image Face (jiPeop):
Calculate face characteristic FaceN (Face (jiPeop));
Count current face's image collection FaceSet (iPeop) in all face characteristics average value as current face's image
In
Heart center (FaceN (Face (jiPeop))), calculate current face's image collection FaceSet (iPeop) in all faces it is special
Center center (FaceN (Face (the j of sign and current face's imageiPeop))) distance constitute iPeopCorresponding distance set;
The owner concentrated to human face data obtains its corresponding distance set, after distance set is arranged from small to large, if distance
Set length is ndiset, μ face=distance set theThe corresponding value in position,Indicate round numbers part;
The construction face characteristic extractor FaceN is simultaneously initialized using face data set, if human face data collection by
NfacesetIndividual is constituted, and network layer structure FaceN25 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, is exported and is
256 × 256 × 64, port number channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, export as 256 ×
256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256 × 256 × 64 are defeated with third layer
256 × 256 × 64 are connected in third dimension out, and exporting is 128 × 128 × 128;4th layer: convolutional layer inputs and is
128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer, inputting is 128
× 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, the 4th layer of input are defeated
128 × 128 × 128 are connected in third dimension with layer 5 128 × 128 × 128 out, and exporting is 64 × 64 × 256;The
Seven layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;8th layer: volume
Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: convolutional layer, it is defeated
Entering is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, inputting is
Seven layers of output 64 × 64 × 256 are connected in third dimension with the 9th layer 64 × 64 × 256, and exporting is 32 × 32 × 512;
Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth
Two layers: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;13rd layer:
Convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;14th layer: Chi Hua
Layer inputs and is connected in third dimension for eleventh floor output 32 × 32 × 512 with the 13rd layer 32 × 32 × 512, defeated
It is out 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, and exporting is 16 × 16 × 512, port number
Channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number
Channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number
Channels=512;18th layer: pond layer, input for the 15th layer export 16 × 16 × 512 and the 17th layer 16 × 16 ×
512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024, defeated
It is out 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, and exporting is 4 × 4
×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128;
Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will
The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length
It is 512, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 512, and output vector is long
Degree is 512, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 512, output vector
Length is Nfaceset, activation primitive is soft-max activation primitive;The parameter of all convolutional layers be convolution kernel kernel size=
3, step-length stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, and parameter is
Pond section size kernel_size=2, step-length stride=(2,2);Its initialization procedure are as follows: set for each face
Face4 exports as FaceN25 (face4), classification yface, yfaceIt is equal to N for lengthfacesetVector, yfaceExpression
Method are as follows: if face face4 belongs to i-th in face image setface4Personal face, then yfaceI-thface4Position is 1, other
Position is 0;The evaluation function of the network is to (FaceN25 (face4)-yface) its cross entropy loss function is calculated, restrain direction
To be minimized, the number of iterations is 2000 times;After iteration, face characteristic extractor FaceN be FaceN25 network from
First layer is to the 24th layer;
The human body image using all angles initializes to characteristics of human body's extractor BodyN and calculates μ body,
Method are as follows: the first step, the human body image set for choosing all angles constitute somatic data collection;Second step, construction characteristics of human body mention
It takes device BodyN and user's volumetric data set initializes;Step 3:
Everyone i concentrated for somatic dataPeop1, obtain somatic data and concentrate all to belong to iPeop1Human body image set
BodySet(iPeop1):
For BodySet (iPeop1) in each human body image Body (jiPeop1):
Calculate characteristics of human body BodyN (Body (jiPeop1));
Count current human's image collection BodySet (iPeop1) in all characteristics of human body average value as current human's image
Center center (BodyN (Body (jiPeop1))), calculate current human's image collection BodySet (iPeop1) in all human bodies it is special
Center center (BodyN (Body (the j of sign and current human's imageiPeop1))) distance constitute iPeop1Corresponding distance set;
The owner concentrated to somatic data obtains its corresponding distance set, after distance set is arranged from small to large, if
Distance set length is ndiset1, μ body=distance setThe corresponding value in position,Indicate round numbers part;
Construction characteristics of human body's extractor BodyN and user's volumetric data set initializes, if somatic data collection by
NbodysetIndividual is constituted, and network layer structure BodyN25 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, is exported and is
256 × 256 × 64, port number channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, export as 256 ×
256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256 × 256 × 64 are defeated with third layer
256 × 256 × 64 are connected in third dimension out, and exporting is 128 × 128 × 128;4th layer: convolutional layer inputs and is
128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer, inputting is 128
× 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, the 4th layer of input are defeated
128 × 128 × 128 are connected in third dimension with layer 5 128 × 128 × 128 out, and exporting is 64 × 64 × 256;The
Seven layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;8th layer: volume
Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: convolutional layer, it is defeated
Entering is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, inputting is
Seven layers of output 64 × 64 × 256 are connected in third dimension with the 9th layer 64 × 64 × 256, and exporting is 32 × 32 × 512;
Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth
Two layers: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;13rd layer:
Convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;14th layer: Chi Hua
Layer inputs and is connected in third dimension for eleventh floor output 32 × 32 × 512 with the 13rd layer 32 × 32 × 512, defeated
It is out 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, and exporting is 16 × 16 × 512, port number
Channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number
Channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number
Channels=512;18th layer: pond layer, input for the 15th layer export 16 × 16 × 512 and the 17th layer 16 × 16 ×
512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024, defeated
It is out 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, and exporting is 4 × 4
×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128;
Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will
The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length
It is 512, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 512, and output vector is long
Degree is 512, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 512, output vector
Length is Nfaceset, activation primitive is soft-max activation primitive;The parameter of all convolutional layers be convolution kernel kernel size=
3, step-length stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, and parameter is
Pond section size kernel_size=2, step-length stride=(2,2);Its initialization procedure are as follows: set for each Zhang Renti
Body4 exports as BodyN25 (body4), classification ybody, ybodyIt is equal to N for lengthbodysetVector, ybodyExpression
Method are as follows: if human body body4 belongs to i-th in human body image setbody4Personal human body, then ybodyI-thbody4Position is 1, other
Position is 0;The evaluation function of the network is to (BodyN25 (body4)-ybody) its cross entropy loss function is calculated, restrain direction
To be minimized, the number of iterations is 2000 times;After iteration, characteristics of human body's extractor BodyN be BodyN25 network from
First layer is to the 24th layer;
It is described according to the transmitting of shopping action recognition module come complete video and only grasp motion video, it is right to search out its
The human region and human face region answered carry out face knowledge using face feature extractor FaceN and characteristics of human body's extractor BodyN
Other or human bioequivalence mode obtains the ID of customer corresponding to the video that currently transmitting of shopping action recognition module comes;Its process
Are as follows: according to shopping action recognition module transmitting come video, from the first frame of video begin look for corresponding human region with
Human face region, until algorithm terminates or handled the last frame of video:
By corresponding human region image Body2 and human face region image Face2 use respectively characteristics of human body's extractor BodyN and
Face characteristic extractor FaceN extracts characteristics of human body BodyN (Body2) and face characteristic FaceN (Face2);
Then face identification information is used first: comparing the Europe of all face characteristics in FaceN (Face2) and FaceFtu set
Family name's distance dFace, feature when selecting Euclidean distance minimum in corresponding FaceFtu set, if this feature is FaceN
(Face3), if dFace< μ face then identifies that current face's image belongs to the client of facial image corresponding to FaceN (Face3)
ID is the ID corresponding to the video actions that action recognition module transmitting comes that does shopping, and current identification process terminates;
If dFace>=μ Face shows only to identify current individual with face identification method, then compares BodyN
(Body2) the Euclidean distance d of all characteristics of human body in gathering with BodyFtuBody, select Euclidean distance minimum when it is corresponding
Feature in BodyFtu set, if this feature is BodyN (Face3), if dBody+dFace< μ face+ μ body, then identify and work as
The Customer ID that preceding human body image belongs to human body image corresponding to BodyN (Face3) is that shopping action recognition module transmitting comes
ID corresponding to video actions;
If still not finding ID corresponding to video actions after all frames for having handled video, in order to avoid wrong identification purchase
Owner's body causes the book keeping operation of mistake, therefore the video come to current shopping action recognition module transmitting is no longer handled;
It is described according to the transmitting of shopping action recognition module come video, begin look for from the first frame of video to corresponding human body
Region and human face region, method are as follows: according to shopping action recognition module transmitting come video, carried out from the first frame of video from
Reason;If currently processed to i-thfRgFrame, if it is (a that the frame, which corresponds to video in the obtained position of module of target detection,ifRg,bifRg,
lifRg,wifRg), the frame is corresponding to be combined into BodyFrameSet in the obtained human region collection of module of target detectionifRgHuman body area
Domain collection is combined into FaceFrameSetifRg, for BodyFrameSetifRgEach of human region (aBFSifRg,bBFSifRg,
lBFSifRg,wBFSifRg), calculate its distance dgbt=(aBFSifRg-aifRg)2+(bBFSifRg-bifRg)2-(lBFSifRg-lifRg)2-
(wBFSifRg-wifRg)2, selecting the smallest human region of distance in all human region set is the corresponding human body area of current video
Domain, if it is (a that the human region chosen, which is position,BFS1,bBFS1,lBFS1,wBFS1), human face region collection is combined into
FaceFrameSetifRgEach of human face region (aFFSifRg,bFFSifRg,lFFSifRg,wFFSifRg), calculate its distance dgft=
(aBFS1-aFFSifRg)2+(bBFS1-bFFifRg)2-(lBFS1-lFFSifRg)2-(wBFS1-wFFSifRg)2, select in all face regional ensembles
It is the corresponding human face region of current video apart from the smallest human face region.
7. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the recognition result handles mould
The concrete methods of realizing of block are as follows:
It does not work in initialization;In identification process, integration is carried out to generate each Gu to the recognition result received
The corresponding shopping list of visitor: first according to individual identification module transmit come the ID of customer determine the corresponding Gu of current shopping information
Visitor, so that choosing the shopping list number modified is ID, then according to product identification module transmit come recognition result determine
The shopping of current customer acts corresponding product and sets product as GoodA, then according to shopping action recognition module transmit come knowledge
Whether other result modifies to shopping cart to determine that current shopping acts, if being identified as taking out article on shopping list ID
Increase product G oodA, accelerating is 1, reduces product G oodA on shopping list ID if being identified as putting back to article, subtracts
Small number is 1, and shopping list does not change if be identified as " take out and put back to " or " taken out article and do not put back to " again, if identification
It as a result is " suspicious stealing " then to supermarket's monitoring transmission alarm signal and the corresponding location information of current video.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910263910.1A CN109977896A (en) | 2019-04-03 | 2019-04-03 | A kind of supermarket's intelligence vending system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910263910.1A CN109977896A (en) | 2019-04-03 | 2019-04-03 | A kind of supermarket's intelligence vending system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109977896A true CN109977896A (en) | 2019-07-05 |
Family
ID=67082544
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910263910.1A Withdrawn CN109977896A (en) | 2019-04-03 | 2019-04-03 | A kind of supermarket's intelligence vending system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109977896A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110379108A (en) * | 2019-08-19 | 2019-10-25 | 铂纳思(东莞)高新科技投资有限公司 | A kind of method and its system of unmanned shop anti-thefting monitoring |
CN110619308A (en) * | 2019-09-18 | 2019-12-27 | 名创优品(横琴)企业管理有限公司 | Aisle sundry detection method, device, system and equipment |
CN110674712A (en) * | 2019-09-11 | 2020-01-10 | 苏宁云计算有限公司 | Interactive behavior recognition method and device, computer equipment and storage medium |
CN110796051A (en) * | 2019-10-19 | 2020-02-14 | 北京工业大学 | Real-time access behavior detection method and system based on container scene |
CN111582202A (en) * | 2020-05-13 | 2020-08-25 | 上海海事大学 | Intelligent course system |
CN111723741A (en) * | 2020-06-19 | 2020-09-29 | 江苏濠汉信息技术有限公司 | Temporary fence movement detection alarm system based on visual analysis |
CN113408501A (en) * | 2021-08-19 | 2021-09-17 | 北京宝隆泓瑞科技有限公司 | Oil field park detection method and system based on computer vision |
CN113901895A (en) * | 2021-09-18 | 2022-01-07 | 武汉未来幻影科技有限公司 | Door opening action recognition method and device for vehicle and processing equipment |
CN114596661A (en) * | 2022-02-28 | 2022-06-07 | 安顺市成威科技有限公司 | Multifunctional intelligent sales counter |
CN117253194A (en) * | 2023-11-13 | 2023-12-19 | 网思科技股份有限公司 | Commodity damage detection method, commodity damage detection device and storage medium |
-
2019
- 2019-04-03 CN CN201910263910.1A patent/CN109977896A/en not_active Withdrawn
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110379108A (en) * | 2019-08-19 | 2019-10-25 | 铂纳思(东莞)高新科技投资有限公司 | A kind of method and its system of unmanned shop anti-thefting monitoring |
WO2021047232A1 (en) * | 2019-09-11 | 2021-03-18 | 苏宁易购集团股份有限公司 | Interaction behavior recognition method, apparatus, computer device, and storage medium |
CN110674712A (en) * | 2019-09-11 | 2020-01-10 | 苏宁云计算有限公司 | Interactive behavior recognition method and device, computer equipment and storage medium |
CN110619308A (en) * | 2019-09-18 | 2019-12-27 | 名创优品(横琴)企业管理有限公司 | Aisle sundry detection method, device, system and equipment |
CN110796051A (en) * | 2019-10-19 | 2020-02-14 | 北京工业大学 | Real-time access behavior detection method and system based on container scene |
CN110796051B (en) * | 2019-10-19 | 2024-04-26 | 北京工业大学 | Real-time access behavior detection method and system based on container scene |
CN111582202A (en) * | 2020-05-13 | 2020-08-25 | 上海海事大学 | Intelligent course system |
CN111582202B (en) * | 2020-05-13 | 2023-10-17 | 上海海事大学 | Intelligent net class system |
CN111723741A (en) * | 2020-06-19 | 2020-09-29 | 江苏濠汉信息技术有限公司 | Temporary fence movement detection alarm system based on visual analysis |
CN113408501A (en) * | 2021-08-19 | 2021-09-17 | 北京宝隆泓瑞科技有限公司 | Oil field park detection method and system based on computer vision |
CN113901895A (en) * | 2021-09-18 | 2022-01-07 | 武汉未来幻影科技有限公司 | Door opening action recognition method and device for vehicle and processing equipment |
CN114596661A (en) * | 2022-02-28 | 2022-06-07 | 安顺市成威科技有限公司 | Multifunctional intelligent sales counter |
CN114596661B (en) * | 2022-02-28 | 2023-03-10 | 安顺市成威科技有限公司 | Multifunctional intelligent sales counter |
CN117253194A (en) * | 2023-11-13 | 2023-12-19 | 网思科技股份有限公司 | Commodity damage detection method, commodity damage detection device and storage medium |
CN117253194B (en) * | 2023-11-13 | 2024-03-19 | 网思科技股份有限公司 | Commodity damage detection method, commodity damage detection device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109977896A (en) | A kind of supermarket's intelligence vending system | |
US11270260B2 (en) | Systems and methods for deep learning-based shopper tracking | |
Liu et al. | PestNet: An end-to-end deep learning approach for large-scale multi-class pest detection and classification | |
Liu et al. | Adversarial learning for constrained image splicing detection and localization based on atrous convolution | |
US20210158053A1 (en) | Constructing shopper carts using video surveillance | |
CN104217214B (en) | RGB D personage's Activity recognition methods based on configurable convolutional neural networks | |
CN108460356A (en) | A kind of facial image automated processing system based on monitoring system | |
KR102554724B1 (en) | Method for identifying an object in an image and mobile device for practicing the method | |
CA3072056A1 (en) | Subject identification and tracking using image recognition | |
CN108470354A (en) | Video target tracking method, device and realization device | |
CN107292339A (en) | The unmanned plane low altitude remote sensing image high score Geomorphological Classification method of feature based fusion | |
CN113239801B (en) | Cross-domain action recognition method based on multi-scale feature learning and multi-level domain alignment | |
US20210248421A1 (en) | Channel interaction networks for image categorization | |
Zhou et al. | Multi-label learning of part detectors for occluded pedestrian detection | |
CN108009493A (en) | Face anti-fraud recognition methods based on action enhancing | |
Tagare et al. | A maximum-likelihood strategy for directing attention during visual search | |
Liu et al. | Customer behavior recognition in retail store from surveillance camera | |
CN109977251A (en) | A method of building identifies commodity based on RGB histogram feature | |
CN110516533A (en) | A kind of pedestrian based on depth measure discrimination method again | |
CN110222587A (en) | A kind of commodity attribute detection recognition methods again based on characteristic pattern | |
CN110070002A (en) | A kind of Activity recognition method based on 3D convolutional neural networks | |
CN114187546B (en) | Combined action recognition method and system | |
CN107563293A (en) | A kind of new finger vena preprocess method and system | |
Fang et al. | Pedestrian attributes recognition in surveillance scenarios with hierarchical multi-task CNN models | |
CN108960005A (en) | The foundation and display methods, system of subjects visual label in a kind of intelligent vision Internet of Things |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20190705 |
|
WW01 | Invention patent application withdrawn after publication |