CN109977896A - A kind of supermarket's intelligence vending system - Google Patents

A kind of supermarket's intelligence vending system Download PDF

Info

Publication number
CN109977896A
CN109977896A CN201910263910.1A CN201910263910A CN109977896A CN 109977896 A CN109977896 A CN 109977896A CN 201910263910 A CN201910263910 A CN 201910263910A CN 109977896 A CN109977896 A CN 109977896A
Authority
CN
China
Prior art keywords
layer
image
exporting
inputting
region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910263910.1A
Other languages
Chinese (zh)
Inventor
刘昱昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Maritime University
Original Assignee
Shanghai Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Maritime University filed Critical Shanghai Maritime University
Priority to CN201910263910.1A priority Critical patent/CN109977896A/en
Publication of CN109977896A publication Critical patent/CN109977896A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a kind of supermarket's intelligence vending systems, the problem that either scans' commodity are excessively time-consuming during tradition is settled accounts is directed to improve, when commodity Input Process is advanceed to shopper's picking, the time loss of items scanning when to remove checkout, checkout speed is greatly improved, improves the shopping experience of customer.Movement during the present invention selects shopper goods using algorithm for pattern recognition is identified and is counted, the picture of commodity is identified to obtain type of merchandize when picking and placing commodity to client, recognition of face is carried out to customer and obtains the identity of customer using human body image recognition when recognition of face is undesirable, the abnormal behaviour of customer is identified to determine whether there is pilferage behavior.This system can realize programming count function under the premise of not reducing customer purchase experience.Original institutional framework the present invention relates to the shopping accounting procedure of customer without changing supermarket, consequently facilitating with existing supermarket's organizational structure seamless interfacing.

Description

A kind of supermarket's intelligence vending system
Technical field
The present invention relates to computer vision monitoring technology field, target detection, target following and area of pattern recognition, specifically It is related to for being detected, being tracked to the individual before the shelf based on monitoring camera and the field of action recognition.
Background technique
Checkout is carried out by either scans' commodity under traditional supermarket model, and this mode is very easy to cause congestion, causes A large amount of shoppers are queued in cashier counter, and whole check-out process is limited by the space of cashier and the number of cashier limitation nothing Method increases considerably, therefore due to the limitation of traditional cash register mode, congestion of settling accounts not can avoid;Existing customer is by independently sweeping Although the mode for retouching merchandise checkout can be reduced the time of items scanning, going out, there is still a need for manual inspection commodity, still It so will cause congestion.The reason of analyzing congestion, most time-consuming process is the process of typing commodity, therefore commodity Input Process is mentioned It is preceding to shopper's picking when, can by most time-consuming process in advance and can parallel work-flow, so that knot be greatly improved Account speed improves the shopping experience of customer.
System proposed by the invention is exactly to identify and count during shopper selects goods using monitoring camera Number is identified to improve checkout speed by the picking to customer with the process for putting back to commodity to carry out picking quantity Plus-minus, obtain the type of merchandise by being identified when client picks and places commodity to commodity, pass through the abnormal behaviour to customer Identification is to determine whether there is pilferage behavior, to realize automatically not reducing customer under the premise of selecting the shopping experience of goods process Statistical function.Original institutional framework the present invention relates to the shopping accounting procedure of customer without changing supermarket, consequently facilitating with existing There is supermarket's organizational structure seamless interfacing.
Summary of the invention
The technical problem to be solved by the present invention is to propose to overcome slow-footed situation of settling accounts under traditional supermarket model A kind of supermarket's intelligence vending system.The identification of customer purchase behavior and the identification of commodity are completed using monitoring camera.
The technical solution adopted by the present invention to solve the technical problems is:
A kind of supermarket's intelligence vending system, the video taken the photograph based on the monitoring camera being fixed in supermarket and on shelf Image is as input.Including image pre-processing module, module of target detection, action recognition module of doing shopping, product identification module is a Body identification module, recognition result processing module.The video image that the image pre-processing module takes the photograph monitoring camera into Row pretreatment, carries out denoising to the noise that may contain in the image of input first, then carries out light to the image after denoising According to compensation, image enhancement then is carried out to the image after illumination compensation, the data after image enhancement are finally passed into target inspection Survey module;The module of target detection carries out target detection to the image received, detects in current video image respectively Human body overall region, face facial area, hand region and product area, then hand region and product area are sent to Human body image region and face facial area, are sent to individual identification module, product area are passed by shopping action recognition module Pass product identification module;The shopping action recognition module carries out static movement to the hand region information received and knows Not, the start frame for grasping video is found, it is then lasting that movement is identified until finding the movement for putting down article as knot Then beam frame is identified video using dynamic action recognition classifier, identify that current action is to take out article, put back to object Product, take out again put back to, taken out article do not put back to either suspicious stealing.Then recognition result is sent to recognition result processing Module, the video for by the video of only grasp motion and only putting down movement are sent to recognition result processing module;The production Product identification module identifies the video of the product area received, identify currently by it is mobile be any product, so Recognition result is sent to recognition result processing module afterwards, product identification module can also increase at any time or delete some product; The individual identification module identifies the human face region and human region that receive, in conjunction with human face region and human region Information, for identification out current individual be in entire supermarket who individual, recognition result is then sent to recognition result Processing module;The recognition result processing module integrates the recognition result received, is passed according to individual identification module It passs the ID of customer come and determines the corresponding customer of current shopping information, the recognition result come according to the transmitting of product identification module is come true The shopping for determining current customer acts corresponding product, is determined currently according to the recognition result that shopping action recognition module transmitting comes Whether shopping movement modifies to shopping cart.To obtain the shopping list of current customer.Shopping action recognition module is known Other suspicious stealing sounds an alarm.
The image pre-processing module, method are: in initial phase, the module does not work;In the detection process: The first step, the monitoring image taken the photograph to monitoring camera carries out mean denoising, thus the monitoring image after being denoised;Second Step carries out illumination compensation to the monitoring image after denoising, to obtain the image after illumination compensation;Third step, by illumination compensation Image afterwards carries out image enhancement, and the data after image enhancement are passed to module of target detection.
The monitoring image that the monitoring camera is taken the photograph carries out mean denoising, and method is: setting monitoring camera and is taken the photograph Monitoring image be Xsrc, because of XsrcFor color RGB image, therefore there are Xsrc-R, Xsrc-G, Xsrc-BThree components, for each A component Xsrc', it proceeds as follows respectively: the window of one 3 × 3 dimension being set first, considers image Xsrc' each pixel Point Xsrc' (i, j), it is respectively [X that pixel value corresponding to matrixes is tieed up in 3 × 3 put centered on the pointsrc′(i-1,j-1),Xsrc′ (i-1,j),Xsrc′(i-1,j+1),Xsrc′(i,j-1),Xsrc′(i,j),Xsrc′(i,j+1),Xsrc′(i+1,j-1),Xsrc′(i+ 1,j),Xsrc' (j+1, j+1)] it is arranged from big to small, take it to come intermediate value as image X after denoisingsrc" pixel (i, J) value is assigned to X after corresponding filteringsrc″(i,j);For Xsrc' boundary point, it may appear that its 3 × 3 dimension window corresponding to The case where certain pixels are not present, then the median for falling in existing pixel in window need to be only calculated, if window Interior is even number point, is assigned to X for the average value for coming intermediate two pixel values as the pixel value after pixel denoisingsrc″ (i, j), thus, new image array XsrcIt " is XsrcImage array after the denoising of current RGB component, for Xsrc-R, Xsrc-G, Xsrc-BAfter three components carry out denoising operation respectively, the X that will obtainsrc-R", Xsrc-G", Xsrc-B" component, by this three A new component is integrated into a new color image XDenResulting image after as denoising.
Described carries out illumination compensation to the monitoring image after denoising, if the monitoring image X after denoisingDen, because of XDenFor Color RGB image, therefore XDenThere are tri- components of RGB, for each component XDen', illumination compensation is carried out respectively, then will Obtained Xcpst' integration obtains colored RBG image Xcpst, XcpstAs XDenImage after illumination compensation, to each component XDen' respectively carry out illumination compensation the step of are as follows: the first step, if XDen' arranged for m row n, construct XDensumAnd NumDenFor same m row The matrix of n column, initial value is 0,Step-lengthWindow size is l, wherein function Min (m, n) expression takes the minimum value of m and n,Indicate round numbers part, sqrt (l) indicates the square root of l, the l=1 if l < 1; Second step, if XDenTop left co-ordinate be (1,1), from coordinate (1,1) start, according to window size be l and step-length s determine it is each A candidate frame, which is [(a, b), (a+l, b+l)] area defined, for XDen' corresponding in candidate frame region Image array carry out histogram equalization, the image array after obtaining the equalization of candidate region [(a, b), (a+l, b+l)] XDen", then XDensumEach element in the corresponding region [(a, b), (a+l, b+l)] calculates XDensum(a+iXsum,b+ jXsum)=XDensum(a+iXsum,b+jXsum)+XDen″(iXsum,jXsum), wherein (iXsum,jXsum) it is integer and 1≤iXsum≤ l, 1 ≤jXsum≤ l, and by NumDenEach element in the corresponding region [(a, b), (a+l, b+l)] adds 1;Finally, calculating Wherein (iXsumNum,jXsumNum) it is XDenEach corresponding point, to obtain XcpstAs to present component XDen' carry out illumination Compensation.
Described is that l and step-length s determines each candidate frame according to window size, be the steps include:
If monitoring image is m row n column, (a, b) is the top left co-ordinate in selected region, and (a+l, b+l) is selection area Bottom right angular coordinate, which is indicated that the initial value of (a, b) is (1,1) by [(a, b), (a+l, b+l)];
As a+l≤m:
B=1;
As b+l≤n:
Selected region is [(a, b), (a+l, b+l)];
B=b+s;
Interior loop terminates;
A=a+s;
Outer loop terminates;
In the above process, selected region [(a, b), (a+l, b+l)] is candidate frame every time.
It is described for XDen' image array corresponding in candidate frame region carries out histogram equalization, if candidate frame Region is [(a, b), (a+l, b+l)] area defined, XDenIt " is XDen' the figure in the region [(a, b), (a+l, b+l)] It as information, the steps include: the first step, construct vector I, I (iI) it is XDen" middle pixel value is equal to iINumber, 0≤iI≤255;The Two steps calculate vectorThird step, for XDen" on each point (iXDen, jXDen), pixel value is XDen″(iXDen, jXDen), calculate X "Den(iXDen, jXDen)=I ' (X "Den(iXDen, jXDen)).To XDen" all pixels in image Histogram equalization process terminates after point value is all calculated and changed, XDen" the result of the interior as histogram equalization saved.
Described carries out image enhancement for the image after illumination compensation, if the image after illumination compensation is Xcpst, correspond to RGB channel be respectively XcpstR, XcpstG, XcpstB, to XcpstThe image obtained after image enhancement is Xenh.Image increasing is carried out to it Strong step are as follows: the first step, for XcpstThe important X of institutecpstR, XcpstG, XcpstBIt is calculated to carry out after obscuring by specified scale Image;Second step, structural matrix LXenhR, LXenhG, LXenhBFor with XcpstRThe matrix of identical dimensional, for image Xcpst's The channel R in RGB channel calculates LXenhR(i, j)=log (XcpstR(i, j))-LXcpstRThe value range of (i, j), (i, j) is All points in image array, for image XcpstRGB channel in the channel G and channel B use algorithm same as the channel R Obtain LXenhGAnd LXenhB;Third step, for image XcpstRGB channel in the channel R, calculate LXenhRMiddle all the points value Mean value MeanR and mean square deviation VarR (attention is mean square deviation), calculating MinR=MeanR-2 × VarR and MaxR=MeanR+2 × Then VarR calculates XenhR(i, j)=Fix ((LXcpstR(i, j)-MinR)/(MaxR-MinR) × 255), wherein Fix expression takes Integer part is assigned a value of 0 if value < 0, and value > 255 is assigned a value of 255;For in RGB channel the channel G and channel B X is obtained using algorithm same as the channel RenhGAnd XenhB, the X of RGB channel will be belonging respectively toenhR、XenhG、XenhBIt is integrated into one Color image Xenh
It is described for XcpstThe important X of institutecpstR, XcpstG, XcpstBIt calculates it and carries out the figure after obscuring by specified scale Picture, for the channel the R X in RGB channelcpstR, the steps include: the first step, define Gaussian function G (x, y, σ)=k × exp (- (x2 +y2)/σ2), σ is scale parameter, k=1/ ∫ ∫ G (x, y) dxdy, then for XcpstREach point XcpstR(i, j) is calculated, WhereinIndicate convolution algorithm, for being lower than the point of scale σ apart from boundary, only Calculate XcpstRWith the convolution of G (x, y, σ) corresponding part, Fix () indicates round numbers part, 0 is assigned a value of if value < 0, value > 255 is then assigned a value of 255.For in RGB channel the channel G and channel B using algorithm same as the channel R update XcpstGWith XcpstG
The module of target detection has demarcated human body image region, face face using having during initialization The image in region, hand region and product area carries out parameter initialization to algorithm of target detection;In the detection process, figure is received As the image that preprocessing module is transmitted, then it is handled, each frame image is carried out using algorithm of target detection Target detection obtains human body image region, face facial area, hand region and the product area of present image, then by hand Portion region and product area are sent to shopping action recognition module, and human body image region and face facial area are sent to individual Product area is passed to product identification module by identification module;
The use have demarcated human body image region, face facial area, hand region and product area figure As carrying out parameter initialization to algorithm of target detection, it the steps include: that the first step, construction feature extract depth network;Second step, structure Make regional choice network, third step, according to each in database used in the construction feature extraction depth network Open image X and the corresponding each human region manually demarcatedThen by ROI layers, Input is image X and regionOutputIt is 7 × 7 × 512 dimensions;Third step, building coordinate refine network.
The construction feature extracts depth network, which is deep learning network structure, network structure are as follows: first Layer: convolutional layer, inputting is 768 × 1024 × 3, and exporting is 768 × 1024 × 64, port number channels=64;The second layer: volume Lamination, inputting is 768 × 1024 × 64, and exporting is 768 × 1024 × 64, port number channels=64;Third layer: Chi Hua Layer, input first layer output 768 × 1024 × 64 are connected in third dimension with third layer output 768 × 1024 × 64, Output is 384 × 512 × 128;4th layer: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, is led to Road number channels=128;Layer 5: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, channel Number channels=128;Layer 6: pond layer, the 4th layer of output 384 × 512 × 128 of input and layer 5 384 × 512 × 128 are connected in third dimension, and exporting is 192 × 256 × 256;Layer 7: convolutional layer, input as 192 × 256 × 256, exporting is 192 × 256 × 256, port number channels=256;8th layer: convolutional layer, input as 192 × 256 × 256, exporting is 192 × 256 × 256, port number channels=256;9th layer: convolutional layer, input as 192 × 256 × 256, exporting is 192 × 256 × 256, port number channels=256;Tenth layer: pond layer inputs as layer 7 output 192 × 256 × 256 are connected in third dimension with the 9th layer 192 × 256 × 256, and exporting is 96 × 128 × 512;11st Layer: convolutional layer, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512;Floor 12: Convolutional layer, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512;13rd layer: volume Lamination, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512;14th layer: Chi Hua Layer inputs and is connected in third dimension for eleventh floor output 96 × 128 × 512 with the 13rd layer 96 × 128 × 512, Output is 48 × 64 × 1024;15th layer: convolutional layer, inputting is 48 × 64 × 1024, and exporting is 48 × 64 × 512, channel Number channels=512;16th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number Channels=512;17th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number Channels=512;18th layer: pond layer, input for the 15th layer export 48 × 64 × 512 and the 17th layer 48 × 64 × 512 are connected in third dimension, and exporting is 48 × 64 × 1024;19th layer: convolutional layer, input as 48 × 64 × 1024, exporting is 48 × 64 × 256, port number channels=256;20th layer: pond layer, inputting is 48 × 64 × 256, Output is 24 × 62 × 256;Second eleventh floor: convolutional layer, inputting is 24 × 32 × 1024, and exporting is 24 × 32 × 256, channel Number channels=256;Second Floor 12: pond layer, inputting is 24 × 32 × 256, and exporting is 12 × 16 × 256;20th Three layers: convolutional layer, inputting is 12 × 16 × 256, and exporting is 12 × 16 × 128, port number channels=128;24th Layer: pond layer, inputting is 12 × 16 × 128, and exporting is 6 × 8 × 128;25th layer: full articulamentum, first by the 6 of input The data of × 8 × 128 dimensions are launched into the vector of 6144 dimensions, then input into full articulamentum, and output vector length is 768, Activation primitive is relu activation primitive;26th layer: full articulamentum, input vector length are 768, and output vector length is 96, activation primitive is relu activation primitive;27th layer: full articulamentum, input vector length are 96, and output vector length is 2, activation primitive is soft-max activation primitive;The parameter of all convolutional layers is size=3 convolution kernel kernel, step-length stride =(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, and parameter is pond section size Kernel_size=2, step-length stride=(2,2);If setting the depth network as Fconv27, for a width color image X, warp Crossing the obtained feature set of graphs Fconv27 (X) of the depth network indicates, the evaluation function of the network is to (Fconv27 (X)-y) its cross entropy loss function is calculated, convergence direction is to be minimized, and y inputs corresponding classification.Database is in nature The image comprising passerby and non-passerby of boundary's acquisition, every image are the color image of 768 × 1024 dimensions, according to being in image No to be divided into two classes comprising pedestrian, the number of iterations is 2000 times.After training, first layer is taken to be characterized extraction to the 17th layer Depth network Fconv indicates a width color image X by the obtained output of the depth network with Fconv (X).
The structure realm selects network, receives Fconv depth network and extracts 512 48 × 64 feature set of graphs Fconv (X), then the first step obtains Conv by convolutional layer1(Fconv (X)), the parameter of the convolutional layer are as follows: convolution kernel Size=1 kernel, step-length stride=(1,1), inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number Channels=512;Then by Conv1(FConv (X)) is separately input to two convolutional layer (Conv2-1And Conv2-2), Conv2-1Structure are as follows: inputting is 48 × 64 × 512, and exporting is 48 × 64 × 18, and port number channels=18, the layer obtains Output be Conv2-1(Conv1(Fconv (X))), then softmax is obtained using activation primitive softmax to the output (Conv2-1(Conv1(Fconv(X))));Conv2-2Structure are as follows: inputting is 48 × 64 × 512, and exporting is 48 × 64 × 36, Port number channels=36;There are two the loss functions of the network: first error function loss1 is to Wshad-cls⊙ (Conv2-1(Conv1(Fconv(X)))-Wcls(X)) softmax error is calculated, second error function loss2 is to Wshad-reg (X)⊙(Conv2-1(Conv1(Fconv(X)))-Wreg(X)) smooth L1 error, the loss function of regional choice network are calculated =loss1/sum (Wcls(X))+loss2/sum(WclS (X)), the sum of sum () representing matrix all elements, convergence direction is It is minimized, Wcls(X) and WregIt (X) is respectively the corresponding positive and negative sample information of database images X, ⊙ representing matrix is according to correspondence Position is multiplied, Wshad-cls(X) and Wshad-regIt (X) is mask, it acts as selection Wshad(X) part that weight is 1 in is trained, To avoiding positive and negative sample size gap excessive, when each iteration, regenerates Wshad-cls(X) and Wshad-reg(X), algorithm iteration 1000 times.
The construction feature extracts database used in depth network, for each image in database, Step 1: each human body image-region, face facial area, hand region and product area are manually demarcated, if it schemes in input The centre coordinate of picture is (abas_tr, bbas_tr), centre coordinate is l in the distance of fore-and-aft distance upper and lower side framebas_tr, centre coordinate It is w in the distance of lateral distance left and right side framebas_tr, then it corresponds to Conv1Position be that center coordinate isHalf is a length ofHalf-breadth is Indicate round numbers part;The Two steps: positive negative sample is generated at random.
The positive negative sample of generation at random, method are as follows: the first step constructs 9 regional frames, second step, for data The each image X in librarytrIf WclsFor 48 × 64 × 18 dimensions, WregFor 48 × 64 × 36 dimensions, all initial values are 0, right WclsAnd WregIt is filled.
Described 9 regional frames of construction, this 9 regional frames are respectively as follows: Ro1(xRo, yRo)=(xRo, yRo, 64,64), Ro2 (xRo, yRo)=(xRo, yRo, 45,90), Ro3(xRo, yRo)=(xRo, yRo, 90,45), Ro4(xRo, yRo)=(xRo, yRo, 128, 128), Ro5(xRo, yRo)=(xRo, yRo, 90,180), Ro6(xRo, yRo)=(xRo, yRo, 180,90), Ro7(xRo, yRo)= (xRo, yRo, 256,256), Ro8(xRo, yRo)=(xRo, yRo, 360,180), Ro9(xRo, yRo)=(xRo, yRo, 180,360), it is right In each region unit, Roi(xRo, yRo) indicate for ith zone frame, the centre coordinate (x of current region frameRo, yRo), the Three indicate pixel distance of the central point apart from upper and lower side frame, and the 4th indicates pixel distance of the central point apart from left and right side frame, i Value from 1 to 9.
It is described to WclsAnd WregIt is filled, method are as follows:
For the body compartments that each is manually demarcated, if it is (a in the centre coordinate of input picturebas_tr, bbas_tr), Centre coordinate is l in the distance of fore-and-aft distance upper and lower side framebas_tr, centre coordinate is in the distance of lateral distance left and right side frame wbas_tr, then it corresponds to Conv1Position be that center coordinate isHalf is a length ofHalf-breadth is
For the upper left cornerThe lower right corner CoordinateEach point in the section surrounded (xCtr, yCtr):
For i value from 1 to 9:
For point (xCtr, yctr), it is upper left angle point (16 (x in the mapping range of database imagesCtr- 1)+1,16 (yctr- 1)+1) bottom right angle point (16xctr, 16yCtr) 16 × 16 sections that are surrounded, for each point (x in the sectionOtr, yOtr):
Calculate (xOtr, yOtr) corresponding to region Roi(xOtr, yOtr) with current manual calibration section coincidence factor;
Select the highest point (x of coincidence factor in current 16 × 16 sectionIoUMax, yIoUMax), if coincidence factor > 0.7, Wcls (xCtr, yctr, 2i-1)=1, Wcls(xCtr, yctr, 2i)=0, which is positive sample, Wreg(xCtr, yctr, 4i-3) and=(xOtr- 16xCtr+ 8)/8, Wreg(xCtr, yCtr, 4i-2) and=(yOtr-16yCtr+ 8)/8, Wreg(xCtr, yCtr, 4i-2) and=Down1 (lbas_tr/ RoiThird position), Wreg(xCtr, yCtr, 4i) and=Down1 (wbas_tr/RoiThe 4th), Down1 () if indicate value be greater than 1 Then value is 1;If coincidence factor < 0.3, Wcls(xCtr, yCtr, 2i-1)=0, Wcls(xCtr, yCtr, 2i)=1;Otherwise Wcls (xCtr, yCtr, 2i-1)=- 1, Wcls(xCtr, yCtr, 2i)=- 1.
If the human region of current manual's calibration does not have the Ro of coincidence factor > 0.6i(xOtr, yOtr), then select coincidence factor most High Roi(xOtr, yOtr) to WclsAnd WregAssignment, assignment method are identical as the assignment method of coincidence factor > 0.7.
Calculating (the xOtr, yOtr) corresponding to region Roi(xOtr, yOtr) be overlapped with the section of current manual's calibration Rate, method are as follows: set the body compartments that manually demarcate in the centre coordinate of input picture as (abas_tr, bbas_tr), centre coordinate It is l in the distance of fore-and-aft distance upper and lower side framebas_tr, centre coordinate is w in the distance of lateral distance left and right side framebas_trIf Roi (xOtr, yOtr) third position be lOtr, the 4th is wOtrIf meeting | xOtr-abas_tr|≤lOtr+lbas_tr- 1 and | yOtr- bbas_tr|≤wOtr+wbas_tr- 1, illustrate that there are overlapping region, overlapping regions=(lOtr+lbas_tr-1-|xOtr-abas_tr|)× (wotr+wbas_tr-1-|yOtr-bbas_tr|), otherwise overlapping region=0;Calculate whole region=(2lotr-1)×(2wOtr-1)+ (2abas_tr-1)×(2wbas_tr- 1)-overlapping region;To obtain coincidence factor=overlapping region/whole region, | | expression takes Absolute value.
The Wshad-cls(X) and Wshad-reg(X), building method are as follows: for image X, corresponding positive negative sample Information is Wcls(X) and Wreg(X), the first step constructs Wshad-cls(X) with and Wshad-reg(X), Wshad-cls(X) and Wcls(X) dimension It is identical, Wshad-reg(X) and Wreg(X) dimension is identical;Second step records the information of all positive samples, for i=1 to 9, if Wcls (X) (a, b, 2i-1)=1, then Wshad-cls(X) (a, b, 2i-1)=1, Wshad-cls(X) (a, b, 2i)=1, Wshad-reg(X) (a, B, 4i-3)=1, Wshad-reg(X) (a, b, 4i-2)=1, Wshad-reg(X) (a, b, 4i-1)=1, Wshad-reg(X) (a, b, 4i)= 1, positive sample has selected altogether sum (Wshad-cls(X)) a, sum () indicates to sum to all elements of matrix, if sum (Wshad-cls(X)) 256 > retain 256 positive samples at random;Third step randomly chooses negative sample, randomly chooses (a, b, i), if Wcls(X) (a, b, 2i-1)=1, then Wshad-cls(X) (a, b, 2i-1)=1, Wshad-cls(X) (a, b, 2i)=1, Wshad-reg(X) (a, b, 4i-3)=1, Wshad-reg(X) (a, b, 4i-2)=1, Wshad-reg(X) (a, b, 4i-1)=1, Wshad-reg(X) (a, b, 4i)=1, if the negative sample quantity chosen is 256-sum (Wshad-cls(X)) a, although negative sample lazy weight 256- sum(Wshad-cls(X)) a but be all unable to get negative sample in 20 generation random numbers (a, b, i), then algorithm terminates.
The ROI layer, input are image X and regionIts method are as follows: for Image X is 48 × 64 × 512 by the dimension of obtained output Fconv (X) of feature extraction depth network Fconv, for every One 48 × 64 matrix VROI_IInformation (512 matrixes altogether), extract VROI_IThe upper left corner in matrix The lower right cornerIt is surrounded Region,Indicate round numbers part;Output is roiI(X) dimension is 7 × 7, then step-length
For iRoI=1: to 7:
For jROI=1 to 7:
Construct section
roiI(X)(iROI, jROIThe value of maximum point in)=section.
When 512 48 × 64 matrix whole after treatments, output splicing is obtained into the output of 7 × 7 × 512 dimensionsParameter is indicated for image X, in regional frame ROI in range.
The building coordinate refines network, method are as follows: the first step, extending database: extended method is for data Each image X and the corresponding each region manually demarcated in libraryIts is corresponding ROI isIf current interval be human body image-region if BClass=[1,0,0, 0,0], [0,0,0,0] BBox=, the BClass=[0,1,0,0,0] if current interval is people's face facial area, BBox=[0, 0,0,0], BClass=[0,0,1,0,0], BBox=[0,0,0,0], if current interval is if current interval is hand region Product area then [0,0,0,1,0] BClass=, BBox=[0,0,0,0];It is random to generate value random number between -1 to 1 arand, brand, lrand, wrand, to obtain new section Indicate round numbers part, the BBox=[a in the sectionrand, brand, lrand, wrand], if new section withThe then BClass=current region of coincidence factor > 0.7 BClass, if new section withCoincidence factor < 0.3, then BClass=[0,0,0,0, 1], the two is not satisfied, then not assignment.Each section at most generates 10 positive sample regions, if generating Num1A positive sample area Domain then generates Num1+ 1 negative sample region, if the inadequate Num in negative sample region1+ 1, then expand arand, brand, lrand, wrand Range, until finding enough negative sample numbers.Second step, building coordinate refine network: for every in database One image X and the corresponding each human region manually demarcatedIts corresponding ROI isThe ROI of 7 × 7 × 512 dimensions will be launched into 25088 dimensional vectors, then passed through Cross two full articulamentum Fc2, obtain output Fc2(ROI), then by Fc2(ROI) micro- by classification layer FClass and section respectively Layer FBBox is adjusted, output FClass (Fc is obtained2And FBBox (Fc (ROI))2(ROI)), classification layer FClass is full articulamentum, Input vector length is 512, and output vector length is 5, and it is full articulamentum that layer FBBox is finely tuned in section, and input vector length is 512, output vector length is 4;There are two the loss functions of the network: first error function loss1 is to FClass (Fc2 (ROI))-BClass calculates softmax error, and second error function loss2 is to (FBBox (Fc2(ROI))-BBox) meter Euclidean distance error is calculated, then whole loss function=loss1+loss2 of the refining network, algorithm iteration process are as follows: change first 1000 convergence error function loss2 of generation, then 1000 convergence whole loss functions of iteration.
The full articulamentum Fc of described two2, structure are as follows: first layer: full articulamentum, input vector length is 25088, defeated Outgoing vector length is 4096, and activation primitive is relu activation primitive;The second layer: full articulamentum, input vector length is 4096, defeated Outgoing vector length is 512, and activation primitive is relu activation primitive.
Described carries out target detection using algorithm of target detection to each frame image, obtains the human body image of present image Region, face facial area, hand region and product area, the steps include:
The first step, by input picture XcpstIt is divided into the subgraph of 768 × 1024 dimensions;
Second step, for each subgraph Xs:
2.1st step is converted using the feature extraction depth network Fconv constructed in initialization, obtains 512 spies Levy subgraph set Fconv (Xs);
2.2nd step, to Fconv (Xs) using area selection network in first layer Conv1, second layer Conv2-1+soffmax Activation primitive and Conv2-2Into transformation, output soffmax (Conv is respectively obtained2-1(Conv1(Fconv(Xs)))) and Conv2-2 (Conv1(Fconv(Xs))), all preliminary candidate sections in the section are then obtained according to output valve;
2.3rd step, for all preliminary candidate sections of all subgraphs of current frame image:
2.3.1 step, is chosen according to the score size in its current candidate region, chooses maximum 50 preliminary candidates Section is as candidate region;
2.3.2 step adjusts candidate section of crossing the border all in candidate section set, then weeds out weight in candidate section Folded frame, to obtain final candidate section;
2.3.3 step, by subgraph XsROI layers are input to each final candidate section, obtains corresponding ROI output, If current final candidate section is (aBB(1), bBB(2), lBB(3), wBB(4)) FBBox (Fc, is then calculated2(ROI)) it obtains Four output (OutBB(1), OutBB(2), OutBB(3), OutBB(4)) to obtain updated coordinate (aBB(1)+8×OutBB (1), bBB(2)+8×OutBB(2), lBB(3)+8×OutBB(3), wBB(4)+8×OutBB(4));Then FClass (Fc is calculated2 (ROI)) exported, if exporting first maximum current interval be human body image-region, if output second maximum when It is people's face facial area between proparea, current interval is hand region if exporting third position maximum, if the 4th maximum of output Current interval is product area, and current interval, which is negative, if exporting the 5th maximum sample areas and deletes the final candidate regions Between.Third step, the coordinate in the final candidate section after updating the refining of all subgraphs, the method for update is to set current candidate region Coordinate be (TLx, TLy, RBx, RBy), the top left co-ordinate of corresponding subgraph is (Seasub, Sebsub), updated seat It is designated as (TLx+Seasub- 1, TLy+Sebsub- 1, RBx, RBy).
It is described by input picture XcpstBe divided into the subgraph of 768 × 1024 dimensions, the steps include: to set the step-length of segmentation as 384 and 512, if window size is m row n column, (asub, bsub) be selected region top left co-ordinate, the initial value of (a, b) is (1,1);Work as asubWhen < m:
bsub=1:
Work as bsubWhen < n:
Selected region is [(asub, bsub), (asub+ 384, bsub+ 512)], by input picture XcpstUpper section institute is right The information for the image-region answered copies in new subgraph, and is attached to top left co-ordinate (asub, bsub) it is used as location information;If choosing Region is determined beyond input picture XcpstSection then will exceed the corresponding rgb pixel value of the pixel in range and be assigned a value of 0;
bsub=bsub+ 512:
Interior loop terminates;
asub=asub+ 384:
Outer loop terminates;
Described obtains all preliminary candidate sections in the section, method according to output valve are as follows: step 1: for softmax(Conv2-1(Conv1(Fconv(Xs)))) its output be 48 × 64 × 18, for Conv2-2(Conv1(Fconv (Xs))), output is 48 × 64 × 36, for any point (x, y) on 48 × 64 dimension spaces, softmax (Conv2-1 (Conv1(Fconv(Xs)))) (x, y) be 18 dimensional vector II, Conv2_2(Conv1(Fconv(Xs))) (x, y) be 36 dimensional vectors IIII, if II (2i-1) > II (2i), for i value from 1 to 9, lOtrFor Roi(xOtr, yotr) third position, wOtrFor Roi (xOtr, yotr) the 4th, then preliminary candidate section be [II (2i-1), (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y, lOtr× IIII (4i-1), wOtr× IIII (4i))], wherein the score in first II (2i-1) expression current candidate region, second Position (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y, IIII (4i-1), IIII (4i)) indicates the center in current candidate section Point is (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y), and the long half-breadth of the half of candidate frame is respectively lOtr× IIII (4i-1) and wOtr×IIII(4i))。
All candidate sections of crossing the border, method are as follows: set monitoring image as m row n in the candidate section set of the adjustment Column, for each candidate section, if its [(ach, bch)], the long half-breadth of the half of candidate frame is respectively lchAnd wchIf ach+lch> M, thenThen its a is updatedch=a 'ch, lch= l′ch;If bch+wch> n, thenThen it updates Its bch=b 'ch, wch=w 'ch.
Described weeds out the frame being overlapped in candidate section, the steps include:
If candidate section set is not sky:
The maximum candidate section i of score is taken out from the set of candidate sectionout:
Calculate candidate section ioutWith candidate section set each of candidate section icCoincidence factor, if coincidence factor > 0.7, then gather from candidate section and deletes candidate section ic
By candidate section ioutIt is put into the candidate section set of output;
When candidate section set is empty, exporting candidate section contained in candidate section set is to weed out candidate regions Between middle overlapping frame after obtained candidate section set.
The calculating candidate section ioutWith candidate section set each of candidate section icCoincidence factor, side Method are as follows: set candidate section icCoordinate section centered on point [(aic, bic)], the long half-breadth of the half of candidate frame is respectively licAnd wic, wait I between constituencycCoordinate section centered on point [(aiout, bicout)], the long half-breadth of the half of candidate frame is respectively lioutAnd wiout;It calculates XA=max (aic, aiout);YA=max (bic, biout);XB=min (lic, liout), yB=min (wic, wiout);If meeting | aic-aiout|≤lic+liout- 1 and | bic-biout|≤wic+wiout- 1, illustrate that there are overlapping region, overlapping regions=(lic+ liout-1-|aic-aiout|)×(wic+wiout-1-|bic-biout|), otherwise overlapping region=0;Calculate whole region=(2lic- 1)×(2wic-1)+(2liout-1)×(2wiout- 1)-overlapping region;To obtain coincidence factor=overlapping region/whole region.
The shopping action recognition module, method is: in initialization, using the hand motion image of standard first Static action recognition classifier is initialized, so that static action recognition classifier be made to can recognize that the grasping of hand, put Lower movement;Then dynamic action recognition classifier is initialized using hand motion video, so that dynamic action be made to identify Classifier can recognize that the taking-up article of hand, put back to article, take out but put back to, taken out article do not put back to either it is suspicious steal Surreptitiously;In the detection process: the first step carries out each the hand region information received using static action recognition classifier Identification, recognition methods are as follows: the image inputted each time is set as Handp1, exporting as StaticN (Handp1) is 3 bit vectors, if First maximum is then identified as grasping, and is identified as putting down if second maximum, if third position maximum is identified as other;Second Step carries out target following to current grasp motion corresponding region after recognizing grasp motion, if current hand region is next The recognition result that static action recognition classifier is used corresponding to frame tracking box is when putting down movement, and target following terminates, will It is currently available since recognizing grasp motion and being video, recognize that put down movement be that video terminates, so that it is dynamic to obtain hand The video marker is complete video by the continuous videos of work.If tracking is lost during tracking, by currently available from identification It is that video starts, the image before tracking loss terminates as video to grasp motion, to obtain the view of only grasp motion It frequently, then is the video of only grasp motion by the video marker;Movement is put down when recognizing, and the movement is not in target following In obtained image, illustrates that the grasp motion of the movement is lost, then terminates using the corresponding hand region of present image as video, Tracking is carried forward since present frame using method for tracking target, until tracking is lost, then the next frame of lost frames is as view The video marker is the video for only putting down movement by the start frame of frequency.Third step makes the obtained complete video of second step It is identified with dynamic action recognition classifier, recognition methods are as follows: set the image inputted each time as Handv1, export and be DynamicN (Handv1) is 5 bit vectors, is identified as taking out article if first maximum, if second maximum is identified as putting Article is returned, is identified as taking out if the maximum of third position and put back to again, if the 4th maximum is identified as having taken out article and not put back to, if 5th maximum is then identified as the movement of suspicious stealing, and the recognition result is then sent to recognition result processing module, will be only The video for having grasp motion and the video for only putting down movement are sent to recognition result processing module, by complete video and only grab The video for holding movement is sent to product identification module and individual identification module.
The hand motion image using standard initializes static action recognition classifier, method are as follows: The first step arranges video data: firstly, choose the video that a large amount of people does shopping in supermarket, these videos include extract product, Article is put back to, takes out and puts back to, taken out article and do not put back to movement with suspicious stealing;Manually each section of video clip is carried out Interception encounters commodity as start frame using manpower, leaves commodity as end frame using manpower, then use mesh for each frame of video Mark detection module extracts its hand region, the color image for being then 256 × 256 by each frame image scaling of hand region, Will scaling rear video be put into hand motion video collection, and mark the video for take out article, put back to article, take out but put back to, Article has been taken out one of not put back to the movement of suspicious stealing;It is taking-up article for classification, puts back to article, takes out and puts It returns, taken out each video that article is not put back to, the first frame of the video is put into the merging of hand motion image set and is labeled as The last frame of the video is put into hand motion image set and merged labeled as putting down movement by grasp motion, removes the from the video It takes a frame to be put into hand motion image set outside one frame and last needle at random to merge labeled as other.To obtain hand motion view Frequency set and hand motion image collection;Second step constructs static action recognition classifier StaticN;Third step, it is dynamic to static state Make recognition classifier StaticN to be initialized, the hand motion image collection constructed by the first step is inputted, if each time The image of input is Handp, is exported as StaticN (Handp), classification yHandp, yHandpRepresentation method are as follows: grasp: yHandp=[1,0,0], puts down: yHandp=[0,1,0], other: yHandp=[0,0,1], the evaluation function of the network are pair (StaticN(Handp)-yHandp) its cross entropy loss function is calculated, convergence direction is to be minimized, the number of iterations 2000 It is secondary.
The construction static state action recognition classifier StaticN, network structure are as follows: first layer: convolutional layer inputs and is 256 × 256 × 3, exporting is 256 × 256 × 64, port number channels=64;The second layer: convolutional layer, input as 256 × 256 × 64, exporting is 256 × 256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256 × 256 × 64 are connected in third dimension with third layer output 256 × 256 × 64, and exporting is 128 × 128 × 128;The Four layers: convolutional layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;5th Layer: convolutional layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 6: Pond layer inputs the 4th layer of output 128 × 128 × 128 and is connected in third dimension with layer 5 128 × 128 × 128, Output is 64 × 64 × 256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number Channels=256;8th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number Channels=256;9th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number Channels=256;Tenth layer: pond layer, input for layer 7 output 64 × 64 × 256 and the 9th layer 64 × 64 × 256 It is connected in third dimension, exporting is 32 × 32 × 512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, exports and is 32 × 32 × 512, port number channels=512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, export as 32 × 32 × 512, port number channels=512;13rd layer: convolutional layer, inputting is 32 × 32 × 512, export as 32 × 32 × 512, port number channels=512;14th layer: pond layer inputs as eleventh floor output 32 × 32 × 512 and the 13rd Layer 32 × 32 × 512 is connected in third dimension, and exporting is 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, exporting is 16 × 16 × 512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, exporting is 16 × 16 × 512, port number channels=512;17th layer: convolutional layer, input as 16 × 16 × 512, exporting is 16 × 16 × 512, port number channels=512;18th layer: pond layer is inputted and is exported for the 15th layer 16 × 16 × 512 are connected in third dimension with the 17th layer 16 × 16 × 512, and exporting is 8 × 8 × 1024;19th Layer: convolutional layer, inputting is 8 × 8 × 1024, and exporting is 8 × 8 × 256, port number channels=256;20th layer: Chi Hua Layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 × 256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128;Second Floor 12: pond layer, inputting is 4 × 4 × 128, export as 2 × 2 × 128;23rd layer: the data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, so by full articulamentum first After input into full articulamentum, output vector length is 128, and activation primitive is relu activation primitive;24th layer: full connection Layer, input vector length are 128, and output vector length is 32, and activation primitive is relu activation primitive;25th layer: Quan Lian Layer is connect, input vector length is 32, and output vector length is 3, and activation primitive is soft-max activation primitive;All convolutional layers Parameter is size=3 convolution kernel kernel, and step-length stride=(1,1), activation primitive is relu activation primitive;All pond layers It is maximum pond layer, parameter is pond section size kernel_size=2, step-length stride=(2,2).
Described initializes dynamic action recognition classifier using hand motion video, method are as follows: the first step, Construct data acquisition system: the when the hand motion image using standard initializes static action recognition classifier The hand motion video collection that one step is constructed uniformly extracts 10 frame images, as input;Second step, construction dynamic action identification Classifier DynamicN;Third step initializes dynamic action recognition classifier DynamicN, and input is the first step pair The set that 10 frame images of each video extraction are constituted exports if the 10 frame images inputted each time are Handv and is DynamicN (Handv), classification yHandv, yHandvRepresentation method are as follows: take out article: yHandv=[1,0,0,0,0] is put Return article: yHandvIt takes out and puts back to again in=[0,1,0,0,0]: yHandvIt has taken out article and has not put back to in=[0,0,1,0,0]: yHandv The movement of=[0,0,0,1,0] and suspicious stealing: yHandv=[0,0,0,0,1], the evaluation function of the network are to (DynamicN (Handv)-yHandv) its cross entropy loss function is calculated, convergence direction is to be minimized, and the number of iterations is 2000 times.
The 10 frame images of uniform extraction, method are as follows: for one section of video image, if the length is Nf frames.First 1st frame image zooming-out of video image is come out into the 1st frame as extracted set, by the last frame image of video image Extract the 10th frame as extracted set, the i-th of extracted setcktFrame is the of video imageFrame, wherein ickt=2 to 9:,Indicate round numbers part.
The construction dynamic action recognition classifier DynamicN, network structure are as follows:
First layer: convolutional layer, inputting is 256 × 256 × 30, and exporting is 256 × 256 × 512, port number channels= 512;The second layer: convolutional layer, inputting is 256 × 256 × 512, and exporting is 256 × 256 × 128, port number channels= 128;Third layer: pond layer, inputting is 256 × 256 × 128, and exporting is 128 × 128 × 128;4th layer: convolutional layer, input It is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer inputs and is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, input the 4th Layer output 128 × 128 × 128 is connected in third dimension with layer 5 128 × 128 × 128, export as 64 × 64 × 256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;The Eight layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: volume Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, it is defeated Enter and be connected in third dimension for layer 7 output 64 × 64 × 256 with the 9th layer 64 × 64 × 256, exporting is 32 × 32 ×512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels= 512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512; 13rd layer: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth Four layers: pond layer inputs and exports 32 × 32 × 512 and the 13rd layer 32 × 32 × 512 in third dimension for eleventh floor It is connected, exporting is 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, export as 16 × 16 × 512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, Port number channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, channel Number channels=512;18th layer: pond layer inputs and exports 16 × 16 × 512 and the 17th layer 16 × 16 for the 15th layer × 512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024, Output is 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, export as 4 × 4×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128; Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length It is 128, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 128, and output vector is long Degree is 32, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 32, and output vector is long Degree is 3, and activation primitive is soft-max activation primitive;The parameter of all convolutional layers is size=3 convolution kernel kernel, step-length Stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, parameter Chi Huaqu Between size kernel_size=2, step-length stride=(2,2).
It is described after recognizing grasp motion, target following, method are carried out to current grasp motion corresponding region are as follows: If the image of the grasp motion currently recognized is Hgrab, current tracking area is region corresponding to image Hgrab.First Step extracts the ORB feature ORB of image HgrabHgrab;Second step, it is corresponding for all hand regions in the next frame of Hgrab Image calculate its ORB feature to obtaining ORB characteristic set, and delete the ORB feature chosen by other tracking box;Third Step, by ORBHgrabIts Hamming distance compared with each value of ORB characteristic set, selection and ORBHgrabThe Hamming distance of feature The smallest ORB feature is the ORB feature chosen, if the ORB feature and ORB chosenHgrabThe similarity > 0.85 of feature, similarity =(Hamming distance/ORB characteristic lengths of two ORB features of 1-), then the corresponding hand region of ORB feature chosen is image Hgrab next frame tracking box, if otherwise similarity < 0.85 show tracking lose.
The ORB feature, the method that ORB feature is extracted from an image have been relatively mature, and calculate in OpenCV Has realization inside machine vision library;Its ORB feature is extracted to a picture, input value is current image, is exported as several group leaders Identical character string is spent, each group represents an ORB feature.
Described terminates using the corresponding hand region of present image as video, using method for tracking target since present frame Be carried forward tracking, until tracking is lost, method are as follows: set the image for putting down movement currently recognized as Hdown, currently with Track region is region corresponding to image Hdown.
If not tracking loss:
The first step extracts the ORB feature ORB of image HdownHdown, moved since the process recognizes grasping in described working as After work, calculated during carrying out target following to current grasp motion corresponding region, so being not required here again Secondary calculating;
Second step, for the corresponding image of all hand regions in the former frame of image Hdown calculate its ORB feature from And ORB characteristic set is obtained, and delete the ORB feature chosen by other tracking box;
Third step, by ORBHdownIts Hamming distance compared with each value of ORB characteristic set, selection and ORBHdownIt is special The smallest ORB feature of the Hamming distance of sign is the ORB feature chosen, if the ORB feature and ORB chosenHdownThe similarity of feature > 0.85, similarity=(Hamming distance/ORB characteristic lengths of two ORB features of 1-), the then corresponding hand of ORB feature chosen Portion region is tracking box of the image Hdown in next frame, if otherwise similarity < 0.85 shows that tracking is lost, algorithm terminates.
The product identification module, method is: in initialization, using the product image set of all angles first Product identification classifier is initialized, and product list is generated to product image;When changing product list: if deleting certain Product then deletes the image of the product from the product image set of all angles, and corresponding position in product list is deleted, If increasing certain product, the product image of all angles of current production is put into the product image set of all angles, will be produced Last the current title for increasing product of back addition of product list, then with the product image set of new all angles with New product list upgrading products recognition classifier;In the detection process, the first step, according to shopping action recognition module transmitting come Complete video and only grasp motion video, first in module of target detection institute according to corresponding to current video first frame Obtained position detects forward the inputted video image of the position from current video first frame, detect the region not by The frame blocked is finally identified the image in region corresponding to frame as the input of product identification classifier, to obtain The recognition result of current production, recognition methods are as follows: set the image inputted each time as Goods1, export as GoodsN (Goods1) For a vector, if the i-th of the vectorgoodsPosition is maximum, then shows that current recognition result is i-th in product listgoodsPosition Recognition result is sent to recognition result processing module by product;
Described first initializes product identification classifier using the product image set of all angles, and to production Product image generates product list, method are as follows: the first step, construct data acquisition system and product list: the data acquisition system is that product is each The image of a angle, product list listGoodsFor a vector, each of vector corresponds to a product name;Second step, structure Make product identification classifier GoodsN;Third step initializes construction product identification classifier GoodsN, and input is each The product image set of a angle exports if input picture is Goods as GoodsN (Goods), classification yGoods, yGoods For one group of vector, length is equal to the number of product in product list, yGoodsRepresentation method are as follows: if image Goods be i-thGoods The product of position, then yGoodsI-thGoodsPosition is 1, other are to (GoodsN (Goods)-for the evaluation function of 0. network yGoods) its cross entropy loss function is calculated, convergence direction is to be minimized, and the number of iterations is 2000 times.
The construction product identification classifier GoodsN, two groups of GoodsN1 and GoodsN2 of network layer structure, wherein The network structure of GoodsN1 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, and exporting is 256 × 256 × 64, port number Channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, and exporting is 256 × 256 × 128, port number Channels=128;Third layer: pond layer, inputting is 256 × 256 × 128, and exporting is 128 × 128 × 128;4th layer: volume Lamination, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolution Layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, It inputs the 4th layer of output 128 × 128 × 128 to be connected in third dimension with layer 5 128 × 128 × 128, exporting is 64 ×64×256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels= 256;8th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;The Nine layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond Change layer, inputs and be connected in third dimension for layer 7 output 64 × 64 × 256 with the 9th layer 64 × 64 × 256, export It is 32 × 32 × 512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number Channels=512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number Channels=512;13rd layer: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number Channels=512;14th layer: pond layer, input for eleventh floor export 32 × 32 × 512 and the 13rd layer 32 × 32 × 512 are connected in third dimension, and exporting is 16 × 16 × 1024;15th layer: convolutional layer, input as 16 × 16 × 1024, exporting is 16 × 16 × 512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, Output is 16 × 16 × 512, port number channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, output It is 16 × 16 × 512, port number channels=512;18th layer: pond layer, input for the 15th layer output 16 × 16 × 512 are connected in third dimension with the 17th layer 16 × 16 × 512, and exporting is 8 × 8 × 1024;19th layer: convolution Layer, inputting is 8 × 8 × 1024, and exporting is 8 × 8 × 256, port number channels=256;20th layer: pond layer, input It is 8 × 8 × 256, exporting is 4 × 4 × 256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, export as 4 × 4 × 128, port number channels=128;The parameter of all convolutional layers be size=3 convolution kernel kernel, step-length stride=(1, 1), activation primitive is relu activation primitive;All pond layers are maximum pond layer, and parameter is pond section size Kernel_size=2, step-length stride=(2,2).The network structure of GoodsN2 are as follows: inputting is 4 × 4 × 128, first will be defeated The data entered are launched into the vector of 2048 dimensions, then input into first layer;First layer: full articulamentum, input vector length are 2048, output vector length is 1024, and activation primitive is relu activation primitive;The second layer: full articulamentum, input vector length are 1024, output vector length is 1024, and activation primitive is relu activation primitive;Third layer: full articulamentum, input vector length are 1024, output vector length is len (listGoods), activation primitive is soft-max activation primitive;len(listGoods) indicate The length of product list.For any input Goods2, GoodsN (Goods2)=GoodsN2 (GoodsN1 (Goods2)).
The product image set and new product list upgrading products recognition classifier with new all angles, Method are as follows: the first step modifies network structure: for the network of product identification the classifier GoodsN ', GoodsN1 ' of neotectonics Structure is constant, identical as GoodsN1 network structure when initialization, the first layer and second layer knot of GoodsN2 ' network structure Structure remains unchanged, and the output vector length of third layer becomes the length of updated product list;Second step, for neotectonics Product identification classifier GoodsN ' is initialized: its product image set inputted as new all angles, if input picture For Goods3, export as GoodsN ' (Goods3)=GoodsN2 ' (GoodsN1 (Goods3)), classification yGoods3, yGoods For one group of vector, length is equal to the number of updated product list, yGoodsRepresentation method are as follows: if image Goods is the iGoodsThe product of position, then yGoodsI-thGoodsPosition is 1, other are to (GoodsN for the evaluation function of 0. network (Goods)-yGoods) its cross entropy loss function is calculated, convergence direction is to be minimized, during initialization in GoodsN1 Parameter value remain unchanged, the number of iterations be 500 times.
It is described according to corresponding to current video first frame in the obtained position of module of target detection, to the position Inputted video image detects forward from current video first frame, detects the frame that the region is not blocked, method are as follows: set and work as Corresponding to preceding video first frame the obtained position of module of target detection be (agoods, bgoods, lgoods, wgoods), if currently Current video first frame is i-thcrgsFrame, frame under process icr=icrgs: the first step, i-thcrFrame is obtained by module of target detection All detection zones be Taskicr;Second step, for TaskicrEach of regional frame (atask, btask, ltask, wtask), Calculate its distance dgt=(atask-agoods)2+(btask-bgoods)2-(ltask+lgoods)2-(wtask+wgoods)2.Distance if it does not exist < 0, then i-thcrCorresponding (a of framegoods, bgoods, lgoods, wgoods) region be region detected for detecting not by The frame blocked, algorithm terminate;Otherwise, distance < 0 if it exists, the then d (i in recording distance list dcr)=minimum range, and icr =icr- 1, if icr> 0, then algorithm jumps to the first step, if icr≤ 0, then selection takes this apart from the maximum record of list d intermediate value Record the corresponding (a of corresponding framegoods, bgoods, lgoods, wgoods) it is what the region detected detected was not blocked Frame, algorithm terminate.
The individual identification module, method is: in initialization, using the face image set of all angles first Face characteristic extractor FaceN is initialized and calculates μ face, then using the human body image of all angles to human body spy Sign extractor BodyN is initialized and is calculated μ body;In the detection process, when user enters supermarket, pass through target detection Module obtains the image Face1 of the face in current human region Body1 and human region, is then mentioned respectively using characteristics of human body Device BodyN and face characteristic extractor FaceN is taken to extract characteristics of human body BodyN (Body1) and face characteristic FaceN (Face1), BodyN (Body1) is saved in BodyFtu set, saves FaceN (Face1) in FaceFtu set, and save current visitor The id information at family, id information can be user supermarket account either user enter supermarket when be randomly assigned it is unduplicated Number, id information are used to distinguish different customers, whenever there is customer to enter supermarket, then extract its characteristics of human body and face characteristic;When In supermarket when user's mobile product, according to shopping action recognition module transmitting come complete video and only grasp motion view Frequently, its corresponding human region and human face region are searched out, face feature extractor FaceN and characteristics of human body's extractor are used BodyN carries out recognition of face or human bioequivalence mode, obtains currently doing shopping corresponding to the video that action recognition module transmitting comes The ID of customer.
The face image set using all angles is initialized and is calculated to face characteristic extractor FaceN μ face, method are as follows: the first step, the face image set for choosing all angles constitute human face data collection;Second step constructs people Face feature extractor FaceN is simultaneously initialized using face data set;Step 3:
Everyone i concentrated for human face dataPeop, obtain human face data and concentrate all to belong to iPeopFacial image Set FaceSet (iPeop):
For FaceSet (iPeop) in each facial image Face (jiPeop):
Calculate face characteristic FaceN (Face (jiPeop));
Count current face's image collection FaceSet (iPeop) in all face characteristics average value as current face scheme Center center (FaceN (Face (the j of pictureiPeop))), calculate current face's image collection FaceSet (iPeop) in all faces Feature
With the center center (FaceN (Face (j of current face's imageiPeop))) distance constitute iPeopCorresponding distance Set.The owner concentrated to human face data obtains its corresponding distance set, after distance set is arranged from small to large, if Distance set length is ndiset, Indicate round numbers part.
The construction face characteristic extractor FaceN is simultaneously initialized using face data set, if human face data collection By NfacesetIndividual is constituted, and network layer structure FaceN25 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, is exported and is 256 × 256 × 64, port number channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, export as 256 × 256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256 × 256 × 64 are defeated with third layer 256 × 256 × 64 are connected in third dimension out, and exporting is 128 × 128 × 128;4th layer: convolutional layer inputs and is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer, inputting is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, the 4th layer of input are defeated 128 × 128 × 128 are connected in third dimension with layer 5 128 × 128 × 128 out, and exporting is 64 × 64 × 256;The Seven layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;8th layer: volume Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: convolutional layer, it is defeated Entering is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, inputting is Seven layers of output 64 × 64 × 256 are connected in third dimension with the 9th layer 64 × 64 × 256, and exporting is 32 × 32 × 512; Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth Two layers: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;13rd layer: Convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;14th layer: Chi Hua Layer inputs and is connected in third dimension for eleventh floor output 32 × 32 × 512 with the 13rd layer 32 × 32 × 512, defeated It is out 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, and exporting is 16 × 16 × 512, port number Channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512;18th layer: pond layer, input for the 15th layer export 16 × 16 × 512 and the 17th layer 16 × 16 × 512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024, defeated It is out 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 ×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128; Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length It is 512, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 512, and output vector is long Degree is 512, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 512, output vector Length is Nfaceset, activation primitive is soft-max activation primitive;The parameter of all convolutional layers be convolution kernel kernel size= 3, step-length stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, and parameter is Pond section size kernel_size=2, step-length stride=(2,2).Its initialization procedure are as follows: set for each face Face4 exports as FaceN25 (face4), classification yface, yfaceIt is equal to N for lengthfacesetVector, yfaceExpression Method are as follows: if face face4 belongs to i-th in face image setface4Personal face, then yfaceI-thface4Position is 1, other Position is that the evaluation function of 0. network is to (FaceN25 (face4)-yface) its cross entropy loss function is calculated, restrain direction To be minimized, the number of iterations is 2000 times;After iteration, face characteristic extractor FaceN be FaceN25 network from First layer is to the 24th layer.
The human body image using all angles initializes characteristics of human body's extractor BodyN and calculates μ Body, method are as follows: the first step, the human body image set for choosing all angles constitute somatic data collection;Second step constructs human body Simultaneously user's volumetric data set initializes feature extractor BodyN;Step 3:
Everyone i concentrated for somatic dataPeop1, obtain somatic data and concentrate all to belong to iPeop1Human figure Image set closes BodySet (iPeop1):
For BodySet (iPeop1) in each human body image Body (jiPeop1):
Calculate characteristics of human body BodyN (Body (jiPeop1));
Count current human's image collection BodySet (iPeop1) in all characteristics of human body average value as current human scheme Center center (BodyN (Body (the j of pictureiPeop1))), calculate current human's image collection BodySet (iPeop1) in owner Center center (BodyN (Body (the j of body characteristics and current human's imageiPeop1))) distance constitute iPeop1Corresponding distance Set.
The owner concentrated to somatic data obtains its corresponding distance set, and distance set is arranged from small to large Afterwards, if distance set length is ndiset1, It indicates to be rounded Number part.
Construction characteristics of human body's extractor BodyN and user's volumetric data set is initialized, if somatic data collection By NbodysetIndividual is constituted, and network layer structure BodyN25 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, is exported and is 256 × 256 × 64, port number channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, export as 256 × 256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256 × 256 × 64 are defeated with third layer 256 × 256 × 64 are connected in third dimension out, and exporting is 128 × 128 × 128;4th layer: convolutional layer inputs and is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer, inputting is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, the 4th layer of input are defeated 128 × 128 × 128 are connected in third dimension with layer 5 128 × 128 × 128 out, and exporting is 64 × 64 × 256;The Seven layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;8th layer: volume Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: convolutional layer, it is defeated Entering is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, inputting is Seven layers of output 64 × 64 × 256 are connected in third dimension with the 9th layer 64 × 64 × 256, and exporting is 32 × 32 × 512; Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth Two layers: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;13rd layer: Convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;14th layer: Chi Hua Layer inputs and is connected in third dimension for eleventh floor output 32 × 32 × 512 with the 13rd layer 32 × 32 × 512, defeated It is out 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, and exporting is 16 × 16 × 512, port number Channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512;18th layer: pond layer, input for the 15th layer export 16 × 16 × 512 and the 17th layer 16 × 16 × 512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024, defeated It is out 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 ×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128; Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length It is 512, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 512, and output vector is long Degree is 512, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 512, output vector Length is Nfaceset, activation primitive is soft-max activation primitive;The parameter of all convolutional layers be convolution kernel kernel size= 3, step-length stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, and parameter is Pond section size kernel_size=2, step-length stride=(2,2).Its initialization procedure are as follows: set for each Zhang Renti Body4 exports as BodyN25 (body4), classification ybody, ybodyIt is equal to N for lengthbodysetVector, ybodyExpression Method are as follows: if human body body4 belongs to i-th in human body image setbody4Personal human body, then ybodyI-thbody4Position is 1, other Position is that the evaluation function of 0. network is to (BodyN25 (body4)-ybody) its cross entropy loss function is calculated, restrain direction To be minimized, the number of iterations is 2000 times;After iteration, characteristics of human body's extractor BodyN be BodyN25 network from First layer is to the 24th layer.
It is described according to the transmitting of shopping action recognition module come complete video and only grasp motion video, search out Its corresponding human region and human face region carry out people using face feature extractor FaceN and characteristics of human body's extractor BodyN Face identification or human bioequivalence mode obtain the ID of customer corresponding to the video that currently transmitting of shopping action recognition module comes.Its Process are as follows: according to shopping action recognition module transmitting come video, begin look for from the first frame of video to corresponding human body area Domain and human face region, until algorithm terminates or handled the last frame of video:
Corresponding human region image Body2 and human face region image Face2 are used into characteristics of human body's extractor respectively BodyN and face characteristic extractor FaceN extracts characteristics of human body BodyN (Body2) and face characteristic FaceN (Face2);
Then face identification information is used first: comparing all face characteristics in FaceN (Face2) and FaceFtu set Euclidean distance dFace, feature when selecting Euclidean distance minimum in corresponding FaceFtu set, if this feature is FaceN (Face3), if dFace< μ face then identifies that current face's image belongs to the visitor of facial image corresponding to FaceN (Face3) Family ID is the ID corresponding to the video actions that action recognition module transmitting comes that does shopping, and current identification process terminates;
If dFace>=μ face shows only to identify current individual with face identification method, then compares BodyN (Body2) the Euclidean distance d of all characteristics of human body in gathering with BodyFtuBody, select Euclidean distance minimum when it is corresponding Feature in BodyFtu set, if this feature is BodyN (Face3), if dBody+dFace< μ face+ μ body, then identify and work as The Customer ID that preceding human body image belongs to human body image corresponding to BodyN (Face3) is that shopping action recognition module transmitting comes ID corresponding to video actions.
If still not finding ID corresponding to video actions after all frames for having handled video, in order to avoid mistake is known Not Gou Wu main body cause the book keeping operation of mistake, therefore the video come to current shopping action recognition module transmitting is no longer handled.
It is described according to the transmitting of shopping action recognition module come video, begin look for from the first frame of video to corresponding Human region and human face region, method are as follows: according to the transmitting of shopping action recognition module come video, from the first frame of video into Row processing.If currently processed to i-thfRgFrame, if it is (a that the frame, which corresponds to video in the obtained position of module of target detection,ifRg, bifRg, lifRg, wifRg), the frame is corresponding to be combined into BodyFrameSet in the obtained human region collection of module of target detectionifRg Human region collection is combined into FaceFrameSetifRg, for BodyFrameSetifRgEach of human region (aBFSifRg, bBFSifRg, lBFSifRg, wBFSifRg), calculate its distance dgbt=(aBFSifRg-aifRg)2+(bBFSifRg-bifRg)2-(lBFSifRg-lifRg)2- (wBFSifRg-wifRg)2, selecting the smallest human region of distance in all human region set is the corresponding human body area of current video Domain, if it is (a that the human region chosen, which is position,BFS1, bBFS1, lBFS1, wBFS1), human face region collection is combined into FaceFrameSetifRgEach of human face region (aFFSifRg, bFFsifRg, lFFsifRg, wFFSifRg), calculate its distance dgft= (aBFS1-aFFSifRg)2+(bBFS1-bFFSifRg)2-(lBFs1-lFFSifRg)2-(wBFS1-wFFSifRg)2, select all face regional ensembles It is middle apart from the smallest human face region be the corresponding human face region of current video.
The recognition result processing module does not work in initialization.In identification process, to the identification knot received Fruit carries out integration to generating the corresponding shopping list of each customer: first according to individual identification module transmit come customer ID determines the corresponding customer of current shopping information, so that choosing the shopping list number modified is ID, then according to product identification The recognition result that module transmitting comes sets product to determine that the shopping of current customer acts corresponding product as GoodA, then basis Whether the recognition result that shopping action recognition module transmitting comes modifies to shopping cart to determine that current shopping acts, if identification Then increase product G oodA on shopping list ID to take out article, accelerating is 1, is being purchased if being identified as putting back to article Product G oodA is reduced on object inventory ID, reducing quantity is 1, if be identified as " take out and put back to " or " taken out article not put again Return " then shopping list do not change, to supermarket's monitoring transmission alarm signal and current video if recognition result is " suspicious stealing " Corresponding location information.
It, can will be in shopping process the invention has the advantages that when commodity Input Process is advanceed to shopper's picking Most time-consuming process advances in shopping process, to remove the time loss of items scanning when checkout, is greatly improved Checkout speed, improves the shopping experience of customer.The present invention selects shopper using algorithm for pattern recognition dynamic during goods It is identified and is counted, the picture of commodity is identified to obtain type of merchandize when picking and placing commodity to client, is carried out to customer Recognition of face and the identity for obtaining customer using human body image recognition when recognition of face is undesirable, to the exception of customer Activity recognition is to determine whether there is pilferage behavior.This system can realize automatic system under the premise of not reducing customer purchase experience Count function.Original institutional framework the present invention relates to the shopping accounting procedure of customer without changing supermarket, consequently facilitating with existing Supermarket's organizational structure seamless interfacing.
Detailed description of the invention
Fig. 1 is functional flow diagram of the invention
Fig. 2 is whole functional module of the invention and its correlation block diagram
Specific embodiment
The present invention will be further described below with reference to the drawings.
A kind of supermarket's intelligence vending system, functional flow diagram is as shown in Figure 1, correlation between its module As shown in Figure 2.
Be provided below three specific embodiments to a kind of detailed process of supermarket's intelligence vending system of the present invention into Row explanation: embodiment 1:
The present embodiment realizes a kind of process of the parameter initialization of supermarket's intelligence vending system.
1. image pre-processing module, in initial phase, the module does not work;
2. human body target detection module has demarcated human body image region, face face using having during initialization The image in region, hand region and product area carries out parameter initialization to algorithm of target detection.
The use have demarcated human body image region, face facial area, hand region and product area figure As carrying out parameter initialization to algorithm of target detection, it the steps include: that the first step, construction feature extract depth network;Second step, structure Make regional choice network, third step, according to each in database used in the construction feature extraction depth network Open image X and the corresponding each human region manually demarcatedThen by ROI layers, Input is image X and regionOutputIt is 7 × 7 × 512 dimensions;Third step, building coordinate refine network.
The construction feature extracts depth network, which is deep learning network structure, network structure are as follows: first Layer: convolutional layer, inputting is 768 × 1024 × 3, and exporting is 768 × 1024 × 64, port number channels=64;The second layer: volume Lamination, inputting is 768 × 1024 × 64, and exporting is 768 × 1024 × 64, port number channels=64;Third layer: Chi Hua Layer, input first layer output 768 × 1024 × 64 are connected in third dimension with third layer output 768 × 1024 × 64, Output is 384 × 512 × 128;4th layer: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, is led to Road number channels=128;Layer 5: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, channel Number channels=128;Layer 6: pond layer, the 4th layer of output 384 × 512 × 128 of input and layer 5 384 × 512 × 128 are connected in third dimension, and exporting is 192 × 256 × 256;Layer 7: convolutional layer, input as 192 × 256 × 256, exporting is 192 × 256 × 256, port number channels=256;8th layer: convolutional layer, input as 192 × 256 × 256, exporting is 192 × 256 × 256, port number channels=256;9th layer: convolutional layer, input as 192 × 256 × 256, exporting is 192 × 256 × 256, port number channels=256;Tenth layer: pond layer inputs as layer 7 output 192 × 256 × 256 are connected in third dimension with the 9th layer 192 × 256 × 256, and exporting is 96 × 128 × 512;11st Layer: convolutional layer, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512;Floor 12: Convolutional layer, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512;13rd layer: volume Lamination, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512;14th layer: Chi Hua Layer inputs and is connected in third dimension for eleventh floor output 96 × 128 × 512 with the 13rd layer 96 × 128 × 512, Output is 48 × 64 × 1024;15th layer: convolutional layer, inputting is 48 × 64 × 1024, and exporting is 48 × 64 × 512, channel Number channels=512;16th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number Channels=512;17th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number Channels=512;18th layer: pond layer, input for the 15th layer export 48 × 64 × 512 and the 17th layer 48 × 64 × 512 are connected in third dimension, and exporting is 48 × 64 × 1024;19th layer: convolutional layer, input as 48 × 64 × 1024, exporting is 48 × 64 × 256, port number channels=256;20th layer: pond layer, inputting is 48 × 64 × 256, Output is 24 × 62 × 256;Second eleventh floor: convolutional layer, inputting is 24 × 32 × 1024, and exporting is 24 × 32 × 256, channel Number channels=256;Second Floor 12: pond layer, inputting is 24 × 32 × 256, and exporting is 12 × 16 × 256;20th Three layers: convolutional layer, inputting is 12 × 16 × 256, and exporting is 12 × 16 × 128, port number channels=128;24th Layer: pond layer, inputting is 12 × 16 × 128, and exporting is 6 × 8 × 128;25th layer: full articulamentum, first by the 6 of input The data of × 8 × 128 dimensions are launched into the vector of 6144 dimensions, then input into full articulamentum, and output vector length is 768, Activation primitive is relu activation primitive;26th layer: full articulamentum, input vector length are 768, and output vector length is 96, activation primitive is relu activation primitive;27th layer: full articulamentum, input vector length are 96, and output vector length is 2, activation primitive is soft-max activation primitive;The parameter of all convolutional layers is size=3 convolution kernel kernel, step-length stride =(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, and parameter is pond section size Kernel size=2, step-length stride=(2,2);If setting the depth network as Fconv27, for a width color image X, warp Crossing the obtained feature set of graphs Fconv27 (X) of the depth network indicates, the evaluation function of the network is to (Fconv27 (X)-y) its cross entropy loss function is calculated, convergence direction is to be minimized, and y inputs corresponding classification.Database is in nature The image comprising passerby and non-passerby of boundary's acquisition, every image are the color image of 768 × 1024 dimensions, according to being in image No to be divided into two classes comprising pedestrian, the number of iterations is 2000 times.After training, first layer is taken to be characterized extraction to the 17th layer Depth network Fconv indicates a width color image X by the obtained output of the depth network with Fconv (X).
The structure realm selects network, receives Fconv depth network and extracts 512 48 × 64 feature set of graphs Fconv (X), then the first step obtains Conv by convolutional layer1(Fconv (X)), the parameter of the convolutional layer are as follows: convolution kernel Size=1 kernel, step-length stride=(1,1), inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number Channels=512;Then by Conv1(Fconv (X)) is separately input to two convolutional layer (Conv2-1And Conv2-2), Conv2-1Structure are as follows: inputting is 48 × 64 × 512, and exporting is 48 × 64 × 18, and port number channels=18, the layer obtains Output be Conv2-1(Conv1(Fconv (X))), then softmax is obtained using activation primitive softmax to the output (Conv2-1(Conv1(Fconv(X))));Conv2-2Structure are as follows: inputting is 48 × 64 × 512, and exporting is 48 × 64 × 36, Port number channels=36;There are two the loss functions of the network: first error function loss1 is to Wshad-cls⊙ (Conv2-1(Conv1(Fconv(X)))-Wcls(X)) softmax error is calculated, second error function loss2 is to Wshad-reg (X)⊙(Conv2-1(Conv1(Fconv(X)))-Wreg(X)) smooth L1 error, the loss function of regional choice network are calculated =loss1/sum (Wcls(X))+loss2/sum(Wcls(X)), the sum of sum () representing matrix all elements, convergence direction are It is minimized, Wcls(X) and WregIt (X) is respectively the corresponding positive and negative sample information of database images X, ⊙ representing matrix is according to correspondence Position is multiplied, Wshad-cls(X) and Wshad-regIt (X) is mask, it acts as selection Wshad(X) part that weight is 1 in is trained, To avoiding positive and negative sample size gap excessive, when each iteration, regenerates Wshad-cls(X) and Wshad-reg(X), algorithm iteration 1000 times.
The construction feature extracts database used in depth network, for each image in database, Step 1: each human body image-region, face facial area, hand region and product area are manually demarcated, if it schemes in input The centre coordinate of picture is (abas_tr, bbas_tr), centre coordinate is l in the distance of fore-and-aft distance upper and lower side framebas_tr, centre coordinate It is w in the distance of lateral distance left and right side framebas_tr, then it corresponds to Conv1Position be that center coordinate isHalf is a length ofHalf-breadth is Indicate round numbers part;The Two steps: positive negative sample is generated at random.
The positive negative sample of generation at random, method are as follows: the first step constructs 9 regional frames, second step, for data The each image X in librarytrIf WclsFor 48 × 64 × 18 dimensions, WregFor 48 × 64 × 36 dimensions, all initial values are 0, right WclsAnd WregIt is filled.
Described 9 regional frames of construction, this 9 regional frames are respectively as follows: Ro1(xRo, yRo)=(xRo, yRo, 64,64), Ro2 (xRo, yRo)=(xRo, yRo, 45,90), Ro3(xRo, yRo)=(xRo, yRo, 90,45), Ro4(xRo, yRo)=(xRo, yRo, 128, 128), Ro5(xRo, yRo)=(xRo, yRo, 90,180), Ro6(xRo, yRo)=(xRo, yRo, 180,90), Ro7(xRo, yRo)= (xRo, yRo, 256,256), Ro8(xRo, yRo)=(xRo, yRo, 360,180), Ro9(xRo, yRo)=(xRo, yRo, 180,360), it is right In each region unit, Roi(xRo, yRo) indicate for ith zone frame, the centre coordinate (x of current region frameRo, yRo), the Three indicate pixel distance of the central point apart from upper and lower side frame, and the 4th indicates pixel distance of the central point apart from left and right side frame, i Value from 1 to 9.
It is described to WclsAnd WregIt is filled, method are as follows:
For the body compartments that each is manually demarcated, if it is (a in the centre coordinate of input picturebas_tr, bbas_tr), Centre coordinate is l in the distance of fore-and-aft distance upper and lower side framebas_tr, centre coordinate is in the distance of lateral distance left and right side frame wbas_tr, then it corresponds to Conv1Position be that center coordinate isHalf is a length ofHalf-breadth is
For the upper left cornerThe lower right corner CoordinateEach point in the section surrounded (xctr, yCtr):
For i value from 1 to 9:
For point (xCtr, yctr), it is upper left angle point (16 (x in the mapping range of database imagesCtr- 1)+1,16 (yctr- 1)+1) bottom right angle point (16xctr, 16yCtr) 16 × 16 sections that are surrounded, for each point (x in the sectionOtr, yOtr):
Calculate (xOtr, yotr) corresponding to region Roi(xOtr, yOtr) with current manual calibration section coincidence factor;
Select the highest point (x of coincidence factor in current 16 × 16 sectionIoUMax, yIoUMax), if coincidence factor > 0.7, Wcls (xCtr, yctr, 2i-1)=1, Wcls(xCtr, yctr, 2i)=0, which is positive sample, Wreg(xCtr, yctr, 4i-3) and=(xOtr- 16xCtr+ 8)/8, Wreg(xCtr, yCtr, 4i-2) and=(yOtr-16yctr+ 8)/8, Wreg(xCtr, yctr, 4i-2) and=Down1 (lbas_tr/ RoiThird position), Wreg(xCtr, yCtr, 4i) and=Down1 (wbas_tr/RoiThe 4th), Down1 () if indicate value be greater than 1 Then value is 1;If coincidence factor < 0.3, Wcls(xCtr, yCtr, 2i-1)=0, Wcls(xCtr, yCtr, 2i)=1;Otherwise Wcls (xCtr, yCtr, 2i-1)=- 1, Wcls(xCtr, yCtr, 2i)=- 1.
If the human region of current manual's calibration does not have the Ro of coincidence factor > 0.6i(xOtr, yOtr), then select coincidence factor most High Roi(xOtr, yOtr) to WclsAnd WregAssignment, assignment method are identical as the assignment method of coincidence factor > 0.7.
Calculating (the xOtr, yOtr) corresponding to region Roi(xOtr, yOtr) be overlapped with the section of current manual's calibration Rate, method are as follows: set the body compartments that manually demarcate in the centre coordinate of input picture as (abas_tr, bbas_tr), centre coordinate It is l in the distance of fore-and-aft distance upper and lower side framebas_tr, centre coordinate is w in the distance of lateral distance left and right side framebas_trIf Roi (xOtr, yOtr) third position be lOtr, the 4th is wOtrIf meeting | xOtr-abas_tr|≤lOtr+lbas_tr- 1 and | yOtr- bbas_tr|≤wOtr+wbas_tr- 1, illustrate that there are overlapping region, overlapping regions=(lOtr+lbas_tr-1-|xOtr-abas_tr|)× (wOtr+wbas_tr-1-|yOtr-bbas_tr|), otherwise overlapping region=0;Calculate whole region=(2lotr-1)×(2wotr-1)+ (2abas_tr-1)×(2wbas_tr- 1)-overlapping region;To obtain coincidence factor=overlapping region/whole region, | | expression takes Absolute value.
The Wshad-cls(X) and Wshad-reg(X), building method are as follows: for image X, corresponding positive negative sample Information is Wcls(X) and Wreg(X), the first step constructs Wshad-cls(X) with and Wshad-reg(X), Wshad-cls(X) and Wcls(X) dimension It is identical, Wshad-reg(X) and Wreg(X) dimension is identical;Second step records the information of all positive samples, for i=1 to 9, if Wcls (X) (a, b, 2i-1)=1, then Wshad-cls(X) (a, b, 2i-1)=1, Wshad-cls(X) (a, b, 2i)=1, Wshad-reg(X) (a, B, 4i-3)=1, Wshad-reg(X) (a, b, 4i-2)=1, Wshad-reg(X) (a, b, 4i-1)=1, Wshad-reg(X) (a, b, 4i)= 1, positive sample has selected altogether sum (Wshad-cls(X)) a, sum () indicates to sum to all elements of matrix, if sum (Wshad-cls(X)) 256 > retain 256 positive samples at random;Third step randomly chooses negative sample, randomly chooses (a, b, i), if Wcls(X) (a, b, 2i-1)=1, then Wshad-cls(X) (a, b, 2i-1)=1, Wshad-cls(X) (a, b, 2i)=1, Wshad-reg(X) (a, b, 4i-3)=1, Wshad-reg(X) (a, b, 4i-2)=1, Wshad-reg(X) (a, b, 4i-1)=1, Wshad-reg(X) (a, b, 4i)=1, if the negative sample quantity chosen is 256-Sum (Wshad-cls(X)) a, although negative sample lazy weight 256- sum(Wshad-cls(X)) a but be all unable to get negative sample in 20 generation random numbers (a, b, i), then algorithm terminates.
The ROI layer, input are image X and regionIts method are as follows: for Image X is 48 × 64 × 512 by the dimension of obtained output Fconv (X) of feature extraction depth network Fconv, for every One 48 × 64 matrix VROI_IInformation (512 matrixes altogether), extract VROI_IThe upper left corner in matrix The lower right cornerIt is surrounded Region,Indicate round numbers part;Output is roiI(X) dimension is 7 × 7, then step-length
For iROI=1: to 7:
For jROI=1 to 7:
Construct section
roiI(X)(iROI, jROIThe value of maximum point in)=section.
When 512 48 × 64 matrix whole after treatments, output splicing is obtained into the output of 7 × 7 × 512 dimensionsParameter is indicated for image X, in regional frame ROI in range.
The building coordinate refines network, method are as follows: the first step, extending database: extended method is for data Each image X and the corresponding each region manually demarcated in libraryIts is corresponding ROI isIf current interval be human body image-region if BClass=[1,0,0, 0,0], [0,0,0,0] BBox=, the BClass=[0,1,0,0,0] if current interval is people's face facial area, BBox=[0, 0,0,0], BClass=[0,0,1,0,0], BBox=[0,0,0,0], if current interval is if current interval is hand region Product area then [0,0,0,1,0] BClass=, BBox=[0,0,0,0];It is random to generate value random number between -1 to 1 arand, brand, lrand, wrand, to obtain new section Indicate round numbers part, the BBox=[a in the sectionrand, brand, lrand, wrand], if new section withThe then BClass=current region of coincidence factor > 0.7 BClass, if new section withCoincidence factor < 0.3, then BClass=[0,0,0,0, 1], the two is not satisfied, then not assignment.Each section at most generates 10 positive sample regions, if generating Num1A positive sample area Domain then generates Num1+ 1 negative sample region, if the inadequate Num in negative sample region1+ 1, then expand arand, brand, lrand, wrand Range, until finding enough negative sample numbers.Second step, building coordinate refine network: for every in database One image X and the corresponding each human region manually demarcatedIts corresponding ROI isThe ROI of 7 × 7 × 512 dimensions will be launched into 25088 dimensional vectors, then passed through Cross two full articulamentum Fc2, obtain output Fc2(ROI), then by Fc2(ROI) micro- by classification layer FClass and section respectively Layer FBBox is adjusted, output FClass (Fc is obtained2And FBBox (Fc (ROI))2(ROI)), classification layer FClass is full articulamentum, Input vector length is 512, and output vector length is 5, and it is full articulamentum that layer FBBox is finely tuned in section, and input vector length is 512, output vector length is 4;There are two the loss functions of the network: first error function loss1 is to FClass (Fc2 (ROI))-BClass calculates softmax error, and second error function loss2 is to (FBBox (Fc2(ROI))-BBox) meter Euclidean distance error is calculated, then whole loss function=loss1+loss2 of the refining network, algorithm iteration process are as follows: change first 1000 convergence error function loss2 of generation, then 1000 convergence whole loss functions of iteration.
The full articulamentum Fc of described two2, structure are as follows: first layer: full articulamentum, input vector length is 25088, defeated Outgoing vector length is 4096, and activation primitive is relu activation primitive;The second layer: full articulamentum, input vector length is 4096, defeated Outgoing vector length is 512, and activation primitive is relu activation primitive.
3. action recognition module of doing shopping first knows static state movement using the hand motion image of standard in initialization Other classifier is initialized, so that static action recognition classifier be made to can recognize that the grasping of hand, put down movement;Then make Dynamic action recognition classifier is initialized with hand motion video, so that dynamic action recognition classifier be enable to identify The taking-up article sold puts back to article, takes out and put back to, having taken out article and do not put back to either suspicious stealing.
The hand motion image using standard initializes static action recognition classifier, method are as follows: The first step arranges video data: firstly, choose the video that a large amount of people does shopping in supermarket, these videos include extract product, Article is put back to, takes out and puts back to, taken out article and do not put back to movement with suspicious stealing;Manually each section of video clip is carried out Interception encounters commodity as start frame using manpower, leaves commodity as end frame using manpower, then use mesh for each frame of video Mark detection module extracts its hand region, the color image for being then 256 × 256 by each frame image scaling of hand region, Will scaling rear video be put into hand motion video collection, and mark the video for take out article, put back to article, take out but put back to, Article has been taken out one of not put back to the movement of suspicious stealing;It is taking-up article for classification, puts back to article, takes out and puts It returns, taken out each video that article is not put back to, the first frame of the video is put into the merging of hand motion image set and is labeled as The last frame of the video is put into hand motion image set and merged labeled as putting down movement by grasp motion, removes the from the video It takes a frame to be put into hand motion image set outside one frame and last needle at random to merge labeled as other.To obtain hand motion view Frequency set and hand motion image collection;Second step constructs static action recognition classifier StaticN;Third step, it is dynamic to static state Make recognition classifier StaticN to be initialized, the hand motion image collection constructed by the first step is inputted, if each time The image of input is Handp, is exported as StaticN (Handp), classification yHandp, yHandpRepresentation method are as follows: grasp: yHandp=[1,0,0], puts down: yHandp=[0,1,0], other: yHandp=[0,0,1], the evaluation function of the network are pair (StaticN(Handp)-yHandp) its cross entropy loss function is calculated, convergence direction is to be minimized, the number of iterations 2000 It is secondary.
The construction static state action recognition classifier StaticN, network structure are as follows: first layer: convolutional layer inputs and is 256 × 256 × 3, exporting is 256 × 256 × 64, port number channels=64;The second layer: convolutional layer, input as 256 × 256 × 64, exporting is 256 × 256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256 × 256 × 64 are connected in third dimension with third layer output 256 × 256 × 64, and exporting is 128 × 128 × 128;The Four layers: convolutional layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;5th Layer: convolutional layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 6: Pond layer inputs the 4th layer of output 128 × 128 × 128 and is connected in third dimension with layer 5 128 × 128 × 128, Output is 64 × 64 × 256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number Channels=256;8th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number Channels=256;9th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number Channels=256;Tenth layer: pond layer, input for layer 7 output 64 × 64 × 256 and the 9th layer 64 × 64 × 256 It is connected in third dimension, exporting is 32 × 32 × 512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, exports and is 32 × 32 × 512, port number channels=512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, export as 32 × 32 × 512, port number channels=512;13rd layer: convolutional layer, inputting is 32 × 32 × 512, export as 32 × 32 × 512, port number channels=512;14th layer: pond layer inputs as eleventh floor output 32 × 32 × 512 and the 13rd Layer 32 × 32 × 512 is connected in third dimension, and exporting is 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, exporting is 16 × 16 × 512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, exporting is 16 × 16 × 512, port number channels=512;17th layer: convolutional layer, input as 16 × 16 × 512, exporting is 16 × 16 × 512, port number channels=512;18th layer: pond layer is inputted and is exported for the 15th layer 16 × 16 × 512 are connected in third dimension with the 17th layer 16 × 16 × 512, and exporting is 8 × 8 × 1024;19th Layer: convolutional layer, inputting is 8 × 8 × 1024, and exporting is 8 × 8 × 256, port number channels=256;20th layer: Chi Hua Layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 × 256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128;Second Floor 12: pond layer, inputting is 4 × 4 × 128, export as 2 × 2 × 128;23rd layer: the data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, so by full articulamentum first After input into full articulamentum, output vector length is 128, and activation primitive is relu activation primitive;24th layer: full connection Layer, input vector length are 128, and output vector length is 32, and activation primitive is relu activation primitive;25th layer: Quan Lian Layer is connect, input vector length is 32, and output vector length is 3, and activation primitive is soft-max activation primitive;All convolutional layers Parameter is size=3 convolution kernel kernel, and step-length stride=(1,1), activation primitive is relu activation primitive;All pond layers It is maximum pond layer, parameter is pond section size kernel_size=2, step-length stride=(2,2).
Described initializes dynamic action recognition classifier using hand motion video, method are as follows: the first step, Construct data acquisition system: the when the hand motion image using standard initializes static action recognition classifier The hand motion video collection that one step is constructed uniformly extracts 10 frame images, as input;Second step, construction dynamic action identification Classifier DynamicN;Third step initializes dynamic action recognition classifier DynamicN, and input is the first step pair The set that 10 frame images of each video extraction are constituted exports if the 10 frame images inputted each time are Handv and is DynamicN (Handv), classification yHandv, yHandvRepresentation method are as follows: take out article: yHandv=[1,0,0,0,0] is put Return article: yHandvIt takes out and puts back to again in=[0,1,0,0,0]: yHandvIt has taken out article and has not put back to in=[0,0,1,0,0]: yHandv The movement of=[0,0,0,1,0] and suspicious stealing: yHandv=[0,0,0,0,1], the evaluation function of the network are to (DynamicN (Handv)-yHandv) its cross entropy loss function is calculated, convergence direction is to be minimized, and the number of iterations is 2000 times.
The 10 frame images of uniform extraction, method are as follows: for one section of video image, if the length is Nf frames.First 1st frame image zooming-out of video image is come out into the 1st frame as extracted set, by the last frame image of video image Extract the 10th frame as extracted set, the i-th of extracted setcktFrame is the of video imageFrame, wherein ickt=2 to 9:,Indicate round numbers part.
The construction dynamic action recognition classifier DynamicN, network structure are as follows:
First layer: convolutional layer, inputting is 256 × 256 × 30, and exporting is 256 × 256 × 512, port number channels= 512;The second layer: convolutional layer, inputting is 256 × 256 × 512, and exporting is 256 × 256 × 128, port number channels= 128;Third layer: pond layer, inputting is 256 × 256 × 128, and exporting is 128 × 128 × 128;4th layer: convolutional layer, input It is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer inputs and is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, input the 4th Layer output 128 × 128 × 128 is connected in third dimension with layer 5 128 × 128 × 128, export as 64 × 64 × 256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;The Eight layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: volume Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, it is defeated Enter and be connected in third dimension for layer 7 output 64 × 64 × 256 with the 9th layer 64 × 64 × 256, exporting is 32 × 32 ×512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels= 512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512; 13rd layer: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth Four layers: pond layer inputs and exports 32 × 32 × 512 and the 13rd layer 32 × 32 × 512 in third dimension for eleventh floor It is connected, exporting is 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, export as 16 × 16 × 512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, Port number channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, channel Number channels=512;18th layer: pond layer inputs and exports 16 × 16 × 512 and the 17th layer 16 × 16 for the 15th layer × 512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024, Output is 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, export as 4 × 4×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128; Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length It is 128, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 128, and output vector is long Degree is 32, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 32, and output vector is long Degree is 3, and activation primitive is soft-max activation primitive;The parameter of all convolutional layers is size=3 convolution kernel kernel, step-length Stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, parameter Chi Huaqu Between size kernel_size=2, step-length stride=(2,2).
4. product identification module, in initialization, first using the product image set of all angles to product identification point Class device is initialized, and generates product list to product image.
Described first initializes product identification classifier using the product image set of all angles, and to production Product image generates product list, method are as follows: the first step, construct data acquisition system and product list: the data acquisition system is that product is each The image of a angle, product list listGoodsFor a vector, each of vector corresponds to a product name;Second step, structure Make product identification classifier GoodsN;Third step initializes construction product identification classifier GoodsN, and input is each The product image set of a angle exports if input picture is Goods as GoodsN (Goods), classification yGoods, yGoods For one group of vector, length is equal to the number of product in product list, yGoodsRepresentation method are as follows: if image Goods be i-thGoods The product of position, then yGoodsI-thGoodsPosition is 1, other are to (GoodsN (Goods)-for the evaluation function of 0. network yGoods) its cross entropy loss function is calculated, convergence direction is to be minimized, and the number of iterations is 2000 times.
The construction product identification classifier GoodsN, two groups of GoodsN1 and GoodsN2 of network layer structure, wherein The network structure of GoodsN1 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, and exporting is 256 × 256 × 64, port number Channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, and exporting is 256 × 256 × 128, port number Channels=128;Third layer: pond layer, inputting is 256 × 256 × 128, and exporting is 128 × 128 × 128;4th layer: volume Lamination, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolution Layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, It inputs the 4th layer of output 128 × 128 × 128 to be connected in third dimension with layer 5 128 × 128 × 128, exporting is 64 ×64×256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels= 256;8th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;The Nine layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond Change layer, inputs and be connected in third dimension for layer 7 output 64 × 64 × 256 with the 9th layer 64 × 64 × 256, export It is 32 × 32 × 512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number Channels=512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number Channels=512;13rd layer: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number Channels=512;14th layer: pond layer, input for eleventh floor export 32 × 32 × 512 and the 13rd layer 32 × 32 × 512 are connected in third dimension, and exporting is 16 × 16 × 1024;15th layer: convolutional layer, input as 16 × 16 × 1024, exporting is 16 × 16 × 512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, Output is 16 × 16 × 512, port number channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, output It is 16 × 16 × 512, port number channels=512;18th layer: pond layer, input for the 15th layer output 16 × 16 × 512 are connected in third dimension with the 17th layer 16 × 16 × 512, and exporting is 8 × 8 × 1024;19th layer: convolution Layer, inputting is 8 × 8 × 1024, and exporting is 8 × 8 × 256, port number channels=256;20th layer: pond layer, input It is 8 × 8 × 256, exporting is 4 × 4 × 256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, export as 4 × 4 × 128, port number channels=128;The parameter of all convolutional layers be size=3 convolution kernel kernel, step-length stride=(1, 1), activation primitive is relu activation primitive;All pond layers are maximum pond layer, and parameter is pond section size Kernel_size=2, step-length stride=(2,2).The network structure of GoodsN2 are as follows: inputting is 4 × 4 × 128, first will be defeated The data entered are launched into the vector of 2048 dimensions, then input into first layer;First layer: full articulamentum, input vector length are 2048, output vector length is 1024, and activation primitive is relu activation primitive;The second layer: full articulamentum, input vector length are 1024, output vector length is 1024, and activation primitive is relu activation primitive;Third layer: full articulamentum, input vector length are 1024, output vector length is len (listGoods), activation primitive is soft-max activation primitive;len(listGoods) indicate The length of product list.For any input Goods2, GoodsN (Goods2)=GoodsN2 (GoodsN1 (Goods2)).
5. individual identification module first mentions face characteristic using the face image set of all angles in initialization It takes device FaceN to be initialized and calculates μ face, then using the human body image of all angles to characteristics of human body's extractor BodyN is initialized and is calculated μ body.
The face image set using all angles is initialized and is calculated to face characteristic extractor FaceN μ face, method are as follows: the first step, the face image set for choosing all angles constitute human face data collection;Second step constructs people Face feature extractor FaceN is simultaneously initialized using face data set;Step 3:
Everyone i concentrated for human face dataPeop, obtain human face data and concentrate all to belong to iPeopFacial image Set FaceSet (iPeop):
For FaceSet (iPeop) in each facial image Face (jiPeop):
Calculate face characteristic FaceN (Face (jiPeop));
Count current face's image collection FaceSet (iPeop) in all face characteristics average value as current face scheme Center center (FaceN (Face (the j of pictureiPeop))), calculate current face's image collection FaceSet (iPeop) in all faces Center center (FaceN (Face (the j of feature and current face's imageiPeop))) distance constitute iPeopCorresponding distance set It closes.
The owner concentrated to human face data obtains its corresponding distance set, and distance set is arranged from small to large Afterwards, if distance set length is ndiset, Indicate round numbers Part.
The construction face characteristic extractor FaceN is simultaneously initialized using face data set, if human face data collection By NfacesetIndividual is constituted, and network layer structure FaceN25 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, is exported and is 256 × 256 × 64, port number channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, export as 256 × 256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256 × 256 × 64 are defeated with third layer 256 × 256 × 64 are connected in third dimension out, and exporting is 128 × 128 × 128;4th layer: convolutional layer inputs and is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer, inputting is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, the 4th layer of input are defeated 128 × 128 × 128 are connected in third dimension with layer 5 128 × 128 × 128 out, and exporting is 64 × 64 × 256;The Seven layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;8th layer: volume Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: convolutional layer, it is defeated Entering is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, inputting is Seven layers of output 64 × 64 × 256 are connected in third dimension with the 9th layer 64 × 64 × 256, and exporting is 32 × 32 × 512; Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth Two layers: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;13rd layer: Convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;14th layer: Chi Hua Layer inputs and is connected in third dimension for eleventh floor output 32 × 32 × 512 with the 13rd layer 32 × 32 × 512, defeated It is out 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, and exporting is 16 × 16 × 512, port number Channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512;18th layer: pond layer, input for the 15th layer export 16 × 16 × 512 and the 17th layer 16 × 16 × 512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024, defeated It is out 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 ×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128; Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length It is 512, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 512, and output vector is long Degree is 512, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 512, output vector Length is Nfaceset, activation primitive is soft-max activation primitive;The parameter of all convolutional layers be convolution kernel kernel size= 3, step-length stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, and parameter is Pond section size kernel_size=2, step-length stride=(2,2).Its initialization procedure are as follows: set for each face Face4 exports as FaceN25 (face4), classification yface, yfaceIt is equal to N for lengthfacesetVector, yfaceExpression Method are as follows: if face face4 belongs to i-th in face image setface4Personal face, then yfaceI-thface4Position is 1, other Position is that the evaluation function of 0. network is to (FaceN25 (face4)-yface) its cross entropy loss function is calculated, restrain direction To be minimized, the number of iterations is 2000 times;After iteration, face characteristic extractor FaceN be FaceN25 network from First layer is to the 24th layer.
The human body image using all angles initializes characteristics of human body's extractor BodyN and calculates μ Body, method are as follows: the first step, the human body image set for choosing all angles constitute somatic data collection;Second step constructs human body Simultaneously user's volumetric data set initializes feature extractor BodyN;Step 3:
Everyone i concentrated for somatic dataPeop1, obtain somatic data and concentrate all to belong to iPeop1Human figure Image set closes BodySet (iPeop1):
For BodySet (iPeop1) in each human body image Body (jiPeop1):
Calculate characteristics of human body BodyN (Body (jiPeop1));
Count current human's image collection BodySet (iPeop1) in all characteristics of human body average value as current human scheme Center center (BodyN (Body (the j of pictureiPeop1))), calculate current human's image collection BodySet (iPeop1) in owner Center center (BodyN (Body (the j of body characteristics and current human's imageiPeop1))) distance constitute iPeop1Corresponding distance Set.
The owner concentrated to somatic data obtains its corresponding distance set, and distance set is arranged from small to large Afterwards, if distance set length is ndiset1, It indicates to be rounded Number part.
Construction characteristics of human body's extractor BodyN and user's volumetric data set is initialized, if somatic data collection By NbodysetIndividual is constituted, and network layer structure BodyN25 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, is exported and is 256 × 256 × 64, port number channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, export as 256 × 256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256 × 256 × 64 are defeated with third layer 256 × 256 × 64 are connected in third dimension out, and exporting is 128 × 128 × 128;4th layer: convolutional layer inputs and is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer, inputting is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, the 4th layer of input are defeated 128 × 128 × 128 are connected in third dimension with layer 5 128 × 128 × 128 out, and exporting is 64 × 64 × 256;The Seven layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;8th layer: volume Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: convolutional layer, it is defeated Entering is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, inputting is Seven layers of output 64 × 64 × 256 are connected in third dimension with the 9th layer 64 × 64 × 256, and exporting is 32 × 32 × 512; Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth Two layers: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;13rd layer: Convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;14th layer: Chi Hua Layer inputs and is connected in third dimension for eleventh floor output 32 × 32 × 512 with the 13rd layer 32 × 32 × 512, defeated It is out 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, and exporting is 16 × 16 × 512, port number Channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512;18th layer: pond layer, input for the 15th layer export 16 × 16 × 512 and the 17th layer 16 × 16 × 512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024, defeated It is out 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 ×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128; Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length It is 512, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 512, and output vector is long Degree is 512, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 512, output vector Length is Nfaceset, activation primitive is soft-max activation primitive;The parameter of all convolutional layers be convolution kernel kernel size= 3, step-length stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, and parameter is Pond section size kernel_size=2, step-length stride=(2,2).Its initialization procedure are as follows: set for each Zhang Renti Body4 exports as BodyN25 (body4), classification ybody, ybodyIt is equal to N for lengthbodysetVector, ybodyExpression Method are as follows: if human body body4 belongs to i-th in human body image setbody4Personal human body, then ybodyI-thbody4Position is 1, other Position is that the evaluation function of 0. network is to (BodyN25 (body4)-ybody) its cross entropy loss function is calculated, restrain direction To be minimized, the number of iterations is 2000 times;After iteration, characteristics of human body's extractor BodyN be BodyN25 network from First layer is to the 24th layer.
6. recognition result processing module does not work in initialization.
Embodiment 2:
The present embodiment realizes a kind of detection process of supermarket's intelligence vending system.
1. image pre-processing module, in the detection process: the first step, the monitoring image taken the photograph to monitoring camera carry out equal Value denoising, thus the monitoring image after being denoised;Second step carries out illumination compensation to the monitoring image after denoising, thus Image after to illumination compensation;Image after illumination compensation is carried out image enhancement, the data after image enhancement is passed by third step Pass module of target detection.
The monitoring image that the monitoring camera is taken the photograph carries out mean denoising, and method is: setting monitoring camera and is taken the photograph Monitoring image be Xsrc, because of XsrcFor color RGB image, therefore there are Xsrc-R, Xsrc-G, Xsrc-BThree components, for each A component Xsrc', it proceeds as follows respectively: the window of one 3 × 3 dimension being set first, considers image Xsrc' each pixel Point Xsrc' (i, j), it is respectively [X that pixel value corresponding to matrixes is tieed up in 3 × 3 put centered on the pointsrc' (i-1, j-1), Xsrc′ (i-1, j), Xsrc' (i-1, j+1), Xsrc' (i, j-1), Xsrc' (i, j), Xsrc' (i, j+1), Xsrc' (i+1, j-1), Xsrc′(i+ 1, j), Xsrc' (j+1, j+1)] it is arranged from big to small, take it to come intermediate value as image X after denoisingsrc" pixel (i, J) value is assigned to X after corresponding filteringsrc" (i, j);For Xsrc' boundary point, it may appear that its 3 × 3 dimension window corresponding to The case where certain pixels are not present, then the median for falling in existing pixel in window need to be only calculated, if window Interior is even number point, is assigned to X for the average value for coming intermediate two pixel values as the pixel value after pixel denoisingsrc″ (i, j), thus, new image array XsrcIt " is XsrcImage array after the denoising of current RGB component, for Xsrc-R, Xsrc-G, Xsrc-BAfter three components carry out denoising operation respectively, the X that will obtainsrc-R", Xsrc-c", Xsrc-B" component, by this three A new component is integrated into a new color image XDenResulting image after as denoising.
Described carries out illumination compensation to the monitoring image after denoising, if the monitoring image X after denoisingDen, because of XDenFor Color RGB image, therefore XDenThere are tri- components of RGB, for each component XDen', illumination compensation is carried out respectively, then will Obtained Xcpst' integration obtains colored RBG image Xcpst, XcpstAs XDenImage after illumination compensation, to each component XDen' respectively carry out illumination compensation the step of are as follows: the first step, if XDen' arranged for m row n, construct XDensumAnd NumDenFor same m row The matrix of n column, initial value is 0,Step-lengthWindow size is l, wherein letter Number min (m, n) expression takes the minimum value of m and n,Indicate round numbers part, sqrt (l) indicates the square root of l, the l if l < 1 =1;Second step, if XDenTop left co-ordinate is (1,1), is started from coordinate (1,1), is that l and step-length s is determined according to window size Each candidate frame, which is [(a, b), (a+l, b+l)] area defined, for XDen' in candidate frame region institute Corresponding image array carries out histogram equalization, the image after obtaining the equalization of candidate region [(a, b), (a+l, b+l)] Matrix XDen", then XDensumEach element in the corresponding region [(a, b), (a+l, b+l)] calculates XDensum(a+iXsum, b +jXsum)=XDensum(a+iXsum, b+jXsum)+XDen″(iXsum, jXsum), wherein (iXsum, jXsum) it is integer and 1≤iXsum≤ l, 1≤jXsum≤ l, and by NumDenEach element in the corresponding region [(a, b), (a+l, b+l)] adds 1;Finally, calculating Wherein (iXsumNum, jXsumNum) it is XDenEach corresponding point, to obtain XcpstAs to present component XDen' carry out illumination Compensation.
Described is that l and step-length s determines each candidate frame according to window size, be the steps include:
If monitoring image is m row n column, (a, b) is the top left co-ordinate in selected region, and (a+l, b+l) is selection area Bottom right angular coordinate, which is indicated that the initial value of (a, b) is (1,1) by [(a, b), (a+l, b+l)];
As a+l≤m:
B=1;
As b+l≤n:
Selected region is [(a, b), (a+l, b+l)];
B=b+s;
Interior loop terminates;
A=a+s;
Outer loop terminates;
In the above process, selected region [(a, b), (a+l, b+l)] is candidate frame every time.
It is described for XDen' image array corresponding in candidate frame region carries out histogram equalization, if candidate frame Region is [(a, b), (a+l, b+l)] area defined, XDenIt " is XDen' the figure in the region [(a, b), (a+l, b+l)] It as information, the steps include: the first step, construct vector I, I (iI) it is XDen" middle pixel value is equal to iINumber, 0≤iI≤255;The Two steps calculate vectorThird step, for XDen" on each point (iXDen, jXDen), pixel value is XDen″(iXDen, jXDen), calculate X "Den(iXDen, jXDen)=I ' (X "Den(iXDen, jXDen)).To XDen" all pixels in image Histogram equalization process terminates after point value is all calculated and changed, XDen" the result of the interior as histogram equalization saved.
Described carries out image enhancement for the image after illumination compensation, if the image after illumination compensation is XcpsT is corresponded to RGB channel be respectively XcpstR, XcpstG, XcpstB, to XcpstThe image obtained after image enhancement is Xenh.Image increasing is carried out to it Strong step are as follows: the first step, for XcpstThe important X of institutecpstR, XcpstG, XcpstBIt is calculated to carry out after obscuring by specified scale Image;Second step, structural matrix LXenhR, LXenhG, LXenhBFor with XcpstRThe matrix of identical dimensional, for image Xcpst's The channel R in RGB channel calculates LXenhR(i, j)=log (XcpstR(i, j))-LXcpstRThe value range of (i, j), (i, j) is All points in image array, for image XcpstRGB channel in the channel G and channel B use algorithm same as the channel R Obtain LXenhGAnd LXenhB;Third step, for image XcpstRGB channel in the channel R, calculate LXenhRMiddle all the points value Mean value MeanR and mean square deviation VarR (attention is mean square deviation), calculating MinR=MeanR-2 × VarR and MaxR=MeanR+2 × Then VarR calculates XenhR(i, j)=Fix ((LXcpstR(i, j)-MinR)/(MaxR-MinR) × 255), wherein Fix expression takes Integer part is assigned a value of 0 if value < 0, and value > 255 is assigned a value of 255;For in RGB channel the channel G and channel B X is obtained using algorithm same as the channel RenhGAnd XenhB, the X of RGB channel will be belonging respectively toenhR、XenhG、XenhBIt is integrated into one Color image Xenh
It is described for XcpstThe important X of institutecpstR, XcpstG, XcpstBIt calculates it and carries out the figure after obscuring by specified scale Picture, for the channel the R X in RGB channelcpstR, the steps include: the first step, define Gaussian function G (x, y, σ)=k × exp (- (x2 +y2)/σ2), σ is scale parameter, k=1/ ∫ ∫ G (x, y) dxdy, then for XcpstREach point XcpstR(i, j) is calculated, WhereinIndicate convolution algorithm, for being lower than the point of scale σ apart from boundary, only Calculate XcpstRWith the convolution of G (x, y, σ) corresponding part, Fix () indicates round numbers part, 0 is assigned a value of if value < 0, value > 255 is then assigned a value of 255.For in RGB channel the channel G and channel B using algorithm same as the channel R update XcpstGWith XcpstG
2. module of target detection receives image pre-processing module and transmits the image come, then to it in the detection process Handled, to each frame image using algorithm of target detection carry out target detection, obtain present image human body image region, Then hand region and product area are sent to shopping action recognition mould by face facial area, hand region and product area Human body image region and face facial area, are sent to individual identification module by block, and product area is passed to product identification mould Block;
Described carries out target detection using algorithm of target detection to each frame image, obtains the human body image of present image Region, face facial area, hand region and product area, the steps include:
The first step, by input picture XcpstIt is divided into the subgraph of 768 × 1024 dimensions;
Second step, for each subgraph Xs:
2.1st step is converted using the feature extraction depth network Fconv constructed in initialization, obtains 512 spies Levy subgraph set Fconv (Xs);
2.2nd step, to Fconv (Xs) using area selection network in first layer Conv1, second layer Conv2-1+softmax Activation primitive and Conv2-2Into transformation, output softmax (Conv is respectively obtained2-1(Conv1(Fconv(Xs)))) and Conv2-2 (Conv1(Fconv(Xs))), all preliminary candidate sections in the section are then obtained according to output valve;
2.3rd step, for all preliminary candidate sections of all subgraphs of current frame image:
2.3.1 step, is chosen according to the score size in its current candidate region, chooses maximum 50 preliminary candidates Section is as candidate region;
2.3.2 step adjusts candidate section of crossing the border all in candidate section set, then weeds out weight in candidate section Folded frame, to obtain final candidate section;
2.3.3 step, by subgraph XsROI layers are input to each final candidate section, obtains corresponding ROI output, If current final candidate section is (aBB(1), bBB(2), lBB(3), wBB(4)) FBBox (Fc, is then calculated2(ROI)) it obtains Four output (OutBB(1), OutBB(2), OutBB(3), OutBB(4)) to obtain updated coordinate (aBB(1)+8×OutBB (1), bBB(2)+8×OutBB(2), lBB(3)+8×OutBB(3), wBB(4)+8×OutBB(4));Then FClass (Fc is calculated2 (ROI)) exported, if exporting first maximum current interval be human body image-region, if output second maximum when It is people's face facial area between proparea, current interval is hand region if exporting third position maximum, if the 4th maximum of output Current interval is product area, and current interval, which is negative, if exporting the 5th maximum sample areas and deletes the final candidate regions Between.Third step, the coordinate in the final candidate section after updating the refining of all subgraphs, the method for update is to set current candidate region Coordinate be (TLx, TLy, RBx, RBy), the top left co-ordinate of corresponding subgraph is (Seasub, Sebsub), updated seat It is designated as (TLx+Seasub- 1, TLy+Sebsub- 1, RBx, RBy).
It is described by input picture XcpstBe divided into the subgraph of 768 × 1024 dimensions, the steps include: to set the step-length of segmentation as 384 and 512, if window size is m row n column, (asub, bsub) be selected region top left co-ordinate, the initial value of (a, b) is (1,1);
Work as asubWhen < m:
bsub=1;
Work as bsubWhen < n:
Selected region is [(asub, bsub), (asub+ 384, bsub+ 512)], by input picture XcpstUpper section institute is right The information for the image-region answered copies in new subgraph, and is attached to top left co-ordinate (asub, bsub) it is used as location information;If choosing Region is determined beyond input picture XcpstSection then will exceed the corresponding rgb pixel value of the pixel in range and be assigned a value of 0;
bsub=bsub+512;
Interior loop terminates;
asub=asub+384;
Outer loop terminates;
Described obtains all preliminary candidate sections in the section, method according to output valve are as follows: step 1: for softmax(Conv2-1(Conv1(Fconv(Xs)))) its output be 48 × 64 × 18, for Conv2-2(Conv1(Fconv (Xs))), output is 48 × 64 × 36, for any point (x, y) on 48 × 64 dimension spaces, softmax (Conv2-1 (Conv1(Fconv(Xs)))) (x, y) be 18 dimensional vector II, Conv2-2(Conv1(Fconv(Xs))) (x, y) be 36 dimensional vectors IIII, if II (2i-1) > II (2i), for i value from 1 to 9, lOtrFor Roi(xOtr, yOtr) third position, wOtrFor Roi (xOtr, yotr) the 4th, then preliminary candidate section be [II (2i-1), (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y, lOtr× IIII (4i-1), wOtr× IIII (4i))], wherein the score in first II (2i-1) expression current candidate region, second Position (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y, IIII (4i-1), IIII (4i)) indicates the center in current candidate section Point is (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y), and the long half-breadth of the half of candidate frame is respectively lOtr× IIII (4i-1) and wOtr×IIII(4i))。
All candidate sections of crossing the border, method are as follows: set monitoring image as m row n in the candidate section set of the adjustment Column, for each candidate section, if its [(ach, bch)], the long half-breadth of the half of candidate frame is respectively lchAnd wchIf ach+lch> M, thenThen its a is updatedch=a 'ch, lch= l′ch;If bch+wch> n, thenThen it updates Its bch=b 'ch, wch=w 'ch.
Described weeds out the frame being overlapped in candidate section, the steps include:
If candidate section set is not sky:
The maximum candidate section i of score is taken out from the set of candidate sectionout:
Calculate candidate section ioutWith candidate section set each of candidate section icCoincidence factor, if coincidence factor > 0.7, then gather from candidate section and deletes candidate section ic
By candidate section ioutIt is put into the candidate section set of output;
When candidate section set is empty, exporting candidate section contained in candidate section set is to weed out candidate regions Between middle overlapping frame after obtained candidate section set.
The calculating candidate section ioutWith candidate section set each of candidate section icCoincidence factor, side Method are as follows: set candidate section icCoordinate section centered on point [(aic, bic)], the long half-breadth of the half of candidate frame is respectively licAnd wic, wait I between constituencycCoordinate section centered on point [(aiout, bicout)], the long half-breadth of the half of candidate frame is respectively lioutAnd wiout;It calculates XA=max (aic, aiout);YA=max (bic, biout);XB=min (lic, liout), yB=min (wic, wiout);If meeting | aic-aiout|≤lic+liout- 1 and | bic-biout|≤wic+wiout- 1, illustrate that there are overlapping region, overlapping regions=(lic+ liout-1-|aic-aiout|)×(wic+wiout-1-|bic-biout|), otherwise overlapping region=0;Calculate whole region=(2lic- 1)×(2wic-1)+(2liout-1)×(2wiout- 1)-overlapping region;To obtain coincidence factor=overlapping region/whole region.
3. action recognition module of doing shopping, in the detection process: the first step makes each the hand region information received It is identified with static action recognition classifier, recognition methods are as follows: set the image inputted each time as Handp1, export and be StaticN (Handp1) is 3 bit vectors, is identified as grasping if first maximum, if second maximum is identified as putting down, if Third position maximum is then identified as other;Second step carries out mesh to current grasp motion corresponding region after recognizing grasp motion Mark tracking, if being using the recognition result of static action recognition classifier corresponding to the next frame tracking box of current hand region When putting down movement, target following terminates, by it is currently available since recognize grasp motion be video, recognize and put down movement Terminate for video, is complete video by the video marker to obtain the continuous videos of hand motion.If being tracked during tracking It loses, then terminates currently available since recognizing grasp motion and being video, from the image before tracking loss as video, It is then the video of only grasp motion by the video marker to obtain the video of only grasp motion;When recognize put down it is dynamic Make, and the movement illustrates that the grasp motion of the movement is lost, then with present image not in the obtained image of target following Corresponding hand region terminates for video, is carried forward tracking since present frame using method for tracking target, until tracking is lost It loses, then start frame of the next frame of lost frames as video, is the video for only putting down movement by the video marker.Third step, The obtained complete video of second step is identified using dynamic action recognition classifier, recognition methods are as follows: set defeated each time The image entered is Handv1, and exporting as DynamicN (Handv1) is 5 bit vectors, is identified as extract if first maximum Product are identified as putting back to article if second maximum, put back to again if third position maximum is identified as taking out, if the 4th maximum It is identified as having taken out article and not put back to, the movement of suspicious stealing is identified as if the 5th maximum, then sends out the recognition result Recognition result processing module is given, the video for by the video of only grasp motion and only putting down movement is sent at recognition result Module is managed, the video of complete video and only grasp motion is sent to product identification module and individual identification module.
It is described after recognizing grasp motion, target following, method are carried out to current grasp motion corresponding region are as follows: If the image of the grasp motion currently recognized is Hgrab, current tracking area is region corresponding to image Hgrab.First Step extracts the ORB feature ORB of image HgrabHgrab;Second step, it is corresponding for all hand regions in the next frame of Hgrab Image calculate its ORB feature to obtaining ORB characteristic set, and delete the ORB feature chosen by other tracking box;Third Step, by ORBHgrabIts Hamming distance compared with each value of ORB characteristic set, selection and ORBHgrabThe Hamming distance of feature The smallest ORB feature is the ORB feature chosen, if the ORB feature and ORB chosenHgrabThe similarity > 0.85 of feature, similarity =(Hamming distance/0RB characteristic lengths of two ORB features of 1-), then the corresponding hand region of ORB feature chosen is image Hgrab next frame tracking box, if otherwise similarity < 0.85 show tracking lose.
The ORB feature, the method that ORB feature is extracted from an image have been relatively mature, and calculate in OpenCV Has realization inside machine vision library;Its ORB feature is extracted to a picture, input value is current image, is exported as several group leaders Identical character string is spent, each group represents an ORB feature.
Described terminates using the corresponding hand region of present image as video, using method for tracking target since present frame Be carried forward tracking, until tracking is lost, method are as follows: set the image for putting down movement currently recognized as Hdown, currently with Track region is region corresponding to image Hdown.
If not tracking loss:
The first step extracts the ORB feature ORB of image HdownHdown, moved since the process recognizes grasping in described working as After work, calculated during carrying out target following to current grasp motion corresponding region, so being not required here again Secondary calculating;
Second step, for the corresponding image of all hand regions in the former frame of image Hdown calculate its ORB feature from And ORB characteristic set is obtained, and delete the ORB feature chosen by other tracking box;
Third step, by ORBHaownIts Hamming distance compared with each value of ORB characteristic set, selection and ORBHdownIt is special The smallest ORB feature of the Hamming distance of sign is the ORB feature chosen, if the ORB feature and ORB chosenHdownThe similarity of feature > 0.85, similarity=(Hamming distance/0RB characteristic lengths of two ORB features of 1-), the then corresponding hand of ORB feature chosen Portion region is tracking box of the image Hdown in next frame, if otherwise similarity < 0.85 shows that tracking is lost, algorithm terminates.
4. product identification module, in the detection process, the first step, according to the transmitting of shopping action recognition module come complete view The video of frequency and only grasp motion, first in the obtained position of module of target detection according to corresponding to current video first frame It sets, the inputted video image of the position is detected forward from current video first frame, detect the frame that the region is not blocked, It is finally identified the image in region corresponding to frame as the input of product identification classifier, to obtain current production Recognition result, recognition methods are as follows: set the image inputted each time as Goods1, export for GoodsN (Goods1) be one to Amount, if the i-th of the vectorgoodsPosition is maximum, then shows that current recognition result is i-th in product listgoodsThe product of position, will Recognition result is sent to recognition result processing module;
It is described according to corresponding to current video first frame in the obtained position of module of target detection, to the position Inputted video image detects forward from current video first frame, detects the frame that the region is not blocked, method are as follows: set and work as Corresponding to preceding video first frame the obtained position of module of target detection be (agoods, bgoods, lgoods, wgoods), if currently Current video first frame is i-thcrgsFrame, frame under process icr=icrgs: the first step, i-thcrFrame is obtained by module of target detection All detection zones be Taskicr;Second step, for TaskicrEach of regional frame (atask, btask, ltask, wtask), Calculate its distance dgt=(atask-agoods)2+(btask-bgoods)2-(ltask+lgoods)2-(wtask+wgoods)2.Distance if it does not exist < 0, then i-thcrCorresponding (a of framegoods, bgoods, lgoods, wgoods) region be region detected for detecting not by The frame blocked, algorithm terminate;Otherwise, distance < 0 if it exists, the then d (i in recording distance list dcr)=minimum range, and icr =icr- 1, if icr> 0, then algorithm jumps to the first step, if icr≤ 0, then selection takes this apart from the maximum record of list d intermediate value Record the corresponding (a of corresponding framegoods, bgoods, lgoods, wgoods) it is what the region detected detected was not blocked Frame, algorithm terminate.
5. individual identification module when user enters supermarket, is obtained currently in the detection process by module of target detection The image Face1 of human region Body1 and the face in human region, then respectively using characteristics of human body's extractor BodyN and Face characteristic extractor FaceN extracts characteristics of human body BodyN (Body1) and face characteristic FaceN (Face1), saves BodyN (Body1) in BodyFtu set, FaceN (Face1) is saved in FaceFtu set, and saves the ID letter of existing customer Breath, id information can be the unduplicated number that user is randomly assigned when either user enters supermarket in the account of supermarket, ID Information is used to distinguish different customers, whenever there is customer to enter supermarket, then extracts its characteristics of human body and face characteristic;When being used in supermarket When the mobile product of family, according to shopping action recognition module transmitting come complete video and only grasp motion video, search out Its corresponding human region and human face region carry out people using face feature extractor FaceN and characteristics of human body's extractor BodyN Face identification or human bioequivalence mode obtain the ID of customer corresponding to the video that currently transmitting of shopping action recognition module comes.
It is described according to the transmitting of shopping action recognition module come complete video and only grasp motion video, search out Its corresponding human region and human face region carry out people using face feature extractor FaceN and characteristics of human body's extractor BodyN Face identification or human bioequivalence mode obtain the ID of customer corresponding to the video that currently transmitting of shopping action recognition module comes.Its Process are as follows: according to shopping action recognition module transmitting come video, begin look for from the first frame of video to corresponding human body area Domain and human face region, until algorithm terminates or handled the last frame of video:
Corresponding human region image Body2 and human face region image Face2 are used into characteristics of human body's extractor respectively BodyN and face characteristic extractor FaceN extracts characteristics of human body BodyN (Body2) and face characteristic FaceN (Face2);
Then face identification information is used first: comparing all face characteristics in FaceN (Face2) and FaceFtu set Euclidean distance dFace, feature when selecting Euclidean distance minimum in corresponding FaceFtu set, if this feature is FaceN (Face3), if dFace< μ face then identifies that current face's image belongs to the visitor of facial image corresponding to FaceN (Face3) Family ID is the ID corresponding to the video actions that action recognition module transmitting comes that does shopping, and current identification process terminates;
If dFace>=μ face shows only to identify current individual with face identification method, then compares BodyN (Body2) the Euclidean distance d of all characteristics of human body in gathering with BodyFtuBody, select Euclidean distance minimum when it is corresponding Feature in BodyFtu set, if this feature is BodyN (Face3), if dBody+dFace< μ face+ μ body, then identify and work as The Customer ID that preceding human body image belongs to human body image corresponding to BodyN (Face3) is that shopping action recognition module transmitting comes ID corresponding to video actions.
If still not finding ID corresponding to video actions after all frames for having handled video, in order to avoid mistake is known Not Gou Wu main body cause the book keeping operation of mistake, therefore the video come to current shopping action recognition module transmitting is no longer handled.
It is described according to the transmitting of shopping action recognition module come video, begin look for from the first frame of video to corresponding Human region and human face region, method are as follows: according to the transmitting of shopping action recognition module come video, from the first frame of video into Row processing.If currently processed to i-thfRgFrame, if it is (a that the frame, which corresponds to video in the obtained position of module of target detection,ifRg, bifRg, lifRg, wifRg), the frame is corresponding to be combined into BodyFrameSet in the obtained human region collection of module of target detectionifRg Human region collection is combined into FaceFrameSetifRg, for BodyFrameSetifRgEach of human region (aBFSifRg, bBFsifRg, lBFSifRg, wBFSifRg), calculate its distance dgbt=(aBFSifRg-aifRg)2+(bBFSifRg-bifRg)2-(lBFSifRg-lifRg)2- (wBFSifRg-wifRg)2, selecting the smallest human region of distance in all human region set is the corresponding human body area of current video Domain, if it is (a that the human region chosen, which is position,BFS1, bBFS1, lBFS1, wBFS1), human face region collection is combined into FaceFrameSetifRgEach of human face region (aFFSifRg, bFFsifRg, lFFsifRg, wFFSifRg), calculate its distance dgft= (aBFS1-aFFSifRg)2+(bBFS1-bFFSifRg)2-(lBFS1-lFFSifRg)2-(wBFS1-wFFSifRg)2, select all face regional ensembles It is middle apart from the smallest human face region be the corresponding human face region of current video.
6. it is every to generate to carry out integration to the recognition result received in identification process for recognition result processing module The corresponding shopping list of one customer: first according to individual identification module transmit come the ID of customer determine current shopping information pair The customer answered, thus choose the shopping list number modified be ID, then according to product identification module transmit come recognition result Product is set as GoodA, then according to shopping action recognition module transmitting to determine that the shopping of current customer acts corresponding product Whether the recognition result come modifies to shopping cart to determine that current shopping acts, clear in shopping if being identified as taking out article Increase product G oodA on single ID, accelerating is 1, reduces product on shopping list ID if being identified as putting back to article GoodA, reducing quantity is 1, and shopping list does not change if be identified as " take out and put back to " or " taken out article and do not put back to " again Become, to supermarket's monitoring transmission alarm signal and the corresponding location information of current video if recognition result is " suspicious stealing ".
Embodiment 3:
The present embodiment realizes a kind of process of the upgrading products list of supermarket's intelligence vending system.
1. a process has only used product identification module.When changing product list: if deleting certain product, from each angle The image of the product is deleted in the product image set of degree, and corresponding position in product list is deleted, if increasing certain product, The product image of all angles of current production is put into the product image set of all angles, by product list last The current title for increasing product of back addition, is then updated with the product image set of new all angles and new product list Product identification classifier.
The product image set and new product list upgrading products recognition classifier with new all angles, Method are as follows: the first step modifies network structure: for the network of product identification the classifier GoodsN ', GoodsN1 ' of neotectonics Structure is constant, identical as GoodsN1 network structure when initialization, the first layer and second layer knot of GoodsN2 ' network structure Structure remains unchanged, and the output vector length of third layer becomes the length of updated product list;Second step, for neotectonics Product identification classifier GoodsN ' is initialized: its product image set inputted as new all angles, if input picture For Goods3, export as GoodsN ' (Goods3)=GoodsN2 ' (GoodsN1 (Goods3)), classification ycoods3, yGoods For one group of vector, length is equal to the number of updated product list, yGoodsRepresentation method are as follows: if image Goods is the iGoodsThe product of position, then yGoodsI-thGoodsPosition is 1, other are to (GoodsN for the evaluation function of 0. network (Goods)-yGoods) its cross entropy loss function is calculated, convergence direction is to be minimized, during initialization in GoodsN1 Parameter value remain unchanged, the number of iterations be 500 times.

Claims (7)

1. a kind of supermarket's intelligence vending system, which is characterized in that based on the monitoring camera being fixed in supermarket and on shelf The video image taken the photograph is as input;It is made of following 6 functional modules: image pre-processing module, module of target detection, shopping Action recognition module, product identification module, individual identification module, recognition result processing module;This 6 respective realities of functional module Existing method is as follows:
The video image that image pre-processing module takes the photograph monitoring camera pre-processes, first to possible in the image of input The noise that contains carries out denoising, then carries out illumination compensation to the image after denoising, then to the image after illumination compensation into Data after image enhancement are finally passed to module of target detection by row image enhancement;
Module of target detection carries out target detection to the image received, detects that the human body in current video image is whole respectively Then hand region and product area are sent to shopping movement and known by region, face facial area, hand region and product area Human body image region and face facial area are sent to individual identification module by other module, and product area is passed to product and is known Other module;
Shopping action recognition module carries out static action recognition to the hand region information received, finds the starting for grasping video Frame, it is then lasting that movement is identified until finding the movement for putting down article as end frame, then video is used dynamic State action recognition classifier is identified, identifies that current action is to take out article, put back to article, take out and put back to, taken out Article does not put back to either suspicious stealing;Then recognition result is sent to recognition result processing module, by only grasp motion Video and only put down the video of movement and be sent to recognition result processing module;
Product identification module identifies the video of the product area received, identify currently by it is mobile be any production Product, are then sent to recognition result processing module for recognition result, and product identification module can also increase at any time or delete some Product;
Individual identification module identifies the human face region and human region that receive, believes in conjunction with human face region and human region Breath, for identification out current individual be in entire supermarket who individual, then recognition result is sent at recognition result Manage module;
Recognition result processing module integrates the recognition result received, according to individual identification module transmit come customer ID determines the corresponding customer of current shopping information, according to product identification module transmit come recognition result determine current customer Shopping acts corresponding product, determined according to the recognition result that shopping action recognition module transmitting comes current shopping act whether It modifies to shopping cart;To obtain the shopping list of current customer;Suspicious stealing to shopping action recognition module identification Behavior sounds an alarm.
2. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the image pre-processing module Concrete methods of realizing are as follows:
In initial phase, the module does not work;In the detection process: the first step, the monitoring image that monitoring camera is taken the photograph into Row mean denoising, thus the monitoring image after being denoised;Second step carries out illumination compensation to the monitoring image after denoising, from And obtain the image after illumination compensation;Image after illumination compensation is carried out image enhancement, by the number after image enhancement by third step According to passing to module of target detection;
The monitoring image that the monitoring camera is taken the photograph carries out mean denoising, and method is: setting the prison that monitoring camera is taken the photograph Control image is Xsrc, because of XsrcFor color RGB image, therefore there are Xsrc-R, Xsrc-G, Xsrc-BThree components, for each point Measure Xsrc', it proceeds as follows respectively: the window of one 3 × 3 dimension being set first, considers image Xsrc' each pixel Xsrc' (i, j), it is respectively [X that pixel value corresponding to matrixes is tieed up in 3 × 3 put centered on the pointsrc′(i-1,j-1),Xsrc′ (i-1,j),Xsrc′(i-1,j+1),Xsrc′(i,j-1),Xsrc′(i,j),Xsrc′(i,j+1),Xsrc′(i+1,j-1),Xsrc′(i+ 1,j),Xsrc' (j+1, j+1)] it is arranged from big to small, take it to come intermediate value as image X after denoisingsrc" pixel (i, J) value is assigned to X after corresponding filteringsrc″(i,j);For Xsrc' boundary point, it may appear that its 3 × 3 dimension window corresponding to The case where certain pixels are not present, then the median for falling in existing pixel in window need to be only calculated, if window Interior is even number point, is assigned to X for the average value for coming intermediate two pixel values as the pixel value after pixel denoisingsrc″ (i, j), thus, new image array XsrcIt " is XsrcImage array after the denoising of current RGB component, for Xsrc-R, Xsrc-G, Xsrc-BAfter three components carry out denoising operation respectively, the X that will obtainsrc-R", Xsrc-G", Xsrc-B" component, by this three A new component is integrated into a new color image XDenResulting image after as denoising;
Described carries out illumination compensation to the monitoring image after denoising, if the monitoring image X after denoisingDen, because of XDenFor colour RGB image, therefore XDenThere are tri- components of RGB, for each component XDen', illumination compensation is carried out respectively, then will be obtained Xcpst' integration obtains colored RBG image Xcpst, XcpstAs XDenImage after illumination compensation, to each component XDen' point Not carry out illumination compensation the step of are as follows: the first step, if XDen' arranged for m row n, construct XDensumAnd NumDenFor same m row n column Matrix, initial value are 0,Step-lengthWindow size is l, wherein function min (m, n) indicates to take the minimum value of m and n,Indicate round numbers part, sqrt (l) indicates the square root of l, the l=1 if l < 1;The Two steps, if XDenTop left co-ordinate is (1,1), is started from coordinate (1,1), is that l and step-length s determines each according to window size Candidate frame, which is [(a, b), (a+l, b+l)] area defined, for XDen' corresponding in candidate frame region Image array carries out histogram equalization, the image array after obtaining the equalization of candidate region [(a, b), (a+l, b+l)] XDen", then XDensumEach element in the corresponding region [(a, b), (a+l, b+l)] calculates XDensum(a+iXsum,b+ jXsum)=XDensum(a+iXsum,b+jXsum)+XDen″(iXsum,jXsum), wherein (iXsum,jXsum) it is integer and 1≤iXsum≤ l, 1 ≤jXsum≤ l, and by NumDenEach element in the corresponding region [(a, b), (a+l, b+l)] adds 1;Finally, calculating Wherein (iXsumNum,jXsumNum) it is XDenEach corresponding point, to obtain XcpstAs to present component XDen' carry out illumination Compensation;
Described is that l and step-length s determines each candidate frame according to window size, be the steps include:
If monitoring image is m row n column, (a, b) is the top left co-ordinate in selected region, and (a+l, b+l) is the right side of selection area Lower angular coordinate, the region are indicated that the initial value of (a, b) is (1,1) by [(a, b), (a+l, b+l)];
As a+l≤m:
B=1;
As b+l≤n:
Selected region is [(a, b), (a+l, b+l)];
B=b+s;
Interior loop terminates;
A=a+s;
Outer loop terminates;
In the above process, selected region [(a, b), (a+l, b+l)] is candidate frame every time;
It is described for XDen' image array corresponding in candidate frame region carries out histogram equalization, if candidate frame region For [(a, b), (a+l, b+l)] area defined, XDenIt " is XDen' image the letter in the region [(a, b), (a+l, b+l)] Breath the steps include: the first step, construct vector I, I (iI) it is XDen" middle pixel value is equal to iINumber, 0≤iI≤255;Second Step calculates vectorThird step, for XDen" on each point (iXDen,jXDen), pixel value is XDen″(iXDen,jXDen), calculate X "Den(iXDen,jXDen)=I ' (X "Den(iXDen,jXDen));To XDen" all pixels in image Histogram equalization process terminates after point value is all calculated and changed, XDen" the result of the interior as histogram equalization saved;
Described carries out image enhancement for the image after illumination compensation, if the image after illumination compensation is Xcpst, corresponding RGB Channel is respectively XcpstR,XcpstG,XcpstB, to XcpstThe image obtained after image enhancement is Xenh;Image enhancement is carried out to it Step are as follows: the first step, for XcpstThe important X of institutecpstR,XcpstG,XcpstBIt calculates it and carries out the figure after obscuring by specified scale Picture;Second step, structural matrix LXenhR,LXenhG,LXenhBFor with XcpstRThe matrix of identical dimensional, for image XcpstRGB it is logical The channel R in road calculates LXenhR(i, j)=log (XcpstR(i,j))-LXcpstR(i, j), the value range of (i, j) are image moment All points in battle array, for image XcpstRGB channel in the channel G and channel B obtained using algorithm same as the channel R LXenhGAnd LXenhR;Third step, for image XcpstRGB channel in the channel R, calculate LXenhRThe mean value of middle all the points value MeanR and mean square deviation VarR (attention is mean square deviation) calculates MinR=MeanR-2 × VarR and MaxR=MeanR+2 × VarR, Then X is calculatedenhR(i, j)=Fix ((LXcpstR(i, j)-MinR)/(MaxR-MinR) × 255), wherein Fix indicates round numbers Part is assigned a value of 0 if value<0, and value>255 are assigned a value of 255;For in RGB channel the channel G and channel B use and R The same algorithm in channel obtains XenhGAnd XenhB, the X of RGB channel will be belonging respectively toenhR、XenhG、XenhBIt is integrated into a cromogram As Xenh
It is described for XcpstThe important X of institutecpstR,XcpstG,XcpstBIt calculates it and carries out the image after obscuring by specified scale, it is right The channel R X in RGB channelcpstR, the steps include: the first step, define Gaussian function G (x, y, σ)=k × exp (- (x2+y2)/ σ2), σ is scale parameter, k=1/ ∫ ∫ G (x, y) dxdy, then for XcpstREach point XcpstR(i, j) is calculated, WhereinIndicate convolution algorithm, for being lower than the point of scale σ apart from boundary, only Calculate XcpstRWith the convolution of G (x, y, σ) corresponding part, Fix () indicates round numbers part, is assigned a value of 0 if value<0, value> 255 are assigned a value of 255;For in RGB channel the channel G and channel B using algorithm same as the channel R update XcpstGWith XcpstG
3. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the module of target detection Concrete methods of realizing are as follows:
During initialization, using with having demarcated human body image region, face facial area, hand region and product area Image to algorithm of target detection carry out parameter initialization;In the detection process, receive what image pre-processing module transmitted Then image is handled it, carry out target detection using algorithm of target detection to each frame image, obtain present image Human body image region, face facial area, hand region and product area, are then sent to purchase for hand region and product area Human body image region and face facial area are sent to individual identification module, product area are transmitted by object action recognition module Give product identification module;
The use have demarcated human body image region, face facial area, hand region and product area image pair Algorithm of target detection carries out parameter initialization, the steps include: that the first step, construction feature extract depth network;Second step, tectonic province Domain selects network, third step, according to each figure in database used in the construction feature extraction depth network As X and the corresponding each human region manually demarcatedThen by ROI layers, input For image X and regionOutputIt is 7 × 7 × 512 dimensions;Third step, building coordinate refine network;
The construction feature extracts depth network, which is deep learning network structure, network structure are as follows: first layer: Convolutional layer, inputting is 768 × 1024 × 3, and exporting is 768 × 1024 × 64, port number channels=64;The second layer: convolution Layer, inputting is 768 × 1024 × 64, and exporting is 768 × 1024 × 64, port number channels=64;Third layer: pond layer, Input first layer output 768 × 1024 × 64 is connected in third dimension with third layer output 768 × 1024 × 64, exports It is 384 × 512 × 128;4th layer: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, port number Channels=128;Layer 5: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, port number Channels=128;Layer 6: pond layer inputs the 4th layer of output 384 × 512 × 128 and layer 5 384 × 512 × 128 It is connected in third dimension, exporting is 192 × 256 × 256;Layer 7: convolutional layer, inputting is 192 × 256 × 256, defeated It is out 192 × 256 × 256, port number channels=256;8th layer: convolutional layer, inputting is 192 × 256 × 256, output It is 192 × 256 × 256, port number channels=256;9th layer: convolutional layer, inputting is 192 × 256 × 256, exports and is 192 × 256 × 256, port number channels=256;Tenth layer: pond layer inputs as layer 7 output 192 × 256 × 256 It is connected in third dimension with the 9th layer 192 × 256 × 256, exporting is 96 × 128 × 512;Eleventh floor: convolutional layer, Input is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512;Floor 12: convolutional layer, it is defeated Entering is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512;13rd layer: convolutional layer, input It is 96 × 128 × 512, exporting is 96 × 128 × 512, port number channels=512;14th layer: pond layer inputs and is Eleventh floor output 96 × 128 × 512 is connected in third dimension with the 13rd layer 96 × 128 × 512, export as 48 × 64×1024;15th layer: convolutional layer, inputting is 48 × 64 × 1024, and exporting is 48 × 64 × 512, port number channels =512;16th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number channels= 512;17th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number channels=512; 18th layer: pond layer inputs and exports 48 × 64 × 512 and the 17th layer 48 × 64 × 512 in third dimension for the 15th layer It is connected on degree, exporting is 48 × 64 × 1024;19th layer: convolutional layer, inputting is 48 × 64 × 1024, and exporting is 48 × 64 × 256, port number channels=256;20th layer: pond layer, inputting is 48 × 64 × 256, export as 24 × 62 × 256;Second eleventh floor: convolutional layer, inputting is 24 × 32 × 1024, and exporting is 24 × 32 × 256, port number channels= 256;Second Floor 12: pond layer, inputting is 24 × 32 × 256, and exporting is 12 × 16 × 256;23rd layer: convolutional layer, Input is 12 × 16 × 256, and exporting is 12 × 16 × 128, port number channels=128;24th layer: pond layer, it is defeated Entering is 12 × 16 × 128, and exporting is 6 × 8 × 128;25th layer: full articulamentum, first by 6 × 8 × 128 dimensions of input Data be launched into the vectors of 6144 dimensions, then input into full articulamentum, output vector length is 768, and activation primitive is Relu activation primitive;26th layer: full articulamentum, input vector length are 768, and output vector length is 96, activation primitive For relu activation primitive;27th layer: full articulamentum, input vector length are 96, and output vector length is 2, activation primitive For soft-max activation primitive;The parameter of all convolutional layers is size=3 convolution kernel kernel, and step-length stride=(1,1) swashs Function living is relu activation primitive;All pond layers are maximum pond layer, and parameter is pond section size kernel_size =2, step-length stride=(2,2);If setting the depth network as Fconv27, for a width color image X, by the depth net The obtained feature set of graphs of network indicates that the evaluation function of the network is to calculate it to (Fconv27 (X)-y) with Fconv27 (X) Cross entropy loss function, convergence direction are to be minimized, and y inputs corresponding classification;Database is to include what nature acquired The image of passerby and non-passerby, every image are the color image of 768 × 1024 dimensions, whether include pedestrian point according in image At two classes, the number of iterations is 2000 times;After training, takes first layer to be characterized to the 17th layer and extract depth network Fconv indicates a width color image X by the obtained output of the depth network with Fconv (X);
The structure realm selects network, receives Fconv depth network and extracts 512 48 × 64 feature set of graphs Fconv (X), then the first step obtains Conv by convolutional layer1(Fconv (X)), the parameter of the convolutional layer are as follows: convolution kernel kernel size =1, step-length stride=(1,1), inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number channels= 512;Then by Conv1(Fconv (X)) is separately input to two convolutional layer (Conv2-1And Conv2-2), Conv2-1Structure are as follows: Input is 48 × 64 × 512, and exporting is 48 × 64 × 18, port number channels=18, and the output that this layer obtains is Conv2-1 (Conv1(Fconv (X))), then softmax (Conv is obtained using activation primitive softmax to the output2-1(Conv1(Fconv (X))));Conv2-2Structure are as follows: inputting is 48 × 64 × 512, and exporting is 48 × 64 × 36, port number channels=36; There are two the loss functions of the network: first error function loss1 is to Wshad-cls⊙(Conv2-1(Conv1(Fconv (X)))-Wcls(X)) softmax error is calculated, second error function loss2 is to Wshad-reg(X)⊙(Conv2-1(Conv1 (Fconv(X)))-Wreg(X)) smooth L1 error, loss function=loss1/sum (W of regional choice network are calculatedcls (X))+loss2/sum(Wcls(X)), the sum of sum () representing matrix all elements, convergence direction are to be minimized, Wcls(X) And WregIt (X) is respectively the corresponding positive and negative sample information of database images X, ⊙ representing matrix is multiplied according to corresponding position, Wshad-cls (X) and Wshad-regIt (X) is mask, it acts as selection Wshad(X) part that weight is 1 in is trained, to avoid positive and negative Sample size gap is excessive, and when each iteration regenerates Wshad-cls(X) and Wshad-reg(X), algorithm iteration 1000 times;
The construction feature extracts database used in depth network, for each image in database, first Step: each human body image-region, face facial area, hand region and product area are manually demarcated, if it is in input picture Centre coordinate is (abas_tr,bbas_tr), centre coordinate is l in the distance of fore-and-aft distance upper and lower side framebas_tr, centre coordinate is in cross It is w to the distance apart from left and right side framebas_tr, then it corresponds to Conv1Position be that center coordinate isHalf is a length ofHalf-breadth is Indicate round numbers part;The Two steps: positive negative sample is generated at random;
The positive negative sample of generation at random, method are as follows: the first step constructs 9 regional frames, second step, for database Each image XtrIf WclsFor 48 × 64 × 18 dimensions, WregFor 48 × 64 × 36 dimensions, all initial values are 0, to Wcls And WregIt is filled;
Described 9 regional frames of construction, this 9 regional frames are respectively as follows: Ro1(yRo,yRo)=(xRo,yRo, 64,64), Ro2(xRo, yRo)=(xRo,yRo,45,90),Ro3(xRo,yRo)=(xRo,yRo,90,45),Ro4(xRo,yRo)=(xRo,yRo, 128,128), Ro5(xRo,yRo)=(xRo,yRo,90,180),Ro6(xRo,yRo)=(xRo,yRo,180,90),Ro7(xRo,yRo)=(xRo,yRo, 256,256), Ro8(xRo,yRo)=(xRo,yRo,360,180),Ro9(xRo,yRo)=(xRo,yRo, 180,360), for each Region unit, Roi(xRo,yRo) indicate for ith zone frame, the centre coordinate (x of current region frameRo,yRo), third position indicates Pixel distance of the central point apart from upper and lower side frame, the 4th indicates pixel distance of the central point apart from left and right side frame, the value of i from 1 to 9;
It is described to WclsAnd WregIt is filled, method are as follows:
For the body compartments that each is manually demarcated, if it is (a in the centre coordinate of input picturebas_tr,bbas_tr), center Coordinate is l in the distance of fore-and-aft distance upper and lower side framebas_tr, centre coordinate is w in the distance of lateral distance left and right side framebas_tr, Then it corresponds to Conv1Position be that center coordinate isHalf is a length ofHalf-breadth For
For the upper left cornerBottom right angular coordinateEach point in the section surrounded (xCtr,yCtr):
For i value from 1 to 9:
For point (xCtr,yCtr), it is upper left angle point (16 (x in the mapping range of database imagesCtr-1)+1,16(yCtr-1)+ 1) bottom right angle point (16xCtr,16yCtr) 16 × 16 sections that are surrounded, for each point (x in the sectionOtr,yOtr):
Calculate (xOtr,yOtr) corresponding to region Roi(xOtr,yOtr) with current manual calibration section coincidence factor;
Select the highest point (x of coincidence factor in current 16 × 16 sectionIoUMax,yIoUMax), if coincidence factor > 0.7, Wcls(xCtr, yCtr, 2i-1)=1, Wcls(xCtr,yCtr, 2i)=0, which is positive sample, Wreg(xCtr,yCtr, 4i-3) and=(xOtr-16xCtr+ 8)/8, Wreg(xCtr,yCtr, 4i-2) and=(yOtr-16yCtr+ 8)/8, Wreg(xCtr,yCtr, 4i-2) and=Down1 (lbas_tr/Roi's Third position), Wreg(xCtr,yCtr, 4i) and=Down1 (wbas_tr/RoiThe 4th), Down1 () is indicated if value greater than taking if 1 Value is 1;If coincidence factor < 0.3, Wcls(xCtr,yCtr, 2i-1)=0, Wcls(xCtr,yCtr, 2i)=1;Otherwise Wcls(xCtr,yCtr, 2i-1)=- 1, Wcls(xCtr,yCtr, 2i)=- 1;
If the human region of current manual's calibration does not have the Ro of coincidence factor > 0.6i(xOtr,yOtr), then select the highest Ro of coincidence factori (xOtr,yOtr) to WclsAnd WregAssignment, assignment method are identical as the assignment method of coincidence factor > 0.7;
Calculating (the xOtr,yOtr) corresponding to region Roi(xOtr,yOtr) with current manual calibration section coincidence factor, side Method are as follows: set the body compartments that manually demarcate in the centre coordinate of input picture as (abas_tr,bbas_tr), centre coordinate longitudinal direction away from It is l with a distance from upper and lower side framebas_tr, centre coordinate is w in the distance of lateral distance left and right side framebas_trIf Roi(xOtr, yOtr) third position be lOtr, the 4th is wOtrIf meeting | xOtr-abas_tr|≤lOtr+lbas_tr- 1 and | yOtr-bbas_tr|≤ wOtr+wbas_tr- 1, illustrate that there are overlapping region, overlapping regions=(lOtr+Ibas_tr-1-|xOtr-abas_tr|)×(wOtr+wbas_tr- 1-|yOtr-bbas_tr|), otherwise overlapping region=0;Calculate whole region=(2lOtr-1)×(2wOtr-1)+(2abas_tr-1)× (2wbas_tr- 1)-overlapping region;To obtain coincidence factor=overlapping region/whole region, | | expression takes absolute value;
The Wshad-cls(X) and Wshad-reg(X), building method are as follows: for image X, corresponding positive and negative sample information For Wcls(X) and Wreg(X), the first step constructs Wshad-cls(X) with and Wshad-reg(X), Wshad-cls(X) and Wcls(X) dimension phase Together, Wshad-reg(X) and Wreg(X) dimension is identical;Second step records the information of all positive samples, for i=1 to 9, if Wcls(X) (a, b, 2i-1)=1, then Wshad-cls(X) (a, b, 2i-1)=1, Wshad-cls(X) (a, b, 2i)=1, Wshad-reg(X)(a,b, 4i-3)=1, Wshad-reg(X) (a, b, 4i-2)=1, Wshad-reg(X) (a, b, 4i-1)=1, Wshad-reg(X) (a, b, 4i)=1, Positive sample has selected altogether sum (Wshad-cls(X)) a, sum () indicates to sum to all elements of matrix, if sum (Wshad-cls(X)) > 256, retain 256 positive samples at random;Third step randomly chooses negative sample, randomly chooses (a, b, i), if Wcls(X) (a, b, 2i-1)=1, then Wshad-cls(X) (a, b, 2i-1)=1, Wshad-cls(X) (a, b, 2i)=1, Wshad-reg(X) (a, b, 4i-3)=1, Wshad-reg(X) (a, b, 4i-2)=1, Wshad-reg(X) (a, b, 4i-1)=1, Wshad-reg(X)(a,b, 4i)=1, if the negative sample quantity chosen is 256-sum (Wshad-cls(X)) a, although negative sample lazy weight 256- sum(Wshad-cls(X)) a but be all unable to get negative sample in 20 generation random numbers (a, b, i), then algorithm terminates;
The ROI layer, input are image X and regionIts method are as follows: for image X By feature extraction depth network Fconv it is obtained output Fconv (X) dimension be 48 × 64 × 512, for each 48 × 64 matrix VsROI_IInformation (512 matrixes altogether), extract VROI_IThe upper left corner in matrix The lower right cornerIt is surrounded Region,Indicate round numbers part;Output is roiI(X) dimension is 7 × 7, then step-length
For iROI=1: to 7:
For jROI=1 to 7:
Construct section
roiI(X)(iROI,jROIThe value of maximum point in)=section;
When 512 48 × 64 matrix whole after treatments, output splicing is obtained into the output of 7 × 7 × 512 dimensionsParameter is indicated for image X, in regional frame ROI in range;
The building coordinate refines network, method are as follows: the first step, extending database: extended method is in database Each image X and the corresponding each region manually demarcatedIts corresponding ROI isThe BClass=[1,0,0,0,0] if current interval is human body image-region, BBox=[0,0,0,0], the BClass=[0,1,0,0,0] if current interval is people's face facial area, BBox=[0,0,0, 0], BClass=[0,0,1,0,0], BBox=[0,0,0,0] if current interval is hand region, if current interval is product Region then [0,0,0,1,0] BClass=, BBox=[0,0,0,0];It is random to generate value random number a between -1 to 1rand, brand,lrand,wrand, to obtain new section Indicate round numbers part, the BBox=[a in the sectionrand,brand,lrand, wrand], if new section withCoincidence factor > 0.7 item BClass=current region BClass, if new section withCoincidence factor < 0.3, then [0,0,0,0,1] BClass=, The two is not satisfied, then not assignment;Each section at most generates 10 positive sample regions, if generating Num1A positive sample region, Then generate Num1+ 1 negative sample region, if the inadequate Num in negative sample region1+ 1, then expand arand,brand,lrand,wrandModel It encloses, until finding enough negative sample numbers;Second step, building coordinate refine network: for each in database Image X and the corresponding each human region manually demarcatedIts corresponding ROI isThe ROI of 7 × 7 × 512 dimensions will be launched into 25088 dimensional vectors, then passed through Cross two full articulamentum Fc2, obtain output Fc2(ROI), then by Fc2(ROI) micro- by classification layer FClass and section respectively Layer FBBox is adjusted, output FClass (Fc is obtained2And FBBox (Fc (ROI))2(ROI)), classification layer FClass is full articulamentum, Input vector length is 512, and output vector length is 5, and it is full articulamentum that layer FBBox is finely tuned in section, and input vector length is 512, output vector length is 4;There are two the loss functions of the network: first error function loss1 is to FClass (Fc2 (ROI))-BClass calculates softmax error, and second error function loss2 is to (FBBox (Fc2(ROI))-BBox) meter Euclidean distance error is calculated, then whole loss function=loss1+loss2 of the refining network, algorithm iteration process are as follows: change first 1000 convergence error function loss2 of generation, then 1000 convergence whole loss functions of iteration;
The full articulamentum Fc of described two2, structure are as follows: first layer: full articulamentum, input vector length be 25088, export to Measuring length is 4096, and activation primitive is relu activation primitive;The second layer: full articulamentum, input vector length be 4096, export to Measuring length is 512, and activation primitive is relu activation primitive;
Described carries out target detection using algorithm of target detection to each frame image, obtains the human body image area of present image Domain, face facial area, hand region and product area, the steps include:
The first step, by input picture XcpstIt is divided into the subgraph of 768 × 1024 dimensions;
Second step, for each subgraph Xs:
2.1st step is converted using the feature extraction depth network Fconv constructed in initialization, obtains 512 feature Set of graphs Fconv (Xs);
2.2nd step, to Fconv (Xs) using area selection network in first layer Conv1, second layer Conv2-1+ softmax activation Function and Conv2-2Into transformation, output softmax (Conv is respectively obtained2-1(Conv1(Fconv(Xs)))) and Conv2-2(Conv1 (Fconv(Xs))), all preliminary candidate sections in the section are then obtained according to output valve;
2.3rd step, for all preliminary candidate sections of all subgraphs of current frame image:
2.3.1 step, is chosen according to the score size in its current candidate region, chooses maximum 50 preliminary candidate sections As candidate region;
2.3.2 step adjusts candidate section of crossing the border all in candidate section set, then weeds out and is overlapped in candidate section Frame, to obtain final candidate section;
2.3.3 step, by subgraph XsROI layers are input to each final candidate section, corresponding ROI output is obtained, if currently Final candidate section be (aBB(1),bBB(2),lBB(3),wBB(4)) FBBox (Fc, is then calculated2(ROI)) obtain four it is defeated (Out outBB(1),OutBB(2),OutBB(3),OutBB(4)) to obtain updated coordinate (aBB(1)+8×OutBB(1),bBB (2)+8×OutBB(2),lBB(3)+8×OutBB(3),wBB(4)+8×OutBB(4));Then FClass (Fc is calculated2(ROI)) To output, current interval is human body image-region if exporting first maximum, if output second maximum current interval is Face facial area, current interval is hand region if exporting third position maximum, if the 4th maximum current interval of output For product area, current interval, which is negative, if exporting the 5th maximum sample areas and deletes the final candidate section;
Third step, the coordinate in the final candidate section after updating the refining of all subgraphs, the method for update is to set current candidate area The coordinate in domain is (TLx, TLy, RBx, RBy), and the top left co-ordinate of corresponding subgraph is (Seasub,Sebsub), it is updated Coordinate is (TLx+Seasub-1,TLy+Sebsub-1,RBx,RBy);
It is described by input picture XcpstIt is divided into the subgraph of 768 × 1024 dimensions, the steps include: the step-length for setting segmentation as 384 Hes 512, if window size is m row n column, (asub,bsub) be selected region top left co-ordinate, the initial value of (a, b) be (1, 1);Work as asubWhen < m:
bsub=1;
Work as bsubWhen < n:
Selected region is [(asub,bsub),(asub+384,bsub+ 512)], by input picture XcpstFigure corresponding to the upper section It is copied to as the information in region in new subgraph, and is attached to top left co-ordinate (asub,bsub) it is used as location information;
If selection area exceeds input picture XcpstSection then will exceed the corresponding equal assignment of rgb pixel value of pixel in range It is 0;
bsub=bsub+512;
Interior loop terminates;
asub=asub+384;
Outer loop terminates;
Described obtains all preliminary candidate sections in the section, method according to output valve are as follows: step 1: for softmax(Conv2-1(Conv1(Fconv(Xs)))) its output be 48 × 64 × 18, for Conv2-2(Conv1(Fconv (Xs))), output is 48 × 64 × 36, for any point (x, y) on 48 × 64 dimension spaces, softmax (Conv2-1 (Conv1(Fconv(Xs)))) (x, y) be 18 dimensional vector II, Conv2-2(Conv1(Fconv(Xs))) (x, y) be 36 dimensional vectors IIII, if II (2i-1) > II (2i), for i value from 1 to 9, lOtrFor Roi (xOtr,yOtr) third position, wOtrFor Roi (xOtr,yOtr) the 4th, then preliminary candidate section be [II (2i-1), (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y, lOtr×IIII(4i-1),wOtr× IIII (4i))], wherein the score in first II (2i-1) expression current candidate region, second Position (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y, IIII (4i-1), IIII (4i)) indicates the center in current candidate section Point is (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y), and the long half-breadth of the half of candidate frame is respectively lOtr× IIII (4i-1) and wOtr×IIII(4i));
All candidate sections of crossing the border, method in the candidate section set of the adjustment are as follows: it sets monitoring image and is arranged as m row n, it is right In each candidate section, if its [(ach,bch)], the long half-breadth of the half of candidate frame is respectively lchAnd wchIf ach+lch> m, thenThen its a is updatedch=a 'ch, lch=l 'ch; If bch+wch> n, thenThen its b is updatedch =b 'ch, wch=w 'ch
Described weeds out the frame being overlapped in candidate section, the steps include:
If candidate section set is not sky:
The maximum candidate section i of score is taken out from the set of candidate sectionout:
Calculate candidate section ioutWith candidate section set each of candidate section icCoincidence factor, if coincidence factor > 0.7, Gather from candidate section and deletes candidate section ic
By candidate section ioutIt is put into the candidate section set of output;
When candidate section set is empty, exporting candidate section contained in candidate section set is to weed out in candidate section Obtained candidate section set after the frame of overlapping;
The calculating candidate section ioutWith candidate section set each of candidate section icCoincidence factor, method are as follows: If candidate section icCoordinate section centered on point [(aic,bic)], the long half-breadth of the half of candidate frame is respectively licAnd wic, candidate regions Between icCoordinate section centered on point [(aiout,bicout)], the long half-breadth of the half of candidate frame is respectively lioutAnd wiout;Calculate xA= max(aic,aiout);YA=max (bic,biout);XB=min (lic,liout), yB=min (wic,wiout);If meeting | aic- aiout|≤lic+liout- 1 and | bic-biout|≤wic+wiout- 1, illustrate that there are overlapping region, overlapping regions=(lic+liout- 1-|aic-aiout|)×(wic+wiout-1-|bic-biout|), otherwise overlapping region=0;Calculate whole region=(2lic-1)× (2wic-1)+(2liout-1)×(2wiout- 1)-overlapping region;To obtain coincidence factor=overlapping region/whole region.
4. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the shopping action recognition mould The concrete methods of realizing of block are as follows:
In initialization, static action recognition classifier is initialized using the hand motion image of standard first, thus Static action recognition classifier is set to can recognize that the grasping of hand, put down movement;Then dynamic to dynamic using hand motion video It is initialized as recognition classifier, so that dynamic action recognition classifier be made to can recognize that the taking-up article of hand, put back to object Product, take out again put back to, taken out article do not put back to either suspicious stealing;In the detection process: the first step, it is every to what is received One hand region information identified using static action recognition classifier, recognition methods are as follows: set the image inputted each time For Handp1, exporting as StaticN (Handp1) is 3 bit vectors, is identified as grasping if first maximum, if second is maximum It is then identified as putting down, is identified as other if the maximum of third position;Second step moves current grasp after recognizing grasp motion Make corresponding region and carry out target following, if using static action recognition point corresponding to the next frame tracking box of current hand region The recognition result of class device is when putting down movement, and target following terminates, and is opened for video by currently available from recognizing grasp motion Begin, recognize and put down movement and terminate for video, is complete view by the video marker to obtain the continuous videos of hand motion Frequently;If tracking during tracking lose, by it is currently available since recognize grasp motion be video, from tracking lose before Image terminate as video, be then only grasp motion by the video marker to obtain the video of only grasp motion Video;Movement is put down when recognizing, and the movement illustrates that the grasping of the movement is dynamic not in the obtained image of target following Lose, then terminate using the corresponding hand region of present image as video, using method for tracking target since present frame forward It is tracked, until tracking is lost, then start frame of the next frame of lost frames as video, is only to put down by the video marker The video of movement;Third step identifies the obtained complete video of second step using dynamic action recognition classifier, identifies Method are as follows: set the image inputted each time as Handv1, exporting as DynamicN (Handv1) is 5 bit vectors, if first most It is big then be identified as take out article, be identified as putting back to article if second maximum, if third position maximum be identified as take out again put It returns, is identified as having taken out article if the 4th maximum and not put back to, if the 5th maximum is identified as the movement of suspicious stealing, so The recognition result is sent to recognition result processing module afterwards, by the video of only grasp motion and only puts down the video of movement It is sent to recognition result processing module, the video of complete video and only grasp motion is sent to product identification module and individual Identification module;
The hand motion image using standard initializes static action recognition classifier, method are as follows: first Step arranges video data: firstly, choosing the video that a large amount of people does shopping in supermarket, these videos include extract product, put back to Article takes out and puts back to again, taken out article and do not put back to movement with suspicious stealing;Manually each section of video clip is cut It takes, commodity is encountered as start frame using manpower, commodity are left as end frame using manpower, target then is used for each frame of video Detection module extracts its hand region, the color image for being then 256 × 256 by each frame image scaling of hand region, will Scaling rear video is put into hand motion video collection, and the video is marked to put back to, again to take out article, putting back to article, taking-up Article is taken out one of not put back to the movement of suspicious stealing;It is taking-up article for classification, puts back to article, takes out and puts It returns, taken out each video that article is not put back to, the first frame of the video is put into the merging of hand motion image set and is labeled as The last frame of the video is put into hand motion image set and merged labeled as putting down movement by grasp motion, removes the from the video It takes a frame to be put into hand motion image set outside one frame and last needle at random to merge labeled as other;To obtain hand motion view Frequency set and hand motion image collection;Second step constructs static action recognition classifier StaticN;Third step, it is dynamic to static state Make recognition classifier StaticN to be initialized, the hand motion image collection constructed by the first step is inputted, if each time The image of input is Handp, is exported as StaticN (Handp), classification yHandp, yHandpRepresentation method are as follows: grasp: yHandp=[1,0,0], puts down: yHandp=[0,1,0], other: yHandp=[0,0,1], the evaluation function of the network are pair (StaticN(Handp)-yHandp) its cross entropy loss function is calculated, convergence direction is to be minimized, the number of iterations 2000 It is secondary;
The construction static state action recognition classifier StaticN, network structure are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, exporting is 256 × 256 × 64, port number channels=64;The second layer: convolutional layer, input as 256 × 256 × 64, exporting is 256 × 256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256 × 256 × 64 are connected in third dimension with third layer output 256 × 256 × 64, and exporting is 128 × 128 × 128;4th layer: Convolutional layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 5: volume Lamination, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 6: Chi Hua Layer inputs the 4th layer of output 128 × 128 × 128 and is connected in third dimension with layer 5 128 × 128 × 128, exports It is 64 × 64 × 256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number Channels=256;8th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number Channels=256;9th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number Channels=256;Tenth layer: pond layer, input for layer 7 output 64 × 64 × 256 and the 9th layer 64 × 64 × 256 It is connected in third dimension, exporting is 32 × 32 × 512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, exports and is 32 × 32 × 512, port number channels=512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, export as 32 × 32 × 512, port number channels=512;13rd layer: convolutional layer, inputting is 32 × 32 × 512, export as 32 × 32 × 512, port number channels=512;14th layer: pond layer inputs as eleventh floor output 32 × 32 × 512 and the 13rd Layer 32 × 32 × 512 is connected in third dimension, and exporting is 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, exporting is 16 × 16 × 512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, exporting is 16 × 16 × 512, port number channels=512;17th layer: convolutional layer, input as 16 × 16 × 512, exporting is 16 × 16 × 512, port number channels=512;18th layer: pond layer is inputted and is exported for the 15th layer 16 × 16 × 512 are connected in third dimension with the 17th layer 16 × 16 × 512, and exporting is 8 × 8 × 1024;19th Layer: convolutional layer, inputting is 8 × 8 × 1024, and exporting is 8 × 8 × 256, port number channels=256;20th layer: Chi Hua Layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 × 256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128;Second Floor 12: pond layer, inputting is 4 × 4 × 128, export as 2 × 2 × 128;23rd layer: the data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, so by full articulamentum first After input into full articulamentum, output vector length is 128, and activation primitive is relu activation primitive;24th layer: full connection Layer, input vector length are 128, and output vector length is 32, and activation primitive is relu activation primitive;25th layer: Quan Lian Layer is connect, input vector length is 32, and output vector length is 3, and activation primitive is soft-max activation primitive;All convolutional layers Parameter is size=3 convolution kernel kernel, and step-length stride=(1,1), activation primitive is relu activation primitive;All pond layers It is maximum pond layer, parameter is pond section size kernel_size=2, step-length stride=(2,2);
Described initializes dynamic action recognition classifier using hand motion video, method are as follows: the first step, construction Data acquisition system: the first step when hand motion image using standard initializes static action recognition classifier The hand motion video collection constructed uniformly extracts 10 frame images, as input;Second step, construction dynamic action identification classification Device DynamicN;Third step initializes dynamic action recognition classifier DynamicN, and input is the first step to each The set that 10 frame images of a video extraction are constituted exports if the 10 frame images inputted each time are Handv as DynamicN (Handv), classification yHandv, yHandvRepresentation method are as follows: take out article: yHandvArticle is put back to in=[1,0,0,0,0]: yHandvIt takes out and puts back to again in=[0,1,0,0,0]: yHandvIt has taken out article and has not put back to in=[0,0,1,0,0]: yHandv=[0,0, 0,1,0] and the movement of suspicious stealing: yHandv=[0,0,0,0,1], the evaluation function of the network are to (DynamicN (Handv)-yHandv) its cross entropy loss function is calculated, convergence direction is to be minimized, and the number of iterations is 2000 times;
The 10 frame images of uniform extraction, method are as follows: for one section of video image, if the length is Nf frames;It first will view 1st frame image zooming-out of frequency image comes out the 1st frame as extracted set, by the last frame image zooming-out of video image Out as the 10th frame of extracted set, the i-th of extracted setcktFrame is the of video image Frame, wherein ickt=2 to 9:,Indicate round numbers part;
The construction dynamic action recognition classifier DynamicN, network structure are as follows:
First layer: convolutional layer, inputting is 256 × 256 × 30, and exporting is 256 × 256 × 512, port number channels=512;
The second layer: convolutional layer, inputting is 256 × 256 × 512, and exporting is 256 × 256 × 128, port number channels= 128;
Third layer: pond layer, inputting is 256 × 256 × 128, and exporting is 128 × 128 × 128;4th layer: convolutional layer, input It is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer inputs and is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, input the 4th Layer output 128 × 128 × 128 is connected in third dimension with layer 5 128 × 128 × 128, export as 64 × 64 × 256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;The Eight layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: volume Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, it is defeated Enter and be connected in third dimension for layer 7 output 64 × 64 × 256 with the 9th layer 64 × 64 × 256, exporting is 32 × 32 ×512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels= 512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512; 13rd layer: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth Four layers: pond layer inputs and exports 32 × 32 × 512 and the 13rd layer 32 × 32 × 512 in third dimension for eleventh floor It is connected, exporting is 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, export as 16 × 16 × 512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, Port number channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, channel Number channels=512;18th layer: pond layer inputs and exports 16 × 16 × 512 and the 17th layer 16 × 16 for the 15th layer × 512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024, Output is 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, export as 4 × 4×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128; Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length It is 128, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 128, and output vector is long Degree is 32, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 32, and output vector is long Degree is 3, and activation primitive is soft-max activation primitive;The parameter of all convolutional layers is size=3 convolution kernel kernel, step-length Stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, parameter Chi Huaqu Between size kernel_size=2, step-length stride=(2,2);
It is described after recognizing grasp motion, target following, method are as follows: set and work as are carried out to current grasp motion corresponding region Before the image of grasp motion that recognizes be Hgrab, current tracking area is region corresponding to image Hgrab;The first step mentions Take the ORB feature ORB of image HgrabHgrab;Second step, for the corresponding image of all hand regions in the next frame of Hgrab Its ORB feature is calculated to obtain ORB characteristic set, and deletes the ORB feature chosen by other tracking box;Third step, will ORBHgrabIts Hamming distance compared with each value of ORB characteristic set, selection and ORBHgrabThe Hamming distance of feature is the smallest ORB feature is the ORB feature chosen, if the ORB feature and ORB chosenHgrabSimilarity > 0.85 of feature, similarity=(1- two The Hamming distance of a ORB feature/ORB characteristic length), then the corresponding hand region of ORB feature chosen is that image Hgrab exists The tracking box of next frame, if otherwise similarity < 0.85 shows that tracking is lost;
The ORB feature, the method that ORB feature is extracted from an image have been relatively mature, and regard in OpenCV computer Feel inside library have realization;Its ORB feature is extracted to a picture, input value is current image, is exported as several groups length phase Same character string, each group represents an ORB feature;
Described terminates using the corresponding hand region of present image as video, using method for tracking target since present frame forward It is tracked, until tracking is lost, method are as follows: set the image for putting down movement currently recognized as Hdown, current tracking area Domain is region corresponding to image Hdown;
If not tracking loss:
The first step extracts the ORB feature ORB of image HdownHdown, due to the process described after recognizing grasp motion, It has calculated during carrying out target following to current grasp motion corresponding region, has been counted again so being not required here It calculates;
Second step calculates its ORB feature for the corresponding image of all hand regions in the former frame of image Hdown to obtain To ORB characteristic set, and delete the ORB feature chosen by other tracking box;
Third step, by ORBHdownIts Hamming distance compared with each value of ORB characteristic set, selection and ORBHdownThe Chinese of feature The smallest ORB feature of prescribed distance is the ORB feature chosen, if the ORB feature and ORB chosenHdownSimilarity > 0.85 of feature, Similarity=(Hamming distance/ORB characteristic lengths of two ORB features of 1-), then the corresponding hand region of ORB feature chosen is i.e. It is image Hdown in the tracking box of next frame, if otherwise similarity < 0.85 shows that tracking is lost, algorithm terminates.
5. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the product identification module Concrete methods of realizing are as follows:
In initialization, product identification classifier is initialized using the product image set of all angles first, and right Product image generates product list;When changing product list: if deleting certain product, from the product image set of all angles The middle image for deleting the product, and corresponding position in product list is deleted, if increasing certain product, by each of current production The product image of angle is put into the product image set of all angles, and by product list, last back addition current increases The title of product, then with the product image set of new all angles and new product list upgrading products recognition classifier; In the detection process, the first step, according to shopping action recognition module transmitting come complete video and only grasp motion video, First in the obtained position of module of target detection according to corresponding to current video first frame, to the input video figure of the position As detecting forward from current video first frame, the frame that the region is not blocked is detected, finally by region corresponding to frame Image is identified as the input of product identification classifier, to obtain the recognition result of current production, recognition methods are as follows: set The image inputted each time is Goods1, and exporting as GoodsN (Goods1) is a vector, if the i-th of the vectorgoodsPosition is most Greatly, then show that current recognition result is i-th in product listgoodsThe product of position, recognition result is sent at recognition result Manage module;
Described first initializes product identification classifier using the product image set of all angles, and to product figure As generating product list, method are as follows: the first step, construct data acquisition system and product list: the data acquisition system is each angle of product The image of degree, product list listGOods is a vector, and each of vector corresponds to a product name;Second step, construction Product identification classifier GoodsN;Third step initializes construction product identification classifier GoodsN, and input is each The product image set of angle exports if input picture is Goods as GoodsN (Goods), classification yGoods, yGoodsFor One group of vector, length are equal to the number of product in product list, yGoodsRepresentation method are as follows: if image Goods be i-thGoodsPosition Product, then yGoodsI-thGoodsPosition is 1, other positions are 0;The evaluation function of the network is to (GoodsN (Goods)- yGoods) its cross entropy loss function is calculated, convergence direction is to be minimized, and the number of iterations is 2000 times;
The construction product identification classifier GoodsN, two groups of GoodsN1 and GoodsN2 of network layer structure, wherein The network structure of GoodsN1 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, and exporting is 256 × 256 × 64, port number Channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, and exporting is 256 × 256 × 128, port number Channels=128;Third layer: pond layer, inputting is 256 × 256 × 128, and exporting is 128 × 128 × 128;4th layer: volume Lamination, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolution Layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, It inputs the 4th layer of output 128 × 128 × 128 to be connected in third dimension with layer 5 128 × 128 × 128, exporting is 64 ×64×256;Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels= 256;8th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;The Nine layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond Change layer, inputs and be connected in third dimension for layer 7 output 64 × 64 × 256 with the 9th layer 64 × 64 × 256, export It is 32 × 32 × 512;Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number Channels=512;Floor 12: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number Channels=512;13rd layer: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number Channels=512;14th layer: pond layer, input for eleventh floor export 32 × 32 × 512 and the 13rd layer 32 × 32 × 512 are connected in third dimension, and exporting is 16 × 16 × 1024;15th layer: convolutional layer, input as 16 × 16 × 1024, exporting is 16 × 16 × 512, port number channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, Output is 16 × 16 × 512, port number channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, output It is 16 × 16 × 512, port number channels=512;18th layer: pond layer, input for the 15th layer output 16 × 16 × 512 are connected in third dimension with the 17th layer 16 × 16 × 512, and exporting is 8 × 8 × 1024;19th layer: convolution Layer, inputting is 8 × 8 × 1024, and exporting is 8 × 8 × 256, port number channels=256;20th layer: pond layer, input It is 8 × 8 × 256, exporting is 4 × 4 × 256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, export as 4 × 4 × 128, port number channels=128;The parameter of all convolutional layers be size=3 convolution kernel kernel, step-length stride=(1, 1), activation primitive is relu activation primitive;All pond layers are maximum pond layer, and parameter is pond section size Kernel_size=2, step-length stride=(2,2);The network structure of GoodsN2 are as follows: inputting is 4 × 4 × 128, first will be defeated The data entered are launched into the vector of 2048 dimensions, then input into first layer;First layer: full articulamentum, input vector length are 2048, output vector length is 1024, and activation primitive is relu activation primitive;The second layer: full articulamentum, input vector length are 1024, output vector length is 1024, and activation primitive is relu activation primitive;Third layer: full articulamentum, input vector length are 1024, output vector length is len (listGoods), activation primitive is soft-max activation primitive;len(listGoods) indicate The length of product list;For any input Goods2, GoodsN (Goods2)=GoodsN2 (GoodsN1 (Goods2));
The product image set and new product list upgrading products recognition classifier with new all angles, method Are as follows: the first step modifies network structure: for the network structure of product identification the classifier GoodsN ', GoodsN1 ' of neotectonics Constant, identical as GoodsN1 network structure when initialization, the first layer and second layer structure of GoodsN2 ' network structure are protected Hold constant, the output vector length of third layer becomes the length of updated product list;Second step, for the product of neotectonics Recognition classifier GoodsN ' is initialized: its product image set inputted as new all angles, if input picture is Goods3 exports as GoodsN ' (Goods3)=GoodsN2 ' (GoodsN1 (Goods3)), classification yGoods3, yGoodsFor One group of vector, length are equal to the number of updated product list, yGoodsRepresentation method are as follows: if image Goods be i-thGoods The product of position, then yGoodsI-thGoodsPosition is 1, other positions are 0;The evaluation function of the network is to (GoodsN (Goods)- yGoods) its cross entropy loss function is calculated, convergence direction is to be minimized, during initialization the parameter value in GoodsN1 It remains unchanged, the number of iterations is 500 times;
The input according to corresponding to current video first frame in the obtained position of module of target detection, to the position Video image detects forward from current video first frame, detects the frame that the region is not blocked, method are as follows: set and work as forward sight Corresponding to frequency first frame the obtained position of module of target detection be (agoods,bgoods,lgoods,wgoods), if current Video first frame is i-thcrgsFrame, frame under process icr=icrgs: the first step, i-thcrFrame is in the obtained institute of module of target detection Having detection zone is Taskicr;Second step, for TaskicrEach of regional frame (atask,btask,ltask,wtask), it calculates Its distance dgt=(atask-agoods)2+(btask-bgoods)2-(ltask+lgoods)2-(wtask+wgoods)2;Distance < 0 if it does not exist, Then i-thcrCorresponding (a of framegoods,bgoods,lgoods,wgoods) region is that region detected for detecting is not blocked Frame, algorithm terminates;Otherwise, distance < 0 if it exists, the then d (i in recording distance list dcr)=minimum range, and icr=icr- 1, if icr> 0, then algorithm jumps to the first step, if icr≤ 0, then selection takes the record pair apart from the maximum record of list d intermediate value Answer the corresponding (a of framegoods,bgoods,lgoods,wgoods) it is the frame that the region detected detected is not blocked, algorithm Terminate.
6. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the individual identification module Tool
Body implementation method are as follows:
In initialization, face characteristic extractor FaceN is initialized using the face image set of all angles first And μ face is calculated, then characteristics of human body's extractor BodyN is initialized using the human body image of all angles and calculates μ body;In the detection process, when user enters supermarket, current human region Body1 and people are obtained by module of target detection Then the image Face1 of face in body region uses characteristics of human body's extractor BodyN and face characteristic extractor respectively FaceN extracts characteristics of human body BodyN (Body1) and face characteristic FaceN (Face1), saves BodyN (Body1) in BodyFtu In set, FaceN (Face1) is saved in FaceFtu set, and save the id information of existing customer, id information can be use The unduplicated number that family is randomly assigned when either user enters supermarket in the account of supermarket, id information are used to distinguish different Gus Visitor then extracts its characteristics of human body and face characteristic whenever there is customer to enter supermarket;When user's mobile product in supermarket, according to Do shopping action recognition module transmitting come complete video and only grasp motion video, search out its corresponding human region with Human face region carries out recognition of face or human bioequivalence side using face feature extractor FaceN and characteristics of human body's extractor BodyN Formula obtains the ID of customer corresponding to the video that currently transmitting of shopping action recognition module comes;
The face image set using all angles initializes face characteristic extractor FaceN and calculates μ Face, method are as follows: the first step, the face image set for choosing all angles constitute human face data collection;Second step constructs face Feature extractor FaceN is simultaneously initialized using face data set;Step 3:
Everyone i concentrated for human face dataPeop, obtain human face data and concentrate all to belong to iPeopFace image set FaceSet(iPeop):
For FaceSet (iPeop) in each facial image Face (jiPeop):
Calculate face characteristic FaceN (Face (jiPeop));
Count current face's image collection FaceSet (iPeop) in all face characteristics average value as current face's image In
Heart center (FaceN (Face (jiPeop))), calculate current face's image collection FaceSet (iPeop) in all faces it is special Center center (FaceN (Face (the j of sign and current face's imageiPeop))) distance constitute iPeopCorresponding distance set; The owner concentrated to human face data obtains its corresponding distance set, after distance set is arranged from small to large, if distance Set length is ndiset, μ face=distance set theThe corresponding value in position,Indicate round numbers part;
The construction face characteristic extractor FaceN is simultaneously initialized using face data set, if human face data collection by NfacesetIndividual is constituted, and network layer structure FaceN25 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, is exported and is 256 × 256 × 64, port number channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, export as 256 × 256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256 × 256 × 64 are defeated with third layer 256 × 256 × 64 are connected in third dimension out, and exporting is 128 × 128 × 128;4th layer: convolutional layer inputs and is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer, inputting is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, the 4th layer of input are defeated 128 × 128 × 128 are connected in third dimension with layer 5 128 × 128 × 128 out, and exporting is 64 × 64 × 256;The Seven layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;8th layer: volume Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: convolutional layer, it is defeated Entering is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, inputting is Seven layers of output 64 × 64 × 256 are connected in third dimension with the 9th layer 64 × 64 × 256, and exporting is 32 × 32 × 512; Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth Two layers: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;13rd layer: Convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;14th layer: Chi Hua Layer inputs and is connected in third dimension for eleventh floor output 32 × 32 × 512 with the 13rd layer 32 × 32 × 512, defeated It is out 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, and exporting is 16 × 16 × 512, port number Channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512;18th layer: pond layer, input for the 15th layer export 16 × 16 × 512 and the 17th layer 16 × 16 × 512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024, defeated It is out 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 ×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128; Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length It is 512, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 512, and output vector is long Degree is 512, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 512, output vector Length is Nfaceset, activation primitive is soft-max activation primitive;The parameter of all convolutional layers be convolution kernel kernel size= 3, step-length stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, and parameter is Pond section size kernel_size=2, step-length stride=(2,2);Its initialization procedure are as follows: set for each face Face4 exports as FaceN25 (face4), classification yface, yfaceIt is equal to N for lengthfacesetVector, yfaceExpression Method are as follows: if face face4 belongs to i-th in face image setface4Personal face, then yfaceI-thface4Position is 1, other Position is 0;The evaluation function of the network is to (FaceN25 (face4)-yface) its cross entropy loss function is calculated, restrain direction To be minimized, the number of iterations is 2000 times;After iteration, face characteristic extractor FaceN be FaceN25 network from First layer is to the 24th layer;
The human body image using all angles initializes to characteristics of human body's extractor BodyN and calculates μ body, Method are as follows: the first step, the human body image set for choosing all angles constitute somatic data collection;Second step, construction characteristics of human body mention It takes device BodyN and user's volumetric data set initializes;Step 3:
Everyone i concentrated for somatic dataPeop1, obtain somatic data and concentrate all to belong to iPeop1Human body image set BodySet(iPeop1):
For BodySet (iPeop1) in each human body image Body (jiPeop1):
Calculate characteristics of human body BodyN (Body (jiPeop1));
Count current human's image collection BodySet (iPeop1) in all characteristics of human body average value as current human's image Center center (BodyN (Body (jiPeop1))), calculate current human's image collection BodySet (iPeop1) in all human bodies it is special Center center (BodyN (Body (the j of sign and current human's imageiPeop1))) distance constitute iPeop1Corresponding distance set;
The owner concentrated to somatic data obtains its corresponding distance set, after distance set is arranged from small to large, if Distance set length is ndiset1, μ body=distance setThe corresponding value in position,Indicate round numbers part;
Construction characteristics of human body's extractor BodyN and user's volumetric data set initializes, if somatic data collection by NbodysetIndividual is constituted, and network layer structure BodyN25 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, is exported and is 256 × 256 × 64, port number channels=64;The second layer: convolutional layer, inputting is 256 × 256 × 64, export as 256 × 256 × 64, port number channels=64;Third layer: pond layer, input first layer output 256 × 256 × 64 are defeated with third layer 256 × 256 × 64 are connected in third dimension out, and exporting is 128 × 128 × 128;4th layer: convolutional layer inputs and is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 5: convolutional layer, inputting is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128;Layer 6: pond layer, the 4th layer of input are defeated 128 × 128 × 128 are connected in third dimension with layer 5 128 × 128 × 128 out, and exporting is 64 × 64 × 256;The Seven layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;8th layer: volume Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;9th layer: convolutional layer, it is defeated Entering is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256;Tenth layer: pond layer, inputting is Seven layers of output 64 × 64 × 256 are connected in third dimension with the 9th layer 64 × 64 × 256, and exporting is 32 × 32 × 512; Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;Tenth Two layers: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;13rd layer: Convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512;14th layer: Chi Hua Layer inputs and is connected in third dimension for eleventh floor output 32 × 32 × 512 with the 13rd layer 32 × 32 × 512, defeated It is out 16 × 16 × 1024;15th layer: convolutional layer, inputting is 16 × 16 × 1024, and exporting is 16 × 16 × 512, port number Channels=512;16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512;17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512;18th layer: pond layer, input for the 15th layer export 16 × 16 × 512 and the 17th layer 16 × 16 × 512 are connected in third dimension, and exporting is 8 × 8 × 1024;19th layer: convolutional layer, inputting is 8 × 8 × 1024, defeated It is out 8 × 8 × 256, port number channels=256;20th layer: pond layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 ×256;Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128; Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128;23rd layer: full articulamentum first will The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length It is 512, activation primitive is relu activation primitive;24th layer: full articulamentum, input vector length are 512, and output vector is long Degree is 512, and activation primitive is relu activation primitive;25th layer: full articulamentum, input vector length are 512, output vector Length is Nfaceset, activation primitive is soft-max activation primitive;The parameter of all convolutional layers be convolution kernel kernel size= 3, step-length stride=(1,1), activation primitive are relu activation primitive;All pond layers are maximum pond layer, and parameter is Pond section size kernel_size=2, step-length stride=(2,2);Its initialization procedure are as follows: set for each Zhang Renti Body4 exports as BodyN25 (body4), classification ybody, ybodyIt is equal to N for lengthbodysetVector, ybodyExpression Method are as follows: if human body body4 belongs to i-th in human body image setbody4Personal human body, then ybodyI-thbody4Position is 1, other Position is 0;The evaluation function of the network is to (BodyN25 (body4)-ybody) its cross entropy loss function is calculated, restrain direction To be minimized, the number of iterations is 2000 times;After iteration, characteristics of human body's extractor BodyN be BodyN25 network from First layer is to the 24th layer;
It is described according to the transmitting of shopping action recognition module come complete video and only grasp motion video, it is right to search out its The human region and human face region answered carry out face knowledge using face feature extractor FaceN and characteristics of human body's extractor BodyN Other or human bioequivalence mode obtains the ID of customer corresponding to the video that currently transmitting of shopping action recognition module comes;Its process Are as follows: according to shopping action recognition module transmitting come video, from the first frame of video begin look for corresponding human region with Human face region, until algorithm terminates or handled the last frame of video:
By corresponding human region image Body2 and human face region image Face2 use respectively characteristics of human body's extractor BodyN and Face characteristic extractor FaceN extracts characteristics of human body BodyN (Body2) and face characteristic FaceN (Face2);
Then face identification information is used first: comparing the Europe of all face characteristics in FaceN (Face2) and FaceFtu set Family name's distance dFace, feature when selecting Euclidean distance minimum in corresponding FaceFtu set, if this feature is FaceN (Face3), if dFace< μ face then identifies that current face's image belongs to the client of facial image corresponding to FaceN (Face3) ID is the ID corresponding to the video actions that action recognition module transmitting comes that does shopping, and current identification process terminates;
If dFace>=μ Face shows only to identify current individual with face identification method, then compares BodyN (Body2) the Euclidean distance d of all characteristics of human body in gathering with BodyFtuBody, select Euclidean distance minimum when it is corresponding Feature in BodyFtu set, if this feature is BodyN (Face3), if dBody+dFace< μ face+ μ body, then identify and work as The Customer ID that preceding human body image belongs to human body image corresponding to BodyN (Face3) is that shopping action recognition module transmitting comes ID corresponding to video actions;
If still not finding ID corresponding to video actions after all frames for having handled video, in order to avoid wrong identification purchase Owner's body causes the book keeping operation of mistake, therefore the video come to current shopping action recognition module transmitting is no longer handled;
It is described according to the transmitting of shopping action recognition module come video, begin look for from the first frame of video to corresponding human body Region and human face region, method are as follows: according to shopping action recognition module transmitting come video, carried out from the first frame of video from Reason;If currently processed to i-thfRgFrame, if it is (a that the frame, which corresponds to video in the obtained position of module of target detection,ifRg,bifRg, lifRg,wifRg), the frame is corresponding to be combined into BodyFrameSet in the obtained human region collection of module of target detectionifRgHuman body area Domain collection is combined into FaceFrameSetifRg, for BodyFrameSetifRgEach of human region (aBFSifRg,bBFSifRg, lBFSifRg,wBFSifRg), calculate its distance dgbt=(aBFSifRg-aifRg)2+(bBFSifRg-bifRg)2-(lBFSifRg-lifRg)2- (wBFSifRg-wifRg)2, selecting the smallest human region of distance in all human region set is the corresponding human body area of current video Domain, if it is (a that the human region chosen, which is position,BFS1,bBFS1,lBFS1,wBFS1), human face region collection is combined into FaceFrameSetifRgEach of human face region (aFFSifRg,bFFSifRg,lFFSifRg,wFFSifRg), calculate its distance dgft= (aBFS1-aFFSifRg)2+(bBFS1-bFFifRg)2-(lBFS1-lFFSifRg)2-(wBFS1-wFFSifRg)2, select in all face regional ensembles It is the corresponding human face region of current video apart from the smallest human face region.
7. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the recognition result handles mould The concrete methods of realizing of block are as follows:
It does not work in initialization;In identification process, integration is carried out to generate each Gu to the recognition result received The corresponding shopping list of visitor: first according to individual identification module transmit come the ID of customer determine the corresponding Gu of current shopping information Visitor, so that choosing the shopping list number modified is ID, then according to product identification module transmit come recognition result determine The shopping of current customer acts corresponding product and sets product as GoodA, then according to shopping action recognition module transmit come knowledge Whether other result modifies to shopping cart to determine that current shopping acts, if being identified as taking out article on shopping list ID Increase product G oodA, accelerating is 1, reduces product G oodA on shopping list ID if being identified as putting back to article, subtracts Small number is 1, and shopping list does not change if be identified as " take out and put back to " or " taken out article and do not put back to " again, if identification It as a result is " suspicious stealing " then to supermarket's monitoring transmission alarm signal and the corresponding location information of current video.
CN201910263910.1A 2019-04-03 2019-04-03 A kind of supermarket's intelligence vending system Withdrawn CN109977896A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910263910.1A CN109977896A (en) 2019-04-03 2019-04-03 A kind of supermarket's intelligence vending system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910263910.1A CN109977896A (en) 2019-04-03 2019-04-03 A kind of supermarket's intelligence vending system

Publications (1)

Publication Number Publication Date
CN109977896A true CN109977896A (en) 2019-07-05

Family

ID=67082544

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910263910.1A Withdrawn CN109977896A (en) 2019-04-03 2019-04-03 A kind of supermarket's intelligence vending system

Country Status (1)

Country Link
CN (1) CN109977896A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110379108A (en) * 2019-08-19 2019-10-25 铂纳思(东莞)高新科技投资有限公司 A kind of method and its system of unmanned shop anti-thefting monitoring
CN110619308A (en) * 2019-09-18 2019-12-27 名创优品(横琴)企业管理有限公司 Aisle sundry detection method, device, system and equipment
CN110674712A (en) * 2019-09-11 2020-01-10 苏宁云计算有限公司 Interactive behavior recognition method and device, computer equipment and storage medium
CN110796051A (en) * 2019-10-19 2020-02-14 北京工业大学 Real-time access behavior detection method and system based on container scene
CN111582202A (en) * 2020-05-13 2020-08-25 上海海事大学 Intelligent course system
CN111723741A (en) * 2020-06-19 2020-09-29 江苏濠汉信息技术有限公司 Temporary fence movement detection alarm system based on visual analysis
CN113408501A (en) * 2021-08-19 2021-09-17 北京宝隆泓瑞科技有限公司 Oil field park detection method and system based on computer vision
CN113901895A (en) * 2021-09-18 2022-01-07 武汉未来幻影科技有限公司 Door opening action recognition method and device for vehicle and processing equipment
CN114596661A (en) * 2022-02-28 2022-06-07 安顺市成威科技有限公司 Multifunctional intelligent sales counter
CN117253194A (en) * 2023-11-13 2023-12-19 网思科技股份有限公司 Commodity damage detection method, commodity damage detection device and storage medium

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110379108A (en) * 2019-08-19 2019-10-25 铂纳思(东莞)高新科技投资有限公司 A kind of method and its system of unmanned shop anti-thefting monitoring
WO2021047232A1 (en) * 2019-09-11 2021-03-18 苏宁易购集团股份有限公司 Interaction behavior recognition method, apparatus, computer device, and storage medium
CN110674712A (en) * 2019-09-11 2020-01-10 苏宁云计算有限公司 Interactive behavior recognition method and device, computer equipment and storage medium
CN110619308A (en) * 2019-09-18 2019-12-27 名创优品(横琴)企业管理有限公司 Aisle sundry detection method, device, system and equipment
CN110796051A (en) * 2019-10-19 2020-02-14 北京工业大学 Real-time access behavior detection method and system based on container scene
CN110796051B (en) * 2019-10-19 2024-04-26 北京工业大学 Real-time access behavior detection method and system based on container scene
CN111582202A (en) * 2020-05-13 2020-08-25 上海海事大学 Intelligent course system
CN111582202B (en) * 2020-05-13 2023-10-17 上海海事大学 Intelligent net class system
CN111723741A (en) * 2020-06-19 2020-09-29 江苏濠汉信息技术有限公司 Temporary fence movement detection alarm system based on visual analysis
CN113408501A (en) * 2021-08-19 2021-09-17 北京宝隆泓瑞科技有限公司 Oil field park detection method and system based on computer vision
CN113901895A (en) * 2021-09-18 2022-01-07 武汉未来幻影科技有限公司 Door opening action recognition method and device for vehicle and processing equipment
CN114596661A (en) * 2022-02-28 2022-06-07 安顺市成威科技有限公司 Multifunctional intelligent sales counter
CN114596661B (en) * 2022-02-28 2023-03-10 安顺市成威科技有限公司 Multifunctional intelligent sales counter
CN117253194A (en) * 2023-11-13 2023-12-19 网思科技股份有限公司 Commodity damage detection method, commodity damage detection device and storage medium
CN117253194B (en) * 2023-11-13 2024-03-19 网思科技股份有限公司 Commodity damage detection method, commodity damage detection device and storage medium

Similar Documents

Publication Publication Date Title
CN109977896A (en) A kind of supermarket&#39;s intelligence vending system
US11270260B2 (en) Systems and methods for deep learning-based shopper tracking
Liu et al. PestNet: An end-to-end deep learning approach for large-scale multi-class pest detection and classification
Liu et al. Adversarial learning for constrained image splicing detection and localization based on atrous convolution
US20210158053A1 (en) Constructing shopper carts using video surveillance
CN104217214B (en) RGB D personage&#39;s Activity recognition methods based on configurable convolutional neural networks
CN108460356A (en) A kind of facial image automated processing system based on monitoring system
KR102554724B1 (en) Method for identifying an object in an image and mobile device for practicing the method
CA3072056A1 (en) Subject identification and tracking using image recognition
CN108470354A (en) Video target tracking method, device and realization device
CN107292339A (en) The unmanned plane low altitude remote sensing image high score Geomorphological Classification method of feature based fusion
CN113239801B (en) Cross-domain action recognition method based on multi-scale feature learning and multi-level domain alignment
US20210248421A1 (en) Channel interaction networks for image categorization
Zhou et al. Multi-label learning of part detectors for occluded pedestrian detection
CN108009493A (en) Face anti-fraud recognition methods based on action enhancing
Tagare et al. A maximum-likelihood strategy for directing attention during visual search
Liu et al. Customer behavior recognition in retail store from surveillance camera
CN109977251A (en) A method of building identifies commodity based on RGB histogram feature
CN110516533A (en) A kind of pedestrian based on depth measure discrimination method again
CN110222587A (en) A kind of commodity attribute detection recognition methods again based on characteristic pattern
CN110070002A (en) A kind of Activity recognition method based on 3D convolutional neural networks
CN114187546B (en) Combined action recognition method and system
CN107563293A (en) A kind of new finger vena preprocess method and system
Fang et al. Pedestrian attributes recognition in surveillance scenarios with hierarchical multi-task CNN models
CN108960005A (en) The foundation and display methods, system of subjects visual label in a kind of intelligent vision Internet of Things

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20190705

WW01 Invention patent application withdrawn after publication