CN111079699A

CN111079699A - Commodity identification method and device

Info

Publication number: CN111079699A
Application number: CN201911394895.0A
Authority: CN
Inventors: 蔡丁丁; 龙寿伦
Original assignee: Beijing Missfresh Ecommerce Co Ltd
Current assignee: Beijing Daily Youxian Technology Co.,Ltd.
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-04-28

Abstract

The application discloses a commodity identification method and device, and belongs to the technical field of internet. In the present application, each of a plurality of images is identified by a product identification model. Since the product recognition model is trained from a plurality of sample images including hands and products, the hands and the products included in the images can be detected by the product recognition model. On the basis of the detection result of the image obtained by the commodity identification model, a plurality of target images including the hand and the commodity held by the hand can be determined. Therefore, the commodity identification result is determined according to the position information of the hand and the commodity held by the hand in each target image and the acquisition time of each target image, other interference information contained in the image is filtered, and only the category of the commodity held by the hand is concerned, so that the error identification of other irrelevant commodities is reduced, and the commodity identification accuracy is improved.

Description

Commodity identification method and device

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method and an apparatus for identifying a commodity.

Background

At present, with the development of intellectualization, in public places such as markets, schools and office buildings, automatic sales counter can be used by falling to the ground on a large scale. The automatic sales counter is divided into a closed type automatic sales counter and an open type automatic sales counter. When a user purchases commodities on the open type automatic vending cabinet, the cabinet door can be opened through scanning the code to selectively purchase the commodities. The automatic vending cabinet can acquire images of the process of purchasing commodities by the user through the image acquisition device arranged on the cabinet body, identify the commodities purchased by the user according to the acquired images, and then settle accounts for the identified commodities purchased by the user after the user closes the cabinet door. However, since the image capturing device captures a large amount of interference information during the process of purchasing a product, the product is easily identified by mistake, and therefore, it is desirable to provide a product identification method to improve the accuracy of product identification.

Disclosure of Invention

The embodiment of the application provides a commodity identification method, a commodity identification device and a storage medium, which can reduce the probability of commodity false identification and improve the commodity identification accuracy. The technical scheme is as follows:

in one aspect, a method for identifying a commodity is provided, the method comprising:

acquiring a plurality of images acquired in the process of purchasing commodities by a user;

identifying each image in the multiple images through a commodity identification model to obtain a detection result of each image, wherein the commodity identification model is obtained by training multiple sample images including hands and commodities;

determining a plurality of target images including the hand and the commodity held by the hand from the plurality of images according to the detection result of each image;

and determining a commodity identification result according to the position information of the hand and the commodity held by the hand in each target image and the acquisition time of each target image.

Optionally, before the identifying each of the plurality of images by the product identification model, the method further includes:

acquiring a plurality of test images and the label information of each test image, and acquiring the label information of the plurality of sample images and each sample image, wherein the label information is used for indicating the positions of hands and the positions of commodities in the corresponding images;

training the initial network according to the multiple sample images and the labeling information of each sample image to obtain a basic recognition model;

determining the detection precision of the basic recognition model according to the multiple test images and the labeling information of each test image;

and if the detection precision does not reach the reference precision value, updating the multiple sample images, returning to the step of acquiring the labeling information of the multiple sample images and each sample image, and taking the basic identification model obtained at the last time as the commodity identification model when the detection precision reaches the reference precision value.

Optionally, the initial network comprises a feature extraction network and a target detection network;

the training the initial network according to the labeling information of the multiple sample images and each sample image to obtain a basic recognition model comprises the following steps:

performing feature extraction on a first sample image through the feature extraction network to obtain a feature matrix of the first sample image, wherein the first sample image is any one sample image in the plurality of sample images;

processing the characteristic matrix through the target detection network to obtain a sample detection result of the first sample image;

determining a loss function value according to the marking information of the first sample image and the sample detection result;

and if the loss function value does not meet a first preset condition, adjusting parameters in the feature extraction network and the target detection network according to the loss function value, updating the first sample image, returning to the step of performing feature extraction on the first sample image through the feature extraction network, and determining the network after the last parameter adjustment as the basic identification model until the loss function value meets the first preset condition.

Optionally, the annotation information includes a position and a size of each annotation frame in the corresponding image and a category of an annotation object in each annotation frame, where the annotation object is a hand or a commodity; the sample detection result includes the position and size of each detection frame detected in the corresponding sample image, and the category of the detection object within each detection frame.

Optionally, the determining, according to the labeling information of the multiple test images and each test image, the detection accuracy of the basic recognition model includes:

identifying each test image through the basic identification model to obtain a detection result of the corresponding test image;

filtering the detection result of each test image to obtain the identification result of each test image;

and determining the detection precision of the basic recognition model according to the recognition result and the labeling information of each test image.

Optionally, the detection result of each of the plurality of images includes a position and a size of each of one or more candidate frames in the corresponding image, an object existence probability indicating whether each candidate frame contains an object, a plurality of candidate categories of the object in each candidate frame, and a probability corresponding to each of the plurality of candidate categories;

the determining a plurality of target images including a hand and a commodity from the plurality of images according to a detection result of each image includes:

determining the object category in the corresponding candidate frame and the corresponding confidence of the corresponding candidate frame according to the object existence probability of each candidate frame in the detection result of each image and the probability corresponding to each candidate category in a plurality of candidate categories of the object in the corresponding candidate frame;

filtering the detection result of each image according to the category of the object in each candidate frame in each image, the corresponding confidence coefficient of each candidate frame and the position and the size of each candidate frame to obtain the identification result of each image;

and determining the plurality of target images from the plurality of images according to the recognition result of each image.

Optionally, the determining the plurality of target images from the plurality of images according to the recognition result of each image includes:

if one or more first candidate frames exist in the first image, wherein the category of the contained object is a hand and the corresponding confidence coefficient is greater than the reference confidence coefficient, acquiring the candidate frame with the maximum confidence coefficient from the one or more first candidate frames as a first target frame;

if the category of the object contained in the first image is one or more second candidate frames of any category in the preset commodity categories, acquiring a second target frame with a confidence coefficient higher than the reference confidence coefficient from the one or more second candidate frames;

determining the relative position relation between each second target frame and the first target frame according to the position of each second target frame and the position of the first target frame;

and if the relative position relation between any second target frame and the first target frame meets a second preset condition, determining the first image as the target image.

In another aspect, there is provided an article recognition apparatus, the apparatus including:

the system comprises a first acquisition module, a second acquisition module and a display module, wherein the first acquisition module is used for acquiring a plurality of images acquired in the process of purchasing commodities by a user;

the identification module is used for identifying each image in the images through a commodity identification model to obtain a detection result of each image, and the commodity identification model is obtained by training a plurality of sample images including hands and commodities;

the first determining module is used for determining a plurality of target images comprising the hand and the commodity held by the hand from the plurality of images according to the detection result of each image;

and the second determining module is used for determining a commodity identification result according to the position information of the hand and the commodity held by the hand in each target image and the acquisition time of each target image.

Optionally, the apparatus further comprises:

the second acquisition module is used for acquiring a plurality of test images and the label information of each test image, and acquiring the label information of the plurality of sample images and each sample image, wherein the label information is used for indicating the positions of hands and the positions of commodities in the corresponding images;

the training module is used for training the initial network according to the multiple sample images and the labeling information of each sample image to obtain a basic recognition model;

the third determining module is used for determining the detection precision of the basic recognition model according to the multiple test images and the labeling information of each test image;

and the triggering module is used for updating the multiple sample images and triggering the second acquisition module to acquire the labeling information of the multiple sample images and each sample image if the detection precision does not reach the reference precision value, and taking the basic identification model obtained at the last time as the commodity identification model until the detection precision reaches the reference precision value.

the training module is specifically configured to:

Optionally, the third determining module is specifically configured to:

the first determining module is specifically configured to:

Optionally, the first determining module is specifically configured to:

In another aspect, a merchandise identification device is provided, the merchandise identification device comprising a processor, a communication interface, a memory, and a communication bus;

the processor, the communication interface and the memory complete mutual communication through the communication bus;

the memory is used for storing computer programs;

the processor is used for executing the program stored in the memory so as to realize the commodity identification method.

In another aspect, a computer-readable storage medium is provided, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the article identification method as provided above.

In another aspect, a computer program product comprising instructions is provided, which when run on a computer causes the computer to perform the steps of the aforementioned article identification method.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

in the embodiment of the application, each image in the plurality of images can be identified through the commodity identification model. Since the product recognition model is trained from a plurality of sample images including hands and products, the hands and the products included in the images can be detected by the product recognition model. On the basis of the detection result of the image obtained by the commodity identification model, a plurality of target images including the hand and the commodity held by the hand can be determined. Therefore, the commodity identification result is determined according to the position information of the hand and the commodity held by the hand in each target image and the acquisition time of each target image, other interference information contained in the image is filtered, and only the category of the commodity held by the hand is concerned, so that the error identification of other irrelevant commodities is reduced, and the commodity identification accuracy is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic structural diagram of an automatic vending cabinet according to an embodiment of the present disclosure;

FIG. 2 is a flowchart of a method for training a commodity identification model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an annotated sample image provided by an embodiment of the present application;

fig. 4 is a flowchart of a method for identifying a commodity according to an embodiment of the present application;

FIG. 5 is a schematic view of a user's hand inside and outside a cabinet door according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an article identification device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an identification device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Before explaining the embodiments of the present application in detail, an application scenario related to the embodiments of the present application will be described.

For an open type automatic sales counter, a user can open a cabinet door of the automatic sales counter by scanning a code, and then the user can take goods and close the cabinet door. The automatic vending cabinet can collect images in the process that a user takes commodities, recognize the collected images and settle accounts according to commodity recognition results, so that the whole vending process is completed. The images collected during the process of taking the goods by the user contain a large amount of information of the surrounding environment, for example, the collected images may contain information of goods which are not taken by the user in the cabinet. In this case, the additional information interferes with the identification of the article, resulting in the false identification of the article. The commodity identification method provided by the embodiment of the application can be applied to the scenes, and is used for reducing the interference of extra information in the image and improving the accuracy of commodity identification.

Next, a system architecture related to the method for identifying a commodity provided by the embodiment of the present application is described.

Fig. 1 is a vending cabinet according to the article identification method provided in the embodiment of the present application. As shown in fig. 1, the vending cabinet may comprise an image acquisition apparatus 101 and a recognition device 102.

The image acquisition device 101 is disposed on the cabinet of the vending cabinet, and the image acquisition device may be a multi-angle camera. In this embodiment of the application, the image capturing device 101 may capture an image when the user opens the door of the automatic sales counter to shop for a commodity, and send the captured image to the identification device 102. It should be noted that the image capturing device 101 may capture an image according to a preset period, or may directly capture video data and send the video data to the identification device 102.

Alternatively, the number of the image capturing devices 101 may be plural. In this case, the plurality of image capturing devices 101 may be disposed at different positions of the vending cabinet so as to capture images from various angles.

The identification device 102 may receive a plurality of images or video data acquired by the image acquisition apparatus, and identify the commodity purchased by the user according to the received images or video data by using the commodity identification method provided in the embodiment of the present application. The identification device 102 may be an industrial personal computer or an intelligent terminal, which is not limited in this embodiment of the application.

In some possible cases, the identification device 102 may also be used to control the vending cabinet, e.g. to control the opening and closing of the doors of the vending cabinet, etc.

Next, a description is given of a product identification method provided in the embodiment of the present application.

In the embodiment of the application, the collected multiple images can be identified through the commodity identification model, and then the commodity identification result is determined according to the detection results of the multiple images. The commodity identification model may be obtained by training in advance based on a plurality of sample images including the hand and the commodity. Based on this, before describing the product identification method provided in the embodiment of the present application, a description is first given of a training process of a product identification model.

Fig. 2 is a flowchart of a training method for a commodity recognition model according to an embodiment of the present application. The training method may be performed by a recognition device, a server, or other devices, which is not limited in this embodiment of the present application. In the following embodiments, the training method is explained by taking the application to a server as an example. Referring to fig. 2, the method comprises the steps of:

step 201: and acquiring the multiple test images and the labeling information of each test image, and acquiring the multiple sample images and the labeling information of each sample image.

In the embodiment of the application, the user purchasing operation is simulated for a plurality of times on a plurality of different automatic vending cabinets, and the image acquisition device on each automatic vending cabinet is used for acquiring images in the user purchasing process, so that a large number of images are obtained. After obtaining a large number of images, the images containing the human hand and the goods held by the hand can be labeled manually. For example, the hand and the product held by the hand may be marked in the image by rectangular marking frames, and the marking information of the corresponding image may be generated according to the position and size of each marking frame and the type of the object included in the corresponding marking frame.

For example, referring to fig. 3, a hand is marked with a first label box 301 and a product held by the hand is marked with a second label box in one image. At this time, the center position coordinates and the size of the first labeling frame 301 may be acquired, and the center position coordinates and the size of the second labeling frame may be acquired. The center position coordinates and the size of the first labeling frame 301 and the type number 1 indicating that the type of the object included in the first labeling frame is a hand are set as one mapping relationship, the center position coordinates and the size of the second labeling frame 302 and the type number 2 indicating the type of the product included in the second labeling frame are set as one mapping relationship, and these two mapping relationships are stored as labeling information of the image in a text having the same name as the image ID of the image.

The method can be adopted for labeling a plurality of images collected according to the operation of a simulation user, so that the labeling information of each image is obtained. Then, a training data set may be generated from the partial images of the plurality of marked images, and a test data set may be generated from the remaining partial images. The images included in the training data set are multiple sample images for model training, and the images included in the test data set are multiple test images for model testing.

Based on this, in this step, the server may obtain a plurality of sample images from the training data set, and obtain the annotation information of each sample image. Meanwhile, the server can acquire a plurality of test images from the test data set and acquire the labeling information of each test image.

Step 202: and training the initial network according to the multiple sample images and the labeling information of each sample image to obtain a basic recognition model.

In an embodiment of the present application, the initial network may include a feature extraction network and an object detection network. The server can perform feature extraction on the first sample image through a feature extraction network to obtain a feature matrix of the first sample image, wherein the first sample image is any one of the plurality of sample images; processing the characteristic matrix through a target detection network to obtain a sample detection result of the first sample image; determining a loss function value according to the marking information of the first sample image and the sample detection result; and adjusting parameters in the feature extraction network and the target detection network according to the loss function value, updating the first sample image, returning to the step of extracting the features of the first sample image through the feature extraction network, and determining the network with the parameters adjusted for the last time as a basic identification model until the loss function value meets a first preset condition.

The server may obtain one sample image from the plurality of sample images as a first sample image, use the first sample image as an input of a feature extraction network, and perform feature extraction on the first sample image through the feature extraction network, thereby obtaining a feature matrix of the first sample image. The feature matrix is capable of characterizing information within the first sample image at a high-dimensional abstraction level. In order to reduce the calculation amount and improve the calculation speed, the feature extraction network can adopt a mobile-V2 lightweight network structure built under a tensoflow deep learning framework. Of course, the feature extraction network may also adopt a network structure such as shuffle net, SqueezeNet, or stronger Darknet, Resnet, and the like, which is not limited in this embodiment of the present application.

After obtaining the feature matrix of the first sample image, the feature matrix may be used as an input of the target detection network, and the feature matrix may be processed by the target detection network, so as to output a sample detection result of the first sample image. The sample detection result may include a position and a size of each detection frame detected in the first sample image and a category of the detection object within each detection frame. The number of the detection frames detected in the first sample image may be one or multiple, the shape of each detection frame may be a rectangle, and the position of each detection frame may be the center position coordinate of the corresponding detection frame.

In addition, it should be further noted that the target detection network may adopt a YOLO-v3 network structure built under a tensierflow deep learning framework. Optionally, a network structure such as YOLOv2, SSD, RetinaNet, RefineNet, FasterRCNN, etc. may also be used to perform target detection, which is not limited in this embodiment of the present application.

Optionally, in addition to the tenserflow deep learning framework, other deep learning frameworks, such as pitorch, Caffe, MXnet, etc., may also be selected in this embodiment.

After the sample detection result of the first sample image is obtained, a loss function value may be calculated according to a difference between the sample detection result of the first sample image and the annotation information, and each parameter in the network may be adjusted according to the loss function value.

For each detection frame included in the sample detection result of the first sample image, the annotation frame corresponding to the detection frame can be searched from the annotation information of the first sample image according to the position of the detection frame and the type of the object included in the detection frame. Then, the loss function value can be calculated according to the position difference between each detection frame and the corresponding labeling frame, and the difference between the type of the object contained in each detection frame and the type of the object contained in the corresponding labeling frame. The loss function value can be used to represent the difference between the sample detection result and the labeled information, and the larger the loss function value, the more the current detection result deviates from the real result.

It should be noted that, before the initial network is trained, an initial learning rate may be set, and as the number of times of training the network increases, the learning rate may gradually decrease. The smaller the learning rate, the smaller the adjustment range of the parameters in the network, and the slower the variation speed of the loss function value.

After determining the loss function value, it may be determined whether the loss function value satisfies a first preset condition. If the loss function value does not meet the first preset condition, one sample image can be obtained from the multiple sample images again to serve as the first sample image, namely, the previous first sample image is updated, so that the internal parameters of the network are adjusted by referring to the updated sample image again to the previous process until the loss function value meets the first preset condition, and the network after the parameters are adjusted for the last time is used as the basic identification model.

The first preset condition is that the currently calculated loss function value cannot be reduced any more than the previously calculated function value. Based on this, when judging whether the loss function value meets the first preset condition, whether the difference value between the loss function value obtained by current calculation and the loss function value obtained by continuous calculation for several times before the current time is within the preset numerical range can be judged, if yes, the loss function value tends to be stable and cannot be further reduced, at this moment, the training process can be stopped, and the network obtained after the parameters are adjusted according to the loss function value for the last time is used as the basic recognition model.

Step 203: and determining the detection precision of the basic recognition model according to the multiple test images and the labeling information of each test image.

After training the basic recognition model, the server may further test the basic recognition model through a plurality of test images obtained from the test data set.

Illustratively, the server can identify each test image through the basic identification model to obtain the detection result of the corresponding test image; filtering the detection result of each test image to obtain the identification result of each test image; and determining the detection precision of the basic recognition model according to the recognition result and the labeling information of each test image.

Taking any one of the plurality of test images as an example, for convenience of description, the test image is referred to as a first test image. The server may use the first test image as an input of the basic recognition model, and the basic recognition model may perform feature extraction on the first test image through a feature extraction network to obtain a feature matrix of the first test image. And finally, processing the characteristic matrix through the target detection network, thereby outputting the detection result of the first test image. The detection result of the first test image comprises the position and the size of each detection frame detected in the first test image, the object existence probability for indicating whether each detection frame contains the object, a plurality of candidate categories of the object contained in each detection frame and the probability corresponding to each candidate category.

After the detection result of the first test image is obtained, considering that there may be a detection frame which is erroneously identified or a detection frame which is used for framing the same object repeatedly exists in the detection frames included in the detection result, the server may further perform filtering on a plurality of detection frames included in the detection result of the first test image, thereby obtaining a final identification result of the first test image.

For example, for each detection frame included in the detection result of the first test image, the server may determine whether the detection frame is located in the image range of the first test image according to the position and size of each detection frame, and if some or all of the detection frame is not located in the image range of the first test image, the detection frame may be determined to be an invalid detection frame with a recognition error, and at this time, the invalid detection frame may be deleted.

After removing the invalid detection box in the detection result, the server may remove the duplicate detection box included in the detection result through a Non-maximum suppression (NMS) algorithm.

For example, the server may calculate a product between the object existence probability of the detection box and the probability corresponding to each candidate category of the object included in the detection box, take a maximum value of the calculated products as the confidence of the corresponding detection box, and take the candidate category for which the maximum value is calculated as the target category of the object included in the corresponding detection box. For each detection frame in the first test image, the server may obtain the confidence corresponding to the corresponding detection frame and the target class of the object included in the corresponding detection frame by referring to the above method.

Thereafter, the server may search for whether there is a duplicate detection frame from the detection frames included in the detection result of the first test image. If only one detection frame is included in the detection result of the first test image, it can be directly determined that no repeated detection frame exists in the first test image. At this time, the server may use the calculated confidence degree corresponding to the detection frame and the target type of the object included in the detection frame as the recognition result of the first test image.

Alternatively, if a plurality of detection frames are included in the detection result of the first test image, the server may calculate the relative positional relationship between every two detection frames. And judging whether the two detection frames are repeated detection frames according to the relative position relationship between every two detection frames.

The server may determine whether an overlapping area exists between the two detection frames according to the position coordinates of the two detection frames and the sizes of the two detection frames, and if the overlapping area does not exist, the two detection frames are not duplicate detection frames. If an overlapping area exists, if the proportion of the intersection (overlapping area) area of the two detection frames to the area of the union (connected area) of the two frames exceeds a preset value, and the class objects in the two detection frames are of the same class, the two detection frames can be determined to be the repeated detection frames.

For the two determined repeated detection frames, the server may reserve the detection frame with the highest confidence level in the repeated detection frames and delete the detection frame with the low confidence level.

It should be noted that there may be more than two detection frames that are both duplicate detection frames, and in this case, the one with the highest confidence may also be retained.

After filtering the detection frames in the detection result of the first test image, the remaining detection frames may be used as the final recognition result of the first test image.

For each test image, the server may refer to the above processing procedure for the first test image to process it, so as to obtain the recognition result of each test image.

After the recognition result of each test image is determined, the server may determine the detection accuracy of the basic recognition model according to the recognition result of each test image and the label information of the corresponding test image.

For any detection frame in the recognition result of any test image, the server may search, according to the type and position of the object in the detection frame, a matching annotation frame from the annotation information of the test image, and if the matching annotation frame is found, may calculate an intersection set of the detection frame and the corresponding annotation frame. And if the intersection set is larger than a preset threshold value, determining the detection frame as correct detection. Then, the server may classify the plurality of correct detection frames in the test image according to the types of the included objects, and classify the plurality of annotation frames in the annotation information of the test image. Then, the server can calculate the proportion of the number of correct detection frames of each category to the number of labeling frames of the corresponding category, so as to obtain the detection accuracy rate of the basic recognition model to each type of object contained in the test image. Then, the server can average the detection accuracy rates of all the test images for the same class of objects, so as to obtain the average detection accuracy of the basic identification model for each class of objects. Finally, the server may Average the detection Average Precision of all categories, thereby obtaining an mapp (Mean Average Precision) for evaluating the detection Precision of the basic recognition model.

Step 204: and judging whether the detection precision of the basic identification model reaches a reference precision value.

The reference precision value may be 0.8, 0.9, or other preset values. The server may determine whether the recognition accuracy of the basic recognition model is less than the reference accuracy value, and if the recognition accuracy is less than the reference accuracy value, it may be determined that the reference accuracy value is not reached, and at this time, step 206 may be performed. If the recognition accuracy is not less than the reference accuracy value, it may be determined that the recognition accuracy reaches the reference accuracy value, and step 205 may be performed.

Step 205: and if the detection precision of the basic recognition model reaches the reference precision value, determining the basic recognition model as a commodity recognition model.

Step 206: and if the detection precision of the basic recognition model does not reach the reference precision value, updating the multiple sample images, and returning to the step 201.

If the detection accuracy of the basic recognition model does not reach the reference accuracy value, the server may update the plurality of sample images, for example, the training data set may be updated. Then, the server may return to step 201 to obtain a plurality of sample images again, and then train and test the initial network again through step 202 and step 204 according to the plurality of obtained sample images again until the detection accuracy of the trained model reaches the reference accuracy value, and use the model as the commodity identification model.

Optionally, in some possible scenarios, the server may train the initial network through multiple sets of sample images by referring to the above method, obtain multiple basic recognition models, and determine the detection accuracy of each basic recognition model through the test data set. Then, the basic recognition model, which has the detection accuracy higher than the reference accuracy value and is the highest among the plurality of basic recognition models, may be used as the product recognition model.

After the commodity identification model is trained, the commodity identification model can be deployed in the identification device of the automatic sales counter shown in the aforementioned fig. 1. In this way, the identification device can identify the commodities purchased by the user through the commodity identification model.

Fig. 4 is a flowchart of a method for identifying a product according to an embodiment of the present application. As shown in fig. 4, the method may be applied to the identification apparatus of the vending cabinet shown in fig. 1, and the method may include the steps of:

step 401: a plurality of images collected in the process of purchasing commodities by a user are acquired.

In the embodiment of the application, when the door of the automatic sales counter is opened, the image acquisition device installed on the automatic sales counter can start image acquisition, and until the door of the automatic sales counter is closed, the automatic sales counter can stop image acquisition. Because the opening and closing of the cabinet door of the automatic sales counter are in the process of purchasing commodities by the user, the image acquired by the image acquisition device in the process is the image acquired in the process of purchasing commodities by the user. The image acquisition device can be after the cabinet door is closed, and the in-process that also can gather the image sends the image of gathering to identification equipment in real time to many image sending to identification equipment that gather. Accordingly, the recognition device may receive a plurality of images captured by the image capturing apparatus.

Alternatively, the image capturing device may capture video data. In this case, the identification device may receive the video data acquired by the image acquisition apparatus, acquire a plurality of video frames included in the video data, and treat the plurality of video frames as a plurality of images.

It should be noted that, if the image capturing device captures images, each image may have a time stamp thereon, and the time stamp may be used to indicate the capturing time of the corresponding image. If the image capturing device captures video data, each frame of image included in the video data may have a time stamp thereon, and the time stamp may also indicate the capturing time of the corresponding image.

Alternatively, if there are a plurality of image capturing devices installed on the vending cabinet, the identification device may receive the image or video data captured by each image capturing device, and obtain an image set according to the image or video data captured by each image capturing device, where the image set corresponding to each image capturing device may include a plurality of images captured by the corresponding image capturing device. In this case, for each image set, the identification device may employ the following

steps

402 and 404 to process the images included in the image set.

Step 402: and identifying each image in the plurality of images through the commodity identification model to obtain the detection result of each image.

The commodity identification model may be obtained by training through the training method shown in fig. 2. Based on this, after the identification device acquires the plurality of images acquired by the automatic sales counter, each image in the plurality of images can be sequentially identified through the commodity identification model according to the sequence of the acquisition time of the plurality of images, so that the detection result of each image is obtained. The detection result of each image may include a position and a size of each of one or more candidate frames detected in the corresponding image, an object existence probability indicating whether each of the candidate frames contains an object, a plurality of candidate categories of the object in each of the candidate frames, and a probability corresponding to each of the candidate categories.

Step 403: a plurality of target images including the hand and the product held by the hand are determined from the plurality of images based on the detection result of each image.

After obtaining the detection result of each image, the recognition device may determine the category of the object in the corresponding candidate frame and the confidence corresponding to the corresponding candidate frame according to the object existence probability of each candidate frame in the detection result of each image and the probability corresponding to each candidate category in the multiple candidate categories of the object in the corresponding candidate frame; and determining a plurality of target images from the plurality of images according to the category of the object in each candidate frame in each image and the corresponding confidence degree of the corresponding candidate frame.

Taking any one of the plurality of images as an example, it may be referred to as a first image for convenience of description. For each candidate frame included in the detection result of the first image, the recognition apparatus may calculate a product between the object existence probability of the candidate frame and a probability corresponding to each candidate category of the object within the candidate frame, take a maximum value of the determined products as a confidence of the candidate frame, and take the candidate category for which the maximum value is calculated as a category of the object within the candidate frame.

After determining the confidence of each candidate frame and the class of the contained object, the recognition device may filter the candidate frames in the first image with reference to the method of filtering the detection frames in the first test image described above.

After filtering the candidate frames included in the detection result of the first image, the recognition apparatus may detect whether one or more first candidate frames in which the category of the included object is a hand and the corresponding confidence is greater than the reference confidence exist in the first image, and detect whether one or more second candidate frames in which the category of the included object is any one of the preset commodity categories exist in the first image. If both of the above-mentioned two candidate frames exist, the recognition device may obtain, as the first target frame, a candidate frame with the highest confidence from the one or more first candidate frames. Meanwhile, the recognition device may further acquire, as the second target frame, a second candidate frame having a confidence greater than the reference confidence from among the one or more second candidate frames.

After determining the first target frame and the second target frame, if there is only one second target frame, the identification device may determine the relative positional relationship between the first target frame and the second target frame, and if there are a plurality of second target frames, the identification device may determine the relative positional relationship between each of the second target frames and the first target frame.

The recognition device may determine whether an overlapping area exists between the first target frame and the second target frame according to the center position coordinates of the first target frame and the size of the first target frame, and the center position coordinates of the second target frame and the size of the second target frame. If there is an overlap region between the first target box and the second target box, the area of the overlap region may be determined. The relative position relationship between the first target frame and the second target frame is characterized by the area of the overlapping area. If there is no overlapping area between the first target frame and the second target frame, it may be determined that the relative positional relationship between the two is non-overlapping.

After determining the relative position relationship between the first target frame and the second target frame, if the relative position relationship between the first target frame and the second target frame indicates that there is an overlap between the two, it may be determined that the two target frames satisfy a second preset condition. If a first target frame and a second target frame satisfying a second preset condition exist in the first image, the first image may be determined as a target image.

For each of the plurality of images, the recognition device may determine the plurality of target images from the plurality of images by processing with reference to the processing method for the first image.

Step 404: and determining a commodity identification result according to the position information of the hand and the commodity held by the hand in each target image and the acquisition time of each target image.

As can be seen from the process of determining the target images, each target image includes a first target frame and a second target frame having an overlapping area, where an object in the first target frame is a hand and an object in the second target frame is a commodity. Since the first target frame and the second target frame have an overlapping area, it can be seen that the product in the second target frame is the product held by the hand. On the basis, the identification device can determine a commodity identification result according to the position information of the hands and commodities held by the hands in the target images and the acquisition time of each target image.

For example, the recognition device may sort the plurality of target images in chronological order of acquisition. And then, determining the motion tracks of the hand and the commodity held by the hand in the image according to the sequence of the plurality of target images according to the position of the first target frame (namely the position of the hand) and the position of the second target frame (namely the position of the commodity held by the hand) in each target image. If the motion track is gradually far away from the preset reference line in the image along the first direction, and finally the motion track is not folded back, the category of the object in the second target frame can be used as a final commodity identification result.

Optionally, if the motion trajectory gradually moves away from the preset reference line along the first direction, then gradually moves away from the preset reference line along the second direction, and after crossing the preset reference line, continues to move away from the preset reference line along the second direction, it indicates that the user has taken the commodity and has put the commodity back on the shelf, at this time, the objects in the second target frames constituting the motion trajectory will not be the commodities taken by the user, and the commodities in the second target frames can be excluded.

The preset reference line is a reference line formed by corresponding position points of the hand in the image acquired by the image acquisition device when the hand of the user is located at different positions of the edge of the cabinet door. When the position of the user hand in the image is located on one side of the reference line, the fact that the user hand is in the cabinet door in the real environment is indicated, and when the position of the user hand in the image is located on the other side of the reference line, the fact that the user hand is out of the cabinet door in the real environment is indicated.

For example, referring to fig. 5, the preset reference line is L₁When the hand of the user is located above the preset reference line, the hand of the user is already located outside the cabinet door, and when the hand of the user is located below the preset reference line, the hand of the user is located inside the cabinet door. When the hands of the user and the commodities held by the hands are gradually far away from the preset reference line along the upward direction, the user can be determined to take the commodities, otherwise, the user can be determined to put the commodities back. Therefore, the type of the commodity can be determined according to the object type in the second target frame in the motion trail in which the commodity held by the hand is far away from the reference line and the turning back does not exist.

Next, a description will be given of a product identification device provided in an embodiment of the present application.

Referring to fig. 6, an embodiment of the present application provides an article identification apparatus 600, which may be applied to the automatic sales counter shown in fig. 1, where the apparatus 600 includes:

the first acquisition module 601 is used for acquiring a plurality of images acquired in the process of purchasing commodities by a user;

the identification module 602 is configured to identify each image of the multiple images through a commodity identification model to obtain a detection result of each image, where the commodity identification model is obtained by training multiple sample images including a hand and a commodity;

a first determining module 603, configured to determine, according to a detection result of each image, a plurality of target images including a hand and a commodity held by the hand from the plurality of images;

the second determining module 604 is configured to determine a product identification result according to the position information of the hand and the product held by the hand in each target image and the acquisition time of each target image.

Optionally, the apparatus 600 further comprises:

the second acquisition module is used for acquiring a plurality of test images and the label information of each test image, and acquiring a plurality of sample images and the label information of each sample image, wherein the label information is used for indicating the positions of hands and the positions of commodities in the corresponding images;

the third determining module is used for determining the detection precision of the basic recognition model according to the multiple test images and the marking information of each test image;

and the triggering module is used for updating the plurality of sample images if the detection precision does not reach the reference precision value, triggering the second acquisition module to acquire the plurality of sample images and the labeling information of each sample image, and taking the basic identification model obtained at the last time as the commodity identification model when the detection precision reaches the reference precision value.

the training module is specifically configured to:

performing feature extraction on the first sample image through a feature extraction network to obtain a feature matrix of the first sample image, wherein the first sample image is any one of the plurality of sample images;

processing the characteristic matrix through a target detection network to obtain a sample detection result of the first sample image;

and if the loss function value does not meet the first preset condition, adjusting parameters in the feature extraction network and the target detection network according to the loss function value, updating the first sample image, returning to the step of performing feature extraction on the first sample image through the feature extraction network, and determining the network after the last parameter adjustment as the basic identification model until the loss function value meets the first preset condition.

Optionally, the annotation information includes a position and a size of each annotation frame in the corresponding image and a category of an annotation object in each annotation frame, and the annotation object is a hand or a commodity; the sample detection result includes the position and size of each detection frame detected in the corresponding sample image, and the category of the detection object within each detection frame.

Optionally, the third determining module is specifically configured to:

Optionally, the detection result of each of the plurality of images includes a position and a size of each of one or more candidate frames in the corresponding image, an object existence probability indicating whether each of the candidate frames contains the object, a plurality of candidate categories of the object in each of the candidate frames, and a probability corresponding to each of the plurality of candidate categories;

the first determining module 603 is specifically configured to:

and determining a plurality of target images from the plurality of images according to the recognition result of each image.

Optionally, the first determining module 603 is specifically configured to:

if the type of the object contained in the first image is one or more second candidate frames of any type in the preset commodity type, acquiring a second target frame with the confidence coefficient higher than the reference confidence coefficient from the one or more second candidate frames;

In summary, in the embodiment of the present application, each image of the plurality of images may be identified by the commodity identification model. Since the product recognition model is trained from a plurality of sample images including hands and products, the hands and the products included in the images can be detected by the product recognition model. On the basis of the detection result of the image obtained by the commodity identification model, a plurality of target images including the hand and the commodity held by the hand can be determined. Therefore, the commodity identification result is determined according to the position information of the hand and the commodity held by the hand in each target image and the acquisition time of each target image, other interference information contained in the image is filtered, and only the category of the commodity held by the hand is concerned, so that the error identification of other irrelevant commodities is reduced, and the commodity identification accuracy is improved.

It should be noted that: in the above embodiment, when identifying a product, the product identification apparatus is exemplified by only dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the commodity identification device and the commodity identification method provided by the above embodiment belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment and are not described herein again.

Fig. 7 is a schematic structural diagram of an identification device 700 for performing article identification according to an embodiment of the present application. The above-described functions of the identification device in the embodiment shown in fig. 4 can be realized by the identification device shown in fig. 7. Specifically, the method comprises the following steps:

the processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 701 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 701 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 701 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 701 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 702 may include one or more computer-readable storage media, which may be non-transitory. Memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 702 is used to store at least one instruction for execution by processor 701 to implement the article identification method provided by the method embodiments herein.

In some embodiments, the recognition device 700 may further optionally include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 703 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 704, a positioning component 705, and a power supply 706.

The peripheral interface 703 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 701 and the memory 702. In some embodiments, processor 701, memory 702, and peripheral interface 703 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 701, the memory 702, and the peripheral interface 703 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 704 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 704 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 704 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 704 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 704 may communicate with other identification devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 704 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The positioning component 705 is used to locate the current geographic location of the identifying device 700 to implement navigation or LBS (location based Service). The positioning component 705 may be a positioning component based on the GPS (global positioning System) of the united states, the beidou System of china, or the galileo System of the european union.

The power supply 706 is used to power the various components in the identification device 700. The power source 706 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When the power source 706 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

It should be understood that the configuration shown in FIG. 7 above is not limiting of the identification device 700 and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components may be used.

In addition, not only is the identification device provided by the embodiments of the present application including a processor and a memory for storing processor-executable instructions, wherein the processor is configured to execute the article identification method shown in fig. 4, but also the embodiments of the present application provide a computer-readable storage medium having stored therein a computer program, which when executed by the processor, can implement the article identification method shown in fig. 4.

The embodiment of the present application further provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the method for identifying a commodity provided in the embodiment shown in fig. 4 or the method for training a commodity identification model provided in the embodiment shown in fig. 3.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for identifying an article, the method comprising:

2. The method of claim 1, wherein prior to identifying each of the plurality of images via the product identification model, further comprising:

3. The method of claim 2, wherein the initial network comprises a feature extraction network and a target detection network;

4. The method of claim 3,

the annotation information comprises the position and the size of each annotation frame in the corresponding image and the type of an annotation object in each annotation frame, and the annotation object is a hand or a commodity;

the sample detection result includes the position and size of each detection frame detected in the corresponding sample image, and the category of the detection object within each detection frame.

5. The method according to claim 2, wherein the determining the detection accuracy of the basic recognition model according to the plurality of test images and the label information of each test image comprises:

6. The method according to any one of claims 1-5, wherein the detection result of each of the plurality of images comprises a position and a size of each of one or more candidate frames in the corresponding image, an object existence probability indicating whether each of the candidate frames contains the object, a plurality of candidate categories of the object in each of the candidate frames, and a corresponding probability of each of the candidate categories;

7. The method according to claim 6, wherein the determining the plurality of target images from the plurality of images according to the recognition result of each image comprises:

8. An article identification device, the device comprising:

9. The apparatus of claim 8, further comprising:

10. The apparatus of claim 9, wherein the initial network comprises a feature extraction network and a target detection network;

the training module is specifically configured to: