CN116468816B

CN116468816B - Training method of image reconstruction model, commodity identification method, device and equipment

Info

Publication number: CN116468816B
Application number: CN202310342126.6A
Authority: CN
Inventors: 万星宇; 倪子涵; 章成全; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2024-04-16
Anticipated expiration: 2043-03-31
Also published as: CN116468816A

Abstract

The disclosure provides a training method of an image reconstruction model, a commodity identification method, a commodity identification device and equipment. The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, image processing, deep learning and the like, and can be applied to scenes such as smart cities and the like. The specific implementation scheme is as follows: randomly selecting K random mask areas of each sample image and K corresponding original images; carrying out random mask processing on K random mask areas of each sample image to obtain an input image of each sample image after the random mask processing; inputting an input image of each sample image and detection frames of K random mask areas into an image reconstruction model to obtain K predicted images of each sample image; based on K predicted images and K original images of each sample image, training the image reconstruction model to obtain the image reconstruction model. According to the scheme disclosed by the invention, the fine granularity characteristic expression capability of the model can be enhanced, so that the accuracy of commodity identification is improved.

Description

Training method of image reconstruction model, commodity identification method, device and equipment

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, image processing, deep learning and the like, and can be applied to scenes such as smart cities and the like.

Background

The commodity identification task refers to the category of commodities which are automatically detected and identified in a retail scene. The category of the merchandise is generally identified by locating areas of the merchandise first, based on the appearance characteristics of each area. But because the image data in the retail scene has the characteristics of dense distribution, fine granularity, wide category distribution and high labeling difficulty. Therefore, the model pre-trained on the universal target detection data set is difficult to effectively distinguish the category attribute of different commodities in the image, so that the accuracy of commodity identification in the actual application scene is low.

Disclosure of Invention

The disclosure provides a training method of an image reconstruction model, a commodity identification method, a commodity identification device and equipment.

According to a first aspect of the present disclosure, there is provided a training method of an image reconstruction model, including:

Randomly selecting K random mask areas of each sample image, wherein K is a positive integer;

obtaining K original images corresponding to K random mask areas of each sample image;

Carrying out random mask processing on K random mask areas of each sample image to obtain an input image of each sample image after the random mask processing;

Inputting an input image of each sample image and detection frames of K random mask areas into an image reconstruction model to obtain K predicted images of each sample image;

Based on K predicted images and K original images of each sample image, training the image reconstruction model to obtain the image reconstruction model.

According to a second aspect of the present disclosure, there is provided a commodity identification method comprising:

acquiring an image to be identified;

Inputting the image to be identified into the commodity identification model to obtain a commodity identification result of the image to be identified, which is output by the commodity identification model;

The commodity identification model is obtained by training an image reconstruction model serving as an initial model, and the image reconstruction model is obtained by training the image reconstruction model based on the training method provided by the first aspect.

According to a third aspect of the present disclosure, there is provided a training apparatus of an image reconstruction model, comprising:

the random selection module is used for randomly selecting K random mask areas of each sample image, wherein K is a positive integer;

The first acquisition module is used for acquiring K original images corresponding to K random mask areas of each sample image;

the processing module is used for carrying out random mask processing on the K random mask areas of each sample image to obtain an input image of each sample image after the random mask processing;

the first input module is used for inputting an input image of each sample image and detection frames of K random mask areas into an image reconstruction model to obtain K predicted images of each sample image;

The training module is used for training the image reconstruction model based on K predicted images and K original images of each sample image to obtain the image reconstruction model.

According to a fourth aspect of the present disclosure, there is provided a commodity identification apparatus comprising:

A fifth acquisition module for acquiring an image to be identified;

The second input module is used for inputting the image to be identified into the commodity identification model to obtain a commodity identification result of the image to be identified, which is output by the commodity identification model;

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the training method of the image reconstruction model provided in the first aspect and/or the article identification method provided in the second aspect.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the training method of the image reconstruction model provided in the first aspect and/or the commodity identification method provided in the second aspect.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the training method of the image reconstruction model provided in the first aspect and/or the merchandise identification method provided in the second aspect.

According to the technical scheme, the regional random mask and regional image reconstruction are adopted, so that the model can be more focused on the feature expression of each commodity target region in the training stage, the fine-grained feature expression capability of the model is effectively enhanced, and the commodity identification accuracy of the model in an actual service application scene is further improved.

The foregoing summary is for the purpose of the specification only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will become apparent by reference to the drawings and the following detailed description.

Drawings

In the drawings, the same reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily drawn to scale. It is appreciated that these drawings depict only some embodiments according to the disclosure and are not therefore to be considered limiting of its scope.

FIG. 1 is a flow diagram of a training method of an image reconstruction model according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a training framework for an image reconstruction model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of K random mask areas of a randomly selected sample image according to an embodiment of the present disclosure;

FIG. 4 is a network architecture schematic of an image reconstruction model according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of the acquisition of a target dimension feature map according to an embodiment of the present disclosure;

FIG. 6 is a flow diagram of a method of article identification according to an embodiment of the present disclosure;

FIG. 7 is a schematic illustration of an application of a merchandise identification model according to an embodiment of the disclosure;

FIG. 8 is a schematic structural view of a training device of an image reconstruction model according to an embodiment of the present disclosure;

Fig. 9 is a schematic structural view of a commodity identification apparatus according to an embodiment of the present disclosure;

FIG. 10 is a schematic view of a scenario of a training method of an image reconstruction model according to an embodiment of the present disclosure;

FIG. 11 is a schematic view of a scenario of a merchandise identification method according to an embodiment of the present disclosure;

Fig. 12 is a schematic structural diagram of an electronic device for implementing a training method and/or a commodity identification method for an image reconstruction model according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terms first, second, third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion, such as a series of steps or elements. The method, system, article, or apparatus is not necessarily limited to those explicitly listed but may include other steps or elements not explicitly listed or inherent to such process, method, article, or apparatus.

In the related art, the pre-training of commodity identification mainly comprises the following two technical schemes:

(1): a supervised pre-training scheme is performed on the public dataset. This scheme commonly uses a manually annotated public dataset for multi-stage pre-training: the backbone network of the model is first trained using the public dataset of image recognition tasks (e.g., image net 1K), and then the entire network model is trained using the public dataset of object detection tasks (objects 365) or the COCO dataset (Microsoft Common Objects in Context, MS COCO).

(2): A method for self-supervising pre-training on a public data set. In the model backbone network training stage, the scheme performs image reconstruction tasks by performing random masking on an input image, then performing image reconstruction tasks through an encoder and decoder network of a transducer model, and finally training the whole network model through a public data set such as Objects365 or COCO data sets.

Obviously, the two technical schemes have more defects, and the scheme one needs large-scale marked data. The data set of the commodity identification task needs to mark the position information and specific category attributes of all commodities in each image at the same time. Therefore, the manual labeling cost is too high, and the data acquisition cost is high because the disclosed commodity identification data set is very few. The fine granularity feature expression capability of the commodity target is not strong enough, and the identification accuracy is not ideal. Self-supervised pre-training on public datasets can only focus the model on the feature expression of more significant target areas in the image, such model backbones tend to extract a more versatile image feature. This approach has difficulty capturing the feature differences in fine granularity of different commodity target areas. Thus, it is not possible to effectively distinguish visually more similar merchandise targets. In addition, encoder-decoder networks based on the transform structure also generate high computational effort, which makes it difficult to apply to practical traffic scenarios.

In order to at least partially solve one or more of the above problems and other potential problems, the present disclosure proposes an image reconstruction scheme for a commodity target area, which can fully mine fine-grained characteristics inside the commodity image area in a pre-training link by using a self-supervision learning technology, and enhance the characteristic expression capability of a model for the fine-grained commodity target, thereby effectively improving the accuracy of commodity identification.

An embodiment of the present disclosure provides a training method of an image reconstruction model, and fig. 1 is a schematic flow chart of the training method of the image reconstruction model according to an embodiment of the present disclosure, where the training method of the image reconstruction model may be applied to a training device of the image reconstruction model. The training device of the image reconstruction model is located in the electronic equipment. The electronic device includes, but is not limited to, a stationary device and/or a mobile device. For example, the fixed device includes, but is not limited to, a server, which may be a cloud server or a general server. For example, mobile devices include, but are not limited to: cell phone, tablet computer, vehicle terminal. In some possible implementations, the training method of the quotient image reconstruction model can also be implemented by a mode that a processor calls computer readable instructions stored in a memory. As shown in fig. 1, the training method of the image reconstruction model includes:

S101: randomly selecting K random mask areas of each sample image, wherein K is a positive integer;

s102: obtaining K original images corresponding to K random mask areas of each sample image;

S103: carrying out random mask processing on K random mask areas of each sample image to obtain an input image of each sample image after the random mask processing;

s104: inputting an input image of each sample image and detection frames of K random mask areas into an image reconstruction model to obtain K predicted images of each sample image;

s105: based on K predicted images and K original images of each sample image, training the image reconstruction model to obtain the image reconstruction model.

Here, the image reconstruction model may be used as an initial model of the commodity identification model.

In the disclosed embodiment, each sample image corresponds to an input image that has undergone a random masking over K random mask areas above the corresponding sample image.

In the embodiment of the present disclosure, the input image is an entire image of each sample image. If there are N sample images, N input images are obtained, which are all obtained by the random masking process.

In the embodiment of the disclosure, the random masking refers to that the image block (patch) in the image is subjected to random masking, and then the masked area is predicted through the unmasked area, so that the model learns the characteristics of the image. On the basis, a random mask area is obtained by randomly selecting an area and then carrying out random masking on the randomly selected area.

In the disclosed embodiment, the detection frames of the K random mask regions of the sample image are obtained by a trained class independent generic object detector. The detection frame may be the actual bounding box (Ground Truth, GT) of the target. The detection frame may be represented by a rectangular frame shape. Wherein the detection frame may include rectangular frame information and category information.

In an embodiment of the present disclosure, training the image reconstruction model based on K predicted images and K original images of each sample image includes: constructing a loss function based on K predicted images and K original images of each sample image; network parameters of the image reconstruction model are updated based on the loss function. Illustratively, the loss function may be a mean square error (Mean Square Error, MSE), a mean absolute error (Mean Absolute Error, MAE), or the like.

In the embodiment of the disclosure, in a network forward propagation stage of the image reconstruction model, MSE losses of a predicted image and an original image are calculated, and the MSE losses are used as optimization targets to update network parameters of the image reconstruction model in a backward propagation stage. Illustratively, training of the image reconstruction model may employ AdamW (ADAM WEIGHT DECAY Regularization) as an optimizer that corrects the defect of weight decay in Adam, which may result in an initial learning rate of 0.0001 for the image reconstruction model, a decay of 0.2 at the 20 th iteration round (epoch), and a total iteration round of 300 rounds.

FIG. 2 shows a schematic diagram of a training framework of an image reconstruction model, as shown in FIG. 2, in which sample images are acquired and K random mask regions of the sample images are randomly selected; obtaining K original images corresponding to K random mask areas of a sample image; carrying out random masking processing on K random masking areas of the sample image to obtain an input image of the sample image after the random masking processing; inputting an input image of a sample image and detection frames of K random mask areas into an image reconstruction model to obtain K predicted images of the sample image; and training the image reconstruction model based on K predicted images and K original images of the sample image to obtain the image reconstruction model.

The image reconstruction model adopts an encoder-decoder Network structure based on a convolutional neural Network, specifically, the encoder adopts an improved target detection algorithm (PP-YOLOE, an evolved version ofYOLO) Network structure, specifically comprises a main Network structure and a neck Network structure, the main Network structure adopts a Cross-stage partial Residual Network (STAGE PARTIAL Residual Network, CSPRESNET), and the neck Network structure adopts a path aggregation Network (PathAggregation Network, PAN). The YOLO (english, all called You Only Look Once) algorithm is a classical one-stage (one-stage) target detection algorithm.

In an embodiment of the present disclosure, the input to the convolutional neural network (Convolutional Neural Networks, CNN) encoder is an input image of each sample image; the output of the CNN encoder is a feature map of the input image of each sample image. The input image is an image obtained by performing random mask processing on each sample image.

In the embodiment of the disclosure, the input of the CNN decoder is the feature vector of K random mask areas of each sample image; the output of the CNN decoder is K predicted pictures for each sample picture.

In the disclosed embodiment, a region of interest alignment (Region ofinterestAlign, ROI alignment) module is disposed between the encoder and the decoder, and the ROI alignment module is configured to extract feature vectors of K random mask regions from a feature map of each sample image according to positions of detection frames of K random mask regions of each sample image.

In the disclosed embodiment, the image reconstruction model is an initial model of the commodity identification model. The scheme can be used as a complete self-supervision pre-training product facing the target detection task, and also can be used as an independent commodity identification product facing the retail inspection scene.

According to the technical scheme, K random mask areas of each sample image are selected randomly; obtaining K original images corresponding to K random mask areas of each sample image; carrying out random mask processing on K random mask areas of each sample image to obtain an input image of each sample image after the random mask processing; inputting an input image of each sample image and detection frames of K random mask areas into an image reconstruction model to obtain K predicted images of each sample image; based on K predicted images and K original images of each sample image, training an image reconstruction model to obtain the image reconstruction model, wherein the image reconstruction model is an initial model of the commodity identification model. By adopting the method of regional random mask and regional image reconstruction, the model can be focused on the feature expression of each commodity target region in the training stage, so that the fine-granularity feature expression capability of the model trunk is effectively enhanced, and the commodity identification accuracy of the model in an actual service application scene is remarkably improved.

In some embodiments, the training method of the image reconstruction model may further include: acquiring a plurality of images acquired in different retail scenarios; and screening the plurality of images to obtain a plurality of sample images, wherein each sample image comprises at least one commodity.

In an embodiment of the present disclosure, a plurality of images acquired under different retail scenarios are acquired. The data can be collected through the interface, and the image data shot in the retail inspection scene of each client can be called on line. The retail inspection scenario may be, for example, a merchandise rack of a convenience store, or a merchandise rack of a supermarket.

Preferably, the sample dataset of the image reconstruction model employs unlabeled retail inspection business data. By collecting a plurality of image formation data sets collected in different retail scenes, the problem that the manual marking cost is high because the position information and specific category attributes of all commodity targets in each picture need to be marked at the same time when the universal data set is adopted can be solved. In addition, because the method does not depend on any marked or unmarked public data set, unmarked business data is directly adopted for training, and the problems of fewer public commodity identification data sets and higher data acquisition cost are solved.

In the embodiment of the disclosure, a large-scale unmarked retail inspection business image is acquired by a trained generic object detector independent of category, and a pseudo tag is generated. Specifically, the general object detector outputs the position information of each commodity area detected in each image as GT, then, the area random masking strategy is adopted to randomly pick the detection frame area in the GT, and the original image in the picked area is subjected to random masking according to a fixed proportion, so that a random masking image is obtained.

According to the technical scheme, the disclosed data set without any manual annotation can be directly used for training, and the method is an efficient training method for a service landing scene.

In some embodiments, randomly selecting K random mask regions for each sample image may include: for each sample image, under the condition that the number N of the detection frames of the sample image is larger than or equal to K, randomly scrambling the sequence of the N detection frames, selecting the first K detection frames from the N detection frames subjected to random scrambling, taking the area where the first K detection frames are located as K random mask areas of the sample image, wherein N is a positive integer; for each sample image, in the case that the number N of detection frames of the sample image is smaller than K, increasing the number of detection frames to K by adding a random disturbance mode to the N detection frames, and taking the area where the K detection frames are located as K random mask areas of the sample image.

In the embodiment of the disclosure, the random disturbance mode includes training, but is not limited to, performing random translation on the position of each GT frame up and down and left and right according to a fixed proportion.

In the embodiment of the disclosure, an area random masking mode is designed in a training stage to carry out random masking on images inside each randomly picked target area.

Fig. 3 shows a schematic diagram of randomly selecting K random mask areas of a sample image, as shown in fig. 3, if it is detected that the sample image includes N detection frames (also referred to as GT target frames), S301, randomly scrambling the sequence of the N detection frames if the number N of detection frames is greater than or equal to K, and selecting the first K detection frames from the N detection frames after the random scrambling, where the first K detection frames are located as K random mask areas of the sample image; s302, when the number N of the detection frames is smaller than K, the number N of the detection frames is increased to K through random translation of the N detection frames up and down and left and right according to a fixed proportion, and the area where the K detection frames are located is used as K random mask areas of the sample image.

According to the technical scheme, the number of the detection frames is increased by adding random disturbance to the detection frames, so that data expansion can be automatically performed, labor cost and time cost are reduced, and the efficiency of the training method of the image reconstruction model is improved.

In some embodiments, performing the random masking process on the K random masking regions of each sample image may include: dividing K original images corresponding to K random mask areas of each sample image into M multiplied by M image blocks with the same size, randomly selecting partial image blocks from the M multiplied by M image blocks according to a fixed proportion, and performing mask processing, wherein M is a positive integer not less than 2.

Illustratively, for each sample image, the respective original image is first cropped (crop) according to the GT frame region positions of the K random mask regions. Dividing each cut original image into 16×16 image blocks with the same size according to the equal length and width ratio, and randomly picking partial image blocks from the 16×16 image blocks according to the fixed ratio (marked as mask ratio) for masking. The image masking mode is as follows: the pixel values of three channels (Red Green Blue, for short, RGB) at the corresponding positions of the image are uniformly set to 0. Illustratively, K is set to 5 and mask ratio is 45%. Here, when the masking processing is performed, the pixel values of the three channels are directly and uniformly set to 0, so that the masking is independent of the distribution of a specific data set, and the method is simple and effective.

According to the technical scheme, the regional random mask mode is adopted, so that the model can be more focused on the feature expression of each commodity target region in the training stage, the effect of the training method of the image reconstruction model is enhanced, and the commodity identification accuracy of the model in an actual service application scene is remarkably improved.

FIG. 4 shows a network architecture schematic of an image reconstruction model, as shown in FIG. 4, employing a Convolutional Neural Network (CNN) based encoder-decoder network architecture, including a CNN encoder 401 and a CNN decoder 402; the CNN encoder 401 adopts a PP-YOLOE network structure, and specifically includes a Backbone network (Backbone) structure 403 and a neck network (Neck) structure 404. Wherein, backbone network structure 403 employs CSPRESNET and neck network structure 404 employs PAN.

In the embodiment of the present disclosure, the backbone network structure 403 is used to extract a feature map of an input image of each sample image. The neck network structure 404 is used for fusing the feature maps extracted by the three convolution modules (C3, C4, C5) of the backbone network structure 403, and outputting feature maps of three different pixel sizes (h×w) through P3, P4, P5.

Wherein, three feature images of each sample image extracted by three convolution modules (C3, C4, C5) are obtained after PAN, the feature images of three different dimensions of each sample image are obtained. For example, P3 outputs a 240-dimensional feature map, P4 outputs a 480-dimensional feature map, and P5 outputs a 960-dimensional feature map, i.e., the channel numbers of the feature maps output by P3, P4, and P5 are c=240, c=480, and c=960, respectively. Where C represents the number of channels.

According to the technical scheme, compared with a transducer network structure adopted by a traditional self-supervision training scheme, the CNN network structure is lighter and less in calculated amount. Therefore, the method is more suitable for actual business landing.

In some embodiments, the training method of the image reconstruction model may further include: carrying out channel number superposition on the feature images on different dimensions of each sample image output by the PAN to obtain a multi-dimensional feature image of each sample image; and carrying out feature dimension reduction on the channel number of the multi-dimensional feature map of each sample image to obtain a target dimension feature map of each sample image. Here, the target dimension may be set or adjusted according to the need. For example, the target dimension may be 128 dimensions, 256 dimensions, etc.

In the embodiment of the disclosure, channel number superposition is performed on three feature images in different dimensions of each sample image output by PAN, so as to obtain a multi-dimensional feature image of each sample image; and carrying out feature dimension reduction on the channel number of one multi-dimensional feature map of each sample image to obtain one target dimension feature map of each sample image. And carrying out feature dimension reduction on the channel number on one multi-dimensional feature map of each sample image. For example, a Zhang Duowei feature map of each sample image is input into a 1×1 convolution layer to perform feature dimension reduction, so as to obtain an output feature map with 128 dimensions of channels.

Fig. 5 shows a schematic diagram of the acquisition of the target dimension feature map, and as shown in fig. 5, feature dimension reduction is performed on one multi-dimensional feature map of each sample image by using a Full Connection (FC) layer. Illustratively, P3 outputs a 240-dimensional feature map, P4 outputs a 480-dimensional feature map, and P5 outputs a 960-dimensional feature map. The three feature images of each sample image are overlapped on the channel number to obtain 1720-dimensional feature images corresponding to each sample image, the 1720-dimensional feature images corresponding to each sample image are input into the FC layer, and one 128-dimensional feature image of each sample image is output by the FC layer.

According to the technical scheme, the feature dimension reduction on the channel number is carried out on one multi-dimensional feature map of each sample image, so that one target dimension feature map of each sample image is obtained, difficulty and complexity of feature recognition can be reduced, and accuracy of a commodity recognition model is improved.

In the embodiment of the disclosure, an ROI alignment module is disposed between an encoder and a decoder, and the ROI alignment module is configured to extract feature vectors of K random mask regions from a feature map of each sample image according to positions of detection frames of K random mask regions of each sample image.

In the embodiment of the disclosure, the input of the ROI alignment module includes the positions of the detection frames of the K random mask regions of each sample image and the feature map of the input image of each sample image, and the output of the ROI alignment module includes the feature vectors of the K random mask regions of each sample image, which may be specifically 128 dimensions.

In the embodiment of the disclosure, the ROI alignment module is specifically further configured to perform pixel adjustment processing on the feature vectors of the extracted K random mask regions. Specifically, the pixel size (h×w) of the 128-dimensional feature vectors of the K random mask areas is uniformly adjusted to 14×14.

In the embodiment of the present disclosure, output from the FC layer is a feature map of the input image of each sample image, the dimension of which is 128 dimensions. The feature vector dimension output from the ROI alignment module is 128 dimensions, and the ROI alignment module unifies only the pixel size of the feature map corresponding to each sample image.

According to the technical scheme, the ROI alignment module is arranged between the encoder and the decoder, and the feature vectors of the K random mask areas are extracted from the feature map of each sample image according to the positions of the detection frames of the K random mask areas of each sample image by the ROI alignment module, so that the fine granularity feature expression capability of the image reconstruction model can be enhanced, and the accuracy of commodity identification is improved.

In the disclosed embodiment, the decoder adopts a design of stacking a plurality of layers of convolution and a plurality of layers of deconvolution, and performs channel number downsampling and feature image pixel size upsampling.

Illustratively, when the original image c=3, the cnn decoder needs to dimension down c=128 to c=3.

In the embodiment of the disclosure, up-sampling refers to that the pixel size of the feature vector of the random mask region output by the ROI alignment layer is 14×14 amplified by 4 times, and then the pixel size of the feature vector of the random mask region is 56×56. In order to make the feature vector of the random mask region coincide with the image pixels of the K original images corresponding to the K random mask regions, the pixel sizes of the K original images corresponding to the K random mask regions are all 56×56. Specifically, the feature vector 14×14×128 is changed to 56×56×3, that is, the 56×56×3 is the size of K original images corresponding to the K random mask areas.

In the embodiment of the present disclosure, the image size of the input image is h×w×c=640×640×3. The size of K original images corresponding to the K random mask areas selected at random is 56×56×3. Where H represents the width of the image, W represents the height of the image, and C represents the number of channels of the image.

In the embodiment of the present disclosure, at the time of random area masking, pixel values on three channels of K random mask areas h×w×c=56×56×3 on an original image are uniformly set to 0. Wherein the three channels are actually "3" in "56×56×3", i.e., R, G, B three channels.

In the embodiment of the disclosure, the decoder includes five convolution layers and three deconvolution layers, wherein the kernel (kernel) size of the first convolution layer and the fourth convolution layer is 1, and the stride (stride) is 1; the kernel size of the second, third and fifth convolution layers is 3 and stride is 1. Of the three deconvolution layers, the first deconvolution layer has a kernel size of 4 and a stride of 1; the kernel size of the second and third deconvolution layers is 4 and stride is 2. Output by the decoder is a 14×14×3 prediction map. In addition, each convolution layer is added with a batch normalization layer (Batch Normalization, BN) layer and a linear rectification function (RECTIFIED LINEAR unit, reLU), which can also be called a ReLU activation function.

In the disclosed embodiment, stride is the sampling interval when convolving. The stride is set to reduce the number of input parameters and the amount of calculation.

According to the technical scheme of the embodiment of the disclosure, the adopted CNN network structure is lighter and the calculated amount is smaller. Meanwhile, batch normalization layer BN layer and ReLU activation function are added after each convolution layer, so that the training and convergence speed of the network can be increased, and the problems of gradient explosion, gradient disappearance and overfitting can be solved.

The embodiment of the disclosure provides a commodity identification method which can be applied to electronic equipment. The electronic device includes, but is not limited to, a stationary device and/or a mobile device. For example, the fixed device includes, but is not limited to, a server, which may be a cloud server or a general server. For example, mobile devices include, but are not limited to: cell phone, tablet computer, scanning equipment. In some possible implementations, the article identification method may also be implemented by way of a processor invoking computer readable instructions stored in a memory. As shown in fig. 6, the commodity identification method includes:

s601: acquiring an image to be identified;

S602: inputting the image to be identified into the commodity identification model to obtain a commodity identification result of the image to be identified, which is output by the commodity identification model.

The commodity identification model is obtained by training with an image reconstruction model as an initial model, and the image reconstruction model is obtained by training according to the training method of the image reconstruction model.

In the embodiment of the disclosure, the image to be identified can be acquired from an image database, can be input by a user, and can be acquired on site through a camera. It should be noted that the present disclosure is not limited to the source of the image to be identified.

In the embodiment of the disclosure, the commodity identification model can adopt a PP-YOLOE network structure. In practical application, the weight corresponding to the encoder in the trained image reconstruction model is loaded to obtain a commodity identification basic model, and then the commodity identification basic model is trained to obtain a commodity identification model. The present disclosure is not limited to a specific training manner of the commodity identification model. For example, an image set formed by a plurality of images collected in different retail scenes is obtained, and a commodity identification model adopting a PP-YOLOE network structure is trained by using the image set to obtain a commodity identification basic model. Further, training the commodity identification basic model by utilizing the commodity image set to obtain a commodity identification model. Therefore, the training images are generated without collecting images and labeling images on the goods shelves on a large scale, and the training cost of the commodity identification model is reduced; by training in two stages, the accuracy of recognition can be improved. Fig. 7 shows an application schematic diagram of a commodity identification model, and as shown in fig. 7, an image to be identified is obtained as an image of two rows of shelves of a warehouse, the image to be identified is input into the commodity identification model, and the output commodity identification result is that the first layer is brand a mineral water and the second layer is brand B mineral water.

According to the technical scheme, the image reconstruction model with better fine granularity expression capability is used as an initial model to train to obtain the commodity identification model, and commodity identification is carried out on the image to be identified through the commodity identification model, so that commodity identification accuracy in an actual business application scene can be remarkably improved.

It should be understood that the schematic diagrams shown in fig. 2, 3,4, 5 and 7 are merely exemplary and not limiting, and that they are scalable, and that those skilled in the art may make various obvious changes and/or substitutions based on the examples of fig. 2, 3,4, 5 and 7, while still falling within the scope of the disclosed embodiments.

An embodiment of the present disclosure provides a training apparatus for an image reconstruction model, as shown in fig. 8, including: a random selection module 801, configured to randomly select K random mask areas of each sample image, where K is a positive integer; a first obtaining module 802, configured to obtain K original images corresponding to K random mask areas of each sample image; a processing module 803, configured to perform a random mask process on the K random mask areas of each sample image, so as to obtain an input image of each sample image after the random mask process; a first input module 804, configured to input an input image of each sample image and detection frames of K random mask areas into an image reconstruction model, to obtain K measured images of each sample image; the training module 805 is configured to train the image reconstruction model based on the K measured images and the K original images of each sample image, to obtain the image reconstruction model. Here, the image reconstruction model is used as an initial model of the commodity identification model.

In some embodiments, the training device of the image reconstruction model may further include: a second acquisition module 806 (not shown in fig. 8) for acquiring a plurality of images acquired under different retail scenarios; a screening module 807 (not shown in fig. 8) for screening the plurality of images to obtain a plurality of sample images, wherein each sample image includes at least one commodity.

In some embodiments, the random selection module 801 includes: the first selecting submodule is used for randomly scrambling the sequence of the N detection frames under the condition that the number N of the detection frames of the sample image is larger than or equal to K for each sample image, selecting the first K detection frames from the N detection frames subjected to random scrambling, and taking the area where the first K detection frames are located as K random mask areas of the sample image, wherein N is a positive integer; and the second selecting sub-module is used for increasing the number of the detection frames to K by adding a random disturbance mode to the N detection frames under the condition that the number N of the detection frames of the sample image is smaller than K, and taking the area where the K detection frames are located as K random mask areas of the sample image.

In some embodiments, the processing module 803 includes: the first processing submodule is used for dividing K original images corresponding to K random mask areas of each sample image into M multiplied by M image blocks with the same size, randomly selecting partial image blocks from the M multiplied by M image blocks according to a fixed proportion, carrying out mask processing, wherein M is a positive integer not less than 2.

In some embodiments, the image reconstruction model employs a convolutional neural network-based encoder-decoder network architecture, wherein the encoder employs a PP-YOLOE backbone network architecture and a neck network architecture, wherein the backbone network architecture employs CSPRESNET and the neck network architecture employs PAN.

In some embodiments, the training device of the image reconstruction model further includes: a third obtaining module 808 (not shown in fig. 8) configured to perform channel number superposition on the feature map of different dimensions of each sample image output by the PAN, so as to obtain a multi-dimensional feature map of each sample image; a fourth obtaining module 809 (not shown in fig. 8) performs feature dimension reduction on the number of channels on the multi-dimensional feature map of each sample image to obtain a target dimension feature map of each sample image.

In some embodiments, an ROI alignment module is disposed between the encoder and the decoder, the ROI alignment module being configured to extract feature vectors of K random mask regions from a feature map of each sample image according to positions of detection frames of the K random mask regions of each sample image.

In some embodiments, the decoder is composed of a structure of multi-layer convolution and multi-layer deconvolution stacked on top of each other. The decoder is used to perform downsampling of the channel number and upsampling of the feature image pixel size.

It should be understood by those skilled in the art that the functions of each processing module in the training apparatus for an image reconstruction model according to the embodiments of the present disclosure may be understood with reference to the foregoing description of the training method for an image reconstruction model, and each processing module in the training apparatus for an image reconstruction model according to the embodiments of the present disclosure may be implemented by an analog circuit that implements the functions described in the embodiments of the present disclosure, or may be implemented by running software that implements the functions described in the embodiments of the present disclosure on an electronic device.

The training device for the image reconstruction model can enhance the fine granularity feature expression capability of the model, so that the accuracy of commodity identification is improved.

An embodiment of the present disclosure provides a commodity identification apparatus, as shown in fig. 9, including: a fifth acquiring module 901, configured to acquire an image to be identified; the second input module 902 is configured to input the image to be identified into the article identification model, and obtain an article identification result of the image to be identified output by the article identification model. The commodity identification model is obtained by training an image reconstruction model serving as an initial model, and the image reconstruction model is obtained by training the image reconstruction model by the training method.

It should be understood by those skilled in the art that the functions of each processing module in the article identifying apparatus according to the embodiments of the present disclosure may be understood by referring to the foregoing description of the article identifying method, and each processing module in the article identifying apparatus according to the embodiments of the present disclosure may be implemented by using an analog circuit that implements the functions described in the embodiments of the present disclosure, or may be implemented by running software that implements the functions described in the embodiments of the present disclosure on an electronic device.

The commodity identification device disclosed by the embodiment of the invention can improve the accuracy of commodity identification.

An embodiment of the present disclosure provides a scene schematic diagram of a training method of an image reconstruction model, as shown in fig. 10. As described above, the training method of the image reconstruction model provided by the embodiment of the present disclosure is applied to an electronic device. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses.

In particular, the electronic device may specifically perform the following operations:

based on K predicted images and K original images of each sample image, training the image reconstruction model to obtain the image reconstruction model. The image reconstruction model is used as an initial model of the commodity identification model.

Wherein each sample image, K random mask areas of each sample image, and K original images corresponding to the K random mask areas of each sample image may be obtained from an image data source. The image data source may be various forms of data storage devices such as a laptop computer, desktop computer, workstation, personal digital assistant, server, blade server, mainframe computer, and other suitable computer. The image data source may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing devices. Furthermore, the image data source and the user terminal may be the same device.

It should be understood that the scene diagram shown in fig. 10 is merely illustrative and not restrictive, and that various obvious changes and/or substitutions may be made by one skilled in the art based on the example of fig. 10, and the resulting technical solutions still fall within the scope of the disclosure of the embodiments of the present disclosure.

The embodiment of the disclosure also provides a scene schematic diagram of the commodity identification method, as shown in fig. 11. As described above, the commodity identification method provided by the embodiment of the present disclosure is applied to an electronic device. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses.

acquiring an image to be identified;

the commodity identification model is obtained by training an image reconstruction model serving as an initial model, and the image reconstruction model is obtained by training the image reconstruction model by the training method.

Wherein the image to be identified may be obtained from an image data source. The image data source may be various forms of data storage devices such as a laptop computer, desktop computer, workstation, personal digital assistant, server, blade server, mainframe computer, and other suitable computer. The image data source may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing devices. Furthermore, the image data source and the user terminal may be the same device.

It should be understood that the scene diagram shown in fig. 11 is merely illustrative and not restrictive, and that various obvious changes and/or substitutions may be made by one skilled in the art based on the example of fig. 11, and the resulting technical solutions still fall within the scope of the disclosure of the embodiments of the present disclosure.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, a computer program product.

Fig. 12 shows a schematic block diagram of an example electronic device 1200 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 12, the apparatus 1200 includes a computing unit 1201, which can perform various appropriate actions and processes according to a computer program stored in a read-only memory (ROM) 1202 or a computer program loaded from a storage unit 1208 into a random access memory (Random Access Memory, RAM) 1203. In the RAM 1203, various programs and data required for the operation of the device 1200 may also be stored. The computing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other via a bus 1204. An Input/Output (I/O) interface 1205 is also connected to bus 1204.

Various components in device 1200 are connected to I/O interface 1205, including: an input unit 1206 such as a keyboard, mouse, etc.; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208 such as a magnetic disk, an optical disk, or the like; and a communication unit 1209, such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The computing unit 1201 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1201 include, but are not limited to, a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), various specialized artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) computing chips, various computing units running machine learning model algorithms, digital signal processors (DIGITAL SIGNAL processors, dsps), and any suitable processors, controllers, microcontrollers, etc. The computing unit 1201 performs the respective methods and processes described above, for example, a training method of an image reconstruction model/a commodity identification method. For example, in some embodiments, the training method/merchandise identification method of the image reconstruction model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1200 via ROM 1202 and/or communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the training method/commodity identification method of the image reconstruction model described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the training method/commodity identification method of the image reconstruction model in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuitry, field programmable gate arrays (Field Programmable GATE ARRAY, FPGA), application-specific integrated circuits (ASIC), application-specific standard products (ASSP), system-on-chip Systems (SOC), complex programmable logic devices (Complex Programmable Logic Device, CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable image reconstruction model training apparatus such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a random access memory, a read-only memory, an erasable programmable read-only memory (EPROM), a flash memory, an optical fiber, a portable compact disc read-only memory (Compact Disk Read Only Memory, CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., cathode Ray Tube (CRT) or Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: a local area network (Local Area Network, LAN), a wide area network (Wide Area Network, WAN), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. that are within the principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A training method of an image reconstruction model, comprising:

obtaining K original images corresponding to the K random mask areas of each sample image;

Carrying out random mask processing on the K random mask areas of each sample image to obtain an input image of each sample image after the random mask processing;

inputting the input image of each sample image and the detection frames of the K random mask areas into an image reconstruction model to obtain K predicted images of each sample image;

Training the image reconstruction model based on the K predicted images and the K original images of each sample image to obtain the image reconstruction model, wherein the image reconstruction model is used as an initial model of a commodity identification model;

the image reconstruction model adopts an encoder-decoder network structure based on a convolutional neural network, a region of interest alignment ROI alignment module is arranged between the encoder and the decoder, and the ROI alignment module is used for extracting feature vectors of the K random mask regions from feature maps of each sample image according to the positions of detection frames of the K random mask regions of each sample image.

2. The method of claim 1, further comprising:

Acquiring a plurality of images acquired in different retail scenarios;

and screening the plurality of images to obtain a plurality of sample images, wherein each sample image comprises at least one commodity.

3. The method of claim 1, wherein the randomly selecting K random mask regions for each sample image comprises:

For each sample image, under the condition that the number N of the detection frames of the sample image is larger than or equal to K, randomly scrambling the sequence of N detection frames, selecting the first K detection frames from the N detection frames subjected to random scrambling, and taking the area where the first K detection frames are positioned as the K random mask areas of the sample image, wherein N is a positive integer;

And under the condition that the number N of the detection frames of the sample image is smaller than K, increasing the number of the detection frames to K by adding a random disturbance mode to the N detection frames, and taking the area where the K detection frames are located as the K random mask areas of the sample image.

4. The method of claim 1, wherein the performing random masking processing on the K random mask areas of each sample image comprises:

Dividing the K original images corresponding to the K random mask areas of each sample image into M multiplied by M image blocks with the same size, randomly selecting partial image blocks from the M multiplied by M image blocks according to a fixed proportion, and performing mask processing, wherein M is a positive integer not less than 2.

5. The method of claim 1, wherein the encoder employs a backbone network structure and a neck network structure of a modified target detection algorithm, wherein the backbone network structure employs a cross-stage partial residual network, and the neck network structure employs a path aggregation network, PAN.

6. The method of claim 5, further comprising:

carrying out channel number superposition on the feature images on different dimensions of each sample image output by the PAN to obtain a multi-dimensional feature image of each sample image;

and carrying out feature dimension reduction on the channel number of the multi-dimensional feature map of each sample image to obtain a target dimension feature map of each sample image.

7. The method of claim 5, wherein the decoder is comprised of a stacked structure of multi-layer convolution and multi-layer deconvolution.

8. A method of article identification comprising:

acquiring an image to be identified;

Inputting the image to be identified into a commodity identification model to obtain a commodity identification result of the image to be identified, which is output by the commodity identification model;

the commodity identification model is obtained by training an image reconstruction model serving as an initial model, and the image reconstruction model is obtained by training the image reconstruction model based on the training method of any one of claims 1 to 7.

9. A training apparatus for an image reconstruction model, comprising:

The first acquisition module is used for acquiring K original images corresponding to the K random mask areas of each sample image;

the first input module is used for inputting the input image of each sample image and the detection frames of the K random mask areas into an image reconstruction model to obtain K predicted images of each sample image;

The training module is used for training the image reconstruction model based on the K predicted images and the K original images of each sample image to obtain the image reconstruction model, and the image reconstruction model is used as an initial model of a commodity identification model;

10. The apparatus of claim 9, further comprising:

the second acquisition module is used for acquiring a plurality of images acquired in different retail scenes;

and the screening module is used for screening the plurality of images to obtain a plurality of sample images, wherein each sample image comprises at least one commodity.

11. The apparatus of claim 9, wherein the random selection module comprises:

The first selecting submodule is used for randomly scrambling the sequence of N detection frames under the condition that the number N of the detection frames of each sample image is larger than or equal to K, selecting the first K detection frames from the N detection frames subjected to random scrambling, and taking the area where the first K detection frames are located as the K random mask areas of the sample image, wherein N is a positive integer;

And the second selecting submodule is used for increasing the number of the detection frames to K by adding a random disturbance mode to the N detection frames under the condition that the number N of the detection frames of the sample image is smaller than K for each sample image, and taking the area where the K detection frames are located as the K random mask areas of the sample image.

12. The apparatus of claim 9, wherein the processing module comprises:

The first processing sub-module is used for dividing the K original images corresponding to the K random mask areas of each sample image into M multiplied by M image blocks with the same size, randomly selecting partial image blocks from the M multiplied by M image blocks according to a fixed proportion, and performing mask processing, wherein M is a positive integer not less than 2.

13. The apparatus of claim 9, wherein the encoder employs a backbone network structure and a neck network structure of a modified target detection algorithm, wherein the backbone network structure employs a cross-stage partial residual network, and the neck network structure employs a path aggregation network, PAN.

14. The apparatus of claim 13, further comprising:

The third acquisition module is used for carrying out channel number superposition on the feature images on different dimensions of each sample image output by the PAN to obtain a multi-dimensional feature image of each sample image;

and the fourth acquisition module is used for carrying out feature dimension reduction on the channel number on the multi-dimensional feature map of each sample image to obtain a target dimension feature map of each sample image.

15. The apparatus of claim 13, wherein the decoder is comprised of a stacked structure of multi-layer convolution and multi-layer deconvolution.

16. A merchandise identification device comprising:

A fifth acquisition module for acquiring an image to be identified;

17. An electronic device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-8.