CN115661747A

CN115661747A - Method for estimating quantity of stored goods based on computer vision

Info

Publication number: CN115661747A
Application number: CN202211298789.4A
Authority: CN
Inventors: 张广渊; 吴杰昊; 李克峰; 王朋; 靳华磊; 王国锋; 赵峰
Original assignee: Shandong Jiaotong University
Current assignee: Shandong Jiaotong University
Priority date: 2022-10-24
Filing date: 2022-10-24
Publication date: 2023-01-31

Abstract

The invention discloses a computer vision-based estimation method for the quantity of stored goods, which is characterized by comprising the following three steps: firstly, constructing a cargo region segmentation data set and a cargo quantity estimation data set; secondly, constructing and training a CSwin-Unet model for cargo region segmentation; and thirdly, building a goods placement and quantity calculation model. The method provides a specific cargo quantity estimation process and has stronger practicability. For the problem that the accuracy of the vision-based method is not high, swin-Unet is improved by using a CSwin-Undesformer, and a CSwin-Unet network model is used for partitioning a cargo area; in addition, by combining with the actual production environment, a PPW (pixel position weight) loss function is adopted, and the goods region segmentation model is further optimized, so that the final estimation on the goods quantity has excellent accuracy, and the task requirements can be met.

Description

Method for estimating quantity of stored goods based on computer vision

Technical Field

The invention relates to the technical field of warehousing management, in particular to a method for estimating the quantity of warehoused goods based on computer vision.

Background

The storage is an important link of commodity storage and circulation, the quantity of target warehouse goods is estimated, a manager can be helped to master the goods information in the warehouse in time, and the storage and circulation storage system has important guiding significance for commodity storage, circulation and formulation of scientific management strategies. The traditional storage management mainly depends on manual operation and verification to carry out statistical management on the quantity and the type of goods. With the development of the technology of the internet of things, the mode of combining the RFID technology and the internet of things is more and more applied to the field of warehouse management. For example, tianxiang Zhang et al monitors goods by fusing sensing information from multiple devices in real time, and proposes a multi-device integrated goods loading management system, namely ARCago. The warehouse management system based on computer vision is mostly arranged at the exit of a warehouse entrance or is only used for safety management.

Traditional warehouse management relies on manual operation, wastes time and labor, has low efficiency, and is gradually eliminated or only plays an auxiliary role in an automatic management system.

The method based on the RFID and the Internet of things needs to use a large number of sensors, the equipment is complex and unstable, the risk of potential safety hazards exists, and the cost is increased due to the complex equipment. And RFID counting requires pre-application of RFID labels to the goods, further increasing equipment cost and operational complexity.

At present, most of management methods based on computer vision are schemes for arranging image acquisition systems at warehouse entrances and exits to identify goods entering and exiting a warehouse, the schemes cannot monitor the goods in the warehouse in real time, timeliness is poor, and due to the fact that a used computer vision algorithm is immature, identification accuracy is low or the management methods are easily influenced by external environments.

Disclosure of Invention

The invention provides a computer vision-based estimation method for the quantity of stored goods, which improves the accuracy of estimation of the quantity of goods.

The technical scheme adopted by the invention for solving the technical problems is as follows: a method for estimating the quantity of stored goods based on computer vision is characterized by comprising three steps:

firstly, constructing a cargo region segmentation data set and a cargo quantity estimation data set;

establishing a goods inventory image acquisition system based on fixed position cameras, wherein the system comprises m cameras with fixed positions in a warehouse; the camera is directly installed at m preset fixed point positions in the warehouse, records video data of goods in the warehouse, and transmits the recorded video to the computer for storage; the number m of the cameras is determined by the number of bin positions, and when the number of the bin positions is n, the number of the cameras is more than or equal to n/2;

selecting videos of 2a cargo areas, namely the bin positions, covered by the cameras of the group a for experiment, intercepting video frames from cargo video data, and manually marking and recording the quantity of cargos;

then, acquiring a cargo image in RGB color space and jpg format from the video frame;

after the video images are screened and the images with the similarity reaching that the difference value of position pixel values of 80% of the two images is less than 5 are eliminated, labeling the two goods areas by using a labelme data labeling tool to obtain 2 semantic information including left and right bins and a semantic segmentation data set including X images; the number X of the pictures is not less than 100 times of the number of the cameras, namely not less than 100M;

further subdividing the divided X pictures, manually marking the quantity of the goods, constructing a goods quantity estimation data set covering 2a different bin positions and 2X data records in total, and adopting 7:2:1, dividing a training set, a verification set and a test set in proportion;

secondly, constructing and training a CSwin-Unet model for cargo region segmentation;

a goods region segmentation model is constructed through a semantic segmentation algorithm, so that the goods region is accurately segmented;

selecting a Swin-Unet semantic segmentation model based on a vision transform (ViT) in a U-Net (model commonly used in the field) type semantic segmentation model, wherein a U-Net structure comprises an encoder and a decoder, and classifying each pixel in an image by adopting the encoder-decoder structure and jump connection in the U-Net to complete a semantic segmentation task;

in an encoder part, extracting features by using a Swin transform block structure (Swin transform block is a substructure in a model) in ViT, and performing down-sampling by using a patch merging (block merging) to perform feature fusion;

in the decoder part, the fused features are decoded by adopting Swin transform block, and the resolution is restored by up-sampling by using patch expansion;

using a more accurate cargo region segmentation model and using a loss function in model training;

selecting a CSwin block structure to improve Swin block in Swin-Unet, and constructing a CSwin-Unet semantic segmentation model; position coding using a cross-shaped self-attention window and local enhancements;

the Cross-Shaped window splices two self-attention areas in the transverse direction and the longitudinal direction around the query point into a Cross-Shaped attention window;

LePE (locally enhanced position coding) learns value position information by deep convolution, adds the value position information by means of residual errors, and embeds the value position information into a block, and locally enhances position coding on the assumption that the most important position information comes from the vicinity of a specific input element, and implements the enhancement by using a deep convolution operator, thereby obtaining a position encoding method

The attention function represents a self-attention function of the LEPE, input Q is a query vector, K is an index vector, and V is a content vector, namely a value; softmax is a Softmax function, d is the variance of the product of the transpose of the Q vector and the K vector; depth _ wise Conv (V) is the result of depth separable convolution on the V vector;

in the training of CSwin-Unet, firstly, a COCO Stuff data set (a common data set) is used for pre-training the CSwin-Unet, and then the model is finely adjusted on the constructed goods region segmentation data set;

adding a Pixel Position Weight (PPW) into the cross entropy loss during fine adjustment, setting a PPW loss function, and endowing a pixel farther from the sight center of the camera with a larger weight, wherein the cross entropy loss formula is as follows:

m wherein is the number of classes, y _c For sample prediction, 1 is taken for the same class, 0,p is taken for the different class _c A probability of predicting a sample as class c;

dice Loss is from a metric function die coeffient for evaluating the similarity of two sample regions, defined as follows:

wherein | X |, N.Y | is the intersection of the area X and the area Y, | X |, | Y | respectively represent the number of elements of the area X and the area Y, and in order to ensure that the value range of Dice after the denominator is repeatedly calculated is between [0,1], the numerator is multiplied by 2; the formula for DiceLoss is thus as follows:

setting a vertical visual angle of a camera to be 45 degrees, wherein the lower boundary of a shot image is positioned right below the camera, the distance between the upper boundary and the lower boundary of the shot image in a world coordinate system is the same as the distance between the camera and the lower boundary of the shot image, h is the distance between the camera and the lower boundary of the shot image, w is the distance between the left boundary and the right boundary of the shot image in the world coordinate system, and a gray area is the visual field range of the camera; constructing a pixel position weight matrix W, carrying out maximum and minimum normalization on elements in the pixel position weight matrix W, and adding a smoothing coefficient k, wherein any element W comprises:

when the cross entropy loss is calculated, the weight matrix, the output tensor O and the label tensor L are subjected to dot product to obtain the cross entropy loss based on the pixel position weight, and the cross entropy loss and the DiceLoss are combined to form PPWLoss:

PPW Loss＝CEL(W×O,W×L)+Loss _Dice (O,L)，

fine tuning the model by using the loss function to obtain a segmentation model suitable for the cargo region segmentation data set; when the quantity of the cargos is estimated, the cargo area is divided by using the model;

thirdly, building a goods placement and quantity calculation model:

modeling a cargo area in a picture in advance, and constructing a cargo unit mask; matching the divided goods regions with the masks to complete the unit conversion of the unit goods and convert the unit goods into corresponding units in the three-dimensional model, wherein each constructed unit is all visible goods; the cargo units are classified into three categories according to the corresponding positions of the cargo units in the three-dimensional model: the unit located on the corner, the unit located on the edge and the unit located inside the plane; thereby, the image problem is converted into a mathematical problem of the three-dimensional mathematical model;

different physical characteristics should be provided according to different types of cargo units, such as: there must be invisible units under the cells of the upper layer; there must be no other elements on the three sides and above of the element at the corner; constructing all determinable goods boundaries in the three-dimensional model according to different physical characteristics of different types of goods units, filling out invisible units, and exhaustively exhausting all possible situations for units which cannot be determined;

designing a rationality loss function according to a placing rule, calculating each possible situation, and pursuing the rationality to be optimal;

for the rationality loss function, each unit has 6 surfaces, and the placing rule requires that goods are placed from one side, and then the next side is placed after one side is filled with the goods, or the next layer is placed after one side is filled with the goods; therefore, for an existing unit, the lower layer of the unit has a non-mask boundary and a unit with a side vacancy, namely the lower layer is not full; or the unit has a non-mask boundary but a side face vacant unit on the layer, namely the layer is not full, and other units exist above the unit, and the loss value of the unit is positively and linearly related to the number of the units above the unit;

when calculating the rationality loss, firstly, judging the placing directions according to the number of the lines, the columns and the layers of the existing units, wherein the loss of two corresponding side surfaces needs to be calculated in each placing direction; if the number of layers is less than the number of rows and columns, the units are judged to be placed along the layers, and the upper surface and the lower surface of a certain unit need to be calculated; the following loss is the number of units missing in the lower layer multiplied by the weight corresponding to the following; when the current layer is not full, the loss is the unit quantity of the upper layer multiplied by the corresponding weight above; there is thus a rationality loss function for this unit:

Loss＝num(MAX _layer -n ₁ )×w ₁ +num(MAX _layer -n ₂ )×w ₂

wherein, MAX _layer Maximum number of monolayers, n ₁ Is the number of units of the lower layer, n ₂ Is the unit number of the upper layer, w ₁ For lower layer losses corresponding weight, w ₂ For the upper layer losing the corresponding weight, when the layer is full, w ₂ A value of 0; when the placing directions are along rows and along columns, the calculated two surfaces are front and back or left and right surfaces, and the function is the same as the rationality loss function;

and calculating the rationality loss of each unit in a placing situation and adding the calculated rationality losses to obtain the rationality loss of the situation, selecting a model with the minimum rationality loss in all exhaustive cases as a goods placing model and obtaining the number of the goods placing model units as a goods number estimation result.

The further technical scheme is as follows: in the first step, the number M of the cameras is determined by the number of the bin positions, when the number of the bin positions is n, and the number M of the cameras is more than or equal to 3n/2 and more than or equal to n/2.

The further technical scheme is as follows: in the first step, an image of the cargo in RGB color space, jpg format, and resolution 2560 x 1440 is obtained from the video frame.

The invention has the beneficial effects that:

1. compared with the traditional method for managing the stored goods through manual verification, the method does not need to perform complex manual operation, and the computer controls the estimation and information storage of the number of the goods, so that the automation and informatization of all the procedures are realized, and the efficiency is high, the cost is low.

2. For the goods counting method based on the RFID technology and the Internet of things, the method can realize high-precision estimation of the quantity of the warehoused goods by reasonably planning the positions based on the original monitoring camera in the warehouse without using complex equipment such as a sensor or an RFID label. The equipment is simple, the potential safety hazard is reduced, and the economic cost is reduced. In addition, the quantity of the goods in the warehouse can be estimated in real time, and the man-machine effect is enhanced.

3. For the current management method based on computer vision, the invention provides a specific cargo quantity estimation process, and has stronger practicability. For the problem that the accuracy rate of the vision-based method is low, swin-Unet is improved by using a CSwin transformer, and a CSwin-Unet network model is used for partitioning a cargo area; in addition, by combining with the actual production environment, a PPW (pixel position weight) loss function is adopted, and the goods region segmentation model is further optimized, so that the final estimation on the goods quantity has excellent accuracy, and the task requirements can be met.

Drawings

FIG. 1 is a general flow chart of an embodiment of a method for estimating the quantity of warehoused goods based on computer vision according to the present invention;

fig. 2 is a schematic diagram of the arrangement of the camera head in the present invention.

FIG. 3 is a schematic diagram of the relationship between the camera settings and the positions of the bins in an embodiment of the present invention;

Detailed Description

The invention will be further described with reference to the accompanying drawings.

As shown in fig. 1-3, the method of the present embodiment includes three parts: the method comprises the steps of building a cargo area segmentation data set and a cargo quantity estimation data set, building and training a CSwin-Unet model for cargo area segmentation, extracting features and estimating cargo quantity.

Firstly, constructing a cargo region segmentation data set and a cargo quantity estimation data set:

a goods inventory image acquisition system based on fixed-position cameras is established, and mainly comprises 18 cameras (Haekwove DS-2CD2746FWDA2-IZS,400 ten thousand pixels) at fixed positions in a warehouse. The video camera is directly installed at 18 preset fixed point positions in the warehouse, records video data of goods in the warehouse, and transmits the recorded video to the computer for storage. The video of 6 goods areas covered by 3 cameras is selected for experiment, video frames are intercepted from goods video data, and the number of goods is manually marked and recorded. Subsequently, an RGB color space, jpg format, and 2560 x 1440 resolution cargo image is obtained from the video frame.

And screening the video images. After images with high similarity (higher specifically means that 80% of the two images have positions and the difference value of pixel values is less than 5), labeling the two goods areas by using a labelme data labeling tool to obtain a semantic segmentation data set which comprises 2 semantic information of left and right bin positions and 3500 images in total. On the basis, 3500 divided pictures are further subdivided, the quantity of goods is manually marked, a goods quantity estimation data set which covers 6 different positions and has 7000 data records is constructed, and 7:2: the proportion of 1 is divided into a training set, a verification set and a test set.

As other embodiments, the number of cameras may be other numbers, the number of cameras being determined by the number of bins, which is preferred: when the number of the bin positions is n, the number M of the cameras is more than or equal to n/2 and more than or equal to 3 n/2; namely, when the number of bin bits is n, the minimum required number of cameras is n/2, the maximum required number of cameras is 3n/2, and the closer to the maximum value the cameras are located in the interval, the more accurate the cameras are.

The 18 stages shown in fig. 3 were used in the actual case, in which the number of bins was 12, so 3 × 12/2=18 stages were used.

The number of cameras and the number of pictures are all numbers in an actual case, and are used for acquiring a data training model. In order to improve the accuracy of the model, in this embodiment, 3 cameras are selected to cover the case of three different backgrounds, so as to improve the robustness of the model as much as possible, each camera covers 2 bins according to the equipment design, and the number of the bins finally included is 2 times that of the cameras. The 3500 pictures are data used for training the model, have no strict value range, are not too few, and are estimated to be 100 times of the number of cameras at least.

The camera is selected, and the resolution is not lower than 224 × 224 in principle.

7000 data records are 2 times larger than 3500 pictures, because each picture contains 2 bins, each corresponding to 1 data record. 7:2:1 is a method for dividing training set, verification set and test set commonly used in machine learning. After the division, 4900 of 7000 pieces of data are taken as a training set to train the model; 1400 pieces are used as a verification set, and the model effect is evaluated and optimized in the data training process; 700 strips are taken as a test set, and the model effect is finally evaluated after the model training is finished.

Secondly, constructing and training a CSwin-Unet model for cargo region segmentation:

and constructing a goods region segmentation model through a semantic segmentation algorithm to realize accurate segmentation of the goods region. Swin-Unet is a U-Net type semantic segmentation model based on a vision transformer (ViT), and realizes classification of each pixel in an image by adopting an encoder-decoder structure and jump connection in U-Net to complete a semantic segmentation task. The structure of the device comprises an encoder and a decoder. In the encoder part, feature extraction is performed by using an advanced Swin transform block structure in ViT, and feature fusion is performed by using a patch merging (block merging) to perform downsampling. In the decoder section, the fused features are decoded also with Swin transform block, and resolution is restored by upsampling using patch expansion.

Considering the problem that the image in the actual production environment is greatly influenced by noises such as illumination, sundries and the like and the precise segmentation of the boundary of the cargo area cannot be realized in the actual application, a more precise cargo area segmentation model needs to be used and a loss function which is more in line with the actual situation needs to be used during model training. CSwin transform is a more advanced and accurate ViT model that uses Cross-Shaped self-attention windows (Cross-Shaped windows) and locally enhanced position coding (LePE) with more advanced performance than Swin. Therefore, the invention selects the CSwin block structure to improve the Swin block in Swin-Unet, constructs the CSwin-Unet semantic segmentation model, improves the model precision and lightens the model.

Compared to Swin block, the improvement in CSwin block is represented by the use of a cross-shaped self-attention window and locally enhanced position coding. The Cross-Shaped window splices two self-attention areas in the transverse direction and the longitudinal direction around the query point into a Cross-Shaped attention window, so that a larger self-attention mechanism area is obtained in a mode of changing the shape of the self-attention window, and meanwhile, the stacking number of blocks (structural blocks of a neural network) is reduced. LePE directly learns the position information of value through deep convolution, adds the position information in a residual error mode, and is very conveniently embedded into block

The attention function represents a self-attention function of the LEPE, input Q is a query vector, K is an index vector, and V is a content vector, namely a value; softmax is a Softmax function, d is the variance of the product of the transpose of the Q vector and the K vector; depth _ wise Conv (V) is the result of depth separable convolution of the V vector.

In the training of CSwin-Unet, in order to obtain higher feature extraction capability, the embodiment adopts a training strategy of transfer learning. It is first pre-trained using the COCO Stuff dataset and then the model is fine-tuned on the constructed cargo region segmentation dataset. Analysis of the images in the cargo region segmentation data set can find that, when the cargo is uniformly placed according to a specific rule, the closer the cargo to the camera, the more pixels covered by a single piece of cargo, the less information contained in a single pixel, and the lower the importance degree of the single pixel. Therefore, the Pixel Position Weight (PPW) is added in the cross entropy loss during fine adjustment to solve the problem of uneven pixel importance degree caused by different distances of different pixel positions in the application. To this end, the present invention designs a PPW loss function that improves the accuracy of the cargo quantity estimation by giving more weight to pixels that are farther from the center of the camera's line of sight, so that the model tends to segment the farther images more accurately. The cross entropy loss formula is as follows:

m wherein is the number of classes, y _c For sample prediction, 1 is taken for the same class, 0,p is taken for the different class _c The probability of being predicted as class c for the sample.

The Dice Loss is derived from die coeffient (die coefficient), which is a metric function used to evaluate the similarity of two sample regions and is defined as follows:

wherein | X |, Y |, which is the intersection of the region X and the region Y, and | X |, | Y | respectively represent the number of elements of the region X and the region Y, and the numerator is multiplied by 2 in order to ensure that the value range of Dice after the denominator is repeatedly calculated is between [0,1 ]. The formula for DiceLoss is thus as follows:

setting a vertical visual angle of the camera to be 45 degrees, wherein the lower boundary of the shot image is positioned right below the camera, and in the graph of FIG. 2, P is a point right below the camera, namely the midpoint of the lowermost edge of the shot image; a is any point in the image; d is the distance from the camera to point a. At this time, the distance between the upper and lower boundaries in the world coordinate system in the image is the same as the distance between the camera and the lower boundary, and the device architecture is shown in fig. 2, where h is the distance between the camera and the lower boundary of the image, w is the distance between the left and right boundaries of the image in the world coordinate system, and the gray area is the field of view of the camera. Thus, a pixel position weight matrix W is constructed, the maximum and minimum values of the elements are normalized, and a smoothing coefficient k is added, wherein any element W comprises:

and when the cross entropy loss is calculated, performing dot product on the weight matrix, the output tensor O and the label tensor L to obtain the cross entropy loss based on the pixel position weight, and combining the cross entropy loss with the DiceLoss to form PPWLoss.

PPW Loss＝CEL(W×O,W×L)+Loss _Dice (O,L)，

And fine-tuning the model by using the loss function to obtain a segmentation model suitable for the cargo region segmentation data set. The model is used for cargo area segmentation in cargo quantity estimation.

The second step is characterized in that: 1. CSwin-Unet is designed by improving Swin-Unnet with CSwinblock; 2. the PPW penalty function was designed and used for fine-tuning in migration learning.

The Cross-Shaped window, LEPE mentioned above is a module in the CSwin block, and the attention function is a further description of the content of LEPE. In addition, cross-entropy loss CEL and Dice loss are commonly used loss functions at present.

Thirdly, building a goods placement and quantity calculation model:

modeling a cargo area in a picture in advance, and constructing a cargo unit mask; matching the divided goods regions with the masks to complete the unit conversion of the unit goods and convert the unit goods into corresponding units in the three-dimensional model, wherein each constructed unit is all visible goods; the cargo units are classified into three categories according to their corresponding positions in the three-dimensional model: the unit located on the corner, the unit located on the edge and the unit located inside the plane; thereby, the image problem is converted into a mathematical problem of the three-dimensional mathematical model;

designing a rationality loss function according to a placing rule, calculating each possible situation, and pursuing rationality to be optimal;

for the rationality loss function, each unit has 6 surfaces, and the placing rule requires that goods are placed from one side, and then the next side is placed after one side is filled with the goods, or the next layer is placed after one side is filled with the goods; therefore, for an existing unit, the lower layer of the unit has a non-mask boundary and a unit with a side vacancy, namely the lower layer is not full; or the unit has a non-mask boundary and a side face vacant unit in the layer, that is, the layer is not full, and other units exist above the unit, the loss value of the unit is in positive linear correlation with the number of the units above the unit, for example, the unit is a, 3 units exist above the a, and the loss value of the a should be added with 3;

Loss＝num(MAX _layer -n ₁ )×w ₁ +num(MAX _layer -n ₂ )×w ₂

wherein, MAX _layer Maximum number of monolayers, n ₁ Is the number of units of the lower layer, n ₂ Is the unit number of the upper layer, w ₁ For lower layer losses corresponding weight, w ₂ For the upper layer to lose the corresponding weight, when the layer is full, w ₂ A value of 0; when the placing directions are along rows and along columns, the calculated two surfaces are front and back or left and right surfaces, and the function is the same as the rationality loss function;

In specific implementation, the CPU of the computer can be i5-9300h, and the GPU is Quadro RTX 5000. The used programming language is Python3.7 (64-bit), the integrated development environment is a Pycharm, CSwin-Unet and other segmentation model algorithms, the deep learning library pytorch implementation based on Python software is realized, and the operating environment is deployed in an Ubuntu 18.04.3 system. The image preprocessing algorithm and the shape feature extraction are realized based on an OpenCV-python package, and the XGboost and other numerical estimation model algorithms are realized based on a skearn package.

Claims

1. A method for estimating the quantity of stored goods based on computer vision is characterized by comprising three steps:

establishing a goods inventory image acquisition system based on fixed position cameras, wherein the system comprises m cameras with fixed positions in a warehouse; the camera is directly arranged at m preset fixed point positions in the warehouse, records video data of goods in the warehouse and transmits the recorded video to the computer for storage; the number m of the cameras is determined by the number of bin positions, and when the number of the bin positions is n, the number of the cameras is more than or equal to n/2;

screening video images, after eliminating images with the similarity of 80% of position pixel value difference values smaller than 5 in the two images, labeling two goods areas by using a labelme data labeling tool to obtain 2 semantic information including left and right bin positions and a semantic segmentation data set of X images; the number X of the pictures is not less than 100 times of the number of the cameras, namely not less than 100M;

selecting a Swin-Unet semantic segmentation model based on a vision transform (ViT) in a U-Net type semantic segmentation model, wherein a U-Net structure comprises an encoder and a decoder, and classifying each pixel in an image by adopting the encoder-decoder structure and jump connection in the U-Net to complete a semantic segmentation task;

in an encoder part, extracting features by using a Swin transform block structure in ViT, and performing downsampling by using patch clustering to perform feature fusion;

in the decoder part, the Swin transformerblock is also adopted to decode the fused features, and the patch expansion is used to perform up-sampling to restore the resolution;

LePE learns value position information by deep convolution, adds the value position information by means of residual errors, embeds the value position information in block, locally enhances position coding on the assumption that the most important position information comes from the vicinity of a specific input element, and implements the enhancement by using a deep convolution operator, thereby obtaining position information of value by depth convolution

in the CSwin-Unet training, firstly, a COCO Stuff data set is used for pre-training the CSwin-Unet, and then the model is finely adjusted on the constructed cargo region segmentation data set;

wherein M is the number of classes, y _c For sample prediction, 1 is taken for the same class, 0,p is taken for the different class _c A probability of predicting a sample as class c;

PPW Loss＝CEL(W×O,W×L)+Loss _Dice (O,L)，

thirdly, building a goods placement and quantity calculation model:

for the rationality loss function, each unit has 6 surfaces, and the placing rule requires that goods are placed from one side, and then the next side is placed after one side is filled with the goods, or the next layer is placed after one side is filled with the goods; therefore, for an existing unit, the lower layer of the unit has a non-mask boundary and a unit with a side vacancy, namely the lower layer is not full; or the unit has a non-mask boundary but a side-face-vacant unit, namely the unit is not full, and other units exist above the unit, and the loss value of the unit is positively linearly related to the number of the units above the unit;

Loss＝num(MAX _layer -n ₁ )×w ₁ +num(MAX _layer -n ₂ )×w ₂

2. The computer vision-based warehouse cargo quantity estimation method according to claim 1, characterized in that: in the first step, the number m of the cameras is determined by the number of bin positions, and when the number of the bin positions is n, the number m of the cameras is more than or equal to 3n/2 and more than or equal to n/2.

3. The computer vision-based warehouse cargo quantity estimation method according to claim 1, characterized in that: in the first step, an image of the cargo in RGB color space, jpg format and resolution 2560 × 1440 is obtained from the video frame.