CN108537329B

CN108537329B - Method and device for performing operation by using Volume R-CNN neural network

Info

Publication number: CN108537329B
Application number: CN201810351549.3A
Authority: CN
Inventors: 张团; 陈云霁
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2018-04-18
Filing date: 2018-04-18
Publication date: 2021-03-23
Anticipated expiration: 2038-04-18
Also published as: CN108537329A

Abstract

A method and a device for operating by using a Volume R-CNN neural network comprise the following steps: acquiring a plurality of images of the same sample at different angles; determining a recommended region for sample detection by using the RPN; predicting the category and the frame of an object in the recommended region by using Fast R-CNN; predicting the Volume proportion occupied by each object by using Volume R-CNN according to the predicted frame of the object; calculating the Volume proportion of different classes of objects according to the object class predicted by Fast R-CNN and the object Volume proportion predicted by Volume R-CNN; wherein the RPN, Fast R-CNN and Volume R-CNN share a convolutional layer. The invention can measure complex and multi-type samples, and can identify different types of objects more accurately and quickly by adopting the artificial neural network technology and the chip.

Description

Method and device for performing operation by using Volume R-CNN neural network

Technical Field

The invention relates to the technical field of image processing, in particular to a method and a device for performing operation by using a Volume R-CNN neural network.

Background

In real life and production, the volume of each kind of object in a sample comprising different kinds of objects is often required to be measured, but a rapid and accurate measurement method is not available at present. One of the prior arts is to take a top view and a side view of a measurement sample by a mobile phone, identify the kind of an object by an artificial neural network, and calculate the volume of each object according to a formula.

The above-mentioned technology has the problems of complicated operation and high requirement for inputting photos; side views are prone to occlusion problems with objects. The method is applicable to the objects with irregular shapes, and the error of the calculation result is large.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method for performing operation by using a Volume R-CNN neural network, which comprises the following steps:

acquiring a plurality of images of the same sample at different angles;

determining a recommended region for sample detection by using the RPN;

predicting the category and the frame of an object in the recommended region by using Fast R-CNN;

predicting the Volume proportion occupied by each object by using Volume R-CNN according to the predicted frame of the object;

calculating the Volume proportion of different classes of objects according to the object class predicted by Fast R-CNN and the object Volume proportion predicted by Volume R-CNN;

wherein the RPN, Fast R-CNN and Volume R-CNN share a convolutional layer.

Preferably, when the recommended region is determined, the RPN performs multilayer convolution operation on the input picture to extract feature mapping of the picture, performs convolution operation on the feature mapping by using a sliding window, and calculates region classification and region regression by using two branches of a classification loss function and a frame regression loss function to obtain the recommended region.

Preferably, the Fast R-CNN maps the recommended regions to the feature maps to obtain RoIs, performs pooling operation on each RoI to convert the RoI into feature maps of the same size, and then performs two full-connection network operations on the pooled RoIs respectively to calculate the object type in each recommended region and accurately predict the borders.

Preferably, the Volume R-CNN maps the predicted frame parameters to the feature map, performs pooling operation on the corresponding mapping regions to obtain sample regions with the same size, performs multilayer full-connection network operation on each sample region, and calculates a Volume intermediate variable v of each object in the graph_i，v_iIs a positive number; then the volume intermediate variable is converted into the corresponding volume proportion f_iThe calculation formula is as follows:

where i is 1, 2 … … n, n being the number of objects in the image.

Preferably, the method for mapping the predicted bounding box parameters onto the feature map comprises: each coordinate data is multiplied by the ratio of the size of the feature map and the original image.

Preferably, the loss function Volume loss in the Volume R-CNN is in the form of

Wherein f is_iFor the predicted volume fraction of each object, f_i ^*And the actual value is the label data input in training.

Preferably, the output of the neural network in the prediction process comprises: the method comprises the steps of calculating an n-dimensional vector representing the Volume proportion of each object in an image by Volume R-CNN, wherein each element is positioned in an interval (0, 1), the sum of all elements is 1, calculating an n x m matrix representing the class of each object in the image by Fast R-CNN, wherein m is the number of classes of identifiable objects, only one element in each row of the matrix is 1, the rest m-1 elements are 0, the column corresponding to the element 1 is the class of the object, and a two-dimensional array representing n 4 of the border of each object.

Preferably, the method further includes multiplying the n-dimensional vector representing the volume proportion of each object by the n × m two-dimensional array representing the category to which each object belongs to obtain a volume proportion vector of each category object, where the volume proportion vector is an m-dimensional vector, each dimension of the m-dimensional vector corresponds to one category object, and a value in each dimension represents the volume proportion occupied by the corresponding category object.

Preferably, the method further includes calculating an m-dimensional vector representing the volume ratio of each object for each image, and then adding all the m-dimensional vectors and dividing by the number of images to obtain an average vector as the final volume ratio vector of each object.

Preferably, the method further comprises an adaptive training step comprising:

step one, an RPN network initializes network parameters, and calculates a class label and a region adjustment parameter of each detection region according to the forward propagation of input image information; updating relevant parameters of the RPN by using a random gradient descent algorithm or an Adam algorithm according to back propagation, wherein the relevant parameters comprise specific partial parameters of the RPN and parameters of a shared convolution part, and training until convergence;

step two, the Fast R-CNN initializes the convolutional layer parameters by using the shared convolutional layer parameters trained in the step one, trains the recommended region obtained in the step one as the recommended region in the neural network calculation process, and updates the network parameters including the shared convolutional network until the network converges;

thirdly, the RPN continues to train and update the unique partial parameters of the RPN by using the shared convolutional network obtained in the second step, and the parameters of the shared convolutional layer are not included;

step four, the Fast R-CNN network trains by using the recommended area obtained in the step three, and only the unique part of the Fast R-CNN network is updated, and the shared convolutional layer parameters are unchanged;

mapping the object frame obtained in the fourth step to the last layer of feature mapping of the shared convolutional network by the Volume R-CNN network, and training and updating unique partial parameters until the network is converged;

the training operation of each step is that input data is subjected to network forward calculation to obtain a loss function of each part, then backward propagation is carried out, and network parameters are updated by using random gradient descent or an Adam algorithm;

wherein the above-mentioned five-step training process can be executed circularly.

In another aspect, the present invention provides an apparatus for performing operations by using a Volume R-CNN neural network, including:

an information input section for acquiring a plurality of images of the same sample at different angles;

an information processing section for processing and calculating the image;

wherein the information processing section includes:

a storage unit for storing the image;

a recommended region generation unit which determines a recommended region for sample detection using the RPN;

a category and border prediction unit which predicts a category and border of an object in the recommended region by using Fast R-CNN; and

the object Volume ratio prediction unit predicts the Volume ratio of each object by using Volume R-CNN according to the predicted frame of the object;

and the class Volume ratio predicting unit calculates the class Volume ratio according to the object class predicted by Fast R-CNN and the object Volume ratio predicted by Volume R-CNN, and averages the calculation results of different images.

Wherein the RPN, Fast R-CNN and Volume R-CNN share a convolutional layer.

Preferably, the information processing part further comprises a data conversion unit for converting the volume proportion output by the processing unit into a corresponding output.

Preferably, the device further comprises an information output part for receiving the output information from the information processing part and displaying the information.

Preferably, the device further comprises a networking component for uploading the measurement data to the database in real time, and meanwhile, the latest parameter model can be updated from the cloud.

Preferably, the information processing unit is a neural network chip.

Compared with the prior art, the invention has the following beneficial effects:

1) compared with the prior invention, more complex and various objects can be measured.

2) The artificial neural network technology and the chip are adopted, so that the identification of different objects is more accurate and faster.

3) The oblique upper overlooking picture is adopted, so that the shielding among different objects can be effectively avoided, and meanwhile, the object objects are comprehensively known.

4) The artificial neural network technology and the chip are adopted to calculate the object volume, the calculation result is more accurate, and the prediction precision is improved along with the continuous increase of training data.

5) The artificial neural network chip has strong computing power, supports offline operation of the neural network, and can realize the object volume detection and corresponding control work by offline the user terminal/front end under the condition of no cloud server for assisting in computing. When the chips are networked and the cloud server assists in computing, the computing capacity of the chips is stronger.

6) The device is simple to operate, is more intelligent, and meets the requirements of life and production.

Drawings

FIG. 1 is a block diagram of a neural network in accordance with an embodiment of the present invention;

FIG. 2 is a diagram illustrating the prediction of object types and borders according to an embodiment of the present invention;

FIG. 3 is a network structure diagram of Volume R-CNN in an embodiment of the present invention;

fig. 4 is a schematic structural diagram of the device of the present invention.

FIG. 5 is a schematic view of the structure of the device of the present invention.

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

The invention discloses a method for operating by using a Volume R-CNN neural network, which extracts and processes key characteristics of an image by using a neural network algorithm and identifies the types of objects and the proportion of the volumes of various objects in the image.

The input image includes multiple different angle top views of the same sample.

In the processing stage of a single image, the processor calculates the input image by using an improved fast Region conditional Neural Network (false R-CNN) Network, and marks each object class (class), each object border (bounding box) and the predicted volume proportion (volume) of each object in the image. Where volume represents a two-bit decimal number from 0 to 1.

The neural network structure in the invention is improved on the basis of the Faster R-CNN network, and a part for predicting the volume ratio of an object is added. The network structure is shown in fig. 1:

the neural network can be divided into three parts: a Region pro-social Networks (RPN) network for predicting recommended regions; a Fast R-CNN network for predicting the class of objects in the image and fine-tuning the bounding box; and the Volume R-CNN network is used for predicting the Volume proportion occupied by each object in the image. The three networks share the convolutional layer to form a unified whole network.

The bounding box of an object refers to the smallest rectangular box that can contain an object of an image. Specifically, as shown in fig. 2, the ellipses and irregular figures in the drawing represent objects with different shapes, and the dotted line frame is the object frame.

FIG. 3 is a Volume R-CNN network structure in which the convolutional layer (CNN) is a shared part. The frame required for this part of the operation is obtained by the second part (Fast R-CNN). The loss function Volume loss is in the form of

The present neural network uses Region pro-possible Networks (RPN) to determine target detection candidate regions. The RPN method comprises the steps of firstly carrying out multilayer convolution operation on an input image, extracting feature maps (feature maps) of the image, then carrying out convolution operation on the feature maps by using a 3-by-3 sliding window, and then calculating region classification and region regression by using two branches to obtain a recommended region. The region classification is to judge the probability that the prediction region belongs to the foreground and the background; the parameters of the recommended region here are parameters with respect to the original input image.

In order to predict the object type of each recommended area and refine the object frame, the recommended area is mapped to the feature maps to obtain rois (region of interest), and then pooling operation is performed on each RoI to convert into feature maps with the same size. And then, two full-connection network operations can be respectively carried out on the pooled RoIs, the object class of each area is calculated, and accurate prediction is carried out on the frame.

Finally, mapping the frame parameters of the frame branch prediction to feature maps, and performing pooling operation on the corresponding mapping area to obtain areas with the same size. And performing multilayer full-connection operation on each target area, and calculating a volume intermediate variable of each object, wherein the intermediate variable is a positive number and does not represent the volume of the object. Each target region comprises an object corresponding to a volume intermediate variable v_iThen converting the intermediate volume variable into the corresponding ratio f_i(ii) a The concrete formula is as follows:

where i is 1, 2 … … n, n being the number of objects in the image. f. of_iThe number of (2) is the number n of objects in the image, so the object volume proportion can be output as a vector containing n elements. The output of the object class prediction is a matrix of n x m, wherein m is the number of the object classes, each row vector of the matrix has only one element of 1, and the rest elements are 0; the number of columns in which element 1 is located corresponds to the category of the object. The output of the object frame prediction branch is a two-dimensional matrix of n x 4, and the elements of each row respectively correspond to the center coordinates (x, y) and the height (h, w) of the frame. The operation of mapping the frame from the original image to the feature map is as follows: each coordinate data is multiplied by the ratio of the size of the feature map and the original image.

Therefore, the output of the neural network in the prediction process is: an n-dimensional vector representing the volume fraction of each object, each element being located within the interval [0, 1] and the sum of the elements being 1; an n x m two-dimensional array representing the class to which each object belongs; a two-dimensional array of n x 4 representing the bounding box of each object. Then, the n-dimensional vector representing the volume proportion of each object is multiplied by the two-dimensional array representing the class to which each object belongs to obtain the volume proportion vector of each object, which is an m-dimensional vector. Each dimension of the m-dimension vector corresponds to one class of objects, and the numerical value on each dimension represents the volume proportion occupied by the corresponding class of objects.

The method of the present invention further includes calculating an m-dimensional vector representing the volume proportion of each type of object for each image in a set of images (photographs of the same sample at different angles), then adding all the m-dimensional vectors and dividing by the number of images in each set to obtain an average vector as the final object type volume proportion vector.

In one embodiment of the invention, the method is used to calculate the energy and nutrient content of a food.

In this embodiment, the method further comprises multiplying the calculated food category volume proportion vector by the food category density vector in bits to obtain a food category mass proportion vector, and multiplying the food category mass proportion vector by the input total mass of the food to obtain a food category mass vector, each bit representing the mass of the food of the corresponding category.

In this embodiment, the method further comprises multiplying the m-dimensional food category mass vector by the corresponding food category nutrient content matrix to obtain a food nutrient content vector, each bit representing a certain nutrient content in the food. Wherein the food category nutrient content matrix is a two-dimensional matrix of m x q, wherein q is the number of nutrient element types measurable by the system. Each row of the food category nutrient content matrix corresponds to a type of food, and each column corresponds to a nutrient element and represents the content of the nutrient element contained in the unit mass of each type of food. The finally obtained vector of the content of the nutrient elements of the food is a q-dimensional vector.

The method of the present invention also includes a method of adaptively training an information processing apparatus.

The input data is an image with mark data, the mark data corresponding to each image is the category (n-dimensional vector) of each object in the image, the frame information (n-4 two-dimensional matrix) of each object and the volume proportion (n-dimensional vector) occupied by each object; where n is the total number of different objects in the image. The processing unit performs preprocessing on the input data information, for example, if the object type information is a character, the object type information is converted into a number corresponding to the type.

The training process is divided into five steps, namely RPN network, Fast R-CNN network for object class detection and frame regression, and network cross training for predicting the volume proportion of the object.

Step one, an RPN network initializes network parameters, and calculates a class label and a region adjustment parameter of each detection region according to the forward propagation of input image information; and updating relevant parameters of the RPN by using a random gradient descent algorithm or an Adam algorithm according to back propagation, wherein the relevant parameters comprise specific partial parameters of the RPN and parameters of a shared convolution part. Training until convergence.

And secondly, initializing convolutional layer parameters by the Fast R-CNN by using the shared convolutional layer parameters trained in the step one, training the recommended area obtained in the step one as a recommended area in the network computing process, and updating network parameters including the shared convolutional network. Until the network converges.

And step three, the RPN continues to train and update the unique partial parameters of the RPN by using the shared convolutional network obtained in the step two, and the parameters of the shared convolutional layer are not included.

And step four, the Fast R-CNN network trains by using the recommended area obtained in the step three, and only the unique part of the Fast R-CNN network is updated, and the shared convolutional layer parameters are unchanged.

And step five, the Volume R-CNN network maps the object frame obtained in the step four to the last layer of feature mapping of the shared convolutional network, and trains and updates the unique partial parameters until the network converges.

The training operation of each step is to forward calculate the input data through the network to obtain the loss function of each part, then to reversely propagate, and to update the network parameters by using the random gradient descent or Adam algorithm.

The five-step training process described above may be performed in a loop.

On the other hand, the present invention further provides a device for performing an operation by using a Volume R-CNN neural network, as shown in fig. 4, including:

an information processing section for processing and calculating the image;

wherein the information processing section includes:

a storage unit for storing the image;

the class Volume ratio predicting unit is used for calculating the class Volume ratio according to the object class predicted by Fast R-CNN and the object Volume ratio predicted by Volume R-CNN, and averaging the calculation results of different images;

wherein the RPN, Fast R-CNN and Volume R-CNN share a convolutional layer.

The device also comprises an information output part which is used for receiving the output information from the information processing part and displaying the information.

In one embodiment of the invention, the device for calculating the energy and the nutritional ingredients of the food based on the method comprises an information input component, an information processing component and an information output component, and is shown in fig. 5.

The information input part comprises one or more cameras and is used for inputting a group of food overlooking images at different angles; a quality measuring device for measuring the quality of the food and transmitting it to the processing unit.

The information processing part comprises a storage unit and a data processing unit, wherein the storage unit is used for receiving and storing input data, instructions and output data, wherein the input data comprises a group of images and a positive number (food quality); the data processing unit firstly utilizes the neural network to extract and process key features contained in input data, a vector for representing the content of nutrient elements in food is generated for each image, and for the same group of images, the average value of the corresponding vectors of all the images is calculated to be used as the final nutrient content vector of the tested food.

The information processing component also comprises a data conversion module for converting the q-dimensional nutrient content vector output by the processing unit into corresponding output, wherein the output can be in the form of a table or a pie chart.

The information output section includes a liquid crystal display which receives output information from the information processing section and displays the information.

The information processing part controls an output result on the screen according to the predicted food nutrient content vector (q-dimensional vector). The data conversion processor converts the q-dimensional vector into corresponding storage information in the format of: name and content of nutrient elements. The names of the nutrient elements can be correspondingly obtained by each index subscript of the q-dimensional vector, and 0 element in the vector is ignored. In addition, the device can store or network to obtain the daily recommended nutrient element intake of people of all ages, and evaluate the test food, namely, the content of various nutrient elements in the food is too high or too low compared with the content required by each meal of the human body, and reasonable diet suggestions are given. And finally, outputting the content of the nutrient elements in the food and the diet suggestion on a display screen. The output form of the nutrient content can be a table and a pie chart.

The device can also include the networking part, can be connected to the internet, uploads the database with measured data in real time, enlarges the data bulk, also can follow the newest parameter model of high in the clouds update simultaneously, improves computational efficiency and precision.

The data processing unit adopts a neural network chip, is suitable for neural network calculation and has strong calculation capability.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for operating by using a Volume R-CNN neural network comprises the following steps:

acquiring a plurality of images of the same sample at different angles;

extracting feature mapping of the picture by using an RPN (resilient packet network), and determining a recommended region for sample detection;

mapping the predicted frame parameters onto the feature map by using Volume R-CNN, performing pooling operation on corresponding mapping regions to obtain sample regions with the same size, performing multilayer full-connection network operation on each sample region, and calculating Volume intermediate variable v of each object in the graph_i ， v_iIs a positive number; then the volume intermediate variable is converted into the corresponding volume proportion f_iThe calculation formula is as follows:

where i is 1, 2 … … n, n is the number of objects in the image, where the loss function Volume loss in the Volume R-CNN is in the form of

Wherein f is_iFor the predicted volume fraction of each object, f_i ^*The actual value is the label data input during training;

wherein the RPN, Fast R-CNN and Volume R-CNN share a convolutional layer.

2. The method according to claim 1, wherein when determining the recommended region, the RPN performs a multi-layer convolution operation on the input picture to extract the feature mapping of the picture, performs a convolution operation on the feature mapping using a sliding window, and then calculates the region classification and the region regression using two branches of a classification loss function and a bounding box regression loss function to obtain the recommended region.

3. The method according to claim 1, wherein the Fast R-CNN maps recommended regions to the feature maps to obtain RoIs, performs pooling operation on each RoI to convert into feature maps of the same size, and then performs two full-connection network operations on the RoIs after the pooling operation, respectively, calculates object classes in each recommended region, and accurately predicts borders.

4. The method of claim 3, wherein the predicted bounding box parameters are mapped onto the feature map by: each coordinate data is multiplied by the ratio of the size of the feature map and the original image.

5. The method of claim 1, wherein predicting the output of the in-process neural network comprises: the method comprises the steps of calculating an n-dimensional column vector which represents the Volume proportion of each object in an image by Volume R-CNN, wherein each element is positioned in an interval (0, 1), the sum of all elements is 1, calculating an n x m matrix which represents the class of each object in the image by Fast R-CNN, m is the number of classes of identifiable objects, only one element in each row of the matrix is 1, the rest m-1 elements are 0, the column corresponding to the element 1 is the class of the object, and calculating a two-dimensional array of n x 4 which represents the frame of each object.

6. The method of claim 1, wherein the method further comprises multiplying an n-dimensional column vector representing the volume fraction of each object by an n x m two-dimensional array representing the class to which each object belongs to obtain a volume fraction vector of each class object, wherein the volume fraction vector is an m-dimensional row vector, each dimension of the m-dimensional row vector corresponds to one class object, and the value in each dimension represents the volume fraction occupied by the corresponding class object.

7. The method of claim 1, further comprising calculating an m-dimensional row vector for each image representing the volumetric proportion of each object, and then adding all m-dimensional row vectors and dividing by the number of images to find the average vector as the final volumetric proportion vector for each object.

8. The method of claim 1, wherein the method further comprises an adaptive training step comprising:

9. An apparatus for performing operations using a Volume R-CNN neural network, comprising:

an information processing section for processing and calculating the image;

wherein the information processing section includes:

a storage unit for storing the image;

the recommended region generating unit extracts feature mapping of the picture by using the RPN and determines a recommended region of sample detection;

the object Volume ratio prediction unit maps the predicted frame parameters to the feature map by using Volume R-CNN, and then performs pooling on the corresponding mapping areaOperating to obtain equal-sized sample regions, performing multilayer full-connection network operation on each sample region, and calculating volume intermediate variable v of each object in the graph_i，v_iIs a positive number; then the volume intermediate variable is converted into the corresponding volume proportion f_iThe calculation formula is as follows:

wherein the RPN, Fast R-CNN and Volume R-CNN share a convolutional layer.

10. The apparatus of claim 9, wherein the information processing section further comprises a data conversion unit for converting the volume proportion output by the processing unit into a corresponding output.

11. The apparatus of claim 9, wherein the apparatus further comprises an information output section for receiving output information from the information processing section and displaying the information.

12. The apparatus of claim 9, wherein the apparatus further comprises a networking component for uploading the measurement data to a database in real time, while the latest parametric model is also updated from a cloud.

13. The apparatus according to claim 9, wherein the information processing section is a neural network chip.