CN108597582B

CN108597582B - Method and device for executing fast R-CNN neural network operation

Info

Publication number: CN108597582B
Application number: CN201810352111.7A
Authority: CN
Inventors: 张团; 陈云霁
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2018-04-18
Filing date: 2018-04-18
Publication date: 2021-02-12
Anticipated expiration: 2038-04-18
Also published as: CN108597582A

Abstract

A method and apparatus for performing fast R-CNN neural network operations, the method comprising: acquiring a plurality of images of the same portion of food at different angles; determining a recommended region for sample detection by using the RPN; predicting the category and the frame of the food object in the recommended area by using Fast R-CNN; predicting the Volume proportion of each food object by using Volume R-CNN according to the predicted food frame; calculating the volume proportion of different types of food according to the types of the food objects and the volume proportion of the food objects; respectively multiplying the calculated volume ratio of each kind of food with the density of each kind of food to obtain the mass ratio of each kind of food; multiplying the mass proportion of each kind of food by the total mass of the food to obtain the mass of each kind of food; multiplying the quality of each food with the corresponding nutrient content to obtain the nutrient element content of the food. The invention can measure complex and various foods, and the food can be identified more accurately and rapidly by adopting the artificial neural network technology and the chip.

Description

Method and device for executing fast R-CNN neural network operation

Technical Field

The invention relates to the technical field of image processing, in particular to a method and a device for executing fast R-CNN neural network operation.

Background

With the acceleration of the pace of life and the improvement of the living standard of people in modern society, the requirements of people on diet are higher and higher. People no longer care about being able to eat full, but about whether the diet is healthy. However, many people lack sufficient knowledge of diet health, and therefore, there is a need for an apparatus capable of intelligently measuring food energy and nutritional ingredients to help people eat more properly.

One of the prior art is the calculation of food energy by weight. The metering device mainly comprises the following parts: tray, weight measurement device, and display screen. The weight measuring device is used for measuring the weight of the food and transmitting the weight information of the food to the microcomputer processor; the processor calculates the energy of the food and displays the result on the liquid crystal display screen.

The above technique has a problem in that the processing power is weak and it is suitable for measuring energy of only a single food kind. The measurement is inaccurate for food containing multiple food categories and the corresponding nutrient content cannot be calculated. The food information is not expandable, and the food information which is not recorded cannot be processed.

Another prior art is to take top view and side view of measured food by mobile phone, identify the food type by artificial neural network, and calculate the volume of each food according to formula; the nutrient content is calculated according to the volume of the food.

The above-mentioned technology has the problems of complicated operation and high requirement for inputting photos; side views are prone to food blockage problems. Parameters such as the length, the width and the like of food are predicted by using the focal length, certain deviation may exist for different mobile phones, the food volume is calculated by using a formula method, the method is not suitable for food with irregular shapes, and the error of the calculation result is large.

Disclosure of Invention

To solve the problems in the prior art, in one aspect, the present invention provides a method for performing fast R-CNN neural network operations, comprising:

acquiring a plurality of images of the same portion of food at different angles;

determining a recommended region for sample detection by using the RPN;

predicting the category and the frame of the food object in the recommended area by using Fast R-CNN;

predicting the Volume proportion of each food object by using Volume R-CNN according to the predicted food frame;

calculating the Volume proportion of different types of food according to the food object type predicted by Fast R-CNN and the Volume proportion of the food object predicted by Volume R-CNN;

respectively multiplying the calculated volume ratio of each kind of food with the density of each kind of food to obtain the mass ratio of each kind of food;

multiplying the mass proportion of each kind of food by the total mass of the food to obtain the mass of each kind of food;

multiplying the quality of each food by the corresponding nutrient content to obtain the content of the nutrient elements of the food;

wherein the RPN, Fast R-CNN and Volume R-CNN share a convolutional layer.

Preferably, when the recommended region is determined, the RPN performs multilayer convolution operation on the input picture to extract feature mapping of the picture, performs convolution operation on the feature mapping by using a sliding window, and calculates region classification and region regression by using two branches of a classification loss function and a frame regression loss function to obtain the recommended region.

Preferably, the Fast R-CNN maps the recommended regions to the feature maps to obtain RoIs, performs pooling operation on each RoI to convert into feature maps of the same size, and then performs two full-connection network operations on the pooled RoIs respectively to calculate the food object category in each recommended region and accurately predict the frame.

Preferably, the Volume R-CNN maps the predicted frame parameters to the feature map, performs pooling operation on the corresponding mapping regions to obtain sample regions of equal size, performs multi-layer full-connection network operation on each sample region, and calculates a Volume intermediate variable v of each food object in the graph_i，v_iIs a positive number; then the volume intermediate variable is converted into the corresponding volume proportion f_iThe calculation formula is as follows:

where i is 1,2 … … n, n being the number of food objects in the image.

Preferably, the method for mapping the predicted bounding box parameters onto the feature map comprises: each coordinate data is multiplied by the ratio of the size of the feature map and the original image.

Preferably, the loss function Volume loss in the Volume R-CNN is in the form of

Wherein f is_iFor the predicted volume fraction of each food object, f_i ^*And the actual value is the label data input in training.

Preferably, the output of the neural network in the prediction process comprises: the method comprises the steps of calculating an n-dimensional vector which represents the Volume proportion of each food object in an image by Volume R-CNN, wherein each element is positioned in an interval (0, 1), the sum of all elements is 1, calculating an n m matrix which represents the category of each food object in the image by Fast R-CNN, m is the number of the categories of identifiable food objects, only one element in each row of the matrix is 1, the rest m-1 elements are 0, the column corresponding to the element 1 is the category of the food object, and calculating a two-dimensional array which represents n 4 of the frame of each food object.

Preferably, the algorithm further includes multiplying the n-dimensional vector representing the volume proportion of each food object by the n × m two-dimensional array representing the category to which each food object belongs to obtain the volume proportion vector of each type of food, which is an m-dimensional vector, each dimension of the m-dimensional vector corresponds to one type of food, and the value in each dimension represents the volume proportion occupied by the corresponding type of food.

Preferably, the method further comprises calculating an m-dimensional vector representing the volume proportion of each category of food for each image, then adding all the m-dimensional vectors and dividing by the number of the images to obtain an average vector as the final volume proportion vector of each category of food.

Preferably, the method further comprises an adaptive training step comprising:

step one, an RPN network initializes network parameters, and calculates a class label and a region adjustment parameter of each detection region according to the forward propagation of input image information; updating relevant parameters of the RPN by using a random gradient descent algorithm or an Adam algorithm according to back propagation, wherein the relevant parameters comprise specific partial parameters of the RPN and parameters of a shared convolution part, and training until convergence;

step two, the Fast R-CNN initializes the convolutional layer parameters by using the shared convolutional layer parameters trained in the step one, trains the recommended region obtained in the step one as the recommended region in the neural network calculation process, and updates the network parameters including the shared convolutional network until the network converges;

thirdly, the RPN continues to train and update the unique partial parameters of the RPN by using the shared convolutional network obtained in the second step, and the parameters of the shared convolutional layer are not included;

step four, the Fast R-CNN network trains by using the recommended area obtained in the step three, and only the unique part of the Fast R-CNN network is updated, and the shared convolutional layer parameters are unchanged;

mapping the food object frame obtained in the fourth step to the last layer of feature mapping of the shared convolutional network by the Volume R-CNN network, and training and updating unique partial parameters until the network is converged;

the training operation of each step is to obtain a loss function of each part by carrying out forward calculation on input data through a network, then carry out backward propagation, and update network parameters by using a random gradient descent or Adam algorithm;

wherein the above-mentioned five-step training process can be executed circularly.

In another aspect, the invention provides an apparatus for performing fast R-CNN neural network operations, comprising

The information input part is used for acquiring a plurality of images of the same food in different angles, the total mass of the food, the density of different types of food in the food and the content of nutrient elements;

an information processing section for processing and calculating the image;

wherein the information processing section includes:

the storage unit is used for storing the image, the total mass, the density and the content of the nutrient elements;

a recommended region generation unit which determines a recommended region for sample detection using the RPN;

a category and frame prediction unit that predicts categories and frames of food objects in the recommended area using Fast R-CNN;

a food object Volume ratio prediction unit which predicts the Volume ratio of each food object in the image by using Volume R-CNN according to the predicted frame of the food object;

the food category Volume ratio prediction unit is used for calculating the Volume ratio of each category of food according to the food object category predicted by Fast R-CNN and the food object Volume ratio predicted by Volume R-CNN, and averaging the calculation results of different images;

the mass ratio prediction unit is used for multiplying the calculated volume ratio of different types of food and the density of different types of food respectively to obtain the mass ratio of different types of food;

the food quality prediction unit multiplies the mass proportion of different types of food by the total mass of the food to obtain the mass of the different types of food; and

the nutrition content prediction unit multiplies the quality of each food by the corresponding nutrition content to obtain the content of the food nutrient elements;

wherein the RPN, Fast R-CNN and Volume R-CNN share a convolutional layer.

Preferably, the information input section includes an image input device and a quality input device.

Preferably, the information processing part further comprises a data conversion unit for converting the q-dimensional nutrient content vector output by the processing unit into a corresponding output.

Preferably, the device further comprises an information output part for receiving the output information from the information processing part and displaying the information.

Preferably, the device further comprises a networking component for uploading the measurement data to the database in real time, and meanwhile, the latest parameter model can be updated from the cloud.

Preferably, the information processing unit is a neural network chip.

Compared with the prior art, the invention has the following beneficial effects:

1) compared with the prior invention, more complex and various foods can be measured.

2) The food identification is more accurate and rapid by adopting the artificial neural network technology and the chip.

3) The overlooking picture is obliquely arranged above, so that the shielding among different foods can be effectively avoided, and objects can be comprehensively known.

4) The food volume is calculated by adopting an artificial neural network technology and a chip, the calculation result is more accurate, and the prediction precision is improved along with the continuous increase of training data.

5) The artificial neural network chip has strong computing power, supports offline operation of the neural network, and can realize the detection and corresponding control of food nutrient components by offline of the user terminal/front end under the condition of no cloud server for assisting in computing. When the chips are networked and the cloud server assists in computing, the computing capacity of the chips is stronger.

6) The device is simple to operate, is more intelligent, and meets the daily life requirements of people.

7) Can provide more reasonable suggestions with the daily diet of people and improve the life quality of people.

Drawings

FIG. 1 is a block diagram of a neural network in accordance with an embodiment of the present invention;

FIG. 2 is a diagram illustrating the prediction of food object categories and borders in an embodiment of the present invention;

FIG. 3 is a network structure diagram of Volume R-CNN in an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

The invention discloses a method for executing fast R-CNN neural network operation, which mainly comprises the following steps: extracting and processing key features of the image, and identifying the type of food and the proportion of the volume of various foods in the image; calculating the weight proportion of each kind of food according to the density of each kind of food; and finally, the processing unit calculates the actual quality of each kind of food to be tested according to the weight proportion and the total weight of each kind of food, and the energy and the nutrient content of the food can be obtained by combining the element content of each kind of food.

The input image includes a plurality of top-view photographs from different angles for the same serving of food.

In the processing stage of a single image, the processor calculates the input image by using a modified fast Region conditional Neural Network (false R-CNN) Network, and marks each food class (class), each food border (bounding box) and the predicted volume ratio (volume) of each food in the image. Where volume represents a two-bit decimal number from 0 to 1.

The neural network structure in the invention is improved on the basis of the Faster R-CNN network, and a part for predicting the volume ratio of food is added. The network structure is shown in fig. 1:

the neural network can be divided into three parts: a Region pro-social Networks (RPN) network for predicting recommended regions; a Fast R-CNN network for predicting the class of objects in the image and fine-tuning the bounding box; volume RCNN network for predicting the Volume fraction occupied by individual food subjects in an image. The three networks share the convolutional layer to form a unified whole network.

The frame of a food means the smallest rectangular frame that can include an image of a certain food. Specifically, as shown in fig. 2, the oval and irregular figures in the figure represent different shapes of food, and the dashed border is the food border.

FIG. 3 is a Volume R-CNN network structure in which the convolutional layer (CNN) is a shared part. The frame required for this part of the operation is obtained by the second part (Fast R-CNN). The loss function Volume loss is in the form of

Wherein f is_iFor the predicted volume fraction of each object, f_i ^*And the actual value is the label data input in training.

The neural network uses Region pro-social Networks (RPN) to determine a target detection recommendation Region. The RPN method comprises the steps of firstly carrying out multilayer convolution operation on an input image, extracting feature maps (feature maps) of the image, then carrying out convolution operation on the feature maps by using a 3-by-3 sliding window, and then calculating region classification and region regression by using two branches to obtain a recommended region. The region classification is to judge the probability that the prediction region belongs to the foreground and the background; the parameters of the recommended region here are parameters with respect to the original input image.

In order to predict the food category of each recommended area and fine tune the food frame, the recommended area is mapped to the feature maps to obtain rois (region of interest), and then pooling operation is performed on each RoI to convert into feature maps with the same size. And then, two full-connection network operations can be respectively carried out on the pooled RoIs, the food category of each area is calculated, and accurate prediction is carried out on the frame.

Finally, mapping the frame parameters of the frame branch prediction to feature maps, and performing pooling operation on the corresponding mapping area to obtain areas with the same size. And carrying out multilayer full-connection operation on each target area, and calculating a volume intermediate variable of each food, wherein the intermediate variable is a positive number and does not represent the volume of the food. Each target region includes a food corresponding to a volume intermediate variable v_iThen converting the intermediate volume variable into the corresponding ratio f_i(ii) a The concrete formula is as follows:

where i is 1,2 … … n, n being the number of food objects in the image. f. of_iThe number of (2) is the number n of food in the image, so the food volume ratio can be output as a vector containing n elements. The predicted output of the food categories is a two-dimensional matrix of n x m, wherein m is the number of the edible categories, each row vector of the matrix has only one element of 1, and the rest elements are 0; the number of columns in which element 1 is located corresponds to the category of food. The output of the predicted branch of the food frame is a two-dimensional matrix of n × 4, and the elements of each row respectively correspond to the center coordinates (x, y) and the height (h, w) of the frame. The operation of mapping the frame from the original image to the feature map is as follows: multiplying each coordinate data by the size of the feature map and the original imageThe ratio of.

Therefore, the output of the neural network in the prediction process is: an n-dimensional vector representing the ratio of the volume of each food, each element being located in the interval [0, 1] and the sum of the elements being 1; an n x m two-dimensional array representing the category to which each food belongs; a two-dimensional array of n x 4 representing each food border. Then, the n-dimensional vector representing the volume proportion of each food is multiplied by the two-dimensional array representing the category to which each food belongs to obtain the volume proportion vector of each food, which is an m-dimensional vector. Each dimension of the m-dimension vector corresponds to a class of food, and the numerical value of each dimension represents the volume proportion of the corresponding class of food.

The method of the invention also comprises the steps of calculating an m-dimensional vector representing the volume proportion of various types of food for each image in a group of images (photos of the same plate of food from different angles), then adding all the m-dimensional vectors, dividing the sum by the number of the images in each group, and calculating an average vector as a final object type volume proportion vector.

The method of the invention also comprises multiplying the calculated volume proportion vector of the food category by the density vector of the food category according to the bit to obtain the mass proportion vector of the food category, and then multiplying the mass proportion vector of the food category by the total mass of the input food to obtain the mass vector of the food category, wherein each bit represents the mass of the food of the corresponding category.

The method also comprises the step of multiplying the m-dimensional food category mass vector by the corresponding food category nutrient content matrix to obtain a food nutrient content vector, wherein each bit represents the content of a certain nutrient element in the food. Wherein the food category nutrient content matrix is a two-dimensional matrix of m x q, wherein q is the number of nutrient element types measurable by the system. Each row of the food category nutrient content matrix corresponds to a type of food, and each column corresponds to a nutrient element and represents the content of the nutrient element contained in the unit mass of each type of food. The finally obtained vector of the content of the nutrient elements of the food is a q-dimensional vector.

The method of the present invention also includes a method of adaptively training an information processing apparatus.

The input data is an image with mark data, the mark data corresponding to each image is the category (n-dimensional vector) of each food in the image, the frame information (n-4 two-dimensional matrix) of each food and the volume proportion (n-dimensional vector) occupied by each food; where n is the total number of food objects in the image. The processing unit preprocesses the input data information, for example, if the food category information is a character, it is converted into a number corresponding to the category.

The training process is divided into five steps, namely RPN network, Fast R-CNN network for food category detection and frame regression, and network cross training for predicting the food volume ratio.

Step one, an RPN network initializes network parameters, and calculates a class label and a region adjustment parameter of each detection region according to the forward propagation of input image information; and updating relevant parameters of the RPN by using a random gradient descent algorithm or an Adam algorithm according to back propagation, wherein the relevant parameters comprise specific partial parameters of the RPN and parameters of a shared convolution part. Training until convergence.

And secondly, initializing convolutional layer parameters by the Fast R-CNN by using the shared convolutional layer parameters trained in the step one, training the recommended area obtained in the step one as a recommended area in the network computing process, and updating network parameters including the shared convolutional network. Until the network converges.

And step three, the RPN continues to train and update the unique partial parameters of the RPN by using the shared convolutional network obtained in the step two, and the parameters of the shared convolutional layer are not included.

And step four, the Fast R-CNN network trains by using the recommended area obtained in the step three, and only the unique part of the Fast R-CNN network is updated, and the shared convolutional layer parameters are unchanged.

And step five, the Volume R-CNN network maps the food frame obtained in the step four to the last layer of feature mapping of the shared convolutional network, and trains and updates the unique partial parameters until the network converges.

The training operation of each step is to forward calculate the input data through the network to obtain the loss function of each part, then to reversely propagate, and to update the network parameters by using the random gradient descent or Adam algorithm.

The five-step training process described above may be performed in a loop.

The invention also provides a device for executing the Faster R-CNN neural network operation, which comprises an information input component, an information processing component and an information output component, as shown in FIG. 4.

The information input part comprises one or more cameras and is used for inputting a group of food overlooking images at different angles; a quality measuring device for measuring the quality of the food and transmitting it to the processing unit.

The information processing part comprises a storage unit and a data processing unit, wherein the storage unit is used for receiving and storing input data, instructions and output data, wherein the input data comprises a group of images and a positive number (food quality); the data processing unit firstly utilizes the neural network to extract and process key features contained in input data, a vector for representing the content of nutrient elements in food is generated for each image, and for the same group of images, the average value of the corresponding vectors of all the images is calculated to be used as the final nutrient content vector of the tested food.

The information processing component also comprises a data conversion module for converting the q-dimensional nutrient content vector output by the processing unit into corresponding output, wherein the output can be in the form of a table or a pie chart.

The information output section includes a liquid crystal display which receives output information from the information processing section and displays the information.

The information processing part controls an output result on the screen according to the predicted food nutrient content vector (q-dimensional vector). The data conversion processor converts the q-dimensional vector into corresponding storage information in the format of: name and content of nutrient elements. The names of the nutrient elements can be correspondingly obtained by each index subscript of the q-dimensional vector, and 0 element in the vector is ignored. In addition, the device can store or network to obtain the daily recommended nutrient element intake of people of all ages, and evaluate the test food, namely, the content of various nutrient elements in the food is too high or too low compared with the content required by each meal of the human body, and reasonable diet suggestions are given. And finally, outputting the content of the nutrient elements in the food and the diet suggestion on a display screen. The output form of the nutrient content can be a table and a pie chart.

The device can also include the networking part, can be connected to the internet, uploads the database with measured data in real time, enlarges the data bulk, also can follow the newest parameter model of high in the clouds update simultaneously, improves computational efficiency and precision.

The data processing unit adopts a neural network chip, is suitable for neural network calculation and has strong calculation capability.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for performing fast R-CNN neural network operations, comprising:

determining a recommended region for sample detection by using the RPN;

wherein the RPN, Fast R-CNN and Volume R-CNN share a convolutional layer;

the Volume R-CNN maps the predicted frame parameters to feature maps of pictures extracted by RPN, performs pooling operation on corresponding mapping areas to obtain sample areas with the same size, performs multilayer full-connection network operation on each sample area, and calculates a Volume intermediate variable v of each food object in the graph_i，v_iIs a positive number; then the volume intermediate variable is converted into the corresponding volume proportion f_iThe calculation formula is as follows:

wherein i =1,2 … … n, n being the number of food objects in the image;

the loss function Volume loss in the Volume R-CNN is in the form of

2. The method according to claim 1, wherein when determining the recommended region, the RPN performs a multi-layer convolution operation on the input picture to extract the feature mapping of the picture, performs a convolution operation on the feature mapping using a sliding window, and then calculates the region classification and the region regression using two branches of a classification loss function and a bounding box regression loss function to obtain the recommended region.

3. The method according to claim 1, wherein the Fast R-CNN maps recommended regions to the feature maps to obtain RoIs, performs pooling operation on each RoI to convert into feature maps of the same size, and then performs two full-connection network operations on the pooled RoIs respectively to calculate food object categories in each recommended region and accurately predict borders.

4. The method of claim 1, wherein the predicted bounding box parameters are mapped onto the feature map by: each coordinate data is multiplied by the ratio of the size of the feature map and the original image.

5. The method of claim 1, wherein predicting the output of the in-process neural network comprises: the method comprises the steps of calculating an n-dimensional vector which represents the Volume proportion of each food object in an image by Volume R-CNN, wherein each element is positioned in an interval (0, 1), the sum of all elements is 1, calculating an n m matrix which represents the category of each food object in the image by Fast R-CNN, m is the number of the categories of identifiable food objects, only one element in each row of the matrix is 1, the rest m-1 elements are 0, the column corresponding to the element 1 is the category of the food object, and calculating a two-dimensional array which represents n 4 of the frame of each food object.

6. The method of claim 1, wherein the method further comprises multiplying an n-dimensional vector representing the volume fraction of each food object by an n x m two-dimensional array representing the category to which each food object belongs to obtain a volume fraction vector of each category, wherein the volume fraction vector is an m-dimensional vector, each dimension of the m-dimensional vector corresponds to one category of food, and the value in each dimension represents the volume fraction of the corresponding category of food.

7. The method of claim 1, further comprising calculating an m-dimensional vector representing the volume fraction of each food category for each image, and then adding all m-dimensional vectors and dividing by the number of images to find the average vector as the final volume fraction vector for each food category.

8. The method of claim 1, wherein the method further comprises an adaptive training step comprising:

9. An apparatus for performing fast R-CNN neural network operations, comprising

an information processing section for processing and calculating the image;

wherein the information processing section includes:

wherein the RPN, Fast R-CNN and Volume R-CNN share a convolutional layer;

wherein i =1,2 … … n, n being the number of food objects in the image;

in the Volume R-CNNThe loss function Volume loss is in the form of

10. The apparatus of claim 9, wherein the information input section includes an image input device and a quality input device.

11. The apparatus of claim 9, wherein the information processing part further comprises a data conversion unit for converting the q-dimensional nutrient content vector output by the processing unit into a corresponding output.

12. The apparatus of claim 9, wherein the apparatus further comprises an information output section for receiving output information from the information processing section and displaying the information.

13. The apparatus of claim 9, wherein the apparatus further comprises a networking component for uploading the measurement data to a database in real time, while the latest parametric model is also updated from a cloud.

14. The apparatus according to claim 9, wherein the information processing section is a neural network chip.