Disclosure of Invention
The invention aims to solve the technical problem that a food identification mode is complicated, and provides a food identification method and device based on deep learning aiming at the defects in the prior art.
In order to solve the above technical problem, the present invention provides a food identification method, including:
acquiring an image to be identified;
performing food identification on the image to be identified, and determining the image to be identified containing food as a first identification image;
performing food target detection on the first identification image, and determining position parameters of food in the first identification image, wherein the position parameters comprise coordinates of the food, and the length and the width of an area where the food is located in the first identification image;
determining a second identification image according to the first identification image and the position parameter;
and performing food identification on the second identification image to determine the name of the food.
In a possible implementation manner, the performing food recognition on the image to be recognized includes:
carrying out food identification on the image to be identified by utilizing a pre-constructed neural network;
wherein the first neural network is constructed by:
3 x 3 convolution kernels are adopted in the first layer to the twelfth layer, wherein a characteristic pyramid structure is added between the fifth layer and the eighth layer and is used for fusing low-dimensional characteristics and high-dimensional characteristics;
applying 5 x 5 convolution kernels to the thirteenth to seventeenth layers;
adopting a flatten layer on the eighteenth layer;
a fully connected layer is employed at the nineteenth layer, and a softmax classifier is employed after the fully connected layer.
In a possible implementation manner, the performing food recognition on the first recognition image and determining a position parameter of the food in the first recognition image includes:
carrying out six times of feature extraction on the first identification image to obtain a first high-dimensional feature map;
performing primary feature extraction on the first identification image to obtain a first low-dimensional feature map;
carrying out feature extraction on the first identification image for three times to obtain a second low-dimensional feature map;
performing five times of feature extraction on the first identification image to obtain a third low-dimensional feature map;
performing feature fusion on the first high-dimensional feature map, the first low-dimensional feature map, the second low-dimensional feature map and the third low-dimensional feature map to obtain a first fused feature map;
and determining the position parameter of the food in the first identification image according to the first fusion feature map.
In one possible implementation manner, the determining the position parameter of the food in the first recognition image according to the first fusion feature map includes:
carrying out feature extraction on the first fusion feature map, and sending the features subjected to feature extraction to a classifier;
performing feature extraction on the first fusion feature map, performing three times of maximum pooling on the features subjected to feature extraction, and performing feature fusion on the features subjected to the three times of maximum pooling to obtain a second fusion feature map;
extracting the features of the second fusion feature map, and sending the features after feature extraction to a position regression device;
determining a location parameter of the food in the first identification image according to the classifier and the location regressor.
In a possible implementation manner, the performing food recognition on the second recognition image and determining the name of the food includes:
performing first downsampling processing on the second identification image to obtain a first downsampling feature map;
performing second downsampling processing on the first downsampling feature map to obtain a second downsampling feature map;
performing feature extraction on the second downsampling feature map to obtain a first extracted feature map;
performing second downsampling processing on the first extracted feature map to obtain a third downsampled feature map;
performing second downsampling processing on the third downsampled feature map to obtain a fourth downsampled feature map;
performing feature extraction on the fourth down-sampling feature map to obtain a second extracted feature map;
performing feature fusion on the first downsampling feature map, the first extracted feature map, the fourth downsampling feature map and the second extracted feature map to obtain a third fused feature map;
giving weight exceeding a preset threshold value to the detail features in the third fusion feature map to obtain a target fusion feature map;
and after the target fusion characteristic graph is subjected to pooling treatment, determining the name of food through full-connection layer classification.
In one possible implementation, the image to be recognized is taken by a rotatable camera from at least several positions above the food;
after the acquiring the image to be recognized, further comprising:
acquiring the quality of food to be detected;
identifying the image to be identified by utilizing a pre-constructed food space parameter model to obtain a space parameter of the food to be identified, wherein the space parameter is used for representing a three-dimensional profile of the food to be identified;
identifying the image to be identified by using a pre-established food color identification model to obtain the color of the food to be identified;
and determining the quality of the food to be detected according to the quality, the space parameter and the color of the food to be detected.
In a possible implementation manner, the rotatable camera includes a first camera and a second camera, wherein the camera internal parameters of the first camera and the second camera are the same, optical axes of the first camera and the second camera are parallel to each other, an X axis of a first camera coordinate system of the first camera and an X axis of a second camera coordinate system of the second camera coincide with each other, a Y axis of the first camera coordinate system and a Y axis of the second camera coordinate system are parallel to each other, a distance between an origin of the first camera coordinate system and an origin of the second camera coordinate system on the X axis is b, the first camera coordinate system uses the optical center of the first camera as the origin, a three-dimensional rectangular coordinate system is established by using the optical axis as the Z axis, the second camera coordinate system uses the optical center of the second camera as the origin, a three-dimensional rectangular coordinate system is established by using the optical axis as the Z axis, when the rotatable camera shoots the food to be detected, setting an included angle between optical axes of the first camera and the second camera to be
Degree;
the food space parameter model is constructed in the following way:
acquiring a plurality of preset standard images of standard foods, wherein each standard image of the standard foods is a plurality of standard images;
for each standard food, the following were performed:
a1, determining the projection point coordinate of any space point of the current standard food in each standard image in the standard food
First coordinates in the first camera coordinate system
And second coordinates in the second camera coordinate system
;
A2, obtaining a first formula for representing the first coordinate according to the corresponding relation between the first coordinate and the second coordinate, wherein the corresponding relation is as follows:
a3, obtaining a second formula according to the central projection relation between the current space point and the projection point coordinate, wherein the second formula is as follows:
wherein, the
For characterizing a focal length of a pixel of the first camera in an X-axis, the
For characterizing a focal length of a pixel of the first camera in a Y-axis, the
Is the center pixel coordinate of the standard image,
linear model parameters of the first camera respectively representing the central pixel coordinate of the standard image and the origin of the standard imageHorizontal pixel values and vertical pixel values between pixel coordinates,
the linear model parameters of the second camera are respectively used for representing the horizontal pixel value and the vertical pixel value between the central pixel coordinate of the standard image and the origin pixel coordinate of the standard image;
a4, obtaining a third formula for representing the space point in the first camera coordinate system according to the first formula and the second formula, wherein the third formula is as follows:
a5, determining the corresponding coordinates of each space point of the current standard food in the first camera coordinate system according to the third formula;
and obtaining a food space parameter model according to the corresponding coordinates of each space point in each standard food in the first camera coordinate system.
The present invention also provides a food recognition apparatus, comprising:
the acquisition module is used for acquiring an image to be identified;
the first identification module is used for identifying food in the image to be identified and determining the image to be identified containing food as a first identification image;
the second identification module is used for carrying out food target detection on the first identification image and determining position parameters of food in the first identification image, wherein the position parameters comprise coordinates of the food, and the length and the width of an area where the food is located in the first identification image; determining a second identification image according to the first identification image and the position parameter;
and the third identification module is used for identifying the food in the second identification image and determining the name of the food.
The present invention also provides a food recognition apparatus, comprising: at least one memory and at least one processor;
the at least one memory to store a machine readable program;
the at least one processor is configured to invoke the machine readable program to perform the method as described above.
The invention also provides a computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the method as described above.
The food recognition method and the device based on deep learning have the following beneficial effects that:
firstly, food identification is carried out on an image to be identified, and the image to be identified containing food is determined as a first identification image; then, food target detection is carried out on the first identification image, and position parameters of the food in the first identification image are determined, wherein the position parameters comprise coordinates of the food, and the length and the width of an area where the food is located in the first identification image; determining a second identification image according to the first identification image and the position parameter; and finally, carrying out food identification on the second identification image to determine the name of the food. Through the technical scheme, the method and the device have the advantages that the name of the food is determined to be divided into three stages, namely, whether the image to be recognized contains the food or not, the position of the food in the first recognition image and the name of the food are determined, so that the name of the food can be quickly marked on the image to be recognized, and the problem that a user can know the related information of the food only by manually inquiring the name of the food is avoided.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
As shown in fig. 1, the food identification method provided by the embodiment of the present invention includes:
step 101, obtaining an image to be identified.
The image to be recognized may be derived from an image acquired by the recognition end in real time, for example, the recognition end is a smart phone, the smart phone is configured with a camera, or an image stored in advance by the recognition end, for example, the recognition end is a server, and is acquired by local reading or network transmission.
In other words, for the food recognition device disposed at the recognition end, the image to be recognized collected in real time may be obtained so as to perform food recognition on the image to be recognized in real time, and the image to be recognized collected in a historical time period may also be obtained so as to perform food recognition on the image to be recognized when the processing task is less, or perform food recognition on the image to be recognized under the instruction of the operator, which is not specifically limited in this embodiment.
Further, for the camera shooting assembly configured at the recognition end, if the camera shooting assembly can be used as an independent device, such as a camera, a video recorder and the like, the camera shooting assembly can be arranged around the environment where food is located, so that the food can be shot from different angles, images to be recognized reflecting the food from different angles can be obtained, and the accuracy of subsequent posture recognition can be guaranteed.
It should be noted that the shooting may be a single shooting or a continuous shooting, and accordingly, in the case of a single shooting, the obtained image to be recognized is a picture, and in the case of a continuous shooting, a video including a plurality of images to be recognized is obtained. Therefore, in each embodiment of the present invention, the image to be recognized for food recognition may be a single picture taken at a time, or may also be a certain image to be recognized in a section of video taken continuously, which is not specifically limited by the present invention.
It should be noted that, when the mobile terminal is used for food identification, a first neural network, a second neural network and a third neural network are pre-constructed in the mobile terminal, wherein the first neural network is used for food identification of an image to be identified, and the image to be identified containing food is determined as a first identification image; the second neural network is used for carrying out food identification on the first identification image and determining the position parameters of the food in the first identification image; and the third neural network is used for carrying out food identification on the second identification image and determining the name of the food. Correspondingly, the training process of the neural network is trained aiming at the preset food type picture of the mobile terminal. Specifically, for example, by means of crawler, shooting, purchasing and labeling, more than 2000 types of common food categories are collected, and a total number of 1200 ten thousand pictures are collected, wherein the pictures comprise a top-down shooting mode, a side-down shooting mode, a flat-shooting mode and the like, and operations such as picture translation, turning, gray scale, sharpening and the like are simulated by using traditional image processing tools such as opencv and the like, so that the generalization capability of the first neural network, the second neural network and the third neural network for identification is increased. For example, the first neural network uses an SGD iterator, the initial learning rate is 0.04, iterates for 25 thousands of steps, the Batch size is 256, trains for 1 week, and acc (accuracy) reaches 99.5%; the second neural network uses Adam iterator, initial learning rate is 0.05, iteration is 30 ten thousand steps, Batch size is 128, training is 2 weeks, mAP (mean Average Precision) reaches 0.654; the third neural network used the RMSprop iterator with an initial learning rate of 0.03, 60 ten thousand iterations, a batch size of 256, trained for 3 weeks, and acc of 98.3%.
And 102, identifying food for the image to be identified, and determining the image to be identified containing the food as a first identification image.
In the step, food identification can be carried out on the image to be identified by utilizing a pre-constructed neural network;
wherein the first neural network is constructed by:
3 x 3 convolution kernels are adopted in the first layer to the twelfth layer, wherein a characteristic pyramid structure is added between the fifth layer and the eighth layer and is used for fusing low-dimensional characteristics and high-dimensional characteristics;
applying 5 x 5 convolution kernels to the thirteenth to seventeenth layers;
adopting a flatten layer (namely a full connection layer) on the eighteenth layer;
a fully connected layer is used at the nineteenth layer and a softmax classifier is used after the fully connected layer.
In this embodiment, the convolution kernel may adopt DW convolution (Depth Wise Conv), and it is experimentally demonstrated that the operation time can be reduced by 45% in the first neural network by using the DW convolution kernel compared with the ordinary convolution kernel. The first neural network adopts a nineteen-layer structure to provide a good Receptive Field (RF), wherein 3 x 3 convolution kernels are adopted from the first layer to the twelfth layer to extract low-dimensional features so as to ensure that the features are not lost; the operation speed of the convolution kernel of 5 x 5 is faster than that of 2 convolution kernels of 3 x 3 in the thirteenth layer to the seventeenth layer; and a characteristic pyramid structure is added between the fifth layer and the eighth layer and is used for fusing low-dimensional characteristics and high-dimensional characteristics, the low-dimensional characteristics mainly focus on local information of the image, and the high-dimensional characteristics mainly focus on overall information of the image, so that the low-dimensional characteristics and the high-dimensional characteristics are fused, and the accuracy of food identification can be improved. According to experimental demonstration, the mAP is improved by 3.5 by adding the characteristic pyramid structure compared with the characteristic pyramid structure, and the operation time is reduced by 4.5% by using 5 × 5 convolution kernels to replace 3 × 3 convolution kernels.
It should be noted that the feature pyramid structure is a basic component in a recognition system for detecting objects with different scales (i.e., low-dimensional features and high-dimensional features), which is well known to those skilled in the art and will not be described herein.
Step 103, performing food target detection on the first identification image, and determining the position parameter of the food in the first identification image.
In this step, the position parameters include coordinates of the food, and a length and a width of the region in the first recognition image where the food is located. That is, by performing feature extraction on the image to be recognized using a neural network including a plurality of feature extraction modules (e.g., residual network modules), a bounding box (bounding box) of the food in the first recognition image can be obtained.
As previously mentioned, the first recognition image may be physically recognized using a second neural network, the specific operation of which is described below.
In an embodiment of the present invention, step 103 specifically includes the following steps:
carrying out six times of feature extraction on the first identification image to obtain a first high-dimensional feature map;
performing primary feature extraction on the first identification image to obtain a first low-dimensional feature map;
carrying out feature extraction on the first identification image for three times to obtain a second low-dimensional feature map;
performing five times of feature extraction on the first recognition image to obtain a third low-dimensional feature map;
performing feature fusion on the first high-dimensional feature map, the first low-dimensional feature map, the second low-dimensional feature map and the third low-dimensional feature map to obtain a first fusion feature map;
according to the first fused feature map, the position parameters of the food in the first recognition image are determined.
In the embodiment of the invention, by performing feature fusion (for example, feature fusion by means of Concat) on feature maps of different scales (i.e., a first high-dimensional feature map, a first low-dimensional feature map, a second low-dimensional feature map and a third low-dimensional feature map), the global features of food in an image to be identified and local features of important interest can be effectively fused together, so that the feature robustness of the fused feature map can be greatly enhanced, and the determined position parameters of the food in the first identification image are more accurate.
In addition, for example, a residual network module may be used to perform feature extraction on an image to be recognized, where in the residual network module, DW convolutions of two layers may be specifically used to perform feature extraction to reduce operation time, DW convolutions of one layer may be a convolution kernel of 1 × 1, DW convolutions of another layer may be a convolution kernel of 3 × 3 to extract fine-grained features, then a Relu activation function is used to improve the fitting capability on nonlinear data and prevent overfitting, and finally a high-dimensional feature and a low-dimensional feature in each residual network module are subjected to feature fusion in an Add manner to prevent feature loss.
Further, in an embodiment of the present invention, determining the position parameter of the food in the first recognition image according to the first fused feature map includes:
extracting the features of the first fusion feature map, and sending the features after feature extraction to a classifier;
performing feature extraction on the first fusion feature map, performing three times of maximum pooling on the features subjected to feature extraction, and performing feature fusion on the features subjected to the three times of maximum pooling to obtain a second fusion feature map;
extracting the features of the second fusion feature map, and sending the features after feature extraction to a position regression device;
according to the classifier and the location regressor, a location parameter of the food in the first recognition image is determined.
In the embodiment of the present invention, the first fused feature map is subjected to feature extraction, and the features after feature extraction are subjected to three times of maximum pooling, for example, the maxporoling (i.e., maximum pooling) layers of 5 × 5, 9 × 9 and 13 × 13 can be used, the maxporoling layers have feature dimensionality reduction, the parameter quantity of the neural network is reduced, the maximum representative features can be extracted, and the position and rotation invariance of the features can be maintained, which has a greater effect on processing the global features, while the convolution size using 5 × 5, 9 × 9 and 13 × 13 is determined because under the cpu condition, the 3 × 3 convolution with the same sense requires multiple layers, the computation speed of the multiple layers 3 × 3 convolution is slower than that of a single layer 5 × 5 or 9 × 9 or 13, and the computation time using the maxporoling layers with the size is reduced by 23.6% compared with that of the maxporoling layer using 3 without loss.
In addition, the second fused feature map obtained by three times of pooling and feature fusion is finally sent to a location regressor (location head), and the features which are not pooled are sent to a classifier (classification head), so that the determined position parameters of the food in the first identification image can be more accurate. This is because the maximum pooling may lose part of the detail features, which have a decisive factor in the classification process for the classification accuracy of the algorithm, while in the position regression process, the overall appearance feature has a great influence on the accuracy and the detail feature has a small influence.
And 104, determining a second identification image according to the first identification image and the position parameter.
It is understood that the second identification image is an image region containing food, i.e., can be understood as a region of interest (ROI) image.
And 105, identifying the food in the second identification image, and determining the name of the food.
As described above, the second recognition image may be subjected to food recognition using a third neural network, and a specific operation of the third neural network will be described below.
In an embodiment of the present invention, step 105 specifically includes:
performing first downsampling processing on the second identification image to obtain a first downsampling feature map;
performing second downsampling processing on the first downsampling feature map to obtain a second downsampling feature map;
performing feature extraction on the second down-sampling feature map to obtain a first extracted feature map;
performing second downsampling processing on the first extracted feature map to obtain a third downsampled feature map;
performing second downsampling processing on the third downsampling feature map to obtain a fourth downsampling feature map;
performing feature extraction on the fourth down-sampling feature map to obtain a second extracted feature map;
performing feature fusion on the first downsampling feature map, the first extracted feature map, the fourth downsampling feature map and the second extracted feature map to obtain a third fused feature map;
giving weight exceeding a preset threshold value to the detail features in the third fusion feature map to obtain a target fusion feature map;
and after the target fusion characteristic graph is subjected to pooling treatment, determining the name of the food through full-connection layer classification.
In the embodiment of the present invention, first, a first downsampling process may be performed on the second identification image by using a SepConv module, where the SepConv module may perform convolution on each channel of the input channel to obtain feature maps with the same number as the input channels, and then aggregate the values of each feature map by using a plurality of convolutions of 1 × 1 to obtain an output feature map, which mainly functions in downsampling to remove redundant features, for example, reducing the original feature map with a size of 224 × 224 to a feature map with a size of 112 × 112 to reduce the operation time.
Secondly, when the first sampling feature map is subjected to second down-sampling processing, convolution of 1 × 1, 3 × 3 and 1 × 1 can be adopted in sequence, and the convolution plays a transition role in the algorithm; meanwhile, after the first down-sampling processing, the second down-sampling processing is performed, which is beneficial to extracting key features in the first down-sampling feature map, and then the second down-sampling processing is performed, the low-dimensional features and the high-dimensional features extracted by convolution of 1 x 1, 3 x 3 and 1 x 1 can be fused in an Add mode, so that the fitting performance of the third neural network on complex data is improved.
In addition, the MBConv module may be used to perform feature extraction on the second downsampled feature map and perform feature extraction on the fourth downsampled feature map, and the MBConv module is well known to those skilled in the art and will not be described herein again.
The first downsampling feature map, the first extracted feature map, the fourth downsampling feature map and the second extracted feature map are subjected to feature fusion, so that the obtained third fused feature map has more detailed features; and then, the weight exceeding the preset threshold is given to the detail features in the third fused feature map, so that the detail features in the obtained target fused feature map can be more concerned, and the identification accuracy of similar foods can be improved. For example, in the experimental process, for the pineapple and the pineapple, the difference of the external forms of the pineapple and the pineapple is not obvious, but the detail features have differences (such as the difference of meat thorn and color), and the identification accuracy of the pineapple and the pineapple can be improved by 18.3% by weighting the detail features in the third fused feature map to exceed a preset threshold value.
In summary, the food identification method provided by the embodiment of the present invention first performs food identification on an image to be identified, and determines the image to be identified containing food as a first identification image; then, food recognition is carried out on the first recognition image, and position parameters of the food in the first recognition image are determined, wherein the position parameters comprise coordinates of the food, and the length and the width of an area where the food is located in the first recognition image; determining a second identification image according to the first identification image and the position parameter; and finally, carrying out food identification on the second identification image to determine the name of the food. Through the technical scheme, the method and the device have the advantages that the name of the food is determined to be divided into three stages, namely, whether the image to be recognized contains the food or not, the position of the food in the first recognition image and the name of the food are determined, so that the name of the food can be quickly marked on the image to be recognized, and the problem that a user can know the related information of the food only by manually inquiring the name of the food is avoided.
In addition to the above-described identification of the kind of food, the quality of the food can also be identified.
In one embodiment of the invention, the image to be recognized is obtained by shooting from a rotatable camera at least from a plurality of positions above the food;
after acquiring the image to be identified, the method further comprises the following steps:
acquiring the quality of food to be detected;
identifying an image to be identified by utilizing a pre-constructed food space parameter model to obtain a space parameter of the food to be detected, wherein the space parameter is used for representing a three-dimensional profile of the food to be detected;
identifying the image to be identified by utilizing a pre-constructed food color identification model to obtain the color of the food to be detected;
and determining the quality of the food to be detected according to the quality, the space parameter and the color of the food to be detected.
In the embodiment of the invention, the quality of the food to be detected and a plurality of images to be identified of the food to be detected are obtained, wherein the images to be identified are shot by a rotatable camera at least from a plurality of positions above the food to be detected; identifying a plurality of images to be identified by utilizing a pre-constructed food space parameter model to obtain space parameters of food to be identified, wherein the space parameters are used for representing a three-dimensional profile of the food to be identified; identifying a plurality of images to be identified by using a pre-established food color identification model to obtain the color of the food to be identified; and determining the quality of the food to be detected according to the quality, the space parameter and the color of the food to be detected, wherein the density of the food to be detected is obtained according to the quality and the space parameter of the food to be detected, and the quality of the food to be detected is accurately and quickly identified according to the density of the food to be detected by utilizing a pre-established density classifier. The quality of the food to be detected is identified through the density classifier and the food color identification model, and the identification accuracy can be improved. The food quality identification method based on the image recognition technology is integrated with the image recognition technology to finish the food quality identification, and can solve the problem that the food quality cannot be accurately identified by the common people.
It should be noted that the quality of the food to be identified and the obtaining of the image to be identified are realized by the food identification system as shown in fig. 4. The food identification system comprises a rotary weighing platform 1 and a rotatable camera 2, food to be measured is placed on the rotary weighing platform 1 for weighing, the rotary weighing platform 1 is connected with the rotatable camera 2, and the rotatable camera 2 can rotate by taking the center of the rotary weighing platform 1 as the center of a circle.
The food to be detected is placed on the rotary weighing platform 1, the quality of the food to be detected can be obtained, meanwhile, the rotatable camera 2 is utilized to shoot the food to be detected at a plurality of set angles, and a plurality of images to be recognized are obtained. Since the food to be measured is an irregular three-dimensional entity, the three-dimensional outline of the food to be measured can be reflected to the maximum extent only by shooting from a plurality of angles of the food. Meanwhile, when the rotatable camera 2 shoots the food to be detected, the linear distance between the rotatable camera 2 and the food to be detected can be kept unchanged (namely, the distance from the rotatable camera 2 to the center of the rotary weighing platform 1 is considered to be unchanged), only one variable with different shooting angles is maintained, and the space vector of the food to be detected can be obtained according to the obtained plurality of images to be identified.
In one embodiment of the present invention, the rotatable camera comprises a first camera and a second camera, wherein the camera internal parameters of the first camera and the second camera are the same, the optical axes are parallel to each other, the first camera coordinate system of the first camera and the X-axis of the second camera coordinate system of the second camera coincide with each other, the first camera coordinate system and the Y-axis of the second camera coordinate system are parallel to each other, the distance between the origin of the first camera coordinate system and the origin of the second camera coordinate system on the X-axis is b, the first camera coordinate system takes the optical center of the first camera as the origin, a three-dimensional rectangular coordinate system established by taking the optical axis as the Z axis, a three-dimensional rectangular coordinate system established by taking the optical center of the second camera as the origin and the optical axis as the Z axis, when the rotatable camera shoots the food to be measured, the included angle between the optical axes of the first camera and the second camera is set to be.
Degree;
the food space parameter model is constructed by the following method:
acquiring a plurality of preset standard images of standard foods, wherein the number of the standard images of each standard food is several;
for each standard food, the following were performed:
a1, determining the projection point coordinates of any space point of the current standard food in each standard image in the standard food
First coordinates in a first camera coordinate system
And second coordinates in a second camera coordinate system
;
A2, obtaining a first formula for representing the first coordinate according to the corresponding relation between the first coordinate and the second coordinate, wherein the corresponding relation is as follows:
a3, obtaining a second formula according to the central projection relation between the current space point and the projection point coordinate, wherein the second formula is as follows:
wherein the content of the first and second substances,
for characterizing the focal length of the pixels of the first camera in the X-axis,
for characterizing the focal length of the pixels of the first camera in the Y-axis,
is the center pixel coordinate of the standard image,
is a linear model parameter of the first camera for characterizing a horizontal pixel value and a vertical pixel value between a center pixel coordinate of the standard image and an origin pixel coordinate of the standard image,
the linear model parameters of the second camera are respectively used for representing the horizontal pixel value and the vertical pixel value between the central pixel coordinate of the standard image and the origin pixel coordinate of the standard image;
a4, obtaining a third formula for representing the space point in the first camera coordinate system according to the first formula and the second formula, wherein the third formula is as follows:
a5, determining the corresponding coordinates of each space point of the current standard food in the first camera coordinate system according to a third formula;
and obtaining a food space parameter model according to the corresponding coordinates of each space point in each standard food in the first camera coordinate system.
In the embodiment of the invention, the rotatable camera comprises two cameras, when each camera shoots food to be detected, the image of a three-dimensional entity of the food to be detected on a two-dimensional plane is obtained, the two cameras are used for shooting two images of the same three-dimensional entity at the same time, because the original points of coordinate systems of the two cameras have a certain distance in the X-axis direction, a parallax angle exists between the two cameras, the three-dimensional entity has similarity and certain difference in the image imaged by each camera, camera parallax is formed, and depth sense exists between the images shot by the two cameras. Based on the method, the three-dimensional entity of the food to be detected is recovered in a three-dimensional reconstruction mode, and further the space parameters of the three-dimensional entity are obtained. Specifically, the rotatable camera comprises a first camera and a second camera, wherein the camera internal parameters of the first camera and the second camera are the same, the optical axes of the first camera and the second camera are parallel to each other, the X axes of the first camera coordinate system and the second camera coordinate system are coincident with each other, the Y axes of the first camera coordinate system and the second camera coordinate system are parallel to each other, and the distance between the origin of the first camera coordinate system and the origin of the second camera coordinate system on the X axis is b. When shooting is carried out by utilizing the rotatable camera, the first camera and the second camera respectively wind around the positionThe Y axes of the camera coordinate systems are rotated relatively to each other so that the included angle between the optical axes of the first camera and the second camera is
In the present embodiment, the first and second electrodes, in the present embodiment,
set to 90 °. Therefore, according to the positional relationship between the first camera and the second camera, the relationship between the first coordinates in the first camera coordinate system and the second coordinates in the second camera coordinate system can be determined, and the second coordinates can be expressed by the first coordinates. According to the central projection relation of the current space point and the projection point coordinate of the standard food, the camera internal parameters and the linear model parameters of the first camera and the second camera are expressed by using the first coordinate, the second coordinate, the pixel focal length of the first camera on the X axis and the pixel focal length of the first camera on the Y axis. Further, the coordinates of the projected point coordinates in the first camera coordinate system are represented by known parameters of camera intrinsic parameters, linear model parameters, the distance between the two cameras in the X-axis, and the pixel focal length of the first camera in the X-axis and the Y-axis. Therefore, the corresponding coordinates of each space point of the current standard food in the first camera coordinate system can be obtained according to the method for obtaining the three-dimensional coordinates through three-dimensional reconstruction. And obtaining a food space parameter model according to the corresponding coordinates of each space point of each standard food in the first camera coordinate system.
As shown in fig. 2 and 3, the embodiment of the invention provides a device where a food recognition device is located and the food recognition device. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. From a hardware level, as shown in fig. 2, a hardware structure diagram of a device in which a food identification apparatus according to an embodiment of the present invention is located is provided, where the device in the embodiment may generally include other hardware, such as a forwarding chip responsible for processing a packet, in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 2. Taking a software implementation as an example, as shown in fig. 3, as a logical apparatus, the apparatus is formed by reading, by a CPU of a device in which the apparatus is located, corresponding computer program instructions in a non-volatile memory into a memory for execution.
As shown in fig. 3, the food identification device provided in this embodiment includes:
an obtaining module 301, configured to obtain an image to be identified;
a first identification module 302, configured to perform food identification on the image to be identified, and determine the image to be identified containing food as a first identification image;
the second identification module 303 is configured to perform food target detection on the first identification image, and determine a position parameter of food in the first identification image, where the position parameter includes a coordinate of the food, and a length and a width of an area where the food is located in the first identification image; determining a second identification image according to the first identification image and the position parameter;
and the third identification module 304 is configured to perform food identification on the second identification image, and determine a name of the food.
In this embodiment of the present invention, the obtaining module 301 may be configured to perform step 101 in the foregoing method embodiment, and the first identifying module 302 may be configured to perform step 102 in the foregoing method embodiment; the second identification module 303 may be configured to perform steps 103 and 104 in the above-described method embodiment; the third identification module 304 may be configured to perform step 105 in the above-described method embodiment.
In an embodiment of the present invention, the first identifying module 302 is configured to perform the following operations:
carrying out food identification on the image to be identified by utilizing a pre-constructed neural network;
wherein the neural network is constructed by:
3 x 3 convolution kernels are adopted in the first layer to the twelfth layer, wherein a characteristic pyramid structure is added between the fifth layer and the eighth layer and is used for fusing low-dimensional characteristics and high-dimensional characteristics;
applying 5 x 5 convolution kernels to the thirteenth to seventeenth layers;
adopting a flatten layer on the eighteenth layer;
a fully connected layer is employed at the nineteenth layer, and a softmax classifier is employed after the fully connected layer.
In an embodiment of the present invention, the second identifying module 303 is configured to perform the following operations:
carrying out six times of feature extraction on the first identification image to obtain a first high-dimensional feature map;
performing primary feature extraction on the first identification image to obtain a first low-dimensional feature map;
carrying out feature extraction on the first identification image for three times to obtain a second low-dimensional feature map;
performing five times of feature extraction on the first identification image to obtain a third low-dimensional feature map;
performing feature fusion on the first high-dimensional feature map, the first low-dimensional feature map, the second low-dimensional feature map and the third low-dimensional feature map to obtain a first fused feature map;
and determining the position parameter of the food in the first identification image according to the first fusion feature map.
In an embodiment of the present invention, the second identification module 303, when performing the determining of the position parameter of the food in the first identification image according to the first fused feature map, is configured to perform the following operations:
carrying out feature extraction on the first fusion feature map, and sending the features subjected to feature extraction to a classifier;
performing feature extraction on the first fusion feature map, performing three times of maximum pooling on the features subjected to feature extraction, and performing feature fusion on the features subjected to the three times of maximum pooling to obtain a second fusion feature map;
extracting the features of the second fusion feature map, and sending the features after feature extraction to a position regression device;
determining a location parameter of the food in the first identification image according to the classifier and the location regressor.
In an embodiment of the present invention, the third identifying module 304 is configured to perform the following operations:
performing first downsampling processing on the second identification image to obtain a first downsampling feature map;
performing second downsampling processing on the first downsampling feature map to obtain a second downsampling feature map;
performing feature extraction on the second downsampling feature map to obtain a first extracted feature map;
performing second downsampling processing on the first extracted feature map to obtain a third downsampled feature map;
performing second downsampling processing on the third downsampled feature map to obtain a fourth downsampled feature map;
performing feature extraction on the fourth down-sampling feature map to obtain a second extracted feature map;
performing feature fusion on the first downsampling feature map, the first extracted feature map, the fourth downsampling feature map and the second extracted feature map to obtain a third fused feature map;
giving weight exceeding a preset threshold value to the detail features in the third fusion feature map to obtain a target fusion feature map;
and after the target fusion characteristic graph is subjected to pooling treatment, determining the name of food through full-connection layer classification.
In one embodiment of the invention, the image to be identified is obtained by shooting from a rotatable camera at least from a plurality of positions above the food;
the food identification device further comprises:
the quality acquisition module is used for acquiring the quality of the food to be detected;
the fourth identification module is used for identifying the image to be identified by utilizing a pre-constructed food space parameter model to obtain a space parameter of the food to be identified, wherein the space parameter is used for representing a three-dimensional profile of the food to be identified;
the fifth identification module is used for identifying the image to be identified by utilizing a pre-established food color identification model to obtain the color of the food to be identified;
and the quality determining module is used for determining the quality of the food to be detected according to the quality, the space parameter and the color of the food to be detected.
In an embodiment of the present invention, the rotatable camera includes a first camera and a second camera, wherein the first camera and the second camera have the same camera internal parameters and have parallel optical axes, a first camera coordinate system of the first camera and an X-axis of a second camera coordinate system of the second camera coincide with each other, a Y-axis of the first camera coordinate system and a Y-axis of the second camera coordinate system are parallel to each other, a distance between an origin of the first camera coordinate system and an origin of the second camera coordinate system on the X-axis is b, the first camera coordinate system uses the optical center of the first camera as the origin and a three-dimensional rectangular coordinate system established by using the optical axis as the Z-axis, the second camera coordinate system uses the optical center of the second camera as the origin and a three-dimensional rectangular coordinate system established by using the optical axis as the Z-axis, when the rotatable camera photographs the food to be measured, setting an included angle between optical axes of the first camera and the second camera to be
Degree;
the food space parameter model is constructed in the following way:
acquiring a plurality of preset standard images of standard foods, wherein each standard image of the standard foods is a plurality of standard images;
for each standard food, the following were performed:
a1, determining the projection point coordinate of any space point of the current standard food in each standard image in the standard food
First coordinates in the first camera coordinate system
And second coordinates in the second camera coordinate system
;
A2, obtaining a first formula for representing the first coordinate according to the corresponding relation between the first coordinate and the second coordinate, wherein the corresponding relation is as follows:
a3, obtaining a second formula according to the central projection relation between the current space point and the projection point coordinate, wherein the second formula is as follows:
wherein, the
For characterizing a focal length of a pixel of the first camera in an X-axis, the
For characterizing a focal length of a pixel of the first camera in a Y-axis, the
Is the center pixel coordinate of the standard image,
linear model parameters of the first camera respectively representing the central pixel coordinate of the standard image and the origin of the standard imageHorizontal pixel values and vertical pixel values between pixel coordinates,
the linear model parameters of the second camera are respectively used for representing the horizontal pixel value and the vertical pixel value between the central pixel coordinate of the standard image and the origin pixel coordinate of the standard image;
a4, obtaining a third formula for representing the space point in the first camera coordinate system according to the first formula and the second formula, wherein the third formula is as follows:
a5, determining the corresponding coordinates of each space point of the current standard food in the first camera coordinate system according to the third formula;
and obtaining a food space parameter model according to the corresponding coordinates of each space point in each standard food in the first camera coordinate system.
It is to be understood that the illustrated structure of the embodiments of the present invention does not constitute a specific limitation to the food identifying device. In other embodiments of the invention, the food identification device may include more or fewer components than illustrated, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
Because the content of information interaction, execution process, and the like among the modules in the device is based on the same concept as the method embodiment of the present invention, specific content can be referred to the description in the method embodiment of the present invention, and is not described herein again.
An embodiment of the present invention further provides a food identification device, including: at least one memory and at least one processor;
the at least one memory to store a machine readable program;
the at least one processor is configured to invoke the machine readable program to perform the food identification method of any embodiment of the present invention.
Embodiments of the present invention also provide a computer-readable medium storing instructions for causing a computer to perform a food identification method as described herein. Specifically, a method or an apparatus equipped with a storage medium on which a software program code that realizes the functions of any of the above-described embodiments is stored may be provided, and a computer (or a CPU or MPU) of the method or the apparatus is caused to read out and execute the program code stored in the storage medium.
In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.
Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.
Further, it should be clear that the functions of any one of the above-described embodiments can be implemented not only by executing the program code read out by the computer, but also by performing a part or all of the actual operations by an operation method or the like operating on the computer based on instructions of the program code.
Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.