CN114743153B

CN114743153B - Non-sensory vegetable taking model establishing and vegetable taking method and device based on video understanding

Info

Publication number: CN114743153B
Application number: CN202210649671.5A
Authority: CN
Inventors: 金一舟; 范时朝; 周钢; 刘庆杰; 王蕴红
Original assignee: Hangzhou Innovation Research Institute of Beihang University
Current assignee: Hangzhou Innovation Research Institute of Beihang University
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2022-09-30
Anticipated expiration: 2042-06-10
Also published as: CN114743153A

Abstract

The invention relates to a method and a device for establishing a non-sensory dish-taking model based on video understanding, wherein the method comprises the following steps: get the dish action of getting of dish model discernment diner through the good noninductive of training, the dish and the diner dinner plate of diner will have dinner the people according to getting the dish action and match, through calculating the weight of every kind of dish, do individual statistics to the diner, get the dish action through noninductive dish model discernment of getting, compare in prior art, place the dinner plate in appointed induction area before need not having diner to take every kind of dish, the efficiency of getting of diner has been improved, the diner is getting the dish in-process, only need normally get the dish and go, need not do extra action, the degree of coordination of diner has been improved.

Description

Non-sensory dish-taking model establishing and dish-taking method and device based on video understanding

Technical Field

The invention relates to the technical field of motion recognition, in particular to a method and a device for building and fetching a non-sensory dish fetching model based on video understanding.

Background

In some special canteen scenes (such as canteens for professional athletes), refined statistics are needed for food ingested by a user every meal;

in the prior art, generally plant the RFID chip on the dinner plate, before every kind of dish, can place an induction area, before the people of having a dinner takes the dish, need earlier the dinner plate, place in the induction area, wait for the discernment back, get the operation of dish again, just also every takes a kind of dish, all need put the dinner plate to appointed induction area, and the response also has certain delay at every turn, and the people of having a dinner degree of adaptability is lower.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for establishing a non-sensory dish-taking model based on video understanding, so as to solve the problems in the prior art that each time a diner takes a dish, the diner plate needs to be placed in a designated sensing area to wait for sensing, sensing is delayed, the time for the diner to take the dish is long, and the matching degree is low.

According to a first aspect of the embodiments of the present invention, there is provided a video understanding-based non-sensory dish model building method, including:

acquiring a video for picking up a dish by a human body, performing frame extraction on the video to acquire a plurality of video frames, marking the video frames containing dish picking actions in the plurality of video frames as normal samples, and selecting the normal samples as a data set;

inputting the normal samples in the data set into an encoder of a target neural network model, mapping the normal samples through the encoder to obtain mapping vectors of the normal samples, reconstructing the mapping vectors of the normal samples by a decoder of the target neural network model, and outputting reconstructed video frames;

calculating a loss value between the reconstructed video frame and the video frame of the normal sample, and adjusting parameters of the target neural network model by a gradient descent method until the loss value is not reduced any more or reaches a preset iteration number, so as to obtain a non-sensory dish model.

According to a second aspect of the embodiments of the present invention, there is provided a video understanding-based non-sensory dish taking method, where the method is based on the non-sensory dish model obtained by the video understanding-based non-sensory dish model building method, and the dish taking method includes:

binding the identity information of the diners with the face images thereof in advance;

when a diner has a meal, matching the personal information of the diner with the dinner plate;

matching the diner plate of the diner with dishes taken by the diner according to the trained noninductive dish taking model;

acquiring the weight of each dish of a diner;

and carrying out individual statistics on the diners according to the weight of each dish of the diners.

Preferably, the first and second electrodes are formed of a metal,

the pre-binding of the identity information of the diner and the face recognition image thereof comprises the following steps:

and according to the face image provided by the diner, binding the face image with a meal card of the diner during dining.

Preferably, the first and second liquid crystal display panels are,

when the diner has a meal, matching the diner identity information with the dinner plate comprises:

and setting a serial number for each dinner plate, using the meal card by a diner to fetch the dinner plate, and binding the meal card with the serial number corresponding to the dinner plate.

Preferably, the first and second liquid crystal display panels are,

the method for matching the diner dinner plate with the dishes taken by the diner according to the trained non-inductive dish taking model comprises the following steps:

when a diner takes dishes, acquiring a face image and a dish taking video of the diner, inputting the face image into a preset face recognition model, acquiring the identity information of the diner corresponding to the face image, acquiring a meal card of the diner according to the identity information of the diner, and acquiring a meal plate number of the diner according to the meal card of the diner;

dividing the dish taking video into a plurality of video frames and inputting the video frames into a trained non-sensory dish taking model;

if the loss values between all reconstructed video frames output by the non-sensory dish taking model and the input video frames are larger than a preset threshold value, it is indicated that the diner does not take dishes;

if the loss value between the reconstructed video frame output by the non-sensory dish taking model and the input video frame is less than or equal to a preset threshold value, indicating that the dish taking action exists for the diner;

and when the diner takes the dish, acquiring the dish information of the diner, and matching the dish information with the diner dinner plate.

Preferably, the first and second liquid crystal display panels are,

the acquiring of the dish information of the diner comprises the following steps: and a weight sensor is arranged below each dish, and when the weight of the corresponding dish is reduced when the dish taking action of the diner exists, the diner is judged to take the dish with the reduced weight.

Preferably, the first and second electrodes are formed of a metal,

the method for acquiring the weight of each dish of the diner comprises the following steps:

and calculating a difference value according to the weight of the dish after the diner takes the dish and the weight of the dish before the diner takes the dish, wherein the difference value is the weight of the dish taken by the diner.

According to a third aspect of the embodiments of the present invention, there is provided a device for building a non-perceptual model based on video understanding, the device including:

a dataset screening module: the method comprises the steps of obtaining a video for a human body to pick up a dish, carrying out frame extraction on the video to obtain a plurality of video frames, marking the video frames containing dish picking actions in the video frames as normal samples, and selecting the normal samples as a data set;

a reconstructed video frame acquisition module: the decoder of the target neural network model reconstructs the mapping vector of the normal sample and outputs a reconstructed video frame;

the non-sensory dish-taking model acquisition module comprises: and the method is used for calculating a loss value between the reconstructed video frame and the video frame of the normal sample, and adjusting the parameters of the target neural network model by a gradient descent method so that the loss value is not reduced or reaches a preset iteration number, thereby obtaining the non-sensory dish model.

According to a fourth aspect of the embodiments of the present invention, there is provided a video understanding-based non-sensory dish device, which uses the non-sensory dish model obtained by the video understanding-based non-sensory dish model building method, the device including:

an identity binding module: the system is used for binding the identity information of the diner and the face image thereof;

a service plate binding module: the service plate is used for matching the personal information of the diner with the service plate when the diner has a meal;

the dish binding module: the food service plate is used for matching the diner service plate with dishes taken by the diner according to the trained non-inductive dish taking model;

a weight acquisition module: the weight acquisition device is used for acquiring the weight of each dish of a diner;

a statistic module: the method is used for carrying out individual statistics on the diners according to the weight of each dish of the diners.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

in the establishment process of the non-sensory vegetable taking model, only the video frames containing the vegetable taking action are trained, namely all the video frames in the data set are the video frames containing the vegetable taking action, compared with the existing model training process, the video frames in the data set do not need to be labeled to distinguish the video frames containing the vegetable taking action from the video frames not containing the vegetable taking action, the early-stage manual labeling process in the model establishment process is reduced, in addition, the method only trains the video frames containing the vegetable taking action but does not train the video frames not containing the vegetable taking action, the training amount is obviously reduced, namely the model training process is faster, the vegetable taking action of a diner is identified through the trained non-sensory vegetable taking model, when the vegetable taking action exists, the quality of the dishes taken by the diner is counted according to the weight reduced by the dishes, the application matches the identity information of the diners with the meal cards and the meal plates, thereby realizing the purpose of matching dishes with the meal plates,

compared with the prior art, the dinner plate is placed in the appointed induction area before each type of dish is not required to be taken by a diner, the dish taking efficiency of the diner is improved, the diner only needs to normally take the dish in the dish taking process, extra action is not required, and the matching degree of the diner is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram illustrating a method for video understanding-based sensorless dish model building in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a video understanding-based no-sense-dishes method according to another exemplary embodiment;

FIG. 3 is a system diagram illustrating a video understanding-based non-sensory-dish model building apparatus according to another exemplary embodiment;

FIG. 4 is a system diagram illustrating a video understanding-based non-sensory-dish device according to another exemplary embodiment;

in the drawings: the method comprises the steps of 1-a data set screening module, 2-a reconstructed video frame obtaining module, 3-a non-sensory dish model obtaining module, 101-an identity binding module, 102-a dinner plate binding module, 103-a dish binding module, 104-a weight obtaining module and 105-a counting module.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Example one

Fig. 1 is a diagram illustrating a video understanding-based non-sensory dish model building method according to an exemplary embodiment, as shown in fig. 1, the method including:

s1, acquiring a video of a human body for fetching dishes, performing frame extraction on the video to acquire a plurality of video frames, marking the video frames containing the dish fetching action in the plurality of video frames as normal samples, and selecting the normal samples as a data set;

s2, inputting the normal samples in the data set into an encoder of a target neural network model, mapping the normal samples through the encoder to obtain mapping vectors of the normal samples, reconstructing the mapping vectors of the normal samples by a decoder of the target neural network model, and outputting reconstructed video frames;

and S3, calculating a loss value between the reconstructed video frame and the video frame of the normal sample, and adjusting the parameters of the target neural network model by a gradient descent method until the loss value is not reduced any more or reaches a preset iteration number, so as to obtain a non-sensory dish-taking model.

It should be noted that the technical solution provided in this embodiment is applicable to some specific canteens, such as athlete canteens, in the technical field where refined statistics needs to be performed on food of diners.

It can be understood that, in the technical solution provided in this embodiment, a neural network model of an encoder based on 3d convolution and a decoder based on 3d convolution is adopted, a camera is installed to collect a video for picking up a dish of a user on site or search a related dish-picking video on the network, a frame of the dish-picking video is picked up to obtain a plurality of video frames, a video frame including a dish-picking action in the plurality of video frames is selected as a data set, the video frame in the data set is input into the preset neural network model, the input video frame is mapped by the encoder based on 3d convolution to obtain a mapping vector of an input vector, the mapping vector is reconstructed by the decoder based on 3d convolution, the reconstructed video frame is output, a loss value between the output video frame and the input video frame is calculated, parameters of the preset neural network model are iteratively adjusted so that the loss value output by the model does not drop or reaches a preset iteration number, the method comprises the steps that a non-sensory vegetable-taking model is obtained, only video frames containing vegetable-taking actions are trained in the model building process, and all video frames contained in a data set are video frames containing vegetable-taking actions, so that compared with the existing building processes of other models, the data set does not need to be labeled and distinguished;

it should be noted that the model building method based on reconstruction is based on the assumption that a model learned on normal data cannot accurately represent and reconstruct an abnormality. However, because the generalization of the neural network is strong, the reconstruction of some anomalies can be well realized, so that the anomalies cannot be well distinguished, and the model classification precision is reduced. To this end, we want to introduce a reliable mechanism to encourage the model to generate larger reconstruction error for the anomaly, and therefore, the embodiments of the present invention also disclose that a memory module (memory module) is added in the encoder and the decoder, the memory module can store k features (the features are extracted from data after passing through the encoder), an input is preset, so that the encoder obtains a feature a, the feature a is used to search in a preset memory library, the most relevant feature B in the memory library is retrieved, the feature B is sent to the decoder for reconstruction, the preset memory library is provided with k data at most, before starting, k normal data (dishes are taken) can be run first, the corresponding extracted features are stored in the memory library, the decoder and the encoder are optimized during the model training process, the reconstruction error is minimized, meanwhile, the characteristics in the memory library can be updated in a same step, in the testing stage, a testing sample is given, and the model is reconstructed only by using the characteristics of a limited number of normal modes (dish taking actions) recorded in the memory library, so that the reconstructed image approaches to a normal sample, the reconstruction error of the normal sample (including the video frame of the dish taking action) is smaller, and the error of abnormal reconstruction (not including the video frame of the dish taking action) is larger and can be used as the basis for judging the dish taking action;

meanwhile, the embodiment also discloses that a reverse distillation mode is introduced, the encoder and the decoder are replaced by a multi-scale encoder and a multi-scale decoder which are pre-trained on a large data set, convolution layers are added simultaneously, the features extracted from the video frame by the pre-trained multi-scale feature encoder are sent into a preset memory bank for searching after a convolution change, and it is worth noting that the structure of the multi-scale decoder and the structure of the multi-scale encoder correspond to each other, in the training stage, only the multi-scale decoder and the added convolution layers are trained, the model parameters of the multi-scale encoder are kept unchanged, the multi-scale features of the multi-scale encoder and the difference values of the multi-scale features corresponding to the multi-scale decoder are calculated simultaneously, the difference values are reduced through gradient reduction, in the actual testing stage, the difference values of the multi-scale features are obtained, and also as part of the computation of the anomaly score for the final vegetable-fetching action and the determination of other actions.

Example two

Fig. 2 is a flow chart illustrating a video understanding-based non-sensory-dish method according to another embodiment, as shown in fig. 2, the method is based on an obtained non-sensory-dish model, and the method includes:

s101, binding identity information of a diner and a face image thereof in advance;

s102, matching the personal information of the diner with a dinner plate when the diner has a dinner;

s103, matching a diner dinner plate with dishes taken by a diner according to the trained non-inductive dish taking model;

s104, acquiring the weight of each dish taken by a diner;

s105, carrying out individual statistics on the diners according to the weight of each dish of the diners;

it can be understood that, in this embodiment, the diner needs to provide a face image, the face image is matched and bound with a meal card used in a dining room, the diner needs to match and bind self information with a meal plate when having a meal, the diner matches the meal plate of the diner with the dishes through a trained non-sensory dish taking model when taking the dishes, the quality of each dish taken by the diner is calculated, the diner is subjected to individual statistics according to the quality of each dish, the diner (especially an athlete) is convenient to manage the diet of the diner, compared with the prior art, the meal plate does not need to be placed in a designated sensing area before the diner takes each dish, the dish taking efficiency of the diner is improved, the diner only needs to take the dishes normally and does not need to do additional actions during the dish taking process, the degree of adaptability of the diners is improved.

Preferably, the first and second electrodes are formed of a metal,

according to a face image provided by a diner, binding the face image with a meal card of the diner during dining;

the meal card and the face image are bound, when a diner has a meal, the face image of the diner is identified, and the meal card corresponding to the diner can be known, so that when the diner has a meal information in subsequent statistics, the information corresponding to dishes is better bound with the identity of the diner.

Preferably, the first and second electrodes are formed of a metal,

setting a serial number for each dinner plate, taking the dinner plate after a user uses the meal card to punch the card, and binding the meal card with the serial number corresponding to the dinner plate;

it can be understood that the serial numbers are set for each dinner plate, and because the identity information of the diners is bound with the meal cards, when the diners swipe the meal cards to pick up the dinner plates, the serial numbers of the dinner plates and the identity information of the diners can be bound, and similarly, when the diners take dinner information for follow-up statistics, the statistics errors are avoided.

Preferably, the first and second liquid crystal display panels are,

the noninductive dish model of getting according to training will have dinner people dinner plate and have dinner people to take the vegetable to match including:

if the loss values between all reconstructed video frames output by the non-sensory vegetable-taking model and the input video frames are larger than a preset threshold value, it is indicated that the diner does not take the vegetable;

if the loss value between the reconstructed video frame output by the non-sensory dish taking model and the input video frame is less than or equal to a preset threshold value, indicating that a dish taking action exists for a diner;

when the diner takes the dishes, acquiring dish information of the diner, and matching the dish information with the diner dinner plate;

it can be understood that according to the above-mentioned built non-sensory dish model, a camera or other equipment for acquiring a face image and video information is arranged in front of each dish, it is worth emphasizing that the camera or other equipment for acquiring a face image and video information is bound to the name of the dish, when a person takes a dish by diner, the camera captures the face image of the user and the video for taking the dish, when the face is detected at a fixed position of the video image (for example, in the middle part of the image, the main purpose of which is to filter an irrelevant face image), it is worth emphasizing that the means for filtering the irrelevant face image here may also be: detecting a spoon for taking dishes (the position of the spoon can be output) through a target detection model, outputting each skeleton point position (comprising a hand and a head) of a human body through a posture estimation model, judging who takes the spoon after being combined with a face detection model, wherein the person taking the spoon is the person for taking the dishes, and better filtering the interference of other background figures, inputting an image through the face detection model (a common detection model such as SSD or YooloX can be adopted), outputting the specific position (x, y, w, h) of the detected face, performing feature matching with a face library through a face recognition model (simply CNN extraction features can be adopted), inputting an image (x, y, w, h) of the position of a face frame detected by the face recognition model, outputting the identity of a diner matched with the face frame, and inputting a dish taking video of the diner into a built non-inductive dish taking model, when in input, the dish taking video is extracted into a single video frame, the single video frame is output as a reconstructed video frame, a loss value between the reconstructed video frame and the input video frame is calculated, if the loss value is greater than a preset threshold value, the video frame does not contain the dish taking action, if the loss value is less than or equal to the preset threshold value, the video frame contains the dish taking action, because in the building process of a non-dish taking model, only the video frame containing the dish taking action is trained, namely if the video frame contains the dish taking action, the loss value between the reconstructed video frame and the input video frame is very small, and because the non-dish taking model does not train other video frames not containing the dish taking action, if the video frame does not contain the dish taking action, the loss value between the reconstructed video frame and the input video frame is larger, whether the diner is getting the dish or not can be identified through the design, when the diner is identified to have the dish getting action, the dish which is got by the diner is obtained, and the dish which is got is matched with the diner dinner plate.

It can be understood that, in this application, through the noninductive dish model of getting binds the people of having dinner with the dish, so this application also can be applied to the noninductive payment of dish, promptly, need not use technical means such as RFID chip, dinner plate discernment and dish discernment to the people of having dinner to get the dish and discern, bind identity information and meal card, cell-phone payment back, can realize the noninductive payment of dish.

Preferably, the first and second liquid crystal display panels are,

the acquiring of the dish information of the diner comprises the following steps: a weight sensor is arranged below each dish, and when the weight of the corresponding dish is reduced when the dish taking action of the diner exists, the diner is judged to take the dish with the reduced weight;

it can be understood that a weight sensor is arranged below each type of dish and used for acquiring the remaining weight of the current dish, when the weight sensor of the dish detects that the diner has the action of taking the dish, the weight sensor of the dish detects that the weight of the remaining dish is reduced, the dish with the reduced weight is matched with the dinner plate of the diner at best, and the diner can know which dishes are taken by the diner.

Preferably, the first and second electrodes are formed of a metal,

calculating a difference value according to the weight of the dish after the diner takes the dish and the weight of the dish before the diner takes the dish, wherein the difference value is the weight of the dish taken by the diner;

it can be understood that when the diner takes the dishes, the difference value of the weight of the dishes after the dishes are taken and the weight of the dishes before the dishes are taken, which are detected by the weight sensor, is the weight of the dishes taken by the diner, so that the diner can be known, the weight of each kind of dishes taken can be used for carrying out individual statistics on the diner information of the diner, and the diner can conveniently carry out fine management on the self diet.

Example three:

fig. 3 is a system diagram illustrating a video understanding-based non-sensory-dish model building apparatus according to an exemplary embodiment, and as shown in fig. 3, the apparatus includes:

data set filtering module 1: the method comprises the steps of obtaining a video for a human body to pick up a dish, carrying out frame extraction on the video to obtain a plurality of video frames, marking the video frames containing dish picking actions in the video frames as normal samples, and selecting the normal samples as a data set;

the reconstructed video frame acquisition module 2: the decoder of the target neural network model reconstructs the mapping vector of the normal sample and outputs a reconstructed video frame;

the non-sensory dish-taking model obtaining module 3: the method is used for calculating a loss value between a reconstructed video frame and a video frame of a normal sample, and adjusting parameters of the target neural network model through a gradient descent method so that the loss value is not reduced or reaches a preset iteration number, and obtaining a non-sensory dish model;

it can be understood that, in the technical scheme provided in this embodiment, a 3d convolution-based encoder and a 3d convolution-based decoder are adopted as the neural network model, a video for a user to pick up dishes is acquired through the data set screening module 1, a frame of the video for picking up dishes is extracted to acquire a plurality of video frames, a video frame including a dish-picking action in the plurality of video frames is selected as a data set, the video frame in the data set is input into the preset neural network model through the reconstructed video frame acquisition module 2, the input video frame is mapped through the 3d convolution encoder to obtain a mapping vector of an input vector, the mapping vector is reconstructed through the 3d convolution decoder to output a reconstructed video frame, a loss value between the output video frame and the input video frame is calculated through the non-sensory dish-picking model acquisition module 3, parameters of the preset neural network model are iteratively adjusted, the loss value output by the model does not decrease or reaches the preset iteration number, a non-sensory dish-taking model is obtained, only video frames containing dish-taking actions are trained in the model building process, and all the video frames contained in the data set are video frames containing dish-taking actions, so that compared with the existing building processes of other models, the data set does not need to be labeled and distinguished.

Example four:

fig. 4 is a system diagram illustrating a video understanding-based no-sense-menu apparatus according to an exemplary embodiment, as shown in fig. 4, the apparatus including: identity binding module 101: the system is used for binding the identity information of the diner and the face image thereof;

dinner plate binding module 102: the meal taking system is used for matching the identity information of a diner with a dinner plate when the diner has a meal;

the dish binding module 103: the food service plate is used for matching the diner service plate with dishes taken by the diner according to the trained non-inductive dish taking model;

the weight acquisition module 104: the weight acquisition device is used for acquiring the weight of each dish of a diner;

the statistics module 105: the system is used for carrying out individual statistics on diners according to the weight of each dish of the diners;

it can be understood that, in this embodiment, the identity binding module 101 matches and binds the face image of the diner with the meal card used in the dining room, the meal binding module 102 matches and binds the information of the diner with the meal plate used, the meal binding module 103 matches the meal plate of the diner with the dishes through the trained non-sensory dish taking model, the weight obtaining module 104 calculates the quality of each dish taken by the diner, the counting module 105 performs individual counting on the diner according to the quality of each dish, so that the diner (especially an athlete) can conveniently manage his diet, compared with the prior art, the scheme of the present application does not need to place the diner in a designated sensing area before taking each dish, improves the dish taking efficiency of the diner, and the diner only needs to take dishes normally during the dish taking process, no extra action is needed, and the matching degree of the diners is improved.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present invention, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A non-sensory dish method based on video understanding is characterized by comprising the following steps:

when a person has a meal, matching the identity information of the person with the dinner plate;

matching the diner dinner plate with dishes taken by the diners according to the trained non-sensory dish taking model;

acquiring the weight of each dish of a diner;

2. The method of claim 1,

the training step of the trained non-sensory dish model comprises the following steps:

acquiring a video for picking up a dish by a human body, carrying out frame extraction on the video to acquire a plurality of video frames, marking the video frames containing dish picking actions in the plurality of video frames as normal samples, and selecting the normal samples as a data set;

3. The method of claim 2,

4. The method of claim 3,

and setting a serial number for each dinner plate, using the meal card by a diner to punch the card and then taking the dinner plate, and binding the meal card with the serial number corresponding to the dinner plate.

5. The method of claim 4,

6. The method of claim 5,

7. A non-sensory-dish device based on video understanding, the device comprising:

a service plate binding module: the meal taking system is used for matching the identity information of a diner with a dinner plate when the diner has a meal;

the method comprises the following steps: