CN111860448A

CN111860448A - Hand washing action recognition method and system

Info

Publication number: CN111860448A
Application number: CN202010764529.6A
Authority: CN
Inventors: 李江; 李骊
Original assignee: Beijing HJIMI Technology Co Ltd
Current assignee: Beijing HJIMI Technology Co Ltd
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2020-10-30

Abstract

The invention provides a hand washing action recognition method and system. The method comprises the following steps: in the pre-detection stage, acquiring image data of a current frame; the image data comprises a registered color image and a registered depth image; pre-detecting the image data; if the pre-detection is passed, the depth image is used for carrying out background removal processing on the color image to obtain a foreground image; using the foreground image to identify the hand washing action to obtain an identification result corresponding to the image data; the recognition result includes the recognized hand washing action category. As can be seen, in the embodiment of the present invention, the hand washing action recognition is implemented in two steps, the first step is performed with the pre-detection, and the second step is performed only after the pre-detection is passed: and (5) recognizing the action. In the process of executing the second step, the registered depth image is used for assisting in background removal, a lot of background information is filtered, the recognition robustness is increased, and then the foreground image is recognized to obtain a recognition result.

Description

Hand washing action recognition method and system

Technical Field

The invention relates to the field of computers, in particular to a hand washing action identification method and system.

Background

Many industries have requirements for hand washing procedures. In the past, the hand washing process is realized through artificial consciousness and training, but the hand washing process is lack of favorable supervision and is inevitable to cause careless mistakes. Therefore, the automatic supervision operation can be carried out by means of the computer vision combined with the machine learning method, so that the labor and the cost are saved, and meanwhile, the correctness and the normativity of the hand washing step are guaranteed.

The prerequisite for automated supervision is the recognition of the hand washing action. Therefore, how to recognize the hand washing action is a popular study.

Disclosure of Invention

In view of this, embodiments of the present invention provide a hand washing action recognition method and system to realize hand washing action recognition.

In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:

a hand washing action recognition method, comprising:

in the pre-detection stage, acquiring image data of a current frame; the image data comprises a registered color image and a registered depth image;

pre-detecting the image data;

if the pre-detection is passed, the depth image is used for carrying out background removal processing on the color image to obtain a foreground image;

using the foreground image to identify the hand washing action to obtain an identification result corresponding to the image data; the recognition result includes the recognized hand washing action category.

Optionally, the performing the pre-detection includes: flushing detection is performed using the image data; performing foam detection using the image data; if the flushing state is not reached and no foam exists, the pre-detection is determined to be passed.

Optionally, before the pre-detection stage, a sample preparation stage is further included; the sample preparation phase comprises: acquiring an image sample; the image samples comprise registered color image samples, depth image samples, and labels; the label is a first label, a second label or a third label; wherein the first tag comprises: information characterizing the flushing state; the second tag includes: information characterizing the presence of foam; the third label includes a hand washing action category; performing data enhancement on the registered color image samples and depth image samples to expand the number of image samples; and carrying out normalization processing on each image sample obtained after the data enhancement.

Optionally, the label is the first label, and the normalized image sample is a first target image sample; the label is the second label, and the normalized image sample is a second target image sample; the label is the third label, the color image sample in the normalized image sample is a target color image sample, and the depth image sample is a target depth image sample; the sample preparation phase further comprises: and performing background removal processing on the corresponding target color image by using the target depth image to obtain a foreground image, and forming a third target image sample by using the target depth image, the target depth image and the third label.

Optionally, after the sample preparation phase and before the pre-detection phase, a training phase is further included; the flush detection is performed by a trained first machine learning model, the foam detection is performed by a trained second machine learning model, and the hand wash action recognition is performed by a trained third machine learning model; the training phase comprises: performing a plurality of iterative trainings on a first machine learning model based on the first target image sample to obtain the trained first machine learning model; performing a plurality of iterative trainings on a second machine learning model based on the second target image sample to obtain the trained second machine learning model; and performing multiple times of iterative training on a third machine learning model based on the third target image sample to obtain the trained third machine learning model.

Optionally, the using the registered depth image to perform background removal processing on the color image to obtain a foreground image includes: and aiming at any pixel point, if the depth value of the pixel point in the depth image is out of a preset range, setting the pixel value of the pixel point in the color image to be 0.

Optionally, the third machine learning model includes: a plurality of directly connected depth-separable convolutional layers; fully rolling up the layers; a global pooling layer; and (5) classifying the task layer.

Optionally, the method further includes: determining and outputting the current hand washing action category by using the identification result of the continuous multi-frame image data; the continuous multi-frame image data comprises the image data of the current frame and continuous N frames of image data before the image data of the current frame; n is a positive integer.

A hand washing action recognition system comprising:

the acquisition unit is used for acquiring the image data of the current frame in the pre-detection stage; the image data comprises a registered color image and depth image;

a pre-detection unit to: pre-detecting the image data;

a pre-processing unit for:

a hand washing action recognition unit for:

As can be seen, in the embodiment of the present invention, the hand washing action recognition is implemented in two steps, the first step is performed with the pre-detection, and the second step is performed only after the pre-detection is passed: and (5) recognizing the action. In the process of executing the second step, the registered depth image is used for assisting in background removal, a lot of background information is filtered, the recognition robustness is increased, and then the foreground image is recognized to obtain a recognition result.

Drawings

Fig. 1 is an exemplary configuration of a hand washing action recognition system according to an embodiment of the present invention;

fig. 2 is an exemplary flow chart of a hand washing action recognition method according to an embodiment of the present invention;

FIG. 3 is another exemplary configuration of a hand washing action recognition system provided by an embodiment of the present invention;

fig. 4 is another exemplary flow chart of a hand washing action recognition method according to an embodiment of the present invention;

FIG. 5 is an exemplary flow of a sample preparation phase provided by embodiments of the present invention;

fig. 6 is an exemplary structure of a CNN model provided in an embodiment of the present invention.

Detailed Description

For reference and clarity, the terms, abbreviations or abbreviations used hereinafter are summarized as follows:

CNN: convolutional Neural Network, Convolutional Neural Network;

depth image: depth image, also called range image, is an image in which the distance (depth) from an image capture to each point in a scene is defined as a pixel value;

3D: 3Dimensional, three-Dimensional;

loss Function: a loss function;

SGD: stochastic Gradient Descent, the random Gradient decreases;

NMS: non maximum value suppression, Non-maximum suppression.

The existing gesture recognition schemes in the market are almost pure color image gesture static recognition schemes, and various deep learning classification recognition experiments are carried out by collecting a large number of images of different hand shapes. However, the hand washing action only depending on the color data has poor recognition accuracy, is easily interfered by factors such as illumination, background and the like, and has poor robustness.

In view of the above, the present invention provides a hand washing action recognition method and system, so as to realize hand washing action recognition and solve the above problems.

Referring to fig. 1, an exemplary structure of the hand washing action recognition system includes: the hand washing machine comprises an acquisition unit 1, a pre-detection unit 2, a pretreatment unit 3 and a hand washing action recognition unit 4.

Furthermore, the system may further comprise an output unit 5 for outputting information for interaction with a person. Such as recognized actions, and may include, among other things, alert tones, alarms, and the like.

Wherein, the acquisition unit 1 includes: RGBD data module, wherein RGB refers to red, green and blue, and D refers to depth. The module includes a device (e.g., a camera) that takes a color image (RGB), and a device (e.g., a depth camera) that takes a depth image.

Depth cameras are also known as 3D cameras. A picture (2D image) taken by a normal color camera can be recorded by seeing all objects within the camera's view angle, but the recorded data does not contain the distance of these objects from the camera. The distance between each point in the image and the camera can be accurately known through the data acquired by the depth camera, and thus the three-dimensional space coordinate of each pixel point in the image can be acquired by adding the (x, y) coordinate of the pixel point in the 2D image.

The acquisition unit 1 may be placed in a hand wash. The position and angle of deployment need to ensure that both color and depth images can be acquired simultaneously.

The acquisition unit 1 and the output unit 5 may be installed in the same apparatus.

As for the pre-detection unit 2, the pre-processing unit 3, and the hand washing action recognition unit 4, they may be installed in the same device as the acquisition unit 1, or may be deployed in an action recognition server and communicate via a network, or the pre-detection unit 2, the pre-processing unit 3, and the hand washing action recognition unit 4 may be separate servers.

Fig. 2 illustrates an exemplary flow of a hand washing action recognition method performed by the hand washing action recognition system described above, including:

s1: in the pre-detection stage, image data of the current frame is acquired.

The image data (RGBD data) includes registered color images and depth images.

The purpose of registration is to allow the depth image and the color image to be merged together, i.e., to convert the image coordinate system of the depth image to the image coordinate system of the color image.

Specifically, step S1 may be performed by the aforementioned acquisition unit 1.

As mentioned above, the acquiring unit 1 is disposed at the hand washing site, and may be an RGBD data module, which periodically and simultaneously acquires a color image and a depth image.

The registered color image and depth image acquired at any acquisition time may be referred to as a frame of image data. The image data acquired at the current time can be regarded as the image data of the current frame.

S2: the image data is pre-detected.

Step S2 may be performed by the aforementioned pre-detection unit 2.

The steps involved in the pre-detection may be designed as appropriate, for example, the design detection includes: flush detection or foam detection, or both.

S3: and if the image passes the pre-detection, the depth image is used for carrying out background removal processing on the color image to obtain a foreground image.

Step S3 may be executed by the preprocessing unit 3 described above.

Specifically, for any pixel point, if the depth value of the pixel point in the depth image is out of the preset range, the pixel value of the pixel point in the color image is set to be 0.

The preset range may be determined according to the optimal recognition distance of the depth camera, and may be, for example, 50cm to 1.2m, and the pixel point with the depth value smaller than 50cm or larger than 1.2m may be set to 0 in the color image, so that most of the background in the image may be effectively removed to extract the color hand image.

S4: and performing hand washing action recognition by using the foreground image to obtain a recognition result corresponding to the image data.

The recognition result includes recognized hand washing action categories, such as crossed finger rubbing, fist rotating rubbing, ten-finger gathering rubbing, and the like.

Different numbers or characters can be used to characterize different hand washing action categories, and those skilled in the art can flexibly design the hand washing action categories, which are not described in detail herein.

Step S4 may be performed by hand washing action recognition unit 4 as described above.

In addition, for image data that fails the pre-detection, the corresponding recognition result may be a pre-detection recognition result.

The recognition result of the pre-detection may include: one or more of an identification result characterizing the presence or absence of flushing, and an identification result characterizing the presence or absence of foam.

As can be seen, in the embodiment of the present invention, the hand washing action recognition is implemented in two steps, the first step is performed with the pre-detection (pre-detection stage), and the second step is performed only after the pre-detection: and (5) recognizing the action. In the process of executing the second step, the registered depth image is used for assisting in background removal, a lot of background information is filtered, the recognition robustness is increased, and then the foreground image is recognized to obtain a recognition result.

In practice, a hand washing action category may include a series of different hand gestures, such as a fist-making rotation motion, which may include a hand-shaking, a gesture in which two hands are in different rotation states. Sometimes, a result of recognition using image data of a single frame may cause erroneous judgment.

Therefore, referring to fig. 2, in other embodiments of the present invention, after step S4, the following steps may be further included:

s5: and determining and outputting the current hand washing action category by using the identification result of the continuous multi-frame image data.

Referring to fig. 3, the system may further include a post-processing unit 6, and the post-processing unit 6 may execute step S5 to output the current hand washing action category to the output unit 5.

The continuous multi-frame image data comprises image data of a current frame and continuous N frames of image data before the image data of the current frame; n is a positive integer.

That is, the current hand washing action type can be comprehensively considered and output in combination with the identification results (whether flush water exists, whether foam exists, and which action type) of the continuous multi-frame image data.

The value of N can be flexibly designed by those skilled in the art.

For example, in case of N-29, the current hand washing action category may be determined using the recognition result of 30 consecutive frames.

More specifically, for the image data of 30 consecutive frames, the number of occurrences of each of the flush water, the recognized hand washing motion category, and the foam may be counted, and when the number of occurrences is greater than a preset threshold (for example, 10 frames), or the proportion of the number of occurrences of the flush water, the recognized hand washing motion category, and the foam within 30 frames (compared to the total number) is calculated, the recognition result of the maximum proportion exceeding 30% may be used as the hand washing motion category output to the output unit 5.

After output, the oldest frame in 30 frames can be removed, for example, the current frame is 31 frames, and the recognition results of 2-31 frames are used for determination. And when the current frame is 32 frames, the identification result of 3-32 frames is used for determination.

It is also possible to remove all 30 frames and accumulate 30 frames again. For example, step S5 is performed once for frames 1-30 (the current frame is the 30 th frame), step S5 is performed once for frames 31-60, and so on, and will not be described again.

In other embodiments of the present invention, there may be other logic to determine whether the user has carefully washed his hands according to the gestures set by the standard, including whether there is flushing, whether to use the liquid soap (foam detection), whether the hand movements meet the specifications, and the like.

It should be noted that, at present, there is also a depth learning model based on RGBD data fusion (model trained by color map and depth map together), but there are the following problems: under the condition of flushing, the depth value is 0 value, the recognition precision of higher requirements can not be reached through model training, and in addition, because water and foam can carry out some interference of sheltering from to the hand, can cause very big degree influence to the hand recognition result.

To solve the above problem, fig. 4 shows another exemplary flow of a hand washing action recognition method performed by the hand washing action recognition system, including:

s41: in the pre-detection stage, image data of the current frame is acquired.

S41 is the same as S1, and is not repeated here.

S42: flush detection is performed using image data.

Flush detection is performed by a trained first machine learning model.

Specifically, the flushing detection can be performed by using the depth image, and the flushing detection can also be performed by using the depth image and the color image at the same time.

S43: the image data is used for foam detection.

The foam detection is performed by a trained second machine learning model.

The flushing detection and the foam detection belong to the pre-detection. Therefore, the pre-detection unit 2 may further include: a trained first machine learning model and a trained second machine learning model.

S44: if the flushing state is not reached and no foam exists, the pre-detection is determined to be passed.

S45 is similar to S3, and is not described herein.

S46: and performing hand washing action recognition by using the foreground image to obtain a recognition result corresponding to the image data.

Hand wash action recognition may be performed by a trained third machine learning model.

The first to third machine learning models may exemplarily be CNN models.

S47 is similar to S5, and is not described herein.

After the flushing water or the foam is detected, the hand motion recognition cannot be carried out on the flushing water or the foam, and is not necessary; therefore, in the embodiment of the invention, the hand washing action recognition mainly aims at the recognition of the hand rubbing action after flushing or foaming. In this embodiment, only when it is determined that the current frame data does not have the flush nor the foam, the hand washing action is recognized.

Furthermore, for better hand washing action recognition, tests can be carried out when the acquisition unit is arranged.

The testing step may include:

step A: the user executes a preset action;

and B: the hand washing action recognition system recognizes and outputs the recognized hand washing action type, and can judge whether the recognized hand washing action type is consistent with an expected result.

That is, after deployment, the method enters a testing stage, prompts a user to perform some preset actions (for example, closing two hands, opening a water faucet, and the like), then acquires an RGBD image, and identifies the image to see whether the identification is correct. For how to identify, reference is made to the above description, which is not repeated herein.

If the difference is not consistent, the distance between the acquisition unit and the water tap, the illumination intensity, the chromaticity, the saturation, the distance and the like in the current scene can be adjusted on line.

After the adjustment is completed, the test is performed again, if the adjustment is not consistent, the adjustment is performed again, and then the test is performed, and so on, which is not described in detail.

The first to third machine learning models described above all need to be trained.

Before training, the training samples need to be prepared.

Therefore, before the pre-detection stage, a sample preparation stage and a training stage are required. This is described below.

First, a sample preparation phase.

Referring to fig. 5, the sample preparation phase may illustratively include the following steps:

s51: an image sample is acquired.

The image samples may include color image samples, depth image samples, and labels that are registered.

In one example, the color image sample and the depth image sample in the image sample may be collected by a module of the obtaining unit, and then the label is manually added.

In one example, for the first through third machine learning models, the labels may be a first label, a second label, and a third label.

Wherein the first tag may include: information characterizing the flushing state;

the second tag may include: information characterizing the presence of foam;

the third label may include a hand washing action category.

S52: data enhancement is performed on the registered color image samples and depth image samples to expand the number of image samples.

The data enhancement processing may include: the image is rotated, shifted, mirrored, random noise added, etc.

In this way, one image sample can be expanded into multiple copies, and the label is, of course, unchanged.

S53: and carrying out normalization processing on each image sample obtained after the data enhancement.

Specifically, the color image sample is normalized.

The normalization operation is a conventional operation, and is not described in detail.

For the sake of distinction, the image sample labeled as the first label and normalized may be referred to as the first target image sample; the image sample which is labeled as a second label and is subjected to normalization processing is called a second target image sample; and taking the color image sample in the image sample with the third label and subjected to normalization processing as a target color image sample, wherein the depth image sample is called a target depth image sample.

For the target depth image sample, the following operations may also be performed:

s54: and performing background removal processing on the corresponding target color image by using the target depth image to obtain a foreground image, wherein the obtained foreground image, the target depth image and the third label form a third target image sample.

For background removal, reference is made to the above description, and details are not repeated here.

It should be noted that the first to third target image samples may be divided into a training set and a test set. Alternatively, the training set includes a portion of the first through third target image samples, and the test set includes a portion of the first through third image samples.

Secondly, training;

still referring to fig. 5, the training phase includes:

s55: performing multiple iterative training on the first machine learning model based on the first target image sample to obtain a trained first machine learning model;

specifically, the depth image in the first target image sample may be used to perform iterative training on the first machine learning model, and the depth image and the color image may also be used to perform iterative training on the first machine learning model simultaneously.

S56: performing multiple iterative training on the second machine learning model based on the second target image sample to obtain a trained second machine learning model;

s57: and performing multiple times of iterative training on the third machine learning model based on the third target image sample to obtain a trained third machine learning model.

Wherein each iterative training comprises:

the first/second/third machine learning model learns based on continuous multi-frame image samples in a training set to obtain a learned first/second/third machine learning model;

inputting continuous multi-frame image samples in the test set into the first/second/third machine learning model after learning, and performing parameter optimization according to the identification result output by the first/second/third machine learning model after learning and the label of the image sample.

The third machine learning model is described below.

In one example, the third machine learning model may be a CNN model, see fig. 6, which may include;

1, a plurality of directly connected depth separable convolutional layers (or convolutional blocks, Cn in fig. 6); where n denotes an nth depth separable convolutional layer, which may specifically include:

DepthwiseConv + BN + ReLU + PointwiseConv + BN + ReLU. Wherein, LU is the characteristic used to extract image data, and is also the characteristic expression;

2, a full convolution layer C1x1 for compressing or expanding the number of channels;

3, global pooling layer (GP) for changing the picture into a feature value;

and 4, a classification task layer Cls for identifying the result probability of the classification, and identifying the current data as the action of the class when the probability of the class is the maximum.

In the conventional CNN model, there are pooled downsampled layers between each depth separable convolutional layer. The operation of the pooling downsampling layer causes the loss of potential related information of original data, so that the recognition accuracy is influenced, especially, in some complex gesture interaction scenes, two gesture actions with different subdivisions are different, and the judgment of a plurality of pixels is possibly mistaken, so that the recognition error is caused.

The CNN model of the present invention has no pooling downsampling layer. Therefore, the resolution can be kept unchanged, and the multi-resolution expansion receptive field strategy of the method depends on the Convolution kernel size of the Convolution block Cn each time, and the multiple resolution expansion receptive field strategy is enlarged from a shallow layer to a deep layer by layer, or uses a hole Convolution (generalized/associated) to expand the receptive field (in short, a hole is inserted into a general Convolution layer, and then the receptive field is enlarged).

It should be noted that, since the size of the convolution kernel is generally an odd number, the larger the size is, the more the global feature of the image can be extracted, and the amount of calculation increases, the size is generally 3, 5, 7, or the like.

In the convolutional neural network, the definition of a Receptive Field (Receptive Field) is the area size of a pixel point on a feature map (feature map) output by each layer of the convolutional neural network, which is mapped on an input picture.

A hand washing action recognition system is described below. Please refer to fig. 1, which exemplarily includes:

an obtaining unit 1, configured to obtain image data of a current frame in a pre-detection stage; the image data comprises a registered color image and depth image;

a pre-detection unit 2 for: pre-detecting image data;

a preprocessing unit 3 for:

if the pre-detection is passed, the depth image is used to perform background removal processing on the color image to obtain a foreground image;

a hand washing action recognition unit 4 for:

performing hand washing action recognition by using the foreground image to obtain a recognition result corresponding to the image data; the recognition result includes the recognized hand washing action category.

For details, please refer to the foregoing description, which is not repeated herein.

In one example, referring to fig. 3, the system may further include:

a post-processing unit 6 for:

determining and outputting the current hand washing action category by using the identification result of the continuous multi-frame image data;

In other embodiments of the present invention, in terms of performing the background removal process, the preprocessing units in all the embodiments described above may be specifically configured to:

and aiming at any pixel point, if the depth value of the pixel point in the depth image is out of the preset range, setting the pixel value of the pixel point in the color image as 0.

In other embodiments of the present invention, in terms of performing the pre-detection, the pre-detection units in all the embodiments are specifically configured to:

flushing detection is carried out by using the depth image;

performing foam detection by using the color image;

if the flushing state is not reached and no foam exists, the pre-detection is determined to be passed.

In one example, the pre-detection unit may further include: a trained first machine learning model and a trained second machine learning model.

Wherein the flush detection is performed by a trained first machine learning model and the foam detection is performed by a trained second machine learning.

In one example, the hand washing action recognition unit may include a trained third machine learning model. Hand wash action recognition is performed by a trained third machine learning model.

In other embodiments of the present invention, before the pre-detection stage, a sample preparation stage may be further included;

the system may further comprise: a sample acquisition unit for:

acquiring an image sample; the image samples comprise registered color image samples, depth image samples, and labels; the labels are a first label, a second label and a third label; wherein the first tag comprises: information characterizing the flushing state; the second label includes: information characterizing the presence of foam; the third label includes a hand washing action category;

performing data enhancement on the registered color image samples and depth image samples to expand the number of image samples;

and carrying out normalization processing on each image sample obtained after the data enhancement.

The image sample labeled as the first label and normalized may be referred to as a first target image sample; the image sample which is labeled as a second label and is subjected to normalization processing is called a second target image sample;

the color image sample in the normalized image sample with the label as the third label is called a target color image sample, and the depth image sample is a target depth image sample;

the sample acquiring unit is further configured to, in a sample preparation phase:

and performing background removal processing on the corresponding target color image by using the target depth image to obtain a foreground image, wherein the obtained foreground image, the target depth image and the third label form a third target image sample.

After the sample preparation stage and before the pre-detection stage, the method also comprises a training stage;

the system may further comprise a training unit for:

performing multiple iterative training on the first machine learning model based on the first target image sample to obtain a trained first machine learning model;

performing multiple iterative training on the second machine learning model based on the second target image sample to obtain a trained second machine learning model;

and performing multiple times of iterative training on the third machine learning model based on the third target image sample to obtain a trained third machine learning model.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is simple, and the description can be referred to the method part.

Those of skill would further appreciate that the various illustrative components and model steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or model described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, WD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A hand washing action recognition method, comprising:

pre-detecting the image data;

2. The method of claim 1, wherein said performing a pre-test comprises:

flushing detection is performed using the image data;

performing foam detection using the image data;

3. The method of claim 2, further comprising, prior to the pre-detection phase, a sample preparation phase;

the sample preparation phase comprises:

acquiring an image sample; the image samples comprise registered color image samples, depth image samples, and labels; the label is a first label, a second label or a third label; wherein the first tag comprises: information characterizing the flushing state; the second tag includes: information characterizing the presence of foam; the third label includes a hand washing action category;

4. The method of claim 3,

the label is the first label, and the normalized image sample is a first target image sample;

the label is the second label, and the normalized image sample is a second target image sample;

the label is the third label, the color image sample in the normalized image sample is a target color image sample, and the depth image sample is a target depth image sample;

the sample preparation phase further comprises:

and performing background removal processing on the corresponding target color image by using the target depth image to obtain a foreground image, and forming a third target image sample by using the target depth image, the target depth image and the third label.

5. The method of claim 4, wherein after the sample preparation phase, and before the pre-detection phase, further comprising a training phase;

the flush detection is performed by a trained first machine learning model, the foam detection is performed by a trained second machine learning model, and the hand wash action recognition is performed by a trained third machine learning model;

the training phase comprises:

performing a plurality of iterative trainings on a first machine learning model based on the first target image sample to obtain the trained first machine learning model;

performing a plurality of iterative trainings on a second machine learning model based on the second target image sample to obtain the trained second machine learning model;

and performing multiple times of iterative training on a third machine learning model based on the third target image sample to obtain the trained third machine learning model.

6. The method of claim 1, wherein the performing background removal processing on the color image using the registered depth image to obtain a foreground image comprises:

and aiming at any pixel point, if the depth value of the pixel point in the depth image is out of a preset range, setting the pixel value of the pixel point in the color image to be 0.

7. The method of claim 1, wherein the third machine learning model comprises:

a plurality of directly connected depth-separable convolutional layers;

fully rolling up the layers;

a global pooling layer;

and (5) classifying the task layer.

8. The method of claim 1, further comprising:

the continuous multi-frame image data comprises the image data of the current frame and continuous N frames of image data before the image data of the current frame; n is a positive integer.

9. A hand washing action recognition system, comprising:

a pre-detection unit to: pre-detecting the image data;

a pre-processing unit for:

a hand washing action recognition unit for: