CN117755315A

CN117755315A - Method and device for controlling equipment operation based on user gesture and electronic equipment

Info

Publication number: CN117755315A
Application number: CN202311787715.1A
Authority: CN
Inventors: 单浩吉; 郑彩虹
Original assignee: Chongqing Changan Automobile Co Ltd
Current assignee: Chongqing Changan Automobile Co Ltd
Priority date: 2023-12-22
Filing date: 2023-12-22
Publication date: 2024-03-26

Abstract

The embodiment of the application relates to a method and a device for controlling equipment operation based on user gestures and an electronic device, wherein the method comprises the following steps: acquiring an image to be identified containing the image of the target user; carrying out semantic segmentation on the image to be identified by using a preset user region segmentation model to obtain a corresponding target user region of the target user in the image to be identified; performing key point detection on the target user area to obtain a key point set; based on the key point set, carrying out gesture analysis on the target user to obtain gesture information representing the current gesture of the target user; and controlling the running state of the target equipment to reach the target running state corresponding to the gesture information based on the gesture information. According to the embodiment of the application, the position of the target user is accurately detected by using the model before the key point detection is carried out, the influence of the objective environment on the gesture detection of the target user is reduced, and the accuracy of controlling the operation of the target device based on the gesture information is improved.

Description

Method and device for controlling equipment operation based on user gesture and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for controlling operation of a device based on a user gesture, an electronic device, and a storage medium.

Background

With the development of machine vision technology, a method for detecting the gesture of a user and further controlling some devices by using the machine vision technology has been applied in many fields.

For example, in the field of intelligent vehicles, intelligent gesture detection can be performed on passengers by using an IMS (Interior Monitoring System) camera method, and the current state of the passengers is determined by detecting different gestures of the passengers, so as to automatically adjust the running state of equipment in a cabin, or prompt a user to manually adjust the equipment to create a cabin condition more suitable for riding. Although the human body posture detection can be realized by the method, under the conditions that the environment background in the vehicle is complex, the passenger image is virtual, shadow and fuzzy and the like due to the fact that the speed of the vehicle is too fast, the human body position cannot be accurately detected, the human body posture detection is wrong, the equipment cannot be accurately controlled, the user experience and even the vehicle running safety are affected, and the IMS method has high requirement standard on camera hardware.

Disclosure of Invention

In view of this, in order to solve some or all of the above technical problems, embodiments of the present application provide a method, an apparatus, an electronic device, and a storage medium for controlling operation of a device based on a user gesture.

In a first aspect, an embodiment of the present application provides a method for controlling operation of a device based on a user gesture, where the method includes: acquiring an image to be identified containing the image of the target user; carrying out semantic segmentation on the image to be identified by using a preset user region segmentation model to obtain a corresponding target user region of the target user in the image to be identified; performing key point detection on the target user area to obtain a key point set; based on the key point set, carrying out gesture analysis on the target user to obtain gesture information representing the current gesture of the target user; and controlling the running state of the target equipment to reach the target running state corresponding to the gesture information based on the gesture information.

In one possible embodiment, acquiring an image to be identified containing an image of a target user includes: acquiring an original image shot by a target camera for a target user; and carrying out image enhancement processing on the original image to obtain an image to be identified.

In one possible implementation manner, the semantic segmentation is performed on the image to be identified by using a preset user region segmentation model to obtain a target user region corresponding to the target user in the image to be identified, including: shallow feature extraction is carried out on the image to be identified by using a shallow feature extraction module included in the user region segmentation model, so that shallow feature data are obtained; carrying out multi-scale deep feature extraction operation on the image to be identified by utilizing a multi-scale deep feature extraction module included in the user region segmentation model to obtain multi-scale deep feature data; fusing the shallow characteristic data and the multi-scale deep characteristic data to obtain fused characteristic data; based on the fusion feature data, classifying pixels in the image to be identified, and determining an area composed of pixels belonging to the image of the target user as a target user area.

In one possible implementation manner, the multi-scale deep feature extraction module included in the user region segmentation model is used to perform multi-scale deep feature extraction operation on the image to be identified to obtain multi-scale deep feature data, where the multi-scale deep feature extraction module includes: cutting the image to be identified to obtain a preset number of sub-images; respectively extracting features of a preset number of sub-images by using an initial feature extraction network included in the multi-scale deep feature extraction module to obtain preset number group feature data; integrating the preset number of sets of characteristic data to obtain initial characteristic data; performing multi-scale deep feature extraction and fusion operation on the initial feature data by utilizing a multi-scale feature fusion network included in the multi-scale deep feature extraction module to obtain multi-scale fusion feature data; and fusing the initial characteristic data and the multi-scale fusion characteristic data to obtain multi-scale deep characteristic data.

In one possible embodiment, the multi-scale feature fusion network comprises at least two expanded convolution layers, each of the at least two expanded convolution layers having a corresponding expansion ratio; the method for extracting and fusing the multi-scale deep features of the initial feature data by utilizing the multi-scale feature fusion network included in the multi-scale deep feature extraction module comprises the following steps: for each expansion convolution layer in at least two expansion convolution layers, if the expansion convolution layer is a first expansion convolution layer, performing expansion convolution processing on initial characteristic data by the expansion convolution layer to obtain characteristic data corresponding to the expansion convolution layer; if the expansion convolution layer is a second expansion convolution layer, combining and carrying out expansion convolution processing on the initial characteristic data and the characteristic data output by the corresponding target expansion convolution layer by the expansion convolution layer to obtain the characteristic data corresponding to the expansion convolution layer; and fusing the characteristic data output by each expansion convolution layer in the at least two expansion convolution layers to obtain multi-scale fused characteristic data.

In one possible implementation, performing gesture analysis on the target user based on the set of keypoints to obtain gesture information representing the current gesture of the target user includes: performing part affinity domain matching operation on the key point set based on a preset part affinity domain prediction network, and determining association relations among key points in the key point set; connecting key points in the key point set based on the association relation; pose information representing the current pose of the target user is determined based on the geometry formed by the connected keypoints.

In one possible embodiment, based on the gesture information, controlling the operation state of the target device to reach the target operation state corresponding to the gesture information includes: determining a target stage of a target user at the current moment after entering a shooting area of a target camera; based on the gesture information, controlling the running state of the target equipment to reach the target running state corresponding to the gesture information according to a preset control strategy corresponding to the target stage.

In one possible implementation manner, based on the gesture information, according to a preset control policy corresponding to the target stage, the controlling the operation state of the target device to reach the target operation state corresponding to the gesture information includes: if the target stage is the initial stage, controlling the running state of the target equipment to reach the target running state corresponding to the gesture information; if the target stage is an intermediate stage, determining whether the same gesture made by the target user meets a preset confirmation condition or not based on gesture information; if the preset confirmation condition is met, controlling the running state of the target equipment to reach the target running state corresponding to the target posture information; if the target stage is the tail stage, determining whether the gesture indicated by the target gesture information is a preset gesture, and if so, controlling the running state of the target device to reach a target running state for prompting a target user to change the gesture.

In one possible implementation, the user region segmentation model is trained in advance based on the following steps: acquiring a first sample image and corresponding background region labeling information; taking the first sample image as input of a preset initial model, taking background region labeling information as expected output of the initial model, and adjusting parameters of the initial model; responding to the fact that the initial model after the adjustment parameters meet the preset first training ending conditions, and determining the initial model after the adjustment parameters as an initial user area segmentation model; acquiring a second sample image and corresponding user region labeling information; taking the second sample image as input of an initial user region segmentation model, taking user region labeling information as expected output of the initial user region segmentation model, and adjusting parameters of the initial user region segmentation model; and determining the initial user region segmentation model after the parameter adjustment as a user region segmentation model in response to the fact that the initial user region segmentation model after the parameter adjustment meets a preset second training ending condition.

In a second aspect, an embodiment of the present application provides an apparatus for controlling operation of a device based on a user gesture, where the apparatus includes: the acquisition module is used for acquiring an image to be identified containing the image of the target user; the segmentation module is used for carrying out semantic segmentation on the image to be identified by utilizing a preset user region segmentation model to obtain a target user region corresponding to the target user in the image to be identified; the detection module is used for detecting key points of the target user area to obtain a key point set; the analysis module is used for carrying out gesture analysis on the target user based on the key point set to obtain gesture information representing the current gesture of the target user; and the control module is used for controlling the running state of the target equipment to reach the target running state corresponding to the gesture information based on the gesture information.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory for storing a computer program; a processor, configured to execute a computer program stored in a memory, and when the computer program is executed, implement a method according to any embodiment of the method for controlling operation of a device based on a user gesture according to the first aspect of the present application.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method according to any one of the embodiments of the method for controlling operation of a device based on a user gesture of the first aspect described above.

In a fifth aspect, embodiments of the present application provide a computer program comprising computer readable code which, when run on a device, causes a processor in the device to implement a method as in any of the embodiments of the method of controlling device operation based on user gestures of the first aspect described above.

According to the method, the device, the electronic equipment and the storage medium for operating the equipment based on the user gesture, the user region segmentation model is utilized to carry out semantic segmentation on the image to be identified, the corresponding target user region of the target user in the image to be identified is obtained, then the target user region is subjected to key point detection, gesture analysis is carried out on the target user based on the obtained key point set, gesture information of the target user is obtained, finally the target operating state corresponding to the gesture information of the target equipment is controlled based on the gesture information, the purpose that the position of the target user is accurately detected by utilizing the model before the key point detection is carried out, the key point of the target user is further accurately detected is achieved, therefore, the high-precision gesture information is obtained by utilizing the key point, the influence of objective environment on the gesture detection of the target user is reduced, and the accuracy of controlling the operation of the target equipment based on the gesture information is improved. In addition, when the user gesture is detected, only the image acquired by the camera is required to be processed, and other high-cost hardware support is not required, so that the implementation cost of the scheme is reduced.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.

Fig. 1 is a schematic flow chart of a method for controlling operation of a device based on a user gesture according to an embodiment of the present application;

FIG. 2 is a flowchart of another method for controlling operation of a device based on user gestures according to an embodiment of the present application;

FIG. 3 is a flowchart of another method for controlling operation of a device based on user gestures according to an embodiment of the present application;

FIG. 4 is a flowchart of another method for controlling the operation of a device based on a user gesture according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a multi-scale feature fusion network according to an embodiment of the present disclosure;

FIG. 6 is a flowchart of another method for controlling the operation of a device based on a user gesture according to an embodiment of the present application;

FIG. 7 is a flowchart of another method for controlling operation of a device based on user gestures according to an embodiment of the present application;

FIG. 8 is a flowchart of another method for controlling the operation of a device based on a user gesture according to an embodiment of the present application;

FIG. 9 is a flowchart of a method for training a user region segmentation model according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an apparatus for controlling operation of a device based on a user gesture according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings, it being apparent that the described embodiments are some, but not all embodiments of the present application. It should be noted that: the relative arrangement of the parts and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise.

It will be appreciated by those skilled in the art that terms such as "first," "second," and the like in the embodiments of the present application are used merely to distinguish between different steps, devices, or modules, and do not represent any particular technical meaning or logical sequence therebetween.

It should also be understood that in this embodiment, "plurality" may refer to two or more, and "at least one" may refer to one, two or more.

It should also be appreciated that any component, data, or structure referred to in the embodiments of the present application may be generally understood as one or more without explicit limitation or the contrary in the context.

In addition, the term "and/or" in this application is merely an association relationship describing an association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In this application, the character "/" generally indicates that the associated object is an or relationship.

It should also be understood that the description of the embodiments herein emphasizes the differences between the embodiments, and that the same or similar features may be referred to each other, and for brevity, will not be described in detail.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the application, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. For an understanding of the embodiments of the present application, the present application will be described in detail below with reference to the drawings in conjunction with the embodiments. It will be apparent that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In order to solve the technical problem that in the prior art, the gesture detection of a target user is greatly influenced by an objective environment, so that misoperation of equipment is caused based on a gesture detection result, the application provides a method for controlling equipment operation based on a user gesture, and accuracy of gesture detection and equipment control can be improved.

Fig. 1 is a flowchart of a method for controlling operation of a device based on a user gesture according to an embodiment of the present application. The method may be applied to one or more electronic devices of a vehicle (e.g., a smart drive vehicle), a smart phone, a notebook computer, a desktop computer, a portable computer, a server, etc. The main execution body of the method may be hardware or software. When the execution body is hardware, the execution body may be one or more of the electronic devices. For example, a single electronic device may perform the method, or a plurality of electronic devices may cooperate with one another to perform the method. When the execution subject is software, the method may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module. The present invention is not particularly limited herein.

As shown in fig. 1, the method specifically includes:

step 101, obtaining an image to be identified including the image of the target user.

In this embodiment, the target user may be a user located at a specific position within the shooting range of the camera. As an example, the embodiment may be applied to an intelligent vehicle, where the target user may be a driver, or may be a passenger at any position on the vehicle, and the vehicle is provided with a camera, so that the target user may be photographed, and the obtained image may be an image to be identified.

And 102, carrying out semantic segmentation on the image to be identified by using a preset user region segmentation model to obtain a target user region corresponding to the target user in the image to be identified.

In this embodiment, the user region segmentation model may be a semantic segmentation model (for example, U-Net, FCN, segNet, etc.) constructed based on neural networks with various structures, and the user region model may perform semantic segmentation on an image to be identified input therein, that is, classify each pixel of the image to be identified, and determine a set of pixels belonging to a target user therefrom as a target user region.

The semantic segmentation model can be obtained through training in advance, namely, a sample image and labeling information are obtained in advance, the labeling information represents the area where an identified object in the sample image is located, the sample image is identified through an initial model to obtain a predicted area of the identified object, then an error (the error is represented by a loss value calculated by a loss function) between the area corresponding to the labeling information and the predicted area is determined, and parameters of the initial model are adjusted based on the error to minimize the error, so that the user area segmentation model is finally obtained.

And 103, detecting key points of the target user area to obtain a key point set.

In this embodiment, after the target user area is obtained, the key point detection may be further performed on the target user area. The method for detecting the key points can be realized by adopting a key point detection model constructed by convolutional neural networks with various structures, such as OpenPose, CFN, RMPE, mask R-CNN and other types of neural network models. The key point detection model can also be obtained by training the sample image and the marked key points in advance.

As an example, the keypoint detection model may include a feature extraction network that includes a plurality of convolution modules, each convolution module containing three parts: convolution layer, batch normalization, activation function. The convolution layer can continuously extract the characteristics, and batch normalization is used for reducing the covariate offset in the convolution modules, so that each convolution module can obtain a characteristic diagram with stable characteristic distribution. The ability of the convolution module to have a nonlinear transformation is provided by an activation function (e.g., a leakey relu). Determination of

The key point detection model may further include a key point detection network, and after the image of the target user area is input and the feature is extracted, the feature map is mapped to the coordinate position of the key point by using a structure such as a full connection layer. For each key point, the x and y coordinates of the key point are predicted by using the full connection layer, a two-dimensional heat map is output, wherein each pixel represents the probability distribution of the key point, and the coordinates of each key point can be determined according to the probability distribution.

And 104, analyzing the gesture of the target user based on the key point set to obtain gesture information representing the current gesture of the target user.

In this embodiment, the posture information of the target user may be determined according to the positional relationship between the body part represented by each of the keypoints included in the key set and the other body parts. For example, when two key points representing two hands of a target user are close and relatively repeatedly moved, gesture information representing that the target user is rubbing the hands may be generated. For another example, when the distribution direction and the relative inclination of the key points representing the trunk of the body are maintained for a certain time, posture information representing that the target user is sleeping is generated.

Step 105, based on the gesture information, controlling the operation state of the target device to reach the target operation state corresponding to the gesture information.

In the present embodiment, the target device may be various types of devices capable of adjusting the operation state according to the received control signal. For example, the target device may be an air conditioner, a seat, a multimedia player, or the like on a vehicle. The correspondence between the posture information and the operation state may be preset. The target operation state may include a state after the air conditioner adjusts the temperature, a state in which the sound outputs the warning sound, a state after the seat is adjusted, and the like.

As an example, when the target user is a passenger on the vehicle, if the posture information indicates that the target user is making a sweat, a fan, or the like, a control signal for controlling the air conditioner on the vehicle to reduce the temperature may be generated, and the control signal may be transmitted to the air conditioner, and the air conditioner may perform the temperature-reducing operation, that is, the target running state after the temperature is reduced.

For another example, when the gesture information indicates that the target user is sleeping, the temperature rise of the air conditioner may be controlled, or the multimedia player may be controlled to play the soothing music.

Optionally, the electronic device may further output corresponding prompt information based on the gesture information, where the prompt information may include various types of information such as voice, light, and image. For example, after the air conditioner is controlled to cool according to the gesture information, voice prompt information can be output at the same time.

According to the method for operating the equipment based on the user gesture control, the user region segmentation model is utilized to conduct semantic segmentation on the image to be identified to obtain the target user region corresponding to the target user in the image to be identified, then the target user region is subjected to key point detection, gesture analysis is conducted on the target user based on the obtained key point set to obtain gesture information of the target user, finally the target operation state corresponding to the gesture information of the target equipment is controlled based on the gesture information, the position of the target user is accurately detected by utilizing the model before the key point detection is conducted, the key point of the target user is accurately detected further, accordingly the high-precision gesture information is obtained by utilizing the key point, the influence of objective environment on the gesture detection of the target user is reduced, and the accuracy of controlling the operation of the target equipment based on the gesture information is improved. In addition, when the user gesture is detected, only the image acquired by the camera is required to be processed, and other high-cost hardware support is not required, so that the implementation cost of the scheme is reduced.

In some alternative implementations of the present embodiment, as shown in fig. 2, step 101 includes:

in step 1011, an original image taken by the target camera on the target user is acquired.

The target camera may be various types of cameras, among others. For example, the target camera may be a general camera for capturing a color image, or may be a special type camera such as an infrared camera.

Step 1012, performing image enhancement processing on the original image to obtain an image to be identified.

The purpose of performing image enhancement processing on an original image is to improve the definition of the image and avoid the phenomenon of image blurring caused by factors such as camera shake, rapid movement of a target user and the like. The method for performing image enhancement processing on the original image can be realized by adopting an image enhancement algorithm. As an example, the image enhancement process may be performed using the Retinex algorithm to improve the sharpness of the original image.

Alternatively, if the original image comprises a color image and other special types of images, such as an infrared image, the individual channels of the color image and the infrared image may be combined, resulting in an image to be identified comprising RGB data channels and infrared data channels,

as an example, when the embodiment is applied to an intelligent vehicle, if jolt or turn occurs during the running process of the vehicle, camera shake or passenger position movement may be caused, so that the photographed original image is blurred, and the image enhancement operation is performed on the original image, so that the influence of the running of the vehicle on the image quality may be effectively reduced.

According to the method, the image enhancement processing is carried out on the original image, so that the image to be identified after the image enhancement is obtained, the definition of the image to be identified is improved, the influence of fuzzy images caused by camera shake and rapid movement of a target user on subsequent semantic segmentation is reduced, the precision of semantic segmentation is improved, and the scene adaptation capability of the method is improved.

In some alternative implementations of the present embodiment, as shown in fig. 3, step 102 includes:

and 1021, performing shallow feature extraction operation on the image to be identified by using a shallow feature extraction module included in the user region segmentation model to obtain shallow feature data.

The shallow feature extraction module may be a convolutional neural network module including a smaller number of convolutional layers. In general, the shallow feature extraction module has smaller downsampling amplitude due to smaller convolution operation times, so that more spatial information in the image can be reserved.

Step 1022, performing multi-scale deep feature extraction operation on the image to be identified by using a multi-scale deep feature extraction module included in the user region segmentation model to obtain multi-scale deep feature data.

The multi-scale deep feature extraction module may be a convolutional neural network module including a greater number of convolutional layers. The multi-scale deep feature extraction module has larger downsampling amplitude due to the fact that the number of convolution operations is larger, and therefore more and deeper features can be extracted, and more space information is lost at the same time.

The multi-scale deep feature extraction module can be constructed based on convolutional neural networks of various structures. As an example, the multi-scale deep feature extraction module may be a neural network model constructed based on a feature pyramid, or may be a neural network model of a structure such as U-net.

Step 1023, fusing the shallow characteristic data and the multi-scale deep characteristic data to obtain fused characteristic data.

Specifically, the sizes of the respective channels of the shallow feature data and the multi-scale deep feature data may be adjusted to be consistent (for example, the multi-scale deep feature data is consistent with the shallow feature data through an upsampling operation), and then the channels of the shallow feature data and the multi-scale deep feature data with the consistent sizes are merged to obtain the fusion feature.

Step 1024, classifying pixels in the image to be identified based on the fusion feature data, and determining an area composed of pixels belonging to the image of the target user as the target user area.

Specifically, the data corresponding to each pixel of the image to be identified in the fusion feature data may be classified based on a preset classifier, the category to which each pixel belongs is determined, and the area composed of pixels belonging to the target user category is determined as the target user area.

Alternatively, in this embodiment, a more lightweight depth separable convolution may be used to replace the convolution operation commonly used for feature extraction in deep learning, and a 1×1 convolution used to strengthen the association between channels is replaced by a packet convolution, and then the convolution channels are randomly replaced (for example, a Channel Shuffle (Channel Shuffle) method may be used to implement the method), so as to strengthen the interaction of information between different groups, thereby reducing the information loss between communications, and using the LeakyReLU activation function to prevent the neuronal death problem. The method can improve the characteristic extraction capability of the model while improving the response speed of the passenger vehicle state reminding.

According to the embodiment, the shallow feature extraction operation and the multi-scale deep feature extraction operation are carried out on the image to be identified, and the obtained shallow feature data and the multi-scale deep feature data are fused, so that the fused feature data can contain more space information brought by the shallow feature data and can also contain deeper features brought by the multi-scale deep feature data, the image to be identified can be segmented more accurately based on the fused feature data, and the accuracy of determining the target user area is further improved.

In some alternative implementations of the present embodiment, as shown in fig. 4, step 1022 includes:

step 10221, segmenting the image to be identified to obtain a preset number of sub-images.

As an example, the image to be recognized may be segmented into 9 sub-images of the same size.

Step 10222, respectively performing feature extraction on the preset number of sub-images by using an initial feature extraction network included in the multi-scale deep feature extraction module to obtain preset number of feature data.

The initial feature extraction network can be realized by adopting a convolution neural network with a preset structure.

Step 10223, integrating the preset number of sets of feature data to obtain initial feature data.

Specifically, the preset number of sets of feature data may be combined in the channel dimension, and then the size of each channel of the combined feature data is converted (for example, downsampling is performed through the pooling layer), so that the size of each channel of the feature data output by the multi-scale feature fusion network is the same as the size of each channel of the feature data output by the multi-scale feature fusion network.

Step 10224, performing multi-scale deep feature extraction and fusion operation on the initial feature data by using a multi-scale feature fusion network included in the multi-scale deep feature extraction module to obtain multi-scale fusion feature data.

Specifically, the multi-scale feature fusion network may include a plurality of convolution layers, each of which may correspond to a convolution collision rate, such that each convolution layer has a receptive field of a different scale to extract features of a different spatial scale. Because the sizes of the characteristic data output by different convolution layers are different, the sizes of the characteristic data output by each convolution layer can be converted, and the characteristic data with the converted sizes are combined in the channel dimension to obtain the multi-scale fusion characteristic data.

Step 10225, fusing the initial feature data and the multi-scale fusion feature data to obtain multi-scale deep feature data.

As an example, the initial feature data and the multi-scale fusion feature data may be transformed to be the same size, and the initial feature data and the multi-scale fusion feature data of the same size are combined in the channel dimension to obtain multi-scale deep feature data. Or, the combined characteristic data can be convolved through 1×1 convolution to establish the association between channels, and simultaneously, the dimension reduction operation is performed, so that the response speed of the model is improved.

According to the embodiment, the segmented multiple sub-images are subjected to feature extraction to obtain initial feature data containing more space information, then the initial feature data is subjected to multi-scale deep feature extraction to obtain multi-scale fusion feature data containing more deep features, and finally the initial feature data and the multi-scale fusion feature data are fused, so that the obtained scale deep feature data not only retains more space features, but also contains more abundant deep semantic features, and therefore pixels can be classified more accurately when the subsequent semantic segmentation is performed on the images, and accuracy of determining target user areas is improved.

In some alternative implementations of the present embodiments, the multi-scale feature fusion network includes at least two inflated convolution layers, each of the at least two inflated convolution layers having a corresponding inflation rate. When the convolution kernel carries out convolution calculation on the feature map, the corresponding region size in the feature map is amplified according to the expansion rate, so that the receptive field of the convolution kernel can be increased under the condition that the number of parameters is unchanged, each convolution output contains a large range of spatial information, and meanwhile, the size of the output feature map can be kept unchanged.

Step 10224 may be performed as follows:

firstly, for each expansion convolution layer in at least two expansion convolution layers, if the expansion convolution layer is a first expansion convolution layer, performing expansion convolution processing on initial characteristic data by the expansion convolution layer to obtain characteristic data corresponding to the expansion convolution layer; if the expansion convolution layer is a second expansion convolution layer, combining and carrying out expansion convolution processing on the initial characteristic data and the characteristic data output by the corresponding target expansion convolution layer by the expansion convolution layer to obtain the characteristic data corresponding to the expansion convolution layer.

Specifically, the first type of expansion convolution layer is a convolution layer that only receives initial feature data, as shown in fig. 5, 501 is a 1*1 convolution layer, 506 is a pooling layer, 502 and 503 are first type expansion convolution layers, and expansion rates are 4 and 8, respectively, that only receives input initial feature data. 504 and 505 are second type of inflated convolutional layers with inflation rates of 12 and 15, respectively, which receive 502 the output characteristic data in addition to the initial characteristic data, and 505 receives 503 and 504 the output characteristic data. The connection of the convolutional layers shown in fig. 5 may also be referred to as a jump connection, and the expanded convolutional layer shown in fig. 5 is calculated using a depth separable convolution (DSConv).

And then, fusing the characteristic data output by each expansion convolution layer in at least two expansion convolution layers to obtain multi-scale fusion characteristic data.

Specifically, the feature data output by each expansion convolution layer and other layers included in the multi-scale feature fusion network as shown in fig. 5 may be combined in the channel dimension, or the combined feature data may be subjected to convolution operation by such operations as 1*1 convolution, so as to obtain multi-scale fusion feature data.

According to the embodiment, at least two expansion convolution layers are arranged in the multi-scale feature fusion network, and the convolution calculation is carried out by each expansion convolution layer in a jump connection mode, so that calculation of different spatial scales of multi-initial feature data is realized, more scale spatial information is obtained, meanwhile, the spatial information of different scales can be fully fused, the output multi-scale fusion feature data contains more abundant spatial information and deep semantic information, and the accuracy of determining a target user area is further improved.

In some alternative implementations of the present embodiment, as shown in fig. 6, step 104 includes:

step 1041, performing a location affinity domain matching operation on the key point set based on a preset location affinity domain prediction network, to determine an association relationship between key points in the key point set.

The affinity domain prediction network can be constructed on the basis of a convolutional neural network in advance, and training can be performed by using labeling information corresponding to a sample key point set table in advance. In general, each key point may include attribute information including a body part to which the key point belongs, thereby determining an association relationship of adjacent key points.

Step 1042, based on the association relationship, connect the key points in the key point set.

According to the association relationship, whether the adjacent key points of each key point can be connected can be determined. For example, a keypoint representing the left hand may be connected to a keypoint representing the left elbow and the left shoulder.

Step 1043, determining pose information representing the current pose of the target user based on the geometry formed by the connected keypoints.

Specifically, the geometric shapes formed by the connected key points can be matched with preset gestures, so that gesture information corresponding to the currently formed geometric shapes is determined. For example, when the key points representing both hands are approaching and repeatedly moving, it may be determined that the target user is doing a hand-rubbing motion. For another example, when the key point representing the hand coincides with the key point representing the forehead, it may be determined that the target user is touching the forehead, and at this time, it may also be determined that the user is in a motion sickness state.

According to the embodiment, the key points with the association relationship are connected by analyzing the key point information, so that the current gesture of the target user can be judged more accurately according to the geometric shape formed after connection, further analysis of the actual demands of the user is facilitated, and the accuracy of controlling the target device is further improved.

In some alternative implementations of the present embodiment, step 105, as shown in fig. 7, includes:

step 1051, determining a target stage in which the target user is located after entering the shooting area of the target camera at the current moment.

Specifically, a period of time for which the target user enters the photographing range of the target camera may be divided into a plurality of stages in advance, and each stage may correspond to a different preset control policy.

Step 1052, based on the gesture information, controlling the operation state of the target device to reach the target operation state corresponding to the gesture information according to the preset control strategy corresponding to the target stage.

Specifically, each stage corresponds to a different preset control policy, for example, after a user enters a shooting range of a target camera for an initial period of time, the target device can be controlled according to detected gesture information in real time, in a later stage, whether the target user really needs to control the target device can be judged according to the number of times the target user makes the same gesture, and when the corresponding control operation is judged to be executed, a corresponding control signal is generated.

According to the embodiment, through setting the control strategies corresponding to different stages, the control strategies are automatically adjusted according to the use scenes of the target users, and the accuracy of controlling the target equipment and the adaptability of the scenes are improved.

In some alternative implementations of the present embodiment, as shown in fig. 8, step 1052 includes:

in step 10521, if the target phase is the initial phase, the operation state of the target device is controlled to reach the target operation state corresponding to the gesture information.

The initial stage, i.e., the initial period of time (e.g., 3 minutes) during which the target user enters the shooting range of the target camera, may adjust the operating state of the target device according to the gesture information detected in real time for the target user.

As an example, when the embodiment is applied to a vehicle, after a driver (target user) enters the vehicle, the gesture of the driver is detected in real time, if the driver is judged to make a hand rubbing action, the temperature in the vehicle is lower at the moment, and the temperature of the air conditioner can be controlled to be increased; if the driver is judged to make a fanning action, the fact that the temperature in the vehicle is higher at the moment is indicated, and the air conditioner can be controlled to reduce the temperature.

Step 10522, if the target stage is an intermediate stage, determining whether the same gesture made by the target user meets the preset confirmation condition based on the gesture information; and if the preset confirmation condition is met, controlling the running state of the target equipment to reach the target running state corresponding to the target posture information.

The intermediate stage may be a time period after the initial stage, in which the gesture of the target user may be continuously detected to determine whether the target user does a certain action, thereby accurately controlling the target device. The preset confirmation condition can be set arbitrarily according to actual requirements. For example, a fixed period of time (for example, 10 seconds) may be set, in which if the number of times a certain gesture occurs exceeds a preset number of times, it is determined that the confirmation condition is met; or if the duration of a certain gesture exceeds a fixed duration, determining that the confirmation condition is met.

As an example, the intermediate stage may be a normal running stage of the vehicle, in which if it is determined that the target user continuously makes a plurality of strokes for 10 seconds, it is determined that the target may be carsickness, at this time, the window may be automatically opened, or a prompt tone prompting opening of the window may be output, and in order to ensure running safety, this step may be performed to accurately determine the actual posture of the user, thereby improving accuracy of device control.

In step 10523, if the target stage is the last stage, it is determined whether the gesture indicated by the target gesture information is a preset gesture, if so, the operation state of the target device is controlled to reach the target operation state for prompting the target user to change the gesture.

The method for determining whether the last stage is the last stage can be set according to the actual scene. For example, the operation duration of the target device may be preset, and the entering of the end phase may be determined when the operation end time point is approached.

As an example, when the embodiment is applied to a vehicle, it may be determined whether the destination is approaching according to navigation information, if the destination is approaching (for example, 1 minute remains for reaching the destination or the distance from the destination is less than 1 km), it is determined whether the gesture of a passenger (target user) on the vehicle is a sleeping gesture (preset gesture), if so, the air conditioner on the vehicle is controlled to cool down, and at the same time, the audio device is controlled to emit a prompt tone, thereby creating an environment for the passenger that is favorable for recovering an awake state.

According to the method, the operation of controlling the target equipment is respectively executed according to different strategies at the initial stage, the middle stage and the end stage after the target user enters the shooting range of the target camera, so that the target equipment is controlled in different modes at different stages more pertinently, and the method is better adapted to actual scenes.

In some optional implementations of the present embodiment, as shown in fig. 9, the user region segmentation model is trained in advance based on the following steps:

Step 901, acquiring a first sample image and corresponding background region labeling information.

The first sample image may be an image including an arbitrary user image, and in general, a shooting scene of the first sample image is the same as that of the target user image, for example, all of the first sample image is shot by a camera installed in a vehicle. The background region labeling information may be preset information representing a background region other than the user image, for example, contour information of the background region.

In step 902, the first sample image is used as input of a preset initial model, the background region labeling information is used as expected output of the initial model, and parameters of the initial model are adjusted.

The initial model may be an untrained neural network model, or an untrained completed neural network model, among others. The model training means training the neural network model through a large amount of data, so that the model can simulate the human brain to have certain judgment capacity on the information in the data.

The training process of the neural network is mainly divided into two stages of forward deducing and backward propagating, and in the forward deducing process, the image characteristics are extracted through operations such as convolution, pooling, batch normalization, activation function and the like, so that the wanted image information is deduced. In the back propagation, the model automatically calculates the difference between the deduction result and the real result to update the model parameters, optimize the model performance and enable the model to have better discrimination effect. The background region is marked with information, namely a real result, and the difference between the deduced result and the real result can be determined by calculating a loss value through a loss function (such as a cross entropy loss function).

As an example, a cross entropy loss function technique may be employed to lose values, the cross entropy loss function being represented by the following equation (1):

wherein n is the number of categories, x _i Is a one-hot vector, which includes elements representing categories that are 1 when they are the same as the labeling category and 0 when they are not the same. q _i The probability value belonging to a certain category is obtained by classifying the predicted value of the pixel through a model through a softmax or sigmoid function.

In addition, the cross-correlation ratio may be used as a criterion for evaluating the performance of the model, and the formula is as shown in the following formula (2):

wherein n is _cls Representing the number of target classes, n, including background _ii Indicating that target i is predicted as target i, indicating that prediction is correct, n _ij Representing prediction of object i as object j, representing prediction error n _ji Indicating that target j is predicted as target i, indicating a prediction error.

In step 903, in response to determining that the initial model after the adjustment parameters meets the preset first training ending condition, the initial model after the adjustment parameters is determined as an initial user region segmentation model.

Specifically, the objective of the training process described above is to enable the model to initially determine the location of the background region, and when the first training end condition is satisfied, it is determined that the training objective is reached. As an example, the first training end condition may include, but is not limited to, at least one of: the training times reach the preset times, the training time reaches the preset time, the loss value converges, and the like. The trained initial user region segmentation model is obtained by training the background region labeling information, so that the initial user region segmentation model has certain background information extraction capability, and the parameters included in the initial user region segmentation model can be used as initial parameters of the subsequent training user region segmentation model.

Step 904, obtaining a second sample image and corresponding user region labeling information.

Wherein the second sample image may be the same as or different from the first sample image. In general, the shooting scene of the second sample image is the same as the shooting scene of the target user image and the first sample image described above, and is, for example, all shot by a camera installed in the vehicle. The user region labeling information may be preset information indicating a region where the user image is located, for example, contour information of the user image.

In step 905, the second sample image is used as an input of the initial user region segmentation model, the user region labeling information is used as an expected output of the initial user region segmentation model, and parameters of the initial user region segmentation model are adjusted.

The method for adjusting the parameters of the initial user region segmentation model is substantially identical to the above step 702, and will not be described herein.

Step 906, in response to determining that the initial user region segmentation model after the adjustment parameters meets the preset second training ending condition, determining the initial user region segmentation model after the adjustment parameters as a user region segmentation model.

Specifically, the objective of the training process described above is to enable the model to accurately segment the user area, and when the second training end condition is satisfied, it is determined that the training objective is reached. The second training ending condition may be the same as the first training ending condition described above, and will not be described here again. The initialization parameters of the user region segmentation model have certain background extraction capability, so that when model training is performed based on the second sample image and the user region labeling information, the background and the user region can be distinguished, and the accuracy of extracting the user region is improved.

It should be noted that, the training process described in this embodiment is a process described for a set of samples, and the actual training process needs to use multiple sets of samples to repeatedly train until the first training end condition or the second training end condition is satisfied.

According to the method and the device for training the user region segmentation model, when the user region segmentation model is trained, pre-training is performed firstly based on the background frequently occurring in the actual application scene, so that the model has certain background information extracting capability, the user region segmentation accuracy of the model can be further improved, meanwhile, the problem of data scarcity during user region segmentation can be relieved, and the user region segmentation model training speed is increased.

Fig. 10 is a schematic structural diagram of an apparatus for controlling operation of a device based on a user gesture according to an embodiment of the present application. The method specifically comprises the following steps: an acquisition module 1001, configured to acquire an image to be identified including an image of a target user; the segmentation module 1002 is configured to perform semantic segmentation on an image to be identified by using a preset user region segmentation model, so as to obtain a target user region corresponding to a target user in the image to be identified; the detection module 1003 is configured to perform key point detection on the target user area to obtain a key point set; the analysis module 1004 is configured to perform gesture analysis on the target user based on the set of key points, so as to obtain gesture information that represents a current gesture of the target user; the control module 1005 is configured to control, based on the gesture information, the operation state of the target device to reach a target operation state corresponding to the gesture information.

In one possible embodiment, the obtaining module includes: the acquisition unit is used for acquiring an original image shot by a target camera for a target user; and the enhancement unit is used for carrying out image enhancement processing on the original image to obtain an image to be identified.

In one possible embodiment, the segmentation module includes: the first extraction unit is used for carrying out shallow feature extraction operation on the image to be identified by utilizing a shallow feature extraction module included in the user region segmentation model to obtain shallow feature data; the second feature extraction unit is used for carrying out multi-scale deep feature extraction operation on the image to be identified by utilizing a multi-scale deep feature extraction module included in the user region segmentation model to obtain multi-scale deep feature data; the fusion unit is used for fusing the shallow characteristic data and the multi-scale deep characteristic data to obtain fused characteristic data; and the classification unit is used for classifying pixels in the image to be identified based on the fusion characteristic data and determining an area consisting of pixels belonging to the image of the target user as a target user area.

In one possible embodiment, the second feature extraction unit includes: the sub-cutting unit is used for cutting the image to be identified to obtain a preset number of sub-images; the extraction subunit is used for respectively carrying out feature extraction on the preset number of sub-images by utilizing an initial feature extraction network included in the multi-scale deep feature extraction module to obtain preset number group feature data; the integration subunit is used for integrating the preset quantity group of characteristic data to obtain initial characteristic data; the first fusion subunit is used for carrying out multi-scale deep feature extraction and fusion operation on the initial feature data by utilizing a multi-scale feature fusion network included in the multi-scale deep feature extraction module to obtain multi-scale fusion feature data; and the second fusion subunit is used for fusing the initial characteristic data and the multi-scale fusion characteristic data to obtain multi-scale deep characteristic data.

In one possible embodiment, the multi-scale feature fusion network comprises at least two expanded convolution layers, each of the at least two expanded convolution layers having a corresponding expansion ratio; the first fusion subunit is further configured to: for each expansion convolution layer in at least two expansion convolution layers, if the expansion convolution layer is a first expansion convolution layer, performing expansion convolution processing on initial characteristic data by the expansion convolution layer to obtain characteristic data corresponding to the expansion convolution layer; if the expansion convolution layer is a second expansion convolution layer, combining and carrying out expansion convolution processing on the initial characteristic data and the characteristic data output by the corresponding target expansion convolution layer by the expansion convolution layer to obtain the characteristic data corresponding to the expansion convolution layer; and fusing the characteristic data output by each expansion convolution layer in the at least two expansion convolution layers to obtain multi-scale fused characteristic data.

In one possible embodiment, the analysis module comprises: the matching unit is used for carrying out part affinity domain matching operation on the key point set based on a preset part affinity domain prediction network and determining the association relation between key points in the key point set; the connection unit is used for connecting the key points in the key point set based on the association relation; and a first determining unit for determining pose information representing a current pose of the target user based on the geometry formed by the connected keypoints.

In one possible embodiment, the control module includes: the second determining unit is used for determining a target stage of the target user at the current moment after entering the shooting area of the target camera; the control unit is used for controlling the running state of the target equipment to reach the target running state corresponding to the gesture information according to the preset control strategy corresponding to the target stage based on the gesture information.

In one possible embodiment, the control unit comprises: the control subunit is used for controlling the running state of the target equipment to reach the target running state corresponding to the gesture information if the target stage is the initial stage; the first determining subunit is used for determining whether the same gesture made by the target user meets a preset confirmation condition or not based on gesture information if the target stage is an intermediate stage; if the preset confirmation condition is met, controlling the running state of the target equipment to reach the target running state corresponding to the target posture information; and the second determining subunit is used for determining whether the gesture indicated by the target gesture information is a preset gesture if the target stage is the tail stage, and controlling the running state of the target equipment to reach the target running state for prompting the target user to change the gesture if the gesture indicated by the target gesture information is the preset gesture.

The device for operating based on the user gesture control apparatus provided in this embodiment may be a device for operating based on the user gesture control apparatus as shown in fig. 10, and may perform all the steps of the above methods for operating based on the user gesture control apparatus, so as to achieve the technical effects of the above methods for operating based on the user gesture control apparatus, and specific reference is made to the above related description, which is not repeated herein for brevity.

Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and an electronic device 1100 shown in fig. 11 includes: at least one processor 1101, memory 1102, at least one network interface 1104, and other user interfaces 1103. The various components in the electronic device 1100 are coupled together by a bus system 1105. It is appreciated that bus system 1105 is used to implement the connected communications between these components. The bus system 1105 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration, the various buses are labeled as bus system 1105 in fig. 11.

The user interface 1103 may include, among other things, a display, keyboard, or pointing device (e.g., mouse, trackball, touch pad, or touch screen, etc.).

It is to be appreciated that memory 1102 in embodiments of the present application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DRRAM). The memory 1102 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some implementations, the memory 1102 stores the following elements, executable units or data structures, or a subset thereof, or an extended set thereof: an operating system 11021 and application programs 11022.

The operating system 11021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application programs 11022 include various application programs such as a Media Player (Media Player), a Browser (Browser), and the like for realizing various application services. A program for implementing the method of the embodiment of the present application may be included in the application program 11022.

In this embodiment, by calling a program or an instruction stored in the memory 1102, specifically, a program or an instruction stored in the application 11022, the processor 1101 is configured to execute the method steps provided by the method embodiments, for example, including:

acquiring an image to be identified containing the image of the target user; carrying out semantic segmentation on the image to be identified by using a preset user region segmentation model to obtain a corresponding target user region of the target user in the image to be identified; performing key point detection on the target user area to obtain a key point set; based on the key point set, carrying out gesture analysis on the target user to obtain gesture information representing the current gesture of the target user; and controlling the running state of the target equipment to reach the target running state corresponding to the gesture information based on the gesture information.

The method disclosed in the embodiments of the present application may be applied to the processor 1101 or implemented by the processor 1101. The processor 1101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware in the processor 1101 or instructions in software. The processor 1101 described above may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software elements in a decoded processor. The software elements may be located in a random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1102, and the processor 1101 reads information in the memory 1102 and performs the steps of the method in combination with its hardware.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the processing units may be implemented within one or more application specific integrated circuits (Application Specific Integrated Circuits, ASIC), digital signal processors (Digital Signal Processing, DSP), digital signal processing devices (dspev, DSPD), programmable logic devices (Programmable Logic Device, PLD), field programmable gate arrays (Field-Programmable Gate Array, FPGA), general purpose processors, controllers, microcontrollers, microprocessors, other electronic units configured to perform the above-described functions of the application, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

The electronic device provided in this embodiment may be an electronic device as shown in fig. 11, and may perform all the steps of the above-described method for operating based on the user gesture control device, so as to achieve the technical effects of the above-described method for operating based on the user gesture control device, and specific reference is made to the above-described related description, which is not repeated herein for brevity.

The embodiment of the application also provides a storage medium (computer readable storage medium). The storage medium here stores one or more programs. Wherein the storage medium may comprise volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, hard disk, or solid state disk; the memory may also comprise a combination of the above types of memories.

When one or more programs in the storage medium are executable by one or more processors, the method for controlling the operation of the device based on the user gesture performed on the electronic device side is implemented.

The above processor is configured to execute a program stored in the memory, so as to implement the following steps of a method executed on the electronic device side and based on the operation of the user gesture control device:

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

It is to be understood that the terminology used herein is for the purpose of describing particular example embodiments only, and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The terms "comprises," "comprising," "includes," "including," and "having" are inclusive and therefore specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. The method steps, processes, and operations described herein are not to be construed as necessarily requiring their performance in the particular order described or illustrated, unless an order of performance is explicitly stated. It should also be appreciated that additional or alternative steps may be used.

The foregoing is merely a specific embodiment of the application to enable one skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of controlling operation of a device based on a user gesture, the method comprising:

acquiring an image to be identified containing the image of the target user;

carrying out semantic segmentation on the image to be identified by using a preset user region segmentation model to obtain a corresponding target user region of the target user in the image to be identified;

performing key point detection on the target user area to obtain a key point set;

based on the key point set, carrying out gesture analysis on the target user to obtain gesture information representing the current gesture of the target user;

and controlling the running state of the target equipment to reach the target running state corresponding to the gesture information based on the gesture information.

2. The method of claim 1, wherein the acquiring the image to be identified comprising the image of the target user comprises:

acquiring an original image shot by a target camera for the target user;

and carrying out image enhancement processing on the original image to obtain the image to be identified.

3. The method according to claim 1, wherein the performing semantic segmentation on the image to be identified by using a preset user region segmentation model to obtain a target user region corresponding to the target user in the image to be identified includes:

carrying out shallow feature extraction operation on the image to be identified by using a shallow feature extraction module included in the user region segmentation model to obtain shallow feature data;

performing multi-scale deep feature extraction operation on the image to be identified by utilizing a multi-scale deep feature extraction module included in the user region segmentation model to obtain multi-scale deep feature data;

fusing the shallow characteristic data and the multi-scale deep characteristic data to obtain fused characteristic data;

and classifying pixels in the image to be identified based on the fusion characteristic data, and determining an area composed of pixels belonging to the image of the target user as the target user area.

4. A method according to claim 3, wherein the performing a multi-scale deep feature extraction operation on the image to be identified by using a multi-scale deep feature extraction module included in the user region segmentation model to obtain multi-scale deep feature data includes:

cutting the image to be identified to obtain a preset number of sub-images;

respectively extracting the characteristics of the preset number of sub-images by using an initial characteristic extraction network included in the multi-scale deep characteristic extraction module to obtain preset number group characteristic data;

integrating the preset number of sets of characteristic data to obtain initial characteristic data;

performing multi-scale deep feature extraction and fusion operation on the initial feature data by utilizing a multi-scale feature fusion network included in the multi-scale deep feature extraction module to obtain multi-scale fusion feature data;

and fusing the initial characteristic data and the multi-scale fusion characteristic data to obtain the multi-scale deep characteristic data.

5. The method of claim 4, wherein the multi-scale feature fusion network comprises at least two expanded convolution layers, each of the at least two expanded convolution layers having a corresponding expansion ratio;

The multi-scale deep feature extraction and fusion operation is performed on the initial feature data by using a multi-scale feature fusion network included in the multi-scale deep feature extraction module to obtain multi-scale fusion feature data, including:

for each expansion convolution layer in the at least two expansion convolution layers, if the expansion convolution layer is a first expansion convolution layer, performing expansion convolution processing on the initial characteristic data by the expansion convolution layer to obtain characteristic data corresponding to the expansion convolution layer; if the expansion convolution layer is a second expansion convolution layer, combining and carrying out expansion convolution processing on the initial characteristic data and the characteristic data output by the corresponding target expansion convolution layer by the expansion convolution layer to obtain the characteristic data corresponding to the expansion convolution layer;

and fusing the characteristic data output by each expansion convolution layer in the at least two expansion convolution layers to obtain the multi-scale fusion characteristic data.

6. The method according to claim 1, wherein performing gesture analysis on the target user based on the set of keypoints to obtain gesture information representing a current gesture of the target user comprises:

Performing part affinity domain matching operation on the key point set based on a preset part affinity domain prediction network, and determining association relations among key points in the key point set;

connecting key points in the key point set based on the association relation;

pose information representing the current pose of the target user is determined based on the geometry formed by the connected keypoints.

7. The method according to claim 1, wherein controlling the operation state of the target device to reach the target operation state corresponding to the gesture information based on the gesture information comprises:

determining a target stage of the target user at the current moment after entering a shooting area of the target camera;

and controlling the running state of the target equipment to reach the target running state corresponding to the gesture information according to the preset control strategy corresponding to the target stage based on the gesture information.

8. The method of claim 6, wherein controlling the operation state of the target device to reach the target operation state corresponding to the gesture information according to the preset control policy corresponding to the target stage based on the gesture information comprises:

If the target stage is an initial stage, controlling the running state of the target equipment to reach a target running state corresponding to the gesture information;

if the target stage is an intermediate stage, determining whether the same gesture made by the target user meets a preset confirmation condition or not based on the gesture information; if the preset confirmation condition is met, controlling the running state of the target equipment to reach a target running state corresponding to the target posture information;

and if the target stage is the tail stage, determining whether the gesture indicated by the target gesture information is a preset gesture, and if so, controlling the running state of the target equipment to reach a target running state for prompting the target user to change the gesture.

9. The method of claim 1, wherein the user region segmentation model is trained in advance based on:

acquiring a first sample image and corresponding background region labeling information;

taking the first sample image as input of a preset initial model, taking the background region labeling information as expected output of the initial model, and adjusting parameters of the initial model;

responding to the fact that the initial model after the adjustment parameters meet a preset first training ending condition, and determining the initial model after the adjustment parameters as an initial user area segmentation model;

Acquiring a second sample image and corresponding user region labeling information;

taking the second sample image as input of the initial user region segmentation model, taking the user region labeling information as expected output of the initial user region segmentation model, and adjusting parameters of the initial user region segmentation model;

and determining the initial user region segmentation model after the adjustment parameters as the user region segmentation model in response to the fact that the initial user region segmentation model after the adjustment parameters are determined to meet a preset second training ending condition.

10. An apparatus for controlling operation of a device based on a user gesture, the apparatus comprising:

the acquisition module is used for acquiring an image to be identified containing the image of the target user;

the segmentation module is used for carrying out semantic segmentation on the image to be identified by utilizing a preset user region segmentation model to obtain a target user region corresponding to the target user in the image to be identified;

the detection module is used for detecting the key points of the target user area to obtain a key point set;

the analysis module is used for carrying out gesture analysis on the target user based on the key point set to obtain gesture information representing the current gesture of the target user;

And the control module is used for controlling the running state of the target equipment to reach the target running state corresponding to the gesture information based on the gesture information.

11. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing a computer program stored in said memory, and which, when executed, implements the method of any of the preceding claims 1-9.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of the preceding claims 1-9.