WO2023020327A1

WO2023020327A1 - Image processing

Info

Publication number: WO2023020327A1
Application number: PCT/CN2022/111023
Authority: WO
Inventors: 何野; 四建楼; 王玉峰; 杜天元; 王明峰; 钱晨
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-08-19
Filing date: 2022-08-09
Publication date: 2023-02-23
Also published as: CN113627379A

Abstract

The present application provides an image processing method and apparatus, a device, and a storage medium. The method may comprise obtaining a regional image, which is in an image to be processed, corresponding to a foot; performing key point detection on the regional image by means of a foot key point detection model to obtain two-dimensional position information of a first foot key point of the foot; and on the basis of a mapping relationship between preset position information of a second foot key point in a preset foot three-dimensional model corresponding to the first foot key point in a three-dimensional space and the two-dimensional position information, determining a three-dimensional pose of the foot in the three-dimensional space.

Description

Image Processing

technical field

The present application relates to the technical field of computer vision, in particular to image processing.

Background technique

Human body key point detection technology refers to the technology of using deep learning technology to extract features from input images, and using the extracted feature maps to locate key points.

However, the key point detection technology of the human body can only detect the key points of the human body, and cannot detect the two-dimensional position information of the key points of the foot that can be estimated for the pose of the foot, so the three-dimensional pose of the foot cannot be obtained.

Contents of the invention

In view of this, in the first aspect, the present application discloses an image processing method. The method may include: acquiring a region image corresponding to the foot in the image to be processed; using a foot key point detection model to perform key point detection on the region image to obtain the first foot key point of the foot Two-dimensional position information; based on the mapping between the preset position information in three-dimensional space and the two-dimensional position information of the second key point of the foot corresponding to the key point of the first foot in the preset three-dimensional model of the foot relationship, and determine the three-dimensional pose of the foot in the three-dimensional space.

In the second aspect, the present application also proposes an image processing device, which includes: a first acquisition module, used to acquire an area image corresponding to the foot in the image to be processed; a key point detection module, used to use the foot key A point detection model, which detects the key points of the region image to obtain the two-dimensional position information of the first foot key point of the foot; the determination module is used to match the first three-dimensional model based on the foot The mapping relationship between the preset position information of the second key point of the foot corresponding to the key point of the foot in the three-dimensional space and the two-dimensional position information determines the three-dimensional pose of the foot in the three-dimensional space.

In a third aspect, the present application also proposes an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein, the processor executes the executable instructions to implement the out image processing method.

In a fourth aspect, the present application also provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program is used to cause a processor to execute the image processing method as shown in any one of the foregoing embodiments.

In the technical solutions disclosed in the aforementioned embodiments, the key point detection model of the foot can be used to detect the key points of the foot area image corresponding to the foot to obtain the two-dimensional position information of the first key point of the foot; and then based on the preset The mapping relationship between the preset position information of the second key point of the foot corresponding to the first key point of the foot in the three-dimensional space and the two-dimensional position information in the three-dimensional model of the foot can determine the The three-dimensional pose of the head in the three-dimensional space. Compared with the key point detection technology of the human body, neural network regression can be used to obtain the two-dimensional position information of the first key point of the foot for pose estimation, so as to facilitate subsequent determination of the position of the foot in the three-dimensional position based on the mapping relationship. 3D pose in space.

In addition, in the technical solutions described in some of the aforementioned embodiments, based on the three-dimensional pose of the foot in the image to be processed, a three-dimensional virtual model of the shoe material can be superimposed on the position corresponding to the foot in the image to be processed , to get the augmented reality effect of virtual shoe fitting.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Description of drawings

In order to more clearly illustrate the technical solutions in one or more embodiments of the present application or related technologies, the following will briefly introduce the drawings that need to be used in the descriptions of the embodiments or related technologies. Obviously, the accompanying drawings in the following description The drawings are only some embodiments described in one or more embodiments of the present application, and those skilled in the art can obtain other drawings based on these drawings without any creative effort.

Fig. 1 is a method flowchart of an image processing method shown in the present application;

FIG. 2 is a schematic flow diagram of a method for acquiring an area image shown in the present application;

Fig. 3 is a schematic flow chart of a model training method shown in the present application;

FIG. 4 is a schematic flow diagram of a foot tracking method shown in the present application;

Fig. 5 is a schematic diagram of a virtual shoe trial process shown in the present application;

Fig. 6 is a schematic flow chart of a virtual shoe-trying method shown in the present application;

FIG. 7 is a schematic structural diagram of an image processing device shown in the present application;

FIG. 8 is a schematic diagram of a hardware structure of an electronic device shown in the present application.

Detailed ways

Exemplary embodiments will be described in detail below with reference to the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of devices and methods consistent with aspects of the present application as recited in the appended claims.

The terminology used in this application is for the purpose of describing particular embodiments only, and is not intended to limit the application. As used in this application and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. It should also be understood that the word "if", as used herein, could be interpreted as "at" or "when" or "in response to a determination", depending on the context.

This application relates to the field of augmented reality. By obtaining the image information of the target object in the real environment, and then using various visual related algorithms to detect or identify the relevant characteristics, states and attributes of the target object, so as to obtain the matching specific application. AR effect combining virtual and reality. Exemplarily, the target object may involve faces, limbs, gestures, actions, etc. related to the human body, or markers and markers related to objects, or sand tables, display areas or display items related to venues or places. Vision-related algorithms may involve visual positioning, SLAM (Simultaneous Localization and Mapping), 3D reconstruction, image registration, background segmentation, object key point extraction and tracking, object pose or depth detection, etc. Specific applications can not only involve interactive scenes such as guided tours, navigation, explanations, reconstructions, virtual effect overlays and display related to real scenes or objects, but also special effects processing related to people, such as makeup beautification, body beautification, special effect display, virtual Interactive scenarios such as model display. The relevant features, states and attributes of the target object can be detected or identified through the convolutional neural network. The aforementioned convolutional neural network is a network model obtained through model training based on a deep learning framework.

This application proposes an image processing method. The method can use the key point detection model of the foot to detect the key points of the foot area image corresponding to the foot, and obtain the two-dimensional position information of the first key point of the foot; then based on the preset three-dimensional model of the foot and the The mapping relationship between the preset position information of the second foot key point corresponding to the first foot key point in the three-dimensional space and the two-dimensional position information can determine the position of the foot in the three-dimensional space 3D pose. Compared with the key point detection technology of the human body, neural network regression can be used to obtain the two-dimensional position information of the first key point of the foot for pose estimation, so as to facilitate subsequent determination of the position of the foot in the three-dimensional position based on the mapping relationship. 3D pose in space.

The first foot key point is at least one key point in at least one preset foot area of the foot in the image. The predetermined foot area may be any area in the foot. For example, the preset area may be the thumb area of the foot.

The second key point of the foot refers to a key point in the same area of the foot as the first key point of the foot among the key points of the foot included in the preset three-dimensional model of the foot.

The method can be applied to electronic equipment. Wherein, the electronic device may implement the method by carrying a software device corresponding to the image processing method. The type of the electronic device may be a notebook computer, a computer, a server, a mobile phone, a PAD terminal and the like. The present application does not specifically limit the specific type of the electronic device. The electronic device may be a device on the client side or on the server side. The server may be a server or a cloud provided by a server, a server cluster or a distributed server cluster.

Referring to FIG. 1 , FIG. 1 is a method flowchart of an image processing method shown in the present application. As shown in Fig. 1, the method may include S102-S106.

Wherein, S102, acquire an area image corresponding to the foot in the image to be processed.

The image to be processed may include feet. The purpose of this application is to capture the foot in the image to be processed and obtain the three-dimensional pose of the foot.

In some embodiments, the image to be processed may be an image transmitted by a user through a client program. In this way, images uploaded by users can be processed.

In some embodiments, the image to be processed may also be an image collected by image collection hardware. For example, the image acquisition hardware may be a camera mounted on a mobile phone terminal. Users can collect video streams in real time through the camera. The image to be processed may be a picture, or an image in the video stream. In this application, the foot can be captured in the image to be processed in real time, and the three-dimensional pose estimation of the foot can be performed.

The feet may refer to the feet of a human body or other animals or robots. The foot may comprise a left foot or a right foot.

The region image may be an image corresponding to the foot region among the images to be processed.

In some embodiments, the image to be processed is a picture or the first frame image in a certain video stream, and the area surrounded by the detection frame of the foot in the image to be processed may be determined as the area image . When the image to be processed is a non-first frame image in the video stream, the method for acquiring the area image will be described in subsequent embodiments, and will not be described in detail here.

Referring to FIG. 2 , FIG. 2 is a schematic flowchart of a method for acquiring an area image shown in the present application.

As shown in FIG. 2, S21-S22 may be executed when S102 is executed.

Wherein, S21, using an object detection model to perform object detection on the image to be processed to obtain an object detection result, where the object detection result includes a detection frame of a foot in the image to be processed.

In some embodiments, when performing S21, the image to be processed may be input into the trained object detection model to obtain a foot detection frame corresponding to the foot included in the image to be processed.

The object detection model can be based on RCNN (Region Convolutional Neural Networks, regional convolutional neural network), FAST-RCNN (Fast Region Convolutional Neural Networks, fast regional convolutional neural network) or FASTER-RCNN (Faster Region Convolutional Neural Networks , a model built by a faster regional convolutional neural network).

In some embodiments, the object detection model can be trained. Specifically, multiple training samples marked with detection frame information corresponding to the feet can be obtained. The model may then be trained under supervision based on the training samples until the model converges.

After the training is completed, the object detection model can be used to detect the feet and the foot detection frame in the image.

S22. Obtain an area image corresponding to the foot according to the detection frame and the image to be processed.

In some embodiments, the foot detection frame obtained by object detection and the image to be processed (or the feature map obtained by performing feature extraction on the image to be processed by using the backbone network) can be input into the area feature extraction unit to obtain the area image.

The region feature extraction unit may be a ROI Align (Region of Interest Align, region of interest alignment) unit or ROI Pooling (Region of Interest Pooling, region of interest collection) unit. This unit can extract the area image surrounded by the foot detection frame in the image to be processed.

Through the methods included in S21 to S22, the deep learning technology can be used to accurately perform matting processing on the image to be processed to obtain an image of the area corresponding to the foot.

S104. Using the foot key point detection model, perform key point detection on the region image to obtain two-dimensional position information of a first foot key point of the foot.

Wherein, the foot key point detection model includes a neural network model trained based on a plurality of foot region image samples labeled with position information of the first foot key point.

The first foot key point refers to at least one key point in at least one preset foot area of the foot in the image. The predetermined foot area may be any area in the foot. For example, the preset area may be the thumb area of the foot. The predetermined foot area may be any area in the foot. For example, the preset area may be the thumb area of the foot.

The number of the first foot key points and the preset foot area can be set according to business requirements. In some embodiments, the first key point of the foot may include a plurality of key points on the contour of the edge of the foot. Therefore, the outline of the foot can be well represented according to the first key point of the foot, thereby improving the effect of estimating the pose of the foot.

The two-dimensional position information may indicate the two-dimensional position information of the first foot key point in the image to be processed. In some embodiments, the two-dimensional position information may indicate the two-dimensional coordinates of the first foot key point in the image to be processed.

In some embodiments, the number of the first foot key points may be no less than four. This can improve the fine-grainedness of the key points of the feet, increase the pose information of the feet, and thus improve the pose estimation effect of the feet. In some embodiments, the number of the first foot key points can be any number within the range of four to fifteen, by setting the key points at the necessary foot positions, so that the first foot key points The set quantity can be within an appropriate quantity range, which is also convenient for subsequent three-dimensional pose detection of the feet.

In some embodiments, the first foot key points include key points of at least one of the following areas:

tip of big toe; medial forefoot joint; medial arch of foot; medial rear ball; rear of heel; lateral rear ball; lateral forefoot joint; junction of forefoot and leg; medial ankle joint; rear hamstring; lateral ankle joint.

In general, points in the protruding area and/or concave area on the foot contour can be used as the first key point of the foot to improve the accuracy of the foot contour representation, thereby improving the effect of foot pose estimation.

The foot key point detection model can be a regression or classification model constructed based on a neural network or a deep learning network. The model is used to detect the two-dimensional position information of the key point of the first foot.

Take the foot key point detection model as an example based on a regression model built on a deep learning network. In some embodiments, it can be trained using training samples. Referring to Fig. 3, Fig. 3 is a schematic flow chart of a model training method shown in the present application. As shown in FIG. 3, S31-S33 may be executed during model training.

Wherein, S31, acquire multiple training samples. The training samples may be foot images including feet. The two-dimensional position information of the first key point of the foot may be marked in the foot image.

S32. Input the plurality of training samples into the foot key point detection model to obtain estimated two-dimensional position information respectively corresponding to the plurality of training samples.

S33. Obtain loss information according to the difference between the obtained estimated two-dimensional position information and the two-dimensional position information of the first foot key points pre-marked by each training sample, and use the loss information to update the obtained position by backpropagation. Describe the parameters of the foot key point detection model.

After the model training is completed, the model has the ability to predict the two-dimensional position information of the key points of the first foot. When performing S104, the area image obtained in S102 may be input into the trained foot key point detection model to obtain two-dimensional position information of the first foot key point of the foot.

S106, based on the mapping relationship between the preset position information in the three-dimensional space of the second foot key point corresponding to the first foot key point in the preset three-dimensional model of the foot and the two-dimensional position information, determine The three-dimensional pose of the foot in the three-dimensional space.

The three-dimensional space refers to the three-dimensional space to be projected by the feet. This space is usually set according to business needs.

The preset three-dimensional model of the foot may be a three-dimensional model maintained in advance according to business requirements. The preset three-dimensional model of the foot may include a plurality of key points of the foot.

The second key point of the foot refers to at least one of the key points of the foot included in the preset three-dimensional model of the foot, at least one of the foot areas that is the same as at least one foot area of the first key point of the foot key point. The foot region may be any region of the foot. For example, the foot area may be the thumb area of the foot.

The preset position information may indicate the three-dimensional position information of the second key point of the foot in the three-dimensional space under the standard pose of the preset three-dimensional model of the foot. In some embodiments, the preset position information may be the three-dimensional coordinates of the second key point of the foot in the three-dimensional space in the standard pose of the three-dimensional model of the foot.

The standard pose may be a preset pose according to business requirements. In some embodiments, the pose at the origin of the three-dimensional coordinate system corresponding to the three-dimensional space and perpendicular to the plane formed by the X-axis and the Y-axis of the three-dimensional coordinate system may be determined as the standard pose. The three-dimensional coordinates of the second key point of the foot when the three-dimensional model of the foot is in the standard pose is the preset position information.

The three-dimensional pose is used to indicate the posture of the foot in the three-dimensional space in the image to be processed. In some embodiments, the three-dimensional pose may include translation and rotation of the foot relative to X, Y, and Z axes in the three-dimensional coordinate system.

When executing S106, according to the pose estimation algorithm, based on the mapping relationship between the two-dimensional position information (usually two-dimensional coordinate information) of the first foot key point obtained in S104 and the preset position information, Estimating the pose of the foot to obtain the three-dimensional pose of the foot. The three-dimensional mapping algorithm may include PNP (Perspective-N-Point, multi-point perspective imaging) and other similar algorithms. This application does not specifically limit the algorithm for solving the mapping relationship.

Take the three-dimensional mapping algorithm as the PNP algorithm as an example. The PNP algorithm has two inputs. One is the two-dimensional coordinates of the first foot key point in the image to be processed; the other is the three-dimensional coordinate of the second foot key point in the three-dimensional space. This algorithm can be based on the mapping relationship between the two-dimensional coordinates of each foot key point in the first foot key point and the three-dimensional coordinates corresponding to the second foot key point, to obtain the three-dimensional coordinates of the foot in the three-dimensional 3D pose in space.

After obtaining the two-dimensional coordinates of the first key point of the foot, the three-dimensional coordinates of the second key point of the foot in the three-dimensional space may be obtained. Then input the three-dimensional coordinates and the two-dimensional coordinates into the solution formula corresponding to the PNP algorithm to obtain the three-dimensional pose of the foot in the three-dimensional space.

In the solution proposed in the foregoing embodiments, the key point detection model of the foot can be used to detect the key points of the foot area image corresponding to the foot to obtain the two-dimensional position information of the key point of the first foot; then based on the preset foot The mapping relationship between the preset position information of the second foot key point corresponding to the first foot key point in the three-dimensional space and the two-dimensional position information in the three-dimensional model of the foot can determine the foot 3D pose in the 3D space. Compared with the key point detection technology of the human body, neural network regression can be used to obtain the two-dimensional position information of the first key point of the foot for pose estimation, so as to facilitate subsequent determination of the position of the foot in the three-dimensional position based on the mapping relationship. 3D pose in space.

In some embodiments, the image processing method shown in this application can also identify the type of the foot, that is, whether the foot is left or right. In the method, the object detection result obtained through S21 includes not only the detection frame of the foot in the image to be processed, but also the type of the foot.

In some embodiments, when performing S21, an object detection model may be used to perform object detection on the image to be processed to obtain the detection frame of the foot included in the image to be processed and the type of the foot. Thus, the object detection model can be used to distinguish the left and right feet on the basis of detecting the foot detection frame in the image to be processed.

The object detection model includes a neural network model trained based on a plurality of training samples marked with detection frames corresponding to feet and type information; the type indicates that the foot is left or right.

In some embodiments, the object detection model may be trained. Specifically, a plurality of training samples marked with detection frame information and type information corresponding to the feet (that is, left foot or right foot) can be obtained. The model may then be trained under supervision based on the training samples until the model converges.

After the training is completed, the object detection model can be used to detect the detection frame and the type of the foot in the image.

In some embodiments, after executing S102, S103 may also be executed, in response to the foot being of a preset type, flipping the region image so that all regions of the foot key point detection model are input The types of the feet in the images are consistent, thereby ensuring that the feet in all regional images input to the foot key point detection model are of the same type (ie, left foot or right foot), which is convenient for the foot key point detection model to process.

The preset type can be set according to business requirements. In some embodiments, when training the foot key point detection model, a left foot sample may be used for training. At this point, the preset type can be set to right foot. When executing S103, if the foot is recognized as a right foot, the area image corresponding to the foot is reversed so that the foot in the area image becomes a left foot type, which is convenient for the foot key point detection model to process.

In some embodiments, image processing needs to be performed on the images in the captured video stream. In order to capture the same foot in the video stream, foot tracking is required. In some embodiments, the object detection model can be used to perform object detection on each frame of images in the video stream to obtain the foot detection frame in the video stream, and then determine the same foot detection frame in the video stream according to the position of the detected foot detection frame. Foot for foot tracking.

It is not difficult to find that the above-mentioned foot tracking method needs to perform object detection on each frame image, and because the structure of the object detection model is relatively complex, the amount of calculation is relatively large, and the overhead is relatively large. Therefore, the tracking efficiency of the above-mentioned foot tracking method is low. The real-time performance of the image processing method of the present application may be poor.

In order to solve the aforementioned problems, in some embodiments, the feature that the position of the feet in adjacent frame images does not change significantly can be used to track the feet in the video stream, reducing the need for foot tracking through the object detection model. The amount of computation to come, reduce overhead, improve the efficiency of foot tracking, thereby improving the real-time performance of the image processing method.

Referring to FIG. 4 , FIG. 4 is a schematic flowchart of a foot tracking method shown in the present application. As shown in FIG. 4, S41-S43 can be executed when foot tracking is implemented.

Wherein, S41, acquire an area image corresponding to the foot in the image to be processed.

The image to be processed is an image in a video stream. The image to be processed may be the first frame image or images after the first frame image in a certain video stream.

If the image to be processed is the first frame image in the video stream, the area image can be acquired through steps S21-S22.

If the image to be processed is an image after the first frame image, when executing S41, the position information of the first foot key point of the foot in the previous frame image of the frame where the image to be processed is located in the video stream may be acquired, And according to the obtained position information, determine the foot key point frame. Based on the key point frame of the foot and the image to be processed, an area image corresponding to the foot is obtained.

Thus, an image of an area in the image to be processed that is at the same position as the foot in the previous frame image can be obtained.

The key point frame of the foot can be a key point frame of any shape. In some embodiments, the foot key point box may be a rectangular box. When S41 is executed, the two-dimensional position information of the first key point of the foot in the cached image of the previous frame of the image to be processed may be acquired. Then, according to the two-dimensional position information, the maximum coordinates x0, y0 and the minimum coordinates x1, y1 in the directions of the X and Y axes can be respectively determined. Afterwards, a rectangular frame composed of four vertices (x0, y0), (x0, y1), (x1, y0) and (x1, y1) can be used as the foot key point frame.

After obtaining the key point frame of the foot, the key point frame of the foot and the image to be processed (or using the backbone network to perform feature extraction on the image to be processed to obtain a feature map) can be input into the region feature extraction unit to obtain the second region image .

The region feature extraction unit may be a ROI Align (Region of Interest Align, region of interest alignment) unit or ROI Pooling (Region of Interest Pooling, region of interest collection) unit. The unit can extract the second area image surrounded by the frame of the key points of the foot in the second image.

In some embodiments, after the area image corresponding to the foot is obtained based on the key point frame of the foot and the image to be processed, it is also possible to obtain the object that is processed on the area image in the previous frame image. The type of the foot in the previous frame image stored during detection or classification; in response to the type of the foot in the previous frame image being a preset type, the region image is flipped so that the input foot The type of feet is consistent in all region images of the keypoint detection model. Therefore, it is ensured that the feet in all the region images input to the foot key point detection model are of the same type (ie left foot or right foot), which is convenient for the foot key point detection model to process.

S42. Classify the region image by using an image classification model to obtain a classification result of the region image.

The classification result indicates whether the region image is a foot image.

The image classification model may include a convolutional neural network. When training the model, multiple image samples labeled with image classification information can be obtained. The image classification information indicates whether the image sample is a foot image. This model can then be trained under supervision based on image samples.

After the training is completed, the image classification model can be used to perform image classification on the region image to obtain a classification result.

S43. In response to the classification result indicating that the area image is a foot image, determine that the foot in the area image is the same foot as the foot in the previous frame image, so as to perform foot tracking.

Since the region image obtained in S41 is the region image at the same position as the foot position in the previous frame image in the image to be processed, if the classification result of the region image indicates that it is a foot image, Then it can be explained that there is also a foot in the image to be processed at the same position as the foot in the previous frame image. According to the characteristic that the position of the same foot in several adjacent frames of images does not change significantly, it can be determined that the foot in the region image is the same foot as the foot in the previous frame of image. Thus, the foot tracking in the video stream is realized, the calculation amount brought by the foot tracking through the object detection model is reduced, the overhead is reduced, the foot tracking efficiency is improved, and the real-time performance of the image processing method is improved.

In the process of foot tracking, S104 may be continued to perform key point detection on the region image by using the foot key point detection model to obtain two-dimensional position information of the first foot key point of the foot. S104 will not be described in detail here.

After the foot tracking is completed, when executing S106, it may be determined that the foot in the region image is the same foot as the foot in the previous frame image, based on the preset three-dimensional model of the foot and the foot in the The mapping relationship between the preset position information of the second foot key point corresponding to the first foot key point in the three-dimensional space and the two-dimensional position information determines the three-dimensional position of the foot in the three-dimensional space posture. The step information of pose estimation is not described in detail here.

If the classification result indicates that the area image is not a foot image, it can be determined that the foot tracking has failed, that is, the image to be processed can be used as an image of a video tracking failure, and the area image is acquired by using the steps of S21-S22, Then through S104 and S106, the three-dimensional pose of the foot is obtained.

In some embodiments, the image classification model shares a feature extraction network with the foot landmark detection model. For example, the backbone network.

The image classification model and the foot key point detection model can be trained in a joint training manner, thereby improving the ability of the feature extraction network to extract features, thereby extracting useful feature information for classification and key point detection, Improve classification and keypoint detection performance.

In some embodiments, the first image sample labeled with image classification information and the second image sample labeled with position information of the first key point of the foot may be acquired. The first image sample and the second image sample may be the same image. The classification information indicates whether the first image sample is an image of a foot.

Then, the first image sample may be input into the image classification model to obtain a classification prediction result, and the first loss information may be obtained according to the classification prediction result and the marked image classification information; and the second image sample may be input into the The foot key point detection model obtains the first foot key point position prediction result, and obtains the second loss information according to the first foot key point position prediction result and the marked first foot key point position information.

Then, model parameters of the image classification model and the foot key point detection model may be adjusted based on the first loss information and the second loss information. In some embodiments, the foregoing training steps may be performed iteratively for multiple times until the image classification model and the foot key point detection model converge.

Since the image classification model shares the feature extraction network with the foot key point detection model, when any model is trained, it will affect the training of the other model, so that the training of the two models can be Complement and promote each other to realize joint training of the image classification model and the foot key point detection model. Through the aforementioned joint training, on the one hand, the ability of the feature extraction network to extract useful feature information for classification and key point detection can be improved, and the effect of classification and key point detection can be improved; Complement and promote each other to improve the efficiency of model training.

In some embodiments, after completing S106 and obtaining the three-dimensional pose of the foot in the three-dimensional space in the image to be processed, virtual shoe fitting may be performed.

In some embodiments, a three-dimensional virtual model of the shoe material may be obtained first. Then, based on the three-dimensional pose of the foot in the image to be processed, the three-dimensional virtual model is superimposed on the position corresponding to the foot in the image to be processed, so as to obtain the augmented reality effect of virtual shoe fitting.

The three-dimensional virtual model can be used to indicate the outline and/or texture color of the shoe material. In some embodiments, the 3D virtual model may include the 3D coordinates of each vertex of the shoe material in 3D space, and the pixel value of each vertex.

By superimposing the positions corresponding to the feet using the three-dimensional virtual model, shoe material can be displayed in the image to be processed, thereby achieving the augmented reality effect of virtual shoe fitting.

Referring to FIG. 5 , FIG. 5 is a schematic diagram of a virtual shoe fitting process shown in the present application.

As shown in Fig. 5, the method may include S51-S54.

Wherein, S51, acquire the initial pose of the 3D virtual model corresponding to the shoe material in the 3D space.

The initial pose is used to indicate the pose information of the shoe material in three-dimensional space. In some embodiments, the initial pose of the shoe material in the three-dimensional space may be pre-maintained in the database. When executing S51, the three-dimensional pose can be acquired from the database.

S52. Based on the three-dimensional pose, transform the initial pose to match the three-dimensional pose corresponding to the foot, and obtain a transformed three-dimensional pose.

The pose transformation information of the initial pose can be determined based on the 3D pose of the foot. The pose transformation information may indicate translation and rotation amounts of the initial pose relative to the X-axis, Y-axis, and Z-axis of the three-dimensional coordinate system.

The converted three-dimensional pose may indicate pose information after converting the shoe material to the pose of the foot.

In some embodiments, the pose transformation information may include translation and rotation information. When performing S52, the pose transformation information may be used to perform a translation and rotation operation on the initial position to obtain the translation and rotation of the shoe material. Pose information after arriving at the pose of the foot.

S53. Based on the converted 3D pose, map the 3D virtual model to the image to be processed to obtain a 2D virtual material corresponding to the shoe material.

When executing S53, the three-dimensional coordinates of each vertex in the three-dimensional virtual model may be adjusted based on the converted three-dimensional pose first, to obtain an adjusted three-dimensional virtual model. Then use a projection algorithm to project the three-dimensional coordinates of each vertex in the adjusted three-dimensional virtual model to the two-dimensional plane where the image to be processed is located, to obtain the two-dimensional coordinates of each vertex. The shape corresponding to the two-dimensional virtual material corresponding to the shoe material can be determined according to the two-dimensional coordinates of the vertices, and can be determined according to the pixel values of the vertices (obtained from the pixel values of the vertices in the three-dimensional virtual model). The texture and color corresponding to the two-dimensional virtual material, so as to obtain the two-dimensional virtual material.

S54. Perform image fusion of the two-dimensional virtual material and the corresponding position of the foot to obtain a fused image for AR effect display of virtual shoe fitting.

In some embodiments, the image fusion can be completed by covering each pixel of the two-dimensional virtual material with the pixel at the corresponding position of the foot, or adjusting the transparency of the pixel at the corresponding position of the foot. The fused image is obtained, and the AR effect of virtual shoe fitting can be displayed by outputting the fused image.

Embodiments will be described below in conjunction with a virtual shoe fitting scene.

In the virtual shoe fitting scene, the shoes in the three-dimensional space can be used to fuse (for example, image rendering) the feet captured by the video stream to complete the virtual shoe fitting.

The virtual shoe trial client can be carried in the mobile terminal. The mobile terminal can be equipped with a camera for real-time collection of video streams.

The virtual shoe library can be installed locally on the mobile terminal (hereinafter referred to as the terminal), or in a server corresponding to the virtual shoe-trying terminal (hereinafter referred to as the client). The virtual shoe library may include 3D virtual models of various shoe materials developed. The virtual shoe library can be any type of database.

The user can select shoes to try on from the virtual shoe library in the virtual shoe trial client, and collect foot video streams through the camera.

Referring to FIG. 6 , FIG. 6 is a schematic flowchart of a virtual shoe fitting method shown in the present application. As shown in Fig. 6, the method may include S601-S611.

S601 may be executed for the first image in the video stream. Wherein, the first image includes the first frame image in the video stream, or the image after it is determined that the foot tracking fails according to steps S607-S610. S601. Use the pre-trained object detection model to perform object detection on the first image to obtain the foot detection frame in the first image and the corresponding type of the foot, that is, the foot is a left foot or a right foot. foot. In the following, it is assumed that this foot is the right foot.

S602. According to the foot detection frame, acquire a first area image of the foot in the first image.

S603. In response to the fact that the foot is a right foot, flipping the first region image is performed so that the first region image is a left foot image, which is convenient for the foot key point detection model to detect. Wherein, the foot key point detection model is trained according to the left foot image sample with the position information of the first foot key point marked.

The key points of the first foot include the key points of at least part of the foot area in the following areas: the tip of the big toe; the inner joint of the forefoot; the inner arch of the foot; the inner side of the rear sole; the rear of the heel; the outer side of the rear sole; joint; junction of forefoot and leg; medial ankle joint; rear hamstring; lateral ankle joint. In this way, the outline of the foot can be described in detail at a fine-grained level, including a large amount of three-dimensional information of the foot, and the accuracy of pose estimation can be improved.

S604. Obtain position information of a first foot key point of the foot according to the foot key point detection model.

S605. Obtain the three-dimensional pose of the foot in the three-dimensional space according to the position information of the key point of the first foot according to the PNP algorithm.

S606. Obtain a 3D virtual model of the shoe corresponding to the right foot from the virtual shoe library, and fuse the shoe with the foot according to the obtained 3D virtual model and the 3D pose of the foot, To demonstrate the effect of virtual shoe fitting. For the description of S606, reference may be made to S51-S54, which will not be described in detail here.

S607 may be executed for the second image in the video stream. Wherein, the second image may be any non-first frame image after the first frame image in the video stream. S607. Acquire position information of the first key point of the foot in a previous frame image of the second image, and determine a key point frame of the foot based on the position information.

S608. According to the foot key point frame, extract a second region image from the second image.

S609. In response to the fact that the foot is the right foot, reverse the image in the second area, so that the image in the second area is an image of the left foot, which is convenient for the foot tracking model to detect. Among them, the foot tracking model includes a classification branch and a key point detection branch. The model is obtained through joint training based on the left foot image sample marked with the position information of the first foot key point and the left foot image sample with image classification information.

S610, according to the classification branch of the pre-trained foot tracking model, determine whether the second region image is a foot image, and according to the key point branch of the pre-trained foot tracking model, determine the first foot of the foot location information of key points.

If it is judged that the second region image is a foot image based on the image classification information, it can be determined that the foot in the second image is the foot that appeared in the previous frame image, complete the foot tracking, and execute S605 and S606 show the effect of virtual shoe fitting.

If it is judged based on the image classification information that the second area image is not a foot image, the foot tracking fails, and S601-S606 can be performed using the second image as the first image to obtain the foot in the other images. 3D pose of the head.

In the foregoing solution, on the one hand, the key point detection model of the foot can be used to obtain the position information of the key point of the first foot through regression; The three-dimensional pose of the foot; then based on the three-dimensional pose of the foot in the image to be processed, the three-dimensional virtual model corresponding to the shoe can be superimposed on the position corresponding to the foot in the first image or the second image to obtain a virtual trial Augmented reality effect of shoes.

On the other hand, the feature that the position of the foot in the adjacent frame image does not change significantly can be used to track the foot in the video stream, reducing the amount of calculations brought by the foot tracking through the object detection model, and reducing the overhead , improve the efficiency of foot tracking, thereby improving the real-time performance of the virtual shoe fitting method.

Corresponding to the foregoing embodiments, the present application proposes an image processing device 70 .

Referring to FIG. 7 , FIG. 7 is a schematic structural diagram of an image processing device shown in the present application.

As shown in Figure 7, the device 70 may include:

The first obtaining module 71 is used to obtain the area image corresponding to the foot in the image to be processed;

The key point detection module 72 is used to use the foot key point detection model to perform key point detection on the region image to obtain the two-dimensional position information of the first foot key point of the foot;

A determining module 73, configured to be based on the difference between the preset position information in the three-dimensional space of the second key point of the foot corresponding to the first key point in the three-dimensional model of the foot and the two-dimensional position information The mapping relationship determines the three-dimensional pose of the foot in the three-dimensional space.

In some embodiments, the first obtaining module 71 is specifically configured to:

Using an object detection model to perform object detection on the image to be processed to obtain an object detection result, the object detection result includes a detection frame of the foot in the image to be processed;

According to the detection frame and the image to be processed, an area image corresponding to the foot is obtained.

In some embodiments, the object detection result further includes the type of the foot; the type is used to indicate that the foot is a left foot or a right foot;

The device 70 also includes:

The first inversion module is configured to perform inversion processing on the region images in response to the feet being of a preset type, so that the types of feet in all region images input to the foot key point detection model are consistent.

In some embodiments, the first acquisition module 71 includes:

Obtaining the position information of the first key point of the foot in the previous frame image of the image to be processed in the video stream, and determining the key point frame of the foot according to the acquired position information;

Based on the key point frame of the foot and the image to be processed, an area image corresponding to the foot is obtained.

In some embodiments, the device 70 also includes:

The second obtaining module is used to obtain the type of the foot in the stored image of the previous frame;

The second flip module is used to perform flip processing on the regional images in response to the type of the feet in the previous frame image being a preset type, so that the feet in all the regional images of the input foot key point detection model The type of department is the same.

In some embodiments, the image to be processed is an image in a video stream, and the video stream also includes an image in the previous frame of the image to be processed; the region image of the image to be processed is based on the Determine the foot key point frame determined by the first foot key point in the previous frame image;

The device 70 also includes:

A tracking module, configured to use an image classification model to classify the region image to obtain a classification result of the region image; the classification result is used to indicate whether the region image is a foot image;

In response to the classification result indicating that the area image is a foot image, it is determined that the foot in the area image is the same foot as the foot in the previous frame image, so as to perform foot tracking.

In some embodiments, the determining module 73 is specifically configured to:

In response to determining that the foot in the region image is the same as the foot in the previous frame image, based on a second foot corresponding to the key point of the first foot in the preset three-dimensional model of the foot The mapping relationship between the preset position information of key points in the three-dimensional space and the two-dimensional position information determines the three-dimensional pose of the foot in the three-dimensional space.

In some embodiments, the image classification model shares a feature extraction network with the foot key point detection model; the device 70 also includes:

The joint training module of the image classification model and the foot key point detection model is used to obtain the first image sample marked with image classification information, and the second image sample marked with the position information of the first foot key point;

inputting the first image sample into the image classification model to obtain a classification prediction result, and obtaining first loss information according to the classification prediction result and the marked image classification information;

Inputting the second image sample into the foot key point detection model to obtain a first foot key point position prediction result, and according to the first foot key point position prediction result and the marked first foot key point position Information, get the second loss information;

Based on the first loss information and the second loss information, model parameters of the image classification model and the foot key point detection model are adjusted.

In some embodiments, the first key points of the foot include a plurality of key points on the contour of the edge of the foot; the number of the first key points of the foot is not less than four.

In some embodiments, the first foot key points include key points of at least one of the following regions:

In some embodiments, the device 70 also includes:

The virtual shoe-trying module is used to obtain the three-dimensional virtual model of the shoe material;

Based on the three-dimensional pose of the foot in the image to be processed, the three-dimensional virtual model is superimposed on the position corresponding to the foot in the image to be processed to obtain an augmented reality effect of virtual shoe fitting.

In some embodiments, the virtual shoe fitting module is specifically used for:

Obtaining the initial pose of the three-dimensional virtual model corresponding to the shoe material in the three-dimensional space;

Based on the three-dimensional pose, converting the initial pose to match the three-dimensional pose corresponding to the foot to obtain a converted three-dimensional pose;

Mapping the 3D virtual model to the image to be processed based on the converted 3D pose to obtain a 2D virtual material corresponding to the shoe material;

Image fusion of the two-dimensional virtual material and the corresponding position of the feet is performed to obtain a fused image for AR effect display of virtual shoe fitting.

Embodiments of the image processing apparatus shown in this application can be applied to electronic equipment. Correspondingly, the present application discloses an electronic device, and the device may include: a processor.

Memory used to store processor-executable instructions.

Wherein, the processor is configured to call executable instructions stored in the memory to implement the image processing method shown in any one of the foregoing embodiments.

Memory used to store processor-executable instructions.

Referring to FIG. 8 , FIG. 8 is a schematic diagram of a hardware structure of an electronic device shown in the present application.

As shown in FIG. 8, the electronic device may include a processor for executing instructions, a network interface for connecting to a network, a memory for storing operation data for the processor, and a memory for storing instructions corresponding to the image processing device. volatile memory.

Wherein, the embodiment of the apparatus may be implemented by software, or by hardware or a combination of software and hardware. Taking software implementation as an example, as a device in a logical sense, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory for operation by the processor of the electronic device where it is located. From the perspective of hardware, in addition to the processor, memory, network interface, and non-volatile memory shown in Figure 8, the electronic device where the device in the embodiment is usually based on the actual function of the electronic device can also include other Hardware, no more details on this.

It can be understood that, in order to increase the processing speed, the device corresponding instructions may also be directly stored in the memory, which is not limited herein.

The present application proposes a computer-readable storage medium, the storage medium stores a computer program, and the computer program can be used to cause a processor to execute the image processing method shown in any one of the foregoing embodiments.

Those skilled in the art should understand that one or more embodiments of the present application may be provided as a method, system or computer program product. Accordingly, one or more embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, optical storage, etc.) having computer-usable program code embodied therein.

"And/or" described in this application means at least one of the two, for example, "A and/or B" includes three options: A, B, and "A and B".

Each embodiment in the present application is described in a progressive manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the data processing device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant parts, please refer to part of the description of the method embodiment.

The foregoing describes specific embodiments of the present application. Other implementations are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or may be advantageous in certain embodiments.

Embodiments of the subject matter and functional operations described in this application can be implemented in digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this application and their structural equivalents, or in A combination of one or more of . Embodiments of the subject matter described in this application can be implemented as one or more computer programs, i.e. one or more of computer program instructions encoded on a tangible, non-transitory program carrier for execution by or to control the operation of data processing apparatus. Multiple modules. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for transmission by the data The processing means executes. A computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this application can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit).

Computers suitable for the execution of a computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing system. Typically, a central processing system will receive instructions and data from read only memory and/or random access memory. The basic components of a computer include a central processing system for implementing or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to, one or more mass storage devices for storing data, such as magnetic or magneto-optical disks, or optical disks, to receive data therefrom or to It transmits data, or both. However, a computer is not required to have such a device. In addition, a computer may be embedded in another device such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a device such as a Universal Serial Bus (USB) ) portable storage devices like flash drives, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or removable disk), magneto-optical disk, and 0xCD_00ROM and DVD-ROM disks. The processor and memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this application contains many specific implementation details, these should not be construed as limitations on the scope of any disclosure or of what may be claimed, but rather as primarily describing features of particular disclosed embodiments. Certain features that are described in this application in multiple embodiments can also be implemented in combination in a single embodiment. On the other hand, various features that are described in a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may function in certain combinations as described above and even be initially so claimed, one or more features from a claimed combination may in some cases be removed from that combination and the claimed A protected combination can point to a subcombination or a variant of a subcombination.

Similarly, while operations are depicted in the figures in a particular order, this should not be construed as requiring that those operations be performed in the particular order shown, or sequentially, or that all illustrated operations be performed, to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of the various system modules and components in the described embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can often be integrated together in a single software product, or packaged into multiple software products.

Thus, certain embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above descriptions are only preferred embodiments of one or more embodiments of the present application, and are not intended to limit one or more embodiments of the present application. Within the spirit and principles of one or more embodiments of the present application, Any modification, equivalent replacement, improvement, etc. should be included in the protection scope of one or more embodiments of the present application.

Claims

An image processing method, characterized in that the method comprises:

Obtain the image of the area corresponding to the foot in the image to be processed;

Using the foot key point detection model to perform key point detection on the region image to obtain two-dimensional position information of the first foot key point of the foot;

Based on the mapping relationship between the preset position information of the second foot key point corresponding to the first foot key point in the three-dimensional space and the two-dimensional position information in the preset three-dimensional model of the foot, determine the The three-dimensional pose of the foot in the three-dimensional space.
The method according to claim 1, wherein the acquiring the region image corresponding to the foot in the image to be processed comprises:

Using an object detection model, performing object detection on the image to be processed to obtain an object detection result, the object detection result including the detection frame of the foot in the image to be processed;

According to the detection frame and the image to be processed, the region image corresponding to the foot is obtained.
The method according to claim 2, wherein the object detection result further includes the type of the foot; the type is used to indicate that the foot is a left foot or a right foot;

The method also includes:

Responding to the fact that the foot is of a preset type, flipping is performed on the region images, so that the types of feet in all region images input to the foot key point detection model are consistent.
The method according to any one of claims 1 to 3, wherein the image to be processed is an image in a video stream; the acquisition of an image of an area corresponding to the foot in the image to be processed includes:

Acquiring the position information of the first foot key point of the foot in the previous frame image of the image to be processed in the video stream, and determining the foot key point frame according to the acquired position information;

Based on the key point frame of the foot and the image to be processed, an area image corresponding to the foot is obtained.
The method according to claim 4, characterized in that the method further comprises:

Obtain the type of the foot in the stored image of the previous frame;

Responding to the fact that the type of the foot in the previous frame image is a preset type, the regional image is flipped so that the types of feet in all the regional images input to the key point detection model of the foot are consistent.
The method according to any one of claims 1 to 5, wherein the image to be processed is an image in a video stream, and the video stream also includes an image of a previous frame of the image to be processed; The region image of the image to be processed is determined according to the foot key point frame determined based on the first foot key point in the previous frame image;

Based on the mapping relationship between the preset position information of the second foot key point corresponding to the first foot key point in the three-dimensional space and the two-dimensional position information in the preset three-dimensional model of the foot, determine the Before the three-dimensional pose of the foot in the three-dimensional space, it also includes:

Using an image classification model to classify the region image to obtain a classification result of the region image; the classification result is used to indicate whether the region image is a foot image;

In response to the classification result indicating that the area image is a foot image, it is determined that the foot in the area image is the same foot as the foot in the previous frame image, so as to perform foot tracking.
The method according to claim 6, characterized in that, based on the preset position information in three-dimensional space of the second key point of the foot corresponding to the key point of the first foot in the preset three-dimensional model of the foot and The mapping relationship between the two-dimensional position information determines the three-dimensional pose of the foot in the three-dimensional space, including:

In response to determining that the foot in the region image is the same foot as the foot in the previous frame image, based on a second foot corresponding to the key point of the first foot in the preset three-dimensional model of the foot The mapping relationship between the preset position information of the key point in the three-dimensional space and the two-dimensional position information determines the three-dimensional pose of the foot in the three-dimensional space.
The method according to claim 6 or 7, wherein the image classification model and the foot key point detection model share a feature extraction network; the joint of the image classification model and the foot key point detection model Training methods include:

Obtaining a first image sample marked with image classification information, and a second image sample marked with position information of first foot key points;

inputting the first image sample into the image classification model to obtain a classification prediction result, and obtaining first loss information according to the classification prediction result and the marked image classification information;

Inputting the second image sample into the foot key point detection model to obtain a first foot key point position prediction result, and according to the first foot key point position prediction result and the marked first foot key point position Information, get the second loss information;

Based on the first loss information and the second loss information, model parameters of the image classification model and the foot key point detection model are adjusted.
The method according to any one of claims 1-8, wherein the first key point of the foot includes a plurality of key points on the edge contour of the foot; the quantity of the first key point of the foot is not less than four.
The method according to claim 9, wherein the first foot key points include key points of at least one of the following regions:

tip of big toe; medial forefoot joint; medial arch of foot; medial rear ball; rear of heel; lateral rear ball; lateral forefoot joint; junction of forefoot and leg; medial ankle joint; rear hamstring; lateral ankle joint.
The method according to any one of claims 1-10, wherein after determining the three-dimensional pose of the foot in the three-dimensional space, further comprising:

Obtain the three-dimensional virtual model of the shoe material;

Based on the three-dimensional pose of the foot in the image to be processed, the three-dimensional virtual model is superimposed on the position corresponding to the foot in the image to be processed to obtain an augmented reality effect of virtual shoe fitting.
The method according to claim 11, wherein, based on the three-dimensional pose of the foot in the image to be processed, superimposing the three-dimensional pose on the position corresponding to the foot in the image to be processed Virtual model, get the augmented reality effect of virtual shoe fitting, including:

Obtaining the initial pose of the three-dimensional virtual model corresponding to the shoe material in the three-dimensional space;

Based on the three-dimensional pose, converting the initial pose to match the three-dimensional pose corresponding to the foot to obtain a converted three-dimensional pose;

Mapping the 3D virtual model to the image to be processed based on the converted 3D pose to obtain a 2D virtual material corresponding to the shoe material;

Image fusion is performed on the two-dimensional virtual material and the corresponding position of the foot to obtain a fused image for displaying the effect of virtual shoe fitting.
An image processing device, characterized in that the device comprises:

The first acquisition module is used to acquire the area image corresponding to the foot in the image to be processed;

A key point detection module, configured to use a foot key point detection model to perform key point detection on the region image to obtain two-dimensional position information of the first foot key point of the foot;

A determining module, configured to map between preset position information in three-dimensional space and the two-dimensional position information of the second key point of the foot corresponding to the first key point in the three-dimensional model of the preset foot relationship, and determine the three-dimensional pose of the foot in the three-dimensional space.
The device according to claim 13, wherein the image to be processed is an image in a video stream, and the video stream also includes an image in the previous frame of the image to be processed; The region image is determined according to the foot key point frame determined based on the first foot key point in the previous frame image;

The device also includes:

A tracking module, configured to use an image classification model to classify the region image to obtain a classification result of the region image; the classification result is used to indicate whether the region image is a foot image;

In response to the classification result indicating that the area image is a foot image, it is determined that the foot in the area image is the same foot as the foot in the previous frame image, so as to perform foot tracking.
The device according to claim 13 or 14, wherein the device further comprises:

The virtual shoe-trying module is used to obtain the three-dimensional virtual model of the shoe material;

Based on the three-dimensional pose of the foot in the image to be processed, the three-dimensional virtual model is superimposed on the position corresponding to the foot in the image to be processed to obtain an augmented reality effect of virtual shoe fitting.
An electronic device, characterized in that the device comprises:

processor;

memory for storing processor-executable instructions;

Wherein, the processor implements the image processing method according to any one of claims 1-12 by running the executable instructions.
A computer-readable storage medium, characterized in that the storage medium stores a computer program, and the computer program is used to make a processor execute the image processing method according to any one of claims 1-12.