CN113627379A

CN113627379A - Image processing method, device, equipment and storage medium

Info

Publication number: CN113627379A
Application number: CN202110955120.7A
Authority: CN
Inventors: 何野; 四建楼; 王玉峰; 杜天元; 王明峰; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2021-11-09
Also published as: WO2023020327A1

Abstract

The application provides an image processing method, an image processing device, image processing equipment and a storage medium. The method can comprise the step of acquiring an area image corresponding to the foot in the image to be processed. And performing key point detection on the region image by using a foot key point detection model to obtain two-dimensional position information of a first foot key point of the foot. And determining the three-dimensional pose of the foot in the three-dimensional space based on the mapping relation between the preset position information and the two-dimensional position information of a second foot key point corresponding to the first foot key point in a preset foot three-dimensional model in the three-dimensional space.

Description

Image processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to an image processing method, apparatus, device, and storage medium.

Background

The human body key point detection technology is a technology for extracting features of an input picture by using a deep learning technology and positioning key points by using an extracted feature map.

The existing human body key point detection technology can only detect human body key points, and can not detect two-dimensional position information of foot key points capable of performing pose estimation aiming at feet, so that the three-dimensional pose of the feet can not be obtained.

Disclosure of Invention

In view of the above, the present application discloses an image processing method. The method may include: acquiring a region image corresponding to the foot in the image to be processed; performing key point detection on the region image by using a foot key point detection model to obtain two-dimensional position information of a first foot key point of the foot; and determining the three-dimensional pose of the foot in the three-dimensional space based on the mapping relation between the preset position information and the two-dimensional position information of a second foot key point corresponding to the first foot key point in a preset foot three-dimensional model in the three-dimensional space.

In some embodiments, the acquiring an image of a region corresponding to a foot in the image to be processed includes: carrying out object detection on the image to be processed by using an object detection model to obtain an object detection result, wherein the object detection result comprises a detection frame of a foot part in the image to be processed; and obtaining a region image corresponding to the foot according to the detection frame and the image to be processed.

In some embodiments, the subject detection result further includes a type of the foot; the type is used for indicating that the foot is a left foot or a right foot; after obtaining the region image corresponding to the foot according to the detection frame and the image to be processed, the method further comprises the following steps: and responding to the foot as a preset type, and turning the region image to enable the types of the foot in the region image input into the foot key point detection model to be consistent.

In some embodiments, the image to be processed is an image in a video stream; the acquiring of the region image corresponding to the foot in the image to be processed includes: acquiring the position information of a first foot key point of a foot in a previous frame image of the image to be processed in the video stream, and determining a foot frame according to the acquired position information; and obtaining a region image corresponding to the foot part based on the foot part frame and the image to be processed.

In some embodiments, after obtaining the region image corresponding to the foot based on the foot frame and the image to be processed, the method further includes: acquiring the type of the foot in the stored previous frame image; and in response to the fact that the acquired type is a preset type, turning the region image to enable the types of the feet in the region image of the input foot key point detection model to be consistent.

In some embodiments, the image to be processed is an image in a video stream, and a previous frame image before the image to be processed is also included in the video stream; determining the regional image of the image to be processed according to a foot frame surrounded by first foot key points in the previous frame of image; before determining the three-dimensional pose of the foot in the three-dimensional space based on the mapping relationship between the two-dimensional position information and preset position information of a second foot key point in the preset foot three-dimensional model corresponding to the first foot key point in the three-dimensional space, the method further comprises: classifying the region images by using an image classification model to obtain a classification result of the region images; the classification result is used for indicating whether the region image is a foot image; and in response to the classification result indicating that the region image is a foot image, determining that the foot in the region image is the same foot as the foot in the previous frame image for foot tracking.

In some embodiments, the determining the three-dimensional pose of the foot in the three-dimensional space based on a mapping relationship between preset position information and the two-dimensional position information of a second foot key point in a preset three-dimensional foot model corresponding to the first foot key point in the three-dimensional space includes: in response to determining that the foot in the region image is the same foot as the foot in the previous frame image, determining a three-dimensional pose of the foot in a three-dimensional space based on a mapping relation between preset position information and the two-dimensional position information of a second foot key point, corresponding to the first foot key point, in the preset foot three-dimensional model in the three-dimensional space.

In some embodiments, the image classification model shares a feature extraction network with the foot keypoint detection model; the joint training method of the image classification model and the foot key point detection model comprises the following steps: acquiring a first image sample marked with image classification information and a second image sample marked with the position information of a first foot key point; inputting the first image sample into the image classification model to obtain a classification prediction result, and obtaining first loss information according to the classification prediction result and labeled image classification information; inputting the second image sample into the foot key point detection model to obtain a first foot key point position prediction result, and obtaining second loss information according to the first foot key point position prediction result and the marked first foot key point position information; and adjusting model parameters of the image classification model and the foot key point detection model based on the first loss information and the second loss information.

In some embodiments, the first foot keypoints comprise a plurality of keypoints on a foot edge contour; the number of the first foot key points is not less than four.

In some embodiments, the first foot keypoints comprise keypoints of at least one of the following regions: big toe tiptoe; the medial forefoot joint; the medial arch of the foot; the inner side of the rear sole; the rear part of the heel; the outer side of the rear sole; the forefoot lateral joint; the joint of the front foot surface and the leg; the medial ankle joint; a rear foot tendon; the lateral ankle joint.

In some embodiments, after determining the three-dimensional pose of the foot in the three-dimensional space, the method further comprises: acquiring a three-dimensional virtual model of a shoe material; and superposing the three-dimensional virtual model at the position corresponding to the foot in the image to be processed based on the three-dimensional pose of the foot in the image to be processed to obtain the augmented reality AR effect of the virtual shoe fitting.

In some embodiments, the superimposing, based on the three-dimensional pose of the foot in the image to be processed, the three-dimensional virtual model at a position corresponding to the foot in the image to be processed to obtain an augmented reality effect of the virtual shoe fitting includes: acquiring an initial pose of a three-dimensional virtual model corresponding to a shoe material in the three-dimensional space; converting the initial pose to be matched with the three-dimensional pose corresponding to the foot based on pose conversion information indicated by the three-dimensional pose to obtain a converted three-dimensional pose; mapping the three-dimensional virtual model to the image to be processed based on the converted three-dimensional pose to obtain a two-dimensional virtual material corresponding to the shoe material; and carrying out image fusion on the two-dimensional virtual material and the position corresponding to the foot part to obtain a fused image so as to display the AR effect of the virtual shoe fitting.

The present application also proposes an image processing apparatus, the apparatus comprising: the first acquisition module is used for acquiring an area image corresponding to the foot in the image to be processed; the key point detection module is used for detecting key points of the area image by using the foot key point detection model to obtain two-dimensional position information of a first foot key point of the foot; the determining module is used for determining the three-dimensional pose of the foot in the three-dimensional space based on the mapping relation between the preset position information and the two-dimensional position information of a second foot key point corresponding to the first foot key point in a preset foot three-dimensional model in the three-dimensional space.

In some embodiments, the image to be processed is an image in a video stream, and a previous frame image before the image to be processed is also included in the video stream; determining the regional image of the image to be processed according to a foot frame surrounded by first foot key points in the previous frame of image; the device further comprises: the tracking module is used for classifying the regional images by using an image classification model to obtain a classification result of the regional images; the classification result is used for indicating whether the region image is a foot image; and in response to the classification result indicating that the region image is a foot image, determining that the foot in the region image is the same foot as the foot in the previous frame image for foot tracking.

In some embodiments, the apparatus further comprises: the virtual shoe fitting module is used for acquiring a three-dimensional virtual model of the shoe material; and superposing the three-dimensional virtual model at the position corresponding to the foot in the image to be processed based on the three-dimensional pose of the foot in the image to be processed to obtain the augmented reality AR effect of the virtual shoe fitting.

The present application further proposes an electronic device, comprising: a processor; a memory for storing processor-executable instructions; wherein the processor executes the executable instructions to implement the image processing method as shown in any one of the foregoing embodiments.

The present application also proposes a computer-readable storage medium storing a computer program for causing a processor to execute an image processing method as shown in any of the preceding embodiments.

In the technical solution disclosed in the foregoing embodiment, a foot key point detection model may be used to perform key point detection on a foot region image corresponding to a foot, so as to obtain two-dimensional position information of a first foot key point; and then, based on the mapping relationship between the preset position information and the two-dimensional position information of a second foot key point in the three-dimensional space, which corresponds to the first foot key point, in the preset foot three-dimensional model, the three-dimensional pose of the foot in the three-dimensional space can be determined. Compared with the human key point detection technology, the two-dimensional position information of the first foot key point capable of carrying out pose estimation can be obtained by utilizing neural network regression, and therefore the three-dimensional pose of the foot in the three-dimensional space can be conveniently determined based on the mapping relation.

In addition, in the technical solution described in the foregoing embodiment, a three-dimensional virtual model of a shoe material may be superimposed at a position of the image to be processed corresponding to the foot based on a three-dimensional pose of the foot in the image to be processed, so as to obtain an augmented reality AR effect of a virtual shoe test.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate one or more embodiments of the present application or technical solutions in the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in one or more embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive exercise.

FIG. 1 is a method flow diagram of an image processing method shown in the present application;

fig. 2 is a schematic flow chart of a region image acquiring method according to the present application;

FIG. 3 is a schematic flow chart diagram of a model training method shown in the present application;

FIG. 4 is a schematic flow chart of a foot tracking method according to the present application;

FIG. 5 is a schematic view of a virtual shoe fitting process shown in the present application;

FIG. 6 is a schematic flow chart illustrating a virtual shoe fitting method according to the present application;

fig. 7 is a schematic structural diagram of an image processing apparatus shown in the present application;

fig. 8 is a schematic diagram of a hardware structure of an electronic device shown in the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It should also be understood that the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.

The application relates to the field of augmented reality, and the method and the device realize detection or identification processing on relevant characteristics, states and attributes of a target object by means of various visual correlation algorithms by acquiring image information of the target object in a real environment, so as to obtain an AR effect combining virtual and reality matched with specific application. For example, the target object may relate to a face, a limb, a gesture, an action, etc. associated with a human body, or a marker, a marker associated with an object, or a sand table, a display area, a display item, etc. associated with a venue or a place. The vision-related algorithms may involve visual localization, SLAM, three-dimensional reconstruction, image registration, background segmentation, key point extraction and tracking of objects, pose or depth detection of objects, and the like. The specific application can not only relate to interactive scenes such as navigation, explanation, reconstruction, virtual effect superposition display and the like related to real scenes or articles, but also relate to special effect treatment related to people, such as interactive scenes such as makeup beautification, limb beautification, special effect display, virtual model display and the like. The detection or identification processing of the relevant characteristics, states and attributes of the target object can be realized through the convolutional neural network. The convolutional neural network is a network model obtained by model training based on a deep learning framework.

The application provides an image processing method. The method can utilize a foot key point detection model to detect key points of a foot region image corresponding to a foot to obtain two-dimensional position information of a first foot key point; and then, based on the mapping relationship between the preset position information and the two-dimensional position information of a second foot key point in the three-dimensional space, which corresponds to the first foot key point, in the preset foot three-dimensional model, the three-dimensional pose of the foot in the three-dimensional space can be determined. Compared with the human key point detection technology, the two-dimensional position information of the first foot key point capable of carrying out pose estimation can be obtained by utilizing neural network regression, and therefore the three-dimensional pose of the foot in the three-dimensional space can be conveniently determined based on the mapping relation.

The first foot keypoint is at least one keypoint in at least one preset foot region of the foot in the image. The predetermined foot area may be any area in the foot. For example, the predetermined area may be a thumb area of the foot.

The second foot key point is a key point which is in the same foot area as the first foot key point in the foot key points included in the preset foot three-dimensional model.

The method can be applied to electronic equipment. Wherein the electronic device may execute the method by loading a software device corresponding to the image processing method. The electronic equipment can be a notebook computer, a server, a mobile phone, a PAD terminal and the like. The specific type of the electronic device is not particularly limited in this application. The electronic device may be a client-side or server-side device. The server may be a server or a cloud provided by a server, a server cluster, or a distributed server cluster.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method of image processing according to the present application. As shown in fig. 1, the method may include S102-S106.

In step S102, a region image corresponding to the foot in the image to be processed is obtained.

The image to be processed may include a foot. The application aims to capture a foot in an image to be processed and obtain the three-dimensional pose of the foot.

In some embodiments, the image to be processed may be an image transmitted by a user through a piggybacked client program. Thus, the images uploaded by the user can be processed.

In some embodiments, the image to be processed may also be an image acquired by image acquisition hardware. For example, the image acquisition hardware may be a camera mounted on a mobile phone terminal. The user can gather the video stream through the camera in real time. The image to be processed may be a picture, or may be an image in the video stream. The method and the device can capture the foot in the image to be processed in real time and estimate the three-dimensional pose of the foot.

The foot may be a human foot, or may be a foot of another animal or robot. The foot may include a left foot or a right foot.

The region image may be an image corresponding to the foot region in the image to be processed.

In some embodiments, the image to be processed is a picture or a first frame image in a video stream, and a region surrounded by the detection frame of the foot in the image to be processed may be determined as the region image. When the image to be processed is a non-first frame image in the video stream, a method for acquiring the area image is described in the following embodiments, which is not described in detail herein.

Referring to fig. 2, fig. 2 is a schematic flow chart of a region image acquiring method according to the present application.

As shown in FIG. 2, S21-S22 may be performed when S102 is performed.

And S21, performing object detection on the image to be processed by using an object detection model to obtain an object detection result, wherein the object detection result comprises a detection frame of a foot in the image to be processed.

In some embodiments, in performing S21, the to-be-processed image may be input into the trained object detection model, and a foot detection frame corresponding to a foot included in the to-be-processed image is obtained.

The object detection model may be a model constructed based on RCNN (Region Convolutional Neural Networks), FAST-RCNN (FAST Region Convolutional Neural Networks), or FAST-RCNN (FASTER Region Convolutional Neural Networks).

In some embodiments, the object detection model may be trained. Specifically, a plurality of training samples labeled with detection frame information corresponding to the foot may be obtained. The model may then be supervised trained based on the training samples until the model converges.

After the training is finished, the detection of the feet and the foot detection frame in the image can be carried out by using the object detection model.

And S22, obtaining the area image corresponding to the foot according to the detection frame and the image to be processed.

In some embodiments, a foot detection frame obtained by object detection and an image to be processed (or a feature map obtained by performing feature extraction on the image to be processed by using a backbone network) may be input to the region feature extraction unit, so as to obtain a region image.

The Region feature extraction unit may be an ROI Align (Region of Interest aligned) unit or an ROI Pooling (Region of Interest assembling) unit. The unit can scratch out the region image enclosed by the foot detection frame in the image to be processed.

By the method included in S21 to S22, the matting processing can be accurately performed on the image to be processed by using the deep learning technique, and the region image corresponding to the foot can be obtained.

And S104, performing key point detection on the region image by using the foot key point detection model to obtain two-dimensional position information of the first foot key point of the foot.

The foot key point detection model comprises a neural network model obtained by training on the basis of a plurality of foot area image samples marked with the position information of the first foot key point.

The first foot key point is at least one key point in at least one preset foot area of the foot in the image. The predetermined foot area may be any area in the foot. For example, the predetermined area may be a thumb area of the foot. The predetermined foot area may be any area in the foot. For example, the predetermined area may be a thumb area of the foot.

The number and the area of the first foot key points can be set according to business requirements. In some embodiments, the first foot keypoints may comprise a plurality of keypoints on a foot edge contour. Therefore, the outline of the foot can be well represented according to the first foot key point, and the pose estimation effect of the foot can be improved.

The two-dimensional position information may indicate two-dimensional position information of the first foot keypoint in the image to be processed. In some embodiments the two-dimensional position information may indicate two-dimensional coordinates of the first foot keypoint in the image to be processed.

In some embodiments the number of first foot keypoints may be no less than four. Therefore, the fine granularity of the key points of the feet can be improved, the contained pose information of the feet is increased, and the pose estimation effect of the feet can be improved. In some embodiments, the number of the first foot key points may be any number in a range from four to fifteen, and by setting key points at necessary foot positions, the first foot key points may be set in a suitable number range, which also facilitates subsequent foot three-dimensional pose detection.

In some embodiments, the first foot keypoints comprise keypoints of at least one of the following regions:

big toe tiptoe; the medial forefoot joint; the medial arch of the foot; the inner side of the rear sole; the rear part of the heel; the outer side of the rear sole; the forefoot lateral joint; the joint of the front foot surface and the leg; the medial ankle joint; a rear foot tendon; the lateral ankle joint.

Therefore, points in the protruding area and/or the recessed area on the foot outline can be used as first foot key points, the accuracy of characterization of the foot outline is improved, and the pose estimation effect of the foot is improved.

The foot key point detection model can be a regression or classification model constructed based on a neural network or a deep learning network. The model is used to detect two-dimensional position information of a first foot keypoint on the foot.

Taking the foot key point detection model as a regression model constructed based on a deep learning network as an example. In some embodiments, it may be trained using training samples. Referring to fig. 3, fig. 3 is a schematic flow chart of a model training method according to the present application. As shown in FIG. 3, in performing model training, S31-S33 may be performed.

In step S31, a plurality of training samples are obtained. The training sample may be a foot image including a foot. The two-dimensional position information of the first foot key point can be marked in the foot image.

And S32, inputting the plurality of training samples into the foot key point detection model to obtain first foot key point two-dimensional position information corresponding to the plurality of training samples respectively.

And S33, obtaining loss information according to the difference between the obtained two-dimensional position information of the first foot key point and the two-dimensional position information of the first foot key point marked in advance by each training sample, and updating the parameters of the foot key point detection model by back propagation by using the loss information.

After the model training is completed, the model has the capability of predicting the two-dimensional position information of the first foot key point. In step S104, the region image obtained in step S102 may be input to the trained foot keypoint detection model to obtain two-dimensional position information of the first foot keypoint of the foot.

S106, determining the three-dimensional pose of the foot in the three-dimensional space based on the mapping relation between the preset position information and the two-dimensional position information of a second foot key point corresponding to the first foot key point in the preset foot three-dimensional model in the three-dimensional space.

The three-dimensional space refers to a three-dimensional space in which the foot needs to be projected. The space is typically set according to business requirements.

The preset foot three-dimensional model can be a three-dimensional model which is maintained in advance according to business requirements. The predetermined three-dimensional model of the foot may include a plurality of foot keypoints.

The second foot key point is a key point which is in the same foot area as the first foot key point in the foot key points included in the preset foot three-dimensional model. The foot region may be any region of the foot. For example, the foot region may be the thumb region of the foot.

The preset position information may indicate three-dimensional position information of the second foot key point in the three-dimensional space when the preset foot three-dimensional model is in a standard pose. In some embodiments, the preset position information may be three-dimensional coordinates of the second foot keypoint in the three-dimensional space when the three-dimensional model of the foot is in a standard pose.

The standard pose may be a pose preset according to business requirements. In some embodiments, the standard pose may be determined as a pose of a plane that is at an origin of a three-dimensional coordinate system corresponding to the three-dimensional space and perpendicular to X and Y axes of the three-dimensional coordinate system. And when the foot three-dimensional model is in the standard pose, the three-dimensional coordinates of the second foot key points are the preset position information.

The three-dimensional pose is used for indicating the posture of the foot in the three-dimensional space in the image to be processed. In some embodiments, the three-dimensional pose may include the amount of translation and rotation of the foot in three directions relative to the X, Y, Z axis in the three-dimensional coordinate system.

In step S106, a pose estimation algorithm may be performed on the foot based on the mapping relationship between the two-dimensional position information (usually, two-dimensional coordinate information) of the first foot key point obtained in step S104 and the preset position information, so as to obtain a three-dimensional pose of the foot. The three-dimensional mapping algorithm may include PNP (inclusive-N-Point, multi-Point Perspective imaging) and the like. The mapping relation solving algorithm is not particularly limited in the present application.

Taking the three-dimensional mapping algorithm as a PNP algorithm as an example. The input to the PNP algorithm is two. One of the two-dimensional coordinates is the two-dimensional coordinate of the first foot key point in the image to be processed; and the second is the three-dimensional coordinate of the second foot key point in the three-dimensional space. The algorithm may obtain the three-dimensional pose of the foot in the three-dimensional space based on a mapping relationship between the two-dimensional coordinates of each foot key point in the first foot key points and the three-dimensional coordinates of the corresponding second foot key points.

After obtaining the two-dimensional coordinates of the first foot keypoint, the three-dimensional coordinates of the second foot keypoint in the three-dimensional space may be obtained. And then inputting the three-dimensional coordinates and the two-dimensional coordinates into a solving formula corresponding to the PNP algorithm to obtain the three-dimensional pose of the foot in the three-dimensional space.

In the solution provided in the foregoing embodiment, a foot key point detection model may be used to perform key point detection on a foot region image corresponding to a foot, so as to obtain two-dimensional position information of a first foot key point; and then, based on the mapping relationship between the preset position information and the two-dimensional position information of a second foot key point in the three-dimensional space, which corresponds to the first foot key point, in the preset foot three-dimensional model, the three-dimensional pose of the foot in the three-dimensional space can be determined. Compared with the human key point detection technology, the two-dimensional position information of the first foot key point capable of carrying out pose estimation can be obtained by utilizing neural network regression, and therefore the three-dimensional pose of the foot in the three-dimensional space can be conveniently determined based on the mapping relation.

In some embodiments, the image processing method illustrated herein may also identify the type of foot, i.e., whether the foot is a left foot or a right foot. The method obtains the type of the foot in the object detection result through S21, besides the detection frame of the foot in the image to be processed.

In some embodiments, in performing S21, the object detection module may be used to perform object detection on the image to be processed, so as to obtain a detection frame of a foot included in the image to be processed and a type of the foot. Therefore, the left foot and the right foot can be distinguished by the object detection model on the basis of detecting the foot detection frame in the image to be processed.

The object detection model comprises a neural network model obtained by training on the basis of a plurality of training samples marked with detection frames and type information corresponding to feet; the type indication foot is either a left foot or a right foot.

In some embodiments, the object detection model may be trained. Specifically, a plurality of training samples labeled with detection frame information and type information (i.e., left foot or right foot) corresponding to the foot may be obtained. The model may then be supervised trained based on the training samples until the model converges.

After the training is finished, the detection frame of the foot and the detection of the foot type in the image can be carried out by using the object detection model.

In some embodiments, after performing S102, S103 may be further performed, in response to that the foot is of a preset type, performing a flipping process on the region image to make the types of the feet in the region images input to the foot keypoint detection model consistent, so as to ensure that the feet in the region images input to the foot keypoint detection model are of the same type (i.e., left foot or right foot), and facilitate the processing of the foot keypoint detection model.

The preset type can be set according to the service requirement. In some embodiments, a left foot sample may be used for training when training the foot keypoint detection model. The preset type may be set to the right foot at this time. In step S103, if the foot is identified as a right foot, the region image corresponding to the foot is flipped, so that the foot in the region image is changed into a left foot type, which is convenient for the foot key point detection model to process.

In some embodiments, the present application entails image processing of images in a captured video stream. To capture the same foot in the video stream, foot tracking is required. In some embodiments, the object detection model may be used to perform object detection on each frame of image in the video stream, obtain a foot in the video stream, and then determine the same foot in the video stream according to the detected foot position for foot tracking.

It is easy to find that the foot tracking method needs to perform object detection on each frame of image, and because the structure of the object detection model is complex, the computation amount is large, and the overhead is large, the tracking efficiency of the foot tracking method is low, and the image processing method disclosed by the application may have poor real-time performance.

In order to solve the foregoing problems, in some embodiments, the feature that the position of the foot in the adjacent frame images does not change significantly may be utilized to track the foot in the video stream, so as to reduce the amount of computation caused by performing foot tracking through the object detection model, reduce overhead, and improve foot tracking efficiency, thereby improving the real-time performance of the image processing method.

Referring to fig. 4, fig. 4 is a schematic flow chart of a foot tracking method according to the present application. As shown in FIG. 4, in implementing foot tracking, S41-S43 may be performed.

In step S41, an area image corresponding to the foot in the image to be processed is acquired.

The image to be processed is an image in a video stream. The image to be processed may be a first frame image or an image after the first frame image in a certain video stream.

If the image to be processed is the first frame image in the video stream, the region image can be obtained through the steps of S21-S22.

If the image to be processed is an image after the first frame image, in step S41, the first foot key point position information of the foot in the image of the frame before the image to be processed in the video stream may be obtained, and the foot frame may be determined according to the obtained position information. And obtaining a region image corresponding to the foot part based on the foot part frame and the image to be processed.

Therefore, the area image at the same position as the foot position in the previous frame image in the image to be processed can be obtained.

The foot frame may be any shape frame. In some embodiments, the foot frame may be a rectangular frame. In performing S41, two-dimensional position information of the first foot keypoint in the image of the previous frame of the buffered image to be processed may be acquired. Then, from the two-dimensional position information, maximum coordinates X0, Y0 and minimum coordinates X1, Y1 in the X and Y-axis directions, respectively, can be determined. Then, a rectangular frame consisting of 4 vertices (x0, y0), (x0, y1), (x1, y0) and (x1, y1) may be used as the foot frame.

After the foot frame is obtained, the foot frame and the image to be processed (or the image to be processed is subjected to feature extraction by using a backbone network to obtain a feature map) may be input to the region feature extraction unit to obtain a second region image.

The Region feature extraction unit may be an ROI Align (Region of Interest aligned) unit or an ROI Pooling (Region of Interest assembling) unit. The unit can scratch out a second area image enclosed by the foot frame in the second image.

In some embodiments, after obtaining the region image corresponding to the foot based on the foot frame and the image to be processed, the type of the foot in the stored previous frame image may also be obtained; and in response to the fact that the acquired type is a preset type, turning the region image to enable the types of the feet in the region image of the input foot key point detection model to be consistent. Therefore, the feet in the region image of the input foot key point detection model are ensured to be of the same type (namely, the left foot or the right foot), and the foot key point detection model can be conveniently processed.

And S42, classifying the area images by using an image classification model to obtain a classification result of the area images.

And the classification result indicates whether the region image is the foot image.

The image classification model may include a convolutional neural network. When the model is trained, a plurality of image samples labeled with image classification information can be obtained. The image classification information indicates whether the image sample is a foot image. The model may then be supervised trained based on the image samples.

After the training is finished, the image classification model can be used for carrying out image classification on the region images to obtain a classification result.

S43, in response to the classification result indicating that the region image is a foot image, determining that the foot in the region image is the same foot as the foot in the previous frame image for foot tracking.

Since the region image obtained at S41 is a region image in the image to be processed at the same position as the foot position in the previous frame image, if the classification result of the region image indicates that it is a foot image, it can be said that a foot is also present at the same position as the foot position in the previous frame image in the image to be processed. According to the characteristic that the position of the same foot in the adjacent frames of images does not change obviously, the foot in the area image and the foot in the previous frame of image can be determined to be the same foot. Therefore, the foot tracking in the video stream is realized, the calculation amount caused by the foot tracking through the object detection model is reduced, the overhead is reduced, the foot tracking efficiency is improved, and the real-time performance of the image processing method is improved.

In the process of foot tracking, S104 may be further performed, and the key point detection model is used to perform key point detection on the region image, so as to obtain two-dimensional position information of the first foot key point of the foot. S104 will not be described in detail here.

After the foot tracking is completed, in step S106, in response to determining that the foot in the region image is the same foot as the foot in the previous frame image, a three-dimensional pose of the foot in the three-dimensional space may be determined based on a mapping relationship between preset position information and the two-dimensional position information of a second foot key point in a preset foot three-dimensional model, which corresponds to the first foot key point, in the three-dimensional space. The step information of the pose estimation is not described in detail here.

If the classification result indicates that the region image is not the foot image, it may be determined that the foot tracking has failed, that is, the to-be-processed image may be used as the image with failed tracking in the video, the region image is acquired by using the steps of S21-S22, and then the three-dimensional pose of the foot is obtained through S104 and S106.

In some embodiments, the image classification model shares a feature extraction network with the foot keypoint detection model. Such as a backbone network.

The image classification model and the foot key point detection model can be trained in a joint training mode, so that the capability of extracting features by a feature extraction network can be improved, feature information beneficial to classification and key point detection is extracted, and the classification and key point detection effects are improved.

In some embodiments, a first image sample labeled with image classification information and a second image sample labeled with first foot keypoint location information may be obtained. The first image sample and the second image sample may be the same image.

Then, the first image sample can be input into the image classification model to obtain a classification prediction result, and first loss information is obtained according to the classification prediction result and labeled image classification information; and inputting the second image sample into the foot key point detection model to obtain a first foot key point position prediction result, and obtaining second loss information according to the first foot key point position prediction result and the marked first foot key point position information.

Model parameters of the image classification model and the foot key point detection model can be adjusted based on the first loss information and the second loss information. In some embodiments, the aforementioned training steps may be performed in multiple iterations until the image classification model converges with the foot keypoint detection model.

Because the image classification model and the foot key point detection model share the feature extraction network, when any one of the models is trained, the training of the other model is affected, so that the training of the two models can be mutually supplemented and promoted, and the joint training of the image classification model and the foot key point detection model is realized. Through the combined training, on one hand, the classification and key point detection effects can be improved by improving the capability of extracting the feature information beneficial to classification and key point detection of the feature extraction network; on the other hand, the model training efficiency can be improved through mutual supplement and promotion between the two model training.

In some embodiments, after S106 is completed and the three-dimensional pose of the foot in the three-dimensional space in the image to be processed is obtained, virtual shoe fitting may be performed.

In some embodiments, a three-dimensional virtual model of the footwear material may be obtained first. And then, based on the three-dimensional pose of the foot in the image to be processed, superposing the three-dimensional virtual model at the position corresponding to the foot in the image to be processed to obtain the augmented reality AR effect of the virtual shoe fitting.

The three-dimensional virtual model may be used to indicate the outline and/or texture color, etc. of the shoe material. In some embodiments, the three-dimensional virtual model may include three-dimensional coordinates of vertices of the footwear material in three-dimensional space, and pixel values of the vertices.

By superposing the positions corresponding to the feet by using the three-dimensional virtual model, shoe materials can be displayed in the image to be processed, and the augmented reality AR effect of the virtual shoe fitting is achieved.

Please refer to fig. 5, fig. 5 is a schematic view illustrating a virtual shoe fitting process according to the present application.

As shown in FIG. 5, the method may include S51-S54.

And S51, acquiring the initial pose of the three-dimensional virtual model corresponding to the shoe material in the three-dimensional space.

The initial pose is used for indicating the posture information of the shoe material in the three-dimensional space. In some embodiments, the initial pose of the shoe material in the three-dimensional space may be pre-maintained in a database. And executing S51, and acquiring the three-dimensional pose from the database.

And S52, converting the initial pose to be matched with the three-dimensional pose corresponding to the foot based on the pose conversion information indicated by the three-dimensional pose to obtain the converted three-dimensional pose.

The three-dimensional pose of the foot may be indicative of pose transformation information of the foot in a three-dimensional coordinate system. The pose conversion information may indicate the amount of translation and rotation of the foot with respect to the X, Y, and Z axes of a three-dimensional coordinate system.

The converted three-dimensional pose may indicate pose information after the shoe material is converted to the pose of the foot.

In some embodiments, the pose transformation information may include translation and rotation information, and when performing S52, the initial position may be subjected to a translation and rotation operation using the pose transformation information to obtain the pose information after translating and rotating the shoe material to the pose of the foot.

And S53, mapping the three-dimensional virtual model to the image to be processed based on the converted three-dimensional pose to obtain a two-dimensional virtual material corresponding to the shoe material.

In S53, the three-dimensional coordinates of each vertex in the three-dimensional virtual model may be adjusted based on the converted three-dimensional pose, so as to obtain an adjusted three-dimensional virtual model. And then projecting the three-dimensional coordinates of each vertex in the adjusted three-dimensional virtual model to a two-dimensional plane where the image to be processed is located by utilizing a projection algorithm to obtain the two-dimensional coordinates of each vertex. The shape corresponding to the two-dimensional virtual material corresponding to the shoe material can be determined according to the two-dimensional coordinates of each vertex, and the texture color corresponding to the two-dimensional virtual material can be determined according to the pixel value of each vertex (obtained through the pixel value of each vertex in the three-dimensional virtual model), so that the two-dimensional virtual material is obtained.

And S54, carrying out image fusion on the two-dimensional virtual material and the position corresponding to the foot part to obtain a fused image so as to display the AR effect of the virtual shoe fitting.

In some embodiments, image fusion can be completed in a manner of covering each pixel point of the two-dimensional virtual material with pixel points in the position corresponding to the foot, or in a manner of adjusting the transparency of the pixel points in the position corresponding to the foot, and the like, so as to obtain a fused image, and thus the display of the AR effect of the virtual shoe test can be performed by outputting the fused image.

The following examples are described in conjunction with a virtual shoe fitting scenario.

The virtual shoe fitting scene can be completed by fusing (e.g., image rendering) the feet captured by the video stream with shoes in a three-dimensional space.

The virtual shoe fitting client can be carried in the mobile terminal. The mobile terminal can carry a common camera and is used for collecting video streams in real time.

The virtual shoe library can be installed locally on a mobile terminal (hereinafter referred to as a terminal) or in a service end corresponding to a virtual shoe trying end (hereinafter referred to as a client). The virtual shoe library may include three-dimensional virtual models of developed versions of shoe material. The virtual shoe library may be any type of database.

The user can select shoes to be tried on from the virtual shoe library in the virtual shoe trying client side, and the foot video stream is collected through the camera.

Referring to fig. 6, fig. 6 is a schematic flow chart of a virtual shoe fitting method according to the present application. As shown in fig. 6, the method may include S601-S611.

S601 may be performed for a first image in the video stream. Wherein, the first image comprises the first frame image in the video stream, or the image after the failure of foot tracking is determined according to the steps S607-S610. S601, performing object detection on the first image by using a pre-trained object detection model to obtain a detection frame of the foot in the first image and a type corresponding to the foot, namely the foot is a left foot or a right foot. The foot is assumed to be the right foot below.

S602, acquiring a first area image of the foot in the first image according to the foot detection frame.

S603, responding to the fact that the foot is the right foot, turning over the first area image to enable the first area image to be a left foot image, and facilitating detection of the foot key point detection model. The foot key point detection model is obtained by training according to a left foot image sample marked with the position information of the first foot key point.

The first foot keypoints comprise keypoints of at least part of the foot regions of: big toe tiptoe; the medial forefoot joint; the medial arch of the foot; the inner side of the rear sole; the rear part of the heel; the outer side of the rear sole; the forefoot lateral joint; the joint of the front foot surface and the leg; the medial ankle joint; a rear foot tendon; the lateral ankle joint. Therefore, the foot outline can be described in detail in a fine-grained manner, a large amount of foot three-dimensional information is included, and the pose estimation accuracy is improved.

S604, obtaining the position information of the first foot key point of the foot according to the foot key point detection model.

And S605, obtaining the three-dimensional pose of the foot in the three-dimensional space according to the position information of the first foot key point and a PNP algorithm.

S606, acquiring a three-dimensional virtual model of the shoe corresponding to the right foot from the virtual shoe library, and fusing the shoe and the foot according to the acquired three-dimensional virtual model and the three-dimensional pose of the foot so as to display the virtual shoe fitting effect. The description of S606 can refer to S51-S54, which are not described in detail herein.

S607 may be performed for a second image in the video stream. The second image may be any non-first frame image after the first frame image in the video stream. S607, obtaining the position information of the first foot key point of the previous frame image of the second image, and determining the foot frame based on the position information.

S608, according to the foot frame, a second area image is deducted from the second image.

And S609, responding to the fact that the foot is a right foot, and turning the second area image to enable the second area image to be a left foot image, so that detection of a foot tracking model is facilitated. The foot tracking model comprises a classification branch and a key point detection branch, and is obtained by performing combined training according to a left foot image sample marked with the position information of the first foot key point and the left foot image sample of the image classification information.

S610, determining whether the second area image is the foot image according to the classification branch of the foot tracking model trained in advance, and determining the position information of the first foot key point of the foot according to the key point branch of the foot tracking model trained in advance.

If the second area image is the foot image, the foot in the second image can be determined to be the foot appearing in the previous frame image, the foot tracking is completed, and S605 and S606 are executed to show the virtual shoe fitting effect.

If the second region image is not the foot image, the foot tracking fails, and S601-S606 may be performed on the second image as the first image to obtain the three-dimensional pose of the foot in the other images.

In one aspect of the foregoing solution, the position information of the first foot key point may be obtained by regression using the foot key point detection model; then, the three-dimensional pose of the foot can be obtained based on the mapping from two dimensions to three-dimensional space of the first foot key point; and then, based on the three-dimensional pose of the foot in the image to be processed, superposing a three-dimensional virtual model corresponding to the shoe at the position corresponding to the foot in the first image or the second image to obtain the augmented reality AR effect of the virtual shoe fitting.

On the other hand, the characteristic that the position of the foot in the adjacent frame images does not change obviously can be utilized to track the foot in the video stream, the calculation amount caused by the foot tracking through the object detection model is reduced, the expenditure is reduced, the foot tracking efficiency is improved, and therefore the real-time performance of the virtual shoe testing method is improved.

In accordance with the foregoing embodiments, the present application proposes an image processing apparatus 70.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an image processing apparatus according to the present application.

As shown in fig. 7, the apparatus 70 may include:

a first obtaining module 71, configured to obtain an area image corresponding to a foot in an image to be processed;

a key point detection module 72, configured to perform key point detection on the region image by using a foot key point detection model, so as to obtain two-dimensional position information of a first foot key point of the foot;

the determining module 73 is configured to determine a three-dimensional pose of the foot in the three-dimensional space based on a mapping relationship between preset position information and the two-dimensional position information of a second foot key point, corresponding to the first foot key point, in a preset foot three-dimensional model in the three-dimensional space.

In some embodiments, the first obtaining module 71 is specifically configured to:

carrying out object detection on the image to be processed by using an object detection model to obtain an object detection result, wherein the object detection result comprises a detection frame of a foot part in the image to be processed;

and obtaining a region image corresponding to the foot according to the detection frame and the image to be processed.

In some embodiments, the subject detection result further includes a type of the foot; the type is used for indicating that the foot is a left foot or a right foot;

the apparatus 70 further comprises:

and the first overturning module is used for responding to the foot as a preset type and overturning the region image so as to enable the types of the foot in the region image input into the foot key point detection model to be consistent.

In some embodiments, the first obtaining module 71 includes:

acquiring the position information of a first foot key point of a foot in a previous frame image of the image to be processed in the video stream, and determining a foot frame according to the acquired position information;

and obtaining a region image corresponding to the foot part based on the foot part frame and the image to be processed.

In some embodiments, the apparatus 70 further comprises:

the second acquisition module is used for acquiring the type of the foot in the stored previous frame image;

and the second overturning module is used for responding to the acquired type as a preset type and overturning the region image so as to enable the types of the feet in the region image of the input foot key point detection model to be consistent.

In some embodiments, the image to be processed is an image in a video stream, and a previous frame image before the image to be processed is also included in the video stream; determining the regional image of the image to be processed according to a foot frame surrounded by first foot key points in the previous frame of image;

the apparatus 70 further comprises:

the tracking module is used for classifying the regional images by using an image classification model to obtain a classification result of the regional images; the classification result is used for indicating whether the region image is a foot image;

and in response to the classification result indicating that the region image is a foot image, determining that the foot in the region image is the same foot as the foot in the previous frame image for foot tracking.

In some embodiments, the determining module 73 is specifically configured to:

in response to determining that the foot in the region image is the same foot as the foot in the previous frame image, determining a three-dimensional pose of the foot in a three-dimensional space based on a mapping relation between preset position information and the two-dimensional position information of a second foot key point, corresponding to the first foot key point, in the preset foot three-dimensional model in the three-dimensional space.

In some embodiments, the image classification model shares a feature extraction network with the foot keypoint detection model; the apparatus 70 further comprises:

the joint training module of the image classification model and the foot key point detection model is used for acquiring a first image sample marked with image classification information and a second image sample marked with position information of a first foot key point;

inputting the first image sample into the image classification model to obtain a classification prediction result, and obtaining first loss information according to the classification prediction result and labeled image classification information;

inputting the second image sample into the foot key point detection model to obtain a first foot key point position prediction result, and obtaining second loss information according to the first foot key point position prediction result and the marked first foot key point position information;

and adjusting model parameters of the image classification model and the foot key point detection model based on the first loss information and the second loss information.

In some embodiments, the apparatus 70 further comprises:

the virtual shoe fitting module is used for acquiring a three-dimensional virtual model of the shoe material;

and superposing the three-dimensional virtual model at the position corresponding to the foot in the image to be processed based on the three-dimensional pose of the foot in the image to be processed to obtain the augmented reality AR effect of the virtual shoe fitting.

In some embodiments, the virtual shoe fitting module is specifically configured to:

acquiring an initial pose of a three-dimensional virtual model corresponding to a shoe material in the three-dimensional space;

converting the initial pose to be matched with the three-dimensional pose corresponding to the foot based on pose conversion information indicated by the three-dimensional pose to obtain a converted three-dimensional pose;

mapping the three-dimensional virtual model to the image to be processed based on the converted three-dimensional pose to obtain a two-dimensional virtual material corresponding to the shoe material;

and carrying out image fusion on the two-dimensional virtual material and the position corresponding to the foot part to obtain a fused image so as to display the AR effect of the virtual shoe fitting.

The embodiment of the image processing apparatus shown in the present application can be applied to an electronic device. Accordingly, the present application discloses an electronic device, which may comprise: a processor.

A memory for storing processor-executable instructions.

Wherein the processor is configured to call the executable instructions stored in the memory to implement the image processing method shown in any one of the foregoing embodiments.

A memory for storing processor-executable instructions.

Referring to fig. 8, fig. 8 is a schematic diagram of a hardware structure of an electronic device shown in the present application.

As shown in fig. 8, the electronic device may include a processor for executing instructions, a network interface for making network connections, a memory for storing operation data for the processor, and a non-volatile memory for storing instructions corresponding to the image processing apparatus.

The embodiments of the apparatus may be implemented by software, or by hardware, or by a combination of hardware and software. Taking a software implementation as an example, as a logical device, the device is formed by reading, by a processor of the electronic device where the device is located, a corresponding computer program instruction in the nonvolatile memory into the memory for operation. In terms of hardware, in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 8, the electronic device in which the apparatus is located in the embodiment may also include other hardware according to an actual function of the electronic device, which is not described again.

It is to be understood that, in order to increase the processing speed, the device-corresponding instruction may also be directly stored in the memory, which is not limited herein.

The present application proposes a computer-readable storage medium storing a computer program which can be used to cause a processor to execute the image processing method shown in any of the foregoing embodiments.

One skilled in the art will recognize that one or more embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

"and/or" as recited herein means having at least one of two, for example, "a and/or B" includes three scenarios: A. b, and "A and B".

The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the data processing apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

Specific embodiments of the present application have been described. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Embodiments of the subject matter and functional operations described in this application may be implemented in the following: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this application and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this application can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this application can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing system. Generally, a central processing system will receive instructions and data from a read-only memory and/or a random access memory. The essential components of a computer include a central processing system for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and 0xCD _00ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

Although this application contains many specific implementation details, these should not be construed as limiting the scope of any disclosure or of what may be claimed, but rather as merely describing features of particular disclosed embodiments. Certain features that are described in this application in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the described embodiments is not to be understood as requiring such separation in all embodiments, and it is to be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only for the purpose of illustrating the preferred embodiments of the present application and is not intended to limit the present application to the particular embodiments of the present application, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present application and are intended to be included within the scope of the present application.

Claims

1. An image processing method, characterized in that the method comprises:

acquiring a region image corresponding to the foot in the image to be processed;

performing key point detection on the region image by using a foot key point detection model to obtain two-dimensional position information of a first foot key point of the foot;

and determining the three-dimensional pose of the foot in the three-dimensional space based on the mapping relation between the preset position information and the two-dimensional position information of a second foot key point corresponding to the first foot key point in a preset foot three-dimensional model in the three-dimensional space.

2. The method according to claim 1, wherein the acquiring of the region image corresponding to the foot in the image to be processed comprises:

3. The method of claim 2, wherein the object detection result further includes a type of the foot; the type is used for indicating that the foot is a left foot or a right foot;

after obtaining the region image corresponding to the foot according to the detection frame and the image to be processed, the method further comprises the following steps:

and responding to the foot as a preset type, and turning the region image to enable the types of the foot in the region image input into the foot key point detection model to be consistent.

4. The method according to any one of claims 1 to 3, wherein the image to be processed is an image in a video stream; the acquiring of the region image corresponding to the foot in the image to be processed includes:

5. The method according to claim 4, further comprising, after obtaining the region image corresponding to the foot based on the foot frame and the image to be processed, the step of:

acquiring the type of the foot in the stored previous frame image;

and in response to the fact that the acquired type is a preset type, turning the region image to enable the types of the feet in the region image of the input foot key point detection model to be consistent.

6. The method according to any one of claims 1 to 5, wherein the image to be processed is an image in a video stream, and the video stream further includes a previous frame image before the image to be processed; determining the regional image of the image to be processed according to a foot frame surrounded by first foot key points in the previous frame of image;

before determining the three-dimensional pose of the foot in the three-dimensional space based on the mapping relationship between the preset position information and the two-dimensional position information of a second foot key point corresponding to the first foot key point in a preset foot three-dimensional model in the three-dimensional space, the method further comprises the following steps:

classifying the region images by using an image classification model to obtain a classification result of the region images; the classification result is used for indicating whether the region image is a foot image;

7. The method according to claim 6, wherein the determining the three-dimensional pose of the foot in the three-dimensional space based on the mapping relationship between the two-dimensional position information and the preset position information of the second foot key point in the preset foot three-dimensional model corresponding to the first foot key point in the three-dimensional space comprises:

8. The method according to claim 6 or 7, wherein the image classification model shares a feature extraction network with the foot keypoint detection model; the joint training method of the image classification model and the foot key point detection model comprises the following steps:

acquiring a first image sample marked with image classification information and a second image sample marked with the position information of a first foot key point;

9. The method of any of claims 1-8, wherein the first foot keypoints comprise a plurality of keypoints on a foot edge contour; the number of the first foot key points is not less than four.

10. The method of claim 9, wherein the first foot keypoints comprise keypoints of at least one of the following regions:

11. The method according to any one of claims 1-10, further comprising, after determining the three-dimensional pose of the foot in the three-dimensional space:

acquiring a three-dimensional virtual model of a shoe material;

12. The method according to claim 11, wherein the overlaying of the three-dimensional virtual model on the position corresponding to the foot in the image to be processed based on the three-dimensional pose of the foot in the image to be processed to obtain the augmented reality effect of the virtual shoe fitting comprises:

13. An image processing apparatus, characterized in that the apparatus comprises:

the first acquisition module is used for acquiring an area image corresponding to the foot in the image to be processed;

the key point detection module is used for detecting key points of the area image by using the foot key point detection model to obtain two-dimensional position information of a first foot key point of the foot;

the determining module is used for determining the three-dimensional pose of the foot in the three-dimensional space based on the mapping relation between the preset position information and the two-dimensional position information of a second foot key point corresponding to the first foot key point in a preset foot three-dimensional model in the three-dimensional space.

14. The apparatus according to claim 13, wherein the image to be processed is an image in a video stream, and the video stream further includes a previous frame image before the image to be processed; determining the regional image of the image to be processed according to a foot frame surrounded by first foot key points in the previous frame of image;

the device further comprises:

15. The apparatus of claim 13 or 14, further comprising:

16. An electronic device, characterized in that the device comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor implements the image processing method according to any one of claims 1 to 12 by executing the executable instructions.

17. A computer-readable storage medium, characterized in that the storage medium stores a computer program for causing a processor to execute the image processing method according to any one of claims 1 to 12.