WO2024060978A1 - 关键点检测模型训练及虚拟角色驱动的方法和装置 - Google Patents

关键点检测模型训练及虚拟角色驱动的方法和装置 Download PDF

Info

Publication number
WO2024060978A1
WO2024060978A1 PCT/CN2023/116711 CN2023116711W WO2024060978A1 WO 2024060978 A1 WO2024060978 A1 WO 2024060978A1 CN 2023116711 W CN2023116711 W CN 2023116711W WO 2024060978 A1 WO2024060978 A1 WO 2024060978A1
Authority
WO
WIPO (PCT)
Prior art keywords
key point
coordinate information
field
image frame
view
Prior art date
Application number
PCT/CN2023/116711
Other languages
English (en)
French (fr)
Inventor
邓博
王佳卓
Original Assignee
广州市百果园信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市百果园信息技术有限公司 filed Critical 广州市百果园信息技术有限公司
Publication of WO2024060978A1 publication Critical patent/WO2024060978A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/72Data preparation, e.g. statistical preprocessing of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • This application relates to the field of image processing technology, for example, to a key point detection model training method, a virtual character driving method, a key point detection model training device, a virtual character driving device, and an electronic device , a computer-readable storage medium and a computer program product.
  • live broadcast content has taken the form of digital live broadcast, such as virtual anchors completely presented by avatars.
  • virtual anchors are realized by using technologies such as optical motion capture and inertial motion capture.
  • technologies such as optical motion capture and inertial motion capture.
  • the implementation of this technology requires the anchor to wear professional equipment for a long time and usually requires connecting multiple cables, resulting in a poor live broadcast experience.
  • 3D human body data in a real environment is generated through end-to-end three-dimensional (3D) pose data.
  • the end-to-end 3D pose estimation method can greatly enhance the interactive capabilities of virtual characters.
  • related technologies use ordinary RGB cameras to use deep learning networks to directly predict the 3D joint points of the human body from the input RGB video.
  • the entire human body is processed, and the camera's field of view (Field) is generally processed.
  • Of View (FOV) and shooting angle have certain requirements. For example, the field of view needs to cover most of the human body.
  • This application provides a key point detection model training method, a virtual character driving method, a key point detection model training device, and a virtual character driving device to solve the problem that posture points are easily misunderstood in related technologies. Detection errors and poor stability of output coordinates.
  • the present application provides a method for driving a virtual character, the method comprising: acquiring a target image frame, the target image frame comprising an image of a portion of a human body; inputting the target image frame into a pre-trained A key point detection model is provided, and the coordinate information of the human key points of the target image frame output by the key point detection model and the field of view probability are obtained, wherein the field of view probability is the probability that the human key points appear in the shooting field of view of the target image frame; and the corresponding virtual character action is driven according to the coordinate information of the human key points and the field of view probability.
  • the present application provides a method for training a key point detection model.
  • the method includes: performing key point detection on multiple sample image frames in a sample set to determine the coordinates of multiple key points included in each sample image frame. information; determine the field of view label of each key point based on the coordinate information of each key point, and the field of view label is used to mark whether the key point is within the shooting field of view of the sample image frame to which it belongs;
  • the coordinate information of multiple key points of each sample image frame and the field of view label are used as supervision signals to train a key point detection model.
  • the key point detection model is used to detect key points of the target image frame in the model inference stage, and Output the coordinate information and field of view probability of the key points of the target image frame.
  • the present application provides a device driven by a virtual character.
  • the device includes: an image collection module, configured to collect target image frames, where the target image frames include images of part of the human body; and a human body key point detection module, configured to collect the
  • the target image frame is input to the pre-trained key point detection model, and the coordinate information and field of view probability of the human body key points of the target image frame output by the key point detection model are obtained, and the field of view probability is the key point of the human body.
  • the virtual character driving module is configured to drive the corresponding virtual character action according to the coordinate information of the human body key point and the field of view probability.
  • the present application provides a device for key point detection model training.
  • the device includes: a key point detection module configured to perform key point detection on multiple sample image frames in a sample set to determine the key points included in each sample image frame. The coordinate information of multiple key points; the field of view label determination module is configured to determine the field of view label of each key point based on the coordinate information of each key point, and the field of view label is used to mark whether the key point is in Within the shooting field of view of the sample image frame to which it belongs; the model training module is configured to use the coordinate information of multiple key points of the multiple sample image frames and the field of view labels as supervision signals to train the key point detection model, so
  • the key point detection model is used to detect key points of the target image frame in the model inference stage, and output the coordinate information and field of view probability of the key points of the target image frame.
  • the present application provides an electronic device, which includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores information that can be executed by the at least one processor.
  • Computer program the computer program is executed by the at least one processor, so that the at least one processor can execute the above-mentioned virtual character driven method or a key point detection model training method.
  • a computer-readable storage medium stores computer instructions, which are used to enable the processor to implement the above-mentioned method of training a key point detection model or a method of training a key point detection model when executed.
  • a computer program product includes computer-executable instructions. When executed, the computer-executable instructions are used to implement the above-mentioned method for training a key point detection model. Or a method of key point detection model training.
  • Figure 1 is a flow chart of a key point detection model training method provided in Embodiment 1 of the present application;
  • Figure 2 is a schematic diagram of a sample image frame provided in Embodiment 1 of the present application.
  • Figure 3 is a schematic diagram of a cropped sample image frame provided in Embodiment 1 of the present application.
  • Figure 4 is a flow chart of a virtual character driving method provided in Embodiment 2 of the present application.
  • Figure 5 is a schematic structural diagram of a device for training a key point detection model provided in Embodiment 3 of the present application;
  • Figure 6 is a schematic structural diagram of a virtual character driven device provided in Embodiment 4 of the present application.
  • FIG. 7 is a schematic structural diagram of an electronic device provided in Embodiment 5 of the present application.
  • Figure 1 is a flow chart of a method for training a key point detection model provided in Example 1 of the present application.
  • the key point detection model is used to detect human key points in images with local human features (such as half-body images). It is suitable for scenarios of human key point detection, such as live broadcast scenes, in which the actions of virtual characters are driven by detecting human key points.
  • end-to-end 3D pose estimation methods can greatly enhance the interactive capabilities of virtual anchors.
  • virtual anchors generally use mobile phones for live broadcast.
  • This scenario has the following characteristics: limited terminal performance, small field of view, and human hands/elbows and other parts are easy to frequently move out of the field of view.
  • many open source 3D gesture points Detection protocols are also highly error-prone.
  • the field of view refers to the maximum range that the terminal's camera can observe. The larger the field of view, the larger the observation range.
  • the key point detection model can be Neural network model.
  • the key point detection model can include two prediction heads. Each prediction head is equivalent to a layer of neural network. The input of the two prediction heads is a feature map. One of the prediction heads is used to predict key points based on the feature map. Position, another prediction head is used to predict whether the key point is within the field of view based on the feature map. Then the key point detection model can be used to determine whether the key point is within the field of view, which improves the key point detection accuracy of the key point detection model for images in near-field scenes.
  • this embodiment may include the following steps.
  • Step 101 performing key point detection on a plurality of sample image frames in a sample set to determine coordinate information of a plurality of key points included in each sample image frame.
  • the sample image frames can contain the main postures of the human body, which can make the prediction results of the posture network more accurate.
  • the key points of the sample image frames can be detected through a pre-generated posture detection model.
  • the posture detection model can be a two-dimensional posture detection model or a three-dimensional posture detection model.
  • the attitude detection model can also be a model obtained by combining a two-dimensional attitude detection model and a three-dimensional attitude detection model.
  • key points may be different according to different business requirements, and this embodiment does not limit this.
  • key points may include but are not limited to: left shoulder point, right shoulder point, left elbow point, right elbow point, left wrist point, right wrist point, left palm point, right palm point, hip joint point, nose point, etc.
  • the coordinate information of the key point can be represented by image coordinates and depth information.
  • the depth information can be obtained using a set depth information calculation algorithm without adding an additional depth sensor to obtain the depth information, thereby saving hardware costs and calibration costs.
  • step 101 may include the following steps.
  • Step 101-1 input the multiple sample image frames into the pre-generated two-dimensional gesture network, and obtain The two-dimensional coordinate information of the key points of the multiple sample image frames output by the two-dimensional gesture network is obtained.
  • the two-dimensional coordinate information is based on the coordinate information in the image coordinate system, including horizontal coordinate values and vertical coordinate values, and can be expressed as in, is the two-dimensional coordinate information of the key point numbered n in the sample image frame, u n is the horizontal coordinate value of the key point numbered n in the sample image frame, v n is the vertical coordinate value of the key point numbered n in the sample image frame coordinate value.
  • R nx2 represents an nx2-dimensional set of real numbers, that is , each element is a 2-dimensional vector, and each component in the vector is a real number.
  • Step 101 - 2 inputting the plurality of sample image frames into a pre-generated three-dimensional gesture network, and obtaining three-dimensional coordinate information of key points of the plurality of sample image frames output by the three-dimensional gesture network.
  • the three-dimensional coordinate information is based on the coordinate information in the world coordinate system, including X-axis coordinate value, Y-axis coordinate value and Z-axis coordinate value, and can be expressed as in, is the three-dimensional coordinate information of the key point numbered n in the sample image frame, x n is the X-axis coordinate value of the key point numbered n in the sample image frame, y n is the Y value of the key point numbered n in the sample image frame Axis coordinate value, z n is the Z-axis coordinate value of the key point numbered n in the sample image frame.
  • R nx3 represents an nx3-dimensional set of real numbers, that is , each element is a 3-dimensional vector, and each component in the vector is a real number.
  • both the two-dimensional attitude network and the three-dimensional attitude network can be existing attitude network models. This embodiment assumes that the key points detected by the two are consistent, and only the coordinate system scales of the two key points are the same. If they are inconsistent, the coordinate data in step 101-3 needs to be unified processed.
  • Step 101 - 3 determining the coordinate information of each key point based on the obtained two-dimensional coordinate information and three-dimensional coordinate information of the key point.
  • the coordinate information of each key point can be the coordinate information obtained by fusing the two-dimensional coordinate information and the three-dimensional coordinate information of the key point, thereby realizing the transformation of different output results of the two-dimensional attitude network and the three-dimensional attitude network.
  • the coordinate information may include depth information, and the depth information may be calculated based on two-dimensional coordinate information and three-dimensional coordinate information.
  • step 101-3 may include the following steps.
  • Step 101-3-1 Determine a first stable key point and a second stable key point from a plurality of key points of the plurality of sample image frames.
  • stable key points can be key points that appear relatively stably in the field of view, such as shoulder points, elbow points, nose points, etc.
  • the first stable key point and the second stable key point can appear in the form of key point pairs, for example, the first stable key point is the left shoulder point and the second stable key point is the right shoulder point; or, for example, the first stable key point is the left shoulder point.
  • the point is the left elbow point, the second stable key point is the right elbow point, and so on.
  • developers can pre-configure a whitelist of stable key points under different application scenarios.
  • you can find the stable key point whitelist in the corresponding scenario based on the current application scenario (such as a live broadcast scenario), and match the current stable key point whitelist in the stable key point whitelist. Multiple key points are detected, and then the matched key points are used as stable key points.
  • first stable key point and the second stable key point There is no restriction on the order of the first stable key point and the second stable key point. They are only used to distinguish different stable key points. If a pair of stable key points is found, one of them can be used as the first stable key point and the other As the second stable key point.
  • Step 101-3-2 Determine an adjustment coefficient based on the two-dimensional coordinate information and three-dimensional coordinate information of the first stable key point and the two-dimensional coordinate information and three-dimensional coordinate information of the second stable key point.
  • the adjustment coefficient is a parameter used to reflect the scale difference between the coordinate systems of the two-dimensional coordinate information and the three-dimensional coordinate information. According to the adjustment coefficient, the depth information of multiple key points can be determined.
  • step 101-3-2 may include the following steps.
  • the first stable key point is the left shoulder point and the key point number is 0; the second stable key point is the right shoulder point and the key point number is 1.
  • Step 101-3-3 Use the adjustment coefficient to adjust the Z-axis coordinate values of multiple key points of the multiple sample image frames to obtain the depth value of each key point.
  • Step 101-3-4 Use the horizontal coordinate value, the vertical coordinate value and the depth value of each key point as the coordinate information of the key point.
  • the image coordinates and depth value of each key point can be organized into the final coordinate information of each key point, that is, in,
  • a two-dimensional posture network is used to obtain the two-dimensional coordinate information of the key points of the sample image frame
  • a three-dimensional posture network is used to obtain the three-dimensional coordinate information of the key points of the sample image frame, and then the two-dimensional coordinate information is fused with
  • the three-dimensional coordinate information obtains the final coordinate information of each key point, which can obtain lower-cost, more accurate and stable key point coordinate information, and improve the detection accuracy of key points.
  • a plurality of key points may also be determined based on the coordinate information of the multiple key points included in each sample image frame.
  • Each sample image frame is subjected to preprocessing such as image augmentation processing, where the image augmentation processing may include at least one or a combination of the following: random perturbation and cropping processing.
  • This embodiment does not limit the implementation method of random disturbance.
  • random disturbance refers to the pixel value of each pixel, changing randomly according to a preset method.
  • the preset method is to randomly perturb the perturbation range within the range of [-20, 20], and if the RGB pixel value of a pixel is (6, 12, 230), After random perturbation in this preset method, it becomes (8, 12, 226).
  • the range of each pigment in the pixel value is [0,255], that is, the maximum value after perturbation is 255, and the minimum value is 0.
  • the cropping process may include the following process:
  • the center position of the cropping frame can be the center position of the human body in the sample image frame, and the center position of the cropping frame can be calculated based on the coordinate information of the detected key points.
  • the center position of the cropping frame can be the average of the coordinate information of the three points: the left shoulder point, the right shoulder point, and the hip joint point.
  • the position of the cropping frame can be determined based on the center position of the cropping frame and the preset cropping frame size (ie, the width and height of the cropping frame).
  • the RGB values of the pixels located outside the cropping frame can be set to black (that is, the RGB value is 0) to obtain a cropped sample image frame. For example, if the sample image frame is as shown in Figure 2, the cropped sample image frame is as shown in Figure 3.
  • the size of the cropping frame can also be randomly perturbed to obtain cropped sample image frames of different cropping frame sizes.
  • the training data set can be expanded through image augmentation processing, model overfitting can be suppressed, and the model generalization ability can be improved.
  • the target scenario can be effectively simulated through low-cost annotation migration methods and data preprocessing methods.
  • Step 102 Determine the field of view label of each key point based on the coordinate information of each key point.
  • the field of view label is used to mark whether the key point is within the shooting field of view of the corresponding sample image frame.
  • the detected key points may not be within the range of the sample image frame. For example, if the palm point in the sample image frame is not in the image , but the detected key points include palm points. Based on this, in this embodiment, for each key point, it can also be determined based on the coordinate information of the key point whether the key point is within the shooting field of view of the sample image frame to which it belongs, thereby determining the field of view label of the key point. .
  • the size of the sample image frame can be compared with the coordinate information of the key point to determine whether the key point is within the shooting field of view of the sample image frame to which it belongs. If it is judged based on the coordinate information of the key point that the key point is located within Within the range of the sample image frame to which it belongs, it is determined that the key point is within the shooting field of view. If it is determined based on the coordinate information of the key point that the key point is not within the range of the sample image frame to which it belongs, it is determined that the key point is within the shooting field of view. outside.
  • step 102 may include the following steps.
  • each sample image frame Obtains the width and height of each sample image frame; take the origin of the image coordinate system as the starting point, determine the horizontal coordinate range according to the width of the plurality of sample image frames, and determine the vertical coordinate range according to the height of the plurality of sample image frames; If the horizontal coordinate value of the key point is within the horizontal coordinate range, or the vertical coordinate value of the key point is within the vertical coordinate range, then it is determined that the field of view label of the key point is an in-field label; If the horizontal coordinate value of the key point is not within the horizontal coordinate range, and the vertical coordinate value of the key point is not within the vertical coordinate range, it is determined that the field of view label of the key point is an out-of-view label.
  • the length of the axis, that is, the vertical coordinate range is [0,height], where height is the height.
  • the field of view label can be determined according to the following logical judgment:
  • Step 103 Use the coordinate information of multiple key points of multiple sample image frames and the field of view labels as supervision signals to train a key point detection model.
  • the obtained coordinate information of multiple key points and field of view labels of multiple sample image frames can be used as supervision signals to train the key point detection model.
  • This key point detection model is used to detect key points in the target image frame during the model inference stage, and The coordinate information and the field of view probability of the key points of the target image frame are output. In subsequent applications, the coordinate information and the field of view probability can be used to drive the corresponding virtual character actions.
  • Loss heatmap can be implemented by L2 loss
  • Loss location can be implemented by L1 loss
  • Loss label can be implemented by cross-entropy loss
  • the coordinate information of multiple key points included in each sample image frame and the field of view label can be obtained.
  • the field of view label is used to mark whether the key point is Within the shooting field of view of the corresponding sample image frame, and in the model training stage, the coordinate information of multiple key points of the multiple sample image frames and the field of view labels are used as supervision signals to train the key point detection model, by introducing additional predictions
  • the head enables the key point detection model to have the function of outputting the field of view label of the key point, thereby effectively judging whether the key point is within the field of view, and improving the key point detection accuracy of the key point detection model for images in near-field scenes, such as , when the user is close to the camera and only part of the human body appears in the camera's shooting field of view, the key point detection model of this embodiment can output whether multiple human body key points are within the field of view or outside the field of view. Field label to avoid subsequent use of key points outside the field of view for scene processing that affects the processing effect.
  • Figure 4 is a flow chart of a virtual character-driven method provided in Embodiment 2 of the present application. This embodiment belongs to the model inference stage of the key point detection model in Embodiment 1. This embodiment
  • this embodiment may include the following steps.
  • Step 201 Collect target image frames, where the target image frames include images of part of the human body.
  • the target image frame can be an image frame collected in real time, for example, in a live broadcast scene, a half-length photo of the anchor collected through a mobile phone or other device.
  • Step 202 input the target image frame into a pre-trained key point detection model, and obtain the coordinate information and field of view probability of the human body key points of the target image frame output by the key point detection model.
  • the target image frame can be input to the key point detection model generated in the first embodiment, and the key point detection model performs human body key point detection to obtain the coordinate information and visual information of one or more human body key points.
  • Field probability where the field of view probability is the key point of the human body in the target image The probability of a frame appearing within the shooting field of view.
  • the coordinate information can be expressed as (u n , v n , depth n ).
  • the coordinate information output by this key point detection model is cheaper than the three-dimensional coordinate information output by the three-dimensional attitude network. This is because some of the information in the three-dimensional coordinate information does not need to be used, and the possible depth value of the actual scene does not need to be combined with the virtual driving scene, only a relative value is required. Therefore, through the training plan of Embodiment 1, The three-dimensional coordinate information is converted into pseudo-3D coordinate information in the image coordinate scale.
  • the following steps may also be included:
  • the above-mentioned step of determining the smoothing weight between the current target image frame and the previous target image frame may include the following steps:
  • For each human body key point determine the distance between the coordinate information of the human body key point in the current target image frame and the smoothed coordinate information in the previous target image frame; compare the distance with the set distance, And determine the distance weight according to the comparison result; use the distance weight and the field of view probability of the human body key point in the current target image frame to calculate the smoothing weight.
  • the distance calculation formula can be used to calculate the two coordinate information.
  • the distance between them for example, can be calculated using the following formula:
  • distance n is the distance between the coordinate information of the nth human key point in the current target image frame and the smoothed coordinate information in the previous target image frame, is the coordinate information of the nth human body key point in the current target image frame, is the smoothed coordinate information of the nth human body key point in the previous target image frame.
  • the distance obtained above is compared with the set distance, and the ratio between the two can be calculated, and the comparison result is the ratio between the two, that is, Among them, threshold is the set distance.
  • threshold is the set distance.
  • the distance weight can be calculated using the following formula:
  • k represents the preset smoothing intensity. The larger the value, the smaller the window and the stronger the ability to suppress violent jumps.
  • the smoothing weight can be calculated by combining the distance weight with the field of view probability of the current human body key point in the current target image frame.
  • the smoothing weight can be calculated using the following formula:
  • prob n is the field of view probability of the nth human body key point.
  • the smoothing weight can be used to smooth the coordinate information and field of view probability.
  • the smoothing process may include the following steps:
  • the first weight of the previous target image frame and the second weight of the current target image frame are determined; based on the first weight and the second weight, the coordinate information of the previous target image frame is and the coordinate information of the current target image frame are weighted to obtain smoothed coordinate information; based on the first weight and the second weight, the field of view probability of the previous target image frame and the field of view of the current target image frame are The probability is weighted and calculated to obtain the smoothed field of view probability.
  • Cached_prob n is the field of view probability after smoothing
  • Cached_prob n-1 is the field of view probability after the last smoothing process.
  • the filter introduced in this embodiment smoothes the key point jump degree and whether it is within the field of view. By reducing the weight of the current frame when the key point coordinates jump violently and is not within the field of view, the output The results of human body key points generally maintain a relatively stable and continuous state.
  • Step 203 Drive the corresponding virtual character action according to the coordinate information of the human body key points and the field of view probability.
  • the corresponding human body key point After obtaining the final coordinate information and field of view probability of the human body key points of the target image frame, it can be judged based on the field of view probability whether the corresponding human body key point is within the field of view of the target image frame. If it is within the field of view, then The corresponding human body part of the virtual character can be moved to the position corresponding to the coordinate information according to the coordinate information. If the virtual character is outside the field of view, the virtual character will not be moved.
  • these key points of the human body can make the interaction between the anchor and the user richer, such as waving, making a heart shape with the hands, displaying 3D gifts given by the user on the arms or wrists of the 3D image, and allowing the anchor and the user's virtual image to play interactive games in a virtual 3D space.
  • the key point detection model can be used to detect the coordinate information and field of view probability of the human body key points of the target image frame.
  • the field of view probability is the probability that the human body key points appear in the shooting field of view of the target image frame. Then Combining the field of view probability and coordinate information to drive the corresponding virtual character actions, this can achieve end-to-end signal output of human body key points, and ensure a certain stability in close-up situations, meeting the driving requirements of virtual characters on mobile phones.
  • FIG. 5 is a schematic structural diagram of a key point detection model training device provided in Embodiment 3 of the present application, which may include the following modules: a key point detection module 301, configured to perform key point detection on multiple sample image frames in the sample set. to determine the coordinate information of multiple key points included in each sample image frame; the field of view label determination module 302 is configured to determine the field of view label of each key point based on the coordinate information of each key point, The field of view label is used to mark whether the key point is within the shooting field of view of the sample image frame to which it belongs; the model training module 303 is configured to combine the coordinate information of multiple key points of multiple sample image frames and the field of view.
  • the label is used as a supervision signal to train a key point detection model.
  • the key point detection model is used to detect key points in the target image frame during the model inference stage, and output the coordinate information and field of view probability of the key points of the target image frame.
  • the key point detection module 301 may include the following modules: a two-dimensional pose prediction module, configured to input the plurality of sample image frames into a pre-generated two-dimensional pose network, and obtain the two-dimensional pose Two-dimensional coordinate information of key points of the plurality of sample image frames output by the network; a three-dimensional posture prediction module configured to input the plurality of sample image frames into a pre-generated three-dimensional posture network and obtain the three-dimensional posture network The output three-dimensional coordinate information of the key points of the plurality of sample image frames; a coordinate determination module configured to determine the coordinates of each key point based on the obtained two-dimensional coordinate information and the three-dimensional coordinate information of the key points information.
  • the two-dimensional coordinate information includes horizontal coordinate values and vertical coordinate values
  • the three-dimensional coordinate information includes X-axis coordinate values, Y-axis coordinate values, and Z-axis coordinate values
  • the coordinate determination module may include the following modules : a stable key point determination module, configured to determine a first stable key point and a second stable key point from a plurality of key points of the plurality of sample image frames; an adjustment coefficient determination module, configured to determine the first stable key point according to the first stable key point point and the two-dimensional coordinate information and the three-dimensional coordinate information of the second stable key point to determine the adjustment coefficient; the adjustment module is configured to use the adjustment coefficient to adjust the Z of the multiple key points of the multiple sample image frames.
  • the axis coordinate value is adjusted to obtain the depth value of each key point;
  • the coordinate generation module is configured to combine the horizontal coordinate value and the vertical coordinate value of each key point. And the depth value is used as the coordinate information of the key point.
  • the adjustment coefficient determination module is configured to: determine the absolute value of the difference between the two-dimensional coordinate information of the first stable key point and the second stable key point as the first difference; determine the The absolute value of the difference between the three-dimensional coordinate information of the first stable key point and the second stable key point is used as the second difference; the ratio of the first difference to the second difference is used as the adjustment coefficient.
  • the coordinate information includes a horizontal coordinate value and a vertical coordinate value
  • the field of view label includes an in-field of view label and an out-field of view label
  • the field of view label determination module 302 is configured to: obtain the width and height of each sample image frame; take the origin of the image coordinate system as a starting point, determine the horizontal coordinate range according to the width of the plurality of sample image frames, and determine the horizontal coordinate range according to the plurality of sample image frames.
  • the height of the sample image frame determines the vertical coordinate range; if the horizontal coordinate value of the key point is within the horizontal coordinate range, or the vertical coordinate value of the key point is within the vertical coordinate range, then the key point is determined
  • the field of view label is an in-field label; if the horizontal coordinate value of the key point is not within the horizontal coordinate range, and the vertical coordinate value of the key point is not within the vertical coordinate range, then the key point is determined
  • the field of view label is the outside field of view label.
  • the device may also include the following modules: an image augmentation module, configured to perform image augmentation processing on the sample image frame based on the multiple coordinate information included in each sample image frame after determining the coordinate information of multiple key points included in each sample image frame, and the image augmentation processing includes at least one or a combination of the following: random perturbation, cropping processing.
  • an image augmentation module configured to perform image augmentation processing on the sample image frame based on the multiple coordinate information included in each sample image frame after determining the coordinate information of multiple key points included in each sample image frame
  • the image augmentation processing includes at least one or a combination of the following: random perturbation, cropping processing.
  • the image augmentation module is configured to: determine the center position of the cropping frame based on the coordinate information of multiple key points included in each sample image frame; determine the position of the cropping frame based on the center position of the cropping frame. , and set the RGB value of the pixels outside the cropping box to black according to the cropping box position.
  • the loss functions used include a heat map loss function, a position loss function, and a label loss function.
  • a device for training a key point detection model provided in an embodiment of the present application can execute a method for training a key point detection model provided in the first embodiment of the present application, and has a functional module corresponding to the execution method.
  • FIG. 6 is a schematic structural diagram of a virtual character-driven device provided in Embodiment 4 of the present application, which may include the following modules: an image acquisition module 401, configured to acquire target image frames, where the target image frames include images of part of the human body;
  • the key point detection module 402 is configured to input the target image frame to a pre-trained key point detection model, and obtain the coordinate information and field of view probability of the human body key points of the target image frame output by the key point detection model.
  • the field of view probability is the key point of the human body The probability of appearing in the shooting field of view of the target image frame;
  • the virtual character driving module 403 is configured to drive the corresponding virtual character action according to the coordinate information of the key points of the human body and the field of view probability.
  • the device may further include the following modules: a smoothing weight determination module configured to determine the smoothing weight between the current target image frame and the previous target image frame; a smoothing processing module configured to use the smoothing Weights are used to smooth the coordinate information and the field of view probability.
  • the smoothing weight determination module is configured to: for each human body key point, determine the coordinate information of the human body key point in the current target image frame and the smoothed coordinate information in the previous target image frame. the distance between them; compare the distance with the set distance, and determine the distance weight based on the comparison result; calculate the smoothing weight using the distance weight and the field of view probability of the human body key point in the current target image frame.
  • the smoothing processing module is configured to: based on the smoothing weight, determine the first weight of the previous target image frame and the second weight of the current target image frame; based on the first weight and the According to the second weight, the coordinate information of the previous target image frame and the coordinate information of the current target image frame are weighted to obtain the smoothed coordinate information; based on the first weight and the second weight, the previous target image frame is weighted.
  • the field of view probability of the image frame and the field of view probability of the current target image frame are weighted and calculated to obtain the smoothed field of view probability.
  • a virtual character driven device provided in the embodiment of the present application can execute a virtual character driven method provided in the second embodiment of the present application, and has functional modules corresponding to the execution method.
  • FIG. 7 shows a schematic structural diagram of an electronic device 10 that can be used to implement method embodiments of the present application.
  • the electronic device 10 can be a server, a mobile phone, or other equipment, including at least one processor 11, and a storage device communicatively connected to the at least one processor 11, such as a read-only memory (Read-Only Memory, ROM) 12. , Random Access Memory (Random Access Memory, RAM) 13, etc., wherein the storage device stores one or more computer programs that can be executed by at least one processor.
  • the processor 11 can operate according to the computer program stored in the ROM 12 or from the storage device.
  • Unit 18 loads a computer program into RAM 13 to perform various appropriate actions and processes. In the RAM 13, various programs and data required for the operation of the electronic device 10 can also be stored.
  • the method in Embodiment 1 or Embodiment 2 can be implemented as a computer program, which is tangibly included in a computer-readable storage medium, such as the storage unit 18 .
  • part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19 .
  • the computer program is loaded into RAM 13 and executed by processor 11, One or more steps of the method in Embodiment 1 or Embodiment 2 described above may be performed.
  • the method in Embodiment 1 or 2 can be implemented as a computer program product.
  • the computer program product includes computer-executable instructions. When executed, the computer-executable instructions are used to perform the above-described steps. One or more steps of the method in Embodiment 1 or Embodiment 2.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

一种关键点检测模型训练及虚拟角色驱动的方法和对应的装置,该虚拟角色驱动的方法包括:采集目标图像帧,所述目标图像帧包括部分人体的图像(201);将所述目标图像帧输入至预先训练的关键点检测模型,并获得所述关键点检测模型输出的所述目标图像帧的人体关键点的坐标信息以及视场概率,所述视场概率为所述人体关键点在所述目标图像帧的拍摄视场内出现的概率(202);根据所述人体关键点的所述坐标信息以及所述视场概率驱动对应的虚拟角色动作(203)。

Description

关键点检测模型训练及虚拟角色驱动的方法和装置
本申请要求在2022年09月20日提交中国专利局、申请号为202211145707.2的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理技术领域,例如涉及一种关键点检测模型训练的方法、一种虚拟角色驱动的方法、一种关键点检测模型训练的装置、一种虚拟角色驱动的装置、一种电子设备、一种计算机可读存储介质以及一种计算机程序产品。
背景技术
随着虚拟产业的发展,直播的内容出现了数字化的直播形态,例如完全由虚拟形象呈现的虚拟主播。
在一种相关技术中,通过采用如光学动作捕捉、惯性动作捕捉等技术实现虚拟主播,但这种技术的实现需要主播长时间佩戴专业设备,且通常需要连接多种线缆,直播体验差。
在另一种相关技术中,通过端对端的三维(3Dimensions,3D)姿态数据生成真实环境下的3D人体数据,端到端的3D姿态估计方法可以极大的增强虚拟人物的交互能力。在获取3D人体数据时,相关技术通过普通RGB相机,利用深度学习网络直接从输入的RGB视频中预测出人体3D关节点的方案,普遍针对整个人体进行处理,一般对相机的视场角(Field Of View,FOV)和拍摄角度有一定要求,例如视场需要覆盖大部分人体。在用户距离相机较近,只有部分人体出现的情况下,由于具有终端性能有限、视场小、人体手/手肘等部分容易频繁移出视场等特点,姿态点位非常容易误检出错。而且由于移动端上资源有限,很多基于热图(heatmap)的方式由于分辨率限制,输出的坐标精度不够,输出坐标的稳定性也受到很大影响。
发明内容
本申请提供了一种关键点检测模型训练的方法、一种虚拟角色驱动的方法、一种关键点检测模型训练的装置、一种虚拟角色驱动的装置,以解决相关技术中姿态点位容易误检出错、输出坐标稳定性差的问题。
本申请提供了一种虚拟角色驱动的方法,所述方法包括:采集目标图像帧,所述目标图像帧包括部分人体的图像;将所述目标图像帧输入至预先训练的关 键点检测模型,并获得所述关键点检测模型输出的所述目标图像帧的人体关键点的坐标信息以及视场概率,所述视场概率为所述人体关键点在所述目标图像帧的拍摄视场内出现的概率;根据所述人体关键点的所述坐标信息以及所述视场概率驱动对应的虚拟角色动作。
本申请提供了一种关键点检测模型训练的方法,所述方法包括:对样本集合中的多个样本图像帧进行关键点检测,以确定每个样本图像帧所包括的多个关键点的坐标信息;基于每个关键点的坐标信息确定所述每个关键点的视场标签,所述视场标签用于标记该关键点是否在所属的样本图像帧的拍摄视场内;将所述多个样本图像帧的多个关键点的坐标信息以及所述视场标签作为监督信号,训练关键点检测模型,所述关键点检测模型用于在模型推理阶段对目标图像帧进行关键点检测,并输出所述目标图像帧的关键点的坐标信息以及视场概率。
本申请提供了一种虚拟角色驱动的装置,所述装置包括:图像采集模块,设置为采集目标图像帧,所述目标图像帧包括部分人体的图像;人体关键点检测模块,设置为将所述目标图像帧输入至预先训练的关键点检测模型,并获得所述关键点检测模型输出的所述目标图像帧的人体关键点的坐标信息以及视场概率,所述视场概率为所述人体关键点在所述目标图像帧的拍摄视场内出现的概率;虚拟角色驱动模块,设置为根据所述人体关键点的所述坐标信息以及所述视场概率驱动对应的虚拟角色动作。
本申请提供了一种关键点检测模型训练的装置,所述装置包括:关键点检测模块,设置为对样本集合中的多个样本图像帧进行关键点检测,以确定每个样本图像帧所包括的多个关键点的坐标信息;视场标签确定模块,设置为基于每个关键点的坐标信息确定所述每个关键点的视场标签,所述视场标签用于标记该关键点是否在所属的样本图像帧的拍摄视场内;模型训练模块,设置为将所述多个样本图像帧的多个关键点的坐标信息以及所述视场标签作为监督信号,训练关键点检测模型,所述关键点检测模型用于在模型推理阶段对目标图像帧进行关键点检测,并输出所述目标图像帧的关键点的坐标信息以及视场概率。
本申请提供了一种电子设备,所述电子设备包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述一种虚拟角色驱动的方法或一种关键点检测模型训练的方法。
根据本申请的第六方面,提供了一种计算机可读存储介质,所述计算机可 读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现上述一种关键点检测模型训练的方法或一种关键点检测模型训练的方法。
根据本申请的第七方面,提供了一种计算机程序产品,所述计算机程序产品包括计算机可执行指令,所述计算机可执行指令在被执行时用于实现上述一种关键点检测模型训练的方法或一种关键点检测模型训练的方法。
附图说明
下面将对实施例描述中所需要使用的附图作简单地介绍。
图1是本申请实施例一提供的一种关键点检测模型训练的方法流程图;
图2是本申请实施例一提供的一种样本图像帧的示意图;
图3是本申请实施例一提供的一种裁剪后的样本图像帧的示意图;
图4是本申请实施例二提供的一种虚拟角色驱动的方法流程图;
图5是本申请实施例三提供的一种关键点检测模型训练的装置的结构示意图;
图6是本申请实施例四提供的一种虚拟角色驱动的装置的结构示意图;
图7是本申请实施例五提供的一种电子设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,所描述的实施例仅仅是本申请一部分的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于列出的那些步骤或单元,而是可包括没有列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
实施例一
图1为本申请实施例一提供的一种关键点检测模型训练的方法流程图,该关键点检测模型用于检测具有局部人体特征的图像(比如半身图像)中的人体关键点,适用于人体关键点检测的场景中,例如在直播场景中,通过检测人体关键点来驱动虚拟角色的动作。
在虚拟直播场景中,端到端的3D姿态估计方法可以极大的增强虚拟主播的交互能力。通过调研分析发现,虚拟主播普遍采用手机进行直播,该场景具有:终端性能有限、视场小、人体手/手肘等部分容易频繁移出视场等特点,这种场景下许多开源的3D姿态点检测方案也非常容易出错。视场是指终端的摄像头能够观察到的最大范围,视场越大,观测范围越大。
本实施例在不增加主播额外拍摄硬件和使用成本的基础上,基于普通的RGB相机,设计了一种低成本的模型训练方法用于训练关键点检测模型,其中,该关键点检测模型可以为神经网络模型,该关键点检测模型可以包含两个预测头,每个预测头相当于一层神经网络,两个预测头的输入是特征图,其中一个预测头用于基于特征图预测关键点的位置,另一个预测头用于基于特征图对关键点是否在视场内进行预测。则通过该关键点检测模型可以判断关键点是否在视场内,提升了关键点检测模型对于近场场景中的图像的关键点检测准确率。
如图1所示,本实施例可以包括如下步骤。
步骤101,对样本集合中的多个样本图像帧进行关键点检测,以确定每个样本图像帧所包括的多个关键点的坐标信息。
样本图像帧中可以包含人体主要姿态,这样可以使得姿态网络的预测结果的准确性更高。
在一实施例中,可以通过预先生成的姿态检测模型来对样本图像帧进行关键点检测,该姿态检测模型可以是二维姿态检测模型,也可以是三维姿态检测模型。为了提高检测的精确度,该姿态检测模型还可以是二维姿态检测模型与三维姿态检测模型进行结合得到的模型。
关键点根据业务需求的不同可以是不同的,本实施例对此不作限制。例如,关键点可以包括但不限于:左肩膀点、右肩膀点、左手肘点、右手肘点、左手腕点、右手腕点、左手掌点、右手掌点、髋关节点、鼻子点等。
示例性地,关键点的坐标信息可以采用图像坐标和深度信息表示。在一种实现中,该深度信息可以采用设定的深度信息计算算法获得,而不用增加额外的深度传感器来获得深度信息,从而节省了硬件成本和标定成本。
在一种实施例中,步骤101可以包括如下步骤。
步骤101-1,将所述多个样本图像帧输入至预先生成的二维姿态网络,并获 得所述二维姿态网络输出的所述多个样本图像帧的关键点的二维坐标信息。
示例性地,该二维坐标信息是基于图像坐标系下的坐标信息,包括水平坐标值以及垂直坐标值,可以表示为其中,为样本图像帧的编号为n的关键点的二维坐标信息,un为样本图像帧的编号为n的关键点的水平坐标值,vn为样本图像帧的编号为n的关键点的垂直坐标值。
Rnx2表示nx2维实数集合,即中,每个元素都是2维向量,向量中每个分量都是实数。
步骤101-2,将所述多个样本图像帧输入至预先生成的三维姿态网络,并获得所述三维姿态网络输出的所述多个样本图像帧的关键点的三维坐标信息。
示例性地,该三维坐标信息是基于世界坐标系下的坐标信息,包括X轴坐标值、Y轴坐标值以及Z轴坐标值,可以表示为其中,为样本图像帧的编号为n的关键点的三维坐标信息,xn为样本图像帧的编号为n的关键点的X轴坐标值,yn为样本图像帧的编号为n的关键点的Y轴坐标值,zn为样本图像帧的编号为n的关键点的Z轴坐标值。
Rnx3表示nx3维实数集合,即中,每个元素都是3维向量,向量中每个分量都是实数。
需要说明的是,该二维姿态网络与该三维姿态网络均可以为已有的姿态网络模型,本实施例假设两者检测出的关键点是一致的,只是两者的关键点的坐标系尺度不一致,需要进行步骤101-3中的坐标数据的统一处理。
步骤101-3,基于所获得的关键点的所述二维坐标信息以及所述三维坐标信息,确定每个关键点的坐标信息。
在该步骤中,每个关键点的坐标信息可以为融合了该关键点的二维坐标信息以及三维坐标信息后得到的坐标信息,从而实现将二维姿态网络和三维姿态网络的不同输出结果变换到统一的坐标系。示例性地,该坐标信息可以包括深度信息,该深度信息可以根据二维坐标信息以及三维坐标信息计算得到。
在一种实施例中,步骤101-3可以包括如下步骤。
步骤101-3-1,从所述多个样本图像帧的多个关键点中确定第一稳定关键点以及第二稳定关键点。
在实际中,稳定关键点可以为比较稳定地出现在视场中的关键点,例如肩膀点、手肘点、鼻子点等。
第一稳定关键点和第二稳定关键点可以以关键点对的形式出现,例如,第一稳定关键点为左肩膀点,第二稳定关键点为右肩膀点;或如,第一稳定关键 点为左手肘点,第二稳定关键点为右手肘点,等等。
在一种实现中,开发人员可以预先配置不同应用场景下的稳定关键点白名单。当需要确定第一稳定关键点以及第二稳定关键点时,可以依据当前的应用场景(例如直播场景),找到对应场景下的稳定关键点白名单,并在该稳定关键点白名单中匹配当前检测出的多个关键点,然后将匹配到的关键点作为稳定关键点。
第一稳定关键点和第二稳定关键点并无先后次序的限制,仅用于区分不同的稳定关键点,若查找到稳定关键点对,则可以将其中一个作为第一稳定关键点,另一个作为第二稳定关键点。
步骤101-3-2,根据所述第一稳定关键点的二维坐标信息和三维坐标信息以及所述第二稳定关键点的二维坐标信息和三维坐标信息,确定调节系数。
该调节系数为用于体现二维坐标信息与三维坐标信息的坐标系的尺度差异的参数,根据该调节系数可以确定多个关键点的深度信息。
在一种实施例中,步骤101-3-2可以包括如下步骤。
确定所述第一稳定关键点的二维坐标信息以及所述第二稳定关键点的二维坐标信息的差值绝对值,作为第一差值;确定所述第一稳定关键点的三维坐标信息以及所述第二稳定关键点的三维坐标信息的差值绝对值,作为第二差值;将所述第一差值与所述第二差值的比值作为调节系数。
例如,假设第一稳定关键点为左肩膀点,关键点编号为0;第二稳定关键点为右肩膀点,关键点编号为1。则:

调节系数
步骤101-3-3,采用所述调节系数对所述多个样本图像帧的多个关键点的Z轴坐标值进行调节,得到每个关键点的深度值。
在一种实现中,每个关键点的深度值depth为:depthn=scale*zn
步骤101-3-4,将每个关键点的所述水平坐标值、所述垂直坐标值以及所述深度值作为该关键点的坐标信息。
当得到每个关键点的深度值以后,则可以将每个关键点的图像坐标与深度值组织成所述每个关键点最终的坐标信息,即,其中,
在本实施例中,采用二维姿态网络获得样本图像帧的关键点的二维坐标信息,以及,采用三维姿态网络获得样本图像帧的关键点的三维坐标信息,然后融合该二维坐标信息以及三维坐标信息得到每个关键点的最终的坐标信息,可以得到成本更低、更加准确和稳定的关键点坐标信息,提高关键点的检测准确度。
在一种实施例中,当确定样本图像帧的每个样本图像帧所包括的多个关键点的坐标信息以后,还可以基于每个样本图像帧所包括的多个关键点的坐标信息对多个样本图像帧进行图像增广处理等预处理,其中,该图像增广处理可以包括如下的至少一种或结合:随机扰动、裁剪处理。
本实施例对随机扰动的实现方式不作限定。例如,若随机扰动是指针对每个像素点的像素值,按照预设方式进行随机变化。则在一示例性的实施方式中,若预设方式为按照扰动范围为本身的[-20,20]的范围进行随机扰动,若一像素点的RGB像素值为(6,12,230),经过该预设方式的随机扰动后变为(8,12,226)。像素值中每种色素的范围为[0,255],即扰动后最大值为255,最小值为0。
为了模拟距离拍摄设备较近的近场场景,还可以对多个样本图像帧(包含随机扰动后生成的样本图像帧)进行裁剪处理。在一种实施例中,裁剪处理可以包括如下过程:
基于每个样本图像帧所包括的多个关键点的坐标信息确定裁剪框中心位置;根据裁剪框中心位置确定裁剪框位置,并按照该裁剪框位置将位于裁剪框以外的像素点的RGB值设置为黑色。
在一种实现中,裁剪框中心位置可以为样本图像帧中的人体的中心位置,可以根据检测出的关键点的坐标信息来计算该裁剪框中心位置。例如,裁剪框中心位置可以为左肩膀点、右肩膀点、髋关节点这三点的坐标信息的均值。
得到裁剪框中心位置以后,则可以根据该裁剪框中心位置以及预设的裁剪框大小(即裁剪框的宽和高)确定裁剪框位置。当确定裁剪框位置以后,则可以将位于裁剪框以外的像素点的RGB值设置为黑色(即RGB值为0),得到裁剪后的样本图像帧。例如,若样本图像帧如图2所示,则裁剪后的样本图像帧如图3所示。
在实际中,还可以针对裁剪框大小进行随机扰动,得到不同裁剪框大小的裁剪后的样本图像帧。
在本实施例中,通过图像增广处理可以扩大训练数据集,抑制模型过拟合,提高模型泛化能力。同时,通过低成本的标注迁移方法和数据预处理方法可以实现对目标场景进行有效模拟。
步骤102,基于每个关键点的坐标信息确定所述每个关键点的视场标签,所述视场标签用于标记该关键点是否在所属的样本图像帧的拍摄视场内。
当检测出多个样本图像帧的关键点集合以后,受模型检测的准确率影响,检测出的关键点有可能不在样本图像帧的范围内,例如,如果在样本图像帧中手掌点不在图像中,但检测出的关键点却包含了手掌点。基于此,在本实施例中对于每个关键点,还可以基于该关键点的坐标信息来判断该关键点是否在所属的样本图像帧的拍摄视场内,从而确定该关键点的视场标签。例如,如果一个关键点在所属的样本图像帧的拍摄视场内,则其视场标签为1,如果一个关键点不在所属的样本图像帧的拍摄视场内,则其视场标签为0。
在实现时,可以将样本图像帧的大小与关键点的坐标信息进行比较来判断该关键点是否在所属的样本图像帧的拍摄视场内,如果根据关键点的坐标信息判断该关键点位于其所属的样本图像帧的范围内,则判定该关键点在拍摄视场内,如果根据关键点的坐标信息判断该关键点不在其所属的样本图像帧的范围内,判定该关键点在拍摄视场外。
在一种实施例中,步骤102可以包括如下步骤。
获取每个样本图像帧的宽度和高度;以图像坐标系的原点为起点,根据所述多个样本图像帧的宽度确定水平坐标范围,根据所述多个样本图像帧的高度确定垂直坐标范围;若所述关键点的水平坐标值在所述水平坐标范围内,或者,所述关键点的垂直坐标值在所述垂直坐标范围内,则判定该关键点的视场标签为视场内标签;若所述关键点的水平坐标值不在所述水平坐标范围内,以及,所述关键点的垂直坐标值不在所述垂直坐标范围内,则判定该关键点的视场标签为视场外标签。
可以以图像坐标系的原点为起点,将该样本图像帧的宽度作为水平坐标轴的长度,也就是水平坐标范围为[0,width],width为宽度;将该样本图像帧的高度作为垂直坐标轴的长度,也就是垂直坐标范围为[0,height],height为高度。
在实际中,可以根据如下逻辑判断式来确定视场标签:
步骤103,将多个样本图像帧的多个关键点的坐标信息以及所述视场标签作为监督信号,训练关键点检测模型。
在该步骤中,可以将获得的多个样本图像帧的多个关键点的坐标信息以及视场标签作为监督信号训练关键点检测模型。
该关键点检测模型用于在模型推理阶段对目标图像帧进行关键点检测,并 输出目标图像帧的关键点的坐标信息以及视场概率,后续应用中示例性地可以采用该坐标信息以及视场概率驱动对应的虚拟角色动作。
在一种实施例中,在训练关键点检测模型时,使用的损失函数包括热图损失函数Lossheatmap、位置损失函数Losslocation以及标签损失函数Losslabel,例如:
Losstotal=Lossheatmap+Losslocation+Losslabel
需要说明的是,本实施例并不限制上述三种损失函数的实现方式,例如,Lossheatmap可以用L2损失实现,Losslocation可以用L1损失实现,Losslabel可以用交叉熵损失实现。
在本实施例中,在数据准备阶段可以获取多个样本图像帧中的每个样本图像帧所包括的多个关键点的坐标信息以及视场标签,该视场标签用于标记该关键点是否在所属的样本图像帧的拍摄视场内,并在模型训练阶段采用该多个样本图像帧的多个关键点的坐标信息以及视场标签作为监督信号训练关键点检测模型,通过引入额外的预测头使得关键点检测模型能够具备输出关键点的视场标签的功能,从而有效判断关键点是否在视场内,提升了关键点检测模型对于近场场景中的图像的关键点检测准确率,例如,在用户距离相机较近,只有部分人体在相机的拍摄视场内出现的情况下,通过本实施例的关键点检测模型能够输出多个人体关键点在视场内还是在视场外的视场标签,避免后续利用在视场外的关键点进行场景处理影响处理效果的情况发生。
实施例二
图4为本申请实施例二提供的一种虚拟角色驱动的方法流程图,本实施例属于实施例一的关键点检测模型的模型推理阶段。本实施例
如图4所示,本实施例可以包括如下步骤。
步骤201,采集目标图像帧,所述目标图像帧包括部分人体的图像。
该目标图像帧可以为实时采集的图像帧,例如,在直播场景中,通过手机等设备采集到的主播的半身照片。
步骤202,将所述目标图像帧输入至预先训练的关键点检测模型,并获得所述关键点检测模型输出的所述目标图像帧的人体关键点的坐标信息以及视场概率。
当采集到目标图像帧以后,可以将目标图像帧输入至实施例一中生成的关键点检测模型,由关键点检测模型进行人体关键点检测,获得一个或多个人体关键点的坐标信息以及视场概率,其中,该视场概率为人体关键点在目标图像 帧的拍摄视场内出现的概率。
坐标信息可以表示为(un,vn,depthn),该关键点检测模型输出的坐标信息相比于三维姿态网络输出的三维坐标信息,成本更低。这是因为三维坐标信息中的信息有些是不需要用到的,且结合虚拟驱动场景并不需要实际场景可能的深度值,只需要一个相对值就可以,所以因此通过实施例一的训练方案把三维坐标信息转换成图像坐标尺度下的伪3D坐标信息。
在一种实施例中,当获得目标图像帧的人体关键点的坐标信息以及视场概率以后,还可以包括如下步骤:
确定当前目标图像帧与上一目标图像帧之间的平滑权重;采用所述平滑权重,对所述坐标信息以及所述视场概率进行平滑处理。
例如,在用手机进行虚拟直播的场景中,对于视场外的人体关键点并不是特别感兴趣,让其尽量平稳不出现跳变和明显的错误即可。因此可以通过滤波器对坐标信息和预测概率进行平滑处理。
在一种实施例中,上述确定当前目标图像帧与上一目标图像帧之间的平滑权重的步骤,可以包括如下步骤:
针对每个人体关键点,确定所述人体关键点在当前目标图像帧的坐标信息与在上一目标图像帧的平滑后的坐标信息之间的距离;将所述距离与设定距离进行比较,并根据比较结果确定距离权重;采用所述距离权重以及所述人体关键点在当前目标图像帧的视场概率,计算平滑权重。
在一种实现中,针对一个人体关键点,当获得其在上一目标图像帧的平滑后的坐标信息,以及,在当前目标图像帧的坐标信息以后,可以采用距离计算公式计算两个坐标信息之间的距离,例如,该距离可以采用如下公式计算:
distancen为第n个人体关键点在当前目标图像帧的坐标信息与在上一目标图像帧的平滑后的坐标信息之间的距离,为第n个人体关键点在当前目标图像帧的坐标信息,为第n个人体关键点在上一目标图像帧的平滑后的坐标信息。
在实现时,将上述获得的距离与设定距离进行比较,可以为计算两者的比值,则比较结果为两者的比值,即,其中,threshold为设定距离。距离distancen低于阈值thresold的情况下,缓存的历史数据权重较大,当前的关键点的权重很小;距离distancen大于阈值情况下,历史数据权重较小,当前的关键点的权重更大。
在一种实现中,可以采用如下公式计算距离权重:
k表示预设的平滑剧烈程度,越大表示窗口越小,抑制剧烈跳变的能力越强。
当获得距离权重以后,则可以结合距离权重与当前人体关键点在当前目标图像帧的视场概率,计算平滑权重。例如,可以采用如下公式计算平滑权重:
probn为第n个人体关键点的视场概率。
当获得平滑权重以后,则可以采用该平滑权重对坐标信息以及视场概率进行平滑处理。在一种实施例中,该平滑处理可以包括如下步骤:
基于所述平滑权重,确定上一目标图像帧的第一权重,以及,当前目标图像帧的第二权重;基于所述第一权重以及所述第二权重,将上一目标图像帧的坐标信息以及当前目标图像帧的坐标信息进行加权计算,得到平滑后的坐标信息;基于所述第一权重以及所述第二权重,将上一目标图像帧的视场概率以及当前目标图像帧的视场概率进行加权计算,得到平滑后的视场概率。
例如,假设第一权重为平滑权重,则第二权重为数值1与平滑权重的差值,即,第一权重=ration,第二权重=1-ration
对坐标信息进行平滑处理的过程如下公式所示:
对视场概率进行平滑处理的过程如下公式所示:
Cached_probn=ration*Cached_probn-1-+(1-ration)probn
Cached_probn为平滑处理后的视场概率,Cached_probn-1为上一次平滑处理后的视场概率。
本实施例引入的滤波器对关键点跳变程度和是否在视场内的情况进行平滑处理,通过调低关键点坐标剧烈跳变和不在视场内的情况下当前帧的权重,让输出的人体关键点结果总体保持一个比较稳定连续的状态。
步骤203,根据所述人体关键点的所述坐标信息以及所述视场概率驱动对应的虚拟角色动作。
当获得目标图像帧的人体关键点的最终的坐标信息以及视场概率以后,则可以根据该视场概率判断对应的人体关键点是否在目标图像帧的视场内,若在该视场内则可以根据坐标信息动作虚拟角色的对应人体部位到该坐标信息对应的位置上,若在该视场外则不对虚拟角色进行动作。
例如,通过这些人体关键点可以让主播与用户的交互更加丰富,比如挥手、比心、在3D形象的手臂或手腕上展示用户赠送的3D礼物、在虚拟3D空间中让主播和用户的虚拟形象进行交互游戏等。
在本实施例中,通过关键点检测模型可以检测目标图像帧的人体关键点的坐标信息以及视场概率,该视场概率为人体关键点在目标图像帧的拍摄视场内出现的概率,然后结合该视场概率和坐标信息驱动对应的虚拟角色动作,以此实现端到端的人体关键点的信号输出,并且在近景情况下可以保证一定的稳定性,满足手机端上虚拟角色的驱动要求。
实施例三
图5为本申请实施例三提供的一种关键点检测模型训练的装置的结构示意图,可以包括如下模块:关键点检测模块301,设置为对样本集合中的多个样本图像帧进行关键点检测,以确定每个样本图像帧所包括的多个关键点的坐标信息;视场标签确定模块302,设置为基于每个关键点的所述坐标信息确定所述每个关键点的视场标签,所述视场标签用于标记该关键点是否在所属的样本图像帧的拍摄视场内;模型训练模块303,设置为将多个样本图像帧的多个关键点的坐标信息以及所述视场标签作为监督信号,训练关键点检测模型,所述关键点检测模型用于在模型推理阶段对目标图像帧进行关键点检测,并输出所述目标图像帧的关键点的坐标信息以及视场概率。
在一种实施例中,关键点检测模块301可以包括如下模块:二维姿态预测模块,设置为将所述多个样本图像帧输入至预先生成的二维姿态网络,并获得所述二维姿态网络输出的所述多个样本图像帧的关键点的二维坐标信息;三维姿态预测模块,设置为将所述多个样本图像帧输入至预先生成的三维姿态网络,并获得所述三维姿态网络输出的所述多个样本图像帧的关键点的三维坐标信息;坐标确定模块,设置为基于所获得的关键点的所述二维坐标信息以及所述三维坐标信息,确定每个关键点的坐标信息。
在一种实施例中,所述二维坐标信息包括水平坐标值以及垂直坐标值,所述三维坐标信息包括X轴坐标值、Y轴坐标值以及Z轴坐标值;坐标确定模块可以包括如下模块:稳定关键点确定模块,设置为从所述多个样本图像帧的多个关键点中确定第一稳定关键点以及第二稳定关键点;调节系数确定模块,设置为根据所述第一稳定关键点以及所述第二稳定关键点的二维坐标信息以及三维坐标信息,确定调节系数;调节模块,设置为采用所述调节系数对所述多个样本图像帧的多个关键点的所述Z轴坐标值进行调节,得到每个关键点的深度值;坐标生成模块,设置为将每个关键点的所述水平坐标值、所述垂直坐标值 以及所述深度值作为该关键点的坐标信息。
在一种实施例中,调节系数确定模块是设置为:确定所述第一稳定关键点以及所述第二稳定关键点的二维坐标信息的差值绝对值,作为第一差值;确定所述第一稳定关键点以及所述第二稳定关键点的三维坐标信息的差值绝对值,作为第二差值;将所述第一差值与所述第二差值的比值作为调节系数。
在一种实施例中,所述坐标信息包括水平坐标值以及垂直坐标值,所述视场标签包括视场内标签以及视场外标签;
视场标签确定模块302是设置为:获取每个样本图像帧的宽度和高度;以图像坐标系的原点为起点,根据所述多个样本图像帧的宽度确定水平坐标范围,根据所述多个样本图像帧的高度确定垂直坐标范围;若所述关键点的水平坐标值在所述水平坐标范围内,或者,所述关键点的垂直坐标值在所述垂直坐标范围内,则判定该关键点的视场标签为视场内标签;若所述关键点的水平坐标值不在所述水平坐标范围内,以及,所述关键点的垂直坐标值不在所述垂直坐标范围内,则判定该关键点的视场标签为视场外标签。
在一种实施例中,所述装置还可以包括如下模块:图像增广模块,设置为在所述确定每个样本图像帧所包括的多个关键点的坐标信息之后,基于所述每个样本图像帧所包括的多个坐标信息,对所述样本图像帧进行图像增广处理,所述图像增广处理包括如下的至少一种或结合:随机扰动、裁剪处理。
在一种实施例中,图像增广模块是设置为:基于所述每个样本图像帧所包括的多个关键点的坐标信息确定裁剪框中心位置;根据所述裁剪框中心位置确定裁剪框位置,并按照所述裁剪框位置将位于裁剪框以外的像素点的RGB值设置为黑色。
在一种实施例中,在训练所述关键点检测模型时,使用的损失函数包括热图损失函数、位置损失函数以及标签损失函数。
本申请实施例所提供的一种关键点检测模型训练的装置可执行本申请实施例一所提供的一种关键点检测模型训练的方法,具备执行方法相应的功能模块。
实施例四
图6为本申请实施例四提供的一种虚拟角色驱动的装置的结构示意图,可以包括如下模块:图像采集模块401,设置为采集目标图像帧,所述目标图像帧包括部分人体的图像;人体关键点检测模块402,设置为将所述目标图像帧输入至预先训练的关键点检测模型,并获得所述关键点检测模型输出的所述目标图像帧的人体关键点的坐标信息以及视场概率,所述视场概率为所述人体关键点 在所述目标图像帧的拍摄视场内出现的概率;虚拟角色驱动模块403,设置为根据所述人体关键点的所述坐标信息以及所述视场概率驱动对应的虚拟角色动作。
在一种实施例中,所述装置还可以包括如下模块:平滑权重确定模块,设置为确定当前目标图像帧与上一目标图像帧之间的平滑权重;平滑处理模块,设置为采用所述平滑权重,对所述坐标信息以及所述视场概率进行平滑处理。
在一种实施例中,平滑权重确定模块是设置为:针对每个人体关键点,确定所述人体关键点在当前目标图像帧的坐标信息与在上一目标图像帧的平滑后的坐标信息之间的距离;将所述距离与设定距离进行比较,并根据比较结果确定距离权重;采用所述距离权重以及所述人体关键点在当前目标图像帧的视场概率,计算平滑权重。
在一种实施例中,平滑处理模块是设置为:基于所述平滑权重,确定上一目标图像帧的第一权重,以及,当前目标图像帧的第二权重;基于所述第一权重以及所述第二权重,将上一目标图像帧的坐标信息以及当前目标图像帧的坐标信息进行加权计算,得到平滑后的坐标信息;基于所述第一权重以及所述第二权重,将上一目标图像帧的视场概率以及当前目标图像帧的视场概率进行加权计算,得到平滑后的视场概率。
本申请实施例所提供的一种虚拟角色驱动的装置可执行本申请实施例二所提供的一种虚拟角色驱动的方法,具备执行方法相应的功能模块。
实施例五
图7示出了可以用来实施本申请的方法实施例的电子设备10的结构示意图。如图7所示,电子设备10可以为服务器、手机等设备,包括至少一个处理器11,以及与至少一个处理器11通信连接的存储装置,如只读存储器(Read-Only Memory,ROM)12、随机访问存储器(Random Access Memory,RAM)13等,其中,存储装置存储有可被至少一个处理器执行的一个或多个计算机程序,处理器11可以根据存储在ROM12中的计算机程序或者从存储单元18加载到RAM 13中的计算机程序,来执行多种适当的动作和处理。在RAM 13中,还可存储电子设备10操作所需的多种程序和数据。
在一些实施例中,实施例一或实施例二中的方法可被实现为计算机程序,其被有形地包含于计算机可读存储介质,例如存储单元18。在一些实施例中,计算机程序的部分或者全部可以经由ROM 12和/或通信单元19而被载入和/或安装到电子设备10上。当计算机程序加载到RAM 13并由处理器11执行时, 可以执行上文描述的实施例一或实施例二中的方法的一个或多个步骤。
在一些实施例中,实施例一或实施例二中的方法可被实现为计算机程序产品,该计算机程序产品包括计算机可执行指令,该计算机可执行指令在被执行时用于执行上文描述的实施例一或实施例二中的方法的一个或多个步骤。

Claims (16)

  1. 一种虚拟角色驱动的方法,包括:
    采集目标图像帧,所述目标图像帧包括部分人体的图像;
    将所述目标图像帧输入至预先训练的关键点检测模型,并获得所述关键点检测模型输出的所述目标图像帧的人体关键点的坐标信息以及视场概率,所述视场概率为所述人体关键点在所述目标图像帧的拍摄视场内出现的概率;
    根据所述人体关键点的所述坐标信息以及所述视场概率驱动对应的虚拟角色动作。
  2. 根据权利要求1所述的方法,在所述根据所述人体关键点的所述坐标信息以及所述视场概率驱动对应的虚拟角色动作之前,还包括:
    确定当前目标图像帧与上一目标图像帧之间的平滑权重;
    采用所述平滑权重,对所述坐标信息以及所述视场概率进行平滑处理。
  3. 根据权利要求2所述的方法,其中,所述确定当前目标图像帧与上一目标图像帧之间的平滑权重,包括:
    确定多个人体关键点中的每个人体关键点在当前目标图像帧的坐标信息与在上一目标图像帧的平滑后的坐标信息之间的距离;
    将所述距离与设定距离进行比较,并根据比较结果确定距离权重;
    采用所述多个人体关键点的距离权重以及所述多个人体关键点在当前目标图像帧的视场概率,计算平滑权重。
  4. 根据权利要求2或3所述的方法,其中,所述采用所述平滑权重,对所述坐标信息以及所述视场概率进行平滑处理,包括:
    基于所述平滑权重,确定所述上一目标图像帧的第一权重,以及,所述当前目标图像帧的第二权重;
    基于所述第一权重以及所述第二权重,将所述上一目标图像帧的坐标信息以及所述当前目标图像帧的坐标信息进行加权计算,得到平滑后的坐标信息;
    基于所述第一权重以及所述第二权重,将所述上一目标图像帧的视场概率以及所述当前目标图像帧的视场概率进行加权计算,得到平滑后的视场概率。
  5. 一种关键点检测模型训练的方法,包括:
    对样本集合中的多个样本图像帧进行关键点检测,以确定每个样本图像帧所包括的多个关键点的坐标信息;
    基于每个关键点的坐标信息确定所述每个关键点的视场标签,所述视场标签用于标记该关键点是否在所属的样本图像帧的拍摄视场内;
    将所述多个样本图像帧的多个关键点的坐标信息以及所述视场标签作为监督信号,训练关键点检测模型,所述关键点检测模型用于在模型推理阶段对目标图像帧进行关键点检测,并输出所述目标图像帧的关键点的坐标信息以及视场概率。
  6. 根据权利要求5所述的方法,其中,所述对样本集合中的多个样本图像帧进行关键点检测,以确定每个样本图像帧所包括的多个关键点的坐标信息,包括:
    将所述多个样本图像帧输入至预先生成的二维姿态网络,并获得所述二维姿态网络输出的所述多个样本图像帧的关键点的二维坐标信息;
    将所述多个样本图像帧输入至预先生成的三维姿态网络,并获得所述三维姿态网络输出的所述多个样本图像帧的关键点的三维坐标信息;
    基于所获得的关键点的所述二维坐标信息以及所述三维坐标信息,确定每个关键点的坐标信息。
  7. 根据权利要求6所述的方法,其中,所述二维坐标信息包括水平坐标值以及垂直坐标值,所述三维坐标信息包括X轴坐标值、Y轴坐标值以及Z轴坐标值;
    所述基于所获得的关键点的所述二维坐标信息以及所述三维坐标信息,确定每个关键点的坐标信息,包括:
    从所述多个样本图像帧的多个关键点中确定第一稳定关键点以及第二稳定关键点;
    根据所述第一稳定关键点的二维坐标信息和三维坐标信息以及所述第二稳定关键点的二维坐标信息和三维坐标信息,确定调节系数;
    采用所述调节系数对所述多个样本图像帧的多个关键点的Z轴坐标值进行调节,得到每个关键点的深度值;
    将每个关键点的所述水平坐标值、所述垂直坐标值以及所述深度值作为该关键点的坐标信息。
  8. 根据权利要求7所述的方法,其中,所述根据所述第一稳定关键点的二维坐标信息和三维坐标信息以及所述第二稳定关键点的二维坐标信息和三维坐标信息,确定调节系数,包括:
    确定所述第一稳定关键点的二维坐标信息以及所述第二稳定关键点的二维 坐标信息的差值绝对值,作为第一差值;
    确定所述第一稳定关键点的三维坐标信息以及所述第二稳定关键点的三维坐标信息的差值绝对值,作为第二差值;
    将所述第一差值与所述第二差值的比值作为调节系数。
  9. 根据权利要求5-8任一项所述的方法,其中,所述坐标信息包括水平坐标值以及垂直坐标值,所述视场标签包括视场内标签以及视场外标签;
    所述基于每个关键点的所述坐标信息确定所述每个关键点的视场标签,包括:
    获取每个样本图像帧的宽度和高度;
    以图像坐标系的原点为起点,根据所述多个样本图像帧的宽度确定水平坐标范围,根据所述多个样本图像帧的高度确定垂直坐标范围;
    响应于所述每个关键点的水平坐标值在所述水平坐标范围内,或者,所述每个关键点的垂直坐标值在所述垂直坐标范围内,判定该关键点的视场标签为视场内标签;
    响应于所述每个关键点的水平坐标值不在所述水平坐标范围内,以及,所述每个关键点的垂直坐标值不在所述垂直坐标范围内,判定该关键点的视场标签为视场外标签。
  10. 根据权利要求5-8任一项所述的方法,在所述确定每个样本图像帧所包括的多个关键点的坐标信息之后,还包括:
    基于每个样本图像帧所包括的多个关键点的坐标信息,对所述多个样本图像帧进行图像增广处理,所述图像增广处理包括如下的至少一种:随机扰动、裁剪处理。
  11. 根据权利要求10所述的方法,其中,在所述图像增广处理包括所述裁剪处理的情况下,所述裁剪处理包括:
    基于每个样本图像帧所包括的多个关键点的坐标信息确定裁剪框中心位置;
    根据所述裁剪框中心位置确定裁剪框位置,并按照所述裁剪框位置将位于裁剪框以外的像素点的RGB值设置为黑色。
  12. 一种虚拟角色驱动的装置,包括:
    图像采集模块,设置为采集目标图像帧,所述目标图像帧包括部分人体的图像;
    人体关键点检测模块,设置为将所述目标图像帧输入至预先训练的关键点检测模型,并获得所述关键点检测模型输出的所述目标图像帧的人体关键点的坐标信息以及视场概率,所述视场概率为所述人体关键点在所述目标图像帧的拍摄视场内出现的概率;
    虚拟角色驱动模块,设置为根据所述人体关键点的所述坐标信息以及所述视场概率驱动对应的虚拟角色动作。
  13. 一种关键点检测模型训练的装置,包括:
    关键点检测模块,设置为对样本集合中的多个样本图像帧进行关键点检测,以确定每个样本图像帧所包括的多个关键点的坐标信息;
    视场标签确定模块,设置为基于每个关键点的坐标信息确定所述每个关键点的视场标签,所述视场标签用于标记该关键点是否在所属的样本图像帧的拍摄视场内;
    模型训练模块,设置为将所述多个样本图像帧的多个关键点的坐标信息以及所述视场标签作为监督信号,训练关键点检测模型,所述关键点检测模型用于在模型推理阶段对目标图像帧进行关键点检测,并输出所述目标图像帧的关键点的坐标信息以及视场概率。
  14. 一种电子设备,包括:
    至少一个处理器;
    存储装置,设置为存储至少一个程序,
    当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-11任一项所述的方法。
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如权利要求1-11任一项所述的方法。
  16. 一种计算机程序产品,包括计算机可执行指令,所述计算机可执行指令在被执行时用于实现权利要求1-11中任一项所述的方法。
PCT/CN2023/116711 2022-09-20 2023-09-04 关键点检测模型训练及虚拟角色驱动的方法和装置 WO2024060978A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211145707.2A CN115482556A (zh) 2022-09-20 2022-09-20 关键点检测模型训练及虚拟角色驱动的方法和对应的装置
CN202211145707.2 2022-09-20

Publications (1)

Publication Number Publication Date
WO2024060978A1 true WO2024060978A1 (zh) 2024-03-28

Family

ID=84423403

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/116711 WO2024060978A1 (zh) 2022-09-20 2023-09-04 关键点检测模型训练及虚拟角色驱动的方法和装置

Country Status (2)

Country Link
CN (1) CN115482556A (zh)
WO (1) WO2024060978A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115482556A (zh) * 2022-09-20 2022-12-16 百果园技术(新加坡)有限公司 关键点检测模型训练及虚拟角色驱动的方法和对应的装置
CN115641647B (zh) * 2022-12-23 2023-03-21 海马云(天津)信息技术有限公司 数字人手腕驱动方法与装置、存储介质及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108985259A (zh) * 2018-08-03 2018-12-11 百度在线网络技术(北京)有限公司 人体动作识别方法和装置
CN111126272A (zh) * 2019-12-24 2020-05-08 腾讯科技(深圳)有限公司 姿态获取方法、关键点坐标定位模型的训练方法和装置
US20210166070A1 (en) * 2019-12-02 2021-06-03 Qualcomm Incorporated Multi-Stage Neural Network Process for Keypoint Detection In An Image
CN114494427A (zh) * 2021-12-17 2022-05-13 山东鲁软数字科技有限公司 一种对吊臂下站人的违规行为检测方法、系统及终端
CN115482556A (zh) * 2022-09-20 2022-12-16 百果园技术(新加坡)有限公司 关键点检测模型训练及虚拟角色驱动的方法和对应的装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108985259A (zh) * 2018-08-03 2018-12-11 百度在线网络技术(北京)有限公司 人体动作识别方法和装置
US20210166070A1 (en) * 2019-12-02 2021-06-03 Qualcomm Incorporated Multi-Stage Neural Network Process for Keypoint Detection In An Image
CN111126272A (zh) * 2019-12-24 2020-05-08 腾讯科技(深圳)有限公司 姿态获取方法、关键点坐标定位模型的训练方法和装置
CN114494427A (zh) * 2021-12-17 2022-05-13 山东鲁软数字科技有限公司 一种对吊臂下站人的违规行为检测方法、系统及终端
CN115482556A (zh) * 2022-09-20 2022-12-16 百果园技术(新加坡)有限公司 关键点检测模型训练及虚拟角色驱动的方法和对应的装置

Also Published As

Publication number Publication date
CN115482556A (zh) 2022-12-16

Similar Documents

Publication Publication Date Title
WO2024060978A1 (zh) 关键点检测模型训练及虚拟角色驱动的方法和装置
US11232286B2 (en) Method and apparatus for generating face rotation image
WO2018177379A1 (zh) 手势识别、控制及神经网络训练方法、装置及电子设备
CN106056053B (zh) 基于骨骼特征点提取的人体姿势识别方法
KR20200036002A (ko) 제스처 인식 방법, 장치 및 디바이스
CN104794737B (zh) 一种深度信息辅助粒子滤波跟踪方法
CN107045631A (zh) 人脸特征点检测方法、装置及设备
JP2018026131A (ja) 動作解析装置
CN109117755A (zh) 一种人脸活体检测方法、系统和设备
CN106530407A (zh) 一种用于虚拟现实的三维全景拼接方法、装置和系统
CN114972958B (zh) 关键点检测方法、神经网络的训练方法、装置和设备
WO2023138549A1 (zh) 图像处理方法、装置、电子设备及存储介质
CN111696196A (zh) 一种三维人脸模型重建方法及装置
CN112861808B (zh) 动态手势识别方法、装置、计算机设备及可读存储介质
CN113435282A (zh) 基于深度学习的无人机影像麦穗识别方法
CN111966217A (zh) 基于手势和眼动的无人机控制方法和系统
CN108010065A (zh) 低空目标快速检测方法及装置、存储介质及电子终端
CN115578515B (zh) 三维重建模型的训练方法、三维场景渲染方法及装置
CN109670517A (zh) 目标检测方法、装置、电子设备和目标检测模型
CN114241379A (zh) 一种乘客异常行为识别方法、装置、设备及乘客监控系统
CN111222459B (zh) 一种视角无关的视频三维人体姿态识别方法
CN113348465A (zh) 图像中对象的关联性预测方法、装置、设备和存储介质
CN112528811A (zh) 行为识别方法和装置
CN116977674A (zh) 图像匹配方法、相关设备、存储介质及程序产品
WO2022033306A1 (zh) 目标跟踪方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23867269

Country of ref document: EP

Kind code of ref document: A1