WO2023024442A1 - 检测方法、训练方法、装置、设备、存储介质和程序产品 - Google Patents

检测方法、训练方法、装置、设备、存储介质和程序产品 Download PDF

Info

Publication number
WO2023024442A1
WO2023024442A1 PCT/CN2022/075197 CN2022075197W WO2023024442A1 WO 2023024442 A1 WO2023024442 A1 WO 2023024442A1 CN 2022075197 W CN2022075197 W CN 2022075197W WO 2023024442 A1 WO2023024442 A1 WO 2023024442A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
feature point
image
limb
initial training
Prior art date
Application number
PCT/CN2022/075197
Other languages
English (en)
French (fr)
Inventor
曹国良
邱丰
刘文韬
钱晨
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023024442A1 publication Critical patent/WO2023024442A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the technical field of computers, and relates to but not limited to a detection method, a training method, a device, a device, a storage medium and a program product.
  • the anchor can display special effects and animations on the live broadcast interface by triggering an external device (for example, a mouse or a keyboard).
  • an external device for example, a mouse or a keyboard.
  • the special effect animation displayed by triggering external equipment is usually relatively simple.
  • Embodiments of the present disclosure at least provide a detection method, a training method, a device, a device, a storage medium, and a program product.
  • an embodiment of the present disclosure provides a pose detection method, including: acquiring a video image frame containing body parts of a target anchor; performing pose detection on the target anchor in the video image frame by a pose detection model, A posture detection result is obtained, wherein the posture detection model is obtained by supervised training using a target sample image and corresponding training labels, the target sample image includes a body part sample to be detected, and the training label is based on the body part sample
  • the supervision type of the limb feature point and the initial training label corresponding to the limb feature point are adjusted to obtain; according to the posture detection result, the posture trigger signal of the virtual anchor corresponding to the target anchor is generated; according to the posture trigger signal control
  • the virtual anchor executes corresponding trigger actions.
  • the pose detection model is used to detect the pose of the target anchor, and then control the way the virtual anchor executes the corresponding trigger action according to the pose detection result, which can automatically pass the pose of the target anchor.
  • the detection results trigger the display of the corresponding parts of the virtual anchor on the live video interface to perform corresponding trigger operations, and then trigger the virtual anchor to perform actions associated with the movements of the anchor's limbs, thereby improving the actual interaction effect.
  • the posture detection results of the body parts to be detected to trigger the virtual anchor it can also realize the precise triggering of the corresponding trigger parts in the virtual anchor, so as to meet the rich trigger needs of users.
  • the training label is obtained by adjusting the supervision type of the limb feature point based on the limb part sample and the initial training label corresponding to the limb feature point, and the posture detection can be obtained by training the multiplexing of the initial training label In this way, the cost of data labeling can be saved and the speed of data labeling can be accelerated.
  • the performing gesture detection on the target anchor in the video image frame by using a gesture detection model, and obtaining a gesture detection result includes: detecting a target frame in the video image frame, Wherein, the target frame is used to frame the limb parts of the target anchor in the video image frame; the gesture detection is performed on the image in the video image frame located in the target frame through the gesture detection model, Obtain the posture detection result.
  • the accuracy of the pose detection result can be improved.
  • the performing gesture detection on the target anchor in the video image frame by using a gesture detection model, and obtaining a gesture detection result includes: no detection of the target anchor in the video image frame In the case of at least part of the specified body parts of the target anchor, the video image frame is processed to obtain the target image; the target image includes an area for gesture detection of the at least part of the specified body parts; through the gesture The detection model performs posture detection on the target image to obtain a posture detection result of the target anchor.
  • the target image is obtained by performing edge filling processing on the video image frame, and the pose detection method can be performed based on the target image. It is realized that even if the video image frame does not contain a complete designated body part, the posture detection of the video image frame can still be performed, and an accurate posture detection result can be obtained.
  • the adjustment based on the supervision type of the limb feature point of the limb part sample and the initial training label corresponding to the limb feature point to obtain the training label includes: determining the limb feature The supervision type of the point; modifying the initial training label based on the supervision type, so as to determine the training label of the target sample image according to the corrected initial training label.
  • the determining the supervision type of the limb feature point includes: determining feature information of the limb feature point, wherein the feature information is used to indicate the corresponding limb feature point and the target The positional relationship between the sample images, and/or, is used to indicate whether the corresponding limb feature point is a feature point of the part to be detected; based on the feature information, determine the supervision type of the limb feature point; and based on the supervision If the type determines that the initial training label of the limb feature point satisfies the correction condition, the initial training label is corrected based on the supervision type to obtain the training label.
  • the initial training label of the limb feature point that needs to be supervised can be retained, and at the same time, it can be improved. Utilization of initial training labels in initial training samples.
  • an embodiment of the present disclosure provides a training method for a neural network model, including: obtaining an initial training sample; the initial training sample includes a sample image and the limbs of the limb parts of the target object contained in the sample image The initial training label of the feature point; determine the part to be detected in the limb part based on the detection task of the network model to be trained; and process the sample image to obtain a target sample image containing the part to be detected; determine the The supervision type of the limb feature point, and modify the initial training label based on the supervision type, so as to determine the training label of the target sample image according to the corrected initial training label; through the target sample image and the training label, Perform supervised training on the network model to be trained.
  • the initial training sample can be complexed by correcting the sample image and initial training label in the initial training sample to obtain the target sample image and training label.
  • the initial training samples can be used in the training process of multiple different network models to be trained at the same time, thereby reducing the training cost of the neural network by saving the cost of data labeling.
  • the training of the network model to be trained by using the target sample image and the training label includes: using the network model to be trained The model performs image processing on the target sample image to obtain the position information of each part to be detected in the target sample image; based on the position information of the first part and the second part among the multiple parts to be detected, and the Training labels, determining the function value of the target loss function used to constrain the position difference between the first part and the second part, wherein the first part and the second part are parts to be detected with an association relationship ; Adjust the model parameters of the network model to be trained according to the function value of the target loss function, so as to train the network model to be trained by using the adjusted model parameters.
  • the position difference between the position information of the first part and the second part and the way of constructing the target loss function by the training label can reduce the phenomenon that the position difference between the first part and the second part is large, so that Improve the processing accuracy of the network model to be trained.
  • the determining the supervision type of the limb feature point, and modifying the initial training label based on the supervision type includes: determining the feature information of the limb feature point, wherein the The feature information is used to indicate the positional relationship between the corresponding body feature point and the target sample image, and/or is used to indicate whether the corresponding body feature point is a feature point of the part to be detected; based on the feature information, determine the The supervision type of the limb feature point; when it is determined based on the supervision type that the initial training label of the limb feature point satisfies the correction condition, the initial training label is corrected based on the supervision type.
  • the initial training label of the limb feature point that needs to be supervised can be retained, and at the same time, it can be improved. Utilization of initial training labels in initial training samples.
  • the network model to be trained By modifying the initial training labels of the body feature points by the supervision type, not only the body feature points located in the target sample image can be supervised, but also the body feature points located outside the target sample image and requiring supervised learning can be supervised.
  • the network model to be trained is trained based on the corrected initial training label (that is, the training label) and the target sample image, the network model can be used to predict the body feature points of some body parts not included in the image.
  • the corrected initial training label and target sample image determined in the technical solution of the present disclosure are used to train the network model to be trained, the limb feature points of some limb parts not included in the image can be predicted, so as to realize the Pose recognition for detecting objects. Therefore, the robustness of the network model trained by the technical scheme of the present disclosure is stronger, and the recognition process is more stable.
  • correcting the initial training label based on the supervision type includes: If the supervision type determines that the limb feature point is located outside the target sample image and is not a feature point of the part to be detected, modify the initial training label of the limb feature point to the first initial training label ; The first initial training label is used to characterize the limb feature point as an unsupervised feature point located outside the target sample image.
  • correcting the initial training label based on the supervision type includes: If the initial training label determines that the limb feature point is located outside the target sample image and belongs to the feature point of the part to be detected, modify the initial training label of the limb feature point to a second initial training label ; The second initial training label is used to characterize the limb feature point as a supervised feature point located outside the target sample image.
  • the initial training samples can be used for multiple different detection tasks at the same time, thereby reducing the training cost of the neural network by saving the cost of data labeling. Furthermore, when the network model to be trained is trained using the corrected initial training label and target sample image determined in the technical solution of the present disclosure, it is possible to predict the limb feature points of some limb parts not included in the image, In this way, gesture recognition of the target to be detected is realized. Therefore, the robustness of the network model trained by the technical scheme of the present disclosure is stronger, and the recognition process is more stable.
  • the sample image is an image determined after the target operation is performed on the original image, wherein the original image is a scene containing the target collected in a scene whose complexity does not meet the preset complex requirements
  • An image of at least one body part of the subject, and the target operation includes at least one of the following: a background replacement operation, an operation of adjusting image brightness, an operation of adjusting image exposure, and an operation of adding a reflective layer.
  • a large number of training samples can be obtained in a short period of time at a lower cost.
  • the complexity of the background of the sample image can be improved by performing at least one image enhancement process of adjusting image brightness, adjusting image exposure, and adding a reflective layer to the original image.
  • the sample image obtained through the above at least one target operation can be closer to the real application scene, and when the network model to be trained is trained based on the sample image, the processing accuracy of the network model to be trained can be improved.
  • the processing the sample image to obtain the target sample image including the part to be detected includes: performing image occlusion processing on the target detection part in the sample image to obtain the The target sample image; wherein, the target detection part is other parts of the limb parts except the part to be detected; and/or, performing image cropping processing on the image in the sample image located in the target area , cropping to obtain a target sample image including the part to be detected; the target area is an image area in the sample image that includes the part to be detected.
  • the sample image matching the detection task can be obtained without re-acquisition of the sample image, thereby realizing the multiplexing of the sample image, and Reduce training time and training cost of neural networks.
  • an embodiment of the present disclosure provides a posture detection device, including: a first acquisition unit configured to acquire a video image frame containing a body part of a target anchor; a posture detection unit configured to use a posture detection model to detect the The target anchor in the video image frame performs posture detection to obtain a posture detection result, wherein the posture detection model is obtained by supervised training using target sample images and corresponding training labels, and the target sample images include limbs to be detected For a part sample, the training label is obtained by adjusting the supervision type of the limb feature point of the limb part sample and the initial training label corresponding to the limb feature point; the generation unit is configured to generate the target according to the posture detection result A gesture trigger signal of the virtual anchor corresponding to the anchor; a control unit configured to control the virtual anchor to perform a corresponding trigger action according to the gesture trigger signal.
  • an embodiment of the present disclosure provides a neural network model training device, including: a first acquisition unit configured to acquire an initial training sample; the initial training sample includes a sample image and the sample image contained in The initial training label of the limb feature point of the limb part of the target object; the correction processing unit is configured to determine the part to be detected in the limb part based on the detection task of the network model to be trained; and process the sample image to obtain The target sample image of the part to be detected; a determination unit configured to determine the supervision type of the limb feature point, and correct the initial training label based on the supervision type, so as to determine the target according to the corrected initial training label A training label of the sample image; a training unit configured to perform supervised training on the network model to be trained by using the target sample image and the training label.
  • the embodiment of the present disclosure further provides a computer device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the computer device is running, the processing
  • the processor communicates with the memory through a bus, and when the machine-readable instructions are executed by the processor, the above-mentioned first aspect, or the steps in any possible implementation manner of the first aspect are executed.
  • the embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the above-mentioned first aspect, or any of the first aspects in the first aspect, can be executed. Steps in one possible implementation.
  • the embodiments of the present disclosure further provide a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program.
  • a computer program product which includes a non-transitory computer-readable storage medium storing a computer program.
  • the computer program product may be a software installation package.
  • FIG. 1 shows a flowchart of a training method for a neural network model provided by an embodiment of the present disclosure
  • FIG. 2 shows a flow chart of determining the supervision type of the limb feature points and correcting the initial training label based on the supervision type in a neural network model training method provided by an embodiment of the present disclosure
  • FIG. 3 shows a flow chart of a posture detection method provided by an embodiment of the present disclosure
  • Fig. 4 shows a schematic diagram of an attitude detection device provided by an embodiment of the present disclosure
  • Fig. 5 shows a schematic diagram of a training device for a neural network model provided by an embodiment of the present disclosure
  • Fig. 6 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure
  • Fig. 7 shows a schematic diagram of another computer device provided by an embodiment of the present disclosure.
  • the present disclosure provides a training method of a neural network model.
  • the detection task of the network model to be trained can be used to determine the part to be detected of the target object.
  • the sample image can be processed based on the part to be detected, so as to obtain a target sample image that meets the training requirements of the network model to be trained, that is, a target sample image including the part to be detected.
  • the training label matching the training process of the network model to be trained can be obtained, and the training label can be obtained according to the matching training label and the target sample image.
  • the network model to be trained is trained.
  • the initial training sample can be realized by modifying the sample image and the initial training label in the initial training sample to obtain the target sample image and the training label. Multiplexing, so that the initial training samples can be used in the training process of multiple different network models to be trained at the same time, and then reduce the training cost of the neural network by saving the cost of data labeling.
  • the posture detection model detects the posture of the target anchor in the video image frame, and then controls the way the virtual anchor executes the corresponding trigger action according to the posture detection result, which can automatically pass the posture detection result of the target anchor in the Trigger and display the corresponding part of the virtual anchor on the live video interface to perform the corresponding trigger operation.
  • the posture detection model can also automatically trigger the display of the corresponding parts of the virtual anchor on the live video interface through the posture detection results of the target anchor to perform corresponding trigger operations, thereby realizing Trigger the virtual anchor to perform actions associated with the anchor's body movements, thereby improving the actual interaction effect.
  • a training method for a neural network model disclosed in an embodiment of the present disclosure is first introduced in detail.
  • the execution subject of the training method for a neural network model provided by an embodiment of the present disclosure generally has a certain computing power of computer equipment.
  • the training method of the neural network model may be realized by calling a computer-readable instruction stored in a memory by a processor.
  • FIG. 1 it is a flow chart of a neural network model training method provided by an embodiment of the present disclosure, the method includes steps S101 to S107, wherein:
  • S101 Acquire an initial training sample; the initial training sample includes a sample image and an initial training label of a limb feature point of a limb part of a target object included in the sample image.
  • the target object can be a real human body, a virtual human body, and other objects capable of body detection.
  • the body part of the target object may be the body part of the whole body of the target object, or part of the body part (for example, the body part of the upper body).
  • the limb parts of the target object included in the sample image can satisfy most limb detection tasks.
  • the limb parts can be the complete upper body limb parts of the target object (for example, head, arms, upper limb trunk, hands), and can also be the complete body limb parts of the target object (such as , head, arms, upper torso, hands, legs and feet).
  • the initial training label of the limb feature point can be understood as: the position information of the limb feature point in the sample image and/or the limb category information (for example, the limb key point belongs to the hand key point), and the body feature point in each The state of supervised learning in a training task (for example, feature points that require supervised learning, or feature points that do not need supervised learning).
  • the number of initial training samples may be multiple groups.
  • multiple groups of initial training samples may be determined based on the object type of the target object and/or the body parts of the target object.
  • Each initial training sample contains a corresponding sample label, which can be used to indicate the object type of the corresponding target object and/or the body part of the target object.
  • the target object may be a human being, or other objects capable of body recognition other than human beings, for example, a simulated virtual robot.
  • At least one set of initial training samples can be set based on the "person”, for example, a set of initial training samples can be set based on the body parts of the "person”, and a set of initial training samples can also be set based on the upper body parts of the "person” .
  • the sample content of the training samples can be enriched, so that the initial training samples can meet more abundant training scenarios.
  • S103 Determine the part to be detected in the limb part based on the detection task of the network model to be trained; and process the sample image to obtain a target sample image including the part to be detected.
  • the detection task of the network model to be trained may be used to indicate the body parts that the network model to be trained needs to detect (ie, the above-mentioned parts to be detected).
  • the sample images in the initial training samples can be processed according to the part to be detected.
  • the sample image may be intercepted to obtain a target sample image including the part to be detected.
  • the object type of the target object is determined based on the detection task, and based on the sample labels of each set of initial training samples, an initial training sample matching the object type of the target object is determined among multiple sets of initial training samples.
  • an initial training sample matching the object type of the target object is determined among multiple sets of initial training samples.
  • the initial training samples matching the network model to be trained can be determined from multiple sets of initial training samples, and when the network model is trained according to the matching initial training samples, the target sample image and training can be improved.
  • the efficiency of label determination is improved, and the accuracy of target sample images and training labels is improved, so that a network model with high training accuracy can be obtained.
  • processing the sample image may be understood as: performing occlusion processing on the sample image, and/or performing cropping processing on the sample image.
  • the occlusion processing can be understood as blocking other parts in the sample image except the part to be detected; the cropping process can be understood as cutting the part to be detected in the sample image.
  • S105 Determine the supervision type of the limb feature point, and modify the initial training label based on the supervision type, so as to determine the training label of the target sample image according to the corrected initial training label.
  • the supervision type of each limb feature point can be determined based on the detection task of the network model to be trained and/or the part to be detected.
  • the supervision type is used to indicate whether supervised learning of body feature points is required during the training of the network model to be trained.
  • the supervision type of the limb feature point can be determined as an unsupervised feature point. If the limb feature point belongs to the part to be detected, then the supervision type of the limb feature point can be determined as a supervised feature point.
  • the training label of the target sample image used for training the network model to be trained can be obtained.
  • S107 Perform supervised training on the network model to be trained by using the target training samples and training labels.
  • a new set of initial training samples can also be constructed based on the target training samples and training labels.
  • a sample label may be added to the new initial training sample, for example, the sample label may be used to indicate the object type of the target object corresponding to the new initial training sample, and/or, the target object contained in the new initial training sample Part information of the body part of the object.
  • the initial training sample By creating a new initial training sample, the initial training sample can be expanded, so that the initial training sample can meet more abundant training requirements, and at the same time, the matching degree between the initial training sample and the corresponding training task can be improved, thereby further speeding up the neural network. training speed.
  • the target sample image and the training label can be obtained, so that the multiplexing of the initial training sample can be realized, so that the initial training sample can be simultaneously It is used for multiple different detection tasks, and then reduces the training cost of the neural network by saving the cost of data labeling.
  • the initial training samples can be obtained in the following manner, including the following process:
  • a sample library is created, which contains multiple sample data sets, and each sample data set contains a corresponding sample label, and the sample label is used to indicate the detection task corresponding to each sample data set.
  • each sample data set in the sample library is each group of initial training samples described in the above process.
  • the sample label determines the object type of the target object corresponding to each sample data set by indicating the detection task corresponding to each sample data set, and/or, the body parts of the target object contained in the sample data set information.
  • the corresponding sample data set can be found from the sample library as the initial training sample.
  • the sample database may contain sample data set A for body detection.
  • the sample image in the sample data set A may contain all the body parts of the target object (the above-mentioned detection parts), and set the body key points ( That is, the aforementioned feature points).
  • the sample images in the selected initial training samples can be processed according to the detection task corresponding to the network model to be trained to obtain the target sample image.
  • the supervision type of each limb feature point in the initial training sample selected for the network model to be trained at the current moment can be determined, and the initial training label of the limb feature point is corrected based on the supervision type, so as to obtain the target training sample. training label.
  • the sample images in the above-mentioned initial training samples are images determined after performing the target operation on the original image, where the original image is collected in a scene where the scene complexity does not meet the preset complex requirements
  • An image of at least one body part of a target object is included, and the target operation includes at least one of the following: a background replacement operation, an operation of adjusting image brightness, an operation of adjusting image exposure, and an operation of adding a reflective layer.
  • the application scenario of the network model to be trained can be obtained, and then, based on the application scenario, the operation type of the target operation matching it can be determined; after that, the corresponding target operation is performed on the original image based on the operation type, Get a sample image.
  • the sample image in the initial training sample can be closer to the real application scenario of the network model to be trained, thereby improving the processing accuracy of the network model to be trained.
  • the image for example, may be a raw image captured against a green screen background that includes at least one detected portion of the target object.
  • at least one original image including all limb parts of the target human body may be acquired.
  • a simple scene may be understood as a green screen background, or a scene under any background that does not contain or contains a reduced number of other objects.
  • the above simple scene can be understood as a scene that does not block any body part of the target object and can accurately identify each body part of the target object.
  • the background image in the original image can be determined, and multiple preset background images can be determined, so that the background image in each original image can be replaced by multiple preset background images, and the initial training sample can be obtained A sample image of .
  • each preset background image may be a complex background image containing various elements. Then, the background image in the original image is replaced according to multiple preset background images to obtain the sample image in the initial training sample.
  • the multiple preset background images include multiple types, wherein the type of the preset background image is associated with the application scenario of the network model to be trained.
  • a plurality of preset background images matching the application scenario may be determined according to the application scenario of the network model.
  • the background elements contained in the image may be different. Therefore, in order to improve the training accuracy of the network model to be trained, you can select the matching model according to the application scenario of the network model to be trained. Preset background image for background replacement of original image.
  • a background image matching the live broadcast scene can be selected from the preset background images for background replacement.
  • the background image matching the live broadcast scene can be understood as a complex background image containing various live broadcast elements constructed by simulating the real live broadcast scene.
  • the application scenario of the network model to be trained is a live broadcast scene
  • the detection task of the network model to be trained is to capture half-body body movements of the anchor (target object).
  • the complex background will affect the capture of the anchor’s half-body body movements, because when the anchor’s limbs are cut off by the edge of the screen, some objects in the background image that are similar to the anchor’s clothing color, skin color, or body shape will affect Capture the body movements of the anchor. Therefore, in a real live broadcast scene, the background of the live broadcast is often complex and changeable. At this time, the captured image will be extremely unstable, which will seriously affect the processing accuracy of the network model. At this point, a complex background image including various live broadcast elements can be obtained, and the background image of the original image can be replaced based on the complex background image, so as to obtain a sample image.
  • the sample image in the initial training sample can be closer to the real application scene of the network model to be trained, thereby improving The processing accuracy of the network model to be trained.
  • At least one of the following target operations may also be performed on the original image: the operation of adjusting the brightness of the image , the operation of adjusting the exposure of the image, and the operation of adding a reflective layer.
  • the matching image enhancement operation can be determined according to the application scenario of the network model to be trained.
  • the sample images in the initial training samples can be closer to the real application scene of the network model to be trained, thereby improving the processing accuracy of the network model to be trained.
  • the background is often complex and changeable, such as a plush toy with a skin color similar to that of the anchor, a reflective glass wall, a dim or exposed shooting environment, and the captured images will be different. Extremely unstable.
  • At least one original image including at least one limb part of the target object may be acquired first.
  • at least one original image including all limb parts of the target human body may be acquired.
  • At least one of the following image enhancement operations is performed on the collected original image: an operation of adjusting image brightness, an operation of adjusting image exposure, and an operation of adding a reflective layer; after processing, a sample image can be obtained.
  • the background complexity of the sample image can be improved by performing at least one image enhancement process of adjusting the image brightness, adjusting the image exposure, and adding a reflective layer to the original image; through the application scenario based on the network model to be trained Determining the matching image enhancement operation can make the sample image closer to the real application scene, so as to improve the processing accuracy of the network model to be trained.
  • the sample image is processed to obtain the target sample image including the part to be detected, including the following process:
  • Step S1031 performing image occlusion processing on the target detection part in the sample image to obtain the target sample image; wherein the target detection part is other parts of the limb parts except the part to be detected.
  • Step S1032 performing image cropping processing on the image located in the target area in the sample image, and cropping to obtain a target sample image including the part to be detected; the target area is a part of the sample image that includes the part to be detected image area.
  • the detection task of the network model to be trained is determined, wherein the detection task is used to indicate the part information of the part to be detected that needs to be detected by the network model to be trained.
  • the part information may be information such as the part name, part code, and number of parts of each part to be detected, and the completeness requirement of each part to be detected.
  • the completeness requirement may include at least one of the following: completeness requirement, incompleteness requirement, degree of limb incompleteness in the case of incompleteness requirement.
  • the detection task is to detect the upper body limbs of the target human body.
  • the parts to be detected include the following parts: upper body torso, arms, head and hands.
  • image occlusion processing and/or image cropping processing may be performed on the sample image according to the part information.
  • image occlusion processing can be performed on the target detection part through an image of a specified color (for example, a black image) to obtain a target sample image; A target sample image including the part to be detected is obtained.
  • a specified color for example, a black image
  • the detection tasks of the same type can be understood as: the detection tasks are all body detection; the multiple task requirements of different detection tasks can be understood as: body detection is performed on different types of body parts.
  • the initial training sample should contain all limb parts of the target human body, so that the initial training sample can meet the various task requirements of the same type of detection task.
  • the sample image matching the detection task can be obtained without re-acquisition of the sample image, thereby realizing the multiplexing of the sample image, and Reduce training time and training cost of neural networks.
  • step S105 determining the supervision type of the limb feature points, and modifying the initial training label based on the supervision type, including the following process:
  • Step S1051 determining feature information of the body feature point, wherein the feature information is used to indicate the positional relationship between the corresponding body feature point and the target sample image, and/or is used to indicate whether the corresponding body feature point is the feature point of the part to be detected;
  • Step S1052 based on the feature information, determine the supervision type of the limb feature point;
  • Step S1053 if it is determined based on the supervision type that the initial training label of the limb feature point meets the correction condition, correct the initial training label based on the supervision type.
  • the sample image is subjected to image occlusion processing and/or image cropping processing in the manner described above to obtain the target sample image, it is actually deleted from the sample image.
  • Detected body parts that is, body parts that do not require supervised learning.
  • it is also necessary to correct the initial training labels of the limb feature points of the limb parts that do not require supervised learning for the detection task.
  • the feature information of the limb feature point can be determined, and then according to the feature information, it can be determined whether the limb feature point is a limb feature point requiring supervised learning.
  • the supervision types of the body feature points can be determined as the following types:
  • Type 1 The body feature point is a supervised feature point, and the body feature point is located in the target sample image.
  • Type 2 The body feature point is a supervised feature point, and the body feature point is located outside the target sample image.
  • the body feature points are not included in the target sample image, but body feature points for supervised learning are required.
  • the parts to be detected are the upper body of the human body (head, upper torso, two arms and two hands). Because when performing image occlusion processing and/or image cropping processing on the sample image, some parts of the hand may be occluded and/or cropped. However, occluded and/or cropped hand parts are still body parts that need to be detected. At this time, the type of the body feature points corresponding to the occluded and/or cropped hand parts is the above-mentioned type two.
  • Type 3 The body feature point is an unsupervised feature point, and the body feature point is located outside the target sample image.
  • each limb feature point After the supervision type of each limb feature point is determined based on the above feature information, it can be determined based on the supervision type whether the initial training label of the limb feature point satisfies the correction condition.
  • the initial training label of the limb feature point can be corrected by supervision type. If it is judged that the supervision type of the limb feature point is Type 1, it is determined that the initial training label of the limb feature point does not satisfy the correction condition.
  • the purpose of correcting the initial training labels is to process the initial training labels in the initial training samples, so as to obtain The training label for training the network model to train on.
  • the initial training label of the limb feature point that needs to be supervised can be retained, and at the same time, it can be improved. Utilization of initial training labels in initial training samples.
  • the method of correcting the initial training labels of the limb feature points through the supervision type can not only supervise the learning of the limb feature points located in the target sample image, but also supervise the learning of the limb feature points located outside the target sample image and Limb feature points that require supervised learning.
  • the network model to be trained is trained based on the corrected initial training label (that is, the training label) and the target sample image, the network model can be used to predict the body feature points of some body parts not included in the image.
  • the accuracy of the initial training labels can be improved, thereby further improving the training accuracy of the network model to be trained.
  • step S1053 when the initial training label of the limb feature point is determined to meet the correction condition based on the supervision type, correcting the initial training label based on the supervision type includes the following process:
  • the initial training label of the limb feature point is corrected to a second initial training label.
  • the initial training labels of the limb feature points are corrected is the second initial training label; the second initial training label is used to characterize the limb feature point as a supervised feature point located outside the target sample image.
  • an application scenario of the above-mentioned network model to be trained may be to detect an image that does not include a complete detection part, so as to obtain feature points of the complete detection part.
  • the network model to be trained can perform body detection on the parts of the upper body. For example, limb detection can be performed on limb parts included in the image, and limb detection can also be performed on limb parts that are not included in the image and need to be detected, so as to obtain limb feature points, wherein the limb feature points include not included in the image.
  • the feature points of the hand contained in this image can be performed on limb parts included in the image.
  • the body feature points are the feature points of the parts to be detected that need to be processed by the network model to be trained. If so, it is necessary to modify the initial training label of the limb feature point to a second initial training label, characterizing the limb feature point as a supervised feature point not included in the target sample image.
  • image occlusion processing or image cropping processing may be performed on the sample images in the initial training samples to obtain sample images that do not include the complete upper body limbs.
  • the occluded or cropped body parts are not included in the processed sample image.
  • the occluded or cropped body part is the body part that needs to be detected by the network model to be trained. Therefore, it is necessary to correct the initial training label of the sample image in the initial training sample.
  • the correction process is: the initial training label of the limb feature point (body feature point) corresponding to the occluded or cropped limb part in the initial training label It is modified to the second initial training label, so that the feature points of the limbs are represented by the second initial training label as supervised feature points not included in the target sample image.
  • the initial training samples can be used for multiple different detection tasks at the same time, thereby reducing the training cost of the neural network by saving the cost of data labeling. Furthermore, when the network model to be trained is trained using the corrected initial training label and target sample image determined in the technical solution of the present disclosure, it is possible to predict the limb feature points of some limb parts not included in the image, In this way, gesture recognition of the target to be detected is realized. Therefore, the robustness of the network model trained by the technical scheme of the present disclosure is stronger, and the recognition process is more stable.
  • step S1053 when it is determined based on the supervision type that the initial training label of the limb feature point satisfies the correction condition, correcting the initial training label based on the supervision type includes: The following process:
  • the initial training label of the limb feature point is corrected to be the first initial training label.
  • the initial training label of the limb feature point is corrected is the first initial training label; the first initial training label is used to characterize the limb feature point as an unsupervised feature point located outside the target sample image.
  • the network model to be trained is used to perform limb detection on upper body limb parts.
  • the sample images of the initial training samples include complete human limbs (ie, upper body limb parts and lower body human limb parts).
  • the sample image needs to be cropped to obtain a sample image including complete or incomplete upper body limb parts.
  • the limb feature points of the limb parts of the upper body are the above-mentioned designated feature points.
  • the initial training label of the limb feature point can be set as the first initial training label, which is used to characterize the limb feature point as an unsupervised feature point located outside the target sample image.
  • the position information of the limb feature point may be modified to (0,0), where (0,0) indicates that the limb feature point is a point in the upper left corner of the target sample image.
  • the position information of the body feature point can also be deleted, and corresponding identification information can be added to the body feature point, so as to indicate that the body feature point is located outside the target sample image through the identification information.
  • the initial training samples can be used for multiple different detection tasks at the same time, thereby reducing the training cost of the neural network by saving the cost of data labeling.
  • the network model to be trained is trained through the target sample image and the training label, including the following process:
  • Step S1051 performing image processing on the target sample image through the network model to be trained to obtain the position information of each part to be detected in the target sample image;
  • Step S1052 based on the position information of the first part and the second part in the plurality of parts to be detected, and the training label, determine the target loss used to constrain the position difference between the first part and the second part A function value of a function, wherein the first part and the second part are parts to be detected that have an association relationship;
  • Step S1053 adjusting the model parameters of the network model to be trained according to the function value of the target loss function, so as to train the network model to be trained with the adjusted model parameters.
  • the image processing of the target training samples can be performed through the network model to be trained to obtain the position information of each part to be detected.
  • the target loss function may also be determined based on the position information of the first part and the second part in the parts to be detected, and the training labels.
  • the objective loss function is used to constrain the position difference between the first part and the second part.
  • the difference between the position information of the first part and the position information of the second part may be calculated, and a target loss function may be constructed based on the difference and the training labels. Then, when the function value of the target loss function satisfies the corresponding constraint conditions, the optimal solution of the model parameters of the network model to be trained is calculated. Furthermore, the network model to be trained is trained according to the optimal solution.
  • the first part and the second part may be detection parts having a linkage relationship.
  • the first part moves under the driving of the second part, or the second part moves under the driving of the first part.
  • the second part may be a wrist part.
  • the first part may be the wrist part, and the second part may be the forearm part.
  • different constraint conditions may be set for different first locations and second locations.
  • corresponding constraint conditions may be set according to the part types of the first part and the second part.
  • the method of constructing the target loss function through the position difference between the position information of the first part and the second part can reduce the phenomenon that the position difference between the first part and the second part is large, thereby improving the training The processing accuracy of the network model.
  • FIG. 3 it is a flow chart of a posture detection method provided by an embodiment of the present disclosure, the method includes steps S301 to S307, wherein:
  • S303 Perform posture detection on the target anchor in the video image frame through a posture detection model to obtain a posture detection result, wherein the posture detection model is obtained by using a target sample image and a corresponding training label to conduct supervised training, so The target sample image includes a body part sample to be detected, and the training label is obtained by adjusting the supervision type of the body feature point of the body part sample and the initial training label corresponding to the body feature point;
  • S305 Generate a gesture trigger signal of a virtual anchor corresponding to the target anchor according to the gesture detection result;
  • the pose detection model is used to detect the pose of the target anchor, and then control the way the virtual anchor executes the corresponding trigger action according to the pose detection result, which can automatically pass the pose of the target anchor.
  • the detection results trigger the display of the corresponding parts of the virtual anchor on the live video interface to perform corresponding trigger operations, and then trigger the virtual anchor to perform actions associated with the movements of the anchor's limbs, thereby improving the actual interaction effect.
  • the posture detection results of the body parts to be detected to trigger the virtual anchor it can also realize the precise triggering of the corresponding trigger parts in the virtual anchor, so as to meet the rich trigger needs of users.
  • the training label is obtained by adjusting the supervision type of the limb feature point based on the limb part sample and the initial training label corresponding to the limb feature point, and the posture detection can be obtained by training the multiplexing of the initial training label In this way, the cost of data labeling can be saved and the speed of data labeling can be accelerated.
  • the complex background will affect the capture of the host’s half-body body movements, because when the body parts of the host (that is, the above-mentioned target object) are cut off by the edge of the screen, some of the background image will be different from the host’s clothing color, skin color, etc. Objects that are close or have similar body shapes will affect the capture of the host's body movements. Therefore, in a real live broadcast scene, the background of the live broadcast is often complex and changeable. At this time, the captured image will be extremely unstable, which will seriously affect the processing accuracy of the network model.
  • the target training sample and the training label are constructed to obtain a sample image that is closer to the real live scene.
  • the pose detection model is trained according to the target training sample and the training label, a pose detection model with high processing accuracy can be obtained.
  • the body parts of the target anchor are detected according to the posture detection model, more accurate posture detection results can be obtained.
  • a gesture trigger signal that triggers the virtual anchor to perform a corresponding trigger action may be generated according to the gesture detection result, so as to trigger the virtual anchor to perform a corresponding trigger action.
  • the virtual anchor can be accurately controlled to perform the corresponding action. trigger action.
  • the gesture detection model is used to perform gesture detection on the target anchor in the video image frame to obtain a gesture detection result, including the following process:
  • the target frame may be detected in the video image frame.
  • the target frame is a frame used to frame body parts of the target object.
  • the video image frame contains multiple objects, it is necessary to determine the target object among the multiple objects.
  • the object at the forefront of the screen among the multiple objects may be determined as the target object.
  • the frame of each object can be determined, and then, among the determined multiple frames, the frame with the largest area is determined as the target frame, and the target frame is framed The selected object is determined as the target object.
  • the pose detection model obtained after training using the neural network model training method described above can be used to perform pose detection on the image located in the target frame to obtain a pose detection result.
  • the accuracy of the pose detection result can be improved.
  • the above step performing posture detection on the target object in the video image frame through a posture detection model to obtain a posture detection result, including the following process:
  • the video image frame is processed to obtain the target image; Partially specifies the region of the body part for pose detection.
  • a sub-image containing the part to be detected is intercepted in the video image frame, and edge filling is performed on the sub-image based on the limb type information and/or limb size information of part of the specified limb part, and the body containing the part of the specified limb part is obtained. Detected filled regions of the target image.
  • the pose detection model can perform pose detection on incomplete designated body parts.
  • edge filling processing may be performed on the video image frame to obtain a target image including the detection area (ie, the above-mentioned filling area) of the specified part.
  • edge filling processing may be performed on the video image frame, so as to add a black area at the edge of the video image frame, so as to obtain a target image.
  • the pose detection model can be used to perform pose detection on the target image to obtain the pose detection result of the target object.
  • the target image is obtained by performing edge filling processing on the video image frame, and the pose detection method is performed based on the target image. It is realized that even if the video image frame does not contain a complete designated body part, the posture detection of the video image frame can still be performed, and an accurate posture detection result can be obtained.
  • the above step obtain the training label based on the supervision type of the limb feature point of the limb part sample and the initial training label corresponding to the limb feature point, including:
  • the supervision type of each limb feature point can be determined based on the posture detection model to be trained.
  • the supervision type is used to indicate whether supervised learning of body feature points is required during the training of the pose detection model to be trained.
  • the supervision type of the limb feature point can be determined as an unsupervised feature point. If the limb feature point belongs to the part to be detected, then the supervision type of the limb feature point can be determined as a supervised feature point.
  • the type of supervision After determining the type of supervision, it is possible to determine the limb feature points that need to be supervised during the training process of the posture detection model to be trained. After correcting the initial training labels based on the type of supervision, the training labels of the target sample images used for training the pose detection model to be trained can be obtained.
  • step: determining the supervision type of the limb feature points includes the following steps:
  • the feature information is used to indicate the positional relationship between the corresponding body feature point and the target sample image, and/or is used to indicate whether the corresponding body feature point is Feature points of the part to be detected;
  • the supervision type of the limb feature point determines the supervision type of the limb feature point; and when the initial training label of the limb feature point is determined to meet the correction condition based on the supervision type, correct the The initial training label is used to obtain the training label.
  • the feature information of the body feature point can be determined, and then it can be determined according to the feature information whether the body feature point is a body feature point requiring supervised learning.
  • the supervision types of the body feature points can be determined as the following types:
  • Type 1 The body feature point is a supervised feature point, and the body feature point is located in the target sample image.
  • Type 2 The body feature point is a supervised feature point, and the body feature point is located outside the target sample image.
  • the body feature points are not included in the target sample image, but body feature points for supervised learning are required.
  • Type 3 The body feature point is an unsupervised feature point, and the body feature point is located outside the target sample image.
  • each limb feature point After the supervision type of each limb feature point is determined based on the above feature information, it can be determined based on the supervision type whether the initial training label of the limb feature point satisfies the correction condition.
  • the initial training label of the limb feature point can be corrected by supervision type. If it is judged that the supervision type of the limb feature point is Type 1, it is determined that the initial training label of the limb feature point does not satisfy the correction condition.
  • the purpose of correcting the initial training labels is to process the initial training labels in the initial training samples, so as to obtain The trained pose detection model is trained on the training labels.
  • the initial training label of the limb feature point that needs to be supervised can be retained, and at the same time, it can be improved. Utilization of initial training labels in initial training samples.
  • This disclosure relates to the field of augmented reality.
  • AR effect combining virtual and reality.
  • the target object may involve faces, limbs, gestures, actions, etc. related to the human body.
  • Practical applications can not only involve interactive scenes such as guided tours, navigation, explanations, reconstructions, virtual effect overlays and display related to real scenes or objects, but also special effects processing related to people, such as makeup beautification, body beautification, special effect display, virtual Interactive scenarios such as model display.
  • the relevant features, states and attributes of the target object can be detected or identified through the convolutional neural network.
  • a dedicated deep neural network is proposed to reasonably predict the global and local conditions of human limbs under various complex conditions. Quantitative analysis of different display degrees of half-body limbs.
  • a data set construction scheme in the general live broadcast scene is provided, and the data enhancement method and the model training loss function for the complex background are designed, thereby improving the generalization ability of the model.
  • the detailed implementation methods include:
  • an attitude detection device corresponding to the attitude detection method is also provided in the embodiment of the present disclosure. Since the problem-solving principle of the device in the embodiment of the disclosure is similar to the training method of the above-mentioned neural network model in the embodiment of the disclosure, therefore The implementation of the device can refer to the implementation of the method.
  • FIG. 4 it is a schematic diagram of a posture detection device provided by an embodiment of the present disclosure, the device includes: a first acquisition unit 41, a posture detection unit 42, a generation unit 43, and a control unit 44; wherein:
  • the first obtaining unit 41 is configured to obtain a video image frame containing the body parts of the target anchor;
  • the posture detection unit 42 is configured to perform posture detection on the target anchor in the video image frame through a posture detection model to obtain a posture detection result, wherein the posture detection model uses target sample images and corresponding training labels to perform posture detection. Obtained by supervised training, the target sample image includes a body part sample to be detected, and the training label is obtained by adjusting the supervision type of the body feature point of the body part sample and the initial training label corresponding to the body feature point;
  • the generating unit 43 is configured to generate a gesture trigger signal of a virtual anchor corresponding to the target anchor according to the gesture detection result;
  • the control unit 44 is configured to control the virtual anchor to perform a corresponding trigger action according to the gesture trigger signal.
  • the posture detection unit 42 is further configured to: detect a target frame in the video image frame, wherein the target frame is used to frame the target anchor in the video image frame body parts of the body; performing gesture detection on images located in the target frame in the video image frame through the gesture detection model, to obtain the gesture detection result.
  • the posture detection unit 42 is further configured to: process the video image frame when at least part of the specified body parts of the target anchor are not detected in the video image frame, Obtaining a target image; the target image includes an area for performing gesture detection on at least part of the specified body parts; performing gesture detection on the target image through the gesture detection model to obtain a gesture detection result of the target anchor.
  • the device is further configured to: determine the supervision type of the limb feature point; modify the initial training label based on the supervision type, so as to determine the target sample image according to the corrected initial training label training label.
  • the device is further configured to: determine feature information of the body feature point, where the feature information is used to indicate a positional relationship between the corresponding body feature point and the target sample image, and /or, used to indicate whether the corresponding limb feature point is a feature point of the part to be detected; based on the feature information, determine the supervision type of the limb feature point; and determine the initial limb feature point based on the supervision type When the training label satisfies the correction condition, the initial training label is corrected based on the supervision type to obtain the training label.
  • the embodiment of the present disclosure also provides a neural network model training device corresponding to the training method of the neural network model. Since the problem-solving principle of the device in the embodiment of the present disclosure is the same as that of the above-mentioned neural network model in the embodiment of the present disclosure The training method is similar, so the implementation of the device can refer to the implementation of the method.
  • FIG. 5 it is a schematic diagram of a neural network model training device provided by an embodiment of the present disclosure, the device includes: a second acquisition unit 51, a correction processing unit 52, a determination unit 53, and a training unit 54; wherein,
  • the first acquisition unit 51 is configured to acquire an initial training sample;
  • the initial training sample includes a sample image and an initial training label of a limb feature point of a limb part of the target object contained in the sample image;
  • the correction processing unit 52 is configured to determine the part to be detected in the limb part based on the detection task of the network model to be trained; and process the sample image to obtain a target sample image including the part to be detected;
  • the determining unit 53 is configured to determine the supervision type of the limb feature point, and modify the initial training label based on the supervision type, so as to determine the training label of the target sample image according to the corrected initial training label;
  • the training unit 54 is configured to perform supervised training on the network model to be trained by using the target sample image and the training label.
  • the initial training sample can be realized by modifying the sample image and the initial training label in the initial training sample to obtain the target sample image and the training label. Multiplexing, so that the initial training samples can be used in the training process of multiple different network models to be trained at the same time, and then reduce the training cost of the neural network by saving the cost of data labeling.
  • the training unit is further configured to: perform image processing on the target sample image through the network model to be trained to obtain the target sample image The position information of each part to be detected in the image; based on the position information of the first part and the second part in the plurality of parts to be detected, and the training label, determine the The function value of the target loss function of the position difference between, wherein, the first part and the second part are parts to be detected with an association relationship; adjust the network to be trained according to the function value of the target loss function Model parameters of the model, so as to train the network model to be trained by using the adjusted model parameters.
  • the determining unit is further configured to: determine feature information of the limb feature point, where the feature information is used to indicate a positional relationship between the corresponding limb feature point and the target sample image, And/or, used to indicate whether the corresponding limb feature point is a feature point of the part to be detected; based on the feature information, determine the supervision type of the limb feature point; at the initial stage of determining the limb feature point based on the supervision type If the training label satisfies the correction condition, the initial training label is corrected based on the supervision type.
  • the determining unit is further configured to: when it is determined according to the supervision type that the body feature point is located outside the target sample image and is not a feature point of the part to be detected, Correcting the initial training label of the limb feature point to a first initial training label; the first initial training label is used to characterize the limb feature point as an unsupervised feature point located outside the target sample image.
  • the determining unit is further configured to: when it is determined according to the initial training label that the limb feature point is located outside the target sample image and belongs to the feature point of the part to be detected, Correcting the initial training label of the limb feature point to a second initial training label; the second initial training label is used to characterize the limb feature point as a supervised feature point located outside the target sample image.
  • the sample image is an image determined after the target operation is performed on the original image, where the original image is a scene containing the target object collected in a scene whose complexity does not meet the preset complexity requirements
  • the target operation includes at least one of the following: a background replacement operation, an operation of adjusting image brightness, an operation of adjusting image exposure, and an operation of adding a reflective layer.
  • the correction processing unit is further configured to: perform image occlusion processing on the target detection part in the sample image to obtain the target sample image; wherein the target detection part is the limb part Other parts in the sample image except the part to be detected; and/or, performing image cropping processing on the image in the target area in the sample image, and cropping to obtain a target sample image containing the part to be detected; the target The area is an image area including the part to be detected in the sample image.
  • the embodiment of the present disclosure also provides a computer device 600, as shown in FIG. 6, which is a schematic structural diagram of the computer device 600 provided by the embodiment of the present disclosure, including:
  • Processor 61 memory 62, and bus 63; memory 62 is used for storing and executing instruction, comprises memory 621 and external memory 622; memory 621 here is also called internal memory, is used for temporarily storing computing data in processor 61, and The data exchanged by the external memory 622 such as hard disk, the processor 61 exchanges data with the external memory 622 through the memory 621, when the computer device 600 is running, the processor 61 communicates with the memory 62 through the bus 63, so that The processor 61 executes the following instructions:
  • the posture detection of the target anchor in the video image frame is performed by a posture detection model to obtain a posture detection result, wherein the posture detection model is obtained by supervised training using target sample images and corresponding training labels, and the target
  • the sample image includes a body part sample to be detected, and the training label is obtained by adjusting the supervision type of the body feature point of the body part sample and the initial training label corresponding to the body feature point;
  • the virtual anchor is controlled to perform a corresponding trigger action according to the gesture trigger signal.
  • the embodiment of the present disclosure also provides a computer device 700, as shown in Figure 7, which is a schematic structural diagram of the computer device 700 provided by the embodiment of the present disclosure, including:
  • Processor 71 memory 72, and bus 73;
  • Memory 72 is used for storing execution order, comprises memory 721 and external memory 722;
  • Memory 721 here is also called internal memory, is used for temporarily storing the operation data in processor 71, and with The data exchanged by the external memory 722 such as hard disk, the processor 71 exchanges data with the external memory 722 through the memory 721, when the computer device 700 is running, the processor 71 communicates with the memory 72 through the bus 73, so that The processor 71 executes the following instructions:
  • the initial training sample includes a sample image and initial training labels of limb feature points of limb parts of the target object contained in the sample image;
  • Embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored.
  • the computer program is run by a processor, the training and attitude of the neural network model described in the above-mentioned method embodiments are executed.
  • the steps of the detection method may be a volatile or non-volatile computer-readable storage medium.
  • the embodiment of the present disclosure also provides a computer program product, the computer program product carries a program code, and the instructions included in the program code can be used to execute the training of the neural network model and the steps of the posture detection method described in the method embodiment above , refer to the above method embodiment.
  • the above-mentioned computer program product may be realized by hardware, software or a combination thereof.
  • the computer program product may be embodied as a computer storage medium, and in another optional embodiment, the computer program product may be embodied as a software product, such as a software development kit (Software Development Kit, SDK), etc. wait.
  • a software development kit Software Development Kit, SDK
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions are realized in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor.
  • the computer software product is stored in a storage medium, including several
  • the instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
  • Embodiments of the present disclosure provide a detection method, a training method, a device, a device, a storage medium, and a program product, wherein the method includes: acquiring a video image frame containing the body parts of the target anchor; The target anchor in the image frame performs posture detection to obtain a posture detection result, wherein the posture detection model is obtained through supervised training using target sample images and corresponding training labels, and the target sample images include body parts to be detected sample, the training label is obtained by adjusting the supervision type of the limb feature point of the limb part sample and the initial training label corresponding to the limb feature point; according to the posture detection result, the virtual anchor corresponding to the target anchor is generated The gesture trigger signal; according to the gesture trigger signal, the virtual anchor is controlled to perform a corresponding trigger action.
  • the pose detection model is used to detect the pose of the target anchor, and then control the way the virtual anchor executes the corresponding trigger action according to the pose detection result, which can automatically pass the pose of the target anchor.
  • the detection results trigger the display of the corresponding parts of the virtual anchor on the live video interface to perform corresponding trigger operations, and then trigger the virtual anchor to perform actions associated with the movements of the anchor's limbs, thereby improving the actual interaction effect.
  • the posture detection results of the body parts to be detected to trigger the virtual anchor it can also realize the precise triggering of the corresponding trigger parts in the virtual anchor, so as to meet the rich trigger needs of users.
  • the training label is obtained by adjusting the supervision type of the limb feature point based on the limb part sample and the initial training label corresponding to the limb feature point, and the posture detection can be obtained by training the multiplexing of the initial training label In this way, the cost of data labeling can be saved and the speed of data labeling can be accelerated.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

一种检测方法、训练方法、装置、设备、存储介质和程序产品,其中,该方法包括:获取包含目标主播的肢体部位的视频图像帧;通过姿态检测模型对视频图像帧中的目标主播进行姿态检测,得到姿态检测结果,其中,姿态检测模型为利用目标样本图像及对应的训练标签进行监督训练得到,目标样本图像包含待检测的肢体部位样本,训练标签基于肢体部位样本的肢体特征点的监督类型及肢体特征点对应的初始训练标签进行调整得到;根据姿态检测结果生成目标主播所对应的虚拟主播的姿态触发信号;根据姿态触发信号控制虚拟主播执行相应的触发动作。

Description

检测方法、训练方法、装置、设备、存储介质和程序产品
相关申请的交叉引用
本公开基于申请号为202110993216.2、申请日为2021年08月27日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以全文引用的方式引入本公开。
技术领域
本公开涉及计算机的技术领域,涉及但不限于一种检测方法、训练方法、装置、设备、存储介质和程序产品。
背景技术
在相关的直播技术中,主播可以通过触发外部设备(例如,鼠标或者键盘)在直播界面展示特效动画。但相关的直播技术中通过触发外部设备展示的特效动画通常比较单一化。
发明内容
本公开实施例至少提供一种检测方法、训练方法、装置、设备、存储介质和程序产品。
第一方面,本公开实施例提供了一种姿态检测方法,包括:获取包含目标主播的肢体部位的视频图像帧;通过姿态检测模型对所述视频图像帧中的所述目标主播进行姿态检测,得到姿态检测结果,其中,所述姿态检测模型为利用目标样本图像及对应的训练标签进行监督训练得到,所述目标样本图像包含待检测的肢体部位样本,所述训练标签基于所述肢体部位样本的肢体特征点的监督类型及所述肢体特征点对应的初始训练标签进行调整得到;根据所述姿态检测结果生成所述目标主播所对应的虚拟主播的姿态触发信号;根据所述姿态触发信号控制所述虚拟主播执行相应的触发动作。
在本公开实施例中,在获取到视频图像帧之后,通过姿态检测模型对目标主播进行姿态检测,进而根据姿态检测结果控制虚拟主播执行相应触发动作的方式,可以实现自动的通过目标主播的姿态检测结果在视频直播界面上触发展示虚拟主播的相应部位执行相应触发操作,进而实现触发虚拟主播执行与主播的肢体产生动作关联的动作,从而提高实际互动效果。同时,通过确定待检测的肢体部位的姿态检测结果对虚拟主播进行触发,还可以实现虚拟主播中对应触发部位的精准触发,从而满足用户的丰富触发需求。同时,在本公开技术方案中,通过基于肢体部位样本的肢体特征点的监督类型及肢体特征点对应的初始训练标签进行调整得到训练标签,可以通过对初始训练标签的复用来训练得到姿态检测模型,通过该方式可以节省数据标注成本,并加快数据标注的速度。
一种可选的实施方式中,所述通过姿态检测模型对所述视频图像帧中的所述目标主播进行姿态检测,得到姿态检测结果,包括:在所述视频图像帧中检测目标框体,其中,所述目标框体用于框选所述视频图像帧中所述目标主播的肢体部位;通过所述姿态检测模型对所述视频图像帧中位于所述目标框体内的图像进行姿态检测,得到所述姿态检测结果。
上述实施方式中,通过检测包含目标对象的目标框体,并对位于该目标框体内的图像进行姿态检测,可以提高姿态检测结果的准确性。
一种可选的实施方式中,所述通过姿态检测模型对所述视频图像帧中的所述目标主播进行姿态检测,得到姿态检测结果,包括:在所述视频图像帧中未检测到所述目标主播的至少 部分指定肢体部位的情况下,对所述视频图像帧进行处理,得到目标图像;所述目标图像中包含用于对该至少部分指定肢体部位进行姿态检测的区域;通过所述姿态检测模型对所述目标图像进行姿态检测,得到所述目标主播的姿态检测结果。
上述实施方式中,在视频图像帧中未检测到目标对象的部分指定肢体部位的情况下,通过对视频图像帧进行边缘填补处理,得到目标图像,以根据该目标图像进行姿态检测的方式,可以实现在视频图像帧中不包含完整的指定肢体部位的情况下,依然可以对该视频图像帧进行姿态检测,得到准确的姿态检测结果。
一种可选的实施方式中,所述基于所述肢体部位样本的肢体特征点的监督类型及所述肢体特征点对应的初始训练标签进行调整得到所述训练标签,包括:确定所述肢体特征点的监督类型;基于所述监督类型修正所述初始训练标签,以根据修正之后的初始训练标签确定所述目标样本图像的训练标签。
一种可选的实施方式中,所述确定所述肢体特征点的监督类型,包括:确定所述肢体特征点的特征信息,其中,所述特征信息用于指示对应肢体特征点和所述目标样本图像之间的位置关系,和/或,用于指示对应肢体特征点是否为待检测部位的特征点;基于所述特征信息,确定所述肢体特征点的监督类型;并在基于所述监督类型确定所述肢体特征点的初始训练标签满足修正条件的情况下,基于所述监督类型修正所述初始训练标签,得到所述训练标签。
上述实施方式中,通过确定肢体特征点的监督类型,并基于该监督类型对肢体特征点的初始训练标签进行修正的方式,可以保留需要进行监督的肢体特征点的初始训练标签,同时还可以提高初始训练样本中初始训练标签的利用率。
第二方面,本公开实施例提供了一种神经网络模型的训练方法,包括:获取初始训练样本;所述初始训练样本中包含样本图像和该样本图像中所包含的目标对象的肢体部位的肢体特征点的初始训练标签;基于待训练网络模型的检测任务在所述肢体部位中确定待检测部位;并对所述样本图像进行处理,得到包含所述待检测部位的目标样本图像;确定所述肢体特征点的监督类型,并基于所述监督类型修正所述初始训练标签,以根据修正之后的初始训练标签确定所述目标样本图像的训练标签;通过所述目标样本图像和所述训练标签,对所述待训练网络模型进行有监督训练。
在本公开实施例中,在确定出待训练网络模型之后,通过对初始训练样本中的样本图像和初始训练标签进行修正处理,得到目标样本图像和训练标签的方式,可以实现初始训练样本的复用,以使初始训练样本可以同时用于多个不同的待训练网络模型的训练过程,进而通过节省数据标注成本的方式降低神经网络的训练成本。
一种可选的实施方式中,所述待检测部位为多个,所述通过所述目标样本图像和所述训练标签,对所述待训练网络模型进行训练,包括:通过所述待训练网络模型对所述目标样本图像进行图像处理,得到所述目标样本图像中每个待检测部位的位置信息;基于多个所述待检测部位中第一部位和第二部位的位置信息,以及所述训练标签,确定用于约束所述第一部位和第二部位之间的位置差异的目标损失函数的函数值,其中,所述第一部位和所述第二部位为具有关联关系的待检测部位;根据所述目标损失函数的函数值调整所述待训练网络模型的模型参数,以通过调整之后的模型参数对所述待训练网络模型进行训练。
上述实施方式中,通过第一部位和第二部位的位置信息之间的位置差异和训练标签构建目标损失函数的方式,可以减少第一部位和第二部位之间位置差异较大的现象,从而提高待训练网络模型的处理精度。
一种可选的实施方式中,所述确定所述肢体特征点的监督类型,并基于所述监督类型修正所述初始训练标签,包括:确定所述肢体特征点的特征信息,其中,所述特征信息用于指示对应肢体特征点和所述目标样本图像之间的位置关系,和/或,用于指示对应肢体特征点是否为待检测部位的特征点;基于所述特征信息,确定所述肢体特征点的监督类型;在基于所述监督类型确定所述肢体特征点的初始训练标签满足修正条件的情况下,基于所述监督类型修正所述初始训练标签。
上述实施方式中,通过确定肢体特征点的监督类型,并基于该监督类型对肢体特征点的 初始训练标签进行修正的方式,可以保留需要进行监督的肢体特征点的初始训练标签,同时还可以提高初始训练样本中初始训练标签的利用率。
通过监督类型对肢体特征点的初始训练标签进行修正的方式,不仅可以监督学习位于目标样本图像内的肢体特征点,还可以监督学习位于目标样本图像之外且需要进行监督学习的肢体特征点。在基于修正之后的初始训练标签(即,训练标签)和目标样本图像对待训练网络模型进行训练时,就可以实现通过该网络模型对未包含在图像中的部分肢体部位的肢体特征点进行预测。在采用本公开技术方案中确定出的修正之后的初始训练标签和目标样本图像,对待训练网络模型进行训练时,可以对未包含在图像中的部分肢体部位的肢体特征点进行预测,从而实现待检测目标的姿态识别。因此,通过本公开技术方案进行训练之后的网络模型的鲁棒性更强,且识别过程更加稳定。
一种可选的实施方式中,所述在基于所述监督类型确定所述肢体特征点的初始训练标签满足修正条件的情况下,基于所述监督类型修正所述初始训练标签,包括:在根据所述监督类型确定出所述肢体特征点位于所述目标样本图像之外,且不是所述待检测部位的特征点情况下,将所述肢体特征点的初始训练标签修正为第一初始训练标签;所述第一初始训练标签用于表征该肢体特征点为位于所述目标样本图像之外的非监督类特征点。
一种可选的实施方式中,所述在基于所述监督类型确定所述肢体特征点的初始训练标签满足修正条件的情况下,基于所述监督类型修正所述初始训练标签,包括:在根据所述初始训练标签确定出所述肢体特征点位于所述目标样本图像之外,且属于待检测部位的特征点的情况下,将所述肢体特征点的初始训练标签修正为第二初始训练标签;所述第二初始训练标签用于表征该肢体特征点为位于所述目标样本图像之外的监督类特征点。
上述实施方式中,通过上述所描述的初始训练标签的修正方式,可以使初始训练样本可以同时用于多个不同的检测任务,进而通过节省数据标注成本的方式降低神经网络的训练成本。进一步地,在采用本公开技术方案中确定出的修正之后的初始训练标签和目标样本图像,对待训练网络模型进行训练时,可以对未包含在图像中的部分肢体部位的肢体特征点进行预测,从而实现待检测目标的姿态识别。因此,通过本公开技术方案进行训练之后的网络模型的鲁棒性更强,且识别过程更加稳定。
一种可选的实施方式中,所述样本图像为对原始图像执行目标操作之后确定的图像,其中,所述原始图像为在场景复杂度不满足预设复杂要求的场景下采集到的包含目标对象的至少一个肢体部位的图像,所述目标操作包括以下至少之一:背景替换操作、调整图像亮度的操作、调整图像曝光度的操作、添加反光图层的操作。
在本公开实施例中,通过对原始图像进行背景替换,可以实现通过较低的成本,在较短的时间内获得海量训练样本。通过对原始图像进行调整图像亮度、调整图像曝光度以及添加反光图层中的至少一种图像增强处理,可以提高样本图像的背景复杂度。通过上述至少一种目标操作所得到的样本图像能够更加贴近真实的应用场景,在基于该样本图像对待训练网络模型进行训练时,可以提高待训练网络模型的处理精度。
一种可选的实施方式中,所述对所述样本图像进行处理,得到包含所述待检测部位的目标样本图像,包括:对所述样本图像中的目标检测部位进行图像遮挡处理,得到所述目标样本图像;其中,所述目标检测部位为所述肢体部位中除所述待检测部位之外的其他部位;和/或,对所述样本图像中位于目标区域内的图像进行图像裁剪处理,裁剪得到包含所述待检测部位的目标样本图像;所述目标区域为所述样本图像中包含所述待检测部位的图像区域。
上述实施方式中,通过对样本图像进行图像遮挡处理和/或图像裁剪处理,可以在不重新采集样本图像的基础上,得到与检测任务相匹配的样本图像,从而实现样本图像的复用,以降低神经网络的训练时间和训练成本。
第三方面,本公开实施例提供了一种姿态检测装置,包括:第一获取单元,配置为获取包含目标主播的肢体部位的视频图像帧;姿态检测单元,配置为通过姿态检测模型对所述视频图像帧中的所述目标主播进行姿态检测,得到姿态检测结果,其中,所述姿态检测模型为利用目标样本图像及对应的训练标签进行监督训练得到,所述目标样本图像包含待检测的肢 体部位样本,所述训练标签基于所述肢体部位样本的肢体特征点的监督类型及所述肢体特征点对应的初始训练标签进行调整得到;生成单元,配置为根据所述姿态检测结果生成所述目标主播所对应的虚拟主播的姿态触发信号;控制单元,配置为根据所述姿态触发信号控制所述虚拟主播执行相应的触发动作。
第四方面,本公开实施例提供了一种神经网络模型的训练装置,包括:第一获取单元,配置为获取初始训练样本;所述初始训练样本中包含样本图像和该样本图像中所包含的目标对象的肢体部位的肢体特征点的初始训练标签;修正处理单元,配置为基于待训练网络模型的检测任务在所述肢体部位中确定待检测部位;并对所述样本图像进行处理,得到包含所述待检测部位的目标样本图像;确定单元,配置为确定所述肢体特征点的监督类型,并基于所述监督类型修正所述初始训练标签,以根据修正之后的初始训练标签确定所述目标样本图像的训练标签;训练单元,配置为通过所述目标样本图像和所述训练标签,对所述待训练网络模型进行有监督训练。
第五方面,本公开实施例还提供一种计算机设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当计算机设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行上述第一方面,或第一方面中任一种可能的实施方式中的步骤。
第六方面,本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述第一方面,或第一方面中任一种可能的实施方式中的步骤。
第七方面,本公开实施例还提供一种计算机程序产品,该计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,该计算机程序被计算机读取并执行时,实现上述第一方面,或第一方面中任一种可能的实施方式中的步骤。该计算机程序产品可以为一个软件安装包。
为使本公开的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,此处的附图被并入说明书中并构成本说明书中的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1示出了本公开实施例所提供的一种神经网络模型的训练方法的流程图;
图2示出了本公开实施例所提供的一种神经网络模型的训练方法中,确定所述肢体特征点的监督类型,并基于所述监督类型修正所述初始训练标签的流程图;
图3示出了本公开实施例所提供的一种姿态检测方法的流程图;
图4示出了本公开实施例所提供的一种姿态检测装置的示意图;
图5示出了本公开实施例所提供的一种神经网络模型的训练装置的示意图;
图6示出了本公开实施例所提供的一种计算机设备的示意图;
图7示出了本公开实施例所提供的另一种计算机设备的示意图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开 一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本公开实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本公开的实施例的详细描述并非旨在限制要求保护的本公开的范围,而是仅仅表示本公开的选定实施例。基于本公开的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
本文中术语“和/或”,仅仅是描述一种关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。
经研究发现,一般情况下,需要针对特定场景采集特定的图像,在采集到特定的图像之后,还需要对特定的图像进行标注,从而根据标注之后的特定图像来训练对应的神经网络。然而,针对特定场景确定特定训练样本的方式,浪费了大量的人力进行图像标注,且采用上述方式确定出的训练样本可复用性低。因此,采用上述所描述的神经网络的训练方法增大了网络训练的成本。
基于上述研究,本公开提供了一种神经网络模型的训练方法。在本公开实施例中,在获取到待训练网络模型后,可以通过该待训练网络模型的检测任务确定目标对象的待检测部位。之后,可以基于该待检测部位对样本图像进行处理,从而得到满足该待训练网络模型的训练要求的目标样本图像,即包含待检测部位的目标样本图像。通过确定肢体特征点的监督类型,并根据该监督类型修正初始训练标签,可以得到与该待训练网络模型的训练过程相匹配的训练标签,从而根据该相匹配的训练标签和目标样本图像对该待训练网络模型进行训练。
通过上述描述可知,本公开技术方案在确定出待训练网络模型之后,通过对初始训练样本中的样本图像和初始训练标签进行修正处理,得到目标样本图像和训练标签的方式,可以实现初始训练样本的复用,以使初始训练样本可以同时用于多个不同的待训练网络模型的训练过程,进而通过节省数据标注成本的方式降低神经网络的训练成本。在训练得到姿态检测模型之后,姿态检测模型对视频图像帧中的目标主播进行姿态检测,进而根据姿态检测结果控制虚拟主播执行相应触发动作的方式,可以实现自动的通过目标主播的姿态检测结果在视频直播界面上触发展示虚拟主播的相应部位执行相应触发操作。同时,通过确定待检测的肢体部位的姿态检测结果对虚拟主播进行触发,还可以实现自动的通过目标主播的姿态检测结果在视频直播界面上触发展示虚拟主播的相应部位执行相应触发操作,进而实现触发虚拟主播执行与主播的肢体产生动作关联的动作,从而提高实际互动效果。
为便于对本实施例进行理解,首先对本公开实施例所公开的一种神经网络模型的训练方法进行详细介绍,本公开实施例所提供的神经网络模型的训练方法的执行主体一般为具有一定计算能力的计算机设备。在一些可能的实现方式中,该神经网络模型的训练方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。
参见图1所示,为本公开实施例提供的一种神经网络模型的训练方法的流程图,所述方法包括步骤S101~S107,其中:
S101:获取初始训练样本;所述初始训练样本中包含样本图像和该样本图像中所包含的目标对象的肢体部位的肢体特征点的初始训练标签。
目标对象可以为真实人体、虚拟人体、以及其他能够进行肢体检测的对象。目标对象的肢体部位可以为该目标对象的全身肢体部位,或者,部分肢体部位(例如,上半身肢体部位)。
在本公开实施例中,可以设置样本图像中所包含的目标对象的肢体部位能够满足大部分肢体检测任务。假设,目标对象为真实人体,那么肢体部位可以为该目标对象的完整上半身肢体部位(例如,头部,双臂,上肢躯干,手部),还可以为该目标对象的完整全身肢体部位(例如,头部,双臂,上肢躯干,手部,腿部和脚部)。
这里,肢体特征点的初始训练标签可以理解为:肢体特征点在样本图像中的位置信息和 /或肢体类别信息(例如,该肢体关键点属于手部关键点),以及该肢体特征点在每个训练任务中的监督学习状态(例如,需要进行监督学习的特征点,或者,不需要进行监督学习的特征点)。
在本公开实施例中,初始训练样本的数量可以为多组。这里,可以基于目标对象的对象类型,和/或,目标对象的肢体部位确定多组初始训练样本。每个初始训练样本包含对应的样本标签,该样本标签可以用于指示对应目标对象的对象类型和/或目标对象的肢体部位。
例如,目标对象可以为人,除此之外,还可以为除了人之外其他能够进行肢体识别的对象,例如,仿真虚拟机器人。
实际实施时,可以基于“人”设置至少一组初始训练样本,例如,基于“人”的全身肢体部位设置一组初始训练样本,还可以基于“人”的上半身肢体部位设置一组初始训练样本。
在本公开实施方式中,通过设置多组初始训练样本,可以丰富训练样本的样本内容,从而使得该初始训练样本能够满足更加丰富的训练场景。
S103:基于待训练网络模型的检测任务在所述肢体部位中确定待检测部位;并对所述样本图像进行处理,得到包含所述待检测部位的目标样本图像。
这里,待训练网络模型的检测任务可以用于指示该待训练网络模型所需要进行检测的肢体部位(即,上述待检测部位)。之后,就可以根据该待检测部位对初始训练样本中的样本图像进行处理。例如,可以对该样本图像进行截取处理,得到包含待检测部位的目标样本图像。
在初始训练样本的数量为多组的情况下,在对样本图像进行处理之前,可以执行以下步骤:
基于检测任务确定目标对象的对象类型,并基于每组初始训练样本的样本标签,在多组初始训练样本中确定与目标对象的对象类型相匹配的初始训练样本。在确定出相匹配的初始训练样本的数量为多组的情况下,还可以基于检测任务确定目标对象的待检测部位,进而,在相匹配的多组初始训练样本中确定包含该待检测部位的初始训练样本,并对确定出的包含待检测部位的初始训练样本中的样本图像进行处理。
通过上述处理方式,可以从多组初始训练样本中确定出与待训练网络模型相匹配的初始训练样本,在根据该相匹配的初始训练样本进行网络模型的训练时,可以提高目标样本图像和训练标签的确定效率,并提高目标样本图像和训练标签准确率,从而能够得到训练精度较高的网络模型。
步骤S103中,对样本图像进行处理可以理解为:对样本图像进行遮挡处理,和/或,对样本图像进行裁剪处理。其中,遮挡处理可以理解为遮挡样本图像中除待检测部位之外的其他部位;裁剪处理可以理解为对样本图像中的待检测部位进行裁剪。
S105:确定所述肢体特征点的监督类型,并基于所述监督类型修正所述初始训练标签,以根据修正之后的初始训练标签确定所述目标样本图像的训练标签。
这里,可以基于待训练网络模型的检测任务和/或待检测部位确定每个肢体特征点的监督类型。监督类型用于指示在对待训练网络模型进行训练的过程中,是否需要对肢体特征点进行监督学习。
若该肢体特征点不属于待检测部位,那么该肢体特征点的监督类型可以确定为非监督类特征点。若该肢体特征点属于待检测部位,那么该肢体特征点的监督类型可以确定为监督类特征点。
在确定出监督类型之后,就可以确定出在待训练网络模型的训练过程中,需要进行监督学习的肢体特征点。在基于监督类型修正初始训练标签之后,就可以得到用于对待训练网络模型进行训练的目标样本图像的训练标签。
S107:通过所述目标训练样本和训练标签对所述待训练网络模型进行有监督训练。
在一个可选的实施方式中,在通过目标训练样本和训练标签,对待训练网络模型进行有监督训练之后,还可以基于目标训练样本和训练标签构建一组新的初始训练样本,此时,还可以为该新的初始训练样本添加样本标签,例如,该样本标签可以用于指示该新的初始训练 样本所对应目标对象的对象类型,和/或,该新的初始训练样本中所包含的目标对象的肢体部位的部位信息。
通过创建新的初始训练样本,可以对初始训练样本进行扩充,从而使得该初始训练样本能够满足更加丰富的训练需求,同时,能够提高初始训练样本和对应训练任务的匹配度,从而进一步加快神经网络的训练速度。
在本公开实施例中,通过对初始训练样本中的样本图像和初始训练标签进行修正处理,得到目标样本图像和训练标签的方式,可以实现初始训练样本的复用,以使初始训练样本可以同时用于多个不同的检测任务,进而通过节省数据标注成本的方式降低神经网络的训练成本。
针对上述步骤S101,在一个可选的实施方式中,可以通过以下所描述的方式获取初始训练样本,包括如下过程:
首先,创建样本库,在该样本库中包含多个样本数据集,且每个样本数据集包含对应的样本标签,该样本标签用于指示每个样本数据集所对应的检测任务。
这里,样本库中的每个样本数据集即为上述过程中所描述的每组初始训练样本。这里,样本标签通过指示每个样本数据集所对应的检测任务,来确定每个样本数据集所对应目标对象的对象类型,和/或,该样本数据集中所包含的目标对象的肢体部位的部位信息。
之后,可以根据待训练网络模型的检测任务,从样本库中查找相对应的样本数据集作为初始训练样本。
例如,在该样本库中可以包含用于进行肢体检测的样本数据集A。为了提高样本数据集的复用程度,可以设置样本数据集A中的样本图像包含目标对象的全部肢体部位(上述检测部位),以及设置样本数据集A中包含该全部肢体部位的肢体关键点(也即,上述特征点)。
在从样本库中查找到初始训练样本之后,就可以根据待训练网络模型对应的检测任务,对选择出的初始训练样本中的样本图像进行处理,得到目标样本图像。之后,可以确定针对当前时刻待训练网络模型,选择出的初始训练样本中每个肢体特征点的监督类型,并基于该监督类型对肢体特征点的初始训练标签进行修正,从而得到目标训练样本的训练标签。
在一个可选的实施方式中,上述初始训练样本中的样本图像为对原始图像执行目标操作之后确定的图像,其中,原始图像为在场景复杂度不满足预设复杂要求的场景下采集到的包含目标对象的至少一个肢体部位的图像,所述目标操作包括以下至少之一:背景替换操作、调整图像亮度的操作、调整图像曝光度的操作、添加反光图层的操作。
在本公开实施例中,可以获取待训练网络模型的应用场景,进而,基于该应用场景确定与之相匹配的目标操作的操作类型;之后,基于该操作类型对原始图像执行相应的目标操作,得到样本图像。
通过基于应用场景确定与之相匹配的目标操作的操作类型的方式,可以使得初始训练样本中的样本图像能够更加接近待训练网络模型的真实应用场景,从而可以提高待训练网络模型的处理精度。
实际实施时,为了提高样本图像中所包含每个肢体部位的完整度,以及提高样本图像的清晰度,可以设置在简单背景(即,场景复杂度不满足预设复杂要求的场景)下采集原始图像,例如,可以在绿幕背景下采集包含目标对象的至少一个检测部位的原始图像。例如,可以采集至少一张包含目标人体的全部肢体部位的原始图像。
这里,简单场景可以理解为绿幕背景,或者,不包含或者包含数量减少的其他物体的任意一种背景下的场景。简单来说,上述简单场景可以理解为不会对目标对象的任意肢体部位进行遮挡,且能够对目标对象的各个肢体部位进行准确识别的场景。
在采集到原始图像之后,可以确定原始图像中的背景图像,并确定多个预设背景图像,从而通过多个预设背景图像对每张原始图像中的背景图像进行替换,得到初始训练样本中的样本图像。
在对原始图像中的背景图像进行替换时,可以确定多个预设背景图像,其中,每个预设背景图像可以为包含多种元素的复杂背景图像。然后,按照多个预设背景图像对原始图像中 的背景图像进行替换,得到初始训练样本中的样本图像。
假设,原始图像的数量为1万,多个预设背景图像的数量为10个,那么通过10个预设背景图像对每个原始图像进行背景替换,可以得到10万个样本图像。如果多个预设背景图像的数量为100个,那么可以得到100万个样本图像。因此,通过上述处理方式还可以通过较低的成本,在较短的时间内获得海量训练样本,以提升模型鲁棒性。
这里,多个预设背景图像包含多种类型,其中,预设背景图像的类型与待训练网络模型的应用场景相关联。
为了提高待训练网络模型的处理精度,可以根据该网络模型的应用场景,确定与该应用场景相匹配的多个预设背景图像。
考虑到不同应用场景下,图像中所包含的背景元素可能是不相同的,因此,为了提高待训练网络模型的训练精度,可以根据待训练网络模型的应用场景,选择与该应用场景相匹配的预设背景图像进行原始图像的背景替换。
假设,待训练网络模型的应用场景为直播场景,考虑到直播场景的特殊性,可以在预设背景图像中选择与该直播场景相匹配的背景图像进行背景替换。这里,与直播场景相匹配的背景图像可以理解为模拟现实中的直播场景所构建的包含多种直播元素的复杂背景图像。
举例来说,待训练网络模型的应用场景为直播场景,该待训练网络模型的检测任务为对主播(目标对象)的半身肢体动作进行捕捉。
在直播场景中,复杂的背景会影响对主播半身肢体动作的捕捉,因为主播的肢体部位被画面边缘截断时,背景图像中一些跟主播的衣服颜色、肤色相近,或者肢体形状相似的物体会影响对主播肢体动作的捕捉。因此,在真实直播场景中,直播背景往往是复杂多变的,这时捕捉画面将会极度不稳定,从而将严重影响网络模型的处理精度。此时,可以获取包含多种直播元素的复杂背景图像,并基于该复杂背景图像替原始图像的背景图像,从而得到样本图像。
在本公开实施例中,通过对原始图像进行背景替换,可以实现通过较低的成本,在较短的时间内获得海量训练样本,以提升模型鲁棒性。上述实施方式中,通过与待训练网络模型的应用场景相匹配的背景图像与原始图像进行背景替换,可以使得初始训练样本中的样本图像能够更加接近待训练网络模型的真实应用场景,从而可以提高待训练网络模型的处理精度。
在本公开实施例中,除了可以按照上述所描述的方式对原始图像执行背景替换操作之外,还可以对原始图像执行以下至少一种目标操作(也即图像增强操作):调整图像亮度的操作、调整图像曝光度的操作、添加反光图层的操作。
实际实施时,可以根据待训练网络模型的应用场景,确定与之相匹配的图像增强操作。通过该处理方式,可以使得初始训练样本中的样本图像能够更加接近待训练网络模型的真实应用场景,从而可以提高待训练网络模型的处理精度。
举例来说,在直播场景下,除了复杂的背景之外,光照也是影响肢体检测性能的一个重要影响因素。例如,光线强弱,玻璃反光等也会影响到网络模型的检测性能。因此,在真实直播场景中,背景往往是复杂多变的,如背景中有与主播肤色相近的毛绒玩具、反光的玻璃墙面、昏暗或曝光的拍摄环境,这时捕捉到的画面将会极度不稳定。
基于此,在本公开实施例中,可以首先采集至少一张包含目标对象的至少一个肢体部位的原始图像。例如,可以采集至少一张包含目标人体的全部肢体部位的原始图像。
之后,对采集到的原始图像进行以下至少之一图像增强操作:调整图像亮度的操作、调整图像曝光度的操作、添加反光图层的操作;处理之后就可以得到样本图像。
上述实施方式中,通过对原始图像进行调整图像亮度、调整图像曝光度以及添加反光图层中的至少一种图像增强处理,可以提高样本图像的背景复杂度;通过基于待训练网络模型的应用场景确定与之相匹配的图像增强操作,可以使得样本图像能够更加贴近真实的应用场景,以提高待训练网络模型的处理精度。
在一个可选的实施方式中,针对上述步骤S103,对所述样本图像进行处理,得到包含所述待检测部位的目标样本图像,包括如下过程:
步骤S1031,对所述样本图像中的目标检测部位进行图像遮挡处理,得到所述目标样本图像;其中,所述目标检测部位为所述肢体部位中除所述待检测部位之外的其他部位。
和/或
步骤S1032,对所述样本图像中位于目标区域内的图像进行图像裁剪处理,裁剪得到包含所述待检测部位的目标样本图像;所述目标区域为所述样本图像中包含所述待检测部位的图像区域。
在本公开实施例中,首先确定待训练网络模型的检测任务,其中,该检测任务用于指示该待训练网络模型所需要进行检测的待检测部位的部位信息。
这里,部位信息可以为每个待检测部位的部位名称、部位编码、部位数量等信息,以及每个待检测部位的完整度需求。其中,完整度需求可以包含以下至少之一:要求完整、要求不完整、在要求不完整的情况下肢体不完整的程度。
例如,检测任务为对目标人体的上半身肢体进行检测,此时,待检测部位包含以下部位:上身躯干、手臂、头部和手部。
在确定出待检测部位的部位信息之后,可以根据该部位信息对样本图像进行图像遮挡处理和/或图像裁剪处理。
实际实施时,可以通过指定颜色的图像(例如,黑色图像)对目标检测部位进行图像遮挡处理,得到目标样本图像;和/或,对样本图像中位于目标区域内的图像进行图像裁剪处理,裁剪得到包含所述待检测部位的目标样本图像。
应理解的是,针对初始训练样本来说,要求该初始训练样本能够满足相同类型的检测任务的多种任务需求。其中,相同类型的检测任务可以理解为:检测任务均为肢体检测;不同检测任务的多种任务需求可以理解为:对不同的类型的肢体部位进行肢体检测。
假设,检测任务为肢体检测任务,那么该初始训练样本中应当包含目标人体的全部肢体部位,从而使得该初始训练样本能够满足相同类型的检测任务的多种任务需求。
上述实施方式中,通过对样本图像进行图像遮挡处理和/或图像裁剪处理,可以在不重新采集样本图像的基础上,得到与检测任务相匹配的样本图像,从而实现样本图像的复用,以降低神经网络的训练时间和训练成本。
在一个可选的实施方式中,如图2所示,针对上述步骤S105,确定所述肢体特征点的监督类型,并基于所述监督类型修正所述初始训练标签,包括如下过程:
步骤S1051,确定所述肢体特征点的特征信息,其中,所述特征信息用于指示对应肢体特征点和所述目标样本图像之间的位置关系,和/或,用于指示对应肢体特征点是否为待检测部位的特征点;
步骤S1052,基于所述特征信息,确定所述肢体特征点的监督类型;
步骤S1053,在基于所述监督类型确定所述肢体特征点的初始训练标签满足修正条件的情况下,基于所述监督类型修正所述初始训练标签。
在本公开实施例中,在按照上述内容所描述的方式对样本图像进行图像遮挡处理和/或图像裁剪处理,得到目标样本图像之后,其实是从样本图像中删除了该检测任务所不需要进行检测的肢体部位,也即,不需要进行监督学习的肢体部位。此时,还需要对该检测任务所不需要进行监督学习的肢体部位的肢体特征点的初始训练标签进行修正。
在对初始训练标签进行修正时,可以确定肢体特征点的特征信息,进而根据该特征信息确定该肢体特征点是否为需要进行监督学习的肢体特征点。
在本公开实施例中,按照肢体特征点的特征信息,可以将肢体特征点的监督类型确定为以下几种类型:
类型一:肢体特征点为监督类特征点,且该肢体特征点位于目标样本图像内。
类型二:肢体特征点为监督类特征点,且该肢体特征点位于目标样本图像外。
针对该类型二,可以理解为该肢体特征点并未包含在目标样本图像内,但是,需要进行监督学习的肢体特征点。
例如,待检测部位为人体的上半身(头部、上身躯干、两个手臂和两个手部)。由于在 对样本图像进行图像遮挡处理和/或图像裁剪处理时,手部的部分部位可能被遮挡和/或裁剪。但是,被遮挡和/或被裁剪的手部部位依然为需要进行检测的肢体部位。此时,被遮挡和/或被裁剪的手部部位所对应的肢体特征点的类型即为上述类型二。
类型三:肢体特征点为非监督类特征点,且该肢体特征点位于目标样本图像外。
在基于上述特征信息确定出每个肢体特征点的监督类型之后,就可以基于监督类型确定肢体特征点的初始训练标签是否满足修正条件。
实际实施时,如果判断出肢体特征点的监督类型为上述类型二或者类型三,则确定肢体特征点的初始训练标签满足修正条件。此时,就可以监督类型修正该肢体特征点的初始训练标签。如果判断出肢体特征点的监督类型为类型一,则确定肢体特征点的初始训练标签不满足修正条件。
在本公开实施例中,在确定出肢体特征点的初始训练标签满足修正条件的情况下,对初始训练标签进行修正的目的是为了对初始训练样本中的初始训练标签进行处理,从而得到能够对待训练网络模型进行训练的训练标签。
上述实施方式中,通过确定肢体特征点的监督类型,并基于该监督类型对肢体特征点的初始训练标签进行修正的方式,可以保留需要进行监督的肢体特征点的初始训练标签,同时还可以提高初始训练样本中初始训练标签的利用率。
在相关的神经网络的训练过程中,在进行数据标注时,只能标注画面中可见的肢体特征点的位置信息。然而,在本公开技术方案中,通过监督类型对肢体特征点的初始训练标签进行修正的方式,不仅可以监督学习位于目标样本图像内的肢体特征点,还可以监督学习位于目标样本图像之外且需要进行监督学习的肢体特征点。在基于修正之后的初始训练标签(即,训练标签)和目标样本图像对待训练网络模型进行训练时,就可以实现通过该网络模型对未包含在图像中的部分肢体部位的肢体特征点进行预测。
在一些姿态识别的场景中,由于待识别目标的部分肢体可能未包含在待识别图像中,此时,通过相关训练方式训练得到的网络模型对该待识别图像中待识别目标进行姿态识别时,可能会出现由于无法识别到完整的肢体特征点,从而导致姿态识别失败的问题。然后,在采用本公开技术方案中确定出的修正之后的初始训练标签和目标样本图像,对待训练网络模型进行训练时,可以对未包含在图像中的部分肢体部位的肢体特征点进行预测,从而实现待检测目标的姿态识别。因此,通过本公开技术方案进行训练之后的网络模型的鲁棒性更强,且识别过程更加稳定。
基于肢体特征点的类型信息,可以确定未包含在处理之后的样本图像中的肢体特征点是否为需要进行检测的特征点。在根据该类型信息对肢体特征点的初始训练标签进行修正时,可以提高初始训练标签的准确性,从而,进而提高了待训练网络模型的训练精度。
下面,将详细介绍对肢体特征点的初始训练标签进行修正的修成过程。
在一个可选的实施方式中,上述步骤S1053,在基于所述监督类型确定所述肢体特征点的初始训练标签满足修正条件的情况下,基于所述监督类型修正所述初始训练标签,包括如下过程:
在确定出肢体特征点的监督类型为上述类型二的情况下,将所述肢体特征点的初始训练标签修正为第二初始训练标签。
实际实施时,在根据所述初始训练标签确定出所述肢体特征点位于所述目标样本图像之外,且属于待检测部位的特征点的情况下,将所述肢体特征点的初始训练标签修正为第二初始训练标签;所述第二初始训练标签用于表征该肢体特征点为位于所述目标样本图像之外的监督类特征点。
在本公开实施例中,上述待训练网络模型的一个应用场景可以为对不包含完整检测部位的图像进行检测,从而得到完整检测部位的特征点。
例如,一张图像中包含上半身肢体部位,且在该图像中包含部分手部部位,此时,该待训练网络模型可以对该上半身肢体部位进行肢体检测。例如,可以对包含在该图像中的肢体部位进行肢体检测,还可以对未包含在该图像中且需要进行检测的肢体部位进行肢体检测, 从而得到肢体特征点,其中,该肢体特征点包含未包含在该图像中的手部的特征点。
此时,在对该待训练网络模型进行训练时,需要构建包含不完整检测部位的样本图像,例如,包含不完整上半身肢体部位的样本图像。然而,针对该样本图像,需要设置未包含在样本图像中需要进行检测的肢体部位的肢体特征点的初始训练标签。
在此情况下,可以确定肢体特征点是否为待训练网络模型所需要进行处理的待检测部位的特征点。如果是,则需要将该肢体特征点的初始训练标签修改为第二初始训练标签,表征该肢体特征点为未包含目标样本图像中的监督类特征点。
举例来说,在获取到初始训练样本之后,可以对初始训练样本中的样本图像进行图像遮挡处理或者图像裁剪处理,得到不包含完整上半身肢体的样本图像。此时,被遮挡或者被裁剪的肢体部位未包含在该处理之后的样本图像中。然而,该被遮挡或者被裁剪的肢体部位是待训练网络模型所需要进行检测的肢体部位。因此,需要对初始训练样本中该样本图像的初始训练标签进行修正,修正过程为:将初始训练标签中被遮挡或者被裁剪的肢体部位所对应的肢体特征点(肢体特征点)的初始训练标签修改为第二初始训练标签,以通过该第二初始训练标签表征肢体特征点为未包含目标样本图像中的监督类特征点。
上述实施方式中,通过上述所描述的初始训练标签的修正方式,可以使初始训练样本可以同时用于多个不同的检测任务,进而通过节省数据标注成本的方式降低神经网络的训练成本。进一步地,在采用本公开技术方案中确定出的修正之后的初始训练标签和目标样本图像,对待训练网络模型进行训练时,可以对未包含在图像中的部分肢体部位的肢体特征点进行预测,从而实现待检测目标的姿态识别。因此,通过本公开技术方案进行训练之后的网络模型的鲁棒性更强,且识别过程更加稳定。
在另一个可选的实施方式中,上述步骤S1053,在基于所述监督类型确定所述肢体特征点的初始训练标签满足修正条件的情况下,基于所述监督类型修正所述初始训练标签,包括如下过程:
在确定出肢体特征点的监督类型为上述类型三的情况下,将所述肢体特征点的初始训练标签修正为第一初始训练标签。
实际实施时,在根据所述监督类型确定出所述肢体特征点位于所述目标样本图像之外,且不是所述待检测部位的特征点情况下,将所述肢体特征点的初始训练标签修正为第一初始训练标签;所述第一初始训练标签用于表征该肢体特征点为位于所述目标样本图像之外的非监督类特征点。
举例来说,待训练网络模型用于对上半身人体肢体部位进行肢体检测。初始训练样本的样本图像中包含完整人体肢体(即,上半身人体肢体部位和下半身人体肢体部位)。
此时,需要对样本图像进行裁剪,得到包含完整或者不完整上半身人体肢体部位的样本图像。其中,上半身人体肢体部位的肢体特征点即为上述指定特征点。
此时,需要将样本图像中不属于待检测部位的肢体特征点的初始训练标签进行修正,例如,修正下半身人体肢体检测部位的肢体特征点的初始训练标签。
这里,可以将肢体特征点的初始训练标签设置为第一初始训练标签,用于表征该肢体特征点为位于目标样本图像之外的非监督类特征点。
在一个可选的实施方式中,可以将该肢体特征点的位置信息修改为(0,0),其中,(0,0)表示该肢体特征点为目标样本图像左上角的点。除此之外,还可以删除该肢体特征点的位置信息,并为该肢体特征点添加相应的标识信息,以通过该标识信息指示该肢体特征点位于目标样本图像之外。
上述实施方式中,通过上述所描述的初始训练标签的修正方式,可以使初始训练样本可以同时用于多个不同的检测任务,进而通过节省数据标注成本的方式降低神经网络的训练成本。
针对上述步骤S107,在待检测部位为多个的情况下,通过所述目标样本图像和所述训练标签,对所述待训练网络模型进行训练,包括如下过程:
步骤S1051,通过所述待训练网络模型对所述目标样本图像进行图像处理,得到所述目 标样本图像中每个待检测部位的位置信息;
步骤S1052,基于多个所述待检测部位中第一部位和第二部位的位置信息,以及所述训练标签,确定用于约束所述第一部位和第二部位之间的位置差异的目标损失函数的函数值,其中,所述第一部位和所述第二部位为具有关联关系的待检测部位;
步骤S1053,根据所述目标损失函数的函数值调整所述待训练网络模型的模型参数,以通过调整之后的模型参数对所述待训练网络模型进行训练。
在本公开实施例中,在按照上述所描述的方式得到目标训练样本和训练标签之后,就可以通过待训练网络模型对目标训练样本进行图像处理,得到各个待检测部位的位置信息。
为了提高待训练网络模型的处理精度,还可以基于待检测部位中的第一部位和第二部位的位置信息,以及训练标签确定目标损失函数。其中,该目标损失函数用于约束第一部位和第二部位之间的位置差异。
在一些实施例中,可以计算第一部位的位置信息和第二部位的位置信息之间的差值,并基于该差值和训练标签构建目标损失函数。然后,计算目标损失函数的函数值满足对应约束条件时,待训练网络模型的模型参数的最优解。进而,根据该最优解对待训练网络模型进行训练。
这里,第一部位和第二部位可以为具有联动关系的检测部位。例如,第一部位在第二部位的带动下进行移动,或者,第二部位在第一部位的带动下进行移动。假设,第一部位为手部,那么第二部位可以为手腕部位。第一部位可以手腕部位,第二部位可以为小臂部位。
在本公开实施例中,针对不相同的第一部位和第二部位可以设置不同的约束条件。例如,可以根据第一部位和第二部位的部位类型设置对应的约束条件。
上述实施方式中,通过第一部位和第二部位的位置信息之间的位置差异构建目标损失函数的方式,可以减少第一部位和第二部位之间位置差异较大的现象,从而提高待训练网络模型的处理精度。
参见图3所示,为本公开实施例提供的一种姿态检测方法的流程图,所述方法包括步骤S301~S307,其中:
S301:获取包含目标主播的肢体部位的视频图像帧;
S303:通过姿态检测模型对所述视频图像帧中的所述目标主播进行姿态检测,得到姿态检测结果,其中,所述姿态检测模型为利用目标样本图像及对应的训练标签进行监督训练得到,所述目标样本图像包含待检测的肢体部位样本,所述训练标签基于所述肢体部位样本的肢体特征点的监督类型及所述肢体特征点对应的初始训练标签进行调整得到;
S305:根据所述姿态检测结果生成所述目标主播所对应的虚拟主播的姿态触发信号;
S307:根据所述姿态触发信号控制所述虚拟主播执行相应的触发动作。
在本公开实施例中,在获取到视频图像帧之后,通过姿态检测模型对目标主播进行姿态检测,进而根据姿态检测结果控制虚拟主播执行相应触发动作的方式,可以实现自动的通过目标主播的姿态检测结果在视频直播界面上触发展示虚拟主播的相应部位执行相应触发操作,进而实现触发虚拟主播执行与主播的肢体产生动作关联的动作,从而提高实际互动效果。同时,通过确定待检测的肢体部位的姿态检测结果对虚拟主播进行触发,还可以实现虚拟主播中对应触发部位的精准触发,从而满足用户的丰富触发需求。同时,在本公开技术方案中,通过基于肢体部位样本的肢体特征点的监督类型及肢体特征点对应的初始训练标签进行调整得到训练标签,可以通过对初始训练标签的复用来训练得到姿态检测模型,通过该方式可以节省数据标注成本,并加快数据标注的速度。
在直播场景中的情况下,复杂的背景会影响对主播半身肢体动作的捕捉,因为主播(即,上述目标对象)的肢体部位被画面边缘截断时,背景图像中一些跟主播的衣服颜色、肤色相近,或者肢体形状相似的物体会影响对主播肢体动作的捕捉。因此,在真实直播场景中,直播背景往往是复杂多变的,这时捕捉画面将会极度不稳定,从而将严重影响网络模型的处理精度。
在本公开实施例中,通过上述所描述的神经网络模型的训练方法,构建目标训练样本和 训练标签的方式,可以得到更加接近真实的直播场景的样本图像。在根据该目标训练样本和训练标签训练得到姿态检测模型之后,可以得到处理精度较高的姿态检测模型。在根据该姿态检测模型检测目标主播的肢体部位时,可以得到更加准确的姿态检测结果。
在本公开实施例中,可以根据该姿态检测结果生成触发虚拟主播执行相应触发动作的姿态触发信号,以触发虚拟主播执行相应的触发动作。
由于在根据该姿态检测模型检测目标主播的肢体部位时,可以得到更加准确的检测结果,因此,在根据该姿态检测结果触发虚拟主播执行相应的触发动作时,可以实现准确的控制虚拟主播执行相应的触发动作。
在一个可选的实施方式中,通过姿态检测模型对所述视频图像帧中的所述目标主播进行姿态检测,得到姿态检测结果,包括如下过程:
(1)、在所述视频图像帧中检测目标框体,其中,所述目标框体用于框选所述视频图像帧中所述目标对象的肢体部位;
(2)、通过所述姿态检测模型对所述视频图像帧中位于所述目标框体内的图像进行姿态检测,得到所述姿态检测结果。
在本公开实施例中,在获取到视频图像帧之后,可以在视频图像帧中检测目标框体。其中,该目标框体为用于框选目标对象的肢体部位的框体。
这里,若视频图像帧中包含多个对象,则需要在多个对象中确定目标对象。在一个可选的实施方式中,可以在检测到视频图像帧中包含多个对象的情况下,将多个对象中位于画面最前方的对象确定为目标对象。在另一个可选的实施方式中,可以确定每个对象的框体,然后,将确定出的多个框体中,面积最大的框体确定为目标框体,并将该目标框体所框选的对象确定为目标对象。
在确定出目标框体之后,就可以通过采用上述所描述的神经网络模型的训练方法训练之后得到的姿态检测模型对位于该目标框体内的图像进行姿态检测,得到姿态检测结果。
上述实施方式中,通过检测包含目标对象的目标框体,并对位于该目标框体内的图像进行姿态检测,可以提高姿态检测结果的准确性。
在一个可选的实施方式中,上述步骤:通过姿态检测模型对所述视频图像帧中的所述目标对象进行姿态检测,得到姿态检测结果,包括如下过程:
(1)、在所述视频图像帧中未检测到所述目标对象的部分指定肢体部位的情况下,对所述视频图像帧进行处理,得到目标图像;所述目标图像中包含用于对该部分指定肢体部位进行姿态检测的区域。
对视频图像帧进行处理,得到目标图像的过程可以描述如下:
在视频图像帧中截取包含待检测部位的子图像,基于部分指定肢体部位的肢体类型信息和/或肢体尺寸信息,对该子图像进行边缘填补,得到包含用于对该部分指定肢体部位进行肢体检测的填补区域的目标图像。
(2)、通过所述姿态检测模型对所述目标图像进行姿态检测,得到所述目标主播的姿态检测结果。
在一个可选的实施方式中,该姿态检测模型能够对包含不完整的指定肢体部位进行姿态检测。
因此,在检测到视频图像帧中未检测到完整的指定肢体部位的情况下,可以确定该视频图像帧中不包含部分指定肢体部位(例如,缺少部分手部)。此时,可以对视频图像帧进行边缘填补处理,得到包含该部分指定部位的检测区域(即,上述填补区域)的目标图像。
在一个可选的实施方式中,可以对该视频图像帧进行边缘填补处理,以在该视频图像帧的边缘增加黑色区域,从而得到目标图像。
在对视频图像帧进行边缘填补处理,得到目标图像之后,就可以通过该姿态检测模型对目标图像进行姿态检测,得到目标对象的姿态检测结果。
上述实施方式中,在视频图像帧中未检测到目标对象的部分指定肢体部位的情况下,通过对视频图像帧进行边缘填补处理,得到目标图像,以根据该目标图像进行姿态检测的方式, 可以实现在视频图像帧中不包含完整的指定肢体部位的情况下,依然可以对该视频图像帧进行姿态检测,得到准确的姿态检测结果。
在一个可选的实施方式中,上述步骤:基于所述肢体部位样本的肢体特征点的监督类型及所述肢体特征点对应的初始训练标签进行调整得到所述训练标签,包括:
(1)、确定所述肢体特征点的监督类型;
(2)、基于所述监督类型修正所述初始训练标签,以根据修正之后的初始训练标签确定所述目标样本图像的训练标签。
这里,可以基于待训练的姿态检测模型确定每个肢体特征点的监督类型。监督类型用于指示在对待训练的姿态检测模型进行训练的过程中,是否需要对肢体特征点进行监督学习。
若该肢体特征点不属于待检测部位,那么该肢体特征点的监督类型可以确定为非监督类特征点。若该肢体特征点属于待检测部位,那么该肢体特征点的监督类型可以确定为监督类特征点。
在确定出监督类型之后,就可以确定出在待训练的姿态检测模型的训练过程中,需要进行监督学习的肢体特征点。在基于监督类型修正初始训练标签之后,就可以得到用于对待训练的姿态检测模型进行训练的目标样本图像的训练标签。
在一个可选的实施方式中,上述步骤:确定所述肢体特征点的监督类型,包括如下步骤:
首先,确定所述肢体特征点的特征信息,其中,所述特征信息用于指示对应肢体特征点和所述目标样本图像之间的位置关系,和/或,用于指示对应肢体特征点是否为待检测部位的特征点;
其次,基于所述特征信息,确定所述肢体特征点的监督类型;并在基于所述监督类型确定所述肢体特征点的初始训练标签满足修正条件的情况下,基于所述监督类型修正所述初始训练标签,得到所述训练标签。
在本公开实施例中,在对初始训练标签进行修正时,可以确定肢体特征点的特征信息,进而根据该特征信息确定该肢体特征点是否为需要进行监督学习的肢体特征点。
在本公开实施例中,按照肢体特征点的特征信息,可以将肢体特征点的监督类型确定为以下几种类型:
类型一:肢体特征点为监督类特征点,且该肢体特征点位于目标样本图像内。
类型二:肢体特征点为监督类特征点,且该肢体特征点位于目标样本图像外。
针对该类型二,可以理解为该肢体特征点并未包含在目标样本图像内,但是,需要进行监督学习的肢体特征点。
类型三:肢体特征点为非监督类特征点,且该肢体特征点位于目标样本图像外。
在基于上述特征信息确定出每个肢体特征点的监督类型之后,就可以基于监督类型确定肢体特征点的初始训练标签是否满足修正条件。
实际实施时,如果判断出肢体特征点的监督类型为上述类型二或者类型三,则确定肢体特征点的初始训练标签满足修正条件。此时,就可以监督类型修正该肢体特征点的初始训练标签。如果判断出肢体特征点的监督类型为类型一,则确定肢体特征点的初始训练标签不满足修正条件。
在本公开实施例中,在确定出肢体特征点的初始训练标签满足修正条件的情况下,对初始训练标签进行修正的目的是为了对初始训练样本中的初始训练标签进行处理,从而得到能够对待训练的姿态检测模型进行训练的训练标签。
上述实施方式中,通过确定肢体特征点的监督类型,并基于该监督类型对肢体特征点的初始训练标签进行修正的方式,可以保留需要进行监督的肢体特征点的初始训练标签,同时还可以提高初始训练样本中初始训练标签的利用率。
本公开涉及增强现实领域,通过获取现实环境中的目标对象的图像信息,进而借助各类视觉相关算法实现对目标对象的相关特征、状态及属性进行检测或识别处理,从而得到与实际应用匹配的虚拟与现实相结合的AR效果。示例性的,目标对象可涉及与人体相关的脸部、肢体、手势、动作等。实际应用不仅可以涉及跟真实场景或物品相关的导览、导航、讲解、 重建、虚拟效果叠加展示等交互场景,还可以涉及与人相关的特效处理,比如妆容美化、肢体美化、特效展示、虚拟模型展示等交互场景。可通过卷积神经网络,实现对目标对象的相关特征、状态及属性进行检测或识别处理。
在本公开的一些实施例中,针对直播场景下人体半身肢体关键点预测和姿态估计困难的问题,提出了专用深度神经网络,合理预测出各种复杂条件下的人体肢体全局与局部情况,对半身肢体不同展现程度进行量化分析。针对真实场景下画面背景质量不高的情况,提供一种通用直播场景下的数据集构建方案,设计了针对复杂背景的数据增强方式和模型训练损失函数,从而提高了模型的泛化能力。其中,详细的实施方式包括:
(1)在数据采集时,不限于特定的采集画面,而是采取更通用的情况去采集,即人体全身肢体(或人体上半身肢体)均保证在画面中,且采集环境背景是绿幕背景,并对采集所得数据进行完整关键点的数据标注,保证数据集复用率达到最高。使用人体分割模型获得像素级的人体信息,对背景部分进行随机的背景替换生成多组训练数据。
(2)在模型训练阶段,基于数据增强的结果,判断肢体点是否有标注信息以及能否判断肢体点是否在图外,依据判断结果给出肢体点的监督信号以及监督信息。
(3)设计一种损失函数约束,用于约束图像中人手腕点和手检测框的位置信息,减少手腕和手部检测框位置差异较大的现象。
(4)在直播时,获取主播肢体上半身画面并进行人体检测,得到的人体检测框图片再使用二维人体姿态估计网络预测所需的人肢体点的二维坐标。经过阶段一至三所训得的模型,在诸如画面昏暗或曝光、有反光的玻璃墙面、背景物体杂乱等复杂背景场景中具有鲁棒的泛化表现,明显减少误检问题。
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的实际执行顺序应当以其功能和可能的内在逻辑确定。
基于同一发明构思,本公开实施例中还提供了与姿态检测方法对应的姿态检测装置,由于本公开实施例中的装置解决问题的原理与本公开实施例上述神经网络模型的训练方法相似,因此装置的实施可以参见方法的实施。
参照图4所示,为本公开实施例提供的一种姿态检测装置的示意图,所述装置包括:第一获取单元41、姿态检测单元42、生成单元43、控制单元44;其中:
第一获取单元41,配置为获取包含目标主播的肢体部位的视频图像帧;
姿态检测单元42,配置为通过姿态检测模型对所述视频图像帧中的所述目标主播进行姿态检测,得到姿态检测结果,其中,所述姿态检测模型为利用目标样本图像及对应的训练标签进行监督训练得到,所述目标样本图像包含待检测的肢体部位样本,所述训练标签基于所述肢体部位样本的肢体特征点的监督类型及所述肢体特征点对应的初始训练标签进行调整得到;
生成单元43,配置为根据所述姿态检测结果生成所述目标主播所对应的虚拟主播的姿态触发信号;
控制单元44,配置为根据所述姿态触发信号控制所述虚拟主播执行相应的触发动作。
一种可能的实施方式中,姿态检测单元42,还配置为:在所述视频图像帧中检测目标框体,其中,所述目标框体用于框选所述视频图像帧中所述目标主播的肢体部位;通过所述姿态检测模型对所述视频图像帧中位于所述目标框体内的图像进行姿态检测,得到所述姿态检测结果。
一种可能的实施方式中,姿态检测单元42,还配置为:在所述视频图像帧中未检测到所述目标主播的至少部分指定肢体部位的情况下,对所述视频图像帧进行处理,得到目标图像;所述目标图像中包含用于对该至少部分指定肢体部位进行姿态检测的区域;通过所述姿态检测模型对所述目标图像进行姿态检测,得到所述目标主播的姿态检测结果。
一种可能的实施方式中,该装置还配置为:确定所述肢体特征点的监督类型;基于所述监督类型修正所述初始训练标签,以根据修正之后的初始训练标签确定所述目标样本图像的 训练标签。
一种可能的实施方式中,该装置还配置为:确定所述肢体特征点的特征信息,其中,所述特征信息用于指示对应肢体特征点和所述目标样本图像之间的位置关系,和/或,用于指示对应肢体特征点是否为待检测部位的特征点;基于所述特征信息,确定所述肢体特征点的监督类型;并在基于所述监督类型确定所述肢体特征点的初始训练标签满足修正条件的情况下,基于所述监督类型修正所述初始训练标签,得到所述训练标签。
基于同一发明构思,本公开实施例中还提供了与神经网络模型的训练方法对应的神经网络模型的训练装置,由于本公开实施例中的装置解决问题的原理与本公开实施例上述神经网络模型的训练方法相似,因此装置的实施可以参见方法的实施。
参照图5所示,为本公开实施例提供的一种神经网络模型的训练装置的示意图,所述装置包括:第二获取单元51、修正处理单元52、确定单元53、训练单元54;其中,
第一获取单元51,配置为获取初始训练样本;所述初始训练样本中包含样本图像和该样本图像中所包含的目标对象的肢体部位的肢体特征点的初始训练标签;
修正处理单元52,配置为基于待训练网络模型的检测任务在所述肢体部位中确定待检测部位;并对所述样本图像进行处理,得到包含所述待检测部位的目标样本图像;
确定单元53,配置为确定所述肢体特征点的监督类型,并基于所述监督类型修正所述初始训练标签,以根据修正之后的初始训练标签确定所述目标样本图像的训练标签;
训练单元54,配置为通过所述目标样本图像和所述训练标签,对所述待训练网络模型进行有监督训练。
通过上述描述可知,本公开技术方案在确定出待训练网络模型之后,通过对初始训练样本中的样本图像和初始训练标签进行修正处理,得到目标样本图像和训练标签的方式,可以实现初始训练样本的复用,以使初始训练样本可以同时用于多个不同的待训练网络模型的训练过程,进而通过节省数据标注成本的方式降低神经网络的训练成本。
一种可能的实施方式中,在所述待检测部位为多个的情况下,训练单元,还配置为:通过所述待训练网络模型对所述目标样本图像进行图像处理,得到所述目标样本图像中每个待检测部位的位置信息;基于多个所述待检测部位中第一部位和第二部位的位置信息,以及所述训练标签,确定用于约束所述第一部位和第二部位之间的位置差异的目标损失函数的函数值,其中,所述第一部位和所述第二部位为具有关联关系的待检测部位;根据所述目标损失函数的函数值调整所述待训练网络模型的模型参数,以通过调整之后的模型参数对所述待训练网络模型进行训练。
一种可能的实施方式中,确定单元,还配置为:确定所述肢体特征点的特征信息,其中,所述特征信息用于指示对应肢体特征点和所述目标样本图像之间的位置关系,和/或,用于指示对应肢体特征点是否为待检测部位的特征点;基于所述特征信息,确定所述肢体特征点的监督类型;在基于所述监督类型确定所述肢体特征点的初始训练标签满足修正条件的情况下,基于所述监督类型修正所述初始训练标签。
一种可能的实施方式中,确定单元,还配置为:在根据所述监督类型确定出所述肢体特征点位于所述目标样本图像之外,且不是所述待检测部位的特征点情况下,将所述肢体特征点的初始训练标签修正为第一初始训练标签;所述第一初始训练标签用于表征该肢体特征点为位于所述目标样本图像之外的非监督类特征点。
一种可能的实施方式中,确定单元,还配置为:在根据所述初始训练标签确定出所述肢体特征点位于所述目标样本图像之外,且属于待检测部位的特征点的情况下,将所述肢体特征点的初始训练标签修正为第二初始训练标签;所述第二初始训练标签用于表征该肢体特征点为位于所述目标样本图像之外的监督类特征点。
一种可能的实施方式中,所述样本图像为对原始图像执行目标操作之后确定的图像,其中,所述原始图像为在场景复杂度不满足预设复杂要求的场景下采集到的包含目标对象的至少一个肢体部位的图像,所述目标操作包括以下至少之一:背景替换操作、调整图像亮度的操作、调整图像曝光度的操作、添加反光图层的操作。
一种可能的实施方式中,修正处理单元,还配置为:对所述样本图像中的目标检测部位进行图像遮挡处理,得到所述目标样本图像;其中,所述目标检测部位为所述肢体部位中除所述待检测部位之外的其他部位;和/或,对所述样本图像中位于目标区域内的图像进行图像裁剪处理,裁剪得到包含所述待检测部位的目标样本图像;所述目标区域为所述样本图像中包含所述待检测部位的图像区域。
关于装置中的各模块的处理流程、以及各模块之间的交互流程的描述可以参照上述方法实施例中的相关说明。
对应于图1中的姿态检测方法,本公开实施例还提供了一种计算机设备600,如图6所示,为本公开实施例提供的计算机设备600结构示意图,包括:
处理器61、存储器62、和总线63;存储器62用于存储执行指令,包括内存621和外部存储器622;这里的内存621也称内存储器,用于暂时存放处理器61中的运算数据,以及与硬盘等外部存储器622交换的数据,处理器61通过内存621与外部存储器622进行数据交换,当所述计算机设备600运行时,所述处理器61与所述存储器62之间通过总线63通信,使得所述处理器61执行以下指令:
获取包含目标主播的肢体部位的视频图像帧;
通过姿态检测模型对所述视频图像帧中的所述目标主播进行姿态检测,得到姿态检测结果,其中,所述姿态检测模型为利用目标样本图像及对应的训练标签进行监督训练得到,所述目标样本图像包含待检测的肢体部位样本,所述训练标签基于所述肢体部位样本的肢体特征点的监督类型及所述肢体特征点对应的初始训练标签进行调整得到;
根据所述姿态检测结果生成所述目标主播所对应的虚拟主播的姿态触发信号;
根据所述姿态触发信号控制所述虚拟主播执行相应的触发动作。
对应于图1中的神经网络模型的训练方法,本公开实施例还提供了一种计算机设备700,如图7所示,为本公开实施例提供的计算机设备700结构示意图,包括:
处理器71、存储器72、和总线73;存储器72用于存储执行指令,包括内存721和外部存储器722;这里的内存721也称内存储器,用于暂时存放处理器71中的运算数据,以及与硬盘等外部存储器722交换的数据,处理器71通过内存721与外部存储器722进行数据交换,当所述计算机设备700运行时,所述处理器71与所述存储器72之间通过总线73通信,使得所述处理器71执行以下指令:
获取初始训练样本;所述初始训练样本中包含样本图像和该样本图像中所包含的目标对象的肢体部位的肢体特征点的初始训练标签;
基于待训练网络模型的检测任务在所述肢体部位中确定待检测部位;并对所述样本图像进行处理,得到包含所述待检测部位的目标样本图像;
确定所述肢体特征点的监督类型,并基于所述监督类型修正所述初始训练标签,以根据修正之后的初始训练标签确定所述目标样本图像的训练标签;
通过所述目标样本图像和所述训练标签,对所述待训练网络模型进行有监督训练。
本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述方法实施例中所述的神经网络模型的训练、姿态检测方法的步骤。其中,该存储介质可以是易失性或非易失的计算机可读取存储介质。
本公开实施例还提供一种计算机程序产品,该计算机程序产品承载有程序代码,所述程序代码包括的指令可用于执行上述方法实施例中所述的神经网络模型的训练、姿态检测方法的步骤,可参见上述方法实施例。
其中,上述计算机程序产品可以通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品可以体现为计算机存储介质,在另一个可选实施例中,计算机程序产品可以体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统和装置的具体工作过程,可以参考前述方法实施例中的对应过程。在本公开所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。以上所描述的装置实 施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对相关技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上所述实施例,仅为本公开的具体实施方式,用以说明本公开的技术方案,而非对其限制,本公开的保护范围并不局限于此,尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或者替换,并不使相应技术方案的本质脱离本公开实施例技术方案的精神和范围,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应所述以权利要求的保护范围为准。
工业实用性
本公开实施例提供了一种检测方法、训练方法、装置、设备、存储介质和程序产品,其中,该方法包括:获取包含目标主播的肢体部位的视频图像帧;通过姿态检测模型对所述视频图像帧中的所述目标主播进行姿态检测,得到姿态检测结果,其中,所述姿态检测模型为利用目标样本图像及对应的训练标签进行监督训练得到,所述目标样本图像包含待检测的肢体部位样本,所述训练标签基于所述肢体部位样本的肢体特征点的监督类型及所述肢体特征点对应的初始训练标签进行调整得到;根据所述姿态检测结果生成所述目标主播所对应的虚拟主播的姿态触发信号;根据所述姿态触发信号控制所述虚拟主播执行相应的触发动作。在本公开实施例中,在获取到视频图像帧之后,通过姿态检测模型对目标主播进行姿态检测,进而根据姿态检测结果控制虚拟主播执行相应触发动作的方式,可以实现自动的通过目标主播的姿态检测结果在视频直播界面上触发展示虚拟主播的相应部位执行相应触发操作,进而实现触发虚拟主播执行与主播的肢体产生动作关联的动作,从而提高实际互动效果。同时,通过确定待检测的肢体部位的姿态检测结果对虚拟主播进行触发,还可以实现虚拟主播中对应触发部位的精准触发,从而满足用户的丰富触发需求。同时,在本公开技术方案中,通过基于肢体部位样本的肢体特征点的监督类型及肢体特征点对应的初始训练标签进行调整得到训练标签,可以通过对初始训练标签的复用来训练得到姿态检测模型,通过该方式可以节省数据标注成本,并加快数据标注的速度。

Claims (17)

  1. 一种姿态检测方法,其中,包括:
    获取包含目标主播的肢体部位的视频图像帧;
    通过姿态检测模型对所述视频图像帧中的所述目标主播进行姿态检测,得到姿态检测结果,其中,所述姿态检测模型为利用目标样本图像及对应的训练标签进行监督训练得到,所述目标样本图像包含待检测的肢体部位样本,所述训练标签基于所述肢体部位样本的肢体特征点的监督类型及所述肢体特征点对应的初始训练标签进行调整得到;
    根据所述姿态检测结果生成所述目标主播所对应的虚拟主播的姿态触发信号;
    根据所述姿态触发信号控制所述虚拟主播执行相应的触发动作。
  2. 根据权利要求1所述的方法,其中,所述通过姿态检测模型对所述视频图像帧中的所述目标主播进行姿态检测,得到姿态检测结果,包括:
    在所述视频图像帧中检测目标框体,其中,所述目标框体用于框选所述视频图像帧中所述目标主播的肢体部位;
    通过所述姿态检测模型对所述视频图像帧中位于所述目标框体内的图像进行姿态检测,得到所述姿态检测结果。
  3. 根据权利要求1或2所述的方法,其中,所述通过姿态检测模型对所述视频图像帧中的所述目标主播进行姿态检测,得到姿态检测结果,包括:
    在所述视频图像帧中未检测到所述目标主播的至少部分指定肢体部位的情况下,对所述视频图像帧进行处理,得到目标图像;所述目标图像中包含用于对该至少部分指定肢体部位进行姿态检测的区域;
    通过所述姿态检测模型对所述目标图像进行姿态检测,得到所述目标主播的姿态检测结果。
  4. 根据权利要求1所述的方法,其中,所述基于所述肢体部位样本的肢体特征点的监督类型及所述肢体特征点对应的初始训练标签进行调整得到所述训练标签,包括:
    确定所述肢体特征点的监督类型;
    基于所述监督类型修正所述初始训练标签,以根据修正之后的初始训练标签确定所述目标样本图像的训练标签。
  5. 根据权利要求4所述的方法,其中,所述确定所述肢体特征点的监督类型,包括:
    确定所述肢体特征点的特征信息,其中,所述特征信息用于指示对应肢体特征点和所述目标样本图像之间的位置关系,和/或,用于指示对应肢体特征点是否为待检测部位的特征点;
    基于所述特征信息,确定所述肢体特征点的监督类型;并在基于所述监督类型确定所述肢体特征点的初始训练标签满足修正条件的情况下,基于所述监督类型修正所述初始训练标签,得到所述训练标签。
  6. 一种神经网络模型的训练方法,其中,包括:
    获取初始训练样本;所述初始训练样本中包含样本图像和该样本图像中所包含的目标对象的肢体部位的肢体特征点的初始训练标签;
    基于待训练网络模型的检测任务在所述肢体部位中确定待检测部位;并对所述样本图像进行处理,得到包含所述待检测部位的目标样本图像;
    确定所述肢体特征点的监督类型,并基于所述监督类型修正所述初始训练标签,以根据修正之后的初始训练标签确定所述目标样本图像的训练标签;
    通过所述目标样本图像和所述训练标签,对所述待训练网络模型进行有监督训练,得到姿态检测模型。
  7. 根据权利要求6所述的方法,其中,所述待检测部位为多个,所述通过所述目标样本图像和所述训练标签,对所述待训练网络模型进行训练,包括:
    通过所述待训练网络模型对所述目标样本图像进行图像处理,得到所述目标样本图像中每个待检测部位的位置信息;
    基于多个所述待检测部位中第一部位和第二部位的位置信息,以及所述训练标签,确定用于约束所述第一部位和第二部位之间的位置差异的目标损失函数的函数值,其中,所述第一部位和所述第二部位为具有关联关系的待检测部位;
    根据所述目标损失函数的函数值调整所述待训练网络模型的模型参数,以通过调整之后的模型参数对所述待训练网络模型进行训练。
  8. 根据权利要求6或7所述的方法,其中,所述确定所述肢体特征点的监督类型,并基于所述监督类型修正所述初始训练标签,包括:
    确定所述肢体特征点的特征信息,其中,所述特征信息用于指示对应肢体特征点和所述目标样本图像之间的位置关系,和/或,用于指示对应肢体特征点是否为待检测部位的特征点;
    基于所述特征信息,确定所述肢体特征点的监督类型;
    在基于所述监督类型确定所述肢体特征点的初始训练标签满足修正条件的情况下,基于所述监督类型修正所述初始训练标签。
  9. 根据权利要求8所述的方法,其中,所述在基于所述监督类型确定所述肢体特征点的初始训练标签满足修正条件的情况下,基于所述监督类型修正所述初始训练标签,包括:
    在根据所述监督类型确定出所述肢体特征点位于所述目标样本图像之外,且不是所述待检测部位的特征点情况下,将所述肢体特征点的初始训练标签修正为第一初始训练标签;所述第一初始训练标签用于表征该肢体特征点为位于所述目标样本图像之外的非监督类特征点。
  10. 根据权利要求8所述的方法,其中,所述在基于所述监督类型确定所述肢体特征点的初始训练标签满足修正条件的情况下,基于所述监督类型修正所述初始训练标签,包括:
    在根据所述初始训练标签确定出所述肢体特征点位于所述目标样本图像之外,且属于待检测部位的特征点的情况下,将所述肢体特征点的初始训练标签修正为第二初始训练标签;所述第二初始训练标签用于表征该肢体特征点为位于所述目标样本图像之外的监督类特征点。
  11. 根据权利要求6所述的方法,其中,所述样本图像为对原始图像执行目标操作之后确定的图像,其中,所述原始图像为在场景复杂度不满足预设复杂要求的场景下采集到的包含目标对象的至少一个肢体部位的图像,所述目标操作包括以下至少之一:背景替换操作、调整图像亮度的操作、调整图像曝光度的操作、添加反光图层的操作。
  12. 根据权利要求6所述的方法,其中,所述对所述样本图像进行处理,得到包含所述待检测部位的目标样本图像,包括:
    对所述样本图像中的目标检测部位进行图像遮挡处理,得到所述目标样本图像;其中,所述目标检测部位为所述肢体部位中除所述待检测部位之外的其他部位;
    和/或
    对所述样本图像中位于目标区域内的图像进行图像裁剪处理,裁剪得到包含所述待检测部位的目标样本图像;所述目标区域为所述样本图像中包含所述待检测部位的图像 区域。
  13. 一种姿态检测装置,其中,包括:
    第一获取单元,配置为获取包含目标主播的肢体部位的视频图像帧;
    姿态检测单元,配置为通过姿态检测模型对所述视频图像帧中的所述目标主播进行姿态检测,得到姿态检测结果,其中,所述姿态检测模型为利用目标样本图像及对应的训练标签进行监督训练得到,所述目标样本图像包含待检测的肢体部位样本,所述训练标签基于所述肢体部位样本的肢体特征点的监督类型及所述肢体特征点对应的初始训练标签进行调整得到;
    生成单元,配置为根据所述姿态检测结果生成所述目标主播所对应的虚拟主播的姿态触发信号;
    控制单元,配置为根据所述姿态触发信号控制所述虚拟主播执行相应的触发动作。
  14. 一种神经网络模型的训练装置,其中,包括:
    第二获取单元,配置为获取初始训练样本;所述初始训练样本中包含样本图像和该样本图像中所包含的目标对象的肢体部位的肢体特征点的初始训练标签;
    修正处理单元,配置为基于待训练网络模型的检测任务在所述肢体部位中确定待检测部位;并对所述样本图像进行处理,得到包含所述待检测部位的目标样本图像;
    确定单元,配置为确定所述肢体特征点的监督类型,并基于所述监督类型修正所述初始训练标签,以根据修正之后的初始训练标签确定所述目标样本图像的训练标签;
    训练单元,配置为通过所述目标样本图像和所述训练标签,对所述待训练网络模型进行有监督训练。
  15. 一种计算机设备,其中,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当计算机设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行如权利要求1至5中任一所述的姿态检测方法的步骤,或者,执行如权利要求6至12中任一所述的神经网络模型的训练方法的步骤。
  16. 一种计算机可读存储介质,其中,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行如权利要求1至5中任一所述的姿态检测方法的步骤,或者,执行如权利要求6至12中任一所述的神经网络模型的训练方法的步骤。
  17. 一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序被计算机读取并执行时,实现如权利要求1至5中任一所述的姿态检测方法的步骤,或者,实现如权利要求6至12中任一所述的神经网络模型的训练方法的步骤。
PCT/CN2022/075197 2021-08-27 2022-01-30 检测方法、训练方法、装置、设备、存储介质和程序产品 WO2023024442A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110993216.2A CN113435431B (zh) 2021-08-27 2021-08-27 姿态检测方法、神经网络模型的训练方法、装置及设备
CN202110993216.2 2021-08-27

Publications (1)

Publication Number Publication Date
WO2023024442A1 true WO2023024442A1 (zh) 2023-03-02

Family

ID=77798218

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/075197 WO2023024442A1 (zh) 2021-08-27 2022-01-30 检测方法、训练方法、装置、设备、存储介质和程序产品

Country Status (2)

Country Link
CN (1) CN113435431B (zh)
WO (1) WO2023024442A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116243623A (zh) * 2023-05-10 2023-06-09 深圳墨影科技有限公司 应用于数字化机器人产业链的机器人场景仿真方法
CN116873690A (zh) * 2023-09-06 2023-10-13 江苏省特种设备安全监督检验研究院 电梯安全监控数据处理系统

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435431B (zh) * 2021-08-27 2021-12-07 北京市商汤科技开发有限公司 姿态检测方法、神经网络模型的训练方法、装置及设备
CN115908665A (zh) * 2021-09-30 2023-04-04 北京字节跳动网络技术有限公司 一种视频处理方法、装置、设备、介质及产品
CN115223002B (zh) * 2022-05-09 2024-01-09 广州汽车集团股份有限公司 模型训练方法、开门动作检测方法、装置以及计算机设备
CN115997385A (zh) * 2022-10-12 2023-04-21 广州酷狗计算机科技有限公司 基于增强现实的界面显示方法、装置、设备、介质和产品

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109905724A (zh) * 2019-04-19 2019-06-18 广州虎牙信息科技有限公司 直播视频处理方法、装置、电子设备及可读存储介质
CN109922354A (zh) * 2019-03-29 2019-06-21 广州虎牙信息科技有限公司 直播互动方法、装置、直播系统及电子设备
CN110139115A (zh) * 2019-04-30 2019-08-16 广州虎牙信息科技有限公司 基于关键点的虚拟形象姿态控制方法、装置及电子设备
US20190279046A1 (en) * 2016-11-01 2019-09-12 Snap Inc. Neural network for object detection in images
CN111353555A (zh) * 2020-05-25 2020-06-30 腾讯科技(深圳)有限公司 一种标注检测方法、装置及计算机可读存储介质
CN113435431A (zh) * 2021-08-27 2021-09-24 北京市商汤科技开发有限公司 姿态检测方法、神经网络模型的训练方法、装置及设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399406B (zh) * 2018-01-15 2022-02-01 中山大学 基于深度学习的弱监督显著性物体检测的方法及系统
CN108712661B (zh) * 2018-05-28 2022-02-25 广州虎牙信息科技有限公司 一种直播视频处理方法、装置、设备及存储介质
CN113286186B (zh) * 2018-10-11 2023-07-18 广州虎牙信息科技有限公司 直播中的形象展示方法、装置和存储介质
CN109936774A (zh) * 2019-03-29 2019-06-25 广州虎牙信息科技有限公司 虚拟形象控制方法、装置及电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190279046A1 (en) * 2016-11-01 2019-09-12 Snap Inc. Neural network for object detection in images
CN109922354A (zh) * 2019-03-29 2019-06-21 广州虎牙信息科技有限公司 直播互动方法、装置、直播系统及电子设备
CN109905724A (zh) * 2019-04-19 2019-06-18 广州虎牙信息科技有限公司 直播视频处理方法、装置、电子设备及可读存储介质
CN110139115A (zh) * 2019-04-30 2019-08-16 广州虎牙信息科技有限公司 基于关键点的虚拟形象姿态控制方法、装置及电子设备
CN111353555A (zh) * 2020-05-25 2020-06-30 腾讯科技(深圳)有限公司 一种标注检测方法、装置及计算机可读存储介质
CN113435431A (zh) * 2021-08-27 2021-09-24 北京市商汤科技开发有限公司 姿态检测方法、神经网络模型的训练方法、装置及设备

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116243623A (zh) * 2023-05-10 2023-06-09 深圳墨影科技有限公司 应用于数字化机器人产业链的机器人场景仿真方法
CN116243623B (zh) * 2023-05-10 2023-08-04 深圳墨影科技有限公司 应用于数字化机器人产业链的机器人场景仿真方法
CN116873690A (zh) * 2023-09-06 2023-10-13 江苏省特种设备安全监督检验研究院 电梯安全监控数据处理系统
CN116873690B (zh) * 2023-09-06 2023-11-17 江苏省特种设备安全监督检验研究院 电梯安全监控数据处理系统

Also Published As

Publication number Publication date
CN113435431A (zh) 2021-09-24
CN113435431B (zh) 2021-12-07

Similar Documents

Publication Publication Date Title
WO2023024442A1 (zh) 检测方法、训练方法、装置、设备、存储介质和程序产品
US11928592B2 (en) Visual sign language translation training device and method
CN111259751B (zh) 基于视频的人体行为识别方法、装置、设备及存储介质
US9912874B2 (en) Real-time visual effects for a live camera view
JP6079832B2 (ja) ヒューマンコンピュータインタラクションシステム、手と手指示点位置決め方法、及び手指のジェスチャ決定方法
WO2017193906A1 (zh) 一种图像处理方法及处理系统
CN112884881B (zh) 三维人脸模型重建方法、装置、电子设备及存储介质
CN109635752B (zh) 人脸关键点的定位方法、人脸图像处理方法和相关装置
Azcarate et al. Automatic facial emotion recognition
US10990170B2 (en) Eye tracking method, electronic device, and non-transitory computer readable storage medium
KR20120054550A (ko) 비디오 스트림에서 움직이고 있는 비정형 물체들을 실시간으로 검출 및 추적하여, 사용자가 컴퓨터 시스템과 상호 작용할 수 있게 해주는 방법 및 디바이스
CN113449696B (zh) 一种姿态估计方法、装置、计算机设备以及存储介质
WO2023279713A1 (zh) 特效展示方法、装置、计算机设备、存储介质、计算机程序和计算机程序产品
CN113160231A (zh) 一种样本生成方法、样本生成装置及电子设备
CN109376618B (zh) 图像处理方法、装置及电子设备
US20190304152A1 (en) Method and device for processing image
CN109325387B (zh) 图像处理方法、装置、电子设备
CN113570615A (zh) 一种基于深度学习的图像处理方法、电子设备及存储介质
US11314981B2 (en) Information processing system, information processing method, and program for displaying assistance information for assisting in creation of a marker
Bong et al. Face recognition and detection using haars features with template matching algorithm
CN115061577A (zh) 手部投影交互方法、系统及存储介质
CN114510142B (zh) 基于二维图像的手势识别方法及其系统和电子设备
Qian et al. Multi-Scale tiny region gesture recognition towards 3D object manipulation in industrial design
Gevorgyan et al. OpenCV 4 with Python Blueprints: Build creative computer vision projects with the latest version of OpenCV 4 and Python 3
CN115937430B (zh) 用于展示虚拟对象的方法、装置、设备及介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE