CN111523452A

CN111523452A - Method and device for detecting human body position in image

Info

Publication number: CN111523452A
Application number: CN202010321816.XA
Authority: CN
Inventors: 钟东宏; 袁宇辰; 孙昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2020-08-11
Anticipated expiration: 2040-04-22
Also published as: CN111523452B

Abstract

The application discloses a method and a device for detecting human body position in an image, which relate to the field of artificial intelligence, and the specific implementation scheme is as follows: determining the position coordinates of a human body frame corresponding to the human body image from an image to be detected comprising the human body image; based on the learned evaluation threshold vector, performing multiple regression calculation on the position coordinates of a human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold vector; carrying out multidimensional comparison analysis on each set of regression offset parameters in the regression offset parameter set to determine an optimal regression offset parameter set; and generating the final position coordinates of the human body frame based on the optimal regression offset parameter group. According to the scheme, the precision of detecting the position coordinates of the human body frame is improved, and the detection result of the human body image is more accurate.

Description

Method and device for detecting human body position in image

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the field of artificial intelligence, and particularly relates to a method and a device for detecting a human body position in an image.

Background

With the continuous development of the internet and artificial intelligence technology, more and more fields are related to automatic calculation and analysis, wherein the human body detection function of monitoring security is one of important scenes.

Human body frame results obtained by human body detection in the common security video field sometimes have certain deviation with an actual target, and although the human body frame results have a superposition part with the actual target, the human body frame results may not completely cover the whole human body. When the human body detection frame which is not accurate enough is used for subsequent tasks (such as classification and tracking), more errors are often introduced, so that if the detection frame can better cover a target human body, the effect of the whole service can be greatly improved.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for detecting the position of a human body in an image.

According to a first aspect, the present application provides a method for detecting a position of a human body in an image, the method comprising: determining the position coordinates of a human body frame corresponding to the human body image from an image to be detected comprising the human body image; based on the learned evaluation threshold vector, performing multiple regression calculation on the position coordinates of a human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold vector; carrying out multidimensional comparison analysis on each set of regression offset parameters in the regression offset parameter set to determine an optimal regression offset parameter set; and generating the final position coordinates of the human body frame based on the optimal regression offset parameter group.

In some embodiments, based on the learned evaluation threshold vector, performing a cascade regression calculation on the position coordinates of a human frame corresponding to the human image in the image to be detected by using a frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold vector, including: based on each evaluation threshold in the learned evaluation threshold vector, performing cascade regression calculation on the position coordinates of a human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold vector, wherein the evaluation threshold vector is an ordered numerical value set of the evaluation threshold representing the image overlapping degree IOU, and numerical values in the ordered numerical value set gradually increase from front to back.

In some embodiments, performing multiple regression calculations on the position coordinates of a human body frame corresponding to a human body image in an image to be detected by using a frame regression loss function based on the learned evaluation threshold vector to obtain a regression offset parameter set corresponding to the evaluation threshold vector, includes: based on the evaluation threshold vector obtained by learning, performing cascade regression calculation on the position coordinates of a human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold vector; the evaluation threshold vector is used for representing an initial evaluation threshold and a cascade step required by cascade regression calculation, and each regression calculation in the cascade regression calculation is performed based on a regression offset parameter group obtained by the last regression calculation.

In some embodiments, the method further comprises: obtaining the position coordinates of the human body frame corresponding to each group of regression offset parameters based on each group of regression offset parameters in the regression offset parameter group set; and comparing the position coordinates of the human body frame corresponding to each group of regression deviation parameters with the preset human body frame respectively, and determining the final position coordinates of the human body frame, wherein the final position coordinates of the human body frame are the position coordinates representing the closest preset human body frame.

In some embodiments, determining the position coordinates of the human body frame corresponding to the human body image from the image to be detected including the human body image includes: and detecting the image to be detected by using the trained depth network model to obtain the position coordinates of the human body frame corresponding to the human body image in the image to be detected, wherein the depth network model is obtained by changing the evaluation threshold value in the evaluation threshold value vector in the frame regression loss function through training.

In some embodiments, the deep network model is trained based on the following steps: obtaining a training sample set, wherein training samples in the training sample set comprise: various categories of images to be detected; and by utilizing a deep learning method, the images to be detected of various categories included in the training sample set training samples are used as the input of the detection network, the position coordinates of the human body frame of the human body images in the input images to be detected of various categories and the evaluation threshold value vector obtained by learning are output, and the deep network model is obtained by training, wherein the learning target of the evaluation threshold value vector is to enable the output position coordinates of the human body frame to approach the position coordinates of the real human body frame.

In a second aspect, the present application provides an apparatus for detecting a position of a human body in an image, the apparatus comprising: a human body position determination unit configured to determine position coordinates of a human body frame corresponding to a human body image from an image to be detected including the human body image; the regression offset calculation unit is configured to perform multiple regression calculation on the position coordinates of a human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function based on the learned evaluation threshold vector to obtain a regression offset parameter set corresponding to the evaluation threshold vector; the offset parameter analysis unit is configured to perform multi-dimensional comparison analysis on each set of regression offset parameters in the regression offset parameter set to determine an optimal regression offset parameter set; and a first position generation unit configured to generate final position coordinates of the human body frame based on the optimal regression offset parameter group.

In some embodiments, a regression offset calculation unit, comprises: the first offset calculation module is configured to perform cascade regression calculation on the position coordinates of a human body frame corresponding to a human body image in an image to be detected by using a frame regression loss function based on each evaluation threshold in the learned evaluation threshold vector to obtain a regression offset parameter set corresponding to the evaluation threshold vector, wherein the evaluation threshold vector is an ordered numerical value set of the evaluation thresholds representing the image overlapping degree IOU, and numerical values in the ordered numerical value set gradually increase from front to back.

In some embodiments, a regression offset calculation unit, comprises: the second offset calculation module is configured to perform cascade regression calculation on the position coordinates of a human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function based on the learned evaluation threshold vector to obtain a regression offset parameter set corresponding to the evaluation threshold vector; the evaluation threshold vector is used for representing an initial evaluation threshold and a cascade step required by cascade regression calculation, and each regression calculation in the cascade regression calculation is performed based on a regression offset parameter group obtained by the last regression calculation.

In some embodiments, the apparatus further comprises: the human body position calculation unit is configured to obtain position coordinates of a human body frame corresponding to each group of regression offset parameters based on each group of regression offset parameters in the regression offset parameter group set; and the second position generation unit is configured to compare the position coordinates of the human body frame corresponding to each set of regression offset parameters with the preset human body frame respectively and determine the final position coordinates of the human body frame, wherein the final position coordinates of the human body frame are the position coordinates representing the closest human body frame to the preset human body frame.

In some embodiments, the human body position determination unit includes: and the human body position detection module is configured to detect the image to be detected by using the trained depth network model to obtain the position coordinates of a human body frame corresponding to the human body image in the image to be detected, wherein the depth network model is obtained by changing the evaluation threshold value in the evaluation threshold value vector in the frame regression loss function through training.

In some embodiments, the deep network model of the human position detection module is trained based on the following modules: a sample acquisition module configured to acquire a training sample set, wherein training samples in the training sample set include: various categories of images to be detected; and the sample training module is configured to utilize a deep learning method to take the images to be detected of various categories included in the training sample set training samples as the input of the detection network, output the position coordinates of the human body frame of the human body images in the input images to be detected of various categories and the learned evaluation threshold vector, and train to obtain a deep network model, wherein the learning target of the evaluation threshold vector is to enable the output position coordinates of the human body frame to approximate to the position coordinates of a real human body frame.

In a third aspect, the present application provides an electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect.

In a fourth aspect, the present application provides a non-transitory computer readable storage medium storing computer instructions, wherein the computer instructions are configured to cause a computer to perform the method as described in any one of the implementation manners of the first aspect.

According to the technology of the application, the position coordinates of the human body frame corresponding to the human body image are determined from the image to be detected comprising the human body image, multiple regression calculation is carried out on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by utilizing a frame regression loss function based on an evaluation threshold vector obtained by learning, a regression offset parameter set corresponding to the evaluation threshold vector is obtained, multidimensional comparison analysis is carried out on each set of regression offset parameters in the regression offset parameter set, the optimal regression offset parameter set is determined, the final position coordinates of the human body frame are generated based on the optimal regression offset parameter set, a human body detection algorithm is optimized, the problem that the position coordinates of the human body frame cannot be predicted accurately due to the fact that only one-time regression calculation is carried out on the frame regression loss function in the prior art is solved, and the detection accuracy of the position coordinates of the human body frame is improved, further, the detection result of the human body image is more accurate.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application.

FIG. 1 is a schematic diagram of a first embodiment of a method for detecting a position of a human body in an image according to the present application;

FIG. 2 is a scene diagram of a method for detecting a position of a human body in an image, in which an embodiment of the present application may be implemented;

FIG. 3 is a foreground interactive schematic interface corresponding to a background executing the method for detecting a position of a human body in an image of the present application;

FIG. 4 is a schematic diagram of a second embodiment of a method for detecting a position of a human body in an image according to the present application;

FIG. 5 is a schematic block diagram of one embodiment of an apparatus for detecting a position of a human body in an image according to the present application;

fig. 6 is a block diagram of an electronic device for implementing the method for detecting the position of a human body in an image according to the embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows a schematic diagram 100 of a first embodiment of a method for detecting a position of a person in an image according to the application. The method for detecting the position of the human body in the image comprises the following steps:

step 101, determining the position coordinates of a human body frame corresponding to a human body image from an image to be detected comprising the human body image.

In this embodiment, the execution main body determines, for a human body image of an image to be detected, position coordinates of a human body frame corresponding to the human body image according to a preset frame position algorithm.

And 102, performing multiple regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by utilizing a frame regression loss function based on the learned evaluation threshold vector to obtain a regression offset parameter set corresponding to the evaluation threshold vector.

In this embodiment, the execution main body performs regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected in sequence according to each evaluation threshold in the learned evaluation threshold vector by using a frame regression loss function, so as to obtain a regression offset parameter set consisting of regression calculation results. The parameters in the regression offset parameter set may include an offset of coordinates of a center point of the body frame and a scaling ratio of the body frame.

And 103, carrying out multi-dimensional comparison analysis on each set of regression offset parameters in the regression offset parameter set to determine the optimal regression offset parameter set.

In this embodiment, a multidimensional comparison analysis is performed on each set of regression offset parameters obtained by regression calculation, and an optimal regression offset parameter set is determined. The multidimensional comparison may include a separate comparison based on each set of regression offset parameters with corresponding parameters of a standard box, or a comparison between different sets of regression offset parameters.

And 104, generating the final position coordinates of the human body frame based on the optimal regression offset parameter group.

In this embodiment, the position coordinates of the human frame corresponding to the optimal regression offset parameter set are used as the final position coordinates of the human frame.

It should be noted that the above-mentioned border regression loss function is a well-known technology widely studied and applied at present, and is not described herein again.

With continued reference to fig. 2, the method 200 for detecting the position of the human body in the image of the present embodiment is executed in the electronic device 201. In the field of security monitoring, after the electronic device 201 for monitoring determines the position coordinates 202 of a human body frame corresponding to a human body image from an image to be detected including the human body image, based on an evaluation threshold vector obtained by learning, performing multiple regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function to obtain a regression offset parameter set 203 corresponding to the evaluation threshold vector, based on an optimal regression offset parameter set, generating final position coordinates 204 of the human body frame, based on the final position coordinates of the human body frame, determining whether the person enters the monitoring field 205, and sending a monitoring result to a monitored person in a voice or text manner, where information received by the person is shown in fig. 3.

The method for detecting a human body position in an image according to the embodiments of the present application determines position coordinates of a human body frame corresponding to a human body image from an image to be detected including the human body image, performs multiple regression calculations on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function based on an evaluation threshold vector obtained by learning to obtain a regression offset parameter set corresponding to the evaluation threshold vector, performs multidimensional comparison analysis on each set of regression offset parameters in the regression offset parameter set to determine an optimal regression offset parameter set, generates final position coordinates of the human body frame based on the optimal regression offset parameter set, optimizes a human body detection algorithm, and solves the problem that the position coordinates of the human body frame cannot be predicted accurately due to only one regression calculation performed on the frame regression loss function in the prior art, the accuracy of detecting the position coordinates of the human body frame is improved, and the detection result of the human body image is more accurate.

With further reference to fig. 4, a schematic diagram 400 of a second embodiment of a method for detecting a position of a person in an image is shown. The process of the method comprises the following steps:

step 401, detecting an image to be detected by using the trained depth network model to obtain position coordinates of a human body frame corresponding to a human body image in the image to be detected.

In this embodiment, an image to be detected is input to a detection network based on a depth network model, so as to obtain position coordinates of a human body frame corresponding to the human body image in the image to be detected, and the depth network model is obtained by changing an evaluation threshold value in an evaluation threshold value vector in a frame regression loss function and training. The position coordinates of the human body frame are detected by utilizing a deep learning technology, so that the detection result is more accurate.

In some optional implementations of this embodiment, the deep network model is trained based on the following steps: obtaining a training sample set, wherein training samples in the training sample set comprise: various categories of images to be detected; and by utilizing a deep learning method, the images to be detected of various categories included in the training sample set training samples are used as the input of the detection network, the position coordinates of the human body frame of the human body images in the input images to be detected of various categories and the evaluation threshold value vector obtained by learning are output, and the deep network model is obtained by training, wherein the learning target of the evaluation threshold value vector is to enable the output position coordinates of the human body frame to approach the position coordinates of the real human body frame. The evaluation threshold vector is learned by using the depth model, so that the problem that the prediction of the position coordinate of the pedestrian frame is wrong is judged only on the basis of the standard that the evaluation threshold is 0.5 during model training, the evaluation threshold vector is more accurate, and the detection of the position coordinate of the human body frame is more accurate.

And step 402, based on each evaluation threshold in the learned evaluation threshold vector, respectively performing cascade regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold vector.

In this embodiment, a regression offset parameter set corresponding to an evaluation threshold vector is obtained by performing cascade regression calculation on the position coordinates of a human body frame corresponding to a human body image in an image to be detected by changing the evaluation threshold of the evaluation threshold vector in the frame regression loss function. The evaluation threshold vector is an ordered numerical value set of the evaluation threshold representing the image overlapping degree IOU, and numerical values in the ordered numerical value set gradually increase from front to back. The regression offset parameter set which is closer to the position coordinate of the real human body frame is obtained by applying the cascade regression operation, and the regression operation efficiency is improved.

In some optional implementation manners of this embodiment, based on the learned evaluation threshold vector, performing multiple regression calculations on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function, to obtain a regression offset parameter set corresponding to the evaluation threshold vector, including: based on the evaluation threshold vector obtained by learning, performing cascade regression calculation on the position coordinates of a human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold vector; the evaluation threshold vector is used for representing an initial evaluation threshold and a cascade step required by cascade regression calculation, and each regression calculation in the cascade regression calculation is performed based on a regression offset parameter group obtained by the last regression calculation. The regression offset parameter set which is closer to the position coordinate of the real human body frame is obtained by applying the cascade regression operation, and the regression operation efficiency is improved.

And 403, obtaining the position coordinates of the human body frame corresponding to each group of regression offset parameters based on each group of regression offset parameters in the regression offset parameter group set.

In some embodiments, the position coordinates of the human body frame corresponding to each set of regression offset parameters are obtained based on each set of regression offset parameters in the set of regression offset parameters, and the parameters in the set of regression offset parameters include an offset of coordinates of a center point of the human body frame and a scaling ratio of the human body frame.

And step 404, comparing the position coordinates of the human body frame corresponding to each set of regression offset parameters with a preset human body frame respectively, and determining the final position coordinates of the human body frame.

In this embodiment, the position coordinates of the human body frame corresponding to each set of regression offset parameters are respectively compared with the preset human body frame, the final position coordinates of the human body frame are determined, and the cascaded regression calculation is verified again. The final position coordinates of the human body frame are the position coordinates representing the closest preset human body frame, and the preset human body frame can be set artificially based on the position coordinates of the real human body frame. And the cascade regression calculation is verified, so that the detection precision of the position coordinates of the human body frame is improved.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 1, the schematic diagram 400 of the method for detecting a human body position in an image in this embodiment learns an evaluation threshold vector by using a depth model, so that a problem that prediction of a position coordinate of a pedestrian frame is wrong is determined only based on a standard that the evaluation threshold is 0.5 during model training, so that the evaluation threshold vector is more accurate, and further, the detection of the position coordinate of a human body frame is more accurate; the regression offset parameter set which is closer to the position coordinate of the real human body frame is obtained by calculating the cascade regression operation, and then the cascade regression operation result is verified, so that the accuracy of detecting the position coordinate of the human body frame is improved.

With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for detecting a position of a human body in an image, which corresponds to the embodiment of the method shown in fig. 1, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the apparatus 500 for detecting a position of a human body in an image according to the present embodiment includes: a human body position determination unit 501, a regression offset calculation unit 502, an offset parameter analysis unit 503, and a first position generation unit 504. The human body position determining unit is configured to determine position coordinates of a human body frame corresponding to a human body image from an image to be detected comprising the human body image; the regression offset calculation unit is configured to perform multiple regression calculation on the position coordinates of a human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function based on the learned evaluation threshold vector to obtain a regression offset parameter set corresponding to the evaluation threshold vector; the offset parameter analysis unit is configured to perform multi-dimensional comparison analysis on each set of regression offset parameters in the regression offset parameter set to determine an optimal regression offset parameter set; and a first position generation unit configured to generate final position coordinates of the human body frame based on the optimal regression offset parameter group.

In this embodiment, specific processes of the human body position determining unit 501, the regression offset calculating unit 502, the offset parameter analyzing unit 503 and the first position generating unit 504 of the apparatus 500 for detecting a human body position in an image and technical effects thereof may respectively refer to the related descriptions of step 101 to step 104 in the embodiment corresponding to fig. 1, and are not repeated herein.

In some optional implementations of this embodiment, the regression offset calculation unit includes: the first offset calculation module is configured to perform cascade regression calculation on the position coordinates of a human body frame corresponding to a human body image in an image to be detected by using a frame regression loss function based on each evaluation threshold in the learned evaluation threshold vector to obtain a regression offset parameter set corresponding to the evaluation threshold vector, wherein the evaluation threshold vector is an ordered numerical value set of the evaluation thresholds representing the image overlapping degree IOU, and numerical values in the ordered numerical value set gradually increase from front to back.

In some optional implementations of this embodiment, the regression offset calculation unit includes: the second offset calculation module is configured to perform cascade regression calculation on the position coordinates of a human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function based on the learned evaluation threshold vector to obtain a regression offset parameter set corresponding to the evaluation threshold vector; the evaluation threshold vector is used for representing an initial evaluation threshold and a cascade step required by cascade regression calculation, and each regression calculation in the cascade regression calculation is performed based on a regression offset parameter group obtained by the last regression calculation.

In some optional implementations of this embodiment, the apparatus further includes: the human body position calculation unit is configured to obtain position coordinates of a human body frame corresponding to each group of regression offset parameters based on each group of regression offset parameters in the regression offset parameter group set; and the second position generation unit is configured to compare the position coordinates of the human body frame corresponding to each set of regression offset parameters with the preset human body frame respectively and determine the final position coordinates of the human body frame, wherein the final position coordinates of the human body frame are the position coordinates representing the closest human body frame to the preset human body frame.

In some optional implementations of this embodiment, the human body position determination unit includes: and the human body position detection module is configured to detect the image to be detected by using the trained depth network model to obtain the position coordinates of a human body frame corresponding to the human body image in the image to be detected, wherein the depth network model is obtained by changing the evaluation threshold value in the evaluation threshold value vector in the frame regression loss function through training.

In some optional implementations of this embodiment, the depth network model of the human body position detection module is trained based on the following modules: a sample acquisition module configured to acquire a training sample set, wherein training samples in the training sample set include: various categories of images to be detected; and the sample training module is configured to utilize a deep learning method to take the images to be detected of various categories included in the training sample set training samples as the input of the detection network, output the position coordinates of the human body frame of the human body images in the input images to be detected of various categories and the learned evaluation threshold vector, and train to obtain a deep network model, wherein the learning target of the evaluation threshold vector is to enable the output position coordinates of the human body frame to approximate to the position coordinates of a real human body frame.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 6, it is a block diagram of an electronic device for detecting a position of a human body in an image according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method for detecting a position of a human body in an image provided herein. A non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method for detecting a position of a human body in an image provided by the present application.

The memory 602, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the method for detecting a human body position in an image in the embodiments of the present application (for example, the human body position determination unit 501, the regression offset calculation unit 502, the offset parameter analysis unit 503, and the first position generation unit 504 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 602, namely, implements the method for detecting the position of the human body in the image in the above method embodiment.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of an electronic device for detecting a position of a human body in an image, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 optionally includes memory located remotely from the processor 601, and these remote memories may be connected over a network to an electronic device for detecting the position of the person in the image. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device for the method of detecting a position of a human body in an image may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus for detecting a position of a human body in an image, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, the position coordinates of the human body frame corresponding to the human body image are determined from the image to be detected comprising the human body image, multiple regression calculation is carried out on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by utilizing a frame regression loss function based on an evaluation threshold vector obtained by learning to obtain a regression offset parameter set corresponding to the evaluation threshold vector, multi-dimensional comparison analysis is carried out on each set of regression offset parameters in the regression offset parameter set to determine the optimal regression offset parameter set, the final position coordinates of the human body frame are generated based on the optimal regression offset parameter set, the human body detection algorithm is optimized, and the problem that the position coordinates of the human body frame cannot be accurately predicted due to the fact that only one-time regression calculation is carried out on the frame regression loss function in the prior art is solved, the accuracy of detecting the position coordinates of the human body frame is improved, and the detection result of the human body image is more accurate.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for detecting a position of a human body in an image, the method comprising:

determining the position coordinates of a human body frame corresponding to a human body image from an image to be detected comprising the human body image;

performing multiple regression calculation on the position coordinates of a human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function based on the learned evaluation threshold vector to obtain a regression offset parameter set corresponding to the evaluation threshold vector;

performing multidimensional comparison analysis on each set of regression offset parameters in the regression offset parameter set to determine an optimal regression offset parameter set;

and generating the final position coordinates of the human body frame based on the optimal regression offset parameter group.

2. The method according to claim 1, wherein the performing a cascade regression calculation on the position coordinates of a human body frame corresponding to the human body image in the image to be detected by using the frame regression loss function based on the learned evaluation threshold vector to obtain a regression offset parameter set corresponding to the evaluation threshold vector comprises:

based on each evaluation threshold in the learned evaluation threshold vector, performing cascade regression calculation on the position coordinates of a human body frame corresponding to the human body image in the image to be detected by using the frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold vector, wherein the evaluation threshold vector is an ordered numerical value set of the evaluation threshold representing the image overlapping degree IOU, and the numerical values in the ordered numerical value set gradually increase from front to back.

3. The method according to claim 1, wherein the performing multiple regression calculations on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function based on the learned evaluation threshold vector to obtain a regression offset parameter set corresponding to the evaluation threshold vector comprises:

based on an evaluation threshold vector obtained by learning, performing cascade regression calculation on the position coordinates of a human body frame corresponding to the human body image in the image to be detected by using the frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold vector;

the evaluation threshold vector is used for representing an initial evaluation threshold and a cascade step required by cascade regression calculation, and each regression calculation in the cascade regression calculation is performed based on a regression offset parameter set obtained by the last regression calculation.

4. The method of claim 1, further comprising:

obtaining the position coordinates of the human body frame corresponding to each group of regression offset parameters based on each group of regression offset parameters in the regression offset parameter group set;

and comparing the position coordinates of the human body frame corresponding to each group of regression deviation parameters with a preset human body frame respectively, and determining the final position coordinates of the human body frame, wherein the final position coordinates of the human body frame are the position coordinates representing the closest human body frame to the preset human body frame.

5. The method according to claim 1, wherein the determining the position coordinates of the body frame corresponding to the body image from the image to be detected including the body image comprises:

and detecting the image to be detected by using the trained depth network model to obtain the position coordinates of a human body frame corresponding to the human body image in the image to be detected, wherein the depth network model is obtained by changing the evaluation threshold value in the evaluation threshold value vector in the frame regression loss function through training.

6. The method of claim 5, wherein the deep network model is trained based on:

obtaining a training sample set, wherein training samples in the training sample set comprise: various categories of images to be detected;

and by utilizing a deep learning method, taking the images to be detected of various categories included in the training sample set training samples as the input of a detection network, outputting the position coordinates of the human body frame of the human body images in the input images to be detected of various categories and the evaluation threshold value vector obtained by learning, and training to obtain a deep network model, wherein the learning target of the evaluation threshold value vector is to enable the output position coordinates of the human body frame to approximate to the position coordinates of a real human body frame.

7. An apparatus for detecting a position of a human body in an image, the apparatus comprising:

the human body position determining unit is configured to determine position coordinates of a human body frame corresponding to a human body image from an image to be detected comprising the human body image;

the regression offset calculation unit is configured to perform multiple regression calculation on the position coordinates of a human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function based on the learned evaluation threshold vector to obtain a regression offset parameter set corresponding to the evaluation threshold vector;

an offset parameter analysis unit configured to perform multidimensional comparison analysis on each set of regression offset parameters in the regression offset parameter set to determine an optimal regression offset parameter set;

and a first position generation unit configured to generate final position coordinates of the human body frame based on the optimal regression offset parameter group.

8. The apparatus of claim 7, wherein the regression offset calculation unit comprises:

the first offset calculation module is configured to perform cascade regression calculation on the position coordinates of a human body frame corresponding to a human body image in the image to be detected by using the frame regression loss function based on each evaluation threshold in an evaluation threshold vector obtained through learning, so as to obtain a regression offset parameter set corresponding to the evaluation threshold vector, wherein the evaluation threshold vector is an ordered numerical value set of the evaluation thresholds representing the image overlapping degree IOU, and numerical values in the ordered numerical value set gradually increase from front to back.

9. The apparatus of claim 7, wherein the regression offset calculation unit comprises:

the second offset calculation module is configured to perform cascade regression calculation on the position coordinates of a human body frame corresponding to the human body image in the image to be detected by using the frame regression loss function based on the learned evaluation threshold vector to obtain a regression offset parameter set corresponding to the evaluation threshold vector; the evaluation threshold vector is used for representing an initial evaluation threshold and a cascade step required by cascade regression calculation, and each regression calculation in the cascade regression calculation is performed based on a regression offset parameter set obtained by the last regression calculation.

10. The apparatus of claim 7, wherein the apparatus further comprises:

the human body position calculation unit is configured to obtain position coordinates of a human body frame corresponding to each group of regression offset parameters based on each group of regression offset parameters in the regression offset parameter group set;

and the second position generation unit is configured to compare the position coordinates of the human body frame corresponding to each set of regression offset parameters with the preset human body frame respectively and determine the final position coordinates of the human body frame, wherein the final position coordinates of the human body frame are the position coordinates representing the closest approach to the preset human body frame.

11. The apparatus of claim 7, wherein the human body position determination unit comprises:

and the human body position detection module is configured to detect the image to be detected by using a trained depth network model to obtain position coordinates of a human body frame corresponding to the human body image in the image to be detected, wherein the depth network model is obtained by changing an evaluation threshold value in an evaluation threshold value vector in a frame regression loss function and training.

12. The apparatus of claim 11, wherein the deep network model of the human position detection module is trained based on:

a sample acquisition module configured to acquire a set of training samples, wherein training samples in the set of training samples comprise: various categories of images to be detected;

and the sample training module is configured to utilize a deep learning method to take the images to be detected of various categories included in the training samples in the training sample set as the input of the detection network, output the position coordinates of the human body frame of the human body images in the input images to be detected of various categories and the learned evaluation threshold value vector, and train to obtain a deep network model, wherein the learning goal of the evaluation threshold value vector is to enable the output position coordinates of the human body frame to approximate to the position coordinates of a real human body frame.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.