CN111523452B

CN111523452B - Method and device for detecting human body position in image

Info

Publication number: CN111523452B
Application number: CN202010321816.XA
Authority: CN
Inventors: 钟东宏; 袁宇辰; 孙昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2023-08-25
Anticipated expiration: 2040-04-22
Also published as: CN111523452A

Abstract

The application discloses a method and a device for detecting the position of a human body in an image, which relate to the field of artificial intelligence and specifically comprise the following steps: determining the position coordinates of a human body frame corresponding to the human body image from the image to be detected comprising the human body image; based on the estimated threshold vector obtained by learning, performing multiple regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by using the frame regression loss function to obtain a regression offset parameter set corresponding to the estimated threshold vector; performing multidimensional comparison analysis on each group of regression offset parameters in the regression offset parameter set to determine an optimal regression offset parameter set; and generating the final position coordinates of the human body frame based on the optimal regression offset parameter set. According to the scheme, the accuracy of detecting the position coordinates of the human body frame is improved, and the detection result of the human body image is further accurate.

Description

Method and device for detecting human body position in image

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the field of artificial intelligence, and particularly relates to a method and a device for detecting the position of a human body in an image.

Background

With the continuous development of the internet and artificial intelligence technology, more and more fields begin to relate to automated computing and analysis, wherein the human body detection function of monitoring security protection is one of important scenes

Human body frame results obtained by human body detection in the common security video field sometimes deviate from an actual target to some extent, and although the human body frame results overlap with the actual target, the whole human body cannot be completely covered. When the inaccurate human body detection frame is used for subsequent tasks (such as classification and tracking), more errors are often introduced, so that if the detection frame can better cover a target human body, the effect of the whole service can be greatly improved.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for detecting the position of a human body in an image.

According to a first aspect, the present application provides a method for detecting a position of a person in an image, the method comprising: determining the position coordinates of a human body frame corresponding to the human body image from the image to be detected comprising the human body image; based on the estimated threshold vector obtained by learning, performing multiple regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by using the frame regression loss function to obtain a regression offset parameter set corresponding to the estimated threshold vector; performing multidimensional comparison analysis on each group of regression offset parameters in the regression offset parameter set to determine an optimal regression offset parameter set; and generating the final position coordinates of the human body frame based on the optimal regression offset parameter set.

In some embodiments, based on the learned evaluation threshold vector, performing cascade regression calculation by using a frame regression loss function on the position coordinates of a frame of a human body corresponding to a human body image in an image to be detected, to obtain a regression offset parameter set corresponding to the evaluation threshold vector, including: based on each evaluation threshold value in the evaluation threshold value vectors obtained through learning, carrying out cascade regression calculation on position coordinates of a human body frame corresponding to a human body image in an image to be detected by utilizing a frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold value vectors, wherein the evaluation threshold value vectors are ordered value sets of evaluation threshold values representing the image overlapping degree IOU, and the numerical values in the ordered value sets gradually increase from front to back.

In some embodiments, based on the learned evaluation threshold vector, performing multiple regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by using the frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold vector, including: based on the estimated threshold vector obtained by learning, carrying out cascade regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by utilizing the frame regression loss function to obtain a regression offset parameter set corresponding to the estimated threshold vector; the evaluation threshold vector is used for representing an initial evaluation threshold and a cascade step length required by cascade regression calculation, and each regression calculation in the cascade regression calculation is performed based on a regression offset parameter set obtained by the last regression calculation.

In some embodiments, the method further comprises: based on each group of regression offset parameters in the regression offset parameter group set, obtaining the position coordinates of the human body frame corresponding to each group of regression offset parameters; and comparing the position coordinates of the human body frames corresponding to each group of regression offset parameters with the preset human body frames respectively, and determining the position coordinates of the final human body frames, wherein the position coordinates of the final human body frames are the position coordinates representing the nearest preset human body frames.

In some embodiments, determining, from an image to be detected including a human body image, position coordinates of a human body frame corresponding to the human body image includes: and detecting the image to be detected by using the depth network model obtained through training to obtain the position coordinates of the human body frame corresponding to the human body image in the image to be detected, wherein the depth network model is obtained through training by changing the evaluation threshold value in the evaluation threshold value vector in the frame regression loss function.

In some embodiments, the depth network model is trained based on the following steps: obtaining a training sample set, wherein training samples in the training sample set comprise: various categories of images to be detected; and outputting the position coordinates of the human body frames of the human body images in the input various types of images to be detected and the learned evaluation threshold vectors by using a deep learning method, wherein the learning target of the evaluation threshold vectors is to enable the position coordinates of the output human body frames to approach the position coordinates of the real human body frames.

In a second aspect, the present application provides an apparatus for detecting a position of a human body in an image, the apparatus comprising: a human body position determining unit configured to determine position coordinates of a human body frame corresponding to a human body image from an image to be detected including the human body image; the regression offset calculation unit is configured to perform multiple regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by utilizing the frame regression loss function based on the learned evaluation threshold vector, so as to obtain a regression offset parameter set corresponding to the evaluation threshold vector; the offset parameter analysis unit is configured to carry out multidimensional comparison analysis on each group of regression offset parameters in the regression offset parameter set to determine an optimal regression offset parameter set; and the first position generating unit is configured to generate the position coordinates of the final human body frame based on the optimal regression offset parameter set.

In some embodiments, the regression offset calculation unit includes: the first offset calculation module is configured to perform cascade regression calculation on position coordinates of a human body frame corresponding to a human body image in an image to be detected by using a frame regression loss function based on each evaluation threshold value in the learned evaluation threshold value vector to obtain a regression offset parameter set corresponding to the evaluation threshold value vector, wherein the evaluation threshold value vector is an ordered value set of evaluation threshold values representing the image overlapping degree IOU, and the numerical values in the ordered value set gradually increase from front to back.

In some embodiments, the regression offset calculation unit includes: the second offset calculation module is configured to perform cascade regression calculation by using a frame regression loss function based on the learned evaluation threshold vector and the position coordinates of the frame of the human body corresponding to the human body image in the image to be detected, so as to obtain a regression offset parameter set corresponding to the evaluation threshold vector; the evaluation threshold vector is used for representing an initial evaluation threshold and a cascade step length required by cascade regression calculation, and each regression calculation in the cascade regression calculation is performed based on a regression offset parameter set obtained by the last regression calculation.

In some embodiments, the apparatus further comprises: the human body position calculation unit is configured to obtain the position coordinates of the human body frame corresponding to each group of regression offset parameters based on each group of regression offset parameters in the regression offset parameter group set; the second position generating unit is configured to compare the position coordinates of the human body frame corresponding to each group of regression offset parameters with the preset human body frame respectively, and determine the position coordinates of the final human body frame, wherein the position coordinates of the final human body frame are the position coordinates representing the nearest preset human body frame.

In some embodiments, the human body position determining unit includes: the human body position detection module is configured to detect an image to be detected by using a depth network model obtained through training to obtain position coordinates of a human body frame corresponding to a human body image in the image to be detected, wherein the depth network model is obtained through training by changing an evaluation threshold value in an evaluation threshold value vector in a frame regression loss function.

In some embodiments, the depth network model of the human body position detection module is trained based on the following modules: a sample acquisition module configured to acquire a training sample set, wherein training samples in the training sample set comprise: various categories of images to be detected; the sample training module is configured to output the position coordinates of the human body frame of the human body image in the input various types of images to be detected and the learned evaluation threshold vector by taking various types of images to be detected included in the training sample set as the input of the detection network by using a deep learning method, and train to obtain a deep network model, wherein the learning target of the evaluation threshold vector is to enable the position coordinates of the output human body frame to approach the position coordinates of the real human body frame.

In a third aspect, the present application provides an electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect.

In a fourth aspect, the application provides a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method as described in any of the implementations of the first aspect.

According to the technology, the position coordinates of the human body frame corresponding to the human body image are determined from the image to be detected comprising the human body image, based on the learned evaluation threshold vector, multiple regression calculation is carried out on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by utilizing the frame regression loss function, a regression offset parameter set corresponding to the evaluation threshold vector is obtained, multidimensional comparison analysis is carried out on each regression offset parameter in the regression offset parameter set, an optimal regression offset parameter set is determined, the final position coordinates of the human body frame are generated based on the optimal regression offset parameter set, the human body detection algorithm is optimized, the problem that the position coordinates of the human body frame are not accurately predicted due to the fact that only one regression calculation is carried out on the frame regression loss function in the prior art is solved, the accuracy of detecting the position coordinates of the human body frame is improved, and the detection result of the human body image is further accurate.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application.

FIG. 1 is a schematic diagram of a first embodiment of a method for detecting a position of a person in an image according to the present application;

FIG. 2 is a scene diagram of a method for detecting a position of a person in an image in which embodiments of the application may be implemented;

FIG. 3 is a foreground interactive schematic interface corresponding to a background for performing the method of the application for detecting a position of a person in an image;

FIG. 4 is a schematic diagram of a second embodiment of a method for detecting a position of a person in an image according to the present application;

FIG. 5 is a schematic structural view of one embodiment of an apparatus for detecting a position of a human body in an image according to the present application;

fig. 6 is a block diagram of an electronic device for implementing a method for detecting a position of a person in an image according to an embodiment of the application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

Fig. 1 shows a schematic diagram 100 of a first embodiment of a method for detecting a position of a person in an image according to the application. The method for detecting the position of the human body in the image comprises the following steps:

step 101, determining the position coordinates of the human body frame corresponding to the human body image from the image to be detected including the human body image.

In this embodiment, the execution body determines, according to a preset frame position algorithm, a position coordinate of a human frame corresponding to a human image with respect to the human image of the image to be detected.

Step 102, based on the learned evaluation threshold vector, performing multiple regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by using the frame regression loss function, and obtaining a regression offset parameter set corresponding to the evaluation threshold vector.

In this embodiment, the executing body sequentially performs regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected according to each evaluation threshold value in the learned evaluation threshold value vector by using the frame regression loss function, so as to obtain a regression offset parameter set composed of regression calculation results. The parameters in the regression offset parameter set may include an offset of the human body frame center point coordinates and a scaling ratio of the human body frame.

And 103, carrying out multidimensional comparison analysis on each group of regression offset parameters in the regression offset parameter group set to determine the optimal regression offset parameter group.

In this embodiment, multidimensional comparison analysis is performed on each set of regression offset parameters obtained by regression calculation, and an optimal regression offset parameter set is determined. The multi-dimensional comparison may include comparing separately based on each set of regression offset parameters with corresponding parameters of the standard box or comparing between different sets of regression offset parameters.

And 104, generating the final position coordinates of the human body frame based on the optimal regression offset parameter set.

In this embodiment, the position coordinates of the human body frame corresponding to the optimal regression offset parameter set are used as the final position coordinates of the human body frame.

It should be noted that the above-mentioned frame regression loss function is a well-known technique widely studied and applied at present, and will not be described here.

With continued reference to fig. 2, the method 200 for detecting a position of a person in an image of the present embodiment operates in an electronic device 201. In the field of monitoring security, when an electronic device 201 for monitoring determines a position coordinate 202 of a human body frame corresponding to a human body image from an image to be detected including the human body image, based on an estimated threshold vector obtained by learning, multiple regression calculation is performed on the position coordinate of the human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function to obtain a regression offset parameter set 203 corresponding to the estimated threshold vector, based on an optimal regression offset parameter set, a final position coordinate 204 of the human body frame is generated, whether the person enters the monitoring field 205 is determined based on the final position coordinate of the human body frame, and a monitoring result is sent to a person to be monitored in a voice or text mode, and information received by the person is shown in fig. 3.

According to the method for detecting the positions of the human bodies in the images, which is provided by the embodiment of the application, the position coordinates of the human body frames corresponding to the human body images are determined from the images to be detected comprising the human body images, the position coordinates of the human body frames corresponding to the human body images in the images to be detected are subjected to multiple regression calculation by using the frame regression loss function based on the learned evaluation threshold value vector, the regression offset parameter set corresponding to the evaluation threshold value vector is obtained, the multi-dimensional comparison analysis is carried out on each group of regression offset parameters in the regression offset parameter set, the optimal regression offset parameter set is determined, the final position coordinates of the human body frames are generated based on the optimal regression offset parameter set, the human body detection algorithm is optimized, the problem that the position coordinates of the human body frames are not sufficiently accurately predicted due to the fact that only one regression calculation is carried out on the frame regression loss function in the prior art is solved, the accuracy of detecting the position coordinates of the human body frames is improved, and the detection result of the human body images is more accurate.

With further reference to fig. 4, a schematic diagram 400 of a second embodiment of a method for detecting a position of a person in an image is shown. The flow of the method comprises the following steps:

And step 401, detecting the image to be detected by using the depth network model obtained by training, and obtaining the position coordinates of the human body frame corresponding to the human body image in the image to be detected.

In this embodiment, based on a depth network model, an image to be detected is input into a detection network to obtain position coordinates of a human body frame corresponding to a human body image in the image to be detected, and the depth network model is obtained by training an evaluation threshold in an evaluation threshold vector in a frame regression loss function. The position coordinates of the human body frame are detected by using a deep learning technology, so that the detection result is more accurate.

In some optional implementations of the present embodiment, the depth network model is trained based on the following steps: obtaining a training sample set, wherein training samples in the training sample set comprise: various categories of images to be detected; and outputting the position coordinates of the human body frames of the human body images in the input various types of images to be detected and the learned evaluation threshold vectors by using a deep learning method, wherein the learning target of the evaluation threshold vectors is to enable the position coordinates of the output human body frames to approach the position coordinates of the real human body frames. The depth model is utilized to learn the evaluation threshold value vector, so that the problem that the prediction of the position coordinates of the pedestrian frame is wrong due to the fact that the evaluation threshold value is judged based on the standard that the evaluation threshold value is 0.5 during model training is avoided, the evaluation threshold value vector is more accurate, and further the detection of the position coordinates of the human body frame is more accurate.

Step 402, based on each evaluation threshold in the learned evaluation threshold vector, performing cascade regression calculation respectively by using the frame regression loss function on the position coordinates of the frame of the human body corresponding to the human body image in the image to be detected, so as to obtain a regression offset parameter set corresponding to the evaluation threshold vector.

In this embodiment, cascade regression calculation is performed on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by changing the evaluation threshold of the evaluation threshold vector in the frame regression loss function, so as to obtain a regression offset parameter set corresponding to the evaluation threshold vector. The evaluation threshold vector is an ordered value set of the evaluation threshold characterizing the image overlapping degree IOU, and the numerical values in the ordered value set gradually increase from front to back. And the regression offset parameter set which is more and more close to the position coordinates of the real human frame is obtained by using cascade regression operation, so that the regression operation efficiency is improved.

In some optional implementations of the present embodiment, based on the learned evaluation threshold vector, performing multiple regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by using the frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold vector, including: based on the estimated threshold vector obtained by learning, carrying out cascade regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by utilizing the frame regression loss function to obtain a regression offset parameter set corresponding to the estimated threshold vector; the evaluation threshold vector is used for representing an initial evaluation threshold and a cascade step length required by cascade regression calculation, and each regression calculation in the cascade regression calculation is performed based on a regression offset parameter set obtained by the last regression calculation. And the regression offset parameter set which is more and more close to the position coordinates of the real human frame is obtained by using cascade regression operation, so that the regression operation efficiency is improved.

Step 403, obtaining the position coordinates of the human body frame corresponding to each group of regression offset parameters based on each group of regression offset parameters in the regression offset parameter group set.

In some embodiments, the position coordinates of the human body frame corresponding to each set of regression offset parameters are obtained based on each set of regression offset parameters in the regression offset parameter set, where the parameters in the regression offset parameter set include an offset of the human body frame center point coordinates and a scaling ratio of the human body frame.

And step 404, comparing the position coordinates of the human body frame corresponding to each group of regression offset parameters with the preset human body frame respectively, and determining the final position coordinates of the human body frame.

In this embodiment, the position coordinates of the human body frame corresponding to each set of regression offset parameters are respectively compared with the preset human body frame, the final position coordinates of the human body frame are determined, and the cascade regression calculation is verified again. The final position coordinates of the human body frame represent the position coordinates closest to the preset human body frame, and the preset human body frame can be artificially set based on the position coordinates of the real human body frame. And the cascade regression calculation is verified, so that the detection precision of the position coordinates of the human body frame is improved.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 1, the schematic diagram 400 of the method for detecting the position of the human body in the image in this embodiment learns the estimated threshold vector by using the depth model, so that the problem that the prediction of the position coordinate of the pedestrian frame is wrong due to the fact that the estimated threshold vector is judged only based on the criterion that the estimated threshold is 0.5 during model training is avoided, the estimated threshold vector is more accurate, and the detection of the position coordinate of the human body frame is more accurate; and obtaining a regression offset parameter set which is more and more close to the position coordinates of the real human frame through operation cascade regression operation, and verifying the cascade regression operation result, thereby improving the accuracy of detecting the position coordinates of the human frame.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for detecting a position of a human body in an image, which corresponds to the embodiment of the method shown in fig. 1, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the apparatus 500 for detecting a position of a human body in an image according to the present embodiment includes: a human body position determining unit 501, a regression offset calculating unit 502, an offset parameter analyzing unit 503, and a first position generating unit 504. The human body position determining unit is configured to determine the position coordinates of a human body frame corresponding to the human body image from the image to be detected comprising the human body image; the regression offset calculation unit is configured to perform multiple regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by utilizing the frame regression loss function based on the learned evaluation threshold vector, so as to obtain a regression offset parameter set corresponding to the evaluation threshold vector; the offset parameter analysis unit is configured to carry out multidimensional comparison analysis on each group of regression offset parameters in the regression offset parameter set to determine an optimal regression offset parameter set; and the first position generating unit is configured to generate the position coordinates of the final human body frame based on the optimal regression offset parameter set.

In this embodiment, the specific processes and the technical effects of the human body position determining unit 501, the regression offset calculating unit 502, the offset parameter analyzing unit 503 and the first position generating unit 504 of the apparatus 500 for detecting the human body position in the image can refer to the relevant descriptions of the steps 101 to 104 in the corresponding embodiment of fig. 1, and are not repeated here.

In some optional implementations of the present embodiment, the regression offset calculation unit includes: the first offset calculation module is configured to perform cascade regression calculation on position coordinates of a human body frame corresponding to a human body image in an image to be detected by using a frame regression loss function based on each evaluation threshold value in the learned evaluation threshold value vector to obtain a regression offset parameter set corresponding to the evaluation threshold value vector, wherein the evaluation threshold value vector is an ordered value set of evaluation threshold values representing the image overlapping degree IOU, and the numerical values in the ordered value set gradually increase from front to back.

In some optional implementations of the present embodiment, the regression offset calculation unit includes: the second offset calculation module is configured to perform cascade regression calculation by using a frame regression loss function based on the learned evaluation threshold vector and the position coordinates of the frame of the human body corresponding to the human body image in the image to be detected, so as to obtain a regression offset parameter set corresponding to the evaluation threshold vector; the evaluation threshold vector is used for representing an initial evaluation threshold and a cascade step length required by cascade regression calculation, and each regression calculation in the cascade regression calculation is performed based on a regression offset parameter set obtained by the last regression calculation.

In some optional implementations of this embodiment, the apparatus further includes: the human body position calculation unit is configured to obtain the position coordinates of the human body frame corresponding to each group of regression offset parameters based on each group of regression offset parameters in the regression offset parameter group set; the second position generating unit is configured to compare the position coordinates of the human body frame corresponding to each group of regression offset parameters with the preset human body frame respectively, and determine the position coordinates of the final human body frame, wherein the position coordinates of the final human body frame are the position coordinates representing the nearest preset human body frame.

In some optional implementations of the present embodiment, the human body position determining unit includes: the human body position detection module is configured to detect an image to be detected by using a depth network model obtained through training to obtain position coordinates of a human body frame corresponding to a human body image in the image to be detected, wherein the depth network model is obtained through training by changing an evaluation threshold value in an evaluation threshold value vector in a frame regression loss function.

In some optional implementations of this embodiment, the depth network model of the human body position detection module is trained based on the following modules: a sample acquisition module configured to acquire a training sample set, wherein training samples in the training sample set comprise: various categories of images to be detected; the sample training module is configured to output the position coordinates of the human body frame of the human body image in the input various types of images to be detected and the learned evaluation threshold vector by taking various types of images to be detected included in the training sample set as the input of the detection network by using a deep learning method, and train to obtain a deep network model, wherein the learning target of the evaluation threshold vector is to enable the position coordinates of the output human body frame to approach the position coordinates of the real human body frame.

According to an embodiment of the present application, the present application also provides an electronic device and a readable storage medium.

As shown in fig. 6, there is a block diagram of an electronic device for a method of detecting a position of a person in an image according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 6, the electronic device includes: one or more processors 601, memory 602, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 601 is illustrated in fig. 6.

The memory 602 is a non-transitory computer readable storage medium provided by the present application. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for detecting a position of a person in an image provided by the present application. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method for detecting a position of a human body in an image provided by the present application.

The memory 602 is used as a non-transitory computer readable storage medium, and may be used to store a non-transitory software program, a non-transitory computer executable program, and modules, such as program instructions/modules corresponding to a method for detecting a position of a human body in an image in an embodiment of the present application (e.g., the human body position determining unit 501, the regression offset calculating unit 502, the offset parameter analyzing unit 503, and the first position generating unit 504 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 602, that is, implements the method for detecting a position of a person in an image in the above-described method embodiment.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created from the use of an electronic device for detecting the position of a person in an image, or the like. In addition, the memory 602 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 602 optionally includes memory remotely located relative to processor 601, which may be connected via a network to an electronic device for detecting the location of a person in an image. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device for the method of detecting a position of a person in an image may further comprise: an input device 603 and an output device 604. The processor 601, memory 602, input device 603 and output device 604 may be connected by a bus or otherwise, for example in fig. 6.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic device for detecting the position of the person in the image, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, etc. input devices. The output means 604 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme provided by the embodiment of the application, the position coordinates of the human body frame corresponding to the human body image are determined from the image to be detected comprising the human body image, based on the learned evaluation threshold vector, multiple regression calculation is performed on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by utilizing the frame regression loss function, a regression offset parameter set corresponding to the evaluation threshold vector is obtained, multidimensional comparison analysis is performed on each regression offset parameter in the regression offset parameter set, an optimal regression offset parameter set is determined, the final position coordinates of the human body frame are generated based on the optimal regression offset parameter set, the human body detection algorithm is optimized, the problem that the position coordinates of the human body frame are not accurate enough due to the fact that only one-time regression calculation is performed on the frame regression loss function in the prior art is solved, the accuracy of detecting the position coordinates of the human body frame is improved, and the detection result of the human body image is more accurate.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. A method for detecting a position of a human body in an image, the method comprising:

determining the position coordinates of a human body frame corresponding to a human body image from an image to be detected comprising the human body image;

based on the learned evaluation threshold vector, performing multiple regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by utilizing a frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold vector, wherein parameters in the regression offset parameter set comprise offset of the center point coordinates of the human body frame and the scaling ratio of the human body frame;

Performing multidimensional comparison analysis on each group of regression offset parameters in the regression offset parameter set to determine an optimal regression offset parameter set, wherein the multidimensional comparison analysis comprises comparison analysis based on the regression offset parameters of each group and corresponding parameters of a standard frame respectively;

generating final position coordinates of the human body frame based on the optimal regression offset parameter set;

the method further comprises the steps of:

based on each group of regression offset parameters in the regression offset parameter group set, obtaining the position coordinates of the human body frame corresponding to each group of regression offset parameters;

and comparing the position coordinates of the human body frame corresponding to each group of regression offset parameters with preset human body frames respectively, determining the position coordinates of the final human body frame, and verifying the cascade regression operation result, wherein the preset human body frame is set based on the position coordinates of the real human body frame, and the position coordinates of the final human body frame are the position coordinates which represent the closest to the preset human body frame.

2. The method of claim 1, wherein the performing multiple regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected based on the evaluation threshold vector obtained by learning by using the frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold vector includes:

And respectively carrying out cascade regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected based on each evaluation threshold value in the learned evaluation threshold value vector by utilizing the frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold value vector, wherein the evaluation threshold value vector is an ordered value set of the evaluation threshold value representing the image overlapping degree IOU, and the numerical value in the ordered value set gradually increases from front to back.

3. The method of claim 1, wherein the obtaining the regression offset parameter set corresponding to the evaluation threshold vector based on the learned evaluation threshold vector performs multiple regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by using a frame regression loss function, and includes:

based on the evaluation threshold vector obtained by learning, carrying out cascade regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by utilizing the frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold vector;

the evaluation threshold vector is used for representing an initial evaluation threshold and a cascade step length required by cascade regression calculation, and each regression calculation in the cascade regression calculation is performed based on a regression offset parameter set obtained by the last regression calculation.

4. The method of claim 1, wherein the determining, from the image to be detected including the human body image, the position coordinates of the human body frame corresponding to the human body image includes:

and detecting the image to be detected by using the depth network model obtained through training to obtain the position coordinates of the human body frame corresponding to the human body image in the image to be detected, wherein the depth network model is obtained through training by changing the evaluation threshold value in the evaluation threshold value vector in the frame regression loss function.

5. The method of claim 4, wherein the depth network model is trained based on:

obtaining a training sample set, wherein training samples in the training sample set comprise: various categories of images to be detected;

and outputting the position coordinates of the human body frames of the human body images in the input various types of images to be detected and the learned evaluation threshold vectors by using a deep learning method, wherein the learning target of the evaluation threshold vectors is to enable the position coordinates of the output human body frames to approach the position coordinates of the real human body frames.

6. An apparatus for detecting a position of a human body in an image, the apparatus comprising:

a human body position determining unit configured to determine position coordinates of a human body frame corresponding to a human body image from an image to be detected including the human body image;

the regression offset calculation unit is configured to perform multiple regression calculation on the position coordinates of the human body frame corresponding to the human body image in the image to be detected by utilizing a frame regression loss function based on the learned evaluation threshold vector to obtain a regression offset parameter set corresponding to the evaluation threshold vector, wherein parameters in the regression offset parameter set comprise offset of the center point coordinates of the human body frame and scaling ratio of the human body frame;

the deviation parameter analysis unit is configured to carry out multidimensional comparison analysis on each group of regression deviation parameters in the regression deviation parameter set to determine an optimal regression deviation parameter set, and the multidimensional comparison analysis comprises comparison analysis based on the regression deviation parameters of each group and corresponding parameters of a standard frame respectively;

the first position generating unit is configured to generate the position coordinates of the final human body frame based on the optimal regression offset parameter set;

The apparatus further comprises:

the human body position calculation unit is configured to obtain the position coordinates of the human body frame corresponding to each group of regression offset parameters based on each group of regression offset parameters in the regression offset parameter group set;

the second position generating unit is configured to compare the position coordinates of the human body frame corresponding to each group of regression offset parameters with preset human body frames respectively, determine the position coordinates of the final human body frame, and verify the cascade regression operation result, wherein the preset human body frame is set based on the position coordinates of the real human body frame, and the position coordinates of the final human body frame are the position coordinates representing the nearest human body frame.

7. The apparatus of claim 6, wherein the regression offset calculation unit comprises:

the first offset calculation module is configured to perform cascade regression calculation on position coordinates of a human body frame corresponding to a human body image in the image to be detected based on each evaluation threshold value in the learned evaluation threshold value vector, and obtain a regression offset parameter set corresponding to the evaluation threshold value vector by using the frame regression loss function, wherein the evaluation threshold value vector is an ordered value set of evaluation threshold values representing the image overlapping degree IOU, and the values in the ordered value set gradually increase from front to back.

8. The apparatus of claim 6, wherein the regression offset calculation unit comprises:

the second offset calculation module is configured to perform cascade regression calculation on the position coordinates of the human frame corresponding to the human body image in the image to be detected based on the learned evaluation threshold vector by using the frame regression loss function to obtain a regression offset parameter set corresponding to the evaluation threshold vector; the evaluation threshold vector is used for representing an initial evaluation threshold and a cascade step length required by cascade regression calculation, and each regression calculation in the cascade regression calculation is performed based on a regression offset parameter set obtained by the last regression calculation.

9. The apparatus of claim 6, wherein the human body position determining unit comprises:

the human body position detection module is configured to detect the image to be detected by using a depth network model obtained through training to obtain the position coordinates of a human body frame corresponding to the human body image in the image to be detected, wherein the depth network model is obtained through training by changing an evaluation threshold value in an evaluation threshold value vector in a frame regression loss function.

10. The apparatus of claim 9, wherein the depth network model of the human position detection module is trained based on:

a sample acquisition module configured to acquire a training sample set, wherein training samples in the training sample set comprise: various categories of images to be detected;

the sample training module is configured to output the position coordinates of the human body frame of the human body image in the input various types of images to be detected and the learned evaluation threshold vector by using a deep learning method as input of a detection network, and train to obtain a deep network model, wherein the learning target of the evaluation threshold vector is to enable the position coordinates of the output human body frame to approach the position coordinates of the real human body frame.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-5.