CN117437697B

CN117437697B - Training method of prone position human body detection model, prone position human body detection method and system

Info

Publication number: CN117437697B
Application number: CN202311754454.3A
Authority: CN
Inventors: 于大洋; 周可; 王羽嗣; 王云忠; 刘思德
Original assignee: Guangzhou Side Medical Technology Co ltd
Current assignee: Guangzhou Side Medical Technology Co ltd
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-04-30
Anticipated expiration: 2043-12-20
Also published as: CN117437697A

Abstract

The invention relates to the technical field of image detection, and particularly discloses a training method of a prone position human body detection model, a prone position human body detection method and a prone position human body detection system, wherein the training method comprises the following steps: the method comprises the steps of extracting an image comprising a prone human body from video data to serve as a training image and labeling a real human body frame, constructing a prone human body detection model comprising a main network, a neck network and a head network which are sequentially connected, wherein the main network comprises a multi-scale fusion module which is embedded in an FPN network and at least adopts a convolution kernel with the width being larger than the height, the neck network comprises a context fusion module and an attention module, after the prone human body detection model is trained by adopting the training image, the capability of acquiring width information of the prone human body detection model is improved through the multi-scale fusion module, the context fusion module and the attention module, interference of a background area can be restrained to highlight human body characteristics, the method is suitable for a scene with complex prone human body and inspection bench background in endoscopy, and the accuracy of human body detection in endoscopy is improved.

Description

Training method of prone position human body detection model, prone position human body detection method and system

Technical Field

The invention belongs to the technical field of image detection, and particularly relates to a training method of a prone position human body detection model, a prone position human body detection method and a prone position human body detection system.

Background

Endoscopy of the human body is a common medical diagnostic tool, and capsule gastroscopy is widely focused and applied as a noninvasive, painless, noninvasive examination tool.

In portable capsule gastroscopy, a camera outside a human body is required to detect the body posture of a prone patient so as to provide convenience for subsequent operations such as position conversion correctness judgment and the like. At present, a human body detection algorithm based on machine learning is mostly used for scenes with a simple standing or background of a human body, cannot be suitable for prone human body detection during endoscopy and complex scenes such as bedding, sterile cloth and the like exist on an inspection table where a patient is located during endoscopy, so that the accuracy of detecting the human body in the endoscopy scene is low, and accurate body position conversion data cannot be provided for endoscopy.

Disclosure of Invention

The embodiment of the invention aims to provide a training method of a prone human body detection model, a prone human body detection method and a prone human body detection system, and aims to solve the problem that the accuracy of detecting a human body in an endoscopy scene is low because the existing human body detection algorithm cannot be applied to human body detection in the endoscopy scene.

In order to achieve the above object, the embodiment of the present invention provides the following technical solutions:

in a first aspect, the present invention provides a training method for a prone human body detection model, specifically including the following steps:

Extracting images including prone human bodies from video data of a plurality of scenes to serve as training images, and marking human bodies in the training images to obtain real human body frames;

Constructing a prone position human body detection model, wherein the prone position human body detection model comprises a main network, a neck network and a head network which are sequentially connected, the main network comprises an FPN network and a multi-scale fusion module which is embedded in the FPN network and performs convolution operation at least by adopting a convolution kernel with the width being larger than the height, and the neck network comprises a context fusion module and an attention module;

and training the prone position human body detection model by adopting the training image.

In a second aspect, the invention provides a prone position human body detection method, which specifically comprises the following steps:

collecting an image of a human body using an endoscope as an image to be detected;

inputting the image to be detected into a pre-trained prone human body detection model to obtain human body detection information;

Wherein the prone human body detection model is trained by the training method of the prone human body detection model of the first aspect.

In a third aspect, the present invention provides a training system for a prone human detection model, specifically including:

the training image acquisition module is used for extracting images including prone human bodies from video data of a plurality of scenes to serve as training images, and labeling human bodies in the training images to obtain real human body frames;

The prone position human body detection model building module is used for building a prone position human body detection model, the prone position human body detection model comprises a main network, a neck network and a head network which are sequentially connected, the main network comprises an FPN network and a multi-scale fusion module which is embedded in the FPN network and performs convolution operation at least by adopting a convolution kernel with the width being larger than the height, and the neck network comprises a context fusion module and an attention module;

And the prone position human body detection model training module is used for training the prone position human body detection model by adopting the training image.

In a fourth aspect, the present invention provides a prone position human body detection system, specifically including:

the image acquisition module to be detected is used for acquiring images of a human body using an endoscope to be used as the images to be detected;

the human body detection module is used for inputting the image to be detected into a pre-trained prone human body detection model to obtain human body detection information;

In a fifth aspect, the present invention provides an electronic device, including:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of training the prone human detection model according to the first aspect of the present invention and/or the method of prone human detection according to the second aspect.

In a sixth aspect, the present invention provides a computer readable storage medium storing computer instructions for causing a processor to implement the training method of the prone human detection model according to the first aspect of the present invention and/or the prone human detection method according to the second aspect when executed.

Compared with the prior art, the invention has the beneficial effects that:

The prone position human body detection model trained by the embodiment of the invention comprises a main network, a neck network and a head network which are sequentially connected, wherein the main network comprises an FPN network and a multi-scale fusion module which is embedded in the FPN network and at least adopts a convolution kernel with the width being larger than the height to carry out convolution operation, and the neck network comprises a context fusion module and an attention module, so that the main network can carry out convolution operation through a convolution check feature map with the width being larger than the height when extracting features from training images, thereby obtaining more features of the training images in width, improving the capability of the prone position human body detection model to obtain width information, being applicable to scenes with the human body width being far higher than the height when the human body is prone, and in addition, the context fusion module and the attention module are used for processing the features extracted from the main network, so that interference of background areas and highlighting human body features can be restrained, being applicable to complex scenes such as bedding, sterile cloth and the like exist on an inspection table in the endoscopy inspection, thereby greatly improving the accuracy of human body in the endoscopy inspection scene, and providing accurate body position transformation data for endoscopy inspection.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present invention.

Fig. 1 shows a flowchart of a training method of a prone human body detection model according to an embodiment of the present invention.

FIG. 2 is a schematic diagram showing a model structure of a prone human detection model in one embodiment;

FIG. 3 shows a schematic diagram of a multi-scale convolution kernel of a multi-scale fusion module;

Fig. 4 shows a network architecture diagram of a neck network;

fig. 5 shows a flowchart of a prone human body detection method according to an embodiment of the present invention.

FIG. 6 shows an application architecture diagram of a training system for a prone human detection model provided by one embodiment of the present invention;

FIG. 7 shows an application architecture diagram of a prone human detection system provided by one embodiment of the present invention;

Fig. 8 shows an application architecture diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Fig. 1 shows a flowchart of a training method of a prone human body detection model provided by an embodiment of the present invention, specifically, as shown in fig. 1, the training method of a prone human body detection model of an embodiment of the present invention specifically includes the following steps:

Step S101, images including prone position human bodies are extracted from video data of a plurality of scenes to serve as training images, and human bodies in the training images are marked to obtain real human body frames.

In one embodiment, video data including prone human body of multiple scenes can be obtained, key frames containing human body are extracted from the video data, data expansion processing is performed based on the key frames to obtain expanded images, the key frames and the expanded images are determined to be training images, labels are further added to each image manually or by AI (Artificial Intelligence) in an auxiliary mode, the labels are human body boundary frames, namely real human body frames containing human body images, the real human body frames have center point coordinates, width and height, and classification labels of human body gestures, such as classification labels of prone, lateral and the like, are carried.

The method is characterized in that video data of different human bodies in prone positions under different scenes can be acquired, video images of the human bodies with different heights, weights, ages and sexes in an inspection table, a sickbed or other scenes can be acquired, the prone positions of the different human bodies or various actions can be changed during video acquisition in the video images, key frames containing the human bodies can be extracted from the video data, the key frames can effectively represent the postures and movements of the human bodies in prone positions, such as left side lying, left half-supporting, right side lying, right half-supporting, prone position, supine position and the like, the key frames are subjected to overturning, rotating, dimensional changing, random shielding, random blurring and the like to obtain expanded images, the expanded images and the key frames are used as training images, real human body frames related to human body boundaries are added to the training images, so that more comprehensive and accurate images including the human body prone positions can be obtained, and sufficient and abundant samples are provided for training of a prone position human body detection model.

Step S102, a prone position human body detection model is built, the prone position human body detection model comprises a main network, a neck network and a head network which are sequentially connected, the main network adopts an FPN (Feature Pyramid Networks, characteristic pyramid) network structure, a Multi-scale fusion module (MM, multi-scale fusion Module) is embedded in the FPN network, the Multi-scale fusion module MM adopts various convolution operations, the convolution operations are carried out by convolution kernels with the width being larger than the height, and the neck network comprises a context fusion module and an attention module.

As shown in fig. 2, the prone human body detection model of the present embodiment includes a backbone network, a neck network and a head network, where the backbone network is used to acquire a training image and extract basic features (such as including semantic information and position information) from the training image, an output layer of the backbone network is connected to an input layer of the neck network, an output layer of the neck network is connected to an input layer of the head network, the backbone network is used to extract multiple levels of target feature maps Pi from the training image, the neck network is used to fuse the multiple levels of target feature maps Pi extracted from the backbone network to obtain image features Oi of the training image, the head network is used to output a human body detection frame according to the image features Oi, the output data specifically includes a pose classification tag (such as cls in fig. 2) of a human body, an intersection ratio (such as IOU in fig. 2) between the human body detection frame and a real human body frame marked in advance, and regression information (such as reg in fig. 2) of the human body detection frame, the regression information reg specifically includes center coordinatesWidth of human body detecting frameHeight/>。

In an alternative embodiment, an FPN (Feature Pyramid Network ) network structure may be adopted as a backbone network, and a multiscale fusion module MM is embedded in the FPN network structure, where the multiscale fusion module MM is configured to perform a multiscale convolution operation on a multiscale initial feature map Ci output by the backbone network, specifically, as shown in fig. 2, the FPN network includes a sub-network that downsamples from a bottom layer to a top layer to extract features (extracts an initial feature map Ci of multiple levels), and a sub-network that upsamples from the top layer to the bottom layer to extract features (extracts a target feature map Pi of multiple levels) that is embedded in a backward transmission of the multiscale fusion module MM, where the multiscale fusion module MM is configured to generate a target feature map Pi of each level, and the multiscale convolution operation may be a convolution operation of a convolution kernel of multiple scales, and may include a standard convolution operation, an asymmetric convolution operation, and an asymmetric convolution operation, where the asymmetric convolution operation may be an operation of a width greater than a convolution kernel of a convolution height.

As shown in fig. 3, a is a3×3 standard convolution operation, in which the width and the height of a convolution kernel are equal, and the convolution kernel is used to perform dense sampling on a3×3 local area of an input feature image, where the width and the height of the acquired feature image are 3, b is a3×5 asymmetric convolution operation, i.e., the height h is 3, and the width w is 5, in this embodiment, the width of the convolution kernel of the asymmetric convolution operation is greater than the height, c is a3×3 standard hole convolution, in which the convolution kernel is provided with holes and the width is equal to the height, d is a3×5 asymmetric hole convolution, in which the convolution kernel is provided with holes and the width is greater than the height, where the convolution operation of the hole is capable of performing dense sampling on the local area to obtain high quality local features, and the convolution of the hole can implement sparse sampling, and can extract features, and can improve the feeling field, in this embodiment, through the asymmetric convolution operation in the multiscale fusion module, can improve the feature extraction performance, and can further improve the performance of detecting the features in the direction of the width and the human body in the prone view when the human body is more suitable for the human body in the direction.

As shown in fig. 4, a context fusion module and an attention module may be constructed as a neck network, where the context fusion module is configured to fuse a plurality of target feature graphs (Pi, pi-1, pi+1) of adjacent levels output by the FPN network to obtain a fusion feature Fi, and the attention module is configured to extract an attention feature Ai from the fusion feature Fi, and multiply the fusion feature Fi and the attention feature Ai pixel by pixel to obtain an image feature Oi of a training image.

As shown in fig. 2, the present embodiment is modified based on yolox, and a head network of yolox may be used as a head network of the prone human body detection model, where the head network is used to output a predicted human body frame when the image feature Oi is input.

Step S103, training a prone position human body detection model by using the training image.

After the prone position human body detection model shown in fig. 2 is constructed, during training, a training image is input into an FPN network, the FPN network extracts a plurality of levels of initial feature images for the training image, the initial feature images of each level are input into a multi-scale fusion module MM, standard convolution, asymmetric convolution and asymmetric hole convolution operations are respectively carried out on the initial feature images of each level through the multi-scale fusion module, and features obtained by the convolution operations are connected to obtain target feature images of each level, wherein the width of convolution kernels of the asymmetric convolution and the asymmetric hole convolution is larger than the height.

As shown in fig. 2, after the training image is input into the FPN network, firstly, the sub-network of forward transmission in the FPN network downsamples and extracts semantic features from the bottom layer to the top layer to obtain a plurality of levels of initial feature images C1, C2, C3, C4 and C5, then, the sub-network of backward transmission from the top layer to the bottom layer embedded with the multiscale fusion module MM extracts fine-grained positioning features to obtain a plurality of target feature images P5, P4 and P3, specifically, the multiscale fusion module MM may be adopted to perform multiscale convolution operation on the initial feature image C5 to obtain a target feature image P5, upsample the target feature image P5 to obtain an upsampled feature image with the same scale as the initial feature image C4, splice the upsampled feature image with the initial feature image C4 to obtain a target feature image P4, upsample the target feature image P4 with the same scale as the initial feature image C3, splice the upsampled feature image with the initial feature image C3, and splice the upsampled feature image with the initial feature image C3 to obtain an upsampled feature image P5 with the same scale as the initial feature image C4, which is actually used as a plurality of levels of feature images.

After the backbone network extracts the target feature maps Pi of each level, the target feature maps Pi of each level may be input to a context fusion module of the neck network, where the context fusion module is configured to fuse the target feature maps of the current level i, the previous level i-1, and the next level i+1 to obtain a fusion feature Fi of the current level i, and input the fusion feature Fi to the attention module to extract the attention feature Ai and multiply the attention feature Fi with the fusion feature Fi to obtain an image feature of the training image.

Specifically, as shown in fig. 4, assuming that the target feature map of the current level i is Pi, the target feature map Pi-1 of the previous level i-1 and the target feature map pi+1 of the next level i+1 are obtained, where the scale of the target feature map Pi is wi×hi×c, where wi is the width of the feature map, hi is the height of the feature map, C is the depth, the scale of the target feature map Pi-1 is wi-1×hi-1×c, the scale of the target feature map pi+1 is wi+1×hi+1×c, and the scales of the target feature map Pi-1 and the target feature map pi+1 may be adjusted (i.e., the resolution operation in fig. 4) to be the same scale as the target feature map Pi, that is, after the scales of the target feature map Pi-1 and the target feature map pi+1 are adjusted to the same resolution as the target feature map Pi, splicing the target feature map Pi-1, the target feature map Pi and the target feature map Pi+1, carrying out convolution processing on the spliced feature map by adopting a convolution layer of 3×3 to obtain fusion features Fi, then carrying out convolution processing on the fusion features Fi by adopting the convolution layer of 3×3 to obtain attention features Ai with the scale of wi×hi×2, extracting attention features of which the first layer is regarded as the normalized scale of wi×hi×1 after normalization processing of the attention features Ai by a softmax function, carrying out multiplication operation on the attention features of which the normalized scale is wi×hi×1 and the fusion features Fi to obtain training image features Oi, inputting the image features Oi into a head network to obtain predicted human body frames of a human body in a training image, calculating loss based on the predicted human body frames and labeled real human body frames, and adjusting model parameters of the prone position human body detection model according to the loss until the prone position human body detection model converges.

In one embodiment, the loss may be calculated by the following formula:

loss = 1 – IOU²+λ（w_gt- w_pred）²

Wherein IOU (Intersection over Union, intersection ratio) is the intersection ratio of the predicted human body frame and the real human body frame, lambda is the weighting coefficient, w _gt is the width value of the real human body frame, and w _pred is the width value of the predicted human body frame.

According to the formula, besides calculating the intersection ratio, the loss of the predicted human body frame and the marked real human body frame in the width direction is calculated, and the weighting coefficient lambda is given, so that the prone position human body detection model is more focused on predicting the fitting degree of the human body frame and the real human body frame in the width direction when model parameters are adjusted through the loss, namely the prone position human body detection model can better adapt to the change of the human body frame in the width direction in a human body prone position scene, namely the model is more suitable for a scene with the human body width being far greater than the height during human body prone position detection, and the capability of the prone position human body detection model for detecting the prone position human body is improved.

The prone position human body detection model trained by the embodiment comprises a main network, a neck network and a head network which are sequentially connected, when the main network extracts features from training images, multi-scale convolution fusion operation can be carried out through a convolution check feature map with the width being larger than the height, so that more features of the training images in the width are obtained, the capability of the prone position human body detection model for obtaining width information is improved, the prone position human body detection model is suitable for scenes with the width being far higher than the height when a human body is prone, in addition, the features extracted by the main network are processed through a context fusion module and an attention module, interference of background areas and protruding human body features can be restrained, the prone position human body detection model is suitable for complex scenes such as bedding, sterile cloth and the like on an inspection table in endoscopy, accuracy of detecting a human body in an endoscopy scene is improved to a large extent, and accurate body position transformation data is provided for endoscopy.

Fig. 5 shows a flowchart of a prone position human body detection method provided by an embodiment of the present invention. Specifically, as shown in fig. 5, the prone position human body detection method in the embodiment of the invention specifically includes the following steps:

s201, an image is acquired for a human body using an endoscope as an image to be detected.

In one embodiment, a camera may be disposed outside a human body using an endoscope to collect images of the human body, and an exemplary patient may lie on an examination table of an endoscopy room, and after the camera is disposed at a position outside the examination table in the endoscopy room and calibrated, image or video data may be collected for the patient on the examination table before the endoscope enters the human body, and the collected image or a video frame sampled from the video data may be determined as an image to be detected to detect the position of the patient on the examination table in the image to be detected.

S202, inputting the image to be detected into a pre-trained prone human body detection model to obtain human body detection information.

The prone position human body detection model is trained by the training method of the prone position human body detection model provided by the embodiment of the invention, after an image to be detected is determined, the image to be detected is input into the prone position human body detection model, so that human body detection information can be obtained, and the human body detection information can be information such as body position classification of a human body.

The prone position human body detection model used in the prone position human body detection method of the embodiment comprises a main network, a neck network and a head network which are sequentially connected, after an image to be detected is input into the main network, the main network can perform multi-scale convolution fusion operation through a convolution check feature map with the width being larger than the height, so that more features of the image to be detected in the width are obtained, the capability of the prone position human body detection model for obtaining width information is improved, the prone position human body detection method is suitable for scenes with the human body width being far higher than the height in the prone position of the human body, in addition, the features extracted by the main network are processed through a context fusion module and an attention module, interference of a background area and the protruding human body features can be restrained, the prone position human body detection method is suitable for complex scenes such as bedding and sterile cloth on an examination table in the endoscopy examination, accuracy of detecting the human body in the endoscopy examination scene by the prone position human body detection model is improved to a large extent, and accurate body position transformation data are provided for the endoscopy.

Fig. 6 shows an application architecture diagram of a training system for a prone human body detection model according to an embodiment of the present invention, where the training system for a prone human body detection model according to the embodiment includes:

the training image obtaining module 301 is configured to extract an image including a prone human body from video data of a plurality of scenes as a training image, and annotate a human body in the training image to obtain a real human body frame;

The prone position human body detection model construction module 302 is configured to construct a prone position human body detection model, where the prone position human body detection model includes a main network, a neck network and a head network that are sequentially connected, the main network includes an FPN network and a multi-scale fusion module that is embedded in the FPN network and performs convolution operation at least by using a convolution kernel with a width greater than a height, and the neck network includes a context fusion module and an attention module;

The prone position human body detection model training module 303 is configured to train a prone position human body detection model by using training images.

In an alternative embodiment, the training image acquisition module 301 specifically includes:

a video data acquisition unit for acquiring video data including prone human bodies of a plurality of scenes;

A key frame extraction unit for extracting a key frame containing a human body from the video data;

The data expansion unit is used for carrying out data expansion processing based on the key frame to obtain an expansion image, and determining the key frame and the expansion image as training images.

In an alternative embodiment, prone human detection model building module 302 specifically includes:

the system comprises a main network construction unit, a multi-scale fusion module and a multi-scale fusion module, wherein the main network construction unit is used for adopting an FPN (fast Fourier transform) network as a main network, and embedding the multi-scale fusion module into the FPN, and the multi-scale fusion module is used for carrying out various convolution operations on the feature images extracted by each level;

The neck network construction unit is used for constructing a context fusion module and an attention module to serve as a neck network, the context fusion module is used for fusing a plurality of adjacent-level feature graphs output by the FPN network to obtain fusion features, and the attention module is used for extracting attention features from the fusion features and multiplying the fusion features and the attention features to serve as image features of training images;

The head network construction unit is used for adopting a yolox head network as a head network of the prone human body detection model, and the head network is used for outputting a predicted human body frame when the image characteristics are input.

In one embodiment, the prone human detection model training module 303 specifically includes:

the training image input unit is used for inputting the training image into the FPN network, and the FPN network extracts a plurality of levels of initial feature graphs Ci for the training image;

The convolution operation unit is used for carrying out standard convolution, asymmetric convolution and asymmetric hole convolution operation on the initial feature map Cn of the last level n through the multi-scale fusion module to obtain a target feature map Pn of an nth level, wherein n is the number of levels of the FPN network;

The target feature map extraction unit is used for carrying out up-sampling on the target feature map Pi+1 of the ith layer and obtaining an up-sampling feature map with the same scale as the initial feature map Ci of the ith layer, carrying out standard convolution, asymmetric convolution and asymmetric hole convolution operation on the up-sampling feature map and the initial feature map Ci of the ith layer through the multi-scale fusion module after splicing to obtain the target feature map Pi of the ith layer, wherein the width of the convolution kernel of the asymmetric convolution and the asymmetric hole convolution is larger than the height, and i is less than or equal to n-1; the context feature fusion unit is used for inputting the target feature images Pi of all the levels into the context fusion module of the neck network, and the context fusion module is used for fusing the target feature images of the current level i, the previous level i-1 and the next level i+1 to obtain fusion features Fi of the current level i;

The image feature calculation unit is used for inputting the fusion feature Fi into the attention module to extract the attention feature Ai and multiplying the attention feature Ai with the fusion feature Fi to obtain the image feature of the training image;

the prediction unit is used for inputting the image characteristics into the head network to obtain a predicted human body frame of a human body in the training image;

the loss calculation unit is used for calculating loss based on the predicted human body frame and the marked real human body frame;

And the model parameter adjusting unit is used for adjusting the model parameters of the prone position human body detection model according to the loss until the prone position human body detection model converges.

In an alternative embodiment, the loss calculation unit specifically includes:

a loss calculation subunit for calculating a loss by the following formula:

loss = 1 – IOU²+λ（w_gt- w_pred）²

wherein IOU is the intersection ratio of the predicted human body frame and the real human body frame, lambda is the weighting coefficient, w _gt is the labeled width value of the real human body frame, and w _pred is the width value of the predicted human body frame.

The training system of the prone position human body detection model provided by the embodiment of the invention can execute the training method of the prone position human body detection model provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Fig. 7 shows an application architecture diagram of a prone human body detection system provided by an embodiment of the present invention, where the prone human body detection system of the present embodiment includes:

the image to be detected acquisition module 401 is configured to acquire an image of a human body using an endoscope as an image to be detected;

The human body detection module 402 is configured to input an image to be detected into a pre-trained prone human body detection model to obtain human body detection information;

the prone position human body detection model is trained by the training method of the prone position human body detection model provided by the embodiment of the invention.

The prone human body detection system provided by the embodiment of the invention can execute the prone human body detection method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Fig. 8 shows a schematic diagram of an electronic device 500 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 8, the electronic device 500 includes at least one processor 501, and a memory communicatively connected to the at least one processor 501, such as a Read Only Memory (ROM) 502, a Random Access Memory (RAM) 503, etc., in which the memory stores a computer program executable by the at least one processor, and the processor 501 may perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 502 or the computer program loaded from the storage unit 508 into the Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic device 500 may also be stored. The processor 501, ROM 502, and RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in electronic device 500 are connected to I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, etc.; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the electronic device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The processor 501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 501 performs the various methods and processes described above, such as a training method for a prone human detection model, and/or a prone human detection method.

In some embodiments, the method of training the prone human detection model, and/or the method of prone human detection may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by processor 501, one or more steps of the above-described training method of the prone human detection model, and/or the prone human detection method may be performed. Alternatively, in other embodiments, processor 501 may be configured to perform the training method of the prone human detection model, and/or the prone human detection method, in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. The training method of the prone position human body detection model is characterized by comprising the following steps of:

constructing a prone position human body detection model, wherein the prone position human body detection model comprises a main network, a neck network and a head network which are sequentially connected, the main network comprises an FPN network and a multi-scale fusion module which is embedded in the FPN network and performs various convolution operations by at least adopting a convolution check FPN network with the width being larger than the height to extract a feature map of each level, and the neck network comprises a context fusion module and an attention module;

Inputting the training image into the FPN network, wherein the FPN network extracts a plurality of levels of initial feature graphs Ci for the training image;

Performing standard convolution, asymmetric convolution and asymmetric hole convolution operation on an initial feature map Cn of a last level n through a multi-scale fusion module to obtain a target feature map Pn of an nth level, wherein n is the number of levels of the FPN network;

for the ith level, up-sampling the target feature map Pi+1 of the (i+1) th level to obtain an up-sampled feature map with the same scale as the initial feature map Ci of the (i) th level, splicing the up-sampled feature map and the initial feature map Ci of the (i) th level, and performing standard convolution, asymmetric convolution and asymmetric hole convolution operations through a multi-scale fusion module to obtain the target feature map Pi of the (i) th level, wherein the width of the convolution kernel of the asymmetric convolution and the asymmetric hole convolution is larger than the height, and i is less than or equal to n-1;

Inputting target feature maps Pi of all levels into a context fusion module of the neck network, wherein the context fusion module is used for fusing target feature maps of a current level i, a previous level i-1 and a next level i+1 to obtain fusion features Fi of the current level i;

Inputting the fusion characteristic Fi into an attention module to extract an attention characteristic Ai and multiplying the attention characteristic Ai with the fusion characteristic Fi to obtain an image characteristic Oi of the training image;

inputting the image characteristics Oi into the head network to obtain a predicted human body frame of a human body in the training image;

calculating a loss based on the predicted human body frame and the real human body frame;

and adjusting model parameters of the prone position human body detection model according to the loss until the prone position human body detection model converges.

2. The training method of the prone human body detection model according to claim 1, wherein an image including a prone human body is extracted from video data of a plurality of scenes as a training image, specifically comprising the steps of:

acquiring video data of a plurality of scenes including prone human bodies;

Extracting key frames containing human bodies from the video data;

and carrying out data expansion processing based on the key frame to obtain an expansion image, and determining the key frame and the expansion image as training images.

3. The method for training the prone human body detection model according to claim 1, wherein the step of constructing the prone human body detection model specifically comprises the following steps:

adopting an FPN network as a backbone network, and embedding a multi-scale fusion module in the FPN network;

a context fusion module and an attention module are constructed to serve as a neck network, the context fusion module is used for fusing a plurality of adjacent-level feature graphs output by the FPN network to obtain fusion features, and the attention module is used for extracting attention features from the fusion features and multiplying the fusion features and the attention features to serve as image features of the training images;

A yolox head network is adopted as a head network of the prone human body detection model, and the head network is used for outputting a predicted human body frame when image features are input.

4. The training method of the prone human detection model according to claim 1, wherein the loss is calculated based on the predicted human frame and the real human frame, specifically comprising the steps of:

the loss is calculated by the following formula:

loss = 1 – IOU² +λ（w_gt- w_pred）²；

Wherein IOU is the intersection ratio of the predicted human body frame and the real human body frame, lambda is the weighting coefficient, w _gt is the labeled width value of the real human body frame, and w _pred is the width of the predicted human body frame.

5. A prone human detection method, comprising:

wherein the prone human detection model is trained by the training method of the prone human detection model according to any one of claims 1-4.

6. The training system of the prone position human body detection model is characterized by comprising the following specific components:

the system comprises a prone position human body detection model construction module, a context fusion module and an attention module, wherein the prone position human body detection model construction module is used for constructing a prone position human body detection model, the prone position human body detection model comprises a main network, a neck network and a head network which are sequentially connected, the main network comprises an FPN network and a multi-scale fusion module which is embedded in the FPN network and performs various convolution operations at least by adopting convolution check of a feature map extracted from each level of the FPN network, and the width of the feature map is larger than the height of the feature map;

the prone position human body detection model training module is used for training the prone position human body detection model by adopting the training image;

the prone position human body detection model training module specifically comprises the following units:

7. The prone position human body detection system is characterized by comprising the following specific components:

8. An electronic device, the electronic device comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of training the prone human detection model of any one of claims 1-4 and/or the method of prone human detection of claim 5.

9. A computer readable storage medium, characterized in that the computer readable storage medium stores computer instructions for causing a processor to implement the training method of the prone human detection model according to any one of claims 1-4 and/or the prone human detection method according to claim 5 when executed.