CN114913550A

CN114913550A - Wounded person identification method and system based on deep learning under wound point gathering scene

Info

Publication number: CN114913550A
Application number: CN202210599623.XA
Authority: CN
Inventors: 楼云江; 彭建文
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-08-16
Also published as: WO2023231290A1

Abstract

The invention relates to a wounded person identification method and system based on a deep learning neural network, wherein the method comprises the following steps: s10, collecting at least one wounded picture in the wounded point collecting environment through a depth camera, and collecting the wounded pictures into a live data set of the wounded pictures; s20, generating an additional wounded picture with a smaller size aiming at the original wounded picture shot in a close range in the field data set in a data augmentation mode, and storing the additional wounded picture and the original wounded picture in the field data set after being related; s30, inputting the live pictures shot by the depth camera into a neural network based on deep learning to calculate the number of wounded persons in the live pictures, wherein the neural network is trained by a pre-training data set and the live data set. The system includes a depth camera, a memory, and a processor that implements the method when executing instructions stored in the memory.

Description

Wounded person identification method and system based on deep learning under wound point gathering scene

Technical Field

The invention relates to a wounded person identification method and system based on deep learning under a wound site scene, and belongs to the technical field of computer software, particularly robot visual identification. The technical scheme of the invention is particularly suitable for the field of emergency rescue application.

Background

In the event of large-scale casualties in collapse environments such as earthquakes, fires and accidents, rescue teams generally concentrate the wounded on a certain area, and the area has the following characteristics: open and far from the disaster site; flat ground or grass; possesses the transportation characteristic and is convenient for follow-up wounded person to send afterwards. Such an injured person concentrated placement area is generally referred to as a concentrated injury point. In order to quickly classify the injury of the wounded in the scene of the injury point, a system is required to be designed to carry a navigation sensor and a medical sensor to the side of the wounded for autonomous injury detection. As shown in fig. 1, in a scene of a flaw collection point after an earthquake, a four-footed robot dog carries a mechanical arm, a depth camera, a radar and various medical sensors as a hardware system for wounded triage of the flaw collection point (a dotted line area in fig. 1). Therefore, a system is needed, which has the primary task of quickly identifying the position of the wounded, the position of the wounded point, whether the wounded exists or not in the wounded point environment.

Disclosure of Invention

The invention provides a wounded person identification method and system based on deep learning in a wound site gathering scene, and aims to at least solve one of technical problems in the prior art.

The technical scheme of the invention is that on one hand, the wounded identification method based on the deep learning neural network comprises the following steps:

s10, collecting at least one wounded picture in the wounded point collecting environment through a depth camera, and collecting the wounded pictures into a live data set of the wounded pictures;

s20, generating an additional wounded picture with a smaller size aiming at the original wounded picture shot in a close range in the field data set in a data augmentation mode, and storing the additional wounded picture and the original wounded picture in the field data set after being related;

s30, inputting the live pictures shot by the depth camera into a neural network based on deep learning to calculate the number of wounded persons in the live pictures, wherein the neural network is trained by a pre-training data set and the live data set.

Further, the step S10 includes:

in the process of moving the mobile device to the injury-concentrated environment, image acquisition is carried out from a far position through a depth camera carried on the mobile device, wherein the mobile device comprises a four-footed robot dog, a mobile robot, a mobile intelligent vehicle or a flying unmanned aerial vehicle.

Further, the step S20 includes:

s21, determining the picture of the wounded shot in the short distance as a picture with low small target ratio;

and S22, reducing the image size of the picture with the small target occupying a relatively low ratio to one fourth of the original image size, and splicing the four reduced pictures into the picture with the same size as the original image.

Further, the step S21 includes:

and identifying that the area range of the picture occupied by the human body outline in the shot picture is smaller than a preset pixel product threshold value, and determining the picture as the picture containing the small target.

Further, for the step S30, the neural network includes: the underlying network layer of the VGG16 network architecture; an auxiliary convolution layer for extracting feature maps of different scales; predicted convolutional layers, including a position predicted convolutional layer and a class predicted convolutional layer.

Further, the neural network is configured to: sending the feature maps output by the conv4_3 layer, the conv7 layer, the conv8_2 layer, the conv9_2 layer, the conv10_2 layer and the conv11_2 layer to a prediction convolution layer to acquire the position information and the classification information of the wounded in the picture; feature maps output by the conv4_3 layer, the conv7 layer and the conv8_2 layer are fused into a feature map with the same size as the conv4_3 layer output feature map, and are sent to the prediction convolutional layer instead of the conv4_3 layer feature map.

Further, the neural network is configured to:

h of conv4_3 layer ₁ xW ₁ xC ₁ The size characteristic diagram passes through a convolution layer with the size of 3x3, then is connected with an L2norm layer and then is connected with a ReLu layer, and therefore a first characteristic diagram with the same size is output;

h of Conv7 layer ₂ xW ₂ xC ₂ The size characteristic diagram passes through the deconvolution layer, is connected with the L2norm layer and then is connected with the ReLu layer, so that the output size is H ₁ xW ₁ xC ₁ A second profile of size/2;

conv8_2 layer H ₃ xW ₃ xC ₃ The feature diagram of size is connected with the L2norm layer and the ReLu layer through the deconvolution layer, so that the output size is H ₁ xW ₁ xC ₁ A third profile of size/2;

generating H by splicing the output first characteristic diagram, the second characteristic diagram and the third characteristic diagram ₁ xW ₁ xC ₁ Size 2 signature, wherein H, W and C represent the dimension of the signature, respectively.

Further, the training of the neural network comprises the steps of:

s40, calculating a loss value of each point of the feature map output by the predicted convolutional layer through a loss function, and updating the model parameters of the neural network in the training process of the neural network until the sum of the loss values of all the points of the trained feature map is less than a preset threshold, stopping training, wherein the calculation of the loss function comprises:

calculating the overlapping degree of the prediction frame and the real frame of each point of the characteristic diagram; if the overlapping degree is larger than the set threshold value, the class marked by the prediction frame and the class marked by the real frame are the same, and the prediction frame is set as a positive class; if the overlapping degree is smaller than the set threshold value, the class marked by the prediction frame is considered as the background and is set as a negative class;

the loss function of the prediction box and the real box is equal to the sum of the position loss and the classification loss of the prediction box, wherein

The position loss of the prediction frame is calculated by

The classification loss of the prediction box is calculated in the manner of

Wherein L is _loc Position loss function, L _class Is a classification loss function, N _p Number of positive class prediction boxes, N _n Is the number of negative class prediction boxes, Box _{i_pred} Is the coordinate information of the prediction Box, Box _{i_real} Is the coordinate information of the corresponding real frame, Distance () represents the calculation function of the euclidean Distance between the coordinates, and CE _ loss represents the cross entropy loss function.

Further, the training of the neural network may further include the steps of:

s51, capturing a plurality of wounded pictures through the Internet and adding the pictures into a pre-training data set;

and S52, reducing the image size of the original wounded picture shot in a close range in the pre-data set to one fourth of the original image size in a data augmentation mode, splicing the four reduced pictures into a picture with the same size as the original image, and storing the spliced wounded picture and the original wounded picture in the pre-data set after being associated.

Another aspect of the present invention relates to a wounded person identification system, including:

at least one depth camera carried by a mobile device;

a computer device connected to the depth camera, the computer device comprising a computer readable storage medium having stored thereon program instructions that, when executed by a processor, implement the method described above.

The beneficial effects of the invention are as follows.

The identification method and the identification system in the technical scheme can adapt to the conditions of quickly identifying the position of the wounded, the position of the wounded point, whether the wounded exists and the like in the process of going to the wounded point through the neural network based on-site self-learning. Therefore, the time for the wounded to stay at the wound collecting point can be greatly reduced, and precious time is strived for the wounded to quickly transfer to the corresponding treatment point.

Drawings

Fig. 1 is a schematic diagram of an autonomous triage hardware system in a site of injury collection in one example.

Fig. 2 is a basic flow diagram of a victim identification method in an embodiment in accordance with the present invention.

Fig. 3 is a flow chart of the method for augmenting the picture of the small-sized wounded person in a picture augmentation manner according to the embodiment of the present invention.

Fig. 4 is a schematic diagram of an injury identification network infrastructure in an embodiment of the method according to the invention.

FIG. 5 is a schematic diagram of a wounded identification network assisted convolutional layer in an embodiment of the method according to the present invention.

FIG. 6 is a schematic diagram of a wounded identification network predicted convolutional layer in an embodiment of the method according to the present invention.

FIG. 7 is a schematic diagram of a feature fused victim identification network prediction convolutional layer in an embodiment of a method according to the present invention.

FIG. 8 is a schematic illustration of feature map fusion details in an embodiment of a method according to the invention.

FIG. 9 is a flow chart of training of the triage identification network in an embodiment of a method according to the invention.

Fig. 10 is a schematic illustration of a picture of an injured person in a training data set in an embodiment of the method according to the invention.

Fig. 11 is a diagram of the effect of the wounded recognition experiment in the scene of the collected injury point according to the technical solution of the present invention, wherein the simulated wounded is located on the paved road surface.

Fig. 12 is a diagram of the experimental effect of identifying the wounded under the scene of the gathered wounded point, wherein the simulated wounded is on a flat lawn.

It should be understood that the words which have been used in the specification and drawings are words of description rather than limitation, and are used in any combination with the words of description in this specification and its accompanying drawings without departing from the spirit and scope of the invention.

Detailed Description

The conception, the specific structure and the technical effects of the present invention will be clearly and completely described in conjunction with the embodiments and the accompanying drawings to fully understand the objects, the schemes and the effects of the present invention.

It should be noted that, unless otherwise specified, when a feature is referred to as being "fixed" or "connected" to another feature, it can be directly fixed or connected to the other feature or indirectly fixed or connected to the other feature. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any combination of one or more of the associated listed items.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element of the same type from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. The use of any and all examples, or exemplary language ("e.g.," such as "or the like") provided herein, is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the invention unless otherwise claimed.

Referring to fig. 1, in some embodiments, the wounded identification system according to the present invention is generally applied to identify a concentrated wound point location in a concentrated wound point scene, identify whether there is a wounded or the number of wounded in the concentrated wound point, and the like. The victim identification system can include at least one depth camera carried by a mobile device and a computer device connected to the depth camera. Wherein, the mobile device includes four-footed machine dog, mobile robot, removal smart car or flight unmanned aerial vehicle. Referring to fig. 1, taking a four-footed robot dog as an example, a depth camera 110 (such as a depth camera model of Intel Realsense D455) carried by a robot dog 100 acquires field image data at an accident scene, and then generates pictures for a computer device to operate and analyze. As shown in fig. 1, in the working task of the machine dog 100, it is necessary to search and advance from a far place to each injury-collecting point (one injury-collecting point is shown as a dotted area in fig. 1, and there may be a plurality of injury-collecting points), detect the injured person, and gradually approach each injured person from far to near according to the detection result. In the process, an application program running in the computer device implements the wounded identification method according to the present invention, and intercepts the pictures containing the wounded by analyzing the images (or videos) transmitted by the depth camera 110 in real time, analyzes the number classification of the wounded, and then obtains the orientation of the wounded relative to the robot dog 100 by back calculation according to the shooting angle of the wounded pictures. The wounded recognition method according to the present invention implemented in the computer device is described in more embodiments in detail below with reference to fig. 2 to 12.

Referring to fig. 2, in some embodiments, a method of triage identification according to the present invention includes at least the steps of:

and S30, inputting the live pictures taken by the depth camera into the neural network based on the deep learning to calculate the number of wounded persons in the live pictures. If the victim picture used for detection is an additional victim picture of a smaller size, which is size modified, the identified number of victims result can be associated with the original victim picture of the original size.

Further, the neural network is trained by a pre-training dataset and the live dataset. Therefore, the method according to the invention may further comprise a training step of the neural network:

and S40, calculating the loss value of each point of the feature map output by the predicted convolutional layer through a loss function, and updating the model parameters of the neural network in the training process of the neural network until the sum of the loss values of all the points of the trained feature map is less than a preset threshold value, and stopping training.

Detailed description of step S10

During the travel of the mobile device to the injury-focus environment, image acquisition is performed by a depth camera mounted on the mobile device (such as a four-footed robot dog) from a remote location. In the process, an application program running in the computer equipment intercepts pictures containing the wounded through real-time analysis of images (or videos) transmitted by the depth camera, the pictures are used for analyzing the number classification of the wounded, and then the orientation of the wounded relative to the mobile equipment is obtained through inverse calculation by utilizing the shooting angle of the pictures of the wounded.

In addition, the wounded picture acquired by the onboard depth camera and marked with certain marks (such as system review confirmation, superior system confirmation or confirmation by other auxiliary sensors) can be used for making a wounded picture data set for continuous updating training of a neural network so as to learn to adapt to the scene of the wounded on site.

Detailed description of step S20

S21, determining the close-distance shot wounded picture as a picture with a low small target ratio, wherein if the area range of the picture occupied by the human body outline in the shot picture is identified to be smaller than a preset pixel product threshold (such as 32 pixels x32 pixels), determining the picture as a picture containing the small target;

Specifically, a depth camera carried by a mobile device (such as a four-footed robot dog) detects the wounded from a far place and gradually approaches the wounded from far to near according to the detection result, and because the visual field range of the depth camera is wide, the wounded detecting process of the four-footed robot moving to the side of the wounded mostly belongs to small target detection, namely the wounded belongs to a target which is less than 32 pixels x32 pixels in the visual field range of the camera.

And the sizes of the wounded in the corresponding pictures in the initial data set are not in accordance with the characteristic. In order to increase the small target wounded sample in the wounded data set, the present section proposes a data enhancement method, aiming to train the target detector in a data balance manner, construct and expand the small target sample by using an oversampling method, reduce the image size of the small target with a relatively low small target to 1/4, and splice 4 reduced images into a picture with the same size as the original image for sending to the network. The stitched image inevitably contains smaller target objects, thereby increasing the weight of the small target data, as shown in fig. 3.

With continued reference to fig. 3, preferably, if the victim size in the victim picture is smaller than a fixed value, the picture is considered to contain a small-sized victim and can be directly input to the victim detection network, otherwise, the picture needs to pass through a preprocessing module and is reduced to 1/4, and the four pictures are combined into a picture with an original size, so that the picture can contain a small-sized target victim and can be input to the victim detection network. Of course, when four pictures are synthesized, the four pictures can be different, so that the wounded data set can be greatly increased, and the target number of small-size wounded is increased. It should be noted that, since the integrated picture obtained by the preprocessing module from multiple pictures is equal to the number of wounded persons appearing multiple times (e.g. 4 times), the detection structure of the number of wounded persons obtained by the detection network recognizing the integrated picture needs to be divided by the original integration multiple times (e.g. 4) to obtain the actual number of wounded persons.

Detailed description of step S30

The neural network structure for identifying the wounded comprises a basic network layer, an auxiliary convolution layer and a prediction convolution layer. The basic network layer can use a network architecture of VGG16 as a basic network, the auxiliary convolutional layer is used for extracting feature maps with different scales, the detection and positioning of wounded persons by using the feature maps of a plurality of different depth layers are used for realizing the detection of wounded persons with different sizes, and the auxiliary convolutional layer outputs 4 feature maps with different scales in the wounded person identification network. The predicted convolutional layer is divided into a position prediction layer having four position information and a classification prediction layer having two information. The basic network, the auxiliary convolutional layer, and the prediction convolutional layer are shown in fig. 4, 5, and 6, respectively.

Under the scene of a wound gathering point, the mobile equipment carries a depth camera to identify the wounded from far, then the mobile equipment slowly approaches the wounded from far to near according to the position of the wounded, and meanwhile, because the visual field range of the camera is wide, the process of identifying the wounded from far to near is carried out by the mobile equipment, and the wounded with small size is identified most of the time. In some preferred embodiments, the triage identification network may be optimized in the following manner, taking into account small-size triage objectives.

Preferably, referring to fig. 7, according to the basic network and the auxiliary convolutional layer of the victim identification network, the network adopted in the identification method according to the present invention is a network that obtains the picture kind victim location information and classification information by outputting six feature maps of conv4_3 layer, conv7 layer, conv8_2 layer, conv9_2 layer, conv10_2 layer, and conv11_2 layer to the prediction convolutional layer. In order to better obtain context information as additional information to help the detection of small-size wounded persons, three feature maps output by the conv4_3 layer, the conv7 layer and the conv8_2 layer are subjected to feature fusion to form a feature map with the same size as the conv4_3 layer output feature map, and the feature map is input into the prediction convolutional layer instead of the conv4_3 layer feature map. The reason why the conv4_3 layer, the conv7 layer and the conv8_2 layer are selected for feature fusion instead of the last three layers is that the small-sized victim pixels have limited and low resolution, so a feature map of a larger size is required for fusion. Fig. 7 shows a schematic diagram of the wounded identification network prediction convolutional layer after feature fusion.

Referring to fig. 8, specific feature fusion details are as follows:

(1) conv4_3 layer feature map H ₁ xW ₁ xC ₁ Connecting an L2norm layer and a ReLu layer through a rolling layer (wherein the convolution size is 3x3, padding is 1 and stride is 1), and outputting a characteristic diagram with the same size;

(2) conv7 layer signature graph H ₂ xW ₂ xC ₂ After passing through the deconvolution layer, the L2norm layer is connected, and then the ReLu layer is connected, so that the output size is H ₁ xW ₁ xC ₁ A profile of size/2;

(3) conv8_2 layer feature graph H ₃ xW ₃ xC ₃ After passing through the deconvolution layer, the L2norm layer is connected, and then the ReLu layer is connected, so that the output size is H ₁ xW ₁ xC ₁ A profile of size/2;

(4) and generating H by splicing the three finally output characteristic graphs ₁ xW ₁ xC ₁ Feature map size 2.

Wherein H ₁ xW ₁ xC ₁ Is 38x38x512, H ₂ xW ₂ xC ₂ Is 19x19x1024, H ₃ xW ₃ xC ₃ Is 10x10x 512.

Detailed description of step S40

The overall flow chart of the network training is shown in fig. 9, in which a batch of picture data (such as a pre-training data set, a live data set, and an initial data set) needs to be input into the neural network model for training, and then used for adjusting the model parameters through the calculation of the loss function. The key to training the network is to determine the computation of the loss function. The prediction convolutional layer predicts the position information and classification information of each point of the feature map, and the loss value of the point is equal to the sum of the loss of the position and the loss of the classification. IoU (overlapping degree) of the prior box and the real box of each point of the feature map is calculated, if IoU is greater than a set fixed threshold, the class marked by the prior box and the real box is the same and is called as a positive class; if the value is less than the set threshold value, the class marked by the prior frame is considered as the background and is called as a negative class. The penalty function for the prediction box and the real box equal to the sum of the penalty for the prediction box position and the penalty for the classification can be expressed as

Loss＝L _class +L _loc (1.1)

Where Loss is the overall Loss function, L _class Is a classification loss function, L _loc A position loss function. L is _class And L _loc The calculation method is shown in the following formulas 1.2 and 1.3:

wherein N is _p Predict the number of frames for the positive class, N _n Predicting the number of boxes for negative classes, Box _{i_pred} Coordinate information, Box, referring to the prediction Box _{i_real} The method refers to coordinate information corresponding to a true value frame, Distance () represents the Euclidean Distance between the solved coordinates, and CE _ loss represents the solved cross entropy loss function.

In addition, for the initial training data set of the network model, the public data sets are mostly target detection data sets and pedestrian detection data sets, and no specific wounded detection data set exists. Due to the particular lying posture of the wounded, the accuracy of the results of the wounded detection is not satisfactory if there is no particular wounded data set. The method comprises the steps of simulating and collecting wounded pictures in a wounded spot scene by using a depth camera carried by a four-footed robot dog, searching the wounded pictures by using the Internet, and sorting thousands of wounded pictures to form an initial pre-training data set. Then, the image size of the original wounded picture shot in a close range in the pre-data set can be reduced to one fourth of the original image size by adopting a data augmentation mode, the four reduced pictures are spliced into the picture with the same size as the original image, and the spliced wounded picture is associated with the original wounded picture and then stored in the pre-data set, so that thousands of wounded targets are contained in the pre-data set. These pictures containing the victim constitute an initial version of the image dataset of the victim (as shown in figure 10) for training by the network model.

Experimental validation of the victim identification method and system according to the present invention

In order to better approach to a wound-collecting scene, the experimental scene is firstly determined to be a wider cement ground or grassland, meanwhile, four wounded models are enabled to regularly lie on the ground, other pedestrians simulate rescue workers to shuttle among the wounded models, and four-footed robot dogs carry depth cameras to identify the wounded from far to near. Experimental results are shown in FIG. 10, where the LYING PERSON identifier indicates the detected victim.

It should be recognized that the method steps in embodiments of the present invention may be embodied or carried out by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The method may use standard programming techniques. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Further, the operations of processes described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described herein (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.

Further, the method may be implemented in any type of computing platform operatively connected to a suitable interface, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and the like. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described herein includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention may also include the computer itself when programmed according to the methods and techniques described herein.

A computer program can be applied to input data to perform the functions described herein to transform the input data to generate output data that is stored to non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.

The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

Claims

1. A wounded person identification method based on a deep learning neural network is characterized by comprising the following steps:

2. The method according to claim 1, wherein the step S10 includes:

3. The method according to claim 1, wherein the step S20 includes:

4. The method according to claim 3, wherein the step S21 includes:

5. The method according to claim 1, wherein for the step S30, the neural network comprises:

the underlying network layer of the VGG16 network architecture;

an auxiliary convolution layer for extracting feature maps of different scales;

predicted convolutional layers, including a position predicted convolutional layer and a class predicted convolutional layer.

6. The method of claim 5, wherein the neural network is configured to:

sending feature maps output by the conv4_3 layer, the conv7 layer, the conv8_2 layer, the conv9_2 layer, the conv10_2 layer and the conv11_2 layer to a prediction convolutional layer to acquire wounded position information and classification information in a picture;

feature maps output by the conv4_3 layer, the conv7 layer and the conv8_2 layer are fused into a feature map with the same size as the conv4_3 layer output feature map, and are sent to the prediction convolutional layer instead of the conv4_3 layer feature map.

7. The method of claim 6, wherein the neural network is configured to:

h of conv4_3 layer ₁ xW ₁ xC ₁ The convolution layer of the size characteristic diagram passing through 3x3 size is connected with the L2norm layer and then connected with the ReLu layer, thereby outputting the first characteristic diagram with the same size;

generating H by splicing the output first characteristic diagram, the second characteristic diagram and the third characteristic diagram ₁ xW ₁ xC ₁ Size of 2Wherein H, W and C represent the dimension of the feature map, respectively.

8. The method of claim 5, wherein the training of the neural network comprises the steps of:

s40, calculating the loss value of each point of the feature map output by the predicted convolutional layer through a loss function, and updating the model parameters of the neural network in the training process of the neural network until the sum of the loss values of all the points of the trained feature map is smaller than a preset threshold value, and stopping training, wherein the calculation of the loss function comprises the following steps:

The position loss of the prediction frame is calculated in the manner of

The classification loss of the prediction box is calculated in the manner of

9. The method of any one of claims 1 to 8, wherein the training of the neural network comprises the steps of:

10. A victim identification system, comprising:

at least one depth camera carried by a mobile device;

a computer device connected to the depth camera, the computer device comprising a computer readable storage medium having stored thereon program instructions that, when executed by a processor, implement the method of any of claims 1-9.