WO2021051547A1 - 暴力行为检测方法及系统 - Google Patents

暴力行为检测方法及系统 Download PDF

Info

Publication number
WO2021051547A1
WO2021051547A1 PCT/CN2019/117407 CN2019117407W WO2021051547A1 WO 2021051547 A1 WO2021051547 A1 WO 2021051547A1 CN 2019117407 W CN2019117407 W CN 2019117407W WO 2021051547 A1 WO2021051547 A1 WO 2021051547A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
size
feature
network
human body
Prior art date
Application number
PCT/CN2019/117407
Other languages
English (en)
French (fr)
Inventor
王健宗
王义文
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051547A1 publication Critical patent/WO2021051547A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services
    • G06Q50/265Personal security, identity or safety
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]

Definitions

  • the embodiments of the present application relate to the field of big data, and in particular to a violent behavior detection method, a violent behavior detection system, computer equipment, and a readable storage medium.
  • the method of human body pose estimation includes:
  • Structured Feature Learning (Structured Feature Learning), which is fine-tuned on the basis of Convolutional Neural Network (CNN), but the accuracy of this multi-person pose estimation is not high;
  • Deep cut (Deepcut and Deepercut), which uses CNN to extract candidate regions of body parts, but this method is very computationally complex and slow;
  • the Convolution Pose Machine uses a sequential convolution architecture to express spatial information and texture information. Although it has good robustness, the network is more complex.
  • an embodiment of the present application provides a method for detecting violent behavior, and the method includes:
  • the human body posture estimation result is matched with the violent behavior human body posture stored in the database to determine whether there is a violent behavior in the scene image according to the matching result, and the violent behavior is classified.
  • an embodiment of the present application also provides a violent behavior detection system, including:
  • the acquisition module is used to acquire scene images through a camera
  • the detection module is used to input the scene image into the feature pyramid network and obtain the target human body from the scene image;
  • the human body pose estimation module is configured to use the cascaded pyramid network to estimate the target human body pose to obtain the human body pose estimation result, wherein the cascaded pyramid network includes the GlobalNet network and the RefineNet network; and
  • the classification module is configured to match the human body posture estimation result with the violent behavior human body posture stored in the database, to determine whether there is a violent behavior in the scene image according to the matching result, and to classify the violent behavior.
  • an embodiment of the present application also provides a computer device, the computer device memory, a processor, and computer-readable instructions stored on the memory and running on the processor, the computer When the readable instruction is executed by the processor, the following steps are implemented:
  • the human body posture estimation result is matched with the violent behavior human body posture stored in the database to determine whether there is a violent behavior in the scene image according to the matching result, and the violent behavior is classified.
  • the embodiments of the present application also provide a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions may Is executed by at least one processor, so that the at least one processor executes the following steps:
  • the human body posture estimation result is matched with the violent behavior human body posture stored in the database to determine whether there is a violent behavior in the scene image according to the matching result, and the violent behavior is classified.
  • the violent behavior detection method, violent behavior detection system, computer equipment, and readable storage medium provided by the embodiments of this application detect the frame of the target human body by first passing the acquired scene image through the feature pyramid network, and then use the The cascaded pyramid network estimates the body pose of the target human body, and matches the estimation result with the violent behavior human body pose stored in the database according to the estimation result to determine whether there is a violent behavior based on the matching result, and respond to the violent behavior. Behaviors are classified. Through the embodiments of the present application, each key point of the human body can be successfully located, which greatly improves the accuracy of recognition and reduces the amount of calculation.
  • FIG. 1 is a flow chart of the steps of the method for detecting violence in the first embodiment of the application.
  • FIG. 2 is a schematic diagram of obtaining a dimensional feature image according to an embodiment of the application.
  • FIG. 3 is a schematic diagram of the overall framework of FPN-based fast R-CNN target detection according to an embodiment of the application.
  • Fig. 4 is a structural diagram of a residual block according to an embodiment of the application.
  • FIG. 5 is a schematic diagram of the hardware architecture of the computer device according to the second embodiment of the application.
  • Fig. 6 is a schematic diagram of program modules of the violence detection system according to the third embodiment of the application.
  • FIG. 1 shows a flow chart of the steps of the violence detection method in the first embodiment of the present application. It can be understood that the flowchart in this method embodiment is not used to limit the order of execution of the steps. It should be noted that, in this embodiment, the computer device 2 is used as the execution subject for exemplary description. details as follows:
  • Step S100 Obtain a scene image through a camera.
  • a surveillance camera is installed in a public place, and real-time transmission of video from the surveillance camera in a public place is used to transfer the data of an event to the cloud frame by frame for processing.
  • the computer device 2 obtains the captured scene image to detect a violent behavior on the scene image.
  • Step S102 Input the scene image into a feature pyramid network, and obtain a target human body from the scene image.
  • the feature pyramid is a basic component of the multi-scale target detection system.
  • the image features are obtained, and the target human body is detected from the image features, for example: users with violent behavior User A and normal behavior user B.
  • the scene image is first passed through the convolutional network, and the feature image of the highest layer of the convolutional network is extracted to obtain the first size feature image. Then, the first-size feature image is up-sampled to a first intermediate-size feature image by bilinear interpolation, and the first intermediate-size feature image is output with the first intermediate size in the convolutional network The images are fused to obtain a first fusion result, and the first fusion result is output to obtain a second size feature image.
  • the size of the acquired scene image is 128*128*3, where 3 is RGB three channels.
  • the scene image is input into the feature pyramid network, and the minimum size feature image is obtained by convolution transformation to obtain a first size feature image with a size of 16*16*128.
  • the feature image of the first size is up-sampled by bilinear interpolation to obtain a feature image of size 32*32*128, and the feature image of size 32*32*128 is combined with the feature pyramid network
  • the output of the convolutional layer in is a 32*32*64 feature image for fusion to obtain a second size feature image of 32*32*64.
  • Bilinear interpolation is a kind of interpolation algorithm, which is an extension of linear interpolation.
  • the four real pixel values around the target point in the original image are used to jointly determine a pixel value in the target image.
  • the core idea is to perform a linear interpolation in two directions.
  • the main purpose of upsampling is to enlarge the image, almost all of which use interpolation, that is, on the basis of the original image pixels, a suitable interpolation algorithm is used to insert new elements between the pixel values.
  • Feature-level image fusion is to extract feature information from the source image, which is the observer’s target or interest area in the source image, such as edges, people, buildings, or vehicles, and then analyze these feature information , Processing and integration to obtain the fused image features.
  • the accuracy of target recognition on the fused features is significantly higher than that of the original image.
  • Feature-level fusion compresses the image information, and then uses the computer to analyze and process it. The memory and time consumed will be reduced compared with the pixel level, and the real-time performance of the required image will be improved.
  • Feature-level image fusion does not have as high an accuracy requirement for image matching as the first layer, and the calculation speed is faster than that of the first layer, but it extracts image features as fusion information, so it will lose a lot of detailed features.
  • the second-size feature image is up-sampled to a second intermediate-size feature image by the bilinear interpolation method. Then, the second intermediate size feature image is fused with the second intermediate size output image in the convolutional network to obtain a second fusion result, and the second fusion result is output to obtain a third size feature image .
  • the first size feature image, the second size feature image, and the third size feature image are input into the RPN network, and then the first size feature image
  • the feature image, the second size feature image, and the third size feature image perform area frame detection, and obtain the region of interest and the region with the highest category score of the region of interest according to the detection result, to obtain the target human body .
  • the first size feature image of 16*16*128 size, the second size feature image of 32*32*64 size, and the third size feature image of 64*64*32 size are input into the RPN network to perform Target detection, if the detection results are [person; 0.6], [person; 0.65], and [person; 0.8], the area where the detection result is [person; 0.8] is the target human body.
  • the feature maps of the second to fifth layers of convolutional layers are obtained, and the feature maps are respectively merged to obtain the second to fifth layer feature maps P2 to P5, Then the feature maps are passed through the ROI Pooling layer, and the pooling results are passed through the fully connected layer, and then the classification results are obtained when the classifier is passed, and the border regression results are obtained when the border regression is used, and then comprehensive classification The result and the frame regression result obtain the frame of the target human body.
  • the fusion method is the same as the fusion method in other embodiments of this application, so it will not be repeated.
  • Step S104 Perform body pose estimation on the target human body using a cascaded pyramid network to obtain a human body pose estimation result, where the cascaded pyramid network includes a GlobalNet network and a RefineNet network.
  • simple visible key points are located through the GlobalNet network, and difficult key points are further processed through the RefineNet network to realize the estimation of the human body pose of the target human body.
  • the detection result image is also sent to the GlobalNet network, wherein the GlobalNet network
  • the pooling layer is not included, and the size of the feature map output by each convolution layer is reduced by adding a step size of convolution between each convolution layer. Then, obtain the first feature map, the second feature map, the third feature map, and the fourth feature map respectively output by the second to fifth convolutional layers, and combine the first feature map, the second feature map, The third feature map and the fourth feature map respectively apply a 3*3 convolution filter to generate a heat map of simple key points.
  • the network without the pooling layer design can also be referred to as a fully convolutional network, which can improve the accuracy of the GlobalNet network for detecting the human body frame.
  • the first feature map and the second feature map have higher spatial resolution in positioning, but less semantic information is identified.
  • the third feature map and the fourth feature map identify There is more semantic information, but the resolution is low. The accuracy of the feature information with different semantic information and different resolutions is fused to improve the accuracy.
  • the heat map is sent to the RefineNet network to process the heat map separately through the residual block structure, wherein the first feature map does not go through the residual block structure, and the second feature map
  • the third feature map passes through two residual block structures
  • the fourth feature map passes through three residual block structures.
  • the residual block structure includes a 1*1 convolution structure and 3 *3 Convolution structure.
  • the processing results are respectively subjected to up-sampling processing, wherein the processing result of the first feature map does not undergo up-sampling processing, the processing result of the second feature map undergoes two up-sampling processing, and the processing result of the third feature map After four times of up-sampling processing, the processing result of the fourth feature map undergoes eight times of up-sampling processing.
  • the sampling processing results are connected to integrate the up-sampling results, the integrated results are trained using L2 loss, and difficult key points are selected according to the training results to obtain the human body posture results.
  • 1*1 convolution in the residual block structure adjusts the size of the heat map
  • the 3*3 convolution structure extracts feature information.
  • the residual block structure reduces a convolutional layer, which greatly reduces the amount of calculation, but the accuracy is not affected.
  • Step S106 Match the human body posture estimation result with the violent behavior human body posture stored in the database to determine whether there is a violent behavior in the scene image according to the matching result, and classify the violent behavior.
  • the database stores a variety of violent behavior human postures and the corresponding relationship of behavior names corresponding to the violent behavior human postures, for example, the behavior posture of punching and the corresponding behavior name are punching. If the human body posture estimation result is punching, the body posture of the punching is matched with multiple violent behavior gestures stored in the database, and the human body posture behavior of the punching is matched as punching, and it is judged that there is a violent behavior.
  • the estimation result of the human body posture is matched with the violent behavior stored in the database to determine whether there is a violent behavior according to the matching result, and before the violent behavior is classified, a plurality of samples are also obtained An image, wherein each of the multiple images includes multiple human bodies, and the multiple human bodies perform different behaviors, and the behaviors include at least: punching, knife stabbing, gun shooting, kicking, and choking. One or more of the neck. Then, perform behavior labeling on the sample image according to the behavior in the sample image, and train the multiple images according to the labeling result to obtain the human body posture corresponding to the behavior.
  • a spatial violence personal database is used.
  • the spatial violence personal database consists of 2000 images, each image contains 2-10 people, and the entire database has a total of 10863 people, of which 5124 (that is, 48% of people) are involved
  • 5124 that is, 48% of people
  • the five types of violence are: punching, stabbing, shooting, kicking and strangling the neck.
  • the accuracy rate is highest. For example: when there is only one person in a picture, the accuracy rate of the system is 94.1%, but when there are 5 people, the accuracy rate drops to 84%, and when there are 10 people, the accuracy rate drops to 79.8%.
  • each key point of the human body can be successfully located, which greatly improves the accuracy of recognition and reduces the amount of calculation.
  • FIG. 5 shows a schematic diagram of the hardware architecture of the computer device according to the second embodiment of the present application.
  • the computer device 2 includes, but is not limited to, a memory 21, a processing 22, and a network interface 23 that can communicate with each other through a system bus.
  • FIG. 5 only shows the computer device 2 with components 21-23, but it should be understood that it is not It is required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the memory 21 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card type memory (for example, SD or DX memory, etc.), random access memory (RAM), static memory Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the memory 21 may be an internal storage unit of the computer device 2, for example, a hard disk or a memory of the computer device 2.
  • the memory may also be an external storage device of the computer device 2, such as a plug-in hard disk equipped on the computer device 2, a smart media card (SMC), a secure digital ( Secure Digital, SD card, Flash Card, etc.
  • the memory 21 may also include both the internal storage unit of the computer device 2 and its external storage device.
  • the memory 21 is generally used to store the operating system and various application software installed in the computer device 2, for example, the program code of the violent behavior detection system 20.
  • the memory 21 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 22 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chips.
  • the processor 22 is generally used to control the overall operation of the computer device 2.
  • the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the violent behavior detection system 20.
  • the network interface 23 may include a wireless network interface or a wired network interface, and the network interface 23 is generally used to establish a communication connection between the computer device 2 and other electronic devices.
  • the network interface 23 is used to connect the computer device 2 with an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 2 and the external terminal.
  • the network may be Intranet, Internet, Global System of Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G Network, Bluetooth (Bluetooth), Wi-Fi and other wireless or wired networks.
  • FIG. 6 shows a schematic diagram of program modules of a violent behavior detection system according to the third embodiment of the present application.
  • the violent behavior detection system 20 may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to complete
  • This application can also implement the above-mentioned violent behavior detection method.
  • the program module referred to in the embodiments of the present application refers to a series of computer-readable instruction instruction segments that can complete specific functions, and is more suitable for describing the execution process of the violent behavior detection system 20 in the storage medium than the program itself. The following description will specifically introduce the functions of each program module in this embodiment:
  • the obtaining module 201 is used to obtain scene images through a camera.
  • a surveillance camera is installed in a public place, and real-time transmission of video from the surveillance camera in a public place is used to transfer the data of an event to the cloud frame by frame for processing.
  • the acquisition module 201 acquires the captured scene image to detect violence behavior on the scene image.
  • the detection module 202 is configured to input the scene image into a feature pyramid network, and obtain a target human body from the scene image.
  • the feature pyramid is a basic component of the multi-scale target detection system.
  • the detection module 202 detects the target human body from the image features, for example: User A with violent behavior and User B with normal behavior.
  • the detection module 202 first passes the scene image through the convolutional network, and extracts the feature image of the highest layer of the convolutional network to Obtain the first size feature image. Then, the first-size feature image is up-sampled to a first intermediate-size feature image by bilinear interpolation, and the first intermediate-size feature image is output with the first intermediate size in the convolutional network The images are fused to obtain a first fusion result, and the first fusion result is output to obtain a second size feature image.
  • the size of the acquired scene image is 128*128*3, where 3 is RGB three channels.
  • the detection module 202 inputs the scene image into the feature pyramid network, and obtains a first size feature image with a minimum size feature image of 16*16*128 through convolution transformation. Then, the feature image of the first size is up-sampled by bilinear interpolation to obtain a feature image of size 32*32*128, and the feature image of size 32*32*128 is combined with the feature pyramid network
  • the output of the convolutional layer in is a 32*32*64 feature image for fusion to obtain a second size feature image of 32*32*64.
  • Bilinear interpolation is a kind of interpolation algorithm, which is an extension of linear interpolation.
  • the four real pixel values around the target point in the original image are used to jointly determine a pixel value in the target image.
  • the core idea is to perform a linear interpolation in two directions.
  • the main purpose of upsampling is to enlarge the image, almost all of which use interpolation, that is, on the basis of the original image pixels, a suitable interpolation algorithm is used to insert new elements between the pixel values.
  • Feature-level image fusion is to extract feature information from the source image, which is the observer’s target or interest area in the source image, such as edges, people, buildings, or vehicles, and then analyze these feature information , Processing and integration to obtain the fused image features.
  • the accuracy of target recognition on the fused features is significantly higher than that of the original image.
  • Feature-level fusion compresses the image information, and then uses the computer to analyze and process it. The memory and time consumed will be reduced compared with the pixel level, and the real-time performance of the required image will be improved.
  • Feature-level image fusion does not have as high an accuracy requirement for image matching as the first layer, and the calculation speed is faster than that of the first layer, but it extracts image features as fusion information, so it will lose a lot of detailed features.
  • the detection module 202 After acquiring the second size feature image, the detection module 202 further up-samples the second size feature image to a second intermediate size feature image through the bilinear interpolation method. Then, the second intermediate size feature image is fused with the second intermediate size output image in the convolutional network to obtain a second fusion result, and the second fusion result is output to obtain a third size feature image .
  • the detection module 202 also outputs the 32*32*64 second size feature image and the convolutional layer in the feature pyramid network to a size of 64*64*32 Fusion of the feature images to obtain a third-size feature image with a size of 64*64*32.
  • the detection module 202 After acquiring the third size feature image, the detection module 202 also inputs the first size feature image, the second size feature image, and the third size feature image into the RPN network, and then respectively Perform area frame detection on the first size feature image, the second size feature image, and the third size feature image, and obtain the region of interest and the region with the highest category score of the region of interest according to the detection result, To obtain the target human body.
  • the detection module 202 also inputs a first size feature image of 16*16*128 size, a second size feature image of 32*32*64 size, and a third size feature image of 64*64*32 size as input
  • a first size feature image of 16*16*128 size if the detection results are [person; 0.6], [person; 0.65] and [person; 0.8], the area where the detection result is [person; 0.8] is the target human body.
  • the detection module 202 also obtains the feature maps of the second to fifth layers of convolutional layers, and merges the feature maps to obtain the second to fifth layers.
  • Feature maps P2 ⁇ P5 and then the feature maps are respectively passed through the ROI Pooling layer, and the pooling results are passed through the fully connected layer, and then the classification results are obtained when the classifier is passed, and the border is obtained when the border is returned.
  • the regression result and then synthesize the classification result and the frame regression result to obtain the frame of the target human body.
  • the fusion method is the same as the fusion method in other embodiments of this application, so it will not be repeated.
  • the human body pose estimation module 203 is configured to use a cascaded pyramid network to estimate the target human body pose to obtain a human body pose estimation result, where the cascaded pyramid network includes a GlobalNet network and a RefineNet network.
  • the human body posture estimation module 203 locates simple visible key points through the GlobalNet network, and further processes difficult key points through the RefineNet network to realize the human body posture estimation of the target human body.
  • the human body posture estimation module 203 also sends the detection result image to the GlobalNet network, where the GlobalNet network does not include a pooling layer, and passes between the convolutional layers.
  • the step size of the convolution is added in between to reduce the size of the feature map output by each convolution layer.
  • the third feature map and the fourth feature map respectively apply a 3*3 convolution filter to generate a heat map of simple key points.
  • the network without the pooling layer design can also be referred to as a fully convolutional network, which can improve the accuracy of the GlobalNet network for human frame detection.
  • the first feature map and the second feature map have higher spatial resolution in positioning, but less semantic information is identified.
  • the third feature map and the fourth feature map identify There is more semantic information, but the resolution is low. The accuracy of the feature information with different semantic information and different resolutions is fused to improve the accuracy.
  • the human body posture estimation module 203 also sends the heat map to the RefineNet network to process the heat map separately through the residual block structure, wherein the first feature map does not pass through the residual block structure ,
  • the second feature map passes through a residual block structure
  • the third feature map passes through two residual block structures
  • the fourth feature map passes through three residual block structures
  • the residual block structure includes 1 *1 Convolution structure and 3*3 convolution structure.
  • the processing results are respectively subjected to up-sampling processing, wherein the processing result of the first feature map does not undergo up-sampling processing, the processing result of the second feature map undergoes two up-sampling processing, and the processing result of the third feature map After four times of up-sampling processing, the processing result of the fourth feature map undergoes eight times of up-sampling processing.
  • the sampling processing results are connected to integrate the up-sampling results, the integrated results are trained using L2 loss, and difficult key points are selected according to the training results to obtain the human body posture results.
  • 1*1 convolution in the residual block structure adjusts the size of the heat map
  • the 3*3 convolution structure extracts feature information.
  • the residual block structure reduces a convolutional layer, which greatly reduces the amount of calculation, but the accuracy is not affected.
  • the classification module 204 is configured to match the human posture estimation result with the violent behavior human posture stored in the database to determine whether there is a violent behavior in the scene image according to the matching result, and classify the violent behavior.
  • the database stores a variety of violent behavior human postures and the corresponding relationship of behavior names corresponding to the violent behavior human postures, for example, the behavior posture of punching and the corresponding behavior name are punching. If the human body posture estimation result is a punch, the classification module 204 matches the body posture of the punch with a plurality of violent behavior postures stored in the database, matches the human body posture behavior of the punch as a punch, and judges There is violence.
  • the violent behavior detection system 20 further includes a human body posture training module 205 for acquiring multiple sample images, wherein each of the multiple images includes multiple human bodies, The multiple human bodies perform different behaviors, and the behaviors include at least one or more of punching, knife stabbing, gun shooting, kicking, and neck choking. Then, the human body posture training module 205 performs behavior labeling on the sample image according to the behavior in the sample image, and trains the multiple images according to the labeling result to obtain the human body posture corresponding to the behavior.
  • a human body posture training module 205 for acquiring multiple sample images, wherein each of the multiple images includes multiple human bodies, The multiple human bodies perform different behaviors, and the behaviors include at least one or more of punching, knife stabbing, gun shooting, kicking, and neck choking. Then, the human body posture training module 205 performs behavior labeling on the sample image according to the behavior in the sample image, and trains the multiple images according to the labeling result to obtain the human body posture corresponding to the behavior.
  • a spatial violence personal database is used.
  • the spatial violence personal database consists of 2000 images, each image contains 2-10 people, and the entire database has a total of 10863 people, of which 5124 (that is, 48% of people) are involved
  • 5124 that is, 48% of people
  • the five types of violence are: punching, stabbing, shooting, kicking and strangling the neck.
  • the accuracy rate is highest. For example: when there is only one person in a picture, the accuracy rate of the system is 94.1%, but when there are 5 people, the accuracy rate drops to 84%, and when there are 10 people, the accuracy rate drops to 79.8%.
  • each key point of the human body can be successfully located, which greatly improves the accuracy of recognition and reduces the amount of calculation.
  • This application also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a cabinet server (including independent servers, or more) that can execute programs.
  • a server cluster composed of two servers) and so on.
  • the computer device in this embodiment at least includes but is not limited to: a memory, a processor, etc., which can be communicably connected to each other through a system bus.
  • This embodiment also provides a non-volatile computer-readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory ( SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks, servers, App application malls, etc., on which storage There are computer-readable instructions, and the corresponding functions are realized when the program is executed by the processor.
  • the non-volatile computer-readable storage medium of this embodiment is used to store the violent behavior detection system 20, and when executed by a processor, the following steps are implemented:
  • the human body posture estimation result is matched with the violent behavior human body posture stored in the database to determine whether there is a violent behavior in the scene image according to the matching result, and the violent behavior is classified.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Tourism & Hospitality (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Development Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Educational Administration (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Economics (AREA)
  • Computer Security & Cryptography (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

一种暴力行为检测方法,包括:通过摄像头获取场景图像;将所述场景图像输入至特征金字塔网络中,并从所述场景图像中获取目标人体;利用级联金字塔网络对所述目标人体进行人体姿态估计以获取人体姿态估计结果,其中,所述级联金字塔网络包括GlobalNet网络和RefineNet网络;及将所述人体姿态估计结果与数据库中存储的暴力行为人体姿态进行匹配,以根据匹配结果判断所述场景图像中是否存在暴力行为,并对所述暴力行为进行分类。本申请实施例还提供一种暴力行为检测系统、计算机设备及可读存储介质。通过本申请实施例,能够成功地定位各个人体关键点,极大的提高了识别的准确率,并降低了运算量。

Description

暴力行为检测方法及系统
本申请要求于2019年9月16日提交中国专利局,专利名称为“暴力行为检测方法及系统”,申请号为201910872172.0的发明专利的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及大数据领域,尤其涉及一种暴力行为检测方法、暴力行为检测系统、计算机设备及可读存储介质。
背景技术
近年来,随着个人行为活动和恐怖组织威胁发生频率的增加,找到新的方法来保持安全、遏制行为具有重要的现实意义。长期以来,监视被认为是有效的行为威慑手段,但当发生暴恐事件时,人的情绪慌张、四处躲避,无法有效拨打报警电话获取求助,实时检测则会第一时间有效的触发报警。其利用公共场所监控摄像头视频的实时传输,将发生事件的数据逐帧传入云端进行处理,并且会逐时间处理非暴力事件的冗余信息,保存暴力事件的监控信息。检测到人体暴力行为会触发距离监控摄像点最近的相关依法处理部门,有助于维护社会稳定和长治久安。而其他的应用场景也十分广泛,例如:实时监控校园公共区域和死角的校园暴力行为、医院伤患家属与医生发生的暴力行为、公交/地铁等交通工具上发生的暴力行为。
现有技术中,人体姿态估计的方法包括:
1.结构化特征学习(Structured Feature Learning),其是在卷积神经网络(Convolutional Neural Network,CNN)的基础上进行微调,然这种多人姿态估计的准确度不高;
2.深切(Deepcut以及Deepercut),其是使用CNN提取身体部分候选区域,然这种方式计算复杂度非常大,速度慢;
3.卷积姿态机(Convolutiona Pose Machine,CPM),使用顺序化的卷积架构来表达空间信息和纹理信息,虽具有很好的鲁棒性,然,网络较为复杂。
发明人发现,虽然还有其他人体姿态估计的方法对人体姿态估计有些好的表现,但是仍然存在较多问题,例如:被遮挡的关键点、不可见的关键点和复杂的背景,这些问题不能很好的解决。
发明内容
有鉴于此,有必要提供一种暴力行为检测方法、暴力行为检测系统、计算机设备及可读存储介质,能够成功地定位各个人体关键点,极大的提高了识别的准确率,并降低了运算量。
为实现上述目的,本申请实施例提供了一种暴力行为检测方法,所述方法包括:
通过摄像头获取场景图像;
将所述场景图像输入至特征金字塔网络中,并从所述场景图像中获取目标人体;
利用级联金字塔网络对所述目标人体进行人体姿态估计以获取人体姿态估计结果,其中,所述级联金字塔网络包括GlobalNet网络和RefineNet网络;及
将所述人体姿态估计结果与数据库中存储的暴力行为人体姿态进行匹配,以根据匹配结果判断所述场景图像中是否存在暴力行为,并对所述暴力行为进行分类。
为实现上述目的,本申请实施例还提供了一种暴力行为检测系统,包括:
获取模块,用于通过摄像头获取场景图像;
检测模块,用于将所述场景图像输入至特征金字塔网络中,并从所述场景图像中获取目标人体;
人体姿态估计模块,用于利用级联金字塔网络对所述目标人体进行人体姿态估计以获取人体姿态估计结果,其中,所述级联金字塔网络包括GlobalNet网络和RefineNet网络;及
分类模块,用于将所述人体姿态估计结果与数据库中存储的暴力行为人体姿态进行匹配,以根据匹配结果判断所述场景图像中是否存在暴力行为,并对所述暴力行为进行分类。
为实现上述目的,本申请实施例还提供了一种计算机设备,所述计算机设备存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述计算机可读指令被处理器执行时实现如下步骤:
通过摄像头获取场景图像;
将所述场景图像输入至特征金字塔网络中,并从所述场景图像中获取目标人体;
利用级联金字塔网络对所述目标人体进行人体姿态估计以获取人体姿态估计结果,其中,所述级联金字塔网络包括GlobalNet网络和RefineNet网络;及
将所述人体姿态估计结果与数据库中存储的暴力行为人体姿态进行匹配,以根据匹配结果判断所述场景图像中是否存在暴力行为,并对所述暴力行为进行分类。
为实现上述目的,本申请实施例还提供了一种非易失性计算机可读存储介质,所述非易失性计算机可读存储介质内存储有计算机可读指令,所述计算机可读指令可被至少一个处理器所执行,以使所述至少一个处理器执行如下步骤:
通过摄像头获取场景图像;
将所述场景图像输入至特征金字塔网络中,并从所述场景图像中获取目标人体;
利用级联金字塔网络对所述目标人体进行人体姿态估计以获取人体姿态估计结果,其中,所述级联金字塔网络包括GlobalNet网络和RefineNet网络;及
将所述人体姿态估计结果与数据库中存储的暴力行为人体姿态进行匹配,以根据匹配结果判断所述场景图像中是否存在暴力行为,并对所述暴力行为进行分类。
本申请实施例提供的暴力行为检测方法、暴力行为检测系统、计算机设备及可读存储介质,通过将获取到的场景图像先经过特征金字塔网络以检测出目标人体的边框,然后根据检测结果,利用级联金字塔网络对所述目标人体进行人体姿态估计,根据估计结果,将所述估计结果与数据库中存储的暴力行为人体姿态进行匹配,以根据匹配结果判断是否存在暴力行为,并对所述暴力行为进行分类。通过本申请实施例,能够成功地定位各个人体关键点,极大的提高了识别的准确率,并降低了运算量。
附图说明
图1为本申请实施例一之暴力行为检测方法的步骤流程图。
图2为本申请实施例之尺寸特征图像获取示意图。
图3为本申请实施例之基于FPN的快速R-CNN目标检测整体框架示意图。
图4为本申请实施例之残差块结构图。
图5为本申请实施例二之计算机设备的硬件架构示意图。
图6为本申请实施例三之暴力行为检测系统的程序模块示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。
实施例一
参阅图1,示出了本申请实施例一之暴力行为检测方法的步骤流程图。可以理解,本方法实施例中的流程图不用于对执行步骤的顺序进行限定。需要说明是,本实施例以计算机设备2为执行主体进行示例性描述。具体如下:
步骤S100,通过摄像头获取场景图像。
示例性地,在公共场所安装监控摄像头,利用公共场所监控摄像头视频的实时传输,将发生事件的数据逐帧传入云端进行处理。所述计算机设备2获取所述拍摄到的场景图像以对所述场景图像进行暴力行为的检测。
步骤S102,将所述场景图像输入至特征金字塔网络中,并从所述场景图像中获取目标人体。
需要说明的是,特征金字塔是多尺度目标检测系统中的一个基本组成部分。
示例性地,通过将获取到的尺寸为128*128*3的场景图像输入至特征金字塔网络中,以获取图像特征,并从所述图像特征中检测出目标人体,例如:有暴力行为的用户甲及正常行为的用户乙。
在一较佳实施例中,当所述场景图像输入至特征金字塔网络后,先将将所述场景图像通过卷积网络,并提取所述卷积网络最高层的特征图像以获取第一尺寸特征图像。然后,通过双线性插值法对所述第一尺寸特征图像进行上采样至第一中间尺寸特征图像,并将所述第一中间尺寸特征图像与所述卷积网络中第一中间尺寸的输出图像进行融合以获取第一融合结果,并将所述第一融合结果输出以获取第二尺寸特征图像。
示例性地,请参阅图2,若获取的场景图像尺寸为128*128*3,其中3为RGB三通道。将所述场景图像输入至所述特征金字塔网络中,通过卷积变换以获取最小尺寸特征图像为16*16*128大小的第一尺寸特征图像。然后,通过双线性插值法将所述第一尺寸特征图像经过上采样以获取32*32*128大小的特征图像,并将所述32*32*128大小的特征图像与所述特征金字塔网络中的卷积层输出为32*32*64大小的特征图像进行融合,以获取32*32*64大小的第二尺寸特征图像。
双线性插值法是插值算法中的一种,是线性插值的扩展。利用原图像中目标点四周的四个真实存在的像素值来共同决定目标图中的一个像素值,其核心思想是在两个方向分别进行一次线性插值。
上采样的主要目的是放大图像,几乎都是采用内插值法,即在原有图像像素的基础上,在像素点值之间采用合适的插值算法插入新的元素。
特征级图像融合是从源图像中将特征信息提取出来,这些特征信息是观察者对源图像中目标或感兴趣的区域,如边缘、人物、建筑或车辆等信息,然后对这些特征信息进行分析、处理与整合从而得到融合后的图像特征。对融合后的特征进行目标识别的精确度明显的高于原始图像的精确度。特征级融合对图像信息进行了压缩,再用计算机分析与处理,所消耗的内存与时间与像素级相比都会减少,所需图像的实时性就会有所提高。特征级图像融合对图像匹配的精确度的要求没有第一层那么高,计算速度也比第一层快,可是它提取图像特征作为融合信息,所以会丢掉很多的细节性特征。
当获取到所述第二尺寸特征图像后,通过所述双线性插值法对所述第二尺寸特征图像进行上采样至第二中间尺寸特征图像。然后,将所述第二中间尺寸特征图像与所述卷积网络中第二中间尺寸的输出图像进行融合以获取第二融合结果,并将所述第二融合结果输出以获取第三尺寸特征图像。
示例性地,请继续参阅图2,将所述32*32*64大小的第二尺寸特征图像与所述特征金字塔网络中的卷积层输出为64*64*32大小的特征图像进行融 合,以获取64*64*32大小的第三尺寸特征图像。
当获取到所述第三尺寸特征图像后,将所述第一尺寸特征图像、所述第二尺寸特征图像及所述第三尺寸特征图像输入至RPN网络中,然后分别对所述第一尺寸特征图像、所述第二尺寸特征图像及所述第三尺寸特征图像进行区域框检测,并根据检测结果获取感兴趣区域及所述感兴趣区域的类别得分最高的区域,以获取所述目标人体。
示例性地,将16*16*128大小的第一尺寸特征图像、32*32*64大小的第二尺寸特征图像及64*64*32大小的第三尺寸特征图像输入至RPN网络中以进行目标检测,若检测结果分别为【人;0.6】、【人;0.65】及【人;0.8】,则获取检测结果为【人;0.8】的区域为目标人体。
在另一较佳实施例中,请参阅图3,通过获取第2~5层卷积层的特征图,并将所述特征图分别进行融合以获取第2~5层特征映射P2~P5,然后将所述特征映射分别经过感兴趣区域池化层(ROI Pooling),并将池化结果经过全连接层,然后通过分类器时获取分类结果,通过边框回归时获取边框回归结果,进而综合分类结果及边框回归结果获取所述目标人体的边框。在本实施例中,融合方法与本申请中其他实施例中的融合方法相同,故不再赘述。
步骤S104,利用级联金字塔网络对所述目标人体进行人体姿态估计以获取人体姿态估计结果,其中,所述级联金字塔网络包括GlobalNet网络和RefineNet网络。
在本实施例中,通过所述GlobalNet网络定位简单的可见关键点,通过所述RefineNet网络进一步的处理困难关键点,以实现对所述目标人体的人体姿态估计。
在一较佳实施例中,当根据检测结果,利用级联金字塔网络对所述目标人体进行人体姿态估计时,还将所述检测结果图像发送至所述GlobalNet网络中,其中,所述GlobalNet网络不包括池化层,并通过在各卷积层之间添加设置卷积的步长以使各卷积层输出的特征图大小缩减。然后,获取第2~5层卷积层分别输出的第一特征图,第二特征图,第三特征图及第四特征图,并将所述第一特征图,所述第二特征图,所述第三特征图及所述第四特征图分别应用3*3的卷积滤波器以生成简单关键点的热力图。
需要说明的是,由于池化层会导致特征信息的丢失,故将原有的GlobalNet网络中的所有池化层改变为通过设置卷积的步长来实现特征图的大小的缩减。例如,通过设置步长Stride=2,卷积核为3*3的卷积时,输出的尺寸就会变成 原来的二分之一。在本实施例中,去掉池化层设计的网络也可称之为全卷积网络,可提升GlobalNet网络对人体框检测的精度。另,所述第一特征图及所述第二特征图在定位上有较高的空间分辨率,然识别出的语义信息较少,所述第三特征图及所述第四特征图识别出的语义信息较多,但分辨率较低,通过将具有不同语义信息不同分辨率的特征信息进行融合以提高准确率。
当将所述第一特征图,所述第二特征图,所述第三特征图及所述第四特征图分别应用3*3的卷积滤波器以生成简单关键点的热力图之后,还将所述热力图发送至所述RefineNet网络中,以将所述热力图分别经过残差块结构处理,其中,所述第一特征图不经过所述残差块结构,所述第二特征图经过一个残差块结构,所述第三特征图经过两个残差块结构,所述第四特征图经过三个残差块结构,所述残差块结构包括1*1卷积结构及3*3卷积结构。然后,将处理结果分别经过上采样处理,其中,所述第一特征图处理结果不经过上采样处理,所述第二特征图处理结果经过两次上采样处理,所述第三特征图处理结果经过四次上采样处理,所述第四特征图处理结果经过八次上采样处理。最后,将采样处理结果进行连接以对所述上采样结果进行整合,将整合结果利用L2损失进行训练,并根据训练结果选择困难关键点以获取所述人体姿态结果。
示例性地,详见图4的残差块结构图,所述残差块结构中1*1卷积调整所述热力图的尺寸,所述3*3卷积结构提取特征信息。所述残差块结构相比现有的Resnet的瓶颈结构设计,减少了一个卷积层,极大的降低了运算量,然,精度却并未受到影响。
步骤S106,将所述人体姿态估计结果与数据库中存储的暴力行为人体姿态进行匹配,以根据匹配结果判断所述场景图像中是否存在暴力行为,并对所述暴力行为进行分类。
示例性地,数据库中存储有多种暴力行为人体姿态以及与所述暴力行为人体姿态对应的行为名称的对应关系,例如,拳打的行为姿态及对应的行为名称为拳打。若人体姿态估计结果为拳打,则将拳打的人体姿态与数据库中存储的多个暴力行为姿态进行匹配,匹配出所述拳打的人体姿态行为为拳打,并判断存在暴力行为。
在一较佳实施例中,将所述人体姿态估计结果与数据库中存储的暴力行为进行匹配,以根据匹配结果判断是否存在暴力行为,并对所述暴力行为进行分类之前,还获取多张样本图像,其中,所述多张图像中的每一张图像均包括多个人体,所述多个人体进行不同的行为,所述行为至少包括:拳打、刀刺、枪 射、脚踢及扼颈中的一项或多项。然后,根据所述样本图像中的行为对所述样本图像进行行为标记,并根据标记结果对所述多张图像进行训练,以获取与所述行为对应的人体姿态。
示例性地,采用空间暴力个人数据库,所述空间暴力个人数据库由2000张图像组成,每张图像中包含2~10人,整个数据库一共有10863个人,其中5124(也即48%的人)涉及五类暴力行为的一类或多类。其中,所述五类暴力行为分别为:拳打、刀刺、枪射、脚踢及扼颈。需要说明的是,当需要处理的人数更少时,准确率最高。例如:一张图片上只有一个人时,系统的准确率为94.1%,但有5个人时,准确率降到84%,10个人时准确率降到79.8%。
通过本申请实施例,能够成功地定位各个人体关键点,极大的提高了识别的准确率,并降低了运算量。
实施例二
请参阅图5,示出了本申请实施例二之计算机设备的硬件架构示意图。计算机设备2包括,但不仅限于,可通过系统总线相互通信连接存储器21、处理22以及网络接口23,图5仅示出了具有组件21-23的计算机设备2,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
所述存储器21至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器21可以是所述计算机设备2的内部存储单元,例如该计算机设备2的硬盘或内存。在另一些实施例中,所述存储器也可以是所述计算机设备2的外部存储设备,例如该计算机设备2上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器21还可以既包括所述计算机设备2的内部存储单元也包括其外部存储设备。本实施例中,所述存储器21通常用于存储安装于所述计算机设备2的操作系统和各类应用软件,例如暴力行为检测系统20的程序代码等。此外,所述存储器21还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器22在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理 器22通常用于控制所述计算机设备2的总体操作。本实施例中,所述处理器22用于运行所述存储器21中存储的程序代码或者处理数据,例如运行所述暴力行为检测系统20等。
所述网络接口23可包括无线网络接口或有线网络接口,该网络接口23通常用于在所述计算机设备2与其他电子设备之间建立通信连接。例如,所述网络接口23用于通过网络将所述计算机设备2与外部终端相连,在所述计算机设备2与外部终端之间的建立数据传输通道和通信连接等。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi等无线或有线网络。
实施例三
请参阅图6,示出了本申请实施例三之暴力行为检测系统的程序模块示意图。在本实施例中,暴力行为检测系统20可以包括或被分割成一个或多个程序模块,一个或者多个程序模块被存储于存储介质中,并由一个或多个处理器所执行,以完成本申请,并可实现上述暴力行为检测方法。本申请实施例所称的程序模块是指能够完成特定功能的一系列计算机可读指令指令段,比程序本身更适合于描述暴力行为检测系统20在存储介质中的执行过程。以下描述将具体介绍本实施例各程序模块的功能:
获取模块201,用于通过摄像头获取场景图像。
示例性地,在公共场所安装监控摄像头,利用公共场所监控摄像头视频的实时传输,将发生事件的数据逐帧传入云端进行处理。所述获取模块201获取所述拍摄到的场景图像以对所述场景图像进行暴力行为的检测。
检测模块202,用于将所述场景图像输入至特征金字塔网络中,并从所述场景图像中获取目标人体。
需要说明的是,特征金字塔是多尺度目标检测系统中的一个基本组成部分。
示例性地,通过将获取到的尺寸为128*128*3的场景图像输入至特征金字塔网络中,以获取图像特征,所述检测模块202从所述图像特征中检测出目标人体,例如:有暴力行为的用户甲及正常行为的用户乙。
在一较佳实施例中,当所述场景图像输入至特征金字塔网络后,所述检测模块202先将将所述场景图像通过卷积网络,并提取所述卷积网络最高层的特 征图像以获取第一尺寸特征图像。然后,通过双线性插值法对所述第一尺寸特征图像进行上采样至第一中间尺寸特征图像,并将所述第一中间尺寸特征图像与所述卷积网络中第一中间尺寸的输出图像进行融合以获取第一融合结果,并将所述第一融合结果输出以获取第二尺寸特征图像。
示例性地,请参阅图2,若获取的场景图像尺寸为128*128*3,其中3为RGB三通道。所述检测模块202将所述场景图像输入至所述特征金字塔网络中,通过卷积变换以获取最小尺寸特征图像为16*16*128大小的第一尺寸特征图像。然后,通过双线性插值法将所述第一尺寸特征图像经过上采样以获取32*32*128大小的特征图像,并将所述32*32*128大小的特征图像与所述特征金字塔网络中的卷积层输出为32*32*64大小的特征图像进行融合,以获取32*32*64大小的第二尺寸特征图像。
双线性插值法是插值算法中的一种,是线性插值的扩展。利用原图像中目标点四周的四个真实存在的像素值来共同决定目标图中的一个像素值,其核心思想是在两个方向分别进行一次线性插值。
上采样的主要目的是放大图像,几乎都是采用内插值法,即在原有图像像素的基础上,在像素点值之间采用合适的插值算法插入新的元素。
特征级图像融合是从源图像中将特征信息提取出来,这些特征信息是观察者对源图像中目标或感兴趣的区域,如边缘、人物、建筑或车辆等信息,然后对这些特征信息进行分析、处理与整合从而得到融合后的图像特征。对融合后的特征进行目标识别的精确度明显的高于原始图像的精确度。特征级融合对图像信息进行了压缩,再用计算机分析与处理,所消耗的内存与时间与像素级相比都会减少,所需图像的实时性就会有所提高。特征级图像融合对图像匹配的精确度的要求没有第一层那么高,计算速度也比第一层快,可是它提取图像特征作为融合信息,所以会丢掉很多的细节性特征。
当获取到所述第二尺寸特征图像后,所述检测模块202还通过所述双线性插值法对所述第二尺寸特征图像进行上采样至第二中间尺寸特征图像。然后,将所述第二中间尺寸特征图像与所述卷积网络中第二中间尺寸的输出图像进行融合以获取第二融合结果,并将所述第二融合结果输出以获取第三尺寸特征图像。
示例性地,请继续参阅图2,所述检测模块202还将所述32*32*64大小的第二尺寸特征图像与所述特征金字塔网络中的卷积层输出为64*64*32大小的特征图像进行融合,以获取64*64*32大小的第三尺寸特征图像。
当获取到所述第三尺寸特征图像后,所述检测模块202还将所述第一尺寸特征图像、所述第二尺寸特征图像及所述第三尺寸特征图像输入至RPN网络中,然后分别对所述第一尺寸特征图像、所述第二尺寸特征图像及所述第三尺寸特征图像进行区域框检测,并根据检测结果获取感兴趣区域及所述感兴趣区域的类别得分最高的区域,以获取所述目标人体。
示例性地,所述检测模块202还将16*16*128大小的第一尺寸特征图像、32*32*64大小的第二尺寸特征图像及64*64*32大小的第三尺寸特征图像输入至RPN网络中以进行目标检测,若检测结果分别为【人;0.6】、【人;0.65】及【人;0.8】,则获取检测结果为【人;0.8】的区域为目标人体。
在另一较佳实施例中,请参阅图3,所述检测模块202还通过获取第2~5层卷积层的特征图,并将所述特征图分别进行融合以获取第2~5层特征映射P2~P5,然后将所述特征映射分别经过感兴趣区域池化层(ROI Pooling),并将池化结果经过全连接层,然后通过分类器时获取分类结果,通过边框回归时获取边框回归结果,进而综合分类结果及边框回归结果获取所述目标人体的边框。在本实施例中,融合方法与本申请中其他实施例中的融合方法相同,故不再赘述。
人体姿态估计模块203,用于利用级联金字塔网络对所述目标人体进行人体姿态估计以获取人体姿态估计结果,其中,所述级联金字塔网络包括GlobalNet网络和RefineNet网络。
在本实施例中,所述人体姿态估计模块203通过所述GlobalNet网络定位简单的可见关键点,通过所述RefineNet网络进一步的处理困难关键点,以实现对所述目标人体的人体姿态估计。
在一较佳实施例中,所述人体姿态估计模块203还将所述检测结果图像发送至所述GlobalNet网络中,其中,所述GlobalNet网络不包括池化层,并通过在各卷积层之间添加设置卷积的步长以使各卷积层输出的特征图大小缩减。然后,获取第2~5层卷积层分别输出的第一特征图,第二特征图,第三特征图及第四特征图,并将所述第一特征图,所述第二特征图,所述第三特征图及所述第四特征图分别应用3*3的卷积滤波器以生成简单关键点的热力图。
需要说明的是,由于池化层会导致特征信息的丢失,故将原有的GlobalNet网络中的所有池化层改变为通过设置卷积的步长来实现特征图的大小的缩减。例如,通过设置步长Stride=2,卷积核为3*3的卷积时,输出的尺寸就会变成原来的二分之一。在本实施例中,去掉池化层设计的网络也可称之为全卷积网 络,可提升GlobalNet网络对人体框检测的精度。另,所述第一特征图及所述第二特征图在定位上有较高的空间分辨率,然识别出的语义信息较少,所述第三特征图及所述第四特征图识别出的语义信息较多,但分辨率较低,通过将具有不同语义信息不同分辨率的特征信息进行融合以提高准确率。
当将所述第一特征图,所述第二特征图,所述第三特征图及所述第四特征图分别应用3*3的卷积滤波器以生成简单关键点的热力图之后,所述人体姿态估计模块203还将所述热力图发送至所述RefineNet网络中,以将所述热力图分别经过残差块结构处理,其中,所述第一特征图不经过所述残差块结构,所述第二特征图经过一个残差块结构,所述第三特征图经过两个残差块结构,所述第四特征图经过三个残差块结构,所述残差块结构包括1*1卷积结构及3*3卷积结构。然后,将处理结果分别经过上采样处理,其中,所述第一特征图处理结果不经过上采样处理,所述第二特征图处理结果经过两次上采样处理,所述第三特征图处理结果经过四次上采样处理,所述第四特征图处理结果经过八次上采样处理。最后,将采样处理结果进行连接以对所述上采样结果进行整合,将整合结果利用L2损失进行训练,并根据训练结果选择困难关键点以获取所述人体姿态结果。
示例性地,详见图4的残差块结构图,所述残差块结构中1*1卷积调整所述热力图的尺寸,所述3*3卷积结构提取特征信息。所述残差块结构相比现有的Resnet的瓶颈结构设计,减少了一个卷积层,极大的降低了运算量,然,精度却并未受到影响。
分类模块204,用于将所述人体姿态估计结果与数据库中存储的暴力行为人体姿态进行匹配,以根据匹配结果判断所述场景图像中是否存在暴力行为,并对所述暴力行为进行分类。
示例性地,数据库中存储有多种暴力行为人体姿态以及与所述暴力行为人体姿态对应的行为名称的对应关系,例如,拳打的行为姿态及对应的行为名称为拳打。若人体姿态估计结果为拳打,则所述分类模块204将拳打的人体姿态与数据库中存储的多个暴力行为姿态进行匹配,匹配出所述拳打的人体姿态行为为拳打,并判断存在暴力行为。
在一较佳实施例中,所述暴力行为检测系统20还包括人体姿态训练模块205,用于获取多张样本图像,其中,所述多张图像中的每一张图像均包括多个人体,所述多个人体进行不同的行为,所述行为至少包括:拳打、刀刺、枪射、脚踢及扼颈中的一项或多项。然后,所述人体姿态训练模块205根据所述 样本图像中的行为对所述样本图像进行行为标记,并根据标记结果对所述多张图像进行训练,以获取与所述行为对应的人体姿态。
示例性地,采用空间暴力个人数据库,所述空间暴力个人数据库由2000张图像组成,每张图像中包含2~10人,整个数据库一共有10863个人,其中5124(也即48%的人)涉及五类暴力行为的一类或多类。其中,所述五类暴力行为分别为:拳打、刀刺、枪射、脚踢及扼颈。需要说明的是,当需要处理的人数更少时,准确率最高。例如:一张图片上只有一个人时,系统的准确率为94.1%,但有5个人时,准确率降到84%,10个人时准确率降到79.8%。
通过本申请实施例,能够成功地定位各个人体关键点,极大的提高了识别的准确率,并降低了运算量。
本申请还提供一种计算机设备,如可以执行程序的智能手机、平板电脑、笔记本电脑、台式计算机、机架式服务器、刀片式服务器、塔式服务器或机柜式服务器(包括独立的服务器,或者多个服务器所组成的服务器集群)等。本实施例的计算机设备至少包括但不限于:可通过系统总线相互通信连接的存储器、处理器等。
本实施例还提供一种非易失性计算机可读存储介质,如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘、服务器、App应用商城等等,其上存储有计算机可读指令,程序被处理器执行时实现相应功能。本实施例的非易失性计算机可读存储介质用于存储暴力行为检测系统20,被处理器执行时实现如下步骤:
通过摄像头获取场景图像;
将所述场景图像输入至特征金字塔网络中,并从所述场景图像中获取目标人体;
利用级联金字塔网络对所述目标人体进行人体姿态估计以获取人体姿态估计结果,其中,所述级联金字塔网络包括GlobalNet网络和RefineNet网络;及
将所述人体姿态估计结果与数据库中存储的暴力行为人体姿态进行匹配,以根据匹配结果判断所述场景图像中是否存在暴力行为,并对所述暴力行为进行分类。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种暴力行为检测方法,包括:
    通过摄像头获取场景图像;
    将所述场景图像输入至特征金字塔网络中,并从所述场景图像中获取目标人体;
    利用级联金字塔网络对所述目标人体进行人体姿态估计以获取人体姿态估计结果,其中,所述级联金字塔网络包括GlobalNet网络和RefineNet网络;及
    将所述人体姿态估计结果与数据库中存储的暴力行为人体姿态进行匹配,以根据匹配结果判断所述场景图像中是否存在暴力行为,并对所述暴力行为进行分类。
  2. 如权利要求1所述的暴力行为检测方法,所述将所述场景图像输入至特征金字塔网络中,以对目标人体进行检测,并从所述场景图像中获取所述目标人体的步骤,还包括:
    将所述场景图像通过卷积网络,并提取所述卷积网络最高层的特征图像以获取第一尺寸特征图像;
    通过双线性插值法对所述第一尺寸特征图像进行上采样至第一中间尺寸特征图像;
    将所述第一中间尺寸特征图像与所述卷积网络中第一中间尺寸的输出图像进行融合以获取第一融合结果;及
    将所述第一融合结果输出以获取第二尺寸特征图像。
  3. 如权利要求2所述的暴力行为检测方法,所述将所述第一融合结果输出以获取第二尺寸特征图像的步骤之后,还包括:
    通过所述双线性插值法对所述第二尺寸特征图像进行上采样至第二中间尺寸特征图像;
    将所述第二中间尺寸特征图像与所述卷积网络中第二中间尺寸的输出图像进行融合以获取第二融合结果;及
    将所述第二融合结果输出以获取第三尺寸特征图像。
  4. 如权利要求3所述的暴力行为检测方法,所述将所述第二融合结果输出以获取第三尺寸特征图像的步骤之后,还包括:
    将所述第一尺寸特征图像、所述第二尺寸特征图像及所述第三尺寸特征图 像输入至RPN网络中;
    分别对所述第一尺寸特征图像、所述第二尺寸特征图像及所述第三尺寸特征图像进行区域框检测;及
    根据检测结果获取感兴趣区域及所述感兴趣区域的类别得分最高的区域,以获取所述目标人体。
  5. 如权利要求1所述的暴力行为检测方法,所述利用级联金字塔网络对所述目标人体进行人体姿态估计的步骤,还包括:
    将所述检测结果图像发送至所述GlobalNet网络中,其中,所述GlobalNet网络不包括池化层,并通过在各卷积层之间添加设置卷积的步长以使各卷积层输出的特征图大小缩减;
    获取第2~5层卷积层分别输出的第一特征图,第二特征图,第三特征图及第四特征图;及
    将所述第一特征图,所述第二特征图,所述第三特征图及所述第四特征图分别应用3*3的卷积滤波器以生成简单关键点的热力图。
  6. 如权利要求5所述的暴力行为检测方法,所述将所述第一特征图,所述第二特征图,所述第三特征图及所述第四特征图分别应用3*3的卷积滤波器以生成简单关键点的热力图的步骤之后,还包括:
    将所述热力图发送至所述RefineNet网络中,以将所述热力图分别经过残差块结构处理,其中,所述第一特征图不经过所述残差块结构,所述第二特征图经过一个残差块结构,所述第三特征图经过两个残差块结构,所述第四特征图经过三个残差块结构,所述残差块结构包括1*1卷积结构及3*3卷积结构;
    将处理结果分别经过上采样处理,其中,所述第一特征图处理结果不经过上采样处理,所述第二特征图处理结果经过两次上采样处理,所述第三特征图处理结果经过四次上采样处理,所述第四特征图处理结果经过八次上采样处理;
    将采样处理结果进行连接以对所述上采样结果进行整合;及
    将整合结果利用L2损失进行训练,并根据训练结果选择困难关键点。
  7. 如权利要求1所述的暴力行为检测方法,所述将人体姿态估计结果与数据库中存储的暴力行为人体姿态进行匹配,以根据匹配结果判断所述场景图像中是否存在暴力行为,并对所述暴力行为进行分类的步骤之前,还包括:
    获取多张样本图像,其中,所述多张图像中的每一张图像均包括多个人体,所述多个人体进行不同的行为,所述行为至少包括:拳打、刀刺、枪射、脚踢 及扼颈中的一项或多项;
    根据所述样本图像中的行为对所述样本图像进行行为标记;及
    根据标记结果对所述多张图像进行训练,以获取与所述行为对应的人体姿态。
  8. 一种暴力行为检测系统,包括:
    获取模块,用于通过摄像头获取场景图像;
    检测模块,用于将所述场景图像输入至特征金字塔网络中,并从所述场景图像中获取目标人体;
    人体姿态估计模块,用于利用级联金字塔网络对所述目标人体进行人体姿态估计以获取人体姿态估计结果,其中,所述级联金字塔网络包括GlobalNet网络和RefineNet网络;及
    分类模块,用于将所述人体姿态估计结果与数据库中存储的暴力行为人体姿态进行匹配,以根据匹配结果判断所述场景图像中是否存在暴力行为,并对所述暴力行为进行分类。
  9. 如权利要求8所述的暴力行为检测系统,所述检测模块还用于:
    将所述场景图像通过卷积网络,并提取所述卷积网络最高层的特征图像以获取第一尺寸特征图像;
    通过双线性插值法对所述第一尺寸特征图像进行上采样至第一中间尺寸特征图像;
    将所述第一中间尺寸特征图像与所述卷积网络中第一中间尺寸的输出图像进行融合以获取第一融合结果;及
    将所述第一融合结果输出以获取第二尺寸特征图像。
  10. 如权利要求9所述的暴力行为检测系统,所述检测模块还用于:
    通过所述双线性插值法对所述第二尺寸特征图像进行上采样至第二中间尺寸特征图像;
    将所述第二中间尺寸特征图像与所述卷积网络中第二中间尺寸的输出图像进行融合以获取第二融合结果;及
    将所述第二融合结果输出以获取第三尺寸特征图像。
  11. 如权利要求10所述的暴力行为检测系统,所述检测模块还用于:
    将所述第一尺寸特征图像、所述第二尺寸特征图像及所述第三尺寸特征图像输入至RPN网络中;
    分别对所述第一尺寸特征图像、所述第二尺寸特征图像及所述第三尺寸特 征图像进行区域框检测;及
    根据检测结果获取感兴趣区域及所述感兴趣区域的类别得分最高的区域,以获取所述目标人体。
  12. 如权利要求8所述的暴力行为检测系统,所述人体姿态估计模块还用于:
    将所述检测结果图像发送至所述GlobalNet网络中,其中,所述GlobalNet网络不包括池化层,并通过在各卷积层之间添加设置卷积的步长以使各卷积层输出的特征图大小缩减;
    获取第2~5层卷积层分别输出的第一特征图,第二特征图,第三特征图及第四特征图;及
    将所述第一特征图,所述第二特征图,所述第三特征图及所述第四特征图分别应用3*3的卷积滤波器以生成简单关键点的热力图。
  13. 如权利要求12所述的暴力行为检测系统,所述人体姿态估计模块还用于:
    将所述热力图发送至所述RefineNet网络中,以将所述热力图分别经过残差块结构处理,其中,所述第一特征图不经过所述残差块结构,所述第二特征图经过一个残差块结构,所述第三特征图经过两个残差块结构,所述第四特征图经过三个残差块结构,所述残差块结构包括1*1卷积结构及3*3卷积结构;
    将处理结果分别经过上采样处理,其中,所述第一特征图处理结果不经过上采样处理,所述第二特征图处理结果经过两次上采样处理,所述第三特征图处理结果经过四次上采样处理,所述第四特征图处理结果经过八次上采样处理;
    将采样处理结果进行连接以对所述上采样结果进行整合;及
    将整合结果利用L2损失进行训练,并根据训练结果选择困难关键点。
  14. 如权利要求8所述的暴力行为检测系统,所述暴力行为检测系统还包括人体姿态训练模块,用于:
    获取多张样本图像,其中,所述多张图像中的每一张图像均包括多个人体,所述多个人体进行不同的行为,所述行为至少包括:拳打、刀刺、枪射、脚踢及扼颈中的一项或多项;
    根据所述样本图像中的行为对所述样本图像进行行为标记;及
    根据标记结果对所述多张图像进行训练,以获取与所述行为对应的人体姿态。
  15. 一种计算机设备,所述计算机设备存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述计算机可读指令被处理器执行时实现以下步骤:
    通过摄像头获取场景图像;
    将所述场景图像输入至特征金字塔网络中,并从所述场景图像中获取目标人体;
    利用级联金字塔网络对所述目标人体进行人体姿态估计以获取人体姿态估计结果,其中,所述级联金字塔网络包括GlobalNet网络和RefineNet网络;及
    将所述人体姿态估计结果与数据库中存储的暴力行为人体姿态进行匹配,以根据匹配结果判断所述场景图像中是否存在暴力行为,并对所述暴力行为进行分类。
  16. 如权利要求15所述的计算机设备,所述计算机可读指令被所述处理器执行时还实现以下步骤:
    将所述场景图像通过卷积网络,并提取所述卷积网络最高层的特征图像以获取第一尺寸特征图像;
    通过双线性插值法对所述第一尺寸特征图像进行上采样至第一中间尺寸特征图像;
    将所述第一中间尺寸特征图像与所述卷积网络中第一中间尺寸的输出图像进行融合以获取第一融合结果;及
    将所述第一融合结果输出以获取第二尺寸特征图像。
  17. 如权利要求16所述的计算机设备,所述计算机可读指令被所述处理器执行时还实现以下步骤:
    通过所述双线性插值法对所述第二尺寸特征图像进行上采样至第二中间尺寸特征图像;
    将所述第二中间尺寸特征图像与所述卷积网络中第二中间尺寸的输出图像进行融合以获取第二融合结果;及
    将所述第二融合结果输出以获取第三尺寸特征图像。
  18. 如权利要求17所述的计算机设备,所述计算机可读指令被所述处理器执行时还实现以下步骤:
    将所述第一尺寸特征图像、所述第二尺寸特征图像及所述第三尺寸特征图像输入至RPN网络中;
    分别对所述第一尺寸特征图像、所述第二尺寸特征图像及所述第三尺寸特征图像进行区域框检测;及
    根据检测结果获取感兴趣区域及所述感兴趣区域的类别得分最高的区域,以获取所述目标人体。
  19. 如权利要求15所述的计算机设备,所述计算机可读指令被所述处理器执行时还实现以下步骤:
    将所述检测结果图像发送至所述GlobalNet网络中,其中,所述GlobalNet网络不包括池化层,并通过在各卷积层之间添加设置卷积的步长以使各卷积层输出的特征图大小缩减;
    获取第2~5层卷积层分别输出的第一特征图,第二特征图,第三特征图及第四特征图;及
    将所述第一特征图,所述第二特征图,所述第三特征图及所述第四特征图分别应用3*3的卷积滤波器以生成简单关键点的热力图。
  20. 一种非易失性计算机可读存储介质,所述非易失性计算机可读存储介质内存储有计算机可读指令,所述计算机可读指令可被至少一个处理器所执行,以使所述至少一个处理器执行以下步骤:
    通过摄像头获取场景图像;
    将所述场景图像输入至特征金字塔网络中,并从所述场景图像中获取目标人体;
    利用级联金字塔网络对所述目标人体进行人体姿态估计以获取人体姿态估计结果,其中,所述级联金字塔网络包括GlobalNet网络和RefineNet网络;及
    将所述人体姿态估计结果与数据库中存储的暴力行为人体姿态进行匹配,以根据匹配结果判断所述场景图像中是否存在暴力行为,并对所述暴力行为进行分类。
PCT/CN2019/117407 2019-09-16 2019-11-12 暴力行为检测方法及系统 WO2021051547A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910872172.0A CN111104841B (zh) 2019-09-16 2019-09-16 暴力行为检测方法及系统
CN201910872172.0 2019-09-16

Publications (1)

Publication Number Publication Date
WO2021051547A1 true WO2021051547A1 (zh) 2021-03-25

Family

ID=70421353

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117407 WO2021051547A1 (zh) 2019-09-16 2019-11-12 暴力行为检测方法及系统

Country Status (2)

Country Link
CN (1) CN111104841B (zh)
WO (1) WO2021051547A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609993A (zh) * 2021-08-06 2021-11-05 烟台艾睿光电科技有限公司 一种姿态估计方法、装置、设备及计算机可读存储介质
CN113610037A (zh) * 2021-08-17 2021-11-05 北京计算机技术及应用研究所 一种基于头部和可见区域线索的遮挡行人检测方法
CN113989927A (zh) * 2021-10-27 2022-01-28 东北大学 一种基于骨骼数据的视频群体暴力行为识别方法及系统
CN115082836A (zh) * 2022-07-23 2022-09-20 深圳神目信息技术有限公司 一种行为识别辅助的目标物体检测方法及装置
CN115240123A (zh) * 2022-09-23 2022-10-25 南京邮电大学 一种面向智能监控系统的暗处暴力行为检测方法
CN117237741A (zh) * 2023-11-08 2023-12-15 烟台持久钟表有限公司 一种校园危险行为检测方法、系统、装置和存储介质
US12106531B2 (en) 2021-07-22 2024-10-01 Microsoft Technology Licensing, Llc Focused computer detection of objects in images

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753643B (zh) * 2020-05-09 2024-05-14 北京迈格威科技有限公司 人物姿态识别方法、装置、计算机设备和存储介质
US20210383534A1 (en) * 2020-06-03 2021-12-09 GE Precision Healthcare LLC System and methods for image segmentation and classification using reduced depth convolutional neural networks
CN114926725A (zh) * 2022-07-18 2022-08-19 中邮消费金融有限公司 一种基于图像分析的线上金融团伙欺诈识别方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229445A (zh) * 2018-02-09 2018-06-29 深圳市唯特视科技有限公司 一种基于级联金字塔网络的多人姿态估计方法
CN108764133A (zh) * 2018-05-25 2018-11-06 北京旷视科技有限公司 图像识别方法、装置及系统
CN109614882A (zh) * 2018-11-19 2019-04-12 浙江大学 一种基于人体姿态估计的暴力行为检测系统及方法
CN110021031A (zh) * 2019-03-29 2019-07-16 中广核贝谷科技有限公司 一种基于图像金字塔的x射线图像增强方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103189897B (zh) * 2011-11-02 2016-06-15 松下电器(美国)知识产权公司 图像识别装置、图像识别方法和集成电路

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229445A (zh) * 2018-02-09 2018-06-29 深圳市唯特视科技有限公司 一种基于级联金字塔网络的多人姿态估计方法
CN108764133A (zh) * 2018-05-25 2018-11-06 北京旷视科技有限公司 图像识别方法、装置及系统
CN109614882A (zh) * 2018-11-19 2019-04-12 浙江大学 一种基于人体姿态估计的暴力行为检测系统及方法
CN110021031A (zh) * 2019-03-29 2019-07-16 中广核贝谷科技有限公司 一种基于图像金字塔的x射线图像增强方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DENG YINONG, LUO JAINXIN , JIN FENGLIN: "Overview of Human Pose Estimation Methods Based on Deep Learning", COMPUTER ENGINEERING AND APPLICATIONS, vol. 55, no. 19, 12 August 2019 (2019-08-12), pages 22 - 42, XP055792554, ISSN: 1002-8331, DOI: 10.3778/j.issn.1002-8331.1906-0113 *
YILUN CHEN, ZHICHENG WANG, YUXIANG PENG, ZHIQIANG ZHANG, GANG YU, JIAN SUN: "Cascaded pyramid network for multi-person pose estimation", 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 1 June 2018 (2018-06-01), pages 7103 - 7112, XP033473629 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12106531B2 (en) 2021-07-22 2024-10-01 Microsoft Technology Licensing, Llc Focused computer detection of objects in images
CN113609993A (zh) * 2021-08-06 2021-11-05 烟台艾睿光电科技有限公司 一种姿态估计方法、装置、设备及计算机可读存储介质
CN113610037A (zh) * 2021-08-17 2021-11-05 北京计算机技术及应用研究所 一种基于头部和可见区域线索的遮挡行人检测方法
CN113989927A (zh) * 2021-10-27 2022-01-28 东北大学 一种基于骨骼数据的视频群体暴力行为识别方法及系统
CN113989927B (zh) * 2021-10-27 2024-04-26 东北大学 一种基于骨骼数据的视频群体暴力行为识别方法及系统
CN115082836A (zh) * 2022-07-23 2022-09-20 深圳神目信息技术有限公司 一种行为识别辅助的目标物体检测方法及装置
CN115082836B (zh) * 2022-07-23 2022-11-11 深圳神目信息技术有限公司 一种行为识别辅助的目标物体检测方法及装置
CN115240123A (zh) * 2022-09-23 2022-10-25 南京邮电大学 一种面向智能监控系统的暗处暴力行为检测方法
CN115240123B (zh) * 2022-09-23 2023-07-14 南京邮电大学 一种面向智能监控系统的暗处暴力行为检测方法
CN117237741A (zh) * 2023-11-08 2023-12-15 烟台持久钟表有限公司 一种校园危险行为检测方法、系统、装置和存储介质
CN117237741B (zh) * 2023-11-08 2024-02-13 烟台持久钟表有限公司 一种校园危险行为检测方法、系统、装置和存储介质

Also Published As

Publication number Publication date
CN111104841B (zh) 2024-09-10
CN111104841A (zh) 2020-05-05

Similar Documents

Publication Publication Date Title
WO2021051547A1 (zh) 暴力行为检测方法及系统
US11830230B2 (en) Living body detection method based on facial recognition, and electronic device and storage medium
CN107358242B (zh) 目标区域颜色识别方法、装置及监控终端
WO2020107847A1 (zh) 基于骨骼点的跌倒检测方法及其跌倒检测装置
WO2019109526A1 (zh) 人脸图像的年龄识别方法、装置及存储介质
WO2021169637A1 (zh) 图像识别方法、装置、计算机设备及存储介质
WO2021139324A1 (zh) 图像识别方法、装置、计算机可读存储介质及电子设备
WO2019095571A1 (zh) 人物情绪分析方法、装置及存储介质
JP7454105B2 (ja) 顔画像品質評価方法及び装置、コンピュータ機器並びにコンピュータプログラム
CN107784282A (zh) 对象属性的识别方法、装置及系统
CN109299658B (zh) 脸部检测方法、脸部图像渲染方法、装置及存储介质
CN111914775A (zh) 活体检测方法、装置、电子设备及存储介质
CN111626163B (zh) 一种人脸活体检测方法、装置及计算机设备
US9633272B2 (en) Real time object scanning using a mobile phone and cloud-based visual search engine
CN111104925B (zh) 图像处理方法、装置、存储介质和电子设备
JP7419080B2 (ja) コンピュータシステムおよびプログラム
CN109977832B (zh) 一种图像处理方法、装置及存储介质
US11113838B2 (en) Deep learning based tattoo detection system with optimized data labeling for offline and real-time processing
CN110941978B (zh) 一种未识别身份人员的人脸聚类方法、装置及存储介质
WO2019033567A1 (zh) 眼球动作捕捉方法、装置及存储介质
CN114519877A (zh) 人脸识别方法、人脸识别装置、计算机设备及存储介质
CN112149570B (zh) 多人活体检测方法、装置、电子设备及存储介质
CN114663871A (zh) 图像识别方法、训练方法、装置、系统及存储介质
CN111325107A (zh) 检测模型训练方法、装置、电子设备和可读存储介质
WO2023279799A1 (zh) 对象识别方法、装置和电子系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19945786

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25.07.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19945786

Country of ref document: EP

Kind code of ref document: A1