WO2021018106A1 - 行人检测方法、装置、计算机可读存储介质和芯片 - Google Patents

行人检测方法、装置、计算机可读存储介质和芯片 Download PDF

Info

Publication number
WO2021018106A1
WO2021018106A1 PCT/CN2020/105020 CN2020105020W WO2021018106A1 WO 2021018106 A1 WO2021018106 A1 WO 2021018106A1 CN 2020105020 W CN2020105020 W CN 2020105020W WO 2021018106 A1 WO2021018106 A1 WO 2021018106A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
candidate frame
feature
pedestrian
map
Prior art date
Application number
PCT/CN2020/105020
Other languages
English (en)
French (fr)
Inventor
叶齐祥
张天亮
刘健庄
张晓鹏
田奇
江立辉
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to JP2022506071A priority Critical patent/JP7305869B2/ja
Priority to EP20848616.7A priority patent/EP4006773A4/en
Publication of WO2021018106A1 publication Critical patent/WO2021018106A1/zh
Priority to US17/586,136 priority patent/US20220148328A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/255Detecting or recognising potential candidate objects based on visual cues, e.g. shapes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands

Definitions

  • This application relates to the field of computer vision in the field of artificial intelligence, and more specifically, to a pedestrian detection method, device, computer-readable storage medium, and chip.
  • Computer vision is an inseparable part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. It is about how to use cameras/video cameras and computers to obtain What we need is the knowledge of the data and information of the subject. Vividly speaking, it is to install eyes (camera/camcorder) and brain (algorithm) on the computer to replace the human eye to identify, track and measure the target, so that the computer can perceive the environment. Because perception can be seen as extracting information from sensory signals, computer vision can also be seen as a science that studies how to make artificial systems "perceive" from images or multi-dimensional data.
  • computer vision uses various imaging systems to replace the visual organs to obtain input information, and then the computer replaces the brain to complete the processing and interpretation of the input information.
  • the ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
  • Pedestrian detection In the field of computer vision, pedestrian detection is a very important research direction. Pedestrian detection has important applications in many fields and scenarios. For example, advanced driving assistance system (ADAS) and automatic driving system (autonomous driving system, ADS) detect and avoid dynamic obstacles such as pedestrians on the road; detect pedestrians in safe cities and video surveillance, Find criminal suspects or track missing persons; in smart home systems, pedestrians can be detected to realize robot movement and obstacle avoidance.
  • ADAS advanced driving assistance system
  • ADS automatic driving system
  • the detection effect is not good when the pedestrian is blocked, which is mainly manifested in the missed and wrong detection of the blocked pedestrian.
  • the present application provides a pedestrian detection method, device, computer readable storage medium, and chip to improve the accuracy of pedestrian detection (especially obstructed pedestrians).
  • a pedestrian detection method includes: acquiring an image; extracting features from the image to obtain a basic feature map of the image; determining a candidate frame for possible pedestrians in the image according to the basic feature map; The feature map is processed to obtain the object visibility map of the image; the basic feature map of the image and the object visibility map of the image are fused to obtain the enhanced feature map of the image; the candidate frame is determined according to the candidate frame of the image and the enhanced feature map of the image Corresponding features: Determine the confidence of the bounding box with pedestrians in the image and the bounding box with pedestrians in the image according to the features corresponding to the candidate frame.
  • the aforementioned image may be an image containing pedestrians.
  • the above-mentioned image acquisition includes: capturing the image through a camera.
  • images can be captured by shooting.
  • the foregoing obtaining of the image includes: obtaining the image from a memory.
  • the image can be convolved (convolution processing), or the result of the image convolution operation can be further processed (for example, summation, weighting, connection, etc.) Operation) to obtain the basic feature map.
  • the aforementioned candidate frame is a bounding frame of an area where pedestrians may exist in the image.
  • the area where the candidate frame is located may be the area enclosed by the candidate frame (the area inside the candidate frame), and the area where the candidate frame is located is the area where pedestrians may exist in the image.
  • the object visibility map of the above image has different responses to different objects.
  • the response to the visible part of the pedestrian is greater than the response to the invisible part of the pedestrian.
  • the features of the pedestrian visible part are more prominent than the pedestrian invisible part.
  • the pixel value of the visible part of the pedestrian is greater than the pixel value of the invisible part of the pedestrian.
  • the characteristics of the visible part of the pedestrian can be more prominently reflected.
  • the pedestrian visible part is the part where the pedestrian image can be seen
  • the pedestrian invisible part is the part where the pedestrian image cannot be seen.
  • the feature corresponding to the aforementioned candidate frame may include the area feature of the candidate frame, and the area feature of the candidate frame is the area feature located in the candidate frame in the enhanced feature map.
  • the area feature of the candidate frame may be the feature of the area enclosed by the candidate frame in the enhanced feature map.
  • the regional feature of the candidate frame may be the part of the feature in the region corresponding to the candidate frame in the enhanced feature map.
  • the position of the candidate frame in the enhanced feature map can be determined first, and then the feature of the area enclosed by the candidate frame in the enhanced feature map can be determined as the regional feature of the candidate frame.
  • the features of the region enclosed by the candidate frame in the enhanced feature map may be sampled (specifically, up-sampling or down-sampling), and then the regional features of the candidate frame can be obtained.
  • the confidence of the bounding frame with pedestrians in the image and the bounding box with pedestrians in the image may be the detection result of pedestrian detection on the image, which may be called the pedestrian detection result of the image.
  • the enhanced feature map obtained by fusing the basic feature map of the image and the object visibility map of the image highlights the features of the visible part of the pedestrian , which can improve the accuracy of subsequent pedestrian detection based on the enhanced feature map.
  • the accuracy of pedestrian detection in this application is significantly improved for the situation of pedestrian obstruction (more serious).
  • the amount of annotation of the training data will not be increased during the training process.
  • This application only needs to generate the visibility map of the object in the process of processing. In the subsequent processing, the visibility map of the object can be considered comprehensively. Compared with the solution of increasing the amount of data annotation to improve the accuracy of pedestrian detection, it can save the amount of data annotation and reduce the complexity of training.
  • the pedestrian invisible part includes a pedestrian occluded part.
  • the visible part of the pedestrian and the occluded part of the pedestrian can be distinguished in the object visibility map, and because the pixel value of the visible part of the pedestrian is greater than the pixel value of the pedestrian occluded part, it can be in the visibility of the object Highlight the features of the visible part of the pedestrian, weaken the feature of the pedestrian occlusion part, reduce the influence of the pedestrian occlusion part on the detection result in the subsequent pedestrian detection process, and facilitate the subsequent pedestrian detection to highlight the feature of the visible part of the pedestrian, and improve the effect of pedestrian detection .
  • the pedestrian invisible part includes the background part of the image.
  • the background part of the image may refer to other parts of the image except pedestrians, or the background part of the image may also refer to other parts of the image except pedestrians and main objects (for example, cars).
  • the pedestrian can be distinguished from the background part in the object visibility map, highlighting the characteristics of the visible part of the pedestrian, weakening the characteristics of the background part, and reducing the background part to detect in the subsequent pedestrian detection process
  • the effect of the result is convenient for highlighting the features of the visible part of the pedestrian during subsequent pedestrian detection, and improving the effect of pedestrian detection.
  • the above processing the basic feature map of the image to obtain the object visibility map of the image includes: convolving the basic feature map of the image by using the first convolutional network Process to obtain multiple first semantic feature maps; perform weighted summation processing on multiple first semantic feature maps to obtain an object visibility map of the image; the above-mentioned determination of the presence of pedestrian bounding boxes and the presence of pedestrians in the image based on the features corresponding to the candidate frame
  • the confidence of the bounding box of pedestrians includes: using a second convolutional network to perform convolution processing on the features corresponding to the candidate frame to obtain multiple second semantic feature maps; using a regression to perform the multiple second semantic features Process, determine the position of the bounding box; use a classifier to process multiple second semantic features to obtain the confidence of the bounding box of pedestrians in the image.
  • the multiple first semantic feature maps are multiple feature maps with different semantics extracted from the entire basic feature map. Specifically, in the above-mentioned multiple first semantic feature maps, any two first semantic feature maps correspond to different semantics.
  • the multiple second semantic feature maps respectively represent multiple different semantic feature maps extracted from the features corresponding to the candidate frame, and the convolution parameters of the second convolutional network are the same as the convolution parameters of the first convolutional network. .
  • any two second semantic feature maps correspond to different semantics.
  • the aforementioned first convolutional network and the second convolutional network have the same convolution parameters, which may mean that the first convolutional network and the second convolutional network have the same convolution kernel parameters. Further, the first convolutional network and the second convolutional network The same convolution parameters of the convolutional network can also mean that the network architecture and convolution kernel parameters of the first convolutional network and the second convolutional network are exactly the same. The first convolutional network and the second convolutional network are used to perform the same image The same semantic feature of the image can be extracted during feature extraction.
  • the above-mentioned weighting coefficient when performing weighted summation processing on the plurality of first semantic features is the weighting coefficient used in the classifier to determine the pedestrian score.
  • the convolution parameters of the first convolutional network and the second convolutional network are the same, and the weighting coefficients when performing weighted summation processing on multiple first semantic features are the ones used in the classifier to determine pedestrians. Therefore, the above-mentioned object visibility map that can highlight the visible part of pedestrians can be obtained by processing a plurality of first semantic features, so that subsequent pedestrian detection can be performed more accurately based on the object visibility map.
  • the feature corresponding to the candidate frame further includes the area feature of the contour area of the candidate frame, and the contour area of the candidate frame is obtained after the candidate frame is reduced according to the first preset ratio To reduce the area formed between the candidate frame and the candidate frame.
  • the above-mentioned reduction of the candidate frame according to the first preset ratio may specifically mean that the width and height of the candidate frame are respectively reduced according to a certain ratio, and the ratio when the width and height of the candidate frame are reduced may be the same or different.
  • reducing the candidate frame according to a first preset ratio includes: reducing the width of the candidate frame according to a first reduction ratio, and reducing the height of the candidate frame according to a second reduction ratio.
  • the first preset ratio includes a first reduction ratio and a second reduction ratio, where the first reduction ratio and the second reduction ratio may be the same or different.
  • the first reduction ratio and the second reduction ratio can be set based on experience. For example, appropriate values may be set for the first reduction ratio and the second reduction ratio, so that the contour area of the candidate frame obtained by reducing the width and height of the candidate frame according to the first reduction ratio and the second reduction ratio can be smaller. Good extraction of pedestrian contour features.
  • the first reduction ratio and the second reduction ratio can be set to values that can better extract the pedestrian contour.
  • the first reduction ratio may be 1/1.1, and the second reduction ratio may be 1/1.8.
  • the regional features of the outline area of the candidate frame generally include the outline features of pedestrians, and the outline features of pedestrians also play an important role in pedestrian detection.
  • the contour feature of the pedestrian can also be taken into consideration when the pedestrian detection is performed, so as to facilitate the subsequent integration of the contour feature of the pedestrian for better performance Pedestrian detection.
  • the above method further includes: zeroing the value of the feature located in the reduced candidate frame among the area features of the candidate frame to obtain the area of the contour area of the candidate frame feature.
  • the contour area of the candidate frame when obtaining the regional features of the contour area of the candidate frame, by directly setting the value of the feature in the reduced candidate frame among the regional features of the candidate frame to zero, the contour area of the candidate frame can be quickly and conveniently obtained.
  • Regional characteristics when obtaining the regional features of the contour area of the candidate frame, by directly setting the value of the feature in the reduced candidate frame among the regional features of the candidate frame to zero, the contour area of the candidate frame can be quickly and conveniently obtained.
  • the contour area of the candidate frame may be scanned directly from the enhanced feature map according to the area position of the contour area of the candidate frame. feature.
  • the feature corresponding to the candidate frame also includes the area feature of the background area of the candidate frame, and the background area of the candidate frame is obtained by expanding the candidate frame according to the second preset ratio. Expands the area formed between the candidate frame and the candidate frame.
  • the above-mentioned expansion of the candidate frame according to the second preset ratio may specifically mean that the width and height of the candidate frame are respectively expanded according to a certain ratio, and the ratio when the width and height of the candidate frame are expanded may be the same or different.
  • the expansion of the candidate frame according to a second preset ratio includes: the width of the candidate frame is expanded according to a first expansion ratio, and the height of the candidate frame is expanded according to a second expansion ratio.
  • the second preset ratio includes a first expansion ratio and a second expansion ratio, where the first expansion ratio and the second expansion ratio may be the same or different.
  • the first expansion ratio and the second expansion ratio can be set based on experience. For example, appropriate values can be set for the first enlargement ratio and the second enlargement ratio, so that the background area of the candidate frame obtained by expanding the width and height of the candidate frame according to the first enlargement ratio and the second enlargement ratio can be larger Good extraction of background features around pedestrians.
  • the first expansion ratio may be 1.1, and the second expansion ratio may be 1.8.
  • the regional characteristics of the background area of the candidate frame generally reflect the characteristics of the background area where the pedestrian in the image is located.
  • the characteristics of the background area can be combined with the characteristics of the pedestrian to perform pedestrian detection.
  • the area feature of the background area can also be taken into account when performing pedestrian detection, so as to facilitate the subsequent integration of the regional features of the background area to better Pedestrian detection is carried out locally.
  • the above method further includes: acquiring a regional feature of the first region, where the regional feature of the first region is the region feature located in the region of the extended candidate frame in the object visibility map ; Set the feature located in the candidate frame in the area feature of the first area to zero to obtain the area feature of the background area of the candidate frame.
  • the above method is executed by a neural network (model).
  • the neural network (model) can be used to process the image, and finally, according to the characteristics corresponding to the candidate frame, it is determined that there is a pedestrian bounding box in the image and a pedestrian bounding box in the image. Confidence.
  • a neural network training method includes: acquiring training data, the training data including training images and pedestrian annotation results of the training images; and performing the following processing on the training images according to the neural network: Convolution processing to obtain the basic feature map of the training image; determine the candidate frame of the pedestrian in the training image according to the basic feature map; process the basic feature map of the training image to obtain the object visibility map of the training image; the basis of the training image The feature map and the object visibility map of the training image are fused to obtain the enhanced feature map of the training image; the feature corresponding to the candidate frame is determined according to the candidate frame of the training image and the enhanced feature map of the training image; the training is determined according to the feature corresponding to the candidate frame The pedestrian detection result of the image; according to the pedestrian detection result of the training image and the pedestrian marking result of the training image, the loss value of the neural network is determined, and then the neural network is adjusted through back propagation according to the loss value.
  • the above candidate frame is a bounding frame of an area where pedestrians may exist in the training image.
  • the area where the candidate frame is located may be the area enclosed by the candidate frame (the area inside the candidate frame), and the area where the candidate frame is located is the area where pedestrians may exist in the training image.
  • the visible part of the pedestrian is the part where the pedestrian image can be seen, and the invisible part of the pedestrian is the part where the pedestrian image cannot be seen.
  • the feature corresponding to the aforementioned candidate frame may include the area feature of the candidate frame, and the area feature of the candidate frame is the feature of the area located in the candidate frame in the enhanced feature map.
  • the pedestrian detection result of the training image may include the confidence of the bounding frame of the pedestrian in the training image and the bounding box of the pedestrian in the image.
  • the result of pedestrian detection and labeling of the training image includes a bounding box of pedestrians in the training image.
  • the pedestrian detection and labeling results of the training images may be pre-labeled (specifically, they may be manually labeled).
  • neural network trained by the method of the above second aspect can be used to implement the method of the first aspect of the present application.
  • the confidence of the bounding box refers to the acquired image
  • the basic feature map, candidate frame, object visibility map, enhanced feature map, and pedestrians in the image Both the bounding box of and the confidence of the bounding box of pedestrians in the image refer to the training image.
  • a pedestrian detection device in a third aspect, includes various modules for executing the method in the first aspect.
  • a neural network training device in a fourth aspect, includes various modules for executing the method in the second aspect.
  • a pedestrian detection device which includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processing The device is used to perform the method in the first aspect described above.
  • a neural network training device in a sixth aspect, includes a memory for storing a program; a processor for executing the program stored in the memory. When the program stored in the memory is executed, the device The processor is used to execute the method in the above second aspect.
  • an electronic device which includes the pedestrian detection device in the third aspect or the fifth aspect.
  • an electronic device in an eighth aspect, includes the neural network training device in the fourth aspect or the sixth aspect.
  • the above-mentioned electronic device may specifically be a mobile terminal (for example, a smart phone), a tablet computer, a notebook computer, an augmented reality/virtual reality device, a vehicle-mounted terminal device, and so on.
  • a mobile terminal for example, a smart phone
  • a tablet computer for example, a tablet computer
  • a notebook computer for example, a tablet computer
  • an augmented reality/virtual reality device for example, a vehicle-mounted terminal device, and so on.
  • a computer-readable storage medium stores program code, and the program code includes instructions for executing the steps in the method in the first aspect or the second aspect.
  • a tenth aspect provides a computer program product containing instructions, when the computer program product runs on a computer, the computer executes the method in the first aspect or the second aspect.
  • a chip in an eleventh aspect, includes a processor and a data interface.
  • the processor reads instructions stored in a memory through the data interface and executes the method in the first aspect or the second aspect.
  • the chip may further include a memory in which instructions are stored, and the processor is configured to execute the instructions stored in the memory.
  • the processor is used to execute the method in the first aspect.
  • the aforementioned chip may specifically be a field programmable gate array FPGA or an application specific integrated circuit ASIC.
  • the method of the first aspect may specifically refer to the first aspect and a method in any one of the various implementation manners of the first aspect.
  • the method in the second aspect may specifically refer to the method in the second aspect and any one of the various implementation manners in the second aspect.
  • FIG. 1 is a schematic diagram of a process of pedestrian detection in an assisted/automatic driving system provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of the process of pedestrian detection in the safe city/video surveillance system provided by an embodiment of the present application;
  • Figure 3 is a schematic structural diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of pedestrian detection using the convolutional neural network model provided by the embodiment of the application.
  • FIG. 5 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
  • Fig. 6 is a schematic block diagram of a pedestrian detection device according to an embodiment of the present application.
  • FIG. 7 is a schematic block diagram of a pedestrian detection device according to an embodiment of the present application.
  • Figure 8 is a schematic diagram of pedestrian detection using a pedestrian detection device
  • Figure 9 is a schematic diagram of pedestrian detection using a pedestrian detection device
  • FIG. 10 is a schematic flowchart of a pedestrian detection method according to an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a basic feature map obtained by convolution processing on an image
  • FIG. 12 is a schematic diagram of a process of generating a basic feature map of an image
  • FIG. 13 is a schematic diagram of the process of determining an image candidate frame by the RPN module
  • Figure 14 is a schematic diagram of an object visibility map
  • Figure 15 is a schematic diagram of an object visibility map
  • FIG. 16 is a schematic flowchart of a pedestrian detection method according to an embodiment of the present application.
  • FIG. 17 is a schematic diagram of a process in which the self-activation module obtains an object visibility map according to the basic feature map
  • Figure 18 is a schematic diagram of the process of weighted summation of the object visibility map and the basic feature map
  • FIG. 19 is a schematic diagram of a process of obtaining pedestrian detection results based on candidate frames and enhanced feature maps
  • 20 is a schematic diagram of a process of obtaining a pedestrian detection result according to an object visibility map, a candidate frame, and an enhanced feature map;
  • 21 is a schematic diagram of determining the contour area of a candidate frame
  • 22 is a schematic diagram of determining the background area of the candidate frame
  • FIG. 23 is a schematic diagram of a process of obtaining a pedestrian detection result according to an object visibility map, a candidate frame, and an enhanced feature map;
  • Figure 24 is a schematic diagram of the RCNN module processing features corresponding to the candidate frame
  • FIG. 25 is a schematic block diagram of a pedestrian detection device according to an embodiment of the present application.
  • FIG. 26 is a schematic diagram of the hardware structure of a pedestrian detection device according to an embodiment of the present application.
  • FIG. 27 is a schematic diagram of the hardware structure of a neural network training device according to an embodiment of the present application.
  • the solution of the present application can be applied to fields that require pedestrian recognition (also called pedestrian detection), such as assisted driving, autonomous driving, safe cities, and smart terminals.
  • pedestrian recognition also called pedestrian detection
  • assisted driving autonomous driving
  • safe cities safe cities
  • smart terminals smart terminals
  • Application Scenario 1 Assisted/Automatic Driving System
  • ADAS advanced driving assistance system
  • ADS autonomous driving system
  • the road screen image acquired by the assisted/automatic driving system undergoes pedestrian detection to obtain a pedestrian detection result.
  • the assisted/automatic driving system can then control the vehicle according to the pedestrian detection result.
  • pedestrian detection can be performed by the pedestrian detection method of the embodiment of the present application. According to the pedestrian detection, it can be determined whether there is a pedestrian and the location of the pedestrian in the road screen, so that the auxiliary/automatic driving system can control the vehicle according to the recognition result.
  • the pedestrian detection is performed in real time, and the pedestrian detection results are marked.
  • the analysis unit of the pedestrian detection result system can be used to find criminal suspects and missing persons to realize skynet tracking.
  • the road screen image acquired by the safe city/video surveillance system undergoes pedestrian detection to obtain a pedestrian detection result.
  • specific people can be identified and tracked based on the pedestrian detection result.
  • pedestrian detection can be performed by the pedestrian detection method of the embodiment of the present application. According to the pedestrian detection, it can be determined whether there is a pedestrian in the monitoring screen and the location of the pedestrian. When there is a pedestrian in the monitoring screen, it can be identified whether the pedestrian is a specific person (missing) Population, criminal suspects, etc.), when a specific person is identified, the Skyeye system (the system can be regarded as a part of the safe city/video surveillance system) can be activated to track the specific person.
  • the Skyeye system the system can be regarded as a part of the safe city/video surveillance system
  • the pedestrian detection method in the embodiment of the present application may be executed by a neural network (model).
  • a neural network model
  • a neural network can be composed of neural units.
  • a neural unit can refer to an arithmetic unit that takes x s and intercept 1 as inputs.
  • the output of the arithmetic unit can be as shown in formula (1):
  • s 1, 2,...n, n is a natural number greater than 1
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to perform nonlinear transformation on the features in the neural network, thereby converting the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting multiple above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field.
  • the local receptive field can be a region composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN can be understood as a neural network with multiple hidden layers.
  • DNN is divided according to the positions of different layers.
  • the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the number of layers in the middle are all hidden layers.
  • the layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1th layer.
  • DNN looks complicated, it is not complicated in terms of the work of each layer. In simple terms, it is the following linear relationship expression: among them, Is the input vector, Is the output vector, Is the offset vector, W is the weight matrix (also called coefficient), and ⁇ () is the activation function.
  • Each layer is just the input vector After such a simple operation, the output vector is obtained Due to the large number of DNN layers, the coefficient W and the offset vector The number is also relatively large.
  • the definition of these parameters in the DNN is as follows: Take the coefficient W as an example, suppose that in a three-layer DNN, the linear coefficients from the fourth neuron in the second layer to the second neuron in the third layer are defined as The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
  • the coefficient from the kth neuron in the L-1th layer to the jth neuron in the Lth layer is defined as
  • the input layer has no W parameter.
  • more hidden layers make the network more capable of portraying complex situations in the real world. Theoretically speaking, a model with more parameters is more complex and has a greater "capacity", which means it can complete more complex learning tasks.
  • Training a deep neural network is also a process of learning a weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (a weight matrix formed by vectors W of many layers).
  • Convolutional neural network (convolutional neuron network, CNN) is a deep neural network with convolutional structure.
  • the convolutional neural network contains a feature extractor composed of a convolution layer and a sub-sampling layer.
  • the feature extractor can be regarded as a filter.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can be connected to only part of the neighboring neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units in the same feature plane share weights, and the shared weights here are the convolution kernels.
  • Sharing weight can be understood as the way to extract image information has nothing to do with location.
  • the convolution kernel can be initialized in the form of a matrix of random size. During the training of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the residual network is a deep convolutional network proposed in 2015. Compared with the traditional convolutional neural network, the residual network is easier to optimize and can increase the accuracy by adding considerable depth.
  • the core of the residual network is to solve the side effect (degradation problem) caused by increasing the depth, so that the network performance can be improved by simply increasing the network depth.
  • a residual network generally contains many sub-modules with the same structure.
  • a residual network (ResNet) is usually used to connect a number to indicate the number of times the sub-module is repeated. For example, ResNet50 means that there are 50 sub-modules in the residual network.
  • the classifier is generally composed of a fully connected layer and a softmax function (which can be called a normalized exponential function), and can output probabilities of different categories according to the input.
  • the neural network can use the backpropagation (BP) algorithm to modify the parameter values in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, forwarding the input signal to the output will cause error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss is converged.
  • the backpropagation algorithm is a backpropagation motion dominated by error loss, and aims to obtain the optimal neural network model parameters, such as the weight matrix.
  • FIG. 3 is a schematic diagram of the system architecture of an embodiment of the present application.
  • the system architecture 100 includes an execution device 110, a training device 120, a database 130, a client device 140, a data storage system 150, and a data collection system 160.
  • the execution device 110 includes a calculation module 111, an I/O interface 112, a preprocessing module 113, and a preprocessing module 114.
  • the calculation module 111 may include the target model/rule 101, and the preprocessing module 113 and the preprocessing module 114 are optional.
  • the data collection device 160 is used to collect training data.
  • the training data may include training images (the training images include pedestrians) and annotation files, where the annotation file gives the bounding box of the pedestrian in the training image. )coordinate of.
  • the data collection device 160 stores the training data in the database 130, and the training device 120 trains to obtain the target model/rule 101 based on the training data maintained in the database 130.
  • the training device 120 performs object detection on the input training image, and outputs the pedestrian detection results (the bounding frame of pedestrians in the image and the pedestrians in the image).
  • the confidence level of the bounding box is compared with the labeling result until the difference between the pedestrian detection result of the object output by the training device 120 and the pre-labeled result is less than a certain threshold, thereby completing the training of the target model/rule 101.
  • the above-mentioned target model/rule 101 can be used to implement the pedestrian detection method of the embodiment of the present application, that is, input the image to be processed (after relevant preprocessing) into the target model/rule 101 to obtain the pedestrian detection result of the image to be processed .
  • the target model/rule 101 in the embodiment of the present application may specifically be a neural network.
  • the training data maintained in the database 130 may not all come from the collection of the data collection device 160, and may also be received from other devices.
  • the training device 120 does not necessarily perform the training of the target model/rule 101 completely based on the training data maintained by the database 130. It may also obtain training data from the cloud or other places for model training.
  • the above description should not be used as a reference to this application. Limitations of Examples.
  • the target model/rule 101 trained by the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. 3, which can be a terminal, such as a mobile phone terminal, a tablet computer, Notebook computers, augmented reality (AR)/virtual reality (VR), vehicle-mounted terminals, etc., can also be servers or clouds.
  • the execution device 110 is configured with an input/output (input/output, I/O) interface 112 for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 140.
  • the input data in this embodiment of the application may include: the image to be processed input by the client device.
  • the client device 140 here may specifically be a terminal device.
  • the preprocessing module 113 and the preprocessing module 114 are used to preprocess the input data (such as the image to be processed) received by the I/O interface 112.
  • the preprocessing module 113 and the preprocessing module 114 may be omitted Or there is only one preprocessing module.
  • the calculation module 111 can be directly used to process the input data.
  • the execution device 110 may call data, codes, etc. in the data storage system 150 for corresponding processing .
  • the data, instructions, etc. obtained by corresponding processing may also be stored in the data storage system 150.
  • the I/O interface 112 presents the processing result, such as the pedestrian detection result calculated by the target model/rule 101, to the client device 140 to provide it to the user.
  • the pedestrian detection result obtained by the target model/rule 101 in the calculation module 111 can be processed by the preprocessing module 113 (or the processing by the preprocessing module 114 can also be added) and then the processing result is sent to the I/ O interface, and then the I/O interface sends the processing result to the client device 140 for display.
  • the calculation module 111 may also transmit the processed pedestrian detection results to the I/O interface, and then the I/O interface will process it. The result is sent to the client device 140 for display.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete The above tasks provide the user with the desired result.
  • the user can manually set input data, and the manual setting can be operated through the interface provided by the I/O interface 112.
  • the client device 140 can automatically send input data to the I/O interface 112. If the client device 140 is required to automatically send the input data and the user's authorization is required, the user can set the corresponding authority in the client device 140.
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form may be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data, and store it in the database 130 as shown in the figure.
  • the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result of the output I/O interface 112 as a new sample as shown in the figure.
  • the data is stored in the database 130.
  • Fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 may also be placed in the execution device 110.
  • the target model/rule 101 obtained by training according to the training device 120 may be the neural network in the embodiment of the present application.
  • the neural network provided in the embodiment of the present application may be a CNN and a deep convolutional neural network ( deep convolutional neural networks, DCNN) and so on.
  • CNN is a very common neural network
  • the structure of CNN will be introduced in detail below in conjunction with Figure 4.
  • a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture.
  • a deep learning architecture refers to a machine learning algorithm. Multi-level learning is carried out on the abstract level of
  • CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to the input image.
  • a convolutional neural network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (the pooling layer is optional), and a fully connected layer 230.
  • CNN convolutional neural network
  • the convolutional layer/pooling layer 220 may include layers 221-226, for example: in an implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, and layer 223 is a convolutional layer. Layers, 224 is the pooling layer, 225 is the convolutional layer, and 226 is the pooling layer; in another implementation, 221 and 222 are the convolutional layers, 223 is the pooling layer, and 224 and 225 are the convolutional layers. Layer, 226 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 221 can include many convolution operators.
  • the convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially It can be a weight matrix. This weight matrix is usually pre-defined. In the process of convolution on the image, the weight matrix is usually one pixel after one pixel (or two pixels after two pixels) along the horizontal direction on the input image. ...It depends on the value of stride) to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same.
  • the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a single depth dimension convolution output, but in most cases, a single weight matrix is not used, but multiple weight matrices of the same size (row ⁇ column) are applied. That is, multiple homogeneous matrices.
  • the output of each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" mentioned above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to eliminate unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), the size of the convolution feature maps extracted by the multiple weight matrices of the same size are also the same, and then the multiple extracted convolution feature maps of the same size are combined to form The output of the convolution operation.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions. .
  • the initial convolutional layer (such as 221) often extracts more general features, which can also be called low-level features; with the convolutional neural network
  • the features extracted by the subsequent convolutional layers (for example, 226) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • the 221-226 layers as illustrated by 220 in Figure 4 can be a convolutional layer followed by a layer
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the only purpose of the pooling layer is to reduce the size of the image space.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of maximum pooling.
  • the operators in the pooling layer should also be related to the image size.
  • the size of the image output after processing by the pooling layer can be smaller than the size of the image of the input pooling layer, and each pixel in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After processing by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 200 needs to use the fully connected layer 230 to generate one or a group of required classes of output. Therefore, the fully connected layer 230 can include multiple hidden layers (231, 232 to 23n as shown in FIG. 4) and an output layer 240. The parameters contained in the multiple hidden layers can be based on specific task types. The relevant training data of the, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.
  • the output layer 240 After the multiple hidden layers in the fully connected layer 230, that is, the final layer of the entire convolutional neural network 200 is the output layer 240.
  • the output layer 240 has a loss function similar to the classification cross entropy, which is specifically used to calculate the prediction error.
  • the convolutional neural network 200 shown in FIG. 4 is only used as an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models.
  • a convolutional neural network (CNN) 200 shown in FIG. 4 may be used to execute the pedestrian detection method of the embodiment of the present application.
  • the image to be processed passes through the input layer 210 and the convolutional layer/pooling layer 220.
  • the detection result of the image to be processed (the bounding frame of the pedestrian in the image to be processed and the confidence of the bounding frame of the pedestrian in the image) can be obtained.
  • FIG. 5 is a hardware structure of a chip provided by an embodiment of the application, and the chip includes a neural network processor 50.
  • the chip can be set in the execution device 110 as shown in FIG. 3 to complete the calculation work of the calculation module 111.
  • the chip may also be set in the training device 120 as shown in FIG. 3 to complete the training work of the training device 120 and output the target model/rule 101.
  • the algorithms of each layer in the convolutional neural network shown in FIG. 4 can be implemented in the chip shown in FIG. 5.
  • a neural network processor (neural-network processing unit, NPU) 50 is mounted as a coprocessor to a main central processing unit (central processing unit, CPU) (host CPU), and the main CPU allocates tasks.
  • the core part of the NPU is the arithmetic circuit 503.
  • the controller 504 controls the arithmetic circuit 503 to extract data from the memory (weight memory or input memory) and perform calculations.
  • the arithmetic circuit 503 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 503 is a general-purpose matrix processor.
  • the arithmetic circuit 503 fetches the data corresponding to matrix B from the weight memory 502 and caches it on each PE in the arithmetic circuit 503.
  • the arithmetic circuit 503 fetches the matrix A data and matrix B from the input memory 501 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 508.
  • the vector calculation unit 507 may perform further processing on the output of the arithmetic circuit 503, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
  • the vector calculation unit 507 can be used for network calculations in the non-convolutional/non-FC layer of the neural network, such as pooling, batch normalization, local response normalization, etc. .
  • the vector calculation unit 507 can store the processed output vector to the unified buffer 506.
  • the vector calculation unit 507 may apply a nonlinear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value.
  • the vector calculation unit 507 generates a normalized value, a combined value, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 503, for example for use in subsequent layers in a neural network.
  • the unified memory 506 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 501 and/or the unified memory 506 through the storage unit access controller 505 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 502, And the data in the unified memory 506 is stored in the external memory.
  • DMAC direct memory access controller
  • the bus interface unit (BIU) 510 is used to implement interaction between the main CPU, the DMAC, and the fetch memory 509 through the bus.
  • An instruction fetch buffer 509 connected to the controller 504 is used to store instructions used by the controller 504;
  • the controller 504 is configured to call the instructions cached in the memory 509 to control the working process of the computing accelerator.
  • the unified memory 506, the input memory 501, the weight memory 502, and the instruction fetch memory 509 are all on-chip memories.
  • the external memory is a memory external to the NPU.
  • the external memory can be a double data rate synchronous dynamic random access memory.
  • Memory double data rate synchronous dynamic random access memory, referred to as DDR SDRAM
  • HBM high bandwidth memory
  • each layer in the convolutional neural network shown in FIG. 4 may be executed by the arithmetic circuit 503 or the vector calculation unit 507.
  • Fig. 6 is a schematic diagram of the system architecture of an embodiment of the present application.
  • the input picture is processed by the pedestrian detection network, and the 2D frames surrounding the pedestrian in the input picture can be obtained (these 2D frames can also be called enclosing frames), and these 2D frame information will be sent to the subsequent analysis module for processing , Such as sending it to the control unit of the automatic driving system for obstacle analysis, and sending it to the analysis module of the safe city to match the missing persons.
  • the pedestrian detection network shown in Figure 6 includes a backbone module, a region proposal network (RPN) module, a self-activation module, a basic feature weighting module, a regional feature generation module, and a regional convolutional neural network (region proposal network). Convolutional neural network, RCNN) module and output module.
  • the pedestrian detection network in FIG. 6 can execute the pedestrian detection method of the embodiment of the present application. The process of processing the input image by the pedestrian detection network will be roughly introduced below.
  • the backbone network module After obtaining the input image, the backbone network module performs convolution processing on the input image to obtain the basic feature map of the input image; the RPN module processes the basic feature map to obtain the pedestrian candidate frame of the input image; the self-activation module performs the basic feature map After further convolution processing and weighted sum processing, the object visibility map of the input image is obtained (the object visibility map can highlight the features of the visible part of the pedestrian in the input image); the basic feature weighting module is used for the object visibility map and foundation The feature map is weighted to obtain an enhanced feature map; the regional feature module can generate the corresponding features of the candidate frame according to the candidate frame and the object visibility map of the input picture; the RCNN module processes the features corresponding to the candidate frame to obtain the pedestrians of the input picture
  • the detection result, the pedestrian detection result may be the confidence level of the bounding frame of the pedestrian in the image and the bounding frame of the pedestrian in the image.
  • the landing product form of the pedestrian detection device in the embodiment of the application may be automatic driving, terminal equipment, monitoring equipment, etc.
  • the present invention is deployed on the computing node of related equipment, and through software transformation, the accuracy of pedestrian recognition can be improved.
  • the pedestrian detection device of the embodiment of the present application may be as shown in FIG. 5.
  • the pedestrian detection device is mainly composed of a backbone network module 101, an RPN module 102, a self-activation module 103, a basic feature weighting module 104, a regional feature generation module 105, and an RCNN module 106 And the output module 107 composition.
  • the pedestrian detection device shown in FIG. 7 may include the area feature generating module 105 or not.
  • the pedestrian detection device does not include the area feature generation module 105, the pedestrian detection device and the process of using the pedestrian detection device to perform pedestrian detection can be shown in FIG. 8.
  • the RPN module determines the candidate frame of the image according to the basic feature map
  • the self-activation module 103 generates the object visibility map of the image according to the basic feature map
  • the basic feature weighting module 104 weights the object visibility map and the basic feature map of the image. Sum up to get an enhanced feature map.
  • the feature corresponding to the candidate frame can be extracted from the enhanced feature map by region of interest (ROI) pooling, and then the feature corresponding to the candidate frame can be processed by the RCNN module to obtain the pedestrian detection result.
  • ROI region of interest
  • the pedestrian detection device includes the area feature generation module 105
  • the pedestrian detection device and the process of using the pedestrian detection device to perform pedestrian detection can be shown in FIG. 9.
  • the pedestrian detection device shown in FIG. 9 uses the area feature generation module 105 to generate the area characteristics of the candidate frame, the area characteristics of the contour area of the candidate frame, and the area characteristics of the background area of the candidate frame. These three features are Can be merged together into the feature corresponding to the candidate frame.
  • the RCNN module processes the features corresponding to the candidate frame to obtain the pedestrian detection result.
  • Backbone network module 101
  • the backbone network module 101 is used to perform a series of convolution processing on an input image (also referred to as a picture) to obtain a basic feature map (feature map) of the image.
  • the basic feature map of the image provides the basic features of the image when other subsequent modules perform image detection.
  • the backbone network in the backbone network module 101 can be implemented in multiple ways, for example, a VGG network, a deep residual network (deep residual network, ResNet), and an inception network.
  • a VGG network a deep residual network (deep residual network, ResNet)
  • ResNet deep residual network
  • RPN module 102
  • the RPN module 102 is used to predict the areas where pedestrians may exist on the basic feature map generated by the backbone network module 101, and give the locations of these areas, and the boundaries of these areas can be called a proposal.
  • the position of the candidate frame detected by the RPN module 102 is not very accurate, and the candidate frame may fall on the background image, or may not well surround the pedestrian target in the image.
  • Self-activation module 103
  • the self-activation module uses the shared Conv5 convolution layer in the RCNN module 106 to perform further convolution processing on the basic feature map generated by the backbone network module 101 to obtain a high-level semantic feature map.
  • the classifier weight coefficient weights the high-level semantic feature map to obtain an object visibility map of the image.
  • the object visibility map of the image has a strong response to the visible part of the pedestrian, but has a weak response to the background and the occluded part.
  • the self-activation module 103 is the core module of the pedestrian detection device of the present application.
  • the basic feature weighting module 104 is configured to perform weighting processing on the basic feature map of the image generated by the backbone network module 101 and the object visibility map of the image generated by the self-activation module 103 to obtain an enhanced feature map of the image.
  • the enhanced feature map of the image the features of the visible part of the pedestrian are enhanced, while the features of the background and the obstructions that obscure the pedestrian are weakened.
  • the regional feature generation module 105 refers to the object visibility map of the image generated by the activation module 103, processes each candidate frame generated by the RPN module 102, and generates the contour area image, background area image, and ROI area image of the pedestrian corresponding to the current candidate frame , And extract the features of the three regional images through the interest network-pooling module in the regional feature generation module 105, and then merge the three extracted regional images as the regional features of the current candidate frame.
  • the RCNN module 106 uses the same Conv5 convolutional layer as the self-activation module 103 to perform convolution processing on the regional features of the candidate frame generated by the regional feature generation module 105 to obtain the image features of the candidate frame area, and then use the global average pooling module (global average pooling module).
  • GAP Average pooling
  • GAP performs an averaging operation on the image features of the candidate frame area, and finally sends the averaged image features to the frame regression and classifier respectively to predict the final coordinates and confidence of the candidate frame.
  • the output module 107 is used to perform non-maximize suppression (NMS) processing on all candidate frames output by the RCNN module 106, thereby merging highly overlapping candidate frames, and filtering out candidate frames with too low confidence .
  • NMS non-maximize suppression
  • FIG. 10 is a schematic flowchart of a pedestrian detection method according to an embodiment of the present application.
  • the method shown in FIG. 10 can be executed by the pedestrian detection device in the present application.
  • the method shown in FIG. 10 includes steps 1001 to 1007, and steps 1001 to 1007 are described in detail below.
  • the aforementioned image may be an image containing pedestrians.
  • the above-mentioned images may be various images containing pedestrians, for example, images taken by mobile phones or other smart terminals, road screen images obtained by assisted/automatic driving systems, and monitoring screen images obtained by safe city/video surveillance systems.
  • the above-mentioned images taken by mobile phones or smart terminals, as well as road screen images and monitoring screen images are generally images containing pedestrians. If these images do not include pedestrians, then the final recognition result can be empty, that is, it is correct. Such images that do not contain pedestrians are recognized, and the bounding box surrounding pedestrians cannot be recognized.
  • step 1001 the image can be obtained by shooting with a camera, or the image can be obtained from the memory. It should be understood that the method shown in FIG. 1 may also directly start from step 1002.
  • the image can be convolved (convolution processing), or the result of the convolution operation of the image can be further processed (for example, summation, weighting, connection And so on) to get the basic feature map.
  • the image may be convolved through the backbone network (module) in the neural network to obtain the basic feature map of the image.
  • the backbone network can adopt a variety of convolutional network architectures, for example, the VGG network (a network proposed by the visual geometry group of the University of Oxford), deep residual network (ResNet), and inception network.
  • the above-mentioned basic feature map may contain multiple channels.
  • the basic feature map when performing feature extraction on the image, can be obtained by convolution processing on the image.
  • the process of obtaining the basic feature map through convolution processing can be as shown in 11 shown.
  • the basic feature map is a feature map containing multiple channels.
  • the resolution of the input image is H 0 *W 0 *3 (height H 0 , width W 0 , the number of channels is 3, that is, three channels of RBG)
  • the basis can be obtained after convolution processing
  • different convolutional layers of the residual network ResNet18 may be used to perform convolution operations on the input image.
  • the convolution operation shown in FIG. 12 may specifically include the following processes (1) to (4):
  • ResNet18-Conv1 the first convolution layer of ResNet18 performs convolution processing on the input image to obtain a feature map C1 (feature map C1).
  • the resolution of the feature map C1 obtained after ResNet18-Conv1 convolution processing can be H 0 /4*W 0 /4*64.
  • ResNet18-Conv1 can downsample the input image twice (the width and height become half of the original each time the sampling operation is performed), and expand the number of channels from 3 to 64, thereby obtaining the feature map C1.
  • ResNet18-Conv2 the second convolution layer of ResNet18 continues to perform convolution processing on the feature map C1 to obtain a feature map C2 (feature map C2).
  • ResNet18-Conv2 continues to perform convolution processing on the feature map C1, and the resolution of the obtained feature map C2 can be the same as the resolution of the feature map C1, both of which are H 0 /4*W 0 /4*64.
  • ResNet18-Conv3 the third convolution layer of ResNet18 continues to perform convolution processing on the feature map C2 to obtain a feature map C3 (feature map C3).
  • ResNet18-Conv3 can down-sample the feature map C2 again and double the number of channels (expand the number of channels from 64 to 128) to obtain feature map C3.
  • the resolution of feature map C3 is H 0 /8*W 0 /8*128.
  • ResNet18-Conv4 (the fourth convolution layer of ResNet18) continues to perform convolution processing on the feature map C3 to obtain a feature map C4 (feature map C4).
  • ResNet18-Conv4 can down-sample the feature map C3 again and double the number of channels (expand the number of channels from 128 to 256) to obtain feature map C4.
  • the resolution of feature map C4 is H 0 /16*W 0 /16*256.
  • the convolution process shown in FIG. 12 is only an example, and the network used in the convolution processing, the number of convolution processing, etc., are not limited in the embodiment of the present application.
  • the basic feature map in step 1002 can be either the feature map C4 shown in FIG. 12, or the convolution feature map C1 to the feature map C4 shown in FIG. At least one feature map.
  • the aforementioned candidate frame is a bounding frame of an area where pedestrians may exist in the image.
  • the area where the candidate frame is located may be the area enclosed by the candidate frame (the area inside the candidate frame), and the area where the candidate frame is located is the area where pedestrians may exist in the image.
  • the RPN module can predict areas where pedestrians may exist according to the basic feature map of the image, and frame these areas, and the borders surrounding these areas are candidate frames.
  • the position of the candidate frame determined in step 1003 is not very accurate.
  • the candidate frame may fall on the background image (at this time, there may be no pedestrian target in the candidate frame), or it may not be able to surround the pedestrian target well. .
  • the RPN module can generally be used to process the basic feature map to obtain a candidate frame.
  • the process of determining the image candidate frame by the RPN module will be described in detail below with reference to FIG. 13.
  • the RPN module can first use a 3 ⁇ 3 convolution kernel to perform convolution processing on the basic feature map to obtain an RPN Hidden feature map (RPN Hidden). Next, two 3 ⁇ 3 convolution kernels are used to perform convolution processing on the RPN hidden feature map, and the position and confidence of each candidate frame of the RPN hidden feature map are predicted. Generally speaking, the higher the confidence of the candidate frame, the greater the probability of pedestrians in the candidate frame.
  • RPN Hidden RPN Hidden feature map
  • the RPN module will merge the predicted candidate frames.
  • the redundant candidate frames can be removed according to the degree of overlap between the candidate frames. Among them, the redundant candidate frames can be removed. But it is not limited to the way of NMS algorithm to screen candidate frames.
  • the N (N ⁇ J) candidate boxes with the highest scores can be selected from the J candidate boxes as candidate boxes containing pedestrians, where N and J are both Positive integer.
  • the position of the candidate frame of the image determined by the RPN module is generally not very accurate. As shown in Figure 13, although there are pedestrians in the two candidate frames, the two candidate frames do not completely include the pedestrians in the frame. Nor does it tightly include pedestrians in the box.
  • the object visibility map of the above image has different responses to different objects.
  • the response to the visible part of the pedestrian is greater than the response to the invisible part of the pedestrian. That is to say, in the object visibility map of the image, the object visibility map of the above image has a strong response to the visible part of the pedestrian, and a weak response to the invisible part of the pedestrian. Compared with the invisible part of the pedestrian, the visible part of the pedestrian The features are more prominent.
  • the pixel value of the visible part of the pedestrian is greater than the pixel value of the invisible part of the pedestrian.
  • the visible part of the human body is the part that can be seen by the pedestrian, that is, the part that the pedestrian is not obscured by other objects.
  • the invisible part of the pedestrian may include other object parts that obscure the pedestrian and the background part of the image.
  • the visible part of the human body has a brighter color, indicating a greater response, while the color of the invisible part of the human body is darker, indicating a smaller response.
  • the 6 images in the first row are the original images, and the 6 images in the second row are the corresponding object visibility maps.
  • the colors of the visible parts of the human body are also compared.
  • the color of the visible part of the human body is still relatively dark.
  • the lower part of the human body is occluded, and the brightness of the color corresponding to the visible part of the human body in the generated object visibility map is greater than the color brightness of the occluded part of the human body.
  • the characteristics of the visible part of the human body are enhanced, while the characteristics of the occluded part of the human body and the background area are weakened.
  • the features of the visible part of the pedestrian can be more prominently reflected.
  • the visible part of the pedestrian is the part where the pedestrian image can be seen, and the invisible part of the pedestrian is the part not enough to see the pedestrian image.
  • the accuracy of pedestrian detection can be improved by combining the object visibility map of the image when determining the bounding frame of the pedestrian in the image subsequently.
  • step 1004 in order to obtain the object visibility map of the image, you can first perform convolution processing on the basic feature map of the image, and then perform weighted summation processing on the multiple feature maps obtained by the convolution processing to obtain the object visibility map of the image .
  • the enhanced feature map obtained by fusing the basic feature map of the image and the object visibility map of the image highlights the features of the visible part of the pedestrian , which can improve the accuracy of subsequent pedestrian detection based on the enhanced feature map.
  • the pedestrian invisible part includes a pedestrian occlusion part.
  • the visible part of the pedestrian and the pedestrian occlusion part can be distinguished in the object visibility map, and because the pixel value of the pedestrian visible part is greater than the pixel value of the pedestrian occlusion part, it can be in the visibility of the object Highlight the characteristics of the visible part of pedestrians, weaken the characteristics of the pedestrian occlusion part, reduce the influence of the pedestrian occlusion part on the detection result in the subsequent pedestrian detection process, and facilitate the highlight of the visible part of the pedestrian to improve the effect of pedestrian detection.
  • the pedestrian invisible part includes the background part of the image.
  • the background part of the image may refer to other parts of the image except pedestrians, or the background part of the image may also refer to other parts of the image except pedestrians and main objects (for example, cars).
  • the pedestrian can be distinguished from the background part in the object visibility map, highlighting the characteristics of the visible part of the pedestrian, weakening the characteristics of the background part, and reducing the background part to detect in the subsequent pedestrian detection process
  • the effect of the result is convenient for highlighting the features of the visible part of the pedestrian during subsequent pedestrian detection, and improving the effect of pedestrian detection.
  • the specific process of determining the object visibility map of the image in step 1004 may include the following steps:
  • the multiple first semantic feature maps are multiple different semantic feature maps extracted from the entire basic feature map. Specifically, in the above-mentioned multiple first semantic feature maps, any two first semantic feature maps correspond to different semantics.
  • the above multiple first semantic feature maps are composed of F 1 , F 2 and F 3 , where F 1 reflects the characteristics of the head, F 2 reflects the characteristics of the left hand, and F 3 reflects the characteristics of the right hand. F 1 , F 2 and F 3 each reflect different semantics.
  • the convolution parameters during the convolution processing in the above step 1004a are the same as the convolution parameters during the convolution processing in the subsequent step 1007a.
  • the weighting coefficient used is the weighting coefficient used to determine the pedestrian score in the classifier.
  • the classifier refers to the RCNN module 106 in FIG. 8 Classifier.
  • the weighting coefficients used in the weighted summation processing of the plurality of first semantic features in step 1004a are used in the classifier to determine the pedestrian score
  • the weight coefficient can make the response of the object visibility map obtained in step 1004b to the visible part of the pedestrian greater than the response to the invisible part of the pedestrian.
  • the above steps 1004a and 1004b may be processed by the self-activation module on the basic feature map to obtain the object visibility map of the image.
  • step 1004a and step 1004b will be described in detail below with reference to FIG. 17.
  • the Conv5 convolution layer can be used to perform further convolution processing on the basic feature map U ⁇ H*W*K to obtain a high-level semantic feature map F ⁇ H*W*K, where H, W, K is the height, width and channel number of the high-level semantic feature map respectively.
  • high-level semantic features can be features that reflect some key parts of a person (for example, head features, left-hand features, right-hand features, etc.).
  • F 1 reflects the characteristics of the head
  • F 2 reflects the characteristics of the left hand
  • F 3 reflects the characteristics of the background object.
  • the importance of pedestrian recognition is different Yes, some features are very important for pedestrian recognition, and some features are less important for pedestrian recognition.
  • head features and facial features are more important for pedestrian recognition, while background The importance of physical features in pedestrian recognition is relatively low.
  • the convolution parameters of the Conv5 convolutional layer in the self-activation module described above are the same as those of the Conv5 convolutional layer in the subsequent RCNN module shown in FIG. 24. Because the convolution parameters of the Conv5 convolution layer in the RCNN module are generally obtained through multiple training before detecting pedestrians. Therefore, the Conv5 convolution layer in the self-activation module uses the same convolution parameters as the Conv5 convolution layer in the RCNN module, which can better extract the high-level semantic features of the image.
  • the multiple first feature maps can be weighted and summed according to formula (2) to obtain the visibility map of the image.
  • w k ⁇ 1*1 represents the weight coefficient of the first feature map F k ⁇ H*W.
  • the weight coefficient used (the weight coefficient here can also be called the visualization weight) and the weight coefficient when the feature maps of different semantics are weighted and summed in the subsequent RCNN module
  • the weight coefficient of the first feature map F k ⁇ H*W is the same as the weight coefficient used when the subsequent RCNN module superimposes feature maps with the same semantics.
  • the weight coefficient of the first feature map that characterizes the human head in the above formula (2) is the same as the weight coefficient when the subsequent RCNN module performs a weighted summation of the feature map that characterizes the human head.
  • the weighted summation method in formula (2) can highlight the features that make a large contribution to human detection and suppress the features that make little contribution. It is found through experiments that the object visibility map V of the image calculated by formula (2) can produce a strong response to the visible part of a pedestrian, and a weak response to the background and the occluded part.
  • the object visibility map of the image calculated by the above formula (2) can be shown in Figure 14 and Figure 15.
  • the color of the person is brighter, indicating that the response is greater, and other people around The darker color of the object or background indicates lower response.
  • the object visibility maps of the images shown in Figure 14 and Figure 15 have a strong inhibitory effect on the background. From a local point of view, it has a strong response to the visible part of the pedestrian and a weak response to the obscured part of the pedestrian.
  • the object visibility map of the image Before the basic feature map of the image and the object visibility map of the image are fused, the object visibility map of the image can be dimensionally expanded, so that the number of channels of the object visibility map after dimensional expansion is the same as that of the basic feature map , And then perform fusion processing on the basic feature map of the image and the object visibility map of the image.
  • the following three fusion methods can be specifically used in the fusion processing.
  • the corresponding elements in the basic feature map can be multiplied by the corresponding elements of the expanded object visibility map, and then the resulting product can be summed with the corresponding elements in the original basic feature map.
  • Get enhanced feature map The value of the corresponding element in.
  • the weighted summation of the feature maps can be performed channel by channel, and finally an enhanced feature map with the same number of channels as the basic feature map is obtained.
  • the basic feature map U has a total of 64 channels, then the feature map of each channel of the basic feature map can be multiplied by the corresponding element in the object visibility map, and then the result of the multiplication can be compared with the corresponding element in the basic feature map The sum is performed to obtain the value of the element corresponding to the enhanced feature map, and similar operations are repeated until the value of the element in the feature map of each of the 64 channels of the enhanced feature map is obtained.
  • formula (3) can be used to perform weighting processing on the object visibility map V of the image and the basic feature map U of the image to obtain the enhanced feature map of the image
  • the enhanced feature map of the image obtained by the self-activation module 103 can enhance the features of the visible part of the human body in the image, and suppress the features of the background and obstructions, facilitating subsequent high-precision pedestrian detection based on the enhanced image features of the image.
  • the corresponding element in the basic feature map can be multiplied by the corresponding element of the object visibility map after the dimension expansion, and then the product obtained is summed with the corresponding element of the object visibility map after the dimension expansion.
  • the enhanced feature map The value of the corresponding element in.
  • the corresponding elements in the basic feature map and the corresponding elements of the object visibility map after the dimension expansion can be directly weighted and summed, and the value of the element obtained after the weighted summation is the corresponding value of the enhanced feature map
  • the weighting coefficient of the object visibility map after the dimension expansion can be greater than the weighting coefficient of the basic feature map, so that the enhanced feature map mainly reflects the features of the object visibility map.
  • the feature corresponding to the candidate frame obtained in the above step 1006 may include the area feature of the candidate frame, and the area feature of the candidate frame is the feature of the area located in the candidate frame in the enhanced feature map.
  • the regional feature of the candidate frame may be the part of the feature in the region corresponding to the candidate frame in the enhanced feature map.
  • the position of the candidate frame in the enhanced feature map can be determined first, and then the feature of the area enclosed by the candidate frame in the enhanced feature map can be determined as the regional feature of the candidate frame.
  • the features of the region enclosed by the candidate frame in the enhanced feature map may be sampled (specifically, up-sampling or down-sampling), and then the regional features of the candidate frame can be obtained.
  • the features corresponding to the aforementioned candidate frame in addition to the regional features of the candidate frame, the features of other regions other than the candidate frame may also be included.
  • the feature located in the candidate frame in the image enhancement feature map can be directly used as the feature corresponding to the candidate frame.
  • the area of the candidate frame can be determined from the object visibility map of the image, and then the enhanced feature map is located in the candidate frame according to the area (location) of the candidate frame.
  • the feature in the frame is determined as the feature corresponding to the above candidate frame.
  • the position of the candidate frame can be determined more accurately with the help of the object visibility map, which facilitates subsequent acquisition of more accurate regional features of the candidate frame.
  • the RCNN module can be used to continue processing the regional features of the candidate frame to finally obtain the bounding box with pedestrians in the image and the bounding box with pedestrians in the image. Confidence.
  • the regional features of the area around the candidate frame can be extracted, and the area around the candidate frame can be fused with the regional features of the candidate frame, and then the RCNN module is used to analyze the fused features Processing is performed to determine the confidence level of the bounding box of pedestrians in the image and the bounding box of pedestrians in the image.
  • the accuracy of pedestrian detection is The rate is higher.
  • the feature corresponding to the candidate frame further includes the area feature of the contour area of the candidate frame, and the contour area of the candidate frame is the reduced candidate frame obtained after the candidate frame is reduced according to the first preset ratio and the area formed between the candidate frame .
  • the regional features of the outline area of the candidate frame generally include the outline features of pedestrians, and the outline features of pedestrians also play an important role in pedestrian detection.
  • the candidate frame (position) of the image first determine the area of the candidate frame from the object visibility map of the image, and then reduce the frame of the candidate frame inward according to a certain ratio.
  • the original candidate frame The area between the frame and the reduced frame of the candidate frame is the contour area of the candidate frame.
  • the contour area of the candidate frame you can also determine the area of the candidate frame from the enhanced feature map, and then reduce the frame of the candidate frame inward according to a certain ratio.
  • the original frame of the candidate frame and the reduced candidate frame The area between the borders is the contour area of the candidate frame.
  • the above-mentioned reduction of the candidate frame according to the first preset ratio may specifically mean that the width and height of the candidate frame are respectively reduced according to a certain ratio, and the ratio when the width and height of the candidate frame are reduced may be the same or different.
  • the candidate frame is reduced according to a first preset ratio, including: the width of the candidate frame is reduced according to a first reduction ratio, and the height of the candidate frame is reduced according to a second reduction ratio.
  • the first preset ratio includes a first reduction ratio and a second reduction ratio, where the first reduction ratio and the second reduction ratio may be the same or different.
  • the first reduction ratio and the second reduction ratio can be set based on experience. For example, appropriate values may be set for the first reduction ratio and the second reduction ratio, so that the contour area of the candidate frame obtained by reducing the width and height of the candidate frame according to the first reduction ratio and the second reduction ratio can be smaller. Good extraction of pedestrian contour features.
  • the first reduction ratio and the second reduction ratio can be set to a value that can better extract the pedestrian contour.
  • the first reduction ratio may be 1/1.1, and the second reduction ratio may be 1/1.8.
  • the center of the candidate frame when the frame of the candidate frame is reduced, the center of the candidate frame can be used as the center point, and the scale is reduced inward according to a certain ratio.
  • the center of the reduced candidate frame can be the object visibility of the original candidate frame in the image.
  • the maximum point in the graph when the frame of the candidate frame is reduced, the center of the candidate frame can be used as the center point, and the scale is reduced inward according to a certain ratio.
  • the center of the reduced candidate frame can be the object visibility of the original candidate frame in the image.
  • the maximum point in the graph when the frame of the candidate frame is reduced.
  • the center of the reduced candidate frame is consistent with the center point of the original candidate frame.
  • the contour feature of the pedestrian can also be taken into consideration when the pedestrian detection is performed, so as to facilitate the subsequent integration of the contour feature of the pedestrian for better performance Pedestrian detection.
  • the method shown in FIG. 10 further includes: setting the value of the feature located in the reduced candidate frame among the regional features of the candidate frame to zero to obtain the regional feature of the contour area of the candidate frame.
  • the contour area of the candidate frame when obtaining the regional features of the contour area of the candidate frame, by directly setting the value of the feature in the reduced candidate frame among the regional features of the candidate frame to zero, the contour area of the candidate frame can be quickly and conveniently obtained.
  • Regional characteristics when obtaining the regional features of the contour area of the candidate frame, by directly setting the value of the feature in the reduced candidate frame among the regional features of the candidate frame to zero, the contour area of the candidate frame can be quickly and conveniently obtained.
  • the area of the contour area of the candidate frame may be directly scanned from the enhanced feature map according to the area position of the contour area of the candidate frame. feature.
  • the feature corresponding to the above candidate frame further includes the area feature of the background area of the candidate frame, and the background area of the candidate frame is the area formed between the expanded candidate frame and the candidate frame obtained after the candidate frame is expanded according to the second preset ratio.
  • the regional characteristics of the background area of the candidate frame generally reflect the characteristics of the background area where the pedestrian in the image is located.
  • the characteristics of the background area can be combined with the characteristics of the pedestrian to perform pedestrian detection.
  • the aforementioned expansion of the candidate frame according to the first preset ratio may specifically mean that the width and height of the candidate frame are respectively expanded according to a certain ratio, and the ratio when the width and height of the candidate frame are expanded may be the same or different.
  • the expansion of the candidate frame according to a first preset ratio includes: the width of the candidate frame is expanded according to a first expansion ratio, and the height of the candidate frame is expanded according to a second expansion ratio.
  • the first preset ratio includes a first expansion ratio and a second expansion ratio, where the first expansion ratio and the second expansion ratio may be the same or different.
  • the first expansion ratio and the second expansion ratio can be set based on experience. For example, appropriate values may be set for the first expansion ratio and the second expansion ratio, so that the contour area of the candidate frame obtained by expanding the width and height of the candidate frame according to the first expansion ratio and the second expansion ratio can be smaller. Good extraction of pedestrian contour features.
  • the first expansion ratio may be 1/1.1, and the second expansion ratio may be 1/1.8.
  • the area of the candidate frame can be determined from the object visibility map of the image, and then the border of the candidate frame can be expanded outward according to a certain ratio.
  • the area between the frame and the expanded frame of the candidate frame is the background area of the candidate frame.
  • the background area of the candidate frame you can first determine the area of the candidate frame from the enhanced feature map of the image according to the candidate frame (position) of the image, and then expand the border of the candidate frame outwards according to a certain ratio , The area between the original frame of the candidate frame and the expanded frame of the candidate frame is the background area of the candidate frame.
  • the expansion ratio can be determined based on experience. For example, when expanding the candidate frame, the width of the candidate frame can be changed to 1.1 times the original size, and the height of the candidate frame can be changed to 1.8 times the original size.
  • the center of the candidate frame when the frame of the candidate frame is expanded, the center of the candidate frame can be used as the center point, and the expansion outwards according to a certain ratio.
  • the center of the expanded candidate frame can be the object visibility of the original candidate frame in the image.
  • the maximum point in the graph when the frame of the candidate frame is expanded, the center of the candidate frame can be used as the center point, and the expansion outwards according to a certain ratio.
  • the center of the expanded candidate frame can be the object visibility of the original candidate frame in the image.
  • the maximum point in the graph when the frame of the candidate frame is expanded, the center of the candidate frame can be used as the center point, and the expansion outwards according to a certain ratio.
  • the center of the expanded candidate frame can be the object visibility of the original candidate frame in the image.
  • the maximum point in the graph when the frame of the candidate frame is expanded, the center of the candidate frame is expanded, the expansion outwards according to a certain ratio.
  • the center of the expanded candidate frame is consistent with the center point of the original candidate frame.
  • the area feature of the background area can also be taken into account when performing pedestrian detection, so as to facilitate the subsequent integration of the regional features of the background area to better Pedestrian detection is carried out locally.
  • the method shown in FIG. 10 further includes: acquiring a regional feature of the first region, where the regional feature of the first region is the region feature located in the region of the extended candidate frame in the object visibility map; and combining the regional feature of the first region
  • the features located in the candidate frame are set to zero to obtain the regional features of the background area of the candidate frame.
  • the enhanced feature map can also be combined to determine the regional feature of the background area of the candidate frame.
  • the regional features of the second region can be obtained, and the regional features of the second region are the regional features located in the area of the extended candidate frame in the enhanced feature map; the features located in the candidate frame in the regional features of the second region are set to zero , Get the regional characteristics of the background area of the candidate frame.
  • the regional feature of the candidate frame, the contour feature of the candidate frame, and the background feature of the candidate frame may be referred to as features corresponding to the candidate frame. That is to say, in this case, the features corresponding to the candidate frame include not only the regional features within the candidate frame's own region, but also the regional features of other regions (contour regions and background regions) outside the candidate frame's own region.
  • the feature corresponding to the candidate frame may include at least one of the contour feature of the candidate frame and the background feature of the candidate frame in addition to the regional feature of the candidate frame.
  • the feature corresponding to the candidate frame contains the most information, so that the subsequent corresponding to the candidate frame according to the RCNN module
  • the feature of pedestrian detection can improve the accuracy of pedestrian detection to a certain extent.
  • the features corresponding to the above candidate frame can directly include the three separate features of the candidate frame’s regional feature, the candidate frame’s contour feature and the candidate frame’s background feature, or can include the candidate frame’s regional feature and the candidate frame’s contour.
  • the three features of the region feature of the candidate frame, the contour feature of the candidate frame, and the background feature of the candidate frame can be fused, and the fused feature is the feature corresponding to the candidate frame.
  • the area of the candidate frame, the contour area of the candidate frame, and the background area of the candidate frame can be determined according to the candidate frame of the image and the object visibility map of the image (see Figure 21 and Figure 22 for the process of determining these areas) .
  • the regional features of the candidate frame and the contour area of the candidate frame are extracted from the enhanced feature map of the image (which can be referred to as contour Features) and regional features of the background area of the candidate frame (can be referred to as background features for short).
  • these three features can be fused to obtain the fused feature, and the fused feature is the candidate frame Corresponding characteristics.
  • linear combination weighted summation of these three features
  • nonlinear combination can be used.
  • the three features can be adjusted to the same size (for example, 7*7*K, K is the number of channels), and then fused.
  • the RCNN module can be used to process the feature corresponding to the candidate frame to finally obtain the confidence of the bounding box with pedestrians in the image and the bounding box with pedestrians in the image.
  • the confidence of the pedestrian bounding box and the pedestrian bounding box in the image can be better determined according to the image features of the corresponding part of the candidate box.
  • the enhanced feature map obtained by fusing the basic feature map of the image and the object visibility map of the image highlights the features of the visible part of the pedestrian , which can improve the accuracy of subsequent pedestrian detection based on the enhanced feature map.
  • the accuracy of pedestrian detection in this application is significantly improved for the situation of pedestrian occlusion (more serious).
  • the pedestrian detection method of the embodiment of the present application can be applied to various scenarios such as an assisted/automatic driving system, a safe city/video surveillance system, and so on.
  • the road screen image is acquired.
  • the autonomous driving vehicle can be controlled according to the confidence of the pedestrian bounding frame in the road screen image and the pedestrian bounding frame in the road screen image. For example, when the confidence of the pedestrian bounding frame in the road screen image and the pedestrian bounding frame in the road screen image determines that there is a pedestrian in front of the vehicle, you can control the vehicle to slow down and whistle, or you can control the vehicle Bypass pedestrians and so on.
  • the monitoring screen image can be acquired first, and then through the processing of the pedestrian detection method of the embodiment of the application, it is obtained that the surveillance screen image is surrounded by pedestrians The confidence level of the bounding frame of pedestrians in the frame and the monitoring screen image.
  • the characteristic persons can be identified and tracked according to the confidence of the pedestrian bounding frame in the monitoring screen image and the pedestrian bounding frame in the monitoring screen image.
  • the Skyeye system (the The system can be regarded as part of the Safe City/video surveillance system) to track the specific person.
  • the specific process of determining the confidence of the bounding box with pedestrians in the image and the bounding box with pedestrians in the image in step 1007 may include the following steps:
  • 1007c Use a classifier to process multiple second semantic features, and obtain the confidence that a pedestrian bounding box exists in the image.
  • the aforementioned multiple second semantic feature maps respectively represent multiple feature maps with different semantics extracted from the features corresponding to the candidate frame.
  • the convolution parameter of the second convolution network is the same as the convolution parameter of the first convolution network. Specifically, in the plurality of second semantic feature maps, any two second semantic feature maps correspond to different semantics.
  • the above multiple second semantic feature maps are composed of F 1 , F 2 and F 3 , where F 1 reflects the characteristics of the head, F 2 reflects the characteristics of the left hand, and F 3 reflects the characteristics of the legs , F 1 , F 2 and F 3 each reflect different semantics.
  • F 1 , F 2, and F 3 in the example here are different from F 1 , F 2 and F 3 when multiple first semantic feature maps are described above.
  • F 1 , F 2 and F 3 here belong to The second semantic feature map, and F 1 , F 2, and F 3 when multiple first semantic feature maps are described as examples above belong to the first semantic feature map.
  • the position of the bounding frame of the pedestrian in the image and the confidence of the bounding frame of the pedestrian in the image may be the detection result of pedestrian detection on the image, which may be called the pedestrian detection result of the image.
  • the feature corresponding to the candidate frame can be convolved through Conv5 in the RCNN module to obtain multiple second semantic feature maps (C1 to Ck), each of the multiple second semantic feature maps
  • the feature map reflects different features.
  • C 1 reflects the characteristics of the head
  • C 2 reflects the characteristics of the left hand
  • C 3 reflects the characteristics of the background object.
  • the average pooling module (global average pooling, GAP) Perform average processing on each second semantic feature map to obtain feature maps P 1 to P k , where P 1 is obtained by averaging feature map C 1 , and P k is the feature map C k Averaging.
  • the classification coefficients can be used in the classifier to perform weighted summation on the feature maps P 1 to P k to obtain the confidence of the candidate frame rows.
  • the weighted summation of the feature map P k can be performed according to formula (4) to obtain the confidence of the candidate frame.
  • w k ⁇ 1*1 is the classifier coefficient corresponding to P k .
  • P k represents C k
  • (xmin, ymin) can be the coordinates of the upper left corner of the frame, and W and H represent the width and height of the frame, respectively.
  • (xmin, ymin) can also be the position of the center point of the frame, the position of the upper right corner/lower left corner/lower right corner of the box.
  • NMS non-maximize suppression
  • Table 1 shows the miss rate of the proposed scheme and the existing scheme for pedestrian detection on the public data set (CityPersons).
  • the traditional scheme 1 is an adaptive fast RCNN (adapted faster RCNN) scheme
  • Traditional solution 1 was proposed in the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), where IEEE is the Institute of Electrical and Electronics Engineers.
  • the traditional solution 2 is the RCNN (occlusion awareness-CNN) solution of perceptual occlusion, and the traditional solution 2 was proposed at the European Conference on Computer Vision (ECCV) in 2018.
  • the second column in Table 1 shows the loss rate of all pedestrians in the image (including pedestrians blocked by other objects and pedestrians not blocked by other objects) during pedestrian detection.
  • the third column of Table 1 shows It is the loss rate of pedestrian detection for pedestrians with severe occlusion in the image. The lower the loss rate, the better the performance of pedestrian detection.
  • the pedestrian detection method of the embodiment of the present application is described in detail above with reference to the accompanying drawings.
  • the pedestrian detection device of the embodiment of the present application is described in detail below in conjunction with the accompanying drawings. It should be understood that the pedestrian detection device described below can execute the embodiments of the present application. In order to avoid unnecessary repetition of each step of the pedestrian detection method, the repeated description is appropriately omitted when introducing the pedestrian detection device of the embodiment of the present application.
  • FIG. 25 is a schematic block diagram of a pedestrian detection device according to an embodiment of the present application.
  • the pedestrian detection device 3000 shown in FIG. 25 includes an acquisition unit 3001 and a processing unit 3002.
  • the acquiring unit 3001 and the processing unit 3002 may be used to execute the pedestrian detection method of the embodiment of the present application. Specifically, the acquiring unit 3001 may execute the foregoing step 1001, and the processing unit 3002 may execute the foregoing steps 1002 to 1007.
  • the processing unit 3002 described above can be divided into multiple modules according to different processing functions. Specifically, the processing unit 3002 can be equivalent to the backbone network module 101, the RPN module 102, the self-activation module 103, and the basic features in the pedestrian detection device shown in FIG. The weighting module 104, the regional feature generation module 105, the RCNN module 106, and the output module 107. The processing unit 3002 can realize the functions of each module in the pedestrian detection device shown in FIG. 7.
  • FIG. 26 is a schematic diagram of the hardware structure of a pedestrian detection device according to an embodiment of the present application.
  • the pedestrian detection device 4000 shown in FIG. 26 includes a memory 4001, a processor 4002, a communication interface 4003, and a bus 4004.
  • the memory 4001, the processor 4002, and the communication interface 4003 implement communication connections between each other through the bus 4004.
  • the memory 4001 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • the memory 4001 may store a program. When the program stored in the memory 4001 is executed by the processor 4002, the processor 4002 is configured to execute each step of the pedestrian detection method in the embodiment of the present application.
  • the processor 4002 may adopt a general-purpose central processing unit (central processing unit, CPU), microprocessor, application specific integrated circuit (ASIC), graphics processing unit (GPU), or one or more
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • GPU graphics processing unit
  • the integrated circuit is used to execute related programs to implement the pedestrian detection method in the method embodiment of the present application.
  • the processor 4002 may also be an integrated circuit chip with signal processing capability.
  • each step of the pedestrian detection method of the present application can be completed by hardware integrated logic circuits in the processor 4002 or instructions in the form of software.
  • the above-mentioned processor 4002 may also be a general-purpose processor, a digital signal processing (DSP), an application specific integrated circuit (ASIC), a ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gates or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application specific integrated circuit
  • FPGA ready-made programmable gate array
  • Discrete gates or transistor logic devices discrete hardware components.
  • the methods, steps, and logical block diagrams disclosed in the embodiments of this application can be implemented or executed.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 4001, and the processor 4002 reads the information in the memory 4001, and combines its hardware to complete the functions required by the units included in the pedestrian detection device, or execute the pedestrian detection method in the method embodiment of the present application.
  • the communication interface 4003 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 4000 and other devices or a communication network.
  • a transceiver device such as but not limited to a transceiver to implement communication between the device 4000 and other devices or a communication network.
  • the image to be processed can be acquired through the communication interface 4003.
  • the bus 4004 may include a path for transferring information between various components of the device 4000 (for example, the memory 4001, the processor 4002, and the communication interface 4003).
  • FIG. 27 is a schematic diagram of the hardware structure of a neural network training device according to an embodiment of the present application. Similar to the above device 4000, the neural network training device 5000 shown in FIG. 27 includes a memory 5001, a processor 5002, a communication interface 5003, and a bus 5004. Among them, the memory 5001, the processor 5002, and the communication interface 5003 implement communication connections between each other through the bus 5004.
  • the memory 5001 may be ROM, static storage device and RAM.
  • the memory 5001 may store a program. When the program stored in the memory 5001 is executed by the processor 5002, the processor 5002 and the communication interface 5003 are used to execute each step of the neural network training method of the embodiment of the present application.
  • the processor 5002 may adopt a general-purpose CPU, a microprocessor, an ASIC, a GPU or one or more integrated circuits to execute related programs to realize the functions required by the units in the image processing apparatus of the embodiments of the present application. Or execute the neural network training method of the method embodiment of this application.
  • the processor 5002 may also be an integrated circuit chip with signal processing capabilities.
  • each step of the neural network training method of the embodiment of the present application can be completed by the integrated logic circuit of hardware in the processor 5002 or instructions in the form of software.
  • the foregoing processor 5002 may also be a general-purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component.
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 5001, and the processor 5002 reads the information in the memory 5001, and combines its hardware to complete the functions required by the units included in the image processing apparatus of the embodiment of the present application, or execute the neural network of the method embodiment of the present application Training method.
  • the communication interface 5003 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 5000 and other devices or communication networks.
  • a transceiver device such as but not limited to a transceiver to implement communication between the device 5000 and other devices or communication networks.
  • the image to be processed can be obtained through the communication interface 5003.
  • the bus 5004 may include a path for transferring information between various components of the device 5000 (for example, the memory 5001, the processor 5002, and the communication interface 5003).
  • the device 4000 and device 5000 only show a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the device 4000 and device 5000 may also include those necessary for normal operation. Other devices. At the same time, according to specific needs, those skilled in the art should understand that the device 4000 and the device 5000 may also include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the device 4000 and the device 5000 may also only include the components necessary to implement the embodiments of the present application, and not necessarily include all the components shown in FIGS. 26 and 27.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Closed-Circuit Television Systems (AREA)

Abstract

本申请提供了行人检测方法、装置、计算机可读存储介质和芯片。涉及人工智能领域,具体涉及计算机视觉领域。该方法包括:对图像进行特征提取,得到图像的基础特征图;根据该基础特征图确定可能存在行人区域的候选框;对图像的基础特征图进行处理,以得到对行人可见部分的响应大于对行人遮挡部分和背景部分响应的物体可见度图;接下来,再对基础特征图和物体可见度图进行加权求和处理,得到图像的增强特征图,最后再根据图像的候选框和图像的增强特征图确定图像中存在行人的包围框和图像中存在行人的包围框的置信度。本申请能够提高行人检测的准确性。

Description

行人检测方法、装置、计算机可读存储介质和芯片
本申请要求于2019年07月30日提交中国专利局、申请号为201910697411.3、申请名称为“行人检测方法、装置、计算机可读存储介质和芯片”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域中的计算机视觉领域,并且更具体地,涉及一种行人检测方法、装置、计算机可读存储介质和芯片。
背景技术
计算机视觉是各个应用领域,如制造业、检验、文档分析、医疗诊断,和军事等领域中各种智能/自主系统中不可分割的一部分,它是一门关于如何运用照相机/摄像机和计算机来获取我们所需的,被拍摄对象的数据与信息的学问。形象地说,就是给计算机安装上眼睛(照相机/摄像机)和大脑(算法)用来代替人眼对目标进行识别、跟踪和测量等,从而使计算机能够感知环境。因为感知可以看作是从感官信号中提取信息,所以计算机视觉也可以看作是研究如何使人工系统从图像或多维数据中“感知”的科学。总的来说,计算机视觉就是用各种成像系统代替视觉器官获取输入信息,再由计算机来代替大脑对这些输入信息完成处理和解释。计算机视觉的最终研究目标就是使计算机能像人那样通过视觉观察和理解世界,具有自主适应环境的能力。
在计算机视觉领域中,行人检测是一个很重要的研究方向,行人检测在很多领域和场景中都有重要的应用。例如,高级驾驶辅助系统(advanced driving assistant system,ADAS)和自动驾驶系统(autonomous driving system,ADS)中对路面行人等动态障碍物进行检测和躲避;在平安城市和视频监控中对行人进行检测,找出犯罪嫌疑人或者追踪失踪人口;在智能家居系统中通过检测行人来实现机器人的运动和避障等。
传统方案在进行行人检测时,在行人被遮挡情况下的检测效果表现不好,主要表现在遮挡行人的漏检和错检。
发明内容
本申请提供一种行人检测方法、装置、计算机可读存储介质和芯片,以提高行人(尤其是被遮挡的行人)检测的准确性。
第一方面,提供了一种行人检测方法,该方法包括:获取图像;对图像进行特征提取,得到图像的基础特征图;根据基础特征图确定图像中可能存在行人的候选框;对图像的基础特征图进行处理,得到图像的物体可见度图;对图像的基础特征图和图像的物体可见度图进行融合处理,得到图像的增强特征图;根据图像的候选框和图像的增强特征图,确定候选框对应的特征;根据候选框对应的特征确定图像中存在行人的包围框(bounding box) 和图像中存在行人的包围框的置信度。
上述图像可以是包含行人的图像。
可选地,上述获取图像,包括:通过摄像头拍摄以获取图像。
例如,对于手机来说,可以通过拍摄获取的图像。
可选地,上述获取图像,包括:从存储器中获取图像。
例如,对于手机来说,可以从相册中选取图像。
在得到图像的基础特征图时,具体可以通过对图像进行卷积操作(卷积处理),或者是对图像的卷积操作的结果做进一步的处理(例如,进行求和、加权处理、连接等操作)得到基础特征图。
上述候选框为图像中可能存在行人的区域的包围框。候选框所在的区域可以是候选框所围成的区域(候选框内部的区域),候选框所在的区域也就是图像中可能存在行人的区域。
上述图像的物体可见度图对不同物体的响应程度不同,在该图像的物体可见度图中,对行人可见部分的响应程度大于对行人不可见部分的响应程度。也就是说,在该图像的物体可见度图中,与行人不可见部分相比,行人可见部分的特征更加突出。具体地,在该图像的物体可见度图中,行人可见部分的像素值大于行人不可见部分的像素值。
通过图像的物体可见度图,能够更突出的反映出行人可见部分的特征。
另外,在上述图像的物体可见度图中,行人可见部分是能够看到行人图像的部分,而行人不可见部分是不能看到行人图像的部分。
上述候选框对应的特征可以包括候选框的区域特征,该候选框的区域特征是增强特征图中位于候选框内的区域特征。具体地,候选框的区域特征可以是增强特征图中位于候选框所围成的区域的特征。
上述候选框的区域特征可以是增强特征图中与候选框相对应的区域内的这部分特征。在确定候选框的区域特征时,可以先确定候选框在增强特征图中的位置,接下来,就可以将候选框在增强特征图中所围成的区域的特征确定为候选框的区域特征。
进一步的,还可以对候选框在增强特征图中所围成的区域的特征对所围成的区域的特征进行抽样处理(具体可以是上采样或者下采样),然后得到候选框的区域特征。
上述图像中存在行人的包围框和图像中存在行人的包围框的置信度可以是对图像进行行人检测的检测结果,可以称为图像的行人检测结果。
本申请中,由于图像的物体可见度图能够更突出的反映行人可见部分的特征,因此,根据图像的基础特征图和图像的物体可见度图融合得到的增强特征图中突出体现了行人可见部分的特征,能够提高后续根据该增强特征图进行行人检测的准确性。并且,针对行人遮挡(较为严重)的情况,本申请行人检测的准确性有较为明显的提高。
进一步地,本申请中通过结合物体可见度图来提高行人检测准确性的同时,不会在训练过程中增加训练数据的标注量,本申请仅需要在处理的过程中生成物体的可见度图,并在后续处理时综合考虑物体可见度图即可,与通过增加数据标注量从而提升行人检测准确性的方案相比,能够在节约数据的标注量,减小训练的复杂度。
可选地,在上述图像的物体可见度图中,行人不可见部分包括行人被遮挡部分。
当行人不可见部分包括行人被遮挡部分时,能够在物体可见度图中区分开行人可见部 分和行人遮挡部分,并且由于行人可见部分的像素值大于行人遮挡部分的像素值,因此,能够在物体可见度中突出体现行人可见部分的特征,削弱行人遮挡部分的特征,减少行人遮挡部分在后续行人检测过程中对检测结果的影响,便于后续进行行人检测时突出行人可见部分的特征,提高行人检测的效果。
可选地,在上述图像的物体可见度图中,行人不可见部分包括图像的背景部分。
上述图像的背景部分可以是指该图像中除了行人之外的其他部分,或者,上述图像的背景部分还可以是指该图像中除了行人和主要物体(例如,汽车)之外的其他部分。
当行人不可见部分包括图像的背景部分时,能够在物体可见度图中将行人和背景部分区分开,突出行人可见部分的特征,削弱背景部分的特征,减少背景部分在后续行人检测过程中对检测结果的影响,便于后续进行行人检测时突出行人可见部分的特征,提高行人检测的效果。
结合第一方面,在第一方面的某些实现方式中,上述对图像的基础特征图进行处理,得到图像的物体可见度图,包括:采用第一卷积网络对图像的基础特征图进行卷积处理,得到多个第一语义特征图;对多个第一语义特征图进行加权求和处理,得到图像的物体可见度图;上述根据候选框对应的特征确定图像中存在行人的包围框和图像中存在行人的包围框的置信度,包括:采用第二卷积网络对候选框对应的特征进行卷积处理,得到多个第二语义特征图;采用回归器对所述多个第二语义特征进行处理,确定包围框的位置;采用分类器对多个第二语义特征进行处理,得到图像中存在行人的包围框的置信度。
其中,上述多个第一语义特征图是从基础特征图全图中提取的多个不同语义的特征图。具体地,在上述多个第一语义特征图中,任意两个第一语义特征图对应不同的语义。
上述多个第二语义特征图分别表示从候选框对应的特征中提取的多个不同语义的特征图,上述第二卷积网络的卷积参数与所述第一卷积网络的卷积参数相同。具体地,在上述多个第二语义特征图中,任意两个第二语义特征图对应不同的语义。
上述第一卷积网络和第二卷积网络的卷积参数相同可以是指第一卷积网络和第二卷积网络的卷积核参数相同,进一步的,第一卷积网络和第二卷积网络的卷积参数相同也可以是指第一卷积网络和第二卷积网络的网络架构和卷积核参数完全相同,采用第一卷积网络和第二卷积网络对相同的图像进行特征提取时能够提取到相同的图像语义特征。
上述在对该多个第一语义特征进行加权求和处理时的加权系数为分类器中用于确定行人得分的权重系数。
本申请中,由于第一卷积网络与第二卷积网络的卷积参数相同,并且在对多个第一语义特征进行加权求和处理时的加权系数为分类器中用于确定行人得到的权重系数,因此,能够通过对多个第一语义特征的处理,得到上述能够突出体现行人可见部分的物体可见度图,便于后续依据该物体可见度图更准确地进行行人检测。
结合第一方面,在第一方面的某些实现方式中,上述候选框对应的特征还包括候选框的轮廓区域的区域特征,候选框的轮廓区域是候选框按照第一预设比例缩小后得到的缩小候选框与候选框之间形成的区域。
上述候选框按照第一预设比例缩小具体可以是指候选框的宽和高分别按照一定的比例进行缩小,候选框的宽和高进行缩小时的比例可以相同,也可以不同。
可选地,上述候选框按照第一预设比例缩小,包括:上述候选框的宽按照第一缩小比 例进行缩小,候选框的高按照第二缩小比例进行缩小。
第一预设比例包括第一缩小比例和第二缩小比例,其中,第一缩小比例和第二缩小比例既可以相同,也可以不同。
第一缩小比例和第二缩小比例可以根据经验来设置。例如,可以将为第一缩小比例和第二缩小比例设置合适的数值,以使得按照第一缩小比例和第二缩小比例对候选框的宽和高进行缩小后得到的候选框的轮廓区域能够较好的提取到行人的轮廓特征。
在设置第一缩小比例和第二缩小比例可以设置为能够较好的提取行人轮廓的数值。
上述第一缩小比例可以是1/1.1,上述第二缩小比例可以是1/1.8。
上述候选框的轮廓区域的区域特征一般会包含行人的轮廓特征,而行人的轮廓特征在进行行人检测时也会起到很重要的作用。
本申请中,当候选框对应的特征还包括候选框的轮廓区域的区域特征时,能够在进行行人检测时,将行人的轮廓特征也考虑进来,便于后续综合行人的轮廓特征来更好地进行行人检测。
结合第一方面,在第一方面的某些实现方式中,上述方法还包括:将候选框的区域特征中位于缩小候选框内的特征的取值置零,以得到候选框的轮廓区域的区域特征。
本申请中,在获取候选框的轮廓区域的区域特征时,通过直接将候选框的区域特征中位于缩小候选框内的特征的取值置零,能够快速方便的获取到候选框的轮廓区域的区域特征。
应理解,在本申请中,还可以采用其他方式获取候选框的轮廓区域的区域特征,例如,可以按照候选框的轮廓区域的区域位置直接从增强特征图中扫描得到候选框的轮廓区域的区域特征。
结合第一方面,在第一方面的某些实现方式中,上述候选框对应的特征还包括候选框的背景区域的区域特征,候选框的背景区域是候选框按照第二预设比例扩大后得到的扩大候选框与候选框之间形成的区域。
上述候选框按照第二预设比例扩大具体可以是指候选框的宽和高分别按照一定的比例进行扩大,候选框的宽和高进行扩大时的比例可以相同,也可以不同。
可选地,上述候选框按照第二预设比例扩大,包括:上述候选框的宽按照第一扩大比例进行扩大,候选框的高按照第二扩大比例进行扩大。
第二预设比例包括第一扩大比例和第二扩大比例,其中,第一扩大比例和第二扩大比例既可以相同,也可以不同。
第一扩大比例和第二扩大比例可以根据经验来设置。例如,可以将为第一扩大比例和第二扩大比例设置合适的数值,以使得按照第一扩大比例和第二扩大比例对候选框的宽和高进行扩大后得到的候选框的背景区域能够较好的提取行人周围的背景特征。
上述第一扩大比例可以是1.1,上述第二扩大比例可以是1.8。
候选框的背景区域的区域特征一般反映的是图像中的行人所处的背景区域的特征,该背景区域的特征可以结合行人的特征来进行行人检测。
本申请中,当候选框对应的特征还包括候选框的背景区域的区域特征时,能够在进行行人检测时,将背景区域的区域特征也考虑进来,便于后续综合背景区域的区域特征来更好地进行行人检测。
结合第一方面,在第一方面的某些实现方式中,上述方法还包括:获取第一区域的区域特征,第一区域的区域特征为物体可见度图中位于扩展候选框的区域内的区域特征;将第一区域的区域特征中位于候选框内的特征置零,得到候选框的背景区域的区域特征。
本申请中,在获取候选框的背景区域的区域特征时,通过获取候选框的背景区域在增强特征图中对应的区域特征,然后直接将该区域特征中位于候选框内的特征置零,能够快速方便的获取到候选框的轮廓区域的区域特征。
可选地,上述方法由神经网络(模型)执行。
具体地,在上述方法中,获取到图像之后,可以利用神经网络(模型)对图像进行处理,最终再根据候选框对应的特征确定图像中存在行人的包围框和图像中存在行人的包围框的置信度。
第二方面,提供了一种神经网络的训练方法,该方法包括:获取训练数据,该训练数据包括训练图像以及训练图像的行人标注结果;根据神经网络对训练图像进行以下处理:对训练图像进行卷积处理,得到训练图像的基础特征图;根据基础特征图确定训练图像中可能存在行人的候选框;对训练图像的基础特征图进行处理,得到训练图像的物体可见度图;对训练图像的基础特征图和训练图像的物体可见度图进行融合处理,得到训练图像的增强特征图;根据训练图像的候选框和训练图像的增强特征图,确定候选框对应的特征;根据候选框对应的特征确定训练图像的行人检测结果;根据该训练图像的行人检测结果和该训练图像的行人标注结果,确定该神经网络的损失值,然后根据该损失值对神经网络通过反向传播进行调整。
其中,上述候选框为训练图像中可能存在行人的区域的包围框。候选框所在的区域可以是候选框所围成的区域(候选框内部的区域),候选框所在的区域也就是训练图像中可能存在行人的区域。
另外,在上述训练图像的物体可见度图中,行人可见部分是能够看到行人图像的部分,而行人不可见部分是不能看到行人图像的部分。
上述候选框对应的特征可以包括候选框的区域特征,该候选框的区域特征是增强特征图中位于候选框内的区域的特征。
上述训练图像的行人检测结果可以包括训练图像中存在行人的包围框和图像中存在行人的包围框的置信度。
上述训练图像的行人检测标注结果包括该训练图像中存在行人的包围框。
上述训练图像的行人检测标注结果可以是预先(具体可以是通过人工进行标注)标注好的。
另外,在上述训练的过程中,采用的训练图像一般是多个。
在对上述神经网络进行训练的过程中,可以为神经网络设置一套初始的模型参数,然后根据训练图像的行人检测标注结果与训练图像的行人检测结果的差异来逐渐调整神经网络的模型参数,直到训练图像的行人检测结果与训练图像的行人检测标注结果之间的差异在一定的预设范围内,或者,当训练的次数达到预设次数时,将此时的神经网络的模型参数确定为该神经网络模型的最终的参数,这样就完成了对神经网络的训练了。
应理解,通过上述第二方面的方法训练得到的神经网络能够用于执行本申请第一方面中的方法。
应理解,在本申请中,在描述或者说明本申请实施例的行人检测方法时出现的基础特征图、候选框、物体可见度图、增强特征图、图像中存在行人的包围框和图像中存在行人的包围框的置信度均是指针对获取到的图像而言的,而在本申请实施例的神经网络训练方法中,基础特征图、候选框、物体可见度图、增强特征图、图像中存在行人的包围框和图像中存在行人的包围框的置信度均是指针对训练图像而言的。
第三方面,提供了一种行人检测装置,该行人检测装置包括用于执行上述第一方面中的方法中的各个模块。
第四方面,提供了一种神经网络的训练装置,该装置包括用于执行上述第二方面中的方法中的各个模块。
第五方面,提供了一种行人检测装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行上述第一方面中的方法。
第六方面,提供了一种神经网络的训练装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行上述第二方面中的方法。
第七方面,提供了一种电子设备,该电子设备包括上述第三方面或者第五方面中的行人检测装置。
第八方面,提供了一种电子设备,该电子设备包括上述第四方面或者第六方面中的神经网络的训练装置。
上述电子设备具体可以是移动终端(例如,智能手机),平板电脑,笔记本电脑,增强现实/虚拟现实设备以及车载终端设备等等。
第九方面,提供一种计算机可读存储介质,该计算机可读存储介质存储有程序代码,该程序代码包括用于执行第一方面或者第二方面中的方法中的步骤的指令。
第十方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面或者第二方面中的方法。
第十一方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行上述第一方面或者第二方面中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行第一方面中的方法。
上述芯片具体可以是现场可编程门阵列FPGA或者专用集成电路ASIC。
应理解,本申请中,第一方面的方法具体可以是指第一方面以及第一方面中各种实现方式中的任意一种实现方式中的方法。第二方面的方法具体可以是指第二方面以及第二方面中各种实现方式中的任意一种实现方式中的方法。
附图说明
图1是本申请实施例提供的辅助/自动驾驶系统中进行行人检测的过程的示意图;
图2是本申请实施例提供的平安城市/视频监控系统中进行行人检测的过程的示意图;
图3是本申请实施例提供的系统架构的结构示意图;
图4是利用本申请实施例提供的卷积神经网络模型进行行人检测的示意图;
图5是本申请实施例提供的一种芯片硬件结构示意图;
图6是本申请实施例的行人检测装置的示意性框图;
图7是本申请实施例的行人检测装置的示意性框图;
图8是利用行人检测装置进行行人检测的示意图;
图9是利用行人检测装置进行行人检测的示意图;
图10是本申请实施例的行人检测方法的示意性流程图;
图11是对图像进行卷积处理得到基础特征图的示意图;
图12是生成图像的基础特征图的过程的示意图;
图13是RPN模块确定图像候选框的过程的示意图;
图14是物体可见度图的示意图;
图15是物体可见度图的示意图;
图16是本申请实施例的行人检测方法的示意性流程图;
图17是自激活模块根据基础特征图得到物体可见度图的过程的示意图;
图18对物体可见度图和基础特征图进行加权求和的过程的示意图;
图19是根据候选框和增强特征图得到行人检测结果的过程的示意图;
图20是根据物体可见度图、候选框以及增强特征图得到行人检测结果的过程的示意图;
图21是确定候选框的轮廓区域的示意图;
图22是确定候选框的背景区域的示意图;
图23是根据物体可见度图、候选框以及增强特征图得到行人检测结果的过程的示意图;
图24是RCNN模块对候选框对应的特征进行处理的示意图;
图25是本申请实施例的行人检测装置的示意性框图;
图26是本申请实施例的行人检测装置的硬件结构示意图;
图27是本申请实施例的神经网络训练装置的硬件结构示意图。
具体实施方式
本申请的方案可以应用在辅助驾驶、自动驾驶、平安城市、智能终端等需要进行行人识别(也可以称为行人检测)的领域。下面对两种常用的应用场景进行简单的介绍。
应用场景一:辅助/自动驾驶系统
在高级驾驶辅助系统(advanced driving assistant system,ADAS)和自动驾驶系统(autonomous driving system,ADS)中需要对路面上行人等动态障碍物进行检测和躲避,尤其是要避免碰撞行人。在交通道路中,密集的行人经常出现,行人之间或者行人与其他物体之间的遮挡比较严重,这些对行车安全造成了严重的威胁,因此,在严重遮挡的场景下准确地进行行人检测对安全行车具有重要意义。
具体地,如图1所示,辅助/自动驾驶系统获取的道路画面图像经过行人检测得到行人检测结果,接下来,辅助/自动驾驶系统可以再根据行人检测结果对车辆进行控制。
其中,行人检测可以由本申请实施例的行人检测方法来执行,根据行人检测能够确定 道路画面中是否存在行人以及行人所在的位置,便于辅助/自动驾驶系统根据识别结果对车辆进行控制。
应用场景二:平安城市/视频监控系统
在平安城市系统和视频监控系统中通过实时进行行人检测,标出行人检测结果,并将行人检测结果系统的分析单元中,可以用于查找犯罪嫌疑人、失踪人口进而实现天网追踪等。
具体地,如图2所示,平安城市/视频监控系统获取的道路画面图像经过行人检测得到行人检测结果,接下来,可以再根据行人检测结果对特定人员进行识别和追踪。
其中,行人检测可以由本申请实施例的行人检测方法来执行,根据行人检测能够确定监控画面中是否存在行人以及行人所在的位置,当监控画面中存在行人时可以识别该行人是否为特定人员(失踪人口、犯罪嫌疑人等等),当识别出特定人员后可以对启动天眼系统(该系统可以视为平安城市/视频监控系统的一部分)对特定人员进行追踪。
本申请实施例的行人检测方法可以由神经网络(模型)来执行,为了更好地理解本申请实施例的行人检测方法,下面先对神经网络的相关术语和概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以如公式(1)所示:
Figure PCTCN2020105020-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),该激活函数用于对神经网络中的特征进行非线性变换,从而将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有多层隐含层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。
虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2020105020-appb-000002
其中,
Figure PCTCN2020105020-appb-000003
是输入向量,
Figure PCTCN2020105020-appb-000004
是输出向量,
Figure PCTCN2020105020-appb-000005
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2020105020-appb-000006
经过如此简单的操作得到输出向量
Figure PCTCN2020105020-appb-000007
由于DNN层数多,系数W和偏移向量
Figure PCTCN2020105020-appb-000008
的数量也比较多。这些参数在DNN中的定义如下所述:以系数W为例,假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2020105020-appb-000009
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
综上,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2020105020-appb-000010
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(3)卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)残差网络
残差网络是在2015年提出的一种深度卷积网络,相比于传统的卷积神经网络,残差网络更容易优化,并且能够通过增加相当的深度来提高准确率。残差网络的核心是解决了增加深度带来的副作用(退化问题),这样能够通过单纯地增加网络深度,来提高网络性能。残差网络一般会包含很多结构相同的子模块,通常会采用残差网络(residual network,ResNet)连接一个数字表示子模块重复的次数,比如ResNet50表示残差网络中有50个子模块。
(6)分类器
很多神经网络结构最后都有一个分类器,用于对图像中的物体进行分类。分类器一般由全连接层(fully connected layer)和softmax函数(可以称为归一化指数函数)组成,能够根据输入而输出不同类别的概率。
(7)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(8)反向传播算法
神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始 的神经网络模型中参数的数值,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。
以上对神经网络的一些基本内容做了简单介绍,下面针对图像数据处理时可能用到的一些特定神经网络进行介绍。
下面结合图3对本申请实施例的系统架构进行详细的介绍。
图3是本申请实施例的系统架构的示意图。如图3所示,系统架构100包括执行设备110、训练设备120、数据库130、客户设备140、数据存储系统150、以及数据采集系统160。
另外,执行设备110包括计算模块111、I/O接口112、预处理模块113和预处理模块114。其中,计算模块111中可以包括目标模型/规则101,预处理模块113和预处理模块114是可选的。
数据采集设备160用于采集训练数据。针对本申请实施例的行人检测方法来说,训练数据可以包括训练图像(该训练图像中包括行人)以及标注文件,其中,标注文件中给出了训练图片中的存在行人的包围框(bounding box)的坐标。在采集到训练数据之后,数据采集设备160将这些训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。
下面对训练设备120基于训练数据得到目标模型/规则101进行描述,训练设备120对输入的训练图像进行物体检测,将输出的行人检测结果(图像中存在行人的包围框以及图像中存在行人的包围框的置信度)与标注结果进行对比,直到训练设备120输出的物体的行人检测结果与预先标注的结果的差异小于一定的阈值,从而完成目标模型/规则101的训练。
上述目标模型/规则101能够用于实现本申请实施例的行人检测方法,即,将待处理图像(通过相关预处理后)输入该目标模型/规则101,即可得到待处理图像的行人检测结果。本申请实施例中的目标模型/规则101具体可以为神经网络。需要说明的是,在实际应用中,数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图3所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端等。在图3中,执行设备110配置输入/输出(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:客户设备输入的待处理图像。这里的客户设备140具体可以是终端设备。
预处理模块113和预处理模块114用于根据I/O接口112接收到的输入数据(如待处理图像)进行预处理,在本申请实施例中,可以没有预处理模块113和预处理模块114或 者只有的一个预处理模块。当不存在预处理模块113和预处理模块114时,可以直接采用计算模块111对输入数据进行处理。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果,如将目标模型/规则101计算得到的行人检测结果呈现给客户设备140,从而提供给用户。
具体地,经过计算模块111中的目标模型/规则101处理得到的行人检测结果可以通过预处理模块113(也可以再加上预处理模块114的处理)的处理后将处理结果送入到I/O接口,再由I/O接口将处理结果送入到客户设备140中显示。
应理解,当上述系统架构100中不存在预处理模块113和预处理模块114时,计算模块111还可以将处理得到的行人检测结果传输到I/O接口,然后再由I/O接口将处理结果送入到客户设备140中显示。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在图3中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,图1仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图1中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
如图3所示,根据训练设备120训练得到目标模型/规则101,可以是本申请实施例中的神经网络,具体的,本申请实施例提供的神经网络可以是CNN以及深度卷积神经网络(deep convolutional neural networks,DCNN)等等。
由于CNN是一种非常常见的神经网络,下面结合图4重点对CNN的结构进行详细的介绍。如上文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
如图4所示,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其 中池化层为可选的),以及全连接层(fully connected layer)230。下面对这些层的相关内容做详细介绍。
卷积层/池化层220:
卷积层:
如图4所示卷积层/池化层220可以包括如示例221-226层,举例来说:在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。
卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的卷积特征图的尺寸也相同,再将提取到的多个尺寸相同的卷积特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图4中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像 素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
全连接层230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用全连接层230来生成一个或者一组所需要的类的数量的输出。因此,在全连接层230中可以包括多层隐含层(如图4所示的231、232至23n)以及输出层240,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。
在全连接层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图4由210至240方向的传播为前向传播)完成,反向传播(如图4由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图4所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。
应理解,可以采用图4所示的卷积神经网络(CNN)200执行本申请实施例的行人检测方法,如图4所示,待处理图像经过输入层210、卷积层/池化层220和全连接层230的处理之后可以得到待处理图像的检测结果(待处理图像中的存在行人的包围框以及图像中存在行人的包围框的置信度)。
图5为本申请实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器50。该芯片可以被设置在如图3所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图3所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。如图4所示的卷积神经网络中各层的算法均可在如图5所示的芯片中得以实现。
神经网络处理器(neural-network processing unit,NPU)50作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路503,控制器504控制运算电路503提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路503内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路503是二维脉动阵列。运算电路503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路503是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路503从权重存储器502中取矩阵B相应的数据,并缓存在运算电路503中每一个PE上。运算电路503从 输入存储器501中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)508中。
向量计算单元507可以对运算电路503的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元507可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现中,向量计算单元能507将经处理的输出的向量存储到统一缓存器506。例如,向量计算单元507可以将非线性函数应用到运算电路503的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元507生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路503的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器506用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器505(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器501和/或统一存储器506、将外部存储器中的权重数据存入权重存储器502,以及将统一存储器506中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)510,用于通过总线实现主CPU、DMAC和取指存储器509之间进行交互。
与控制器504连接的取指存储器(instruction fetch buffer)509,用于存储控制器504使用的指令;
控制器504,用于调用指存储器509中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器506,输入存储器501,权重存储器502以及取指存储器509均为片上(on-chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,简称DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
另外,在本申请中,图4所示的卷积神经网络中各层的运算可以由运算电路503或向量计算单元507执行。
图6是本申请实施例的系统架构的示意图。
如图6所示,输入图片经过行人检测网络进行处理,可以得到输入图片中包围行人2D框(这些2D框也可以称为包围框),这些2D框信息将会送入后续的分析模块进行处理,如送入自动驾驶系统中规控单元进行障碍物分析,送入平安城市的分析模块中与失踪人口进行匹配等。
图6所示的行人检测网络包括主干网络(backbone)模块、候选框生产网络(region proposal network,RPN)模块、自激活模块、基础特征加权模块、区域特征生成模块、区域卷积神经网络(region convolutional neural network,RCNN)模块以及输出模块。图6中的行人检测网络可以执行本申请实施例的行人检测方法,下面对行人检测网络对输入图片进行处理的过程进行大致的介绍。
在获取到输入图片后,主干网络模块对输入图片进行卷积处理,得到输入图片的基础 特征图;RPN模块对基础特征图进行处理,得到输入图片的行人候选框;自激活模块对基础特征图做进一步的卷积处理和加权求和处理后得到输入图片的物体可见度图(该物体可见度图能够突出显示输入图片中的行人可见部分的特征);基础特征加权模块用于对物体可见度图和基础特征图进行加权处理,以得到增强特征图;区域特征模块可以根据候选框以及输入图片的物体可见度图生成候选框的对应的特征;RCNN模块对候选框对应的特征进行处理,得到输入图片的行人检测结果,该行人检测结果可以是图像中存在行人的包围框以及图像中存在行人的包围框的置信度。
本申请实施例的行人检测装置的落地产品形态可以是自动驾驶、终端设备、监控设备等,本发明部署在相关设备的计算节点上,通过软件改造,能够提升行人识别的准确率。
本申请实施例的行人检测装置可以如图5所示,该行人检测装置主要由主干网络模块101、RPN模块102、自激活模块103、基础特征加权模块104、区域特征生成模块105、RCNN模块106以及输出模块107组成。
图7所示的行人检测装置既可以包含区域特征生成模块105,也可以不包含区域特征生成模块105。
当行人检测装置不包括区域特征生成模块105时,行人检测装置以及利用行人检测装置进行行人检测的过程可以图8所示。
如图8所示,RPN模块根据基础特征图确定图像的候选框,自激活模块103根据基础特征图生成图像的物体可见度图,基础特征加权模块104对图像的物体可见度图和基础特征图进行加权求和,得到增强特征图。接下来,可以通过感兴趣区域(region of interest,ROI)池化,从增强特征图中提取候选框对应的特征,接下来由RCNN模块对候选框对应的特征进行处理,以得到行人检测结果。应理解,在图8中,候选框对应的特征就是候选框的区域特征。
当行人检测装置包括区域特征生成模块105时,行人检测装置以及利用行人检测装置进行行人检测的过程可以图9所示。
与图8相比,图9所示的行人检测装置会利用区域特征生成模块105生成候选框的区域特征以及候选框的轮廓区域的区域特征和候选框的背景区域的区域特征,这三部分特征可以共同融合成候选框对应的特征。接下来,再由RCNN模块对候选框对应的特征进行处理,以得到行人检测结果。
为了更好地了解本申请实施例的行人检测方法的执行过程,下面先对图7中的各个模块的功能进行简单的描述。
主干网络模块101:
主干网络模块101用于对输入的图像(也可以称为图片)进行一系列的卷积处理,得到图像的基础特征图(feature map)。该图像的基础特征图为后续的其他模块进行图像检测时提供图像的基础特征。
主干网络模块101中的主干网络可以有多种实现方式,例如,VGG网络、深度残差网络(deep residual network,ResNet)和inception网络等。
RPN模块102:
RPN模块102用于在主干网络模块101产生的基础特征图上预测出可能存在行人的区域,并给出这些区域的位置,这些区域的边界可以称为候选框(proposal)。一般来说, RPN模块102检测得到的候选框的位置并不十分准确,候选框可能会落到背景图像上,也可能不会很好的包围住图像中的行人目标。
自激活模块103:
自激活模块采用RCNN模块106中的共享的Conv5卷积层对主干网络模块101产生的基础特征图进行进一步的卷积处理,得到高层语义特征图,接下来,再采用与RCNN模块106中相同的分类器权重系数对高层语义特征图进行加权,以得到图像的物体可见度图(visibility map)。该图像的物体可见度图对行人的可见部分具有强响应,而对背景以及遮挡部分具有弱响应。其中,自激活模块103是本申请的行人检测装置的核心模块。
基础特征加权模块104:
基础特征加权模块104用于对主干网络模块101生成的图像的基础特征图和自激活模块103生成的图像的物体可见度图进行加权处理,得到图像的增强特征图。在图像的增强特征图中,行人的可见部分的特征得以加强,而图像中的背景以及对行人形成遮挡的遮挡物的特征被削弱。
区域特征生成模块105:
区域特征生成模块105参考自激活模块103生成的图像的物体可见度图,对RPN模块102生成的每个候选框进行处理,生成当前候选框对应的行人的轮廓区域图像、背景区域图像以及ROI区域图像,并且通过区域特征生成模块105中的感兴趣网络-池化(pooling)模块抽取这3个区域图像的特征,然后将抽取到的这3个区域图像进行融合,作为当前候选框的区域特征。
RCNN模块106:
RCNN模块106采用与自激活模块103相同的Conv5卷积层对区域特征生成模块105生成的候选框的区域特征进行卷积处理,得到候选框区域的图像特征,然后采用全局平均池化模块(global average pooling,GAP)对候选框区域的图像特征进行平均化操作,最后再将平均化后的图像特征分别送入框回归器和分类器中,从而预测出候选框的最终坐标和置信度。
输出模块107:
输出模块107用于对RCNN模块106输出的所有候选框进行非极大值抑制(non-maximize suppression,NMS)处理,从而把高度重叠的候选框进行合并,并且过滤掉置信度过低的候选框。从而输出反映行人检测结果的2D框(相当于下文中提及的包围框)和2D框的置信度。
下面结合图10对本申请实施例的行人检测方法进行详细描述。
图10是本申请实施例的行人检测方法的示意性流程图。图10所示的方法可以由本申请中的行人检测装置来执行。图10所示的方法包括步骤1001至1007,下面对步骤1001至1007进行详细的介绍。
1001、获取图像。
上述图像可以是包含行人的图像。
具体地,上述图像可以是各种包含行人的图像,例如,通过手机或者其他智能终端拍摄的图像,辅助/自动驾驶系统获取的道路画面图像,平安城市/视频监控系统获取的监控画面图像。
应理解,上述通过手机或者智能终端拍摄的图像,以及道路画面图像和监控画面图像一般是包含行人的图像,如果这些图像中不包括行人的话,那么,最终的识别结果可以是空,也就是对这类不包含行人的图像进行识别,不能识别出包围行人的包围框。
在上述步骤1001中,既可以通过摄像头拍摄来获取图像,也可以从存储器中获取图像。应理解,图1所示的方法也可以直接从步骤1002开始。
1002、对图像进行特征提取,得到图像的基础特征图。
在步骤1002得到图像的基础特征图时,具体可以通过对图像进行卷积操作(卷积处理),或者是对图像的卷积操作结果做进一步的处理(例如,进行求和、加权处理、连接等操作)得到基础特征图。
在上述步骤1002中,可以通过神经网络中的主干网络(模块)对图像进行卷积处理,从而得到图像的基础特征图(feature map)。该主干网络可以采用多种卷积网络架构,例如,VGG网络(牛津大学的视觉几何组(visual geometry group)提出一种网络)、深度残差网络(deep residual network,ResNet)和inception网络等。
上述基础特征图可以包含多个通道,上述步骤1002中在对图像进行特征提取时,具体可以通过对图像进行卷积处理来得到基础特征图,通过卷积处理得到基础特征图的过程可以如图11所示。
如图11所示,基础特征图是一个包含多个通道的特征图。在图11中,假设输入图像的分辨率为H 0*W 0*3(高度H 0,宽度W 0,通道数为3,也就是RBG三个通道),那么经过卷积处理后可以得到基础特征图U∈H*W*K,其中,H和K分别表示基础特征图的高度和宽度,K表示基础特征图的通道数。
下面结合图12对步骤1002生成图像的基础特征图的过程进行详细的介绍。
如图12所示,可以采用残差网络ResNet18的不同卷积层对输入图像进行卷积操作,具体地,图12所示的卷积操作具体可以包括以下过程(1)至(4):
(1)ResNet18-Conv1(ResNet18的第一个卷积层)对输入图像进行卷积处理,得到特征图C1(feature map C1)。
假设上述输入图像的分辨率为H 0*W 0*3(高度H 0,宽度W 0,通道数为3),经过ResNet18-Conv1卷积处理后得到的特征图C1的分辨率可以为H 0/4*W 0/4*64。具体地,ResNet18-Conv1可以对输入图像进行两次下采样(每次进行采样操作时宽和高均变为原来的一半),并将通道数从3扩充到64,从而得到特征图C1。
(2)ResNet18-Conv2(ResNet18的第二个卷积层)对特征图C1继续进行卷积处理,得到特征图C2(feature map C2)。
ResNet18-Conv2继续对特征图C1进行卷积处理,得到的特征图C2的分辨率可以与特征图C1的分辨率相同,都是H 0/4*W 0/4*64。
(3)ResNet18-Conv3(ResNet18的第三个卷积层)对特征图C2继续进行卷积处理,得到特征图C3(feature map C3)。
ResNet18-Conv3可以对特征图C2再进行一次下采样,并将通道数加倍(将通道数由64扩充到128),从而得到特征图C3,特征图C3的分辨率为H 0/8*W 0/8*128。
(4)ResNet18-Conv4(ResNet18的第四个卷积层)对特征图C3继续进行卷积处理,得到特征图C4(feature map C4)。
ResNet18-Conv4可以对特征图C3再进行一次下采样,并将通道数加倍(将通道数由128扩充到256),从而得到特征图C4,特征图C4的分辨率为H 0/16*W 0/16*256。
应理解,上述图12所示的卷积过程仅为示例,本申请实施例中对卷积处理时采用的网络,卷积处理的次数等等不做限定。
以图12所示的卷积处理过程为例,步骤1002中的基础特征图既可以是图12所示的特征图C4,也可以是图12所示的卷积特征图C1至特征图C4中的至少一个特征图。
当基础特征图由图12所示的多个不同分辨率的特征图组成时,在后续进行ROI池化的过程中,可以将这些不同分辨率的特征图调整为分辨率一致的特征图,并输入到后续的RCNN模块继续进行处理。
1003、根据基础特征图确定图像的候选框。
上述候选框为图像中可能存在行人的区域的包围框。候选框所在的区域可以是候选框所围成的区域(候选框内部的区域),候选框所在的区域也就是图像中可能存在行人的区域。
在步骤1003中,可以根据图像的基础特征图通过RPN模块预测出可能存在行人的区域,并框出这些区域,框住这些区域的边框就是候选框。
一般来说,步骤1003中确定的候选框的位置并不十分准确,候选框可能会落到背景图像上(此时,候选框中可能没有行人目标),也可能无法很好的包围住行人目标。
在步骤1003中,一般可以利用RPN模块对基础特征图进行处理,从而得到候选框。下面结合图13对RPN模块确定图像候选框的过程进行详细描述。
如图13所示,RPN模块可以先采用3×3的卷积核对基础特征图进行卷积处理,得到RPN隐藏特征图(RPN Hidden)。接下来,再分别采用两个3×3的卷积核对RPN隐藏特征图进行卷积处理,预测出RPN隐藏特征图的每个候选框的位置和置信度。一般来说,候选框的置信度越高,说明这个候选框存在行人的概率越大。
接下来,RPN模块会对预测得到的候选框进行合并处理,在进行合并处理时,可以根据候选框之间的重合程度去掉多余的候选框,其中,在去掉多余的候选框的过程中可以采用但不限于NMS算法的方式进行候选框的筛选。
假设RPN模块一共预测出了J个候选框,那么,可以从该J个候选框中挑选出得分最高的N(N<J)个候选框作为包含行人的候选框,其中,N和J均为正整数。
但是经过RPN模块确定的图像的候选框的位置一般也不是十分准确,如图13所示,虽然两个候选框中都有行人,但是这两个候选框并没有把行人完全包括在框内,也没有把行人紧紧包括在框内。
1004、对图像的基础特征图进行处理,得到图像的物体可见度图。
其中,上述图像的物体可见度图对不同物体的响应程度不同,在该图像的物体可见度图中,对行人可见部分的响应程度大于对行人不可见部分的响应程度。也就是说,在该图像的物体可见度图中,上述图像的物体可见度图对行人的可见部分具有强响应,而对行人的不可见部分具有弱响应,与行人不可见部分相比,行人可见部分的特征更加突出。
具体地,在该图像的物体可见度图中,行人可见部分的像素值大于行人不可见部分的像素值。
下面结合图14和图15对物体可见度图的显示效果进行说明。
如图14所示,人体可见部分是行人能够被看到的部分,也就是行人没有被其他物体遮挡的部分,行人不可见部分可以包括其他遮挡住行人的物体部分,以及图像的背景部分。在图14中,人体可见部分的颜色较亮,表示响应较大,而人体不可见部分的颜色较暗,表示响应较小。
如图15所示,第一行的6个图像为原始图像,第二行的6个图像为相应的物体可见度图,在第二行所示的6个图像中,人体可见部分的颜色也比较亮,而人体可见部分的颜色仍然比较暗。在图15所示的第3列图像和第4列图像中,人体的下半部分均被遮挡,生成的物体可见度图中人体可见部分对应的颜色的亮度均大于人体遮挡部分对应的颜色亮度,从而使得人体可见部分的特征得到了加强,而人体遮挡部分的特征以及背景区域的特征被削弱。
通过上述图像的物体可见度图,能够更突出的反映出行人可见部分的特征。
另外,在上述图像的物体可见度图中,行人可见部分是能够看到行人图像的部分,而行人不可见部分是不够看到行人图像的部分。
由于上述图像的物体可见度图对行人的可见部分具有更强的响应,因此,后续在确定图像中包含行人的包围框时通过结合图像的物体可见度图,能够提高行人检测的准确性。
在上述步骤1004中,为了得到图像的物体可见度图,可以先对图像的基础特征图进行卷积处理,然后对卷积处理得到的多个特征图进行加权求和处理,得到图像的物体可见度图。
本申请中,由于图像的物体可见度图能够更突出的反映行人可见部分的特征,因此,根据图像的基础特征图和图像的物体可见度图融合得到的增强特征图中突出体现了行人可见部分的特征,能够提高后续根据该增强特征图进行行人检测的准确性。
可选地,在上述图像的物体可见度图中,行人不可见部分包括行人遮挡部分。
当行人不可见部分包括行人遮挡部分时,能够在物体可见度图中区分开行人可见部分和行人遮挡部分,并且由于行人可见部分的像素值大于行人遮挡部分的像素值,因此,能够在物体可见度中突出体现行人可见部分的特征,削弱行人遮挡部分的特征,减少在后续行人检测过程中,行人遮挡部分对检测结果的影响,便于突出行人可见部分的特征,提高行人检测的效果。
可选地,在上述图像的物体可见度图中,行人不可见部分包括所述图像的背景部分。
上述图像的背景部分可以是指该图像中除了行人之外的其他部分,或者,上述图像的背景部分还可以是指该图像中除了行人和主要物体(例如,汽车)之外的其他部分。
当行人不可见部分包括图像的背景部分时,能够在物体可见度图中将行人和背景部分区分开,突出行人可见部分的特征,削弱背景部分的特征,减少背景部分在后续行人检测过程中对检测结果的影响,便于后续进行行人检测时突出行人可见部分的特征,提高行人检测的效果。
如图16所示,上述步骤1004确定图像的物体可见度图的具体过程可以包括以下步骤:
1004a、采用第一卷积网络对图像的基础特征图进行卷积处理,得到多个第一语义特征图;
1004b、对上述多个第一语义特征图进行加权求和处理,得到图像的物体可见度图。
其中,上述多个第一语义特征图是从基础特征图全图中提取的多个不同语义的特征 图。具体地,在上述多个第一语义特征图中,任意两个第一语义特征图对应不同的语义。
例如,上述多个第一语义特征图由F 1、F 2和F 3组成,其中,F 1反映的是头部特征,F 2反映的是左手的特征,F 3反映的是右手的特征,F 1、F 2和F 3各自反映的语义均不相同。
上述步骤1004a进行卷积处理时的卷积参数与后续步骤1007a中进行卷积处理时的卷积参数是相同的。在上述步骤1004b在对该多个第一语义特征进行加权求和处理时,使用的加权系数为分类器中用于确定行人得分的权重系数,该分类器指的是图8中RCNN模块106中的分类器。
由于上述步骤1004a中的卷积参数与步骤1007a中的卷积参数相同,步骤1004a中在对该多个第一语义特征进行加权求和处理时的加权系数为分类器中用于确定行人得分的权重系数,可以使得步骤1004b得到的物体可见度图对行人的可见部分的响应大于对行人不可见部分的响应。
上述步骤1004a和步骤1004b可以由自激活模块对基础特征图进行处理,得到图像的物体可见度图。
下面结合图17对上述步骤1004a和步骤1004b进行详细的说明。
如图17所示,可以先采用Conv5卷积层对基础特征图U∈H*W*K做进一步的卷积处理,得到高层语义特征图F∈H*W*K,其中,H,W,K分别为高层语义特征图的高度、宽度和通道数。其中,高层语义特征能够是反映人的一些关键部位的特征(例如,头部特征,左手特征,右手特征等)。
接下来,高层语义特征图F进一步可以分为K个第一特征图F k∈H*W,k=1,2,...,K.,其中,每个第一特征图反映不同的特征,例如,F 1反映的是头部特征,F 2反映的是左手的特征,F 3反映的是背景物的特征……对于不同部位的特征来说,在进行行人识别时的重要程度是不同的,有些特征在进行行人识别时的重要性很高,有些特征则在进行行人识别时的重要性较低,例如,头部特征和面部特征在进行行人识别时的重要性比较高,而背景物特征在进行行人识别时的重要性就比较低。
另外,上述自激活模块中的Conv5卷积层的卷积参数与后续图24所示的RCNN模块中的Conv5卷积层的卷积参数相同。由于RCNN模块中的Conv5卷积层的卷积参数一般是在对行人检测之前经过多次训练得到的。因此,自激活模块中的Conv5卷积层采用与用RCNN模块中的Conv5卷积层相同的卷积参数,能够较好地提取出图像的高级语义特征。除此之外,由于分类器的权重建立的是经过Conv5卷积层之后的特征和分类的映射关系,为了复用分类器(图8中RCNN模块106)的权重,我们需要让自激活模块经过相同的编码,所以在此使用Conv5卷积层。
在获取到上述多个第一特征图(F 1至F k)之后,可以根据公式(2)对该多个第一特征图进行加权求和,以得到图像的可见度图。
Figure PCTCN2020105020-appb-000011
在上述公式(2)中,w k∈1*1表示第一特征图F k∈H*W权重系数,w k越大,说明F k对行人识别的贡献越大,F k的重要性也就越高。
在根据上述公(2)计算图像的物体可见度图时,采用的权重系数(这里的权重系数也可以称为可视化权重)与后续RCNN模块中对不同语义的特征图进行加权求和时的权 重系数相同,具体地,第一特征图F k∈H*W的权重系数与后续RCNN模块对相同语义的特征图进行叠加时采用的权重系数相同。例如,在利用上述公式(2)中表征人的头部的第一特征图的权重系数与后续RCNN模块对表征人的头部的特征图进行加权求和时的权重系数相同。
通过公式(2)中的加权求和方式可以突出对人体检测贡献大的特征,抑制贡献不大的特征。通过实验发现,通过公式(2)计算得到的图像的物体可见性图V能够对行人的可见部分产生强响应,对背景和遮挡部分产生弱响应。
上述通过公式(2)计算产生的图像的物体可见度图可以如图14和图15所示,在图14和图15所示的图像中,人的颜色较亮,表示响应较大,人周围其他物体或者背景的颜色较暗,表示响应越低。总体来看,图14和图15所示的图像的物体可见度图对背景具有很强的抑制作用。从局部来看,对行人的可见部分具有强响应,对行人的遮挡部分具有弱响应。
本申请中,当采用与RCNN模块进行加权求和处理相同的加权系数对多个第一特征图进行加权求和得到图像的物体可见度图时,能够突出对人体贡献大的特征,抑制贡献不大的特征。
1005、对图像的基础特征图和图像的物体可见度图进行融合处理,得到图像的增强特征图。
在对图像的基础特征图和图像的物体可见度图进行融合处理之前,可以先对图像的物体可见度图进行维度扩展,使得经过维度扩展后的物体可见度图的通道数与基础特征图的通道数相同,然后再对图像的基础特征图和图像的物体可见度图进行融合处理,在进行融合处理时具体可以采用以下三种融合方式。
第一种融合方式:
在第一种融合方式下,可以将基础特征图中的对应元素与维度扩展后的物体可见度图的对应元素相乘,然后将得到的乘积与原来的基础特征图中的相应元素求和,从而得到增强特征图
Figure PCTCN2020105020-appb-000012
中的相应元素的值。
具体地,如图18所示,物体可见度图V在进行维度扩展之前,一般只有一个通道,通过维度扩展,使得维度扩展后的物体可见度图V与基础特征图U的通道数一致,然后将基础特征图U中的第(i,j)个元素与经过维度扩展后的基础特征图U的第(i,j)个元素相乘,并将乘积与基础特征图U中的第(i,j)个元素的取值相加,得到增强特征图
Figure PCTCN2020105020-appb-000013
中每个元素的取值。
应理解,在图18所示的过程中,可以逐个通道进行特征图的加权求和,最终得到与基础特征图通道数相同的增强特征图。例如,基础特征图U一共有64个通道,那么可以将基础特征图的每个通道的特征图与物体可见度图中的对应元素相乘,然后将相乘的结果与基础特征图中的相应元素进行求和,得到增强特征图对应元素的取值,重复类似的操作,直到获取到增强特征图的64个通道中的每个通道的特征图中的元素的取值。
具体地,在进行加权处理时,可以采用公式(3)对图像的物体可见度图V和图像的基础特征图U进行加权处理,得到图像的增强特征图
Figure PCTCN2020105020-appb-000014
Figure PCTCN2020105020-appb-000015
在上述公式(3)中,
Figure PCTCN2020105020-appb-000016
是将V在通道的维度上扩展K倍,⊙是对应元素的相乘。
通过自激活模块103得到的图像的增强特征图能够使得图像中的人体的可见部分的特征得以增强,背景和遮挡物的特征得到抑制,便于后续根据图像的增强图像特征进行高精度的行人检测。
第二种融合方式:
在第二种融合方式下,可以将基础特征图中的对应元素与维度扩展后的物体可见度图的对应元素相乘,然后将得到的乘积与维度扩展后的物体可见度图的对应元素求和,从而得到增强特征图
Figure PCTCN2020105020-appb-000017
中的相应元素的值。
第三种融合方式:
在第三种融合方式下,可以直接将基础特征图中的对应元素与维度扩展后的物体可见度图的对应元素进行加权求和,加权求和后得到的元素的取值为增强特征图的相应元素的取值,在进行加权求和时,维度扩展后的物体可见度图的加权系数可以大于基础特征图的加权系数,从而使得增强特征图中主要体现出物体可见度图的特征。
1006、根据图像的候选框和图像的增强特征图,确定候选框对应的特征。
上述步骤1006中得到的候选框对应的特征可以包括候选框的区域特征,该候选框的区域特征是增强特征图中位于候选框内的区域的特征。
具体地,上述候选框的区域特征可以是增强特征图中与候选框相对应的区域内的这部分特征。在确定候选框的区域特征时,可以先确定候选框在增强特征图中的位置,接下来,就可以将候选框在增强特征图中所围成的区域的特征确定为候选框的区域特征。
进一步的,还可以对候选框在增强特征图中所围成的区域的特征对所围成的区域的特征进行抽样处理(具体可以是上采样或者下采样),然后得到候选框的区域特征。
对于上述候选框对应的特征来说,除了包含候选框的区域特征之外,还可以包含候选框之外的其他区域的特征。
如图19所示,可以根据图像的候选框(的位置),直接将图像增强特征图中位于候选框内的特征作为上述候选框对应的特征。
为了使得最终确定的候选框对应的特征更准确,还可以先结合图像的物体可见度图确定候选框的区域,然后再根据图像的候选框的区域和图像的增强特征图,确定候选框对应的特征。
如图20所示,可以根据图像的候选框(的位置),先从图像的物体可见度图中确定出候选框的区域,然后再根据候选框的区域(的位置)将增强特征图中位于候选框内的特征确定为上述候选框对应的特征。
在图20中,由于图像的物体可见度图对人体可见部分的响应更高,因此,借助物体的可见度图能够更准地确定出候选框的位置,便于后续获取更准确的候选框的区域特征。
在图19和图20中,在获取到了候选框的区域特征之后,可以采用RCNN模块对候选框的区域特征继续处理,以最终得到图像中存在行人的包围框和图像中存在行人的包围框的置信度。
本申请中,为了进一步提高最终的行人检测结果的准确性,可以提取候选框周围区域 的区域特征,并将候选框周围区域与候选框的区域特征进行融合,然后采用RCNN模块对融合后的特征进行处理,以确定图像中存在行人的包围框和图像中存在行人的包围框的置信度。本申请中,由于综合采用了候选框的区域特征和候选框周围区域的区域特征来综合确定图像中存在行人的包围框和图像中存在行人的包围框的置信度,在进行行人检测时的准确率更高。
可选地,上述候选框对应的特征还包括候选框的轮廓区域的区域特征,候选框的轮廓区域是候选框按照第一预设比例缩小后得到的缩小候选框与候选框之间形成的区域。
上述候选框的轮廓区域的区域特征一般会包含行人的轮廓特征,而行人的轮廓特征在进行行人检测时也会起到很重要的作用。
如图21所示,先根据图像的候选框(的位置),先从图像的物体可见度图中确定出候选框的区域,然后将候选框的边框按照一定的比例向内缩小,候选框原来的边框与候选框缩小后的边框之间的区域就是候选框的轮廓区域。
或者,在确定候选框的轮廓区域时,还可以从增强特征图中确定出候选框的区域,然后将候选框的边框按照一定的比例向内缩小,候选框原来的边框与候选框缩小后的边框之间的区域就是候选框的轮廓区域。
上述候选框按照第一预设比例缩小具体可以是指候选框的宽和高分别按照一定的比例进行缩小,候选框的宽和高进行缩小时的比例可以相同,也可以不同。
可选地,上述候选框按照第一预设比例缩小,包括:上述候选框的宽按照第一缩小比例进行缩小,候选框的高按照第二缩小比例进行缩小。
第一预设比例包括第一缩小比例和第二缩小比例,其中,第一缩小比例和第二缩小比例既可以相同,也可以不同。
第一缩小比例和第二缩小比例可以根据经验来设置。例如,可以将为第一缩小比例和第二缩小比例设置合适的数值,以使得按照第一缩小比例和第二缩小比例对候选框的宽和高进行缩小后得到的候选框的轮廓区域能够较好的提取到行人的轮廓特征。
在设置第一缩小比例和第二缩小比例可以取能够较好的提取行人轮廓的数值。
上述第一缩小比例可以是1/1.1,上述第二缩小比例可以是1/1.8。
其中,在对候选框的边框进行缩小时,可以是以候选框的中心为中心点,按照一定的比例向内缩小,其中,缩小后的候选框的中心可以是原始候选框在图像的物体可见度图中的最大值点。
可选地,缩小后的候选框的中心和原始候选框的中心点保持一致。
本申请中,当候选框对应的特征还包括候选框的轮廓区域的区域特征时,能够在进行行人检测时,将行人的轮廓特征也考虑进来,便于后续综合行人的轮廓特征来更好地进行行人检测。
可选地,图10所示的方法还包括:将候选框的区域特征中位于缩小候选框内的特征的取值置零,以得到候选框的轮廓区域的区域特征。
本申请中,在获取候选框的轮廓区域的区域特征时,通过直接将候选框的区域特征中位于缩小候选框内的特征的取值置零,能够快速方便的获取到候选框的轮廓区域的区域特征。
应理解,在本申请中,还可以采用其他方式获取候选框的轮廓区域的区域特征,例如, 可以按照候选框的轮廓区域的区域位置直接从增强特征图中扫描得到候选框的轮廓区域的区域特征。
可选地,上述候选框对应的特征还包括候选框的背景区域的区域特征,候选框的背景区域是候选框按照第二预设比例扩大后得到的扩大候选框与候选框之间形成的区域。
候选框的背景区域的区域特征一般反映的是图像中的行人所处的背景区域的特征,该背景区域的特征可以结合行人的特征来进行行人检测。
上述候选框按照第一预设比例扩大具体可以是指候选框的宽和高分别按照一定的比例进行扩大,候选框的宽和高进行扩大时的比例可以相同,也可以不同。
可选地,上述候选框按照第一预设比例扩大,包括:上述候选框的宽按照第一扩大比例进行扩大,候选框的高按照第二扩大比例进行扩大。
第一预设比例包括第一扩大比例和第二扩大比例,其中,第一扩大比例和第二扩大比例既可以相同,也可以不同。
第一扩大比例和第二扩大比例可以根据经验来设置。例如,可以将为第一扩大比例和第二扩大比例设置合适的数值,以使得按照第一扩大比例和第二扩大比例对候选框的宽和高进行扩大后得到的候选框的轮廓区域能够较好的提取到行人的轮廓特征。
在设置第一扩大比例和第二扩大比例可以取能够较好的提取行人轮廓的数值。
上述第一扩大比例可以是1/1.1,上述第二扩大比例可以是1/1.8。
如图22所示,可以根据图像的候选框(的位置),先从图像的物体可见度图中确定出候选框的区域,然后将候选框的边框按照一定的比例向外扩大,候选框原来的边框与候选框扩大后的边框之间的区域就是候选框的背景区域。
或者,在确定候选框的背景区域时还可以根据图像的候选框(的位置),先从图像的增强特征图中确定出候选框的区域,然后将候选框的边框按照一定的比例向外扩大,候选框原来的边框与候选框扩大后的边框之间的区域就是候选框的背景区域。
在对候选框的边框进行扩大时可以根据经验来确定扩大比例。例如,在对候选框进行扩大时,可以将候选框的宽度变为原来的1.1倍,高度变为原来的1.8倍。
其中,在对候选框的边框进行扩大时,可以是以候选框的中心为中心点,按照一定的比例向外扩大,其中,扩大后的候选框的中心可以是原始候选框在图像的物体可见度图中的最大值点。
可选地,扩大后的候选框的中心和原始候选框的中心点保持一致。
本申请中,当候选框对应的特征还包括候选框的背景区域的区域特征时,能够在进行行人检测时,将背景区域的区域特征也考虑进来,便于后续综合背景区域的区域特征来更好地进行行人检测。
可选地,图10所示的方法还包括:获取第一区域的区域特征,第一区域的区域特征为物体可见度图中位于扩展候选框的区域内的区域特征;将第一区域的区域特征中位于候选框内的特征置零,得到候选框的背景区域的区域特征。
可选地,还可以结合增强特征图来确定候选框的背景区域的区域特征。具体地,可以获取第二区域的区域特征,第二区域的区域特征为增强特征图中位于扩展候选框的区域内的区域特征;将第二区域的区域特征中位于候选框内的特征置零,得到候选框的背景区域的区域特征。
本申请中,在获取候选框的背景区域的区域特征时,通过获取候选框的背景区域在增强特征图中对应的区域特征,然后直接将该区域特征中位于候选框内的特征置零,能够快速方便的获取到候选框的轮廓区域的区域特征。
应理解,这里的候选框的区域特征、候选框的轮廓特征和候选框的背景特征可以称为候选框对应的特征。也就是说,在这种情况下,候选框对应的特征不仅包括候选框自身区域内的区域特征,也包括候选框自身区域之外的其他区域(轮廓区域和背景区域)的区域特征。
具体地,对于上述候选框对应的特征来说,候选框对应的特征除了包含候选框的区域特征之外,还可以包括候选框的轮廓特征和候选框的背景特征中的至少一个。
而当候选框对应的特征包括候选框的区域特征、候选框的轮廓特征和候选框的背景特征这三个特征时,候选框对应的特征包含的信息最多,使得后续根据RCNN模块对候选框对应的特征进行行人检测,可以在一定程度上提高行人检测的准确率。
应理解,上述候选框对应的特征既可以直接包括候选框的区域特征、候选框的轮廓特征和候选框的背景特征这三个单独的特征,也可以包括候选框的区域特征、候选框的轮廓特征和候选框的背景特征这三个特征所融合后的特征。也就是说,可以对候选框的区域特征、候选框的轮廓特征和候选框的背景特征这三个特征进行融合,融合后的特征就是候选框对应的特征。
下面结合图23以候选框对应的特征是候选框的区域特征、候选框的轮廓特征和候选框的背景特征这三个特征融合后的特征为例对得到候选框对应的特征,根据候选框对应的特征进行行人检测的过程进行详细说明。
如图23所示,可以根据图像的候选框和图像的物体可见度图来确定候选框的区域、候选框的轮廓区域和候选框的背景区域(确定这些区域的过程可以参见图21和图22)。
接下来,再根据候选框的区域、候选框的轮廓区域和候选框的背景区域分别从图像的增强特征图中提取出候选框的区域特征、候选框的轮廓区域的区域特征(可以简称为轮廓特征)和候选框的背景区域的区域特征(可以简称为背景特征)。
在获取到候选框的区域特征、候选框的轮廓区域的区域特征和候选框的背景区域的区域特征之后,可以对这三个特征进行融合,得到融合后的特征,融合后的特征就是候选框对应的特征。在对这三个特征进行融合是可以采用线性组合的方式(对这三个特征进行加权求和),也可以采用非线性组合的方式。在对这三个特征进行融合之前,可以先将这三个特征调整到同等大小(例如,7*7*K,K为通道数),然后进行融合。
在得到候选框对应的特征之后,可以利用RCNN模块对候选框对应的特征进行处理,以最终得到图像中存在行人的包围框和图像中存在行人的包围框的置信度。
1007、根据候选框对应的特征确定图像中存在行人的包围框和图像中存在行人的包围框的置信度。
通过对候选框对应的特征进行卷积处理和加权求和,能够根据候选框对应部分的图像特征来更好地确定图像中存在行人的包围框和图像中存在行人的包围框的置信度。
本申请中,由于图像的物体可见度图能够更突出的反映行人可见部分的特征,因此,根据图像的基础特征图和图像的物体可见度图融合得到的增强特征图中突出体现了行人可见部分的特征,能够提高后续根据该增强特征图进行行人检测的准确性。并且,针对行 人遮挡(较为严重)的情况,本申请进行行人检测的准确性有较为明显的提高。
另外,本申请中通过结合物体可见度图来提高行人检测准确性的同时,不会在训练过程中增加训练数据的标注量,本申请仅需要在处理的过程中生成物体的可见度图,并在后续处理时综合考虑物体可见度图即可,与通过增加数据标注量从而提升行人检测准确性的方案相比,能够在节约数据的标注量,减小训练的复杂度。
本申请实施例的行人检测方法可以应用于辅助/自动驾驶系统、平安城市/视频监控系统等多种场景中。
当本申请实施例的行人检测方法应用于辅助/自动驾驶系统时,获取的是道路画面图像,通过本申请实施例的行人检测方法的处理,能够检测出道路画面图像中存在行人的包围框和道路画面图像中存在行人的包围框的置信度。接下来,可以根据道路画面图像中存在行人的包围框和道路画面图像中存在行人的包围框的置信度对自动驾驶车辆进行控制。例如,当通过道路画面图像中存在行人的包围框和道路画面图像中存在行人的包围框的置信度确定车辆正前方很可能存在行人时,可以通过控制车辆减速并进行鸣笛,也可以控制车辆绕过行人等等。
当本申请实施例的行人检测方法应用于平安城市/视频监控系统时,可以先获取的是监控画面图像,然后通过本申请实施例的行人检测方法的处理,得到监控画面图像中存在行人的包围框和监控画面图像中存在行人的包围框的置信度。接下来,可以根据监控画面图像中存在行人的包围框和监控画面图像中存在行人的包围框的置信度对特征人员进行识别和追踪。例如,当根据监控画面图像中存在行人的包围框和监控画面图像中存在行人的包围框的置信度识别出监控画面中存在特定人员(失踪人口或者犯罪嫌疑人)时,可以通过天眼系统(该系统可以视为平安城市/视频监控系统的一部分)对该特定人员进行追踪。
在上述步骤1007中,可以先对候选框对应的特征进行卷积处理,然后再对卷积处理后得到的卷积特征图进行加权求和,然后再根据加权求和得到的特征图确定图像中存在行人的包围框和图像中存在行人的包围框的置信度。
具体地,如图16所示,上述步骤1007中确定图像中存在行人的包围框和图像中存在行人的包围框的置信度的具体过程可以包括以下步骤:
1007a、采用第二卷积网络对候选框对应的特征进行卷积处理,得到多个第二语义特征图;
1007b、采用回归器对多个第二语义特征进行处理,确定图像中存在行人的包围框的位置;
1007c、采用分类器对多个第二语义特征进行处理,得到图像中存在行人的包围框的置信度。
上述多个第二语义特征图分别表示从候选框对应的特征中提取的多个不同语义的特征图。上述第二卷积网络的卷积参数与所述第一卷积网络的卷积参数相同。具体地,在上述多个第二语义特征图中,任意两个第二语义特征图对应不同的语义。
例如,上述多个第二语义特征图由F 1、F 2和F 3组成,其中,F 1反映的是头部特征,F 2反映的是左手的特征,F 3反映的是腿部的特征,F 1、F 2和F 3各自反映的语义均不相同。
应理解,这里例子中的F 1、F 2和F 3与上文中举例介绍多个第一语义特征图时的F 1、 F 2和F 3不同,这里的F 1、F 2和F 3属于第二语义特征图,而上述举例介绍多个第一语义特征图时的F 1、F 2和F 3属于第一语义特征图。
上述图像中存在行人的包围框的位置和图像中存在行人的包围框的置信度可以是对图像进行行人检测的检测结果,可以称为图像的行人检测结果。
下面结合图24对上述步骤1007a至1007c的过程进行详细描述。
如图24所示,可以通过RCNN模块中的Conv5对候选框对应的特征进行卷积处理,得到多个第二语义特征图(C1至Ck),上述多个第二语义特征图中的每个特征图反映不同特征。例如,C 1反映的是头部特征,C 2反映的是左手的特征,C 3反映的是背景物的特征……在得到多个第二语义特征之后,可以采用平均池化模块(global average pooling,GAP)对每个第二语义特征图进行平均处理,得到特征图P 1至P k,其中,P 1是对特征图C 1进行平均处理后得到的,P k是对特征图C k进行平均处理得到的。
在得到特征图P 1至P k之后,可以在分类器中采用分类系数对特征图P 1至P k进行加权求和,得到候选框行的置信度。
具体地,可以根据公式(4)对特征图P k进行加权求和,进而得到候选框的置信度。
Figure PCTCN2020105020-appb-000018
在上述公式(4)中,w k∈1*1为P k对应的分类器系数,在RCNN的分类器中,w k越大,说明P k对人的贡献越大,由于P k是C k的平均,因此,P k代表着C k,所以w k越大,说明C k对识别人的作用也就越大。因此,分类器中的权重系数对C k具有选择作用。正因为如此,在自激活模块中,采用与RCNN共用的分类器系数w k对高层语义特征图进行加权,才能形成物体可见性图。
最后,在框回归器中,采用类似的系数,得到更加准确的框的坐标(xmin,ymin,width,height)。
其中,(xmin,ymin)可以是框的左上角位置的坐标,W和H分别表示框的宽和高。另外,(xmin,ymin)还可以是框的中心点的位置,框的右上角位置/左下角位置/右下角位置。
在得到了候选框的位置和坐标之后,还可以对输出的所有的候选框进行非极大值抑制(non-maximize suppression,NMS)处理,从而把高度重叠的候选框进行合并,并且过滤掉置信度过低的候选框。从而输出反映行人检测结果,即图像中存在行人的包围框和图像中存在行人的包围框的置信度。
下面结合表1对本申请实施例的行人检测方法的效果进行说明。表1示出了本申请方案和现有方案在公开的数据集(CityPersons)上进行行人检测时的丢失率(miss rate),其中,传统方案1为自适应快速RCNN(adapted faster RCNN)方案,传统方案1是在2017年的IEEE国际计算机视觉与模式识别会议(IEEE conference on computer cision and pattern recognition,CVPR)中提出的,其中,IEEE为电气和电子工程师协会(institute of electrical and electronics engineers)。传统方案2为感知遮挡的RCNN(occlusion aware-CNN)方案,传统方案2是在2018年的欧洲计算机视觉国际会议(european conference on computer vision,ECCV)提出的。
另外,表1中的第二列表示的是对图像中的全部行人(包括被其他物体遮挡的行人和没有被其他物体遮挡的行人)进行行人检测时的丢失率,表1的第三列表示的是对图像中存在严重遮挡的行人进行行人检测时的丢失率。丢失率越低表示进行行人检测的性能越 好。
从表1中可以看出,本申请方案在两种场景下的丢失率均低于现有方案,尤其是在严重遮挡的场景下,本申请方案与传统方案相比,能够取得10%左右的性能增益,对效果的提升比较明显。
表1
行人检测方法 丢失率(全部) 丢失率(严重遮挡)
Adapted Faster RCNN(CVPR17) 43.86 50.47
OR-CNN(ECCV18) 40.19 51.43
本申请 39.26 41.14
上文结合附图对本申请实施例的行人检测方法进行了详细描述,下面结合附图对本申请实施例的行人检测装置进行详细的描述,应理解,下面描述的行人检测装置能够执行本申请实施例的行人检测方法的各个步骤,为了避免不必要的重复,下面在介绍本申请实施例的行人检测装置时适当省略重复的描述。
图25是本申请实施例的行人检测装置的示意性框图。图25所示的行人检测装置3000包括获取单元3001和处理单元3002。
获取单元3001和处理单元3002可以用于执行本申请实施例的行人检测方法,具体地,获取单元3001可以执行上述步骤1001,处理单元3002可以执行上述步骤1002至1007。
上述处理单元3002按照处理功能的不同可以分成多个模块,具体地,处理单元3002可以相当于图7所示的行人检测装置中的主干网络模块101、RPN模块102、自激活模块103、基础特征加权模块104、区域特征生成模块105、RCNN模块106以及输出模块107。处理单元3002能够实现图7所示的行人检测装置中的各个模块的功能。
图26是本申请实施例的行人检测装置的硬件结构示意图。图26所示的行人检测装置4000(该装置4000具体可以是一种计算机设备)包括存储器4001、处理器4002、通信接口4003以及总线4004。其中,存储器4001、处理器4002、通信接口4003通过总线4004实现彼此之间的通信连接。
存储器4001可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器4001可以存储程序,当存储器4001中存储的程序被处理器4002执行时,处理器4002用于执行本申请实施例的行人检测方法的各个步骤。
处理器4002可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请方法实施例的行人检测方法。
处理器4002还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的行人检测方法的各个步骤可以通过处理器4002中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器4002还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执 行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器4001,处理器4002读取存储器4001中的信息,结合其硬件完成本行人检测装置中包括的单元所需执行的功能,或者执行本申请方法实施例的行人检测方法。
通信接口4003使用例如但不限于收发器一类的收发装置,来实现装置4000与其他设备或通信网络之间的通信。例如,可以通过通信接口4003获取待处理图像。
总线4004可包括在装置4000各个部件(例如,存储器4001、处理器4002、通信接口4003)之间传送信息的通路。
图27是本申请实施例的神经网络训练装置的硬件结构示意图。与上述装置4000类似,图27所示的神经网络训练装置5000包括存储器5001、处理器5002、通信接口5003以及总线5004。其中,存储器5001、处理器5002、通信接口5003通过总线5004实现彼此之间的通信连接。
存储器5001可以是ROM,静态存储设备和RAM。存储器5001可以存储程序,当存储器5001中存储的程序被处理器5002执行时,处理器5002和通信接口5003用于执行本申请实施例的神经网络的训练方法的各个步骤。
处理器5002可以采用通用的,CPU,微处理器,ASIC,GPU或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的图像处理装置中的单元所需执行的功能,或者执行本申请方法实施例的神经网络的训练方法。
处理器5002还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请实施例的神经网络的训练方法的各个步骤可以通过处理器5002中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器5002还可以是通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器5001,处理器5002读取存储器5001中的信息,结合其硬件完成本申请实施例的图像处理装置中包括的单元所需执行的功能,或者执行本申请方法实施例的神经网络的训练方法。
通信接口5003使用例如但不限于收发器一类的收发装置,来实现装置5000与其他设备或通信网络之间的通信。例如,可以通过通信接口5003获取待处理图像。
总线5004可包括在装置5000各个部件(例如,存储器5001、处理器5002、通信接口5003)之间传送信息的通路。
应注意,尽管上述装置4000和装置5000仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置4000和装置5000还可以包括实 现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置4000和装置5000还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置4000和装置5000也可仅仅包括实现本申请实施例所必须的器件,而不必包括图26和图27中所示的全部器件。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种行人检测方法,其特征在于,包括:
    获取图像;
    对所述图像进行特征提取,得到所述图像的基础特征图;
    根据所述基础特征图确定所述图像的候选框,其中,所述候选框为所述图像中可能存在行人的区域的包围框;
    对所述图像的基础特征图进行处理,得到所述图像的物体可见度图,所述物体可见度图中的行人可见部分的像素值大于行人不可见部分的像素值;
    对所述图像的基础特征图和所述图像的物体可见度图进行融合处理,得到所述图像的增强特征图;
    根据所述图像的候选框和所述图像的增强特征图,确定所述候选框对应的特征,所述候选框对应的特征包括所述候选框的区域特征,所述候选框的区域特征是所述增强特征图中位于所述候选框内的区域的特征;
    根据所述候选框对应的特征确定所述图像中存在行人的包围框和所述图像中存在行人的包围框的置信度。
  2. 如权利要求1所述的方法,其特征在于,在所述物体可见度图中,所述行人不可见部分包括行人被遮挡部分。
  3. 如权利要求1或2所述的方法,其特征在于,在所述物体可见度图中,所述行人不可见部分包括所述图像的背景部分。
  4. 如权利要求1-3中任一项所述的方法,其特征在于,所述对所述图像的基础特征图进行处理,得到所述图像的物体可见度图,包括:
    采用第一卷积网络对所述图像的基础特征图进行卷积处理,得到多个第一语义特征图,所述多个第一语义特征图是从所述基础特征图全图中提取的多个不同语义的特征图;
    对所述多个第一语义特征图进行加权求和处理,得到所述图像的物体可见度图;
    所述根据所述候选框对应的特征确定所述图像中存在行人的包围框和所述图像中存在行人的包围框的置信度,包括:
    采用第二卷积网络对所述候选框对应的特征进行卷积处理,得到多个第二语义特征图,所述多个第二语义特征图分别表示从所述候选框对应的特征中提取的多个不同语义的特征图,所述第二卷积网络的卷积参数与所述第一卷积网络的卷积参数相同;
    采用回归器对所述多个第二语义特征进行处理,确定所述包围框的位置;
    采用分类器对所述多个第二语义特征进行处理,得到所述图像中存在行人的包围框的置信度,其中,在对所述多个第一语义特征进行加权求和处理时的加权系数为所述分类器中用于确定行人得分的权重系数。
  5. 如权利要求1-4中任一项所述的方法,其特征在于,所述候选框对应的特征还包括候选框的轮廓区域的区域特征,所述候选框的轮廓区域是所述候选框按照第一预设比例缩小后得到的缩小候选框与所述候选框之间形成的区域。
  6. 如权利要求5所述的方法,其特征在于,所述方法还包括:
    将所述候选框的区域特征中位于所述缩小候选框内的特征的取值置零,以得到所述候选框的轮廓区域的区域特征。
  7. 如权利要求1-6中任一项所述的方法,其特征在于,所述候选框对应的特征还包括所述候选框的背景区域的区域特征,所述候选框的背景区域是所述候选框按照第二预设比例扩大后得到的扩大候选框与所述候选框之间形成的区域。
  8. 如权利要求7所述的方法,其特征在于,所述方法还包括:
    获取第一区域的区域特征,所述第一区域的区域特征为所述物体可见度图中位于所述扩展候选框的区域内的区域特征;
    将所述第一区域的区域特征中位于所述候选框内的特征置零,得到所述候选框的背景区域的区域特征。
  9. 如权利要求1-8中任一项所述的方法,其特征在于,所述对所述图像进行特征提取,得到所述图像的基础特征图,包括:
    对所述图像进行卷积处理,得到所述图像的基础特征图。
  10. 一种行人检测装置,其特征在于,包括:
    获取单元,用于获取图像;
    处理单元,所述处理单元用于:
    对所述图像进行特征提取,得到所述图像的基础特征图;
    根据所述基础特征图确定所述图像的候选框,其中,所述候选框为所述图像中可能存在行人的区域的包围框;
    对所述图像的基础特征图进行处理,得到所述图像的物体可见度图,所述物体可见度图中的行人可见部分的像素值大于行人不可见部分的像素值;
    对所述图像的基础特征图和所述图像的物体可见度图进行融合处理,得到所述图像的增强特征图;
    根据所述图像的候选框和所述图像的增强特征图,确定所述候选框对应的特征,所述候选框对应的特征包括所述候选框的区域特征,所述候选框的区域特征是所述增强特征图中位于所述候选框内的区域的特征;
    根据所述候选框对应的特征确定所述图像中存在行人的包围框和所述图像中存在行人的包围框的置信度。
  11. 如权利要求10所述的装置,其特征在于,在所述物体可见度图中,所述行人不可见部分包括行人被遮挡部分。
  12. 如权利要求10或11所述的装置,其特征在于,在所述物体可见度图中,所述行人不可见部分包括所述图像的背景部分。
  13. 如权利要求10-12中任一项所述的装置,其特征在于,所述处理单元用于:
    采用第一卷积网络对所述图像的基础特征图进行卷积处理,得到多个第一语义特征图,所述多个第一语义特征图是从所述基础特征图全图中提取的多个不同语义的特征图;
    对所述多个第一语义特征图进行加权求和处理,得到所述图像的物体可见度图;
    所述根据所述候选框对应的特征确定所述图像中存在行人的包围框和所述图像中存在行人的包围框的置信度,包括:
    采用第二卷积网络对所述候选框对应的特征进行卷积处理,得到多个第二语义特征 图,所述多个第二语义特征图分别表示从所述候选框对应的特征中提取的多个不同语义的特征图,所述第二卷积网络的卷积参数与所述第一卷积网络的卷积参数相同;
    采用回归器对所述多个第二语义特征进行处理,确定所述包围框的位置;
    采用分类器对所述多个第二语义特征进行处理,得到所述图像中存在行人的包围框的置信度,其中,在对所述多个第一语义特征进行加权求和处理时的加权系数为所述分类器中用于确定行人得分的权重系数。
  14. 如权利要求10-13中任一项所述的装置,其特征在于,所述候选框对应的特征还包括候选框的轮廓区域的区域特征,所述候选框的轮廓区域是所述候选框按照第一预设比例缩小后得到的缩小候选框与所述候选框之间形成的区域。
  15. 如权利要求14所述的装置,其特征在于,所述处理单元还用于将所述候选框的区域特征中位于所述缩小候选框内的特征的取值置零,以得到所述候选框的轮廓区域的区域特征。
  16. 如权利要求10-15中任一项所述的装置,其特征在于,所述候选框对应的特征还包括所述候选框的背景区域的区域特征,所述候选框的背景区域是所述候选框按照第二预设比例扩大后得到的扩大候选框与所述候选框之间形成的区域。
  17. 如权利要求16所述的装置,其特征在于,所述处理单元还用于:
    获取第一区域的区域特征,所述第一区域的区域特征为所述物体可见度图中位于所述扩展候选框的区域内的区域特征;
    将所述第一区域的区域特征中位于所述候选框内的特征置零,得到所述候选框的背景区域的区域特征。
  18. 如权利要求10-17中任一项所述的装置,其特征在于,所述处理单元用于对所述图像进行卷积处理,得到所述图像的基础特征图。
  19. 一种计算机可读存储介质,其特征在于,所述计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行如权利要求1-9中任一项所述的方法。
  20. 一种芯片,其特征在于,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,以执行如权利要求1-9中任一项所述的方法。
PCT/CN2020/105020 2019-07-30 2020-07-28 行人检测方法、装置、计算机可读存储介质和芯片 WO2021018106A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2022506071A JP7305869B2 (ja) 2019-07-30 2020-07-28 歩行者検出方法及び装置、コンピュータ読み取り可能な記憶媒体並びにチップ
EP20848616.7A EP4006773A4 (en) 2019-07-30 2020-07-28 PEDESTRIAN DETECTION METHOD, APPARATUS, COMPUTER READABLE STORAGE MEDIA, AND CHIP
US17/586,136 US20220148328A1 (en) 2019-07-30 2022-01-27 Pedestrian detection method and apparatus, computer-readable storage medium, and chip

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910697411.3 2019-07-30
CN201910697411.3A CN112307826A (zh) 2019-07-30 2019-07-30 行人检测方法、装置、计算机可读存储介质和芯片

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/586,136 Continuation US20220148328A1 (en) 2019-07-30 2022-01-27 Pedestrian detection method and apparatus, computer-readable storage medium, and chip

Publications (1)

Publication Number Publication Date
WO2021018106A1 true WO2021018106A1 (zh) 2021-02-04

Family

ID=74229436

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/105020 WO2021018106A1 (zh) 2019-07-30 2020-07-28 行人检测方法、装置、计算机可读存储介质和芯片

Country Status (5)

Country Link
US (1) US20220148328A1 (zh)
EP (1) EP4006773A4 (zh)
JP (1) JP7305869B2 (zh)
CN (1) CN112307826A (zh)
WO (1) WO2021018106A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469118A (zh) * 2021-07-20 2021-10-01 京东科技控股股份有限公司 多目标行人跟踪方法及装置、电子设备、存储介质
CN114067370A (zh) * 2022-01-17 2022-02-18 北京新氧科技有限公司 一种脖子遮挡检测方法、装置、电子设备及存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102018206848A1 (de) * 2018-05-03 2019-11-07 Robert Bosch Gmbh Verfahren und Vorrichtung zum Ermitteln eines Tiefeninformationsbilds aus einem Eingangsbild
JP7297705B2 (ja) * 2020-03-18 2023-06-26 株式会社東芝 処理装置、処理方法、学習装置およびプログラム
CN111368937B (zh) * 2020-03-19 2024-05-28 京东方科技集团股份有限公司 图像分类方法、装置、及其训练方法、装置、设备、介质
CN115273154B (zh) * 2022-09-26 2023-01-17 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) 基于边缘重构的热红外行人检测方法、系统及存储介质
CN116597387A (zh) * 2023-07-17 2023-08-15 建信金融科技有限责任公司 一种异常处理方法、装置、电子设备及计算机可读介质
CN117079221B (zh) * 2023-10-13 2024-01-30 南方电网调峰调频发电有限公司工程建设管理分公司 抽蓄电站地下工程的施工安全监测方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9443320B1 (en) * 2015-05-18 2016-09-13 Xerox Corporation Multi-object tracking with generic object proposals
CN107944369A (zh) * 2017-11-17 2018-04-20 大连大学 一种基于级联区域生成网络和增强随机森林的行人检测方法
CN108038409A (zh) * 2017-10-27 2018-05-15 江西高创保安服务技术有限公司 一种行人检测方法
CN108664838A (zh) * 2017-03-27 2018-10-16 北京中科视维文化科技有限公司 基于改进rpn深度网络的端到端的监控场景行人检测方法

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103310213B (zh) * 2012-03-07 2016-05-25 株式会社理光 车辆检测方法和装置
US9904852B2 (en) * 2013-05-23 2018-02-27 Sri International Real-time object detection, tracking and occlusion reasoning
US9965719B2 (en) * 2015-11-04 2018-05-08 Nec Corporation Subcategory-aware convolutional neural networks for object detection
US20170206426A1 (en) * 2016-01-15 2017-07-20 Ford Global Technologies, Llc Pedestrian Detection With Saliency Maps
JP6787196B2 (ja) * 2017-03-09 2020-11-18 コニカミノルタ株式会社 画像認識装置及び画像認識方法
JP6972757B2 (ja) * 2017-08-10 2021-11-24 富士通株式会社 制御プログラム、制御方法、及び情報処理装置
US10769500B2 (en) * 2017-08-31 2020-09-08 Mitsubishi Electric Research Laboratories, Inc. Localization-aware active learning for object detection
US9946960B1 (en) * 2017-10-13 2018-04-17 StradVision, Inc. Method for acquiring bounding box corresponding to an object in an image by using convolutional neural network including tracking network and computing device using the same
CN107909027B (zh) * 2017-11-14 2020-08-11 电子科技大学 一种具有遮挡处理的快速人体目标检测方法
CN108898047B (zh) * 2018-04-27 2021-03-19 中国科学院自动化研究所 基于分块遮挡感知的行人检测方法及系统
CN109753885B (zh) * 2018-12-14 2020-10-16 中国科学院深圳先进技术研究院 一种目标检测方法、装置以及行人检测方法、系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9443320B1 (en) * 2015-05-18 2016-09-13 Xerox Corporation Multi-object tracking with generic object proposals
CN108664838A (zh) * 2017-03-27 2018-10-16 北京中科视维文化科技有限公司 基于改进rpn深度网络的端到端的监控场景行人检测方法
CN108038409A (zh) * 2017-10-27 2018-05-15 江西高创保安服务技术有限公司 一种行人检测方法
CN107944369A (zh) * 2017-11-17 2018-04-20 大连大学 一种基于级联区域生成网络和增强随机森林的行人检测方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4006773A4

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469118A (zh) * 2021-07-20 2021-10-01 京东科技控股股份有限公司 多目标行人跟踪方法及装置、电子设备、存储介质
CN113469118B (zh) * 2021-07-20 2024-05-21 京东科技控股股份有限公司 多目标行人跟踪方法及装置、电子设备、存储介质
CN114067370A (zh) * 2022-01-17 2022-02-18 北京新氧科技有限公司 一种脖子遮挡检测方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
JP2022542949A (ja) 2022-10-07
JP7305869B2 (ja) 2023-07-10
EP4006773A1 (en) 2022-06-01
CN112307826A (zh) 2021-02-02
US20220148328A1 (en) 2022-05-12
EP4006773A4 (en) 2022-10-05

Similar Documents

Publication Publication Date Title
WO2021018106A1 (zh) 行人检测方法、装置、计算机可读存储介质和芯片
WO2020253416A1 (zh) 物体检测方法、装置和计算机存储介质
EP3916628A1 (en) Object identification method and device
WO2021043168A1 (zh) 行人再识别网络的训练方法、行人再识别方法和装置
WO2020177651A1 (zh) 图像分割方法和图像处理装置
WO2021043112A1 (zh) 图像分类方法以及装置
US20210398252A1 (en) Image denoising method and apparatus
US20230214976A1 (en) Image fusion method and apparatus and training method and apparatus for image fusion model
US20220148291A1 (en) Image classification method and apparatus, and image classification model training method and apparatus
WO2021063341A1 (zh) 图像增强方法以及装置
CN112446834A (zh) 图像增强方法和装置
EP4109343A1 (en) Perception network architecture search method and device
CN111914997B (zh) 训练神经网络的方法、图像处理方法及装置
CN111310604A (zh) 一种物体检测方法、装置以及存储介质
CN112529904A (zh) 图像语义分割方法、装置、计算机可读存储介质和芯片
WO2021098441A1 (zh) 手部姿态估计方法、装置、设备以及计算机存储介质
CN113066017A (zh) 一种图像增强方法、模型训练方法及设备
CN113065645A (zh) 孪生注意力网络、图像处理方法和装置
CN113011562A (zh) 一种模型训练方法及装置
WO2024002211A1 (zh) 一种图像处理方法及相关装置
CN112464930A (zh) 目标检测网络构建方法、目标检测方法、装置和存储介质
CN114764856A (zh) 图像语义分割方法和图像语义分割装置
CN117157679A (zh) 感知网络、感知网络的训练方法、物体识别方法及装置
CN113449550A (zh) 人体重识别数据处理的方法、人体重识别的方法和装置
CN112446835A (zh) 图像恢复方法、图像恢复网络训练方法、装置和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20848616

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022506071

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020848616

Country of ref document: EP

Effective date: 20220228