WO2022121075A1 - 人体头肩区域的定位方法、定位装置和电子设备 - Google Patents

人体头肩区域的定位方法、定位装置和电子设备 Download PDF

Info

Publication number
WO2022121075A1
WO2022121075A1 PCT/CN2021/070576 CN2021070576W WO2022121075A1 WO 2022121075 A1 WO2022121075 A1 WO 2022121075A1 CN 2021070576 W CN2021070576 W CN 2021070576W WO 2022121075 A1 WO2022121075 A1 WO 2022121075A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
head
shoulders
bounding box
fusion
Prior art date
Application number
PCT/CN2021/070576
Other languages
English (en)
French (fr)
Inventor
王金桥
赵朝阳
赵旭
Original Assignee
中科视语(北京)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科视语(北京)科技有限公司 filed Critical 中科视语(北京)科技有限公司
Publication of WO2022121075A1 publication Critical patent/WO2022121075A1/zh
Priority to ZA2023/05848A priority Critical patent/ZA202305848B/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]

Definitions

  • the embodiments of the present invention relate to the technical fields of computer vision and pattern recognition, and in particular, to a positioning method, a positioning device and an electronic device for the head and shoulder region of a human body.
  • Human head and shoulders region localization also known as head and shoulders detection, is to locate all human head and shoulders in an image or video frame in the form of a rectangular bounding box.
  • Human head and shoulders area positioning has a wide range of application scenarios: in crowd counting applications, the number of bounding boxes on the human head and shoulders can be counted to obtain accurate information on the number and density of crowd positions; in crowd behavior analysis, continuous Track each head and shoulder area in the video frame to obtain the individual movement direction of pedestrians; in the monitoring of illegal behavior in scenes such as drivers and passengers, construction sites, etc., after locating the head and shoulder area, analyze the head and shoulder area to obtain whether the corresponding person smokes, hits Telephone calls, illegal wearing of helmets, etc.
  • the head-and-shoulders region positioning function often needs to be deployed on remote equipment terminals with low computing power.
  • the head-and-shoulders region detection method is required to maintain sufficient accuracy, with high execution efficiency and low resource occupation.
  • the related technology uses traditional image target detection methods to locate the head and shoulders area, such as ACF algorithm or DPM algorithm, which does not perform well in scenes such as occlusion, blur, dark light, and attitude changes; in addition, a two-step process based on deep learning is used.
  • the target detection method uses a two-stage neural network to locate the head and shoulders area from coarse to fine, and uses the powerful feature extraction ability of deep neural network in image recognition, but the operation efficiency is not high and the resource consumption is too large.
  • the purpose of the embodiments of the present invention is to provide a localization method, a localization device and an electronic device for the head and shoulders region of the human body, so as to solve the problems of high accuracy and low resource occupancy required by existing algorithms running on low-efficiency terminal equipment.
  • an embodiment of the present invention provides a method for locating the head and shoulders region of a human body, including:
  • a final positioning result is obtained by performing non-maximum value suppression on the second positioning result.
  • the reduced feature map obtained by convolving the target image through a convolutional neural network includes:
  • a convolution layer with a span of 2 is used for the target image, and the feature map is reduced by 2 times successively to obtain the reduced feature map.
  • the convolutional layers of the convolutional neural network are alternately arranged by sparsely connected convolutional layers and ordinary convolutional layers; wherein the sparsely connected convolutional layers have the same number of input and output channels, and The network is connected between the input channel and the output channel with the same serial number, and the size of the convolution kernel weight matrix is N ⁇ 3 ⁇ 3, where N is the number of channels.
  • the activation function of the convolutional neural network is:
  • x and y represent the input and output feature maps of the activation function, respectively, and p and q are learnable parameters.
  • the first feature map, the second feature map and the third feature map are respectively subjected to multi-focus context processing to obtain the first fusion feature map, the second fusion feature map and the
  • the third fusion feature map including:
  • the second feature map and the third feature map are respectively input into the sparse connection convolution structure with three convolution kernels to extract features of different fields of view, they are fused together to obtain the result.
  • the sparsely connected convolution structure with the same size of the three convolution kernels includes a convolution kernel with a short focal length branch structure for focusing on the features of the head and shoulders region, and the remaining two convolution kernels form a long focal length branch, focusing on the head and shoulders area. Contextual features around the shoulder region.
  • the first fused feature map, the second fused feature map, and the third fused feature map are respectively passed through the prediction convolution layer to obtain the probability of each position and the encoding bounding box Output values, including:
  • the category and bounding box coordinates of each pixel position on the feature map output by the network are specified by the following rules:
  • the coordinate point is inside a certain head and shoulders, the coordinate point is a positive sample, and the outer frame of the head and shoulders is the basic fact GT outer frame matched by the coordinate point, otherwise it is a negative sample and does not match the outer frame frame;
  • x c , y c are the horizontal and vertical coordinates of the coordinate points obtained by mapping the pixels on the feature map to the input image;
  • x gt , y gt , h gt , w gt are the horizontal and vertical coordinates of the center point of the matched GT outer frame The vertical coordinate and the width and height values.
  • ⁇ x, ⁇ y, ⁇ h, ⁇ w are the encoded bounding box coordinates that the network needs to output.
  • an embodiment of the present invention also provides a positioning device for the head and shoulders region of a human body, including:
  • the acquisition module is used to acquire the target image
  • the control processing module is used to convolve the target image through a convolutional neural network to obtain a reduced feature map; convolve the reduced feature map to obtain a first feature map and a second feature with different resolutions. and the third feature map; perform multi-focus context processing on the first feature map, the second feature map and the third feature map respectively to obtain the first fused feature map, the second fused feature map and the third fused feature map feature map; pass the first fusion feature map, the second fusion feature map and the third fusion feature map through the prediction convolution layer to obtain the probability of each position and the output value of the encoding bounding box;
  • the encoded bounding box output value at the location is reverse-encoded to obtain the bounding box coordinates in the image coordinate system, and combined with the classification probability to form a regional positioning vector, and the bounding boxes of all pixel positions of all prediction layers are aggregated together to obtain the first positioning result. ; Perform bounding box filtering on the first positioning result to obtain a second positioning result; perform non-maximum suppression on
  • control processing module is configured to use a convolutional layer with a span of 2 for the target image, and successively reduce the feature map by 2 times to obtain the reduced feature map.
  • the convolutional layers of the convolutional neural network are alternately arranged by sparsely connected convolutional layers and ordinary convolutional layers; wherein the sparsely connected convolutional layers have the same number of input and output channels, and The network is connected between the input channel and the output channel with the same serial number, and the size of the convolution kernel weight matrix is N ⁇ 3 ⁇ 3, where N is the number of channels.
  • the activation function of the convolutional neural network is:
  • x and y represent the input and output feature maps of the activation function, respectively, and p and q are learnable parameters.
  • control processing module is configured to input the first feature map, the second feature map, and the third feature map respectively into a sparsely connected volume with three convolution kernels of equal size After extracting the features of different fields of view from the product structure, fuse them together to obtain the first fusion feature map, the second fusion feature map and the third fusion feature map;
  • the sparsely connected convolution structure with the same size of the three convolution kernels includes a convolution kernel with a short focal length branch structure for focusing on the features of the head and shoulders region, and the remaining two convolution kernels form a long focal length branch, focusing on the head and shoulders area. Contextual features around the shoulder region.
  • control processing module is configured to use two parallel convolution operations to output each of the first fused feature map, the second fused feature map, and the third fused feature map respectively.
  • the category and bounding box coordinates of each pixel position on the feature map output by the network are determined by the following rules:
  • the coordinate point is inside a head and shoulders and the distance between the head and shoulders is closer to the coordinate point than other heads and shoulders, the coordinate point is a positive sample, and the outer frame of the head and shoulders is The basic fact GT outer box matched for the coordinate point, otherwise it is a negative sample and does not match the outer box;
  • x c , y c are the horizontal and vertical coordinates of the coordinate points obtained by mapping the pixels on the feature map to the input image;
  • x gt , y gt , h gt , w gt are the horizontal and vertical coordinates of the center point of the matched GT outer frame Vertical coordinates and width and height values.
  • ⁇ x, ⁇ y, ⁇ h, ⁇ w are the encoded bounding box coordinates that the network needs to output.
  • an embodiment of the present invention further provides an electronic device, including: at least one processor and at least one memory; the memory is used to store one or more program instructions; the processor is used to run one or more program instructions A program instruction is used to execute the method for detecting the head and shoulders region of a human body according to the first aspect.
  • an embodiment of the present invention further provides a computer-readable storage medium, comprising one or more program instructions, wherein the one or more program instructions are used to execute the head and shoulder region of the human body according to the first aspect. Detection method.
  • the neural network directly generates the localization result from the input image in an end-to-end manner in which the result is output directly in a single step, which is more efficient than the two-stage method.
  • the present invention makes it lighter on the one hand, and can accurately and efficiently extract the features of the head and shoulders region on the other hand.
  • FIG. 1 is a flowchart of a method for locating a head and shoulders region of a human body according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of a multi-focus context information fusion structure in an example of the present invention.
  • FIG. 3 is a structural block diagram of a device for positioning the head and shoulders region of a human body according to an embodiment of the present invention.
  • FIG. 1 is a flowchart of a method for locating a head and shoulders region of a human body according to an embodiment of the present invention. As shown in FIG. 1 , the method for locating the head and shoulders region of a human body according to an embodiment of the present invention includes:
  • S1 Convolve the target image through a convolutional neural network to obtain a reduced feature map.
  • this embodiment uses a small number of network layers at the input end of the network to quickly reduce the resolution of the feature map, so as to reduce the spatial range of the convolution kernel sliding and save the amount of computation.
  • the specific method is to use a convolutional layer with a span of 2 in several layers at the input end of the network, and successively reduce the feature map by 2 times.
  • this embodiment does not use the pooling layer to reduce the feature map, mainly because the pooling layer will cause the loss of detailed information, and the structure of continuous feature map reduction here is not applicable.
  • each convolutional layer with stride 2 is followed by several convolutional layers with stride 1 to extract more semantic features for head and shoulders region localization.
  • this embodiment designs a sparsely connected convolutional structure.
  • Conventional convolution will use the same convolution kernel to perform convolution operations on all input channels. Let the number of input and output channels of the convolution operation be N and M respectively, and the size of the convolution kernel is 3, then the size of the convolution kernel parameter matrix is N ⁇ M ⁇ 3 ⁇ 3.
  • the number of input and output channels is the same, and each convolution only performs the convolution operation on a single channel in the feature map, forming a sparsely connected convolution.
  • the size of the convolution kernel matrix is N ⁇ 3 ⁇ 3, and N is the number of channels.
  • each sparsely connected convolution is followed by an ordinary convolution layer, which is characterized by the convolution kernel space size of 1 ⁇ 1, and the convolution kernel weight matrix of N ⁇ M ⁇ 1 ⁇ 1, where N and M are respectively Number of input and output channels.
  • the structure of this embodiment can greatly reduce the amount of parameters and the amount of calculation.
  • the activation function is an essential element of modern neural networks.
  • the deep convolutional neural network method generally uses the ReLU function to avoid the problem of gradient disappearance during the training process, and the ReLU function produces information loss in the part where the input is less than 0.
  • the ReLU function has a regularizing effect on larger network models and avoids overfitting, while the lightweight structure will limit the model capacity and reduce the accuracy.
  • the specific formula of the activation function of this embodiment is:
  • PQReLU is the name of the activation function in this embodiment
  • x and y respectively represent the input and output feature maps of the activation function
  • p and q are learnable parameters, which are determined during the training of the convolutional neural network.
  • S2 Convolve the reduced feature map to obtain the first feature map, the second feature map, and the third feature map with different resolutions.
  • the resolution is downsampled by 8 times, 16 times.
  • the three deepest feature maps of the three sets of feature maps of 32 times are marked as P1, P2 and P3.
  • S3 Perform multi-focal-length context processing on the first feature map, the second feature map, and the third feature map respectively to obtain the first fused feature map, the second fused feature map, and the third fused feature map.
  • micro-sub-networks with different depths are used in the prediction layer to form a variety of feature maps with different receptive field ranges.
  • the network can make more accurate decisions by referring to the human body area around the head and shoulders and the context information of the surrounding environment when making decisions about the positioning of the head and shoulders area.
  • FIG. 2 is a schematic diagram of a multi-focus context information fusion structure in an example of the present invention.
  • this embodiment consists of three convolution kernels with equal size sparsely connected convolutions.
  • One of the convolution kernels forms a short focal length branch, focusing on the features of the head and shoulders region, and the other two convolution kernels form a branch , focusing on contextual features around the head and shoulders region.
  • the feature maps input by this structure are extracted from the features of different fields of view through two branches respectively, and then fused together to obtain features fused with multi-focal distance context information.
  • This embodiment predicts the location results of head and shoulders regions of different sizes on multiple network layers with different resolutions.
  • the above-mentioned context information fusion structure is added before the prediction convolution of each layer, and the first fusion feature map, the second fusion feature map and the third fusion are obtained by performing multi-focus context processing on P1, P2 and P3.
  • Feature maps denoted as Q1, Q2 and Q3.
  • S4 Pass the first fused feature map, the second fused feature map, and the third fused feature map through the prediction convolution layer to obtain the probability of each position and the output value of the encoded bounding box.
  • the present invention uses two parallel convolution operations to output the classification probability value and the bounding box frame encoding value respectively after the above-mentioned context information fusion structure.
  • the neural network can directly output the positioning results of the head and shoulders area, forming an end-to-end structure, so that the algorithm calculation occurs in the neural network part, reducing the algorithm links and speeding up the operation, especially in the special neural network computing chip can reduce the interaction of different computing original memory.
  • the present invention designs the following rules:
  • the coordinate point is inside a head and shoulders, the coordinate point is a positive sample, and the outer frame of the head and shoulders is the Ground Truth (GT) outer frame matched by the coordinate point, otherwise it is a negative sample and does not match the outer frame ;
  • GT Ground Truth
  • x c , y c are the horizontal and vertical coordinates of the coordinate points obtained by mapping the pixels on the feature map to the input image;
  • x gt , y gt , h gt , w gt are the horizontal and vertical coordinates of the center point of the matched GT outer frame Vertical coordinates and width and height values.
  • ⁇ x, ⁇ y, ⁇ h, ⁇ w are the encoded bounding box coordinates that the network needs to output.
  • Algorithmic structures based on neural networks require training based on a certain number of samples and a specific loss function to produce useful features.
  • a sufficient number of images containing head and shoulders need to be collected, and the head and shoulder region is marked in the format of (x gt , y gt , h gt , w gt ).
  • the cross-entropy loss function is used for the supervision of the classification output
  • the SmoothL1 loss function is used for the supervision of the positioning link.
  • Q1, Q2 and Q3 are respectively passed through the prediction convolution layer to obtain the probability c ij of each position and the encoded bounding box output value ⁇ B ij .
  • the subscript i represents the sequence number of the prediction layer, and the subscript j represents the pixel position number.
  • S5 Inversely encode the output value of the encoded bounding box at each pixel position to obtain the bounding box coordinates in the image coordinate system, and combine them with the classification probability to form a (c, x, y, w, h) regional positioning vector, and set The bounding boxes of all pixel positions of all prediction layers are summed together to obtain the first positioning result.
  • S6 Perform bounding box filtering on the first positioning result, and filter bounding boxes with c ⁇ by using a preset threshold ⁇ to obtain a second positioning result.
  • the neural network directly generates the locating result from the input image in an end-to-end manner in which the result is directly output in a single step, which is more efficient than the two-stage method.
  • the present invention makes it lighter on the one hand, and can accurately and efficiently extract the features of the head and shoulders region on the other hand.
  • FIG. 3 is a structural block diagram of a device for positioning the head and shoulders region of a human body according to an embodiment of the present invention.
  • the device for locating the head and shoulders region of the human body according to the embodiment of the present invention includes: an acquisition module 100 and a control processing module 200 .
  • the acquisition module 100 is used for acquiring the target image.
  • the control processing module 200 is used for: convolving the target image through a convolutional neural network to obtain a reduced feature map; convolving the reduced feature map to obtain a first feature map and a second feature map with different resolutions and the third feature map; the first feature map, the second feature map and the third feature map are processed in multi-focal distance context respectively to obtain the first fusion feature map, the second fusion feature map and the third fusion feature map; the first fusion feature map
  • the feature map, the second fused feature map, and the third fused feature map respectively pass through the prediction convolution layer to obtain the probability of each position and the output value of the encoded bounding box; inversely encode the output value of the encoded bounding box at each pixel position to obtain the image
  • the bounding box coordinates in the coordinate system are combined with the classification probability to form a regional positioning vector, and the bounding boxes of all pixel positions of all prediction layers are aggregated together to obtain the first positioning result; the first positioning result is filtered by the bounding box to obtain the second Positioning
  • control processing module 200 is configured to use a convolutional layer with a span of 2 for the target image, and successively reduce the feature map by 2 times to obtain a reduced feature map.
  • the convolutional layers of the convolutional neural network are alternately arranged by sparsely connected convolutional layers and ordinary convolutional layers; wherein, the sparsely connected convolutional layers have the same number of input and output channels, and the same sequence number.
  • the network connection between the input channel and the output channel of , the size of the convolution kernel weight matrix is N ⁇ 3 ⁇ 3, and N is the number of channels.
  • the activation function of the convolutional neural network is:
  • x and y represent the input and output feature maps of the activation function, respectively, and p and q are learnable parameters.
  • the control processing module 200 is configured to input the first feature map, the second feature map and the third feature map respectively into a sparsely connected convolution structure with three convolution kernels of equal size to extract different fields of view After the features are fused together, the first fused feature map, the second fused feature map and the third fused feature map are obtained.
  • the sparsely connected convolution structure with the same size of three convolution kernels includes a convolution kernel with a short focal length branch structure that focuses on the features of the head and shoulders area, and the remaining two convolution kernels form a long focal length branch, focusing on the head and shoulders area. surrounding contextual features.
  • control processing module 200 is configured to use two parallel convolution operations for the first fusion feature map, the second fusion feature map and the third fusion feature map to output the probability and code of each position respectively Bounding box output value.
  • the category and bounding box coordinates of each pixel position on the feature map output by the network are determined by the following rules: map the pixel points on the feature map to the input image to obtain the coordinate points in the original image coordinate system; Inside a head and shoulders, the coordinate point is a positive sample, and the outer frame of the head and shoulder is the basic fact GT outer frame matched by the coordinate point, otherwise it is a negative sample and does not match the outer frame; according to the matched GT outer frame, we can get GT code, the calculation formula is:
  • x c , y c are the horizontal and vertical coordinates of the coordinate points obtained by mapping the pixels on the feature map to the input image;
  • x gt , y gt , h gt , w gt are the horizontal and vertical coordinates of the center point of the matched GT outer frame Vertical coordinates and width and height values.
  • ⁇ x, ⁇ y, ⁇ h, ⁇ w are the encoded bounding box coordinates that the network needs to output.
  • the specific implementation of the positioning device for the human head and shoulders region of the embodiment of the present invention is similar to the specific implementation of the positioning method for the human head and shoulders region according to the embodiment of the present invention.
  • the section on the positioning method of the human head and shoulders region The description, in order to reduce redundancy, will not be repeated.
  • An embodiment of the present invention further provides an electronic device, comprising: at least one processor and at least one memory; the memory is used to store one or more program instructions; the processor is used to execute one or more program instructions, The method for locating the head and shoulders region of a human body according to the first aspect is used.
  • the embodiments disclosed in the present invention provide a computer-readable storage medium, where computer program instructions are stored, and when the computer program instructions are executed on a computer, the computer can execute the above-mentioned human head and shoulders Area location method.
  • the processor may be an integrated circuit chip, which has signal processing capability.
  • the processor may be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present invention may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the processor reads the information in the storage medium, and completes the steps of the above method in combination with its hardware.
  • the storage medium may be memory, eg, may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory.
  • the non-volatile memory may be a read-only memory (Read-Only Memory, referred to as ROM), a programmable read-only memory (Programmable ROM, referred to as PROM), an erasable programmable read-only memory (Erasable PROM, referred to as EPROM) , Electrically Erasable Programmable Read-Only Memory (Electrically EPROM, EEPROM for short) or flash memory.
  • ROM Read-Only Memory
  • PROM programmable read-only memory
  • EPROM erasable programmable read-only memory
  • EPROM erasable programmable Read-Only Memory
  • Electrically Erasable Programmable Read-Only Memory Electrically Erasable Programmable Read-Only Memory
  • flash memory Electrically Erasable Programmable Read-Only Memory
  • the volatile memory may be Random Access Memory (RAM for short), which is used as an external cache memory.
  • RAM Random Access Memory
  • DRAM Dynamic RAM
  • SDRAM Synchronous DRAM
  • DDRSDRAM double data rate synchronous dynamic random access memory
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM Synch Link DRAM
  • DRRAM Direct Rambus RAM
  • the storage media described in the embodiments of the present invention are intended to include, but not be limited to, these and any other suitable types of memory.
  • the functions described in the present invention may be implemented by a combination of hardware and software.
  • the corresponding functions may be stored in or transmitted over as one or more instructions or code on a computer-readable medium.
  • Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
  • a storage medium can be any available medium that can be accessed by a general purpose or special purpose computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

本发明实施例公开了人体头肩区域的定位方法、定位装置和电子设备,该定位方法包括:将目标图像通过卷积神经网络进行卷积得到缩小后的特征图;再次进行卷积得到第一特征图、第二特征图和第三特征图;进行多焦距上下文处理得到第一融合特征图、第二融合特征图和第三融合特征图;经过预测卷积层得到每个位置的概率和编码包围框输出值;将每个像素位置处的编码包围框输出值进行反编码得到图像坐标系下的包围框坐标,并结合分类概率得到第一定位结果;进行包围框过滤得到第二定位结果;进行非极大值抑制得到最终定位结果。本发明直接由输入图像产生定位结果,比双阶段方法高效,同时神经网络结构轻量,能准确且高效地提取头肩区域的特征。

Description

人体头肩区域的定位方法、定位装置和电子设备
本申请要求中科视语(北京)科技有限公司于2020年12月09日提交的、发明名称为“人体头肩区域的定位方法、定位装置和电子设备”的、中国专利申请号“202011432151.6”的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及计算机视觉和模式识别技术领域,具体涉及人体头肩区域的定位方法、定位装置和电子设备。
背景技术
人体头肩区域定位,也被称为头肩检测,是将图像或视频帧中的所有人体头肩部位以矩形包围框的形式定位出来。人体头肩区域定位有着广泛的应用场景:在人群计数应用中,可通过统计人体头肩包围框的数量来计数,获得精准的数量和人群位置密度信息;在人群行为分析中,可以通过对连续视频帧中每个头肩区域进行跟踪,获得行人个体运动方向;在司乘、工地等场景的违规行为监测中,可对头肩区域定位后,对头肩区域进行分析来获得相应人员是否有抽烟、打电话、违规佩戴安全帽等行为。头肩区域定位功能往往需部署在低计算力的远程设备终端,要求头肩区域检测方法保持足够准确率前提下,拥有高执行效率和低资源占用。
相关技术为了达到高效率采用传统图像目标检测方法来定位头肩区域,例如ACF算法或DPM算法,对遮挡、模糊、暗光、姿态变化等场景表现不佳;此外,采用基于深度学习的两步目标检测方法,利用双阶段神经网络由粗到精地定位头肩区域,利用深度神经网络在图像识别上的强大特征抽取能力,但是 运行效率不高且资源占用过大。
发明内容
本发明实施例的目的在于提供人体头肩区域的定位方法、定位装置和电子设备,用以解决现有在低效率终端设备上运行算法面临的高准确率和低资源占用率需求的问题。
为实现上述目的,本发明实施例主要提供如下技术方案:
第一方面,本发明实施例提供了一种人体头肩区域的定位方法,包括:
将目标图像通过卷积神经网络进行卷积得到缩小后的特征图;
将所述缩小后的特征图进行卷积得到分辨率互不相同的第一特征图、第二特征图和第三特征图;
将所述第一特征图、所述第二特征图和所述第三特征图分别进行多焦距上下文处理得到第一融合特征图、第二融合特征图和第三融合特征图;
将所述第一融合特征图、所述第二融合特征图和所述第三融合特征图分别经过预测卷积层得到每个位置的概率和编码包围框输出值;
将每个像素位置处的编码包围框输出值进行反编码得到图像坐标系下的包围框坐标,并和分类概率组合成区域定位向量,将所有预测层所有像素位置的包围框汇总在一起,得到第一定位结果;
对所述第一定位结果进行包围框过滤得到第二定位结果;
对所述第二定位结果进行非极大值抑制得到最终定位结果。
根据本发明的一个实施例,所述将目标图像通过卷积神经网络进行卷积得到缩小后的特征图,包括:
将所述目标图像采用跨度为2的卷积层,逐次按2倍进行特征图缩小得到所述缩小后的特征图。
根据本发明的一个实施例,所述卷积神经网络的卷积层由稀疏连接卷积层和普通卷积层交替排列;其中,所述稀疏连接卷积层的输入和输出通道数相同,且在序号相同的输入通道和输出通道之间网络连接,其卷积核权重矩阵大小为 N×3×3,N为通道数。
根据本发明的一个实施例,所述卷积神经网络的激活函数为:
Figure PCTCN2021070576-appb-000001
其中,x和y分别表示激活函数的输入输出特征图,p和q为可学习参数。
根据本发明的一个实施例,所述将所述第一特征图、所述第二特征图和所述第三特征图分别进行多焦距上下文处理得到第一融合特征图、第二融合特征图和第三融合特征图,包括:
将所述第一特征图、所述第二特征图和所述第三特征图分别输入由三个卷积核等大的稀疏连接卷积结构提取不同视野范围的特征后,融合在一起得到所述第一融合特征图、所述第二融合特征图和所述第三融合特征图;
其中,所述三个卷积核等大的稀疏连接卷积结构包括一个用于关注头肩区域特征的短焦距分支结构的卷积核,剩余两个卷积核构成一个长焦距分支,关注头肩区域周围的上下文特征。
根据本发明的一个实施例,所述将所述第一融合特征图、所述第二融合特征图和所述第三融合特征图分别经过预测卷积层得到每个位置的概率和编码包围框输出值,包括:
将所述第一融合特征图、所述第二融合特征图和所述第三融合特征图采用两个并行的卷积操作分别输出每个位置的概率和编码包围框输出值;
其中,通过以下规则明确在网络输出的特征图上每个像素位置的类别和包围框坐标:
将特征图上的像素点映射到输入图像上得到在原图坐标系的坐标点;
如果所述坐标点处在某个头肩内部,则所述坐标点为正样本,该头肩的外包框即为所述坐标点匹配到的基本事实GT外包框,否则为负样本,不匹配外包框;
根据匹配到的GT外包框可得到的GT编码,计算公式为:
Figure PCTCN2021070576-appb-000002
Figure PCTCN2021070576-appb-000003
Δh=h gt
Δw=w gt
其中,x c,y c为特征图上的像素点映射到输入图像得到的坐标点的横纵坐标;x gt,y gt,h gt,w gt为匹配到的GT外包框的中心点的横纵坐标以及宽、高数值。Δx,Δy,Δh,Δw为网络需要输出的编码后的包围框坐标。
第二方面,本发明实施例还提供一种人体头肩区域的定位装置,包括:
获取模块,用于获取目标图像;
控制处理模块,用于将目标图像通过卷积神经网络进行卷积得到缩小后的特征图;将所述缩小后的特征图进行卷积得到分辨率互不相同的第一特征图、第二特征图和第三特征图;将所述第一特征图、所述第二特征图和所述第三特征图分别进行多焦距上下文处理得到第一融合特征图、第二融合特征图和第三融合特征图;将所述第一融合特征图、所述第二融合特征图和所述第三融合特征图分别经过预测卷积层得到每个位置的概率和编码包围框输出值;将每个像素位置处的编码包围框输出值进行反编码得到图像坐标系下的包围框坐标,并和分类概率组合成区域定位向量,将所有预测层所有像素位置的包围框汇总在一起,得到第一定位结果;对所述第一定位结果进行包围框过滤得到第二定位结果;对所述第二定位结果进行非极大值抑制得到最终定位结果。
根据本发明的一个实施例,所述控制处理模块用于将所述目标图像采用跨度为2的卷积层,逐次按2倍进行特征图缩小得到所述缩小后的特征图。
根据本发明的一个实施例,所述卷积神经网络的卷积层由稀疏连接卷积层和普通卷积层交替排列;其中,所述稀疏连接卷积层的输入和输出通道数相同,且在序号相同的输入通道和输出通道之间网络连接,其卷积核权重矩阵大小为N×3×3,N为通道数。
根据本发明的一个实施例,所述卷积神经网络的激活函数为:
Figure PCTCN2021070576-appb-000004
其中,x和y分别表示激活函数的输入输出特征图,p和q为可学习参数。
根据本发明的一个实施例,所述控制处理模块用于将所述第一特征图、所述第二特征图和所述第三特征图分别输入由三个卷积核等大的稀疏连接卷积结构提取不同视野范围的特征后,融合在一起得到所述第一融合特征图、所述第二融合特征图和所述第三融合特征图;
其中,所述三个卷积核等大的稀疏连接卷积结构包括一个用于关注头肩区域特征的短焦距分支结构的卷积核,剩余两个卷积核构成一个长焦距分支,关注头肩区域周围的上下文特征。
根据本发明的一个实施例,所述控制处理模块用于将所述第一融合特征图、所述第二融合特征图和所述第三融合特征图采用两个并行的卷积操作分别输出每个位置的概率和编码包围框输出值;
其中,通过以下规则确定在网络输出的特征图上每个像素位置的类别和包围框坐标:
将特征图上的像素点映射到输入图像上得到在原图坐标系的坐标点;
如果所述坐标点处在某个头肩内部且该头肩与所述坐标点的距离比其他头肩距离所述坐标点更近,则所述坐标点为正样本,该头肩的外包框即为所述坐标点匹配到的基本事实GT外包框,否则为负样本,不匹配外包框;
根据匹配到的GT外包框可得到的GT编码,计算公式为:
Figure PCTCN2021070576-appb-000005
Figure PCTCN2021070576-appb-000006
Δh=h gt
Δw=w gt
其中,x c,y c为特征图上的像素点映射到输入图像得到的坐标点的横纵坐标;x gt,y gt,h gt,w gt为匹配到的GT外包框的中心点的横纵坐标以及宽、高数值。Δx,Δy,Δh,Δw为网络需要输出的编码后的包围框坐标。
第三方面,本发明实施例还提供一种电子设备,包括:至少一个处理器和 至少一个存储器;所述存储器用于存储一个或多个程序指令;所述处理器,用于运行一个或多个程序指令,用以执行如第一方面所述的人体头肩区域的检测方法。
第四方面,本发明实施例还提供一种计算机可读存储介质,包含一个或多个程序指令,所述一个或多个程序指令用于被执行如第一方面所述的人体头肩区域的检测方法。
本发明实施例提供的技术方案至少具有如下优点:
本发明实施例提供的人体头肩区域的定位方法、定位装置和电子设备,神经网络以单步直接输出结果的端对端方式,直接由输入图像产生定位结果,比双阶段方法要高效。同时本发明通过对神经网络的结构进行合理设计,一方面使其更加轻量,一方面使其能准确且高效地的提取头肩区域的特征。
附图说明
为了更清楚地说明本发明新型的实施方式或现有技术中的技术方案,下面将对实施方式或现有技术描述中所需要使用的附图作简单地介绍。显而易见地,下面描述中的附图仅仅是示例性的,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图引伸获得其它的实施附图。
本说明书所绘示的结构、比例、大小等,均仅用以配合说明书所揭示的内容,以供熟悉此技术的人士了解与阅读,并非用以限定本实用新型可实施的限定条件,故不具技术上的实质意义,任何结构的修饰、比例关系的改变或大小的调整,在不影响本实用新型所能产生的功效及所能达成的目的下,均应仍落在本发明所揭示的技术内容得能涵盖的范围内。
图1为本发明实施例的人体头肩区域的定位方法的流程图。
图2为本发明一个示例中多焦点上下文信息融合结构的示意图。
图3为本发明实施例的人体头肩区域的定位装置的结构框图。
具体实施方式
以下由特定的具体实施例说明本发明的实施方式,熟悉此技术的人士可由本说明书所揭露的内容轻易地了解本发明的其他优点及功效。
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、接口、技术之类的具体细节,以便透彻理解本发明。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本发明。在其它情况中,省略对众所周知的系统、电路以及方法的详细说明,以免不必要的细节妨碍本发明的描述。
在本发明的描述中,需要理解的是,术语“中心”、“纵向”、“横向”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本发明和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明的限制。此外,术语“第一”和“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性。
在本发明的描述中,需要说明的是,除非另有明确的规定和限定,术语“相连”和“连接”应做广义理解,例如可以是直接相连,也可以通过中间媒介间接相连。对于本领域的普通技术人员而言,可以具体情况理解上述术语在本发明中的具体含义。
图1为本发明实施例的人体头肩区域的定位方法的流程图。如图1所示,本发明实施例的人体头肩区域的定位方法,包括:
S1:将目标图像通过卷积神经网络进行卷积得到缩小后的特征图。
具体地,为了降低网络计算复杂度,本实施例在网络的输入端利用少量几个网络层快速降低特征图的分辨率,以降低卷积核滑动的空间范围,节约计算量。具体方法是,在网络的输入端的几层采用跨度为2的卷积层,逐次按2倍进行特征图缩小。和一般的卷积神经网络结构不同,本实施例不采用池化层进行特征图缩小,主要是因为池化层会带来细节信息丢失,不适用此处连续特 征图缩小的结构。在这些连续下降层之后,每个跨度为2的卷积层后接若干个跨度为1的卷积层,以提取语义性更强的特征来进行头肩区域定位。
在卷积连接方式设置上,为了降低网络计算复杂度,本实施例设计了一种稀疏连接的卷积结构。常规的卷积会用同一个卷积核在所有输入通道上进行卷积操作,设卷积操作输入输出通道数分别为N、M,卷积核大小为3,则卷积核参数矩阵大小为N×M×3×3。与常规的卷积不同的是,本实施例的稀疏连接卷积方式中,其输入输出通道数相同,每个卷积只在特征图中的单个通道上进行卷积操作,形成了稀疏连接的结构,如上面的输入输出通道数以及卷积核大小设置,卷积核矩阵大小为N×3×3,N为通道数。此外,在每个稀疏连接卷积后接一个普通卷积层,其特点是卷积核空间大小为1×1,其卷积核权重矩阵为N×M×1×1,N和M分别为输入输出通道数。以融合不同特征通道间信息。本实施例的结构可以极大降低参数量和计算量。
激活函数是现代神经网络的必要构成元素,深度卷积神经网络方法一般采用ReLU函数来避免训练过程中的梯度消失问题,而ReLU函数在输入小于0的部分产生了信息丢失。ReLU函数在较大网络模型上起到正则效果,避免过拟合,而在轻量级结构上会限制模型容量,降低准确率。本实施例的激活函数的具体公式为:
Figure PCTCN2021070576-appb-000007
其中,PQReLU为本实施例的激活函数名称,x和y分别表示激活函数的输入输出特征图,p和q为可学习参数,在卷积神经网络的训练时确定取值大小。
S2:将缩小后的特征图进行卷积得到分辨率互不相同的第一特征图、第二特征图和第三特征图,本实施例将分辨率为原图下采样8倍、16倍、32倍的三组特征图中深度最深的三个特征图记为P1,P2和P3。
S3:将第一特征图、第二特征图和第三特征图分别进行多焦距上下文处理得到第一融合特征图、第二融合特征图和第三融合特征图。
具体地,本实施例在预测层用不同深度的微型子网络形成多种不同感受野范围的特征图。通过此结构,可以让网络在进行头肩区域定位的决策时,参考到头肩周围的人体区域以及周围环境上下文信息,进行更精准的决策。
图2为本发明一个示例中多焦点上下文信息融合结构的示意图。如图2所示,本实施例由三个卷积核等大的稀疏连接卷积构成,其中一个卷积核构成一个短焦距分支,关注头肩区域特征,其它两个卷积核构成一个分支,关注头肩区域周围的上下文特征。该结构输入的特征图分别经两个分支提取不同视野范围的特征后,融合在一起得到融合了多焦距上下文信息的特征。本实施例在多个不同分辨率的网络层上预测不同大小的头肩区域定位结果。因此,本实施例在其中每层的预测卷积前均添加了上述上下文信息融合结构,将P1,P2和P3进行多焦距上下文处理得到第一融合特征图、第二融合特征图和第三融合特征图,记为Q1,Q2和Q3。
S4:将第一融合特征图、第二融合特征图和第三融合特征图分别经过预测卷积层得到每个位置的概率和编码包围框输出值。
具体地,本发明在上述上下文信息融合结构后面用两个并行的卷积操作分别输出分类概率值和包围框框编码值。通过该策略,神经网络可以直接输出头肩区域定位结果,形成了一种端对端的结构,使算法计算都发生在神经网络部分,减少了算法环节,加速了运算,特别在专用神经网络计算芯片上可减少不同计算原件内存的交互。为了明确在网络输出的特征图上每个像素位置的类别和包围框坐标,本发明设计了如下的规则:
将特征图上的像素点映射到输入图像上得到在原图坐标系的坐标点;
若该坐标点处在某个头肩内部,则该坐标点为正样本,该头肩的外包框即为该坐标点匹配到的Ground Truth(GT)外包框,否则为负样本,不匹配外包框;
根据匹配到的GT外包框可得到的GT编码,计算公式为:
Figure PCTCN2021070576-appb-000008
Figure PCTCN2021070576-appb-000009
Δh=h gt
Δw=w gt
其中,x c,y c为特征图上的像素点映射到输入图像得到的坐标点的横纵坐标;x gt,y gt,h gt,w gt为匹配到的GT外包框的中心点的横纵坐标以及宽、高数值。Δx,Δy,Δh,Δw为网络需要输出的编码后的包围框坐标。
基于神经网络的算法结构需要基于一定数量样本和特定损失函数进行训练来产生有用的功能。本实施例在训练阶段,需采集足够数量的包含头肩的图像,并将头肩区域以(x gt,y gt,h gt,w gt)的格式标注出来。在训练时,分类输出的监督采用交叉熵损失函数,定位环节的监督采用SmoothL1损失函数。
本实施例将Q1,Q2和Q3分别经过预测卷积层得到每个位置的概率c ij和编码包围框输出值ΔB ij。其中下标i代表预测层的序号,下标j代表像素位置编号。
S5:将每个像素位置处的编码包围框输出值进行反编码得到图像坐标系下的包围框坐标,并和分类概率组合成(c,x,y,w,h)的区域定位向量,将所有预测层所有像素位置的包围框汇总在一起,得到第一定位结果。
S6:对第一定位结果进行包围框过滤,利用预先设置的阈值θ将c<θ的包围框进行过滤,得到第二定位结果。
S7:对第二定位结果进行应用目标检测算法中的非极大值抑制得到最终定位结果。
本发明实施例提供的人体头肩区域的定位方法,神经网络以单步直接输出结果的端对端方式,直接由输入图像产生定位结果,比双阶段方法要高效。同时本发明通过对神经网络的结构进行合理设计,一方面使其更加轻量,一方面使其能准确且高效地的提取头肩区域的特征。
图3为本发明实施例的人体头肩区域的定位装置的结构框图。如图3所示,本发明实施例的人体头肩区域的定位装置,包括:获取模块100和控制处理模块200。
其中,获取模块100用于获取目标图像。
控制处理模块200用于:将目标图像通过卷积神经网络进行卷积得到缩小后的特征图;将缩小后的特征图进行卷积得到分辨率互不相同的第一特征图、第二特征图和第三特征图;将第一特征图、第二特征图和第三特征图分别进行多焦距上下文处理得到第一融合特征图、第二融合特征图和第三融合特征图;将第一融合特征图、第二融合特征图和第三融合特征图分别经过预测卷积层得到每个位置的概率和编码包围框输出值;将每个像素位置处的编码包围框输出值进行反编码得到图像坐标系下的包围框坐标,并和分类概率组合成区域定位向量,将所有预测层所有像素位置的包围框汇总在一起,得到第一定位结果;对第一定位结果进行包围框过滤得到第二定位结果;对第二定位结果进行非极大值抑制得到最终定位结果。
在本发明的一个实施例中,控制处理模块200用于将目标图像采用跨度为2的卷积层,逐次按2倍进行特征图缩小得到缩小后的特征图。
在本发明的一个实施例中,卷积神经网络的卷积层由稀疏连接卷积层和普通卷积层交替排列;其中,稀疏连接卷积层的输入和输出通道数相同,且在序号相同的输入通道和输出通道之间网络连接,其卷积核权重矩阵大小为N×3×3,N为通道数。
在本发明的一个实施例中,卷积神经网络的激活函数为:
Figure PCTCN2021070576-appb-000010
其中,x和y分别表示激活函数的输入输出特征图,p和q为可学习参数。
在本发明的一个实施例中,控制处理模块200用于将第一特征图、第二特征图和第三特征图分别输入由三个卷积核等大的稀疏连接卷积结构提取不同视野范围的特征后,融合在一起得到第一融合特征图、第二融合特征图和第三融合特征图。其中,三个卷积核等大的稀疏连接卷积结构包括一个用于关注头肩区域特征的短焦距分支结构的卷积核,剩余两个卷积核构成一个长焦距分支,关注头肩区域周围的上下文特征。
在本发明的一个实施例中,控制处理模块200用于将第一融合特征图、第二融合特征图和第三融合特征图采用两个并行的卷积操作分别输出每个位置的概率和编码包围框输出值。其中,通过以下规则确定在网络输出的特征图上每个像素位置的类别和包围框坐标:将特征图上的像素点映射到输入图像上得到在原图坐标系的坐标点;如果坐标点处在某个头肩内部,则坐标点为正样本,该头肩的外包框即为坐标点匹配到的基本事实GT外包框,否则为负样本,不匹配外包框;根据匹配到的GT外包框可得到的GT编码,计算公式为:
Figure PCTCN2021070576-appb-000011
Figure PCTCN2021070576-appb-000012
Δh=h gt
Δw=w gt
其中,x c,y c为特征图上的像素点映射到输入图像得到的坐标点的横纵坐标;x gt,y gt,h gt,w gt为匹配到的GT外包框的中心点的横纵坐标以及宽、高数值。Δx,Δy,Δh,Δw为网络需要输出的编码后的包围框坐标。
需要说明的是,本发明实施例的人体头肩区域的定位装置的具体实施方式与本发明实施例的人体头肩区域的定位方法的具体实施方式类似,具体参见人体头肩区域的定位方法部分的描述,为了减少冗余,不做赘述。
另外,本发明实施例的人体头肩区域的定位装置的其它构成以及作用对于本领域的技术人员而言都是已知的,为了减少冗余,不做赘述。
本发明实施例还提供一种电子设备,包括:至少一个处理器和至少一个存储器;所述存储器用于存储一个或多个程序指令;所述处理器,用于运行一个或多个程序指令,用以执行如第一方面所述的人体头肩区域的定位方法。
本发明所公开的实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序指令,当所述计算机程序指令在计算机上运行时,使得计算机执行上述的人体头肩区域的定位方法。
在本发明实施例中,处理器可以是一种集成电路芯片,具有信号的处理能 力。处理器可以是通用处理器、数字信号处理器(Digital Signal Processor,简称DSP)、专用集成电路(Application Specific Integrated Circuit,简称ASIC)、现场可编程门阵列(Field Programmable Gate Array,简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。
可以实现或者执行本发明实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本发明实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。处理器读取存储介质中的信息,结合其硬件完成上述方法的步骤。
存储介质可以是存储器,例如可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。
其中,非易失性存储器可以是只读存储器(Read-Only Memory,简称ROM)、可编程只读存储器(Programmable ROM,简称PROM)、可擦除可编程只读存储器(Erasable PROM,简称EPROM)、电可擦除可编程只读存储器(Electrically EPROM,简称EEPROM)或闪存。
易失性存储器可以是随机存取存储器(Random Access Memory,简称RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,简称SRAM)、动态随机存取存储器(Dynamic RAM,简称DRAM)、同步动态随机存取存储器(Synchronous DRAM,简称SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,简称DDRSDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,简称ESDRAM)、同步连接动态随机存取存储器(Synch Link DRAM,简称SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,简称DRRAM)。
本发明实施例描述的存储介质旨在包括但不限于这些和任意其它适合类 型的存储器。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件与软件组合来实现。当应用软件时,可以将相应功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。
虽然,上文中已经用一般性说明及具体实施例对本发明作了详尽的描述,但在本发明的基础上,可以对之作一些修改或改进,这对本领域技术人员而言是显而易见的。因此,在不偏离本实用新型精神的基础上所做的这些修改或改进,均属于本实用新型要求保护的范围。

Claims (9)

  1. 一种人体头肩区域的定位方法,其特征在于,包括:
    将目标图像通过卷积神经网络进行卷积得到缩小后的特征图;
    将所述缩小后的特征图进行卷积得到分辨率互不相同的第一特征图、第二特征图和第三特征图;
    将所述第一特征图、所述第二特征图和所述第三特征图分别进行多焦距上下文处理得到第一融合特征图、第二融合特征图和第三融合特征图;
    将所述第一融合特征图、所述第二融合特征图和所述第三融合特征图分别经过预测卷积层得到每个位置的概率和编码包围框输出值;
    将每个像素位置处的编码包围框输出值进行反编码得到图像坐标系下的包围框坐标,并和分类概率组合成区域定位向量,将所有预测层所有像素位置的包围框汇总在一起,得到第一定位结果;
    对所述第一定位结果进行包围框过滤得到第二定位结果;
    对所述第二定位结果进行非极大值抑制得到最终定位结果。
  2. 根据权利要求1所述的人体头肩区域的定位方法,其特征在于,所述将目标图像通过卷积神经网络进行卷积得到缩小后的特征图,包括:
    将所述目标图像采用跨度为2的卷积层,逐次按2倍进行特征图缩小得到所述缩小后的特征图。
  3. 根据权利要求1所述的人体头肩区域的定位方法,其特征在于,所述卷积神经网络的卷积层由稀疏连接卷积层和普通卷积层交替排列;其中,所述稀疏连接卷积层的输入和输出通道数相同,且在序号相同的输入通道和输出通道之间网络连接,其卷积核权重矩阵大小为N×3×3,N为通道数。
  4. 根据权利要求1所述的人体头肩区域的定位方法,其特征在于,所述卷积神经网络的激活函数为:
    Figure PCTCN2021070576-appb-100001
    其中,x和y分别表示激活函数的输入输出特征图,p和q为可学习参数。
  5. 根据权利要求1所述的人体头肩区域的定位方法,其特征在于,所述将所述第一特征图、所述第二特征图和所述第三特征图分别进行多焦距上下文处理得到第一融合特征图、第二融合特征图和第三融合特征图,包括:
    将所述第一特征图、所述第二特征图和所述第三特征图分别输入由三个卷积核等大的稀疏连接卷积结构提取不同视野范围的特征后,融合在一起得到所述第一融合特征图、所述第二融合特征图和所述第三融合特征图;
    其中,所述三个卷积核等大的稀疏连接卷积结构包括一个用于关注头肩区域特征的短焦距分支结构的卷积核,剩余两个卷积核构成一个长焦距分支,关注头肩区域周围的上下文特征。
  6. 根据权利要求1所述的人体头肩区域的定位方法,其特征在于,所述将所述第一融合特征图、所述第二融合特征图和所述第三融合特征图分别经过预测卷积层得到每个位置的概率和编码包围框输出值,包括:
    将所述第一融合特征图、所述第二融合特征图和所述第三融合特征图采用两个并行的卷积操作分别输出每个位置的概率和编码包围框输出值;
    其中,通过以下规则确定在网络输出的特征图上每个像素位置的类别和包围框坐标:
    将特征图上的像素点映射到输入图像上得到在原图坐标系的坐标点;
    如果所述坐标点处在某个头肩内部且该头肩与所述坐标点的距离比其他头肩距离所述坐标点更近,则所述坐标点为正样本,该头肩的外包框即为所述坐标点匹配到的基本事实GT外包框,否则为负样本,不匹配外包框;
    根据匹配到的GT外包框可得到的GT编码,计算公式为:
    Figure PCTCN2021070576-appb-100002
    Figure PCTCN2021070576-appb-100003
    Δh=h gt
    Δw=w gt
    其中,x c,y c为特征图上的像素点映射到输入图像得到的坐标点的横纵 坐标;x gt,y gt,h gt,w gt为匹配到的GT外包框的中心点的横纵坐标以及宽、高数值。Δx,Δy,Δh,Δw为网络需要输出的编码后的包围框坐标。
  7. 一种人体头肩区域的定位装置,其特征在于,包括:
    获取模块,用于获取目标图像;
    控制处理模块,用于将目标图像通过卷积神经网络进行卷积得到缩小后的特征图;将所述缩小后的特征图进行卷积得到分辨率互不相同的第一特征图、第二特征图和第三特征图;将所述第一特征图、所述第二特征图和所述第三特征图分别进行多焦距上下文处理得到第一融合特征图、第二融合特征图和第三融合特征图;将所述第一融合特征图、所述第二融合特征图和所述第三融合特征图分别经过预测卷积层得到每个位置的概率和编码包围框输出值;将每个像素位置处的编码包围框输出值进行反编码得到图像坐标系下的包围框坐标,并和分类概率组合成区域定位向量,将所有预测层所有像素位置的包围框汇总在一起,得到第一定位结果;对所述第一定位结果进行包围框过滤得到第二定位结果;对所述第二定位结果进行非极大值抑制得到最终定位结果。
  8. 一种电子设备,其特征在于,所述电子设备包括:至少一个处理器和至少一个存储器;
    所述存储器用于存储一个或多个程序指令;
    所述处理器,用于运行一个或多个程序指令,用以执行如权利要求1-6任一项所述的人体头肩区域的定位方法。
  9. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中包含一个或多个程序指令,所述一个或多个程序指令用于执行如权利要求1-6任一项所述的人体头肩区域的定位方法。
PCT/CN2021/070576 2020-12-09 2021-01-07 人体头肩区域的定位方法、定位装置和电子设备 WO2022121075A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
ZA2023/05848A ZA202305848B (en) 2020-12-09 2023-05-31 Positioning method, positioning apparatus and electronic device for human head and shoulders area

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011432151.6 2020-12-09
CN202011432151.6A CN112507872B (zh) 2020-12-09 2020-12-09 人体头肩区域的定位方法、定位装置和电子设备

Publications (1)

Publication Number Publication Date
WO2022121075A1 true WO2022121075A1 (zh) 2022-06-16

Family

ID=74970266

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/070576 WO2022121075A1 (zh) 2020-12-09 2021-01-07 人体头肩区域的定位方法、定位装置和电子设备

Country Status (3)

Country Link
CN (1) CN112507872B (zh)
WO (1) WO2022121075A1 (zh)
ZA (1) ZA202305848B (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139484B (zh) * 2021-04-28 2023-07-11 上海商汤科技开发有限公司 人群定位方法及装置、电子设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150169957A1 (en) * 2013-12-16 2015-06-18 Samsung Electronics Co., Ltd. Method and apparatus for detecting object of image
CN104751491A (zh) * 2015-04-10 2015-07-01 中国科学院宁波材料技术与工程研究所 一种人群跟踪及人流量统计方法及装置
CN106845406A (zh) * 2017-01-20 2017-06-13 深圳英飞拓科技股份有限公司 基于多任务级联卷积神经网络的头肩检测方法及装置
CN110287849A (zh) * 2019-06-20 2019-09-27 北京工业大学 一种适用于树莓派的轻量化深度网络图像目标检测方法

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874855A (zh) * 2017-01-19 2017-06-20 博康智能信息技术有限公司北京海淀分公司 头肩区域定位方法及装置
CN108416250B (zh) * 2017-02-10 2021-06-22 浙江宇视科技有限公司 人数统计方法及装置
CN110021034A (zh) * 2019-03-20 2019-07-16 华南理工大学 一种基于头肩检测的跟踪录播方法及系统
CN110729045A (zh) * 2019-10-12 2020-01-24 闽江学院 一种基于上下文感知残差网络的舌图像分割方法
CN110852270B (zh) * 2019-11-11 2024-03-15 中科视语(北京)科技有限公司 基于深度学习的混合语法人体解析方法及装置
CN111598112B (zh) * 2020-05-18 2023-02-24 中科视语(北京)科技有限公司 多任务的目标检测方法、装置、电子设备及存储介质
CN111612017B (zh) * 2020-07-07 2021-01-29 中国人民解放军国防科技大学 一种基于信息增强的目标检测方法
CN111783754B (zh) * 2020-09-04 2020-12-08 中国科学院自动化研究所 基于部位上下文的人体属性图像分类方法、系统和装置
CN112434612A (zh) * 2020-11-25 2021-03-02 创新奇智(上海)科技有限公司 吸烟检测方法、装置、电子设备及计算机可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150169957A1 (en) * 2013-12-16 2015-06-18 Samsung Electronics Co., Ltd. Method and apparatus for detecting object of image
CN104751491A (zh) * 2015-04-10 2015-07-01 中国科学院宁波材料技术与工程研究所 一种人群跟踪及人流量统计方法及装置
CN106845406A (zh) * 2017-01-20 2017-06-13 深圳英飞拓科技股份有限公司 基于多任务级联卷积神经网络的头肩检测方法及装置
CN110287849A (zh) * 2019-06-20 2019-09-27 北京工业大学 一种适用于树莓派的轻量化深度网络图像目标检测方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIN XIN, ZHAO XU;ZHAO CHAOYANG;XU HUAZHONG;WANG JINQIAO: "Real-Time Crowd Counting for Embedded Systems With High Accuracy", GAOJISHU TONGXUN - HIGH TECHNOLOGY LETTERS, BEIJING, CN, vol. 30, no. 1, 31 January 2020 (2020-01-31), CN , pages 32 - 40, XP055940999, ISSN: 1002-0470, DOI: 10.3772/j.issn.1002-0470.2020.01.004 *
JIN, XIN: "Research and Achievement of Efficient People Counting Algorithm Based on Deep Learning", INFORMATION SCIENCE AND TECHNOLOGY, CHINESE MASTER’S THESES FULL-TEXT DATABASE, no. 7, 1 April 2019 (2019-04-01), China, pages 1 - 58, XP009537564, DOI: 10.27381/d.cnki.gwlgu.2019.000292 *

Also Published As

Publication number Publication date
CN112507872B (zh) 2021-12-28
CN112507872A (zh) 2021-03-16
ZA202305848B (en) 2023-12-20

Similar Documents

Publication Publication Date Title
CN110222717B (zh) 图像处理方法和装置
CN114202672A (zh) 一种基于注意力机制的小目标检测方法
WO2021018106A1 (zh) 行人检测方法、装置、计算机可读存储介质和芯片
WO2020177607A1 (zh) 图像去噪方法和装置
JP2022515895A (ja) 物体認識方法及び装置
CN110163188B (zh) 视频处理以及在视频中嵌入目标对象的方法、装置和设备
CN110222718B (zh) 图像处理的方法及装置
WO2022001372A1 (zh) 训练神经网络的方法、图像处理方法及装置
CN113011562A (zh) 一种模型训练方法及装置
CN112529904A (zh) 图像语义分割方法、装置、计算机可读存储介质和芯片
CN115239581A (zh) 一种图像处理方法及相关装置
JP2022545962A (ja) 団霧認識方法及び装置、電子機器、記憶媒体並びにコンピュータプログラム製品
CN110705564B (zh) 图像识别的方法和装置
WO2022121075A1 (zh) 人体头肩区域的定位方法、定位装置和电子设备
CN116740439A (zh) 一种基于跨尺度金字塔Transformer的人群计数方法
CN116052026A (zh) 一种无人机航拍图像目标检测方法、系统及存储介质
Wang et al. Object counting in video surveillance using multi-scale density map regression
Guo et al. Scale region recognition network for object counting in intelligent transportation system
CN117157679A (zh) 感知网络、感知网络的训练方法、物体识别方法及装置
CN116229406B (zh) 车道线检测方法、系统、电子设备及存储介质
Ren et al. A lightweight object detection network in low-light conditions based on depthwise separable pyramid network and attention mechanism on embedded platforms
CN117237867A (zh) 基于特征融合的自适应场面监视视频目标检测方法和系统
TWI826160B (zh) 圖像編解碼方法和裝置
CN114511798B (zh) 基于transformer的驾驶员分心检测方法及装置
EP4296896A1 (en) Perceptual network and data processing method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901794

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901794

Country of ref document: EP

Kind code of ref document: A1