WO2023207163A1 - Modèle de détection d'objet et procédé de détection d'objet occupant un itinéraire d'échappement d'incendie, et utilisation - Google Patents

Modèle de détection d'objet et procédé de détection d'objet occupant un itinéraire d'échappement d'incendie, et utilisation Download PDF

Info

Publication number
WO2023207163A1
WO2023207163A1 PCT/CN2022/141284 CN2022141284W WO2023207163A1 WO 2023207163 A1 WO2023207163 A1 WO 2023207163A1 CN 2022141284 W CN2022141284 W CN 2022141284W WO 2023207163 A1 WO2023207163 A1 WO 2023207163A1
Authority
WO
WIPO (PCT)
Prior art keywords
transposed
features
layer
bottleneck residual
scale
Prior art date
Application number
PCT/CN2022/141284
Other languages
English (en)
Chinese (zh)
Inventor
沈瑶
张香伟
毛云青
曹喆
梁艺蕾
Original Assignee
城云科技(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 城云科技(中国)有限公司 filed Critical 城云科技(中国)有限公司
Publication of WO2023207163A1 publication Critical patent/WO2023207163A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of target detection, and in particular to target detection models, methods and applications for detecting fire escape occupancy targets.
  • the current detection algorithm is not friendly to small objects, which is reflected in the following four aspects: 1. Excessive down-sampling rate: Assume that the current small object size is 15 ⁇ 15, and the convolution down-sampling rate in general object detection is 16. This results in an excessively large downsampling rate on the feature map so that small objects cannot occupy even one pixel; 2. Excessively large receptive field: In the convolutional network, the receptive field of the feature points on the feature map is larger than the downsampling rate. Many, resulting in a point on the feature map, a small object occupies fewer features, and will contain a large number of features of the surrounding area, thus affecting its detection results; 3.
  • the current convolutional neural network still faces many problems in actual design and use, mainly reflected in the following aspects:
  • Embodiments of the present application provide a target detection model, method and application for fire channel occupied target detection, which can improve the detection accuracy of small targets and is particularly suitable for specific application scenarios of fire channel occupied target detection.
  • embodiments of the present application provide a method for constructing a target detection model.
  • the method includes:
  • the backbone network, the neck multi-scale feature fusion network and the neural network head are connected in sequence.
  • the backbone network includes a slicing operation, a transposed bottleneck residual module and a 3*3 convolution.
  • the input image undergoes a 3*3 convolution after the slicing operation.
  • the product is input into the transposed bottleneck residual module.
  • the features output by the upper-level transposed bottleneck residual module are input to the next-level transposed bottleneck residual module after the slicing operation.
  • the transposed bottleneck residual modules at different levels output respectively.
  • the neck multi-scale feature fusion network includes the same number of 1*1 convolutions as the transposed bottleneck residual module, jump cross fusion module and context-aware attention network, and the scale features of different scales are input separately Feature fusion and feature channel unification are performed in the corresponding 1*1 convolution to obtain initial features of different scales.
  • the initial features of different scales are fused with high-level semantic information and low-level spatial features through the jump cross fusion module to obtain jumps of different scales.
  • Cross-fusion features, jump cross-fusion features of different scales are passed into the context-aware attention network to obtain prediction features; the neural network head is divided into a classification prediction network and a border prediction network.
  • embodiments of the present application provide a target detection model, which is constructed according to the above construction method.
  • embodiments of the present application provide a target detection method, including the following steps:
  • the backbone network includes independent slicing operation, transposed bottleneck residual module and 3*3 convolution. After the slicing operation, the image to be detected is input into the transposed bottleneck residual module through 3*3 convolution, and is transposed by the upper level. The features output by the bottleneck residual module are input to the next-level transposed bottleneck residual module after the slicing operation. Different levels of transposed bottleneck residual modules output scale features of different scales respectively;
  • Scale features of different scales are input into the 1*1 convolution of the corresponding level in the neck multi-scale feature fusion network for feature fusion and feature channel unification to obtain initial features of different levels.
  • the initial features of different levels are processed at high levels through the jump cross fusion module.
  • the fusion of semantic information and low-level spatial features obtains jump cross fusion features at different levels, and the jump cross fusion features at different levels are respectively input to the context-aware attention network to output prediction features;
  • the predicted features are input into the head of the neural network to obtain the target to be detected.
  • embodiments of the present application provide a method for detecting fire passage occupancy, which includes: obtaining an image to be detected covering the fire passage area; inputting the image to be detected into a fire passage occupancy target detection model for detection. If the fire passage occupancy target detection model is detected, If the target is occupied, it is judged that there is an occupied target on the fire exit.
  • the fire exit occupied target detection model is obtained by training the target detection model by using the image of the fire exit marked with the occupied target as a training sample.
  • embodiments of the present application provide an electronic device, including a memory and a processor, characterized in that a computer program is stored in the memory, and the processor is configured to run the computer program to execute the Target detection method or the fire escape passage occupied target detection method.
  • embodiments of the present application provide a computer program product, including a software code part.
  • the software code part is used to execute the target detection method or the Fire exit occupied target detection method.
  • embodiments of the present application provide a readable storage medium in which a computer program is stored.
  • the computer program includes program code for controlling a process to execute a process.
  • the process includes executing the process according to The target detection method or the fire passage occupied target detection method.
  • the backbone network of the target detection model consists of independent slicing operations, transposed bottleneck residual modules and 3*3 convolutions.
  • the transposed bottleneck residual module has a better trade-off in floating point operations and accuracy than ordinary
  • the residual module is more generalizable: the transposed bottleneck residual module uses depth-separable convolution, that is, the number of groups is equal to the number of input channels, and the spatial information is mixed and weighted in a single channel, and the transposed bottleneck residual module
  • the 7*7 depth-separable convolution used is placed at the beginning of the anti-bottleneck to only mix information in the spatial dimension; the depth-separable convolution layer r with higher computational complexity is moved forward, so that complex modules will have more There are fewer channels, and the efficient and dense 1*1 layer will have more channels; adopting the form of small dimension to large dimension and then to small dimension, this can avoid compressing the dimension band when information is converted between different dimensional feature spaces. information loss.
  • the neck multi-scale fusion network uses 1*1 convolution for feature fusion and feature channel unification, followed by a 7-layer deep jump and cross-connected feature fusion layer, and finally a context-aware attention network. It includes not only skip layer connections, but also cross-scale connections to overcome multi-scale changes. Considering the characteristics of the same layer and adjacent layers, bilinear interpolation and maximum pooling are used as upsampling and downsampling functions respectively. Skip layer and cross-scale The mechanism of scale connection requires that the scale target detection model needs to have sufficient high- and low-level information exchange.
  • the skip layer and cross-scale connection are stacked in the form of feature splicing; to effectively solve the problem of large-scale variance, the skip layer and cross-scale connection are Under this method, full exchange of high-level semantic information and low-level spatial information can be achieved.
  • This method can effectively learn features of different scales and help improve the accuracy of target detection, especially the detection of small and large objects; effectively alleviate the problem caused by Problems caused by large-scale changes.
  • the context-aware attention network can efficiently encode the location information and appearance information of local features.
  • the attention network takes the features output by the convolutional network as input and learns to adjust the importance of different areas in the features, thereby obtaining rich information about the local area.
  • Appearance features and their spatial features are used for accurate classification; it brings considerable improvements in fine-grained classification performance and captures subtle differences between targets or scenes.
  • the attention network comprehensively considers pixel-level features, small-area features, large-area features, and images. Classification based on context information of level features.
  • Figure 1 is a schematic diagram of the overall framework of a target detection model according to an embodiment of the present application.
  • Figure 2 is a schematic structural diagram of the transposed bottleneck residual submodule according to an embodiment of the present application
  • Figure 3 is a schematic structural diagram of a neck multi-scale fusion network according to an embodiment of the present application.
  • Figure 4 is a schematic structural diagram of a context-aware attention network according to an embodiment of the present application.
  • Figure 5 is a schematic diagram of a long short-term memory network according to an embodiment of the present application.
  • Figure 6 is a schematic framework diagram of a target detection device according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of the hardware structure of an electronic device according to an embodiment of the present application.
  • the steps of the corresponding method are not necessarily performed in the order shown and described in this specification.
  • methods may include more or fewer steps than described in this specification.
  • a single step described in this specification may be broken down into multiple steps for description in other embodiments; and multiple steps described in this specification may also be combined into a single step in other embodiments. describe.
  • embodiments of the present application provide a method for constructing a target detection model, including:
  • the backbone network, the neck multi-scale feature fusion network and the neural network head are connected in sequence.
  • the backbone network includes a slicing operation, a transposed bottleneck residual module and a 3*3 convolution.
  • the input image undergoes a 3*3 convolution after the slicing operation.
  • the product is input into the transposed bottleneck residual module.
  • the features output by the upper-level transposed bottleneck residual module are input to the next-level transposed bottleneck residual module after the slicing operation.
  • the transposed bottleneck residual modules at different levels output respectively.
  • the neck multi-scale feature fusion network includes the same number of 1*1 convolutions as the transposed bottleneck residual module, jump cross fusion module and context-aware attention network, and the scale features of different scales are input separately Feature fusion and feature channel unification are performed in the corresponding 1*1 convolution to obtain initial features of different scales.
  • the initial features of different scales are fused with high-level semantic information and low-level spatial features through the jump cross fusion module to obtain jumps of different scales.
  • Cross-fusion features, jump cross-fusion features of different scales are input into the context-aware attention network to obtain prediction features; the neural network head is divided into a classification prediction network and a border prediction network, and the prediction features are input into the neural network head part for target prediction.
  • the backbone network includes four slicing operations, four transposed bottleneck residual modules and a 3*3 convolution to achieve five times of downsampling.
  • the output of each slicing operation corresponds to a transposed bottleneck residual module, which is the first
  • the output of the slicing operation undergoes 3*3 convolution and is input to the corresponding transposed bottleneck residual module, and the output of other slicing operations is directly input to the corresponding transposed bottleneck residual module.
  • the output of each transposed bottleneck residual module is input into the neck multi-scale feature fusion network.
  • the backbone network includes the first-level slicing operation, 3*3 convolution, the first-level transposed bottleneck residual module, the second-level slicing operation, the second-level transposed bottleneck residual module, and the third-level connected in sequence. Slicing operation, third-level transposed bottleneck residual module, fourth-level slicing operation, fourth-level transposed bottleneck residual module.
  • the input image is downsampled in the first-level slicing operation, 3*3 convolution, second-level slicing operation, third-level slicing operation, and fourth-level slicing operation respectively.
  • the transposed bottleneck residual module includes at least one group of transposed bottleneck residual sub-modules, and different levels of transposed bottleneck residual modules include different numbers of groups of transposed bottleneck residual sub-modules.
  • the first-level transposed bottleneck residual module includes three groups of series-connected transposed bottleneck residual sub-modules
  • the second-level transposed bottleneck residual module includes three groups of series-connected transposed bottleneck residual sub-modules.
  • the third-level transposed bottleneck residual module includes nine series-connected sets of transposed bottleneck residual sub-modules
  • the fourth-level transposed bottleneck residual module includes three sets of series-connected transposed bottleneck residual sub-modules.
  • FIG. 2 is a schematic structural diagram of the transposed bottleneck residual submodule of this solution.
  • Each group of transposed bottleneck residual submodules includes 7*7 depth separable convolutions connected in sequence.
  • An activation layer is used between convolutions, and the input and output of each group of transposed bottleneck residual submodules are element-level summed.
  • the transposed bottleneck residual module uses depth-separable convolution and a large convolution kernel, making the trade-off between floating point operations and accuracy more generalizable than the general residual module.
  • the number of groups of depth-separable convolution is equal to the number of input channels. Since each convolution kernel processes one channel separately, the spatial information is mixed and weighted within a single channel, that is, only the information in the spatial dimension is mixed to reduce the amount of floating point operations. . However, in order to make up for the loss of accuracy, the number of channels is increased from 64 to 96. As the amount of floating point operations increases, the network performance of this solution is enhanced.
  • the 7*7 depth-separable convolution of this scheme is placed at the beginning of the anti-bottleneck of the transposed bottleneck residual sub-module, and the depth-separable convolution with relatively high computational complexity is moved forward, making it complex
  • the module will have fewer channels, while the efficient and dense 1*1 layer will have more channels; adopting the form of small dimension to large dimension and then to small dimension, which allows information to be converted between different dimensional feature spaces while avoiding the information loss caused by compressed dimensions.
  • the 7*7 depth separable convolution is then input to the first 1*1 convolution using normalization processing.
  • An activation layer is used between the first 1*1 convolution and the second 1*1 convolution.
  • the activation layer may be a SUM activation function, and the normalization process adopts layer normalization processing. Therefore, no normalization layer is used between the two 1*1 convolutional layers, only nonlinear projection is performed.
  • the technical improvements of the transposed bottleneck residual module provided by this solution are: using SMU activation function, fewer activation functions and normalization layers, only using activation functions between 1*1 convolutions, and only using 7*7 convolutions.
  • a normalization layer is used between convolution and 1*1 convolution, and batch normalization is replaced by layer normalization.
  • Figure 3 is a schematic structural diagram of the neck multi-scale fusion network of this solution.
  • the neck multi-scale feature fusion network uses 1*1 convolution to perform feature fusion and unify feature channels on the scale features of different scales output by the backbone network to obtain initial features of different scales.
  • the number of feature channels of the initial features of different scales is the same.
  • the initial features of different layers are skipped and cross-connected in the skip cross fusion module to obtain skip cross fusion features of different scales, and the skip cross fusion features of different scales are respectively input into the context-aware attention network to obtain predicted features.
  • low-scale scale features are input into the 1*1 convolution of the neck multi-scale feature fusion network to obtain low-scale initial features
  • high-scale scale features are input into the 1* convolution of the neck multi-scale feature fusion network.
  • 1 Convolution to obtain high-scale initial features The initial feature M2 in Figure 3 is obtained by inputting the scale feature C2 in Figure 2 into a 1*1 convolution.
  • the initial feature M3 is obtained by inputting the scale feature C3 into a 1*1 convolution.
  • the initial feature M4 The scale feature C4 is input into the 1*1 convolution and the initial feature M5 is the scale feature C5 which is input into the 1*1 convolution.
  • the initial feature M2 is used to detect small targets
  • the initial features M3 and M4 are used to detect medium targets
  • the initial feature M5 is used to detect large targets.
  • the resolution of feature maps in the same layer is the same.
  • the feature resolution of the M5 layer is the same.
  • the same layer only deepens the neural network and enriches the semantic information of the feature map.
  • this solution inputs the initial features of different scales into the jump cross fusion module. fusion processing.
  • the skip cross fusion module of this scheme not only contains skip layer connections, but also cross-scale connections to overcome multi-scale changes. Considering the characteristics of the same layer and adjacent layers, bilinear interpolation and maximum pooling are used as upsampling and downsampling functions respectively.
  • skip layer and cross-scale connection makes the neck multi-scale fusion network need to have sufficient high and low levels.
  • skip layers and cross-scale connections are stacked in the form of feature splicing; effectively solving the problem of large-scale variance, under skip layers and cross-scale connections, full exchange of high-level semantic information and low-level spatial information can be achieved.
  • This kind of This method can effectively allow features of different scales to learn from each other, helping to improve target detection accuracy, especially detecting small and large objects, thereby effectively alleviating problems caused by large-scale changes.
  • This structure enables dense information exchange at different spatial scales as well as different levels of latent semantics, and helps the detector process high-level semantic information and low-level spatial information with the same priority in the early stages of the network, making it more effective in detection tasks.
  • the corresponding neck multi-scale feature fusion network of this scheme includes four 1*1 convolutions, and the outputs of the four 1*1 convolutions are respectively Four levels of initial features are generated, and the initial features are input into the skip cross fusion module for skip cross fusion.
  • the jump cross fusion module of this solution uses jump and cross connections with a depth of 7 layers.
  • the skip cross fusion module includes multi-level feature fusion layers corresponding to initial features at different levels. The depth of each level of feature fusion layer is 7 layers.
  • Jump layer connections are used within the feature fusion layer at the same level and between feature fusion layers at different levels.
  • bilinear interpolation is used as the upsampling function
  • maximum pooling is used as the downsampling function
  • the skip layer connection and the cross-scale layer connection are stacked in the form of feature splicing.
  • the feature fusion layer at the same level includes depth layers of different depths connected in sequence, and the different depth layers of the feature fusion layer at the same level are jump-connected; the odd-numbered depth layers of the feature fusion layer at different levels use the following Sampling, even-numbered depth layers adopt upsampling, and the same depth layer between feature fusion layers at different levels adopts cross-scale connection; the depth layer of the feature fusion layer of the lowest scale and the depth layer of the feature fusion layer of the adjacent previous scale adopt downsampling. Sampling cross connection; the depth layer of the highest scale feature fusion layer and the depth layer of the adjacent next scale feature fusion layer adopt upsampling cross connection.
  • the depth layer of the feature fusion layer at the same level and the depth layer at intervals are jump-connected.
  • the odd-numbered depth layers of different levels of feature fusion layers use downsampling, and the even-numbered depth layers use upsampling
  • bilinear interpolation is used as the upsampling function
  • maximum pooling is used as the downsampling function.
  • the same depth layer of the feature fusion layer and the same depth layer of the spaced feature fusion layer are connected across scales, and are performed in a down-sampling manner. Connect across scales.
  • the lowest depth depth layer of feature fusion layers at different levels does not perform cross-scale connections.
  • this solution includes a first feature fusion layer corresponding to low scale, a second feature fusion layer and a third feature fusion layer corresponding to intermediate scale, and a fourth feature fusion layer corresponding to high scale.
  • Each feature fusion layer is based on The depth levels are sequentially divided into a first depth layer, a second depth layer, a third depth layer, a fourth depth layer, a fifth depth layer, a sixth depth layer and a seventh depth layer.
  • the first depth layer is jump connected to the third depth layer, the fifth depth layer and the seventh depth layer, and the second depth layer is to the fourth depth layer.
  • the depth layer is jump-connected to the sixth depth layer
  • the third depth layer is jump-connected to the fifth depth layer and the seventh depth layer
  • the fourth depth layer is jump-connected to the sixth depth layer
  • the seventh depth layer is jump-connected to the ninth depth layer.
  • the depth layer of the first feature fusion layer and the depth layer of the third feature fusion layer are connected, and the second feature fusion layer and the fourth feature fusion layer are connected.
  • the depth layer connection of the fusion layer, the first depth layer of the first feature fusion layer, the second feature fusion layer, the third feature fusion layer and the fourth feature fusion layer does not participate in the cross-scale connection.
  • the first depth layer of the first feature fusion layer and the second feature fusion layer A down-sampling cross-connection is used between the second depth layer of the first feature fusion layer and a down-sampling cross-connection between the second depth layer of the second feature fusion layer.
  • Down-sampling cross-connection is used between the third depth layer and the fourth depth layer of the second feature fusion layer, and down-sampling is used between the fourth depth layer of the first feature fusion layer and the fifth depth layer of the second feature fusion layer.
  • the fifth depth layer of the first feature fusion layer and the sixth depth layer of the second feature fusion layer adopt a downsampling cross connection
  • the sixth depth layer of the first feature fusion layer and the second feature fusion layer Downsampled cross-connections are used between the seventh depth layers.
  • the first depth layer and the third feature fusion layer of the fourth feature fusion layer An upsampling cross connection is used between the second depth layer of the fourth feature fusion layer and the third depth layer of the third feature fusion layer.
  • An upsampling cross connection is used between the fourth feature fusion layer and the third depth layer of the third feature fusion layer.
  • An upsampling cross-connection is used between the third depth layer and the fourth depth layer of the third feature fusion layer, and an upsampling is used between the fourth depth layer of the fourth feature fusion layer and the fifth depth layer of the third feature fusion layer.
  • the cross connection of upsampling is used between the fifth depth layer of the fourth feature fusion layer and the sixth depth layer of the third feature fusion layer, and the sixth depth layer of the fourth feature fusion layer and the third feature fusion layer are Upsampling cross-connections are used between the seventh depth layers.
  • the initial features of this solution obtain jump cross features at four scales after going through the above jump cross fusion module.
  • the jump cross features at four scales are respectively input into the context-aware attention network to obtain more accurate regions of interest.
  • Figure 4 is a structural diagram of the context-aware attention network.
  • the attention network can efficiently encode the position information and appearance information of local features.
  • the attention network takes the jump cross features obtained above as input and learns to adjust the importance of different areas in the features, thereby obtaining rich appearance features of the local area. and its spatial characteristics, and then carry out accurate classification; it brings considerable fine-grained classification performance improvement and captures the subtle differences between targets or scenes. .
  • the attention network provided by this solution comprehensively considers the context information of pixel-level features, small-area features, large-area features, and picture-level features for classification.
  • the context-aware attention network amplifies the width and height of the separately input jump cross fusion features.
  • a series of candidate regions All candidate regions cover all regional positions of the skip cross fusion feature.
  • Candidate regions of different sizes are represented as fixed-size features using bilinear interpolation. Similar fixed-size features are weighted to obtain context vectors.
  • the context vector undergoes global average pooling and is converted into a region sequence.
  • the region sequence is input into the long short-term memory network to obtain the corresponding hidden state sequence.
  • the hidden state sequence is used as a prediction feature for subsequent head prediction.
  • the same jump cross-fusion feature is used as a candidate area in row i and column j to derive candidate areas rn of different size areas of the system.
  • Each candidate area is converted into a feature fn of uniform size using bilinear interpolation. Different features fn are weighted with each other.
  • a series of context vectors cn are obtained, where each context vector corresponds to each candidate region. Global average pooling of the context vectors is performed to obtain the region sequence sn. Multiple region sequences are input into the long short-term memory network to obtain the corresponding hidden state sequence;
  • the formula for obtaining the context feature vector c from the uniform size feature f is as follows:
  • the parameter matrices W ⁇ and W ⁇ ′ in this formula are used to convert input features into query terms and key terms.
  • W ⁇ is a nonlinear combination
  • b ⁇ and b ⁇ are bias terms
  • the overall learnable parameters are W ⁇ , W ⁇ ′ , W ⁇ , b ⁇ and b ⁇
  • the attention term ⁇ represents the similarity between the two features.
  • the context vector c can represent the contextual information contained in the regional uniform size feature f. This information is based on its relationship with other features. The correlation degree of the region is obtained; the context vector c describes the criticality and characteristics of the region.
  • the jump intersection feature extracted by the neck multi-scale feature fusion network is used as input.
  • the input feature is I, and the width and height are w and h.
  • different granularity levels are defined on the input feature I.
  • the basic area, the granularity level is determined by the size of the area; taking the i row and j column of the input feature as an example, the minimum area is ( ⁇ x, ⁇ y), and a series of areas can be derived by enlarging the width and height (candidate areas r1, r2, r3 to rn), generate similar area collections R at different positions, and obtain the final area collection R.
  • R covers different aspect ratio areas at all positions, which can provide comprehensive contextual information and help provide information at different levels of the image.
  • Subtle features R regions are obtained on the feature map, ranging in size from the smallest ⁇ x* ⁇ y*C to the largest W*H*C, and bilinear interpolation is used to represent areas of different sizes as fixed-size features (f1, f2, f3 to fn); bilinear pooling maps the target coordinates back to the original image, takes the nearest four points, outputs them by distance, and finally obtains the fixed feature after pooling; according to the similarity between fn and other uniform-size features Weighted output allows the model to selectively focus on more relevant areas, thereby generating more comprehensive contextual information.
  • the context vector c of the region is converted into a region sequence and input into the recurrent neural network, and the hidden state unit h of the recurrent neural network is used to express the structural features; in order to increase the generalization ability and reduce the amount of calculation , the region sequence s is obtained by global average pooling of the context vector c, and the hidden state sequence h corresponding to the region sequence s is finally output, which is used in the subsequent head prediction module; the information from pixels to targets to scenes is carefully considered to position the The location of local features or targets also describes their rich and complementary features from multiple dimensions, thereby deriving the content of the complete image or target; the module can efficiently encode the location information and appearance information of local features, and the module combines the convolutional network with The output features are used as input to learn to adjust the importance of different areas in the features, thereby obtaining rich appearance features and spatial features of local areas, and then performing accurate classification and better positioning.
  • Figure 5 is a schematic diagram of the framework of the long short-term memory network of this solution.
  • the region sequence of the current layer, the hidden state sequence output by the previous layer, and the context vector of the previous layer are used as the input of the current long short-term memory network.
  • the hidden state sequence of the current layer is obtained as output.
  • the hidden state sequence output by the previous layer and the region sequence of the current layer are fused and multiplied element-wise with the context vector of the previous layer.
  • the formula for the long short-term memory network is as follows:
  • [h r-1 ,S r ] represents the feature stacking of the previous hidden state sequence h r-1 and the current region sequence S r .
  • f t is processed by a ⁇ (sigmoid) unit of the forget gate, It outputs a vector between 0 and 1 by looking at the stacked feature information of the two.
  • the value between 0 and 1 in the vector indicates which information is retained or discarded in the previous sequence context vector c r-1 , 0 indicates Discard, 1 means retain; second, what new information to add to the cell state, first use the stacked features to determine which information to update through the input gate operation, and then pass the stacked feature information through a tanh layer to obtain new candidate cell information Ar; third, update the old cell information Cr-1 to become the new cell information Cr.
  • the update rule is to forget part of the old cell information through the forget gate selection, and add part of the candidate cell information Ar through the input gate selection to obtain the new cell information Cr; fourth, pass the input through a sigmoid layer called the output gate to obtain the judgment conditions , and then pass the cell state through the tanh layer to obtain a vector with a value between -1 and 1. This vector is multiplied by the judgment condition obtained by the output gate to obtain the final output.
  • the training configuration is basically the same from the baseline model to the final model.
  • the initial warm-up of the training sets the learning rate parameter to Very small, as training proceeds, the learning rate gradually increases, and finally reaches the learning rate of normal training.
  • the optimizer selected during training is SGD
  • the initial learning rate is 0.01
  • the learning rate change strategy is cosine decaying schedule
  • weight decay is set to 0.05
  • momentum is set to 0.9
  • the batch depends on the hardware device.
  • the input size uniformly transitions from 448 to 832 with a step size of 32; the connection weight w and bias b of each layer are randomly initialized.
  • the activation function SMU is selected, and the selected border loss function is CIOU_Loss and the maximum number of iterations under the current data are used for training.
  • the deep learning framework used for training in this program is PyTorch.
  • this patent uses the explicit regularization method DropBlock and the implicit regularization method data enhancement to improve the generalization ability of the model.
  • the target detection model for detecting the target to be tested can be trained.
  • the target detection model for detecting different targets can be trained.
  • fire lanes marked with occupied objects can be used as training samples.
  • a fire lane occupied target detection model can be trained.
  • the model is loaded to predict the target in the image or video, and the final convolution output result is performed; the output result is non-maximum suppression during inference: the final features of the prediction layer will be divided into multiple grids, Any one of each feature cell will have three predicted bounding boxes; secondly, predictions with low probability are discarded, that is, the model believes that there is nothing in this grid; in the inference stage, for multiple detection targets, for each Run non-maximum suppression separately for each category, and the final predicted bounding box output is the ability to predict the bounding box after the model is loaded.
  • the target detection model provided by this solution has several major technical improvements:
  • the transposed bottleneck residual module has a stronger trade-off between floating point operations and accuracy than the general residual module;
  • the neck multi-scale fusion network effectively solves the problem of large-scale variance, Under the skip layer and cross-scale connection, high-level semantic information and low-level spatial information can be fully exchanged.
  • This method allows features of different scales to learn from each other, which helps to improve the accuracy of target detection, especially the detection of small and large objects. This effectively alleviates problems caused by large-scale changes;
  • Context-aware attention network carefully considers information from pixels to targets to scenes, locates local features or target locations, and describes them from multiple dimensions. Rich and complementary features to derive the content of the complete image or target; the module can efficiently encode the position information and appearance information of local features.
  • the module takes the features output by the convolutional network as input and learns to adjust the importance of different regions in the features. properties, thereby obtaining rich appearance features and spatial features of local areas, and then performing accurate classification and better positioning.
  • Embodiment 2 This embodiment of the present application provides a target detection method. Specifically, the target detection method uses the trained target detection model described in the first aspect to perform target detection.
  • the target detection model is connected to a backbone network in sequence. , neck multi-scale feature fusion network and neural network head, the method includes:
  • the backbone network includes independent slicing operation, transposed bottleneck residual module and 3*3 convolution. After the slicing operation, the image to be detected is input into the transposed bottleneck residual module through 3*3 convolution, and is transposed by the upper level. The features output by the bottleneck residual module are input to the next-level transposed bottleneck residual module after the slicing operation. Different levels of transposed bottleneck residual modules output scale features of different scales respectively;
  • Scale features of different scales are input into the 1*1 convolution of the corresponding level in the neck multi-scale feature fusion network for feature fusion and feature channel unification to obtain initial features of different levels.
  • the initial features of different levels are processed at high levels through the jump cross fusion module.
  • the fusion of semantic information and low-level spatial features obtains jump cross fusion features at different levels, and the jump cross fusion features at different levels are input to the context-aware attention network to output prediction features;
  • the predicted features are input into the head of the neural network to obtain the target to be detected.
  • the neural network head mentioned in this solution has been trained, so it can predict the target to be detected based on the input jump cross fusion features.
  • the neural network head can be used to predict different targets. For example, if the training sample is the occupancy target of the fire escape, this solution can be used to pre-predict the occupancy target of the fire escape. At this time, the target to be detected is the occupancy target of the fire escape.
  • each slice operation of the backbone network corresponds to a transposed bottleneck residual module.
  • the output of the first slice operation undergoes 3*3 convolution and is input to the corresponding transposed bottleneck residual module.
  • the output of other slice operations is directly Input to the corresponding transposed bottleneck residual module.
  • the backbone network includes four slicing operations, four transposed bottleneck residual modules, and a 3*3 convolution to achieve five times of downsampling.
  • the transposed bottleneck residual module includes at least one group of transposed bottleneck residual sub-modules, and different levels of transposed bottleneck residual modules include different numbers of groups of transposed bottleneck residual sub-modules.
  • the first-level transposed bottleneck residual module includes three groups of series-connected transposed bottleneck residual sub-modules
  • the second-level transposed bottleneck residual module includes three groups of series-connected transposed bottleneck residual sub-modules.
  • the third-level transposed bottleneck residual module includes nine series-connected sets of transposed bottleneck residual sub-modules
  • the fourth-level transposed bottleneck residual module includes three sets of series-connected transposed bottleneck residual sub-modules.
  • Each group of transposed bottleneck residual sub-modules includes sequentially connected 7*7 depth separable convolutions, the first 1*1 convolution, the second 1*1 convolution and Drop_path, among which the 7*7 depth separable convolutions Normalization processing between the first 1*1 convolution and the second 1*1 convolution, an activation layer is used between the first 1*1 convolution and the second 1*1 convolution, and the input sum of each group of transposed bottleneck residual sub-modules The outputs are element-wise added.
  • the 7*7 depth separable convolution is input to the first 1*1 convolution after normalization processing, and a layer of activation is used between the first 1*1 convolution and the second 1*1 convolution.
  • the activation layer may be a SUM activation function, and the normalization process adopts layer normalization processing. Therefore, no normalization layer is used between the two 1*1 convolutional layers, only nonlinear projection is performed.
  • the neck multi-scale feature fusion network uses 1*1 convolution to perform feature fusion and unify feature channels on the scale features of different scales output by the backbone network to obtain initial features of different scales.
  • the number of feature channels of the initial features of different scales is the same.
  • the initial features of different layers are skipped and cross-connected in the skip cross fusion module to obtain skip cross fusion features of different scales, and the skip cross fusion features of different scales are respectively input into the context-aware attention network to obtain predicted features.
  • the context-aware attention network amplifies the jump cross fusion features to derive a series of candidate areas in width and height, and all candidate areas cover jumps. All regional locations of cross-fused features.
  • Candidate regions of different sizes are represented as fixed-size features using bilinear interpolation. Similar fixed-size features are weighted to obtain context vectors.
  • the context vectors undergo global average pooling and are converted into region sequences. The region sequences are input to The corresponding hidden state sequence is obtained in the long short-term memory network as the prediction feature.
  • Embodiment 1 For the feature content of the neck multi-scale feature fusion network and attention network, refer to Embodiment 1.
  • Embodiment 3 provides a method for detecting fire lane occupancy targets.
  • the fire lane occupancy targets are used as training samples to train the target detection model mentioned in the first aspect, and a fire lane occupancy target detection model is obtained. It is well known that the size of the objects accumulated in the fire escape can be small. Any non-fire appliances accumulated in the fire escape can be considered as occupied objects. These occupied objects will affect the normal use of the fire escape.
  • the target detection model provided by this solution is particularly suitable for detecting targets of different scales, and is especially suitable for detecting occupied targets in fire escapes.
  • this solution can use the fire channel occupancy target detection model based on the fixed cameras built by urban management to automatically detect the fire channel occupancy problem in the monitoring screen, providing a convenient, fast and open information technology for fire channel occupancy management.
  • the fire channel occupation case can be settled more accurately and the damaged location can be quickly located, making urban governance more efficient and effective.
  • the fire exit occupied target detection method includes the following steps:
  • the image to be detected is input into the fire channel occupied target detection model for detection. If the occupied target is detected, it is judged that there is an occupied target on the fire channel.
  • the fire lane occupied target detection model is obtained by training the target detection model as described above by using images of fire lanes marked with occupied targets as training samples.
  • the step of "obtaining images to be detected covering the fire escape area” select the image of the camera monitoring the fire escape as the image to be detected.
  • parameters such as the camera address, algorithm type, callback address, etc. can be set on the system interface.
  • the interface starts a new process and starts to capture image frames from the camera's video stream, store them in redis, and notify the listening program at the same time;
  • the listener program retrieves the image to be tested from redis after receiving the notification.
  • images of fire lanes marked with occupied targets are selected as training samples.
  • data enhancement can be performed on the training samples. Specifically, the following technical means can be selected:
  • the collected basic data is subjected to data enhancement.
  • the enhancement methods are: 1. Color transformation; 2. Rotation transformation; 3. Adding noise; 4. Sharpening and blurring; 5. Zoom transformation; 6. Translation transformation: up, down, left and right dimensions Moving the image; 7. Flip transformation; 8. Cropping transformation; 9. Affine transformation: perform a linear transformation on the image and then connect it to a translation transformation.
  • the image to be detected is input into the fire lane occupied target detection model to output the position of the fire lane occupied target bounding box and the confidence of the target. .
  • iterative processing can be carried out during the use of the fire passage occupancy target detection model: collect a new batch of data, let the fire passage occupancy target detection model detect this batch of data, and divide the detection results into two major categories: Framed images and frameless images.
  • Framed images are divided into real target images and false positive target images.
  • Frameless images can be divided into images with undetected targets and images with no targets in the image.
  • false positive target images are As negative samples, images containing fire lane occupancy targets but not detected are used as training samples.
  • these undetected target images are subjected to data annotation and data enhancement, and then a new fire lane occupancy is trained based on the original model.
  • Target detection model test the model effect and check whether the accuracy meets the standard.
  • the new fire channel occupied target detection model does not meet the standard, add new data and adjust the network parameters for training. If the model accuracy has reached the requirements and is optimal under the current training data, it will stop. Training, loops through this step to achieve a complexity where the fire exit occupied target detection model is suitable for samples in the actual environment.
  • the fire passage occupied target detection method also includes the following steps: when it is detected that the fire passage contains occupied targets, notify the corresponding management department.
  • this application also proposes a target detection device, including:
  • Image acquisition unit 301 used to acquire an image to be detected including a target to be detected
  • the scale feature acquisition unit 302 is used to process the image to be detected to obtain scale features of different scales.
  • the image to be detected is input into the transposed bottleneck residual module through 3*3 convolution after the slicing operation, and is passed through the upper level transposed bottleneck residual module.
  • the features output by the difference module are input to the next-level transposed bottleneck residual module after the slicing operation.
  • Different levels of transposed bottleneck residual modules output scale features of different scales respectively;
  • the prediction feature acquisition unit 303 is used to process scale features to obtain prediction features.
  • Scale features of different scales are input into the 1*1 convolution of the corresponding level in the neck multi-scale feature fusion network for feature fusion and feature channel unification to obtain different levels of features.
  • Initial features The initial features at different levels are fused with high-level semantic information and low-level spatial features through the skip cross fusion module to obtain skip cross fusion features of different scales.
  • the skip cross fusion features of different scales are input to the context-aware attention network output prediction. feature;
  • the prediction unit 304 is used to input prediction features into the neural network head to obtain the target to be detected.
  • This embodiment also provides an electronic device, referring to Figure 7, including a memory 404 and a processor 402.
  • the memory 404 stores a computer program
  • the processor 402 is configured to run the computer program to perform any of the above target detections. Method or steps in an embodiment of a fire escape target detection method.
  • the above-mentioned processor 402 may include a central processing unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, referred to as ASIC), or may be configured to implement one or more integrated circuits according to the embodiments of the present application.
  • CPU central processing unit
  • ASIC Application Specific Integrated Circuit
  • memory 404 may include mass storage 404 for data or instructions.
  • the memory 404 may include a hard disk drive (Hard Disk Drive, HDD for short), a floppy disk drive, a solid state drive (Solid State Drive, SSD for short), flash memory, an optical disk, a magneto-optical disk, a magnetic tape, or a Universal Serial Bus (Universal Serial Bus, Referred to as USB) drive or a combination of two or more of these.
  • Storage 404 may include removable or non-removable (or fixed) media, where appropriate.
  • Memory 404 may be internal or external to the data processing device, where appropriate. In certain embodiments, memory 404 is Non-Volatile memory.
  • the memory 404 includes read-only memory (Read-OnlyMemory, ROM for short) and random access memory (RandomAccessMemory, RAM for short).
  • the ROM can be a mask-programmed ROM, programmable ROM (ProgrammableRead-OnlyMemory, referred to as PROM), erasable PROM (ErasableProgrammableRead-OnlyMemory, referred to as EPROM), electrically erasable PROM (Electrically ErasableProgrammableRead -OnlyMemory, referred to as EEPROM), electrically rewritable ROM (Electrically Alterable Read-OnlyMemory, referred to as EAROM) or flash memory (FLASH) or a combination of two or more of these.
  • PROM programmable ROM
  • EPROM erasable PROM
  • EPROM ErasableProgrammableRead-OnlyMemory
  • EEPROM Electrically ErasableProgrammable
  • the RAM can be static random access memory (StaticRandom-AccessMemory, referred to as SRAM) or dynamic random access memory (DynamicRandomAccessMemory, referred to as DRAM), wherein the DRAM can be fast page mode dynamic random access Memory 404 (FastPageModeDynamicRandomAccessMemory, referred to as FPMDRAM), extended data output dynamic random access memory (ExtendedDateOutDynamicRandomAccessMemory, referred to as EDODRAM), synchronous dynamic random access memory (SynchronousDynamicRandom-AccessMemory, referred to as SDRAM), etc.
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • FPMDRAM fast page mode dynamic random access Memory 404
  • EDODRAM Extended Data output dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • Memory 404 may be used to store or cache various data files required for processing and/or communication, as well as possibly computer program instructions executed by processor 402.
  • the processor 402 reads and executes the computer program instructions stored in the memory 404 to implement any of the target detection methods or fire escape target detection methods in the above embodiments.
  • the above-mentioned electronic device may also include a transmission device 406 and an input-output device 408, wherein the transmission device 406 is connected to the above-mentioned processor 402, and the input-output device 408 is connected to the above-mentioned processor 402.
  • Transmission device 406 may be used to receive or send data over a network.
  • Specific examples of the above-mentioned network may include a wired or wireless network provided by a communication provider of the electronic device.
  • the transmission device includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station to communicate with the Internet.
  • the transmission device 406 may be a radio frequency (Radio Frequency, RF for short) module, which is used to communicate with the Internet wirelessly.
  • RF Radio Frequency
  • Input and output devices 408 are used to input or output information.
  • the input information may be a surveillance video of a fire escape, etc.
  • the output information may be an occupation target, etc.
  • the above-mentioned processor 402 can be configured to perform the following steps through a computer program:
  • the backbone network includes independent slicing operations, multi-level transposed bottleneck residual modules and 3*3 convolutions.
  • the image to be detected is input into the transposed bottleneck residual module through 3*3 convolution after the slicing operation, and is passed through the upper level
  • the features output by the transposed bottleneck residual module are input to the next-level transposed bottleneck residual module after the slicing operation.
  • Different levels of transposed bottleneck residual modules output scale features of different scales respectively;
  • Scale features of different scales are input into the 1*1 convolution of the corresponding level in the neck multi-scale feature fusion network for feature fusion and feature channel unification to obtain initial features of different levels.
  • the initial features of different levels are processed at high levels through the jump cross fusion module.
  • the fusion of semantic information and low-level spatial features obtains jump cross fusion features at different levels, and the jump cross fusion features at different levels are respectively input to the context-aware attention network to output prediction features;
  • Step S104 The predicted features are input into the head of the neural network to obtain the target to be detected.
  • various embodiments may be implemented in hardware or special purpose circuitry, software, logic, or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. Although various aspects of the invention may be shown and described as block diagrams, flow diagrams, or using some other graphical representation, it is to be understood that, by way of non-limiting example, the blocks, devices, systems, techniques, or methods described herein may be Hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controllers or other computing devices, or some combination thereof.
  • Embodiments of the invention may be implemented by computer software executable by a data processor of the mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware.
  • Computer software or programs also referred to as program products
  • a computer program product may include one or more computer-executable components that are configured to perform embodiments when the program is executed.
  • One or more computer-executable components may be at least one software code or a portion thereof.
  • any block of the logic flow in the figures may represent program steps, or interconnected logic circuits, blocks, and functions, or a combination of program steps and logic circuits, blocks, and functions.
  • Software may be stored on physical media such as memory chips or memory blocks implemented within a processor, magnetic media such as hard or floppy disks, and optical media such as, for example, DVD and its data variants, CDs.
  • Physical media are non-transient media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

L'invention concerne un modèle de détection d'objet et un procédé de détection d'objets occupant un itinéraire d'échappement d'incendie, et leur utilisation, se rapportant au domaine de la détection d'objet. Une structure est constituée de trois parties : un réseau fédérateur, un réseau de fusion de caractéristiques multi-échelle de cou et une tête de réseau neuronal. Le réseau fédérateur utilise des opérations de tranchage indépendantes, quatre modules résiduels de goulot d'étranglement de transposition et une convolution 3*3 pour effectuer un sous-échantillonnage pendant cinq fois. Le réseau de fusion de caractéristiques multi-échelle de cou utilise une convolution 1*1 pour effectuer une fusion de caractéristiques et une unification de canal de caractéristiques, puis utilise une couche de saut et une couche transversale pour affiner et fusionner des informations sémantiques de haut niveau et des caractéristiques spatiales de bas niveau, et enfin, utilise un réseau d'attention sensible au contexte. La tête de réseau neuronal est divisée en un réseau de prédiction de classification et en un réseau de prédiction de boîte de délimitation. La présente invention peut bien détecter des objets à échelles multiples, et peut être utilisée pour détecter des objets occupant un itinéraire d'échappement d'incendie.
PCT/CN2022/141284 2022-04-24 2022-12-23 Modèle de détection d'objet et procédé de détection d'objet occupant un itinéraire d'échappement d'incendie, et utilisation WO2023207163A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210432925.8 2022-04-24
CN202210432925.8A CN114529825B (zh) 2022-04-24 2022-04-24 用于消防通道占用目标检测的目标检测模型、方法及应用

Publications (1)

Publication Number Publication Date
WO2023207163A1 true WO2023207163A1 (fr) 2023-11-02

Family

ID=81628154

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/141284 WO2023207163A1 (fr) 2022-04-24 2022-12-23 Modèle de détection d'objet et procédé de détection d'objet occupant un itinéraire d'échappement d'incendie, et utilisation

Country Status (2)

Country Link
CN (1) CN114529825B (fr)
WO (1) WO2023207163A1 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117237746A (zh) * 2023-11-13 2023-12-15 光宇锦业(武汉)智能科技有限公司 基于多交叉边缘融合小目标检测方法、系统及存储介质
CN117590761A (zh) * 2023-12-29 2024-02-23 广东福临门世家智能家居有限公司 用于智能家居的开门状态检测方法及系统
CN117593516A (zh) * 2024-01-18 2024-02-23 苏州元脑智能科技有限公司 一种目标检测方法、装置、设备及存储介质
CN117649609A (zh) * 2024-01-30 2024-03-05 中国人民解放军海军航空大学 面向跨时空尺度域的遥感图像建筑物信息提取方法
CN117739289A (zh) * 2024-02-20 2024-03-22 齐鲁工业大学(山东省科学院) 基于声图融合的泄漏检测方法及系统
CN117830788A (zh) * 2024-03-06 2024-04-05 潍坊科技学院 一种多源信息融合的图像目标检测方法
CN118071745A (zh) * 2024-04-19 2024-05-24 天津师范大学 一种基于深度学习的骨折检测方法和系统

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114529825B (zh) * 2022-04-24 2022-07-22 城云科技(中国)有限公司 用于消防通道占用目标检测的目标检测模型、方法及应用
CN114863368B (zh) * 2022-07-05 2022-09-27 城云科技(中国)有限公司 用于道路破损检测的多尺度目标检测模型、方法
CN115375999B (zh) * 2022-10-25 2023-02-14 城云科技(中国)有限公司 应用于危化品车检测的目标检测模型、方法及装置
CN115546879B (zh) * 2022-11-29 2023-02-17 城云科技(中国)有限公司 用于表情识别的细粒度识别模型及方法
CN115937655B (zh) * 2023-02-24 2023-05-23 城云科技(中国)有限公司 多阶特征交互的目标检测模型及其构建方法、装置及应用
CN116452972B (zh) * 2023-03-17 2024-06-21 兰州交通大学 一种基于Transformer端到端的遥感图像车辆目标检测方法
CN117894002B (zh) * 2024-03-18 2024-06-07 杭州像素元科技有限公司 一种危险物小目标检测模型的构建方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112232232A (zh) * 2020-10-20 2021-01-15 城云科技(中国)有限公司 一种目标检测方法
CN113128564A (zh) * 2021-03-23 2021-07-16 武汉泰沃滋信息技术有限公司 一种基于深度学习的复杂背景下典型目标检测方法及系统
US20220019843A1 (en) * 2020-07-14 2022-01-20 Flir Unmanned Aerial Systems Ulc Efficient refinement neural network for real-time generic object-detection systems and methods
CN114118284A (zh) * 2021-11-30 2022-03-01 重庆理工大学 一种基于多尺度特征融合的目标检测方法
CN114529825A (zh) * 2022-04-24 2022-05-24 城云科技(中国)有限公司 用于消防通道占用目标检测的目标检测模型、方法及应用

Family Cites Families (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647585B (zh) * 2018-04-20 2020-08-14 浙江工商大学 一种基于多尺度循环注意力网络的交通标识符检测方法
CN108805345A (zh) * 2018-06-01 2018-11-13 广西师范学院 一种基于深度卷积神经网络模型的犯罪时空风险预测方法
CN109492830B (zh) * 2018-12-17 2021-08-31 杭州电子科技大学 一种基于时空深度学习的移动污染源排放浓度预测方法
CN110188863B (zh) * 2019-04-30 2021-04-09 杭州电子科技大学 一种适用于资源受限设备的卷积神经网络的卷积核压缩方法
CN110084210B (zh) * 2019-04-30 2022-03-29 电子科技大学 基于注意力金字塔网络的sar图像多尺度舰船检测方法
CN110717420A (zh) * 2019-09-25 2020-01-21 中国科学院深圳先进技术研究院 一种基于遥感图像的耕地提取方法、系统及电子设备
CN110782015A (zh) * 2019-10-25 2020-02-11 腾讯科技(深圳)有限公司 神经网络的网络结构优化器的训练方法、装置及存储介质
KR20210072504A (ko) * 2019-12-09 2021-06-17 삼성전자주식회사 뉴럴 네트워크 시스템 및 이의 동작 방법
CN111178213B (zh) * 2019-12-23 2022-11-18 大连理工大学 一种基于深度学习的航拍车辆检测方法
CN111401201B (zh) * 2020-03-10 2023-06-20 南京信息工程大学 一种基于空间金字塔注意力驱动的航拍图像多尺度目标检测方法
CN111461211B (zh) * 2020-03-31 2023-07-21 中国科学院计算技术研究所 一种用于轻量级目标检测的特征提取方法及相应检测方法
CN111553321A (zh) * 2020-05-18 2020-08-18 城云科技(中国)有限公司 一种流动商贩目标检测模型、检测方法及其管理方法
CN111967305B (zh) * 2020-07-01 2022-03-18 华南理工大学 一种基于轻量级卷积神经网络的实时多尺度目标检测方法
CN111860693A (zh) * 2020-07-31 2020-10-30 元神科技(杭州)有限公司 一种轻量级视觉目标检测方法及系统
CN112016511A (zh) * 2020-09-08 2020-12-01 重庆市地理信息和遥感应用中心 基于大尺度深度卷积神经网络的遥感图像蓝顶房检测方法
CN112686304B (zh) * 2020-12-29 2023-03-24 山东大学 一种基于注意力机制以及多尺度特征融合的目标检测方法、设备及存储介质
CN112686276A (zh) * 2021-01-26 2021-04-20 重庆大学 一种基于改进RetinaNet网络的火焰检测方法
CN112699859B (zh) * 2021-03-24 2021-07-16 华南理工大学 目标检测方法、装置、存储介质及终端
CN113313070A (zh) * 2021-06-24 2021-08-27 华雁智能科技(集团)股份有限公司 架空输电线路缺陷检测方法、装置及电子设备
CN113537013A (zh) * 2021-07-06 2021-10-22 哈尔滨理工大学 一种多尺度自注意力特征融合的行人检测方法
CN113393469A (zh) * 2021-07-09 2021-09-14 浙江工业大学 基于循环残差卷积神经网络的医学图像分割方法和装置
CN113781410B (zh) * 2021-08-25 2023-10-13 南京邮电大学 一种基于MEDU-Net+网络的医学图像分割方法和系统
CN114140786B (zh) * 2021-12-03 2024-05-17 杭州师范大学 基于HRNet编码与双分支解码的场景文本识别方法
CN114170634A (zh) * 2021-12-07 2022-03-11 浙江理工大学 基于DenseNet网络改进的手势图像特征提取方法
CN114092820B (zh) * 2022-01-20 2022-04-22 城云科技(中国)有限公司 目标检测方法及应用其的移动目标跟踪方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220019843A1 (en) * 2020-07-14 2022-01-20 Flir Unmanned Aerial Systems Ulc Efficient refinement neural network for real-time generic object-detection systems and methods
CN112232232A (zh) * 2020-10-20 2021-01-15 城云科技(中国)有限公司 一种目标检测方法
CN113128564A (zh) * 2021-03-23 2021-07-16 武汉泰沃滋信息技术有限公司 一种基于深度学习的复杂背景下典型目标检测方法及系统
CN114118284A (zh) * 2021-11-30 2022-03-01 重庆理工大学 一种基于多尺度特征融合的目标检测方法
CN114529825A (zh) * 2022-04-24 2022-05-24 城云科技(中国)有限公司 用于消防通道占用目标检测的目标检测模型、方法及应用

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117237746A (zh) * 2023-11-13 2023-12-15 光宇锦业(武汉)智能科技有限公司 基于多交叉边缘融合小目标检测方法、系统及存储介质
CN117237746B (zh) * 2023-11-13 2024-03-15 光宇锦业(武汉)智能科技有限公司 基于多交叉边缘融合小目标检测方法、系统及存储介质
CN117590761A (zh) * 2023-12-29 2024-02-23 广东福临门世家智能家居有限公司 用于智能家居的开门状态检测方法及系统
CN117590761B (zh) * 2023-12-29 2024-04-19 广东福临门世家智能家居有限公司 用于智能家居的开门状态检测方法及系统
CN117593516B (zh) * 2024-01-18 2024-03-22 苏州元脑智能科技有限公司 一种目标检测方法、装置、设备及存储介质
CN117593516A (zh) * 2024-01-18 2024-02-23 苏州元脑智能科技有限公司 一种目标检测方法、装置、设备及存储介质
CN117649609A (zh) * 2024-01-30 2024-03-05 中国人民解放军海军航空大学 面向跨时空尺度域的遥感图像建筑物信息提取方法
CN117649609B (zh) * 2024-01-30 2024-04-30 中国人民解放军海军航空大学 面向跨时空尺度域的遥感图像建筑物信息提取方法
CN117739289A (zh) * 2024-02-20 2024-03-22 齐鲁工业大学(山东省科学院) 基于声图融合的泄漏检测方法及系统
CN117739289B (zh) * 2024-02-20 2024-04-26 齐鲁工业大学(山东省科学院) 基于声图融合的泄漏检测方法及系统
CN117830788A (zh) * 2024-03-06 2024-04-05 潍坊科技学院 一种多源信息融合的图像目标检测方法
CN117830788B (zh) * 2024-03-06 2024-05-10 潍坊科技学院 一种多源信息融合的图像目标检测方法
CN118071745A (zh) * 2024-04-19 2024-05-24 天津师范大学 一种基于深度学习的骨折检测方法和系统

Also Published As

Publication number Publication date
CN114529825B (zh) 2022-07-22
CN114529825A (zh) 2022-05-24

Similar Documents

Publication Publication Date Title
WO2023207163A1 (fr) Modèle de détection d'objet et procédé de détection d'objet occupant un itinéraire d'échappement d'incendie, et utilisation
CN112232232B (zh) 一种目标检测方法
CN109840531B (zh) 训练多标签分类模型的方法和装置
WO2023138300A1 (fr) Procédé de détection de cible et procédé de suivi de cible mobile l'utilisant
CN114202672A (zh) 一种基于注意力机制的小目标检测方法
US20220230282A1 (en) Image processing method, image processing apparatus, electronic device and computer-readable storage medium
Shen et al. Pcw-net: Pyramid combination and warping cost volume for stereo matching
Li et al. A new method of image detection for small datasets under the framework of YOLO network
JP7096431B2 (ja) ビデオ分析方法及びそれに関連するモデル訓練方法、機器、装置
WO2021218470A1 (fr) Procédé et dispositif d'optimisation de réseau neuronal
Xia et al. A deep Siamese postclassification fusion network for semantic change detection
CN111210446A (zh) 一种视频目标分割方法、装置和设备
CN116310850B (zh) 基于改进型RetinaNet的遥感图像目标检测方法
CN114549913A (zh) 一种语义分割方法、装置、计算机设备和存储介质
CN115187530A (zh) 超声自动乳腺全容积图像的识别方法、装置、终端及介质
Fan et al. A novel sonar target detection and classification algorithm
Ma et al. Cross-scale fusion and domain adversarial network for generalizable rail surface defect segmentation on unseen datasets
Fu et al. A case study of utilizing YOLOT based quantitative detection algorithm for marine benthos
CN112529025A (zh) 一种数据处理方法及装置
CN111914949B (zh) 基于强化学习的零样本学习模型的训练方法及装置
Cao et al. Face detection for rail transit passengers based on single shot detector and active learning
US20230298335A1 (en) Computer-implemented method, data processing apparatus and computer program for object detection
CN109559345B (zh) 一种服装关键点定位系统及其训练、定位方法
Chen et al. Alfpn: adaptive learning feature pyramid network for small object detection
Duong et al. Towards an Error-free Deep Occupancy Detector for Smart Camera Parking System

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22939972

Country of ref document: EP

Kind code of ref document: A1