CN112966697B - Target detection method, device and equipment based on scene semantics and storage medium - Google Patents

Target detection method, device and equipment based on scene semantics and storage medium Download PDF

Info

Publication number
CN112966697B
CN112966697B CN202110286154.1A CN202110286154A CN112966697B CN 112966697 B CN112966697 B CN 112966697B CN 202110286154 A CN202110286154 A CN 202110286154A CN 112966697 B CN112966697 B CN 112966697B
Authority
CN
China
Prior art keywords
scene
inputting
candidate target
feature
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110286154.1A
Other languages
Chinese (zh)
Other versions
CN112966697A (en
Inventor
谢雪梅
刘卓
李旭阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Institute of Technology of Xidian University
Original Assignee
Guangzhou Institute of Technology of Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Institute of Technology of Xidian University filed Critical Guangzhou Institute of Technology of Xidian University
Priority to CN202110286154.1A priority Critical patent/CN112966697B/en
Publication of CN112966697A publication Critical patent/CN112966697A/en
Application granted granted Critical
Publication of CN112966697B publication Critical patent/CN112966697B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target detection method, a device, terminal equipment and a storage medium based on scene semantics, wherein the method comprises the following steps: constructing a target detection model; inputting the training image into a feature map extraction network to obtain a multi-scale feature map; inputting the multi-scale feature map into a scene semantic feature extraction network to obtain scene semantic features; calculating the multi-label classification loss of scene prediction according to the scene semantic features; inputting the multi-scale feature map into a candidate target feature extraction network to obtain a candidate target feature set; inputting the scene semantic features and the candidate target feature set into a fusion network for fusion to obtain a new candidate target feature set; inputting a detection head network to perform classification and regression operation, and calculating classification loss and regression loss; training a target detection model by combining the three loss functions; and inputting the image to be detected into the trained target detection model to obtain a detection result. The invention can solve the problem that the target with fuzzy appearance is difficult to identify at present.

Description

Target detection method, device and equipment based on scene semantics and storage medium
Technical Field
The invention relates to the field of computer visual image target detection, in particular to a target detection method and device based on scene semantics, terminal equipment and a storage medium.
Background
In the existing target detection method, most of the internal features of the target to be detected are extracted through a convolutional neural network, then the extracted features are classified, and then the detection of the target to be detected is realized. However, these methods ignore global scene semantic information in the detection process, and in a real scene, the appearance of an object is often related to the scene, so that it is difficult for the existing object detection method to identify an object with a fuzzy or unobvious appearance.
Disclosure of Invention
The embodiment of the invention aims to provide a target detection method, a device, a terminal device and a storage medium based on scene semantics, which can make full use of scene semantic information and effectively solve the problem that the target with fuzzy appearance is difficult to identify in the existing detection method.
In order to achieve the above object, an embodiment of the present invention provides a target detection method based on scene semantics, including:
constructing a target detection model; the target detection model comprises a feature map extraction network, a scene semantic feature extraction network, a candidate target feature extraction network, a fusion network and a detection head network;
inputting a training image into the feature map extraction network to obtain a multi-scale feature map;
inputting the multi-scale feature map into the scene semantic feature extraction network to obtain scene semantic features;
calculating the multi-label classification loss of scene prediction according to the scene semantic features;
inputting the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set;
inputting the scene semantic features and the candidate target feature set into the fusion network for fusion to obtain a new candidate target feature set;
inputting the new candidate target feature set and the candidate target feature set into the detection head network for classification and regression operation, and calculating corresponding classification loss and regression loss;
training the target detection model by combining the multi-label classification loss, the classification loss and the regression loss to obtain a trained target detection model;
inputting the image to be detected into the trained target detection model to obtain the category information and the position information of the target to be detected in the image to be detected.
Preferably, the calculating the multi-label classification loss of the scene prediction according to the scene semantic features specifically includes:
inputting the scene semantic features into two fully-connected layers and outputting predicted values of scene categories;
according to
Figure BDA0002980562460000021
Computing multi-label classification loss L for scene predictionsmll(ii) a Where C is the total number of categories of target detection, ycIs the label value of whether or not an object of class c appears in the image,
Figure BDA0002980562460000022
is a predicted value of whether or not an object of category c appears in the image.
Preferably, the inputting the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set specifically includes:
inputting the multi-scale feature map into an RPN network to generate target Propusals, then performing ROI Align operation, and then obtaining the candidate target feature set through two full-connection layers;
or inputting the multi-scale feature map into a 4-layer convolutional network to obtain the candidate target feature set. Preferably, the inputting the scene semantic features and the candidate target feature set into the fusion network for fusion to obtain a new candidate target feature set specifically includes:
embedding any feature vector in the candidate target feature set and the scene semantic features into the same dimension through a full connection layer respectively to obtain a first embedded vector and a second embedded vector correspondingly;
according to the formula
Figure BDA0002980562460000023
Calculating a semantic compatibility score θ for the arbitrary feature vector and the scene semantic featuresi(ii) a Wherein v isieI is more than or equal to 1 and less than or equal to n, n is the total number of the feature vectors in the candidate target feature set,ea second embedding vector corresponding to the scene semantic features;
and updating any feature vector according to the semantic compatibility score to obtain the new candidate target feature set.
Preferably, the updating any feature vector according to the semantic compatibility score to obtain the new candidate target feature set specifically includes:
converting any feature vector and the scene semantic feature by using a full connection layer respectively, adding and activating by using a tanh activation function to obtain a fusion feature of s' ═ tan (Wv)i+ Us); wherein W is used to convert any of the feature vectors viU is the weight of the full link layer used to transform the scene semantic features s;
according to vi′=θivi+(1-θi) s' performs weighted calculation on any eigenvector to obtain any updated eigenvector vi'; wherein, thetaiScoring the semantic compatibility;
and obtaining the new candidate target feature set according to any updated feature vector.
Preferably, the inputting the new candidate target feature set and the candidate target feature set into the detection head network for classification and regression includes:
inputting any feature vector in the new candidate target feature set into a classification branch in the detection head network for prediction to obtain a scene category of a training target;
and inputting any feature vector of the candidate target feature set into a regression branch in the detection head network for prediction to obtain a boundary box of the training target.
Another embodiment of the present invention provides a target detection apparatus based on scene semantics, including:
the model construction module is used for constructing a target detection model; the target detection model comprises a feature map extraction network, a scene semantic feature extraction network, a candidate target feature extraction network, a fusion network and a detection head network;
the characteristic diagram extraction module is used for inputting the training image into the characteristic diagram extraction network to obtain a multi-scale characteristic diagram;
the scene semantic feature extraction module is used for inputting the multi-scale feature map into the scene semantic feature extraction network to obtain scene semantic features;
the multi-label calculation module is used for calculating the multi-label classification loss of scene prediction according to the scene semantic features;
the candidate target feature extraction module is used for inputting the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set;
the fusion module is used for inputting the scene semantic features and the candidate target feature set into the fusion network for fusion to obtain a new candidate target feature set;
the classification regression module is used for inputting the new candidate target feature set and the candidate target feature set into the detection head network for classification and regression operation, and calculating corresponding classification loss and regression loss;
the training module is used for training the target detection model by combining the multi-label classification loss, the classification loss and the regression loss to obtain a trained target detection model;
and the test module is used for inputting the image to be tested into the trained target detection model to obtain the category information and the position information of the target to be tested in the image to be tested.
Another embodiment of the present invention provides a terminal device, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and when the processor executes the computer program, the processor implements the scene semantic-based object detection method according to any one of the above items.
Another embodiment of the present invention provides a computer-readable storage medium, which includes a stored computer program, where when the computer program runs, a device in which the computer-readable storage medium is located is controlled to execute any one of the above-mentioned methods for detecting an object based on scene semantics.
Compared with the prior art, the target detection method, the device, the terminal equipment and the storage medium based on the scene semantics, provided by the embodiment of the invention, display and utilize the semantic information of the scene during modeling, and introduce a semantic self-adaptive mode to fuse the scene semantic characteristics and the characteristics of the target so as to improve the classification discrimination capability of the target characteristics, so that the detection precision of weak and small targets can be obviously improved, and the problems that the existing target detection method lacks flexible and effective utilization of the context information of the scene, so that the target with fuzzy appearance is difficult to detect and the target with ambiguous appearance is easy to be misclassified can be effectively solved.
Drawings
Fig. 1 is a schematic flowchart of a target detection method based on scene semantics according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a target detection model according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of a testing stage in a target detection method based on scene semantic detection according to an embodiment of the present invention;
FIG. 4 is a comparison graph of the detection effect of the present invention on the data set COCO2017-val using the present invention;
fig. 5 is a schematic structural diagram of an object detection apparatus based on scene semantics according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, which is a schematic flowchart of a target detection method based on scene semantics according to the embodiment of the present invention, the method includes steps S1 to S9:
s1, constructing a target detection model; the target detection model comprises a feature map extraction network, a scene semantic feature extraction network, a candidate target feature extraction network, a fusion network and a detection head network;
s2, inputting the training image into the feature map extraction network to obtain a multi-scale feature map;
s3, inputting the multi-scale feature map into the scene semantic feature extraction network to obtain scene semantic features;
s4, calculating the multi-label classification loss of scene prediction according to the scene semantic features;
s5, inputting the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set;
s6, inputting the scene semantic features and the candidate target feature set into the fusion network for fusion to obtain a new candidate target feature set;
s7, inputting the new candidate target feature set and the candidate target feature set into the detection head network for classification and regression operation, and calculating corresponding classification loss and regression loss;
s8, training the target detection model by combining the multi-label classification loss, the classification loss and the regression loss to obtain a trained target detection model;
and S9, inputting the image to be detected into the trained target detection model to obtain the category information and the position information of the target to be detected in the image to be detected.
Specifically, a target detection model is constructed; the target detection model comprises a feature map extraction network, a scene semantic feature extraction network, a candidate target feature extraction network, a fusion network and a detection head network. Fig. 2 is a schematic structural diagram of a target detection model according to the embodiment of the present invention.
And inputting the training image into a feature map extraction network to obtain a multi-scale feature map. Generally, the training images need to be labeled before being input, and the labels include the positions of the training targets, the classes and scene classes determined by the classes of the training targets. Preferably, the feature map extraction network is a ResNet network + FPN network. The multiscale feature map is defined as { p }2,p3,p4,p5,p6And the corresponding down sampling rates are 4, 8, 16, 32 and 64.
And inputting the multi-scale feature map into a scene semantic feature extraction network to obtain scene semantic features. Because the down-sampling rate of p2-p5 is low, the corresponding receptive field is small, and the receptive field corresponding to the feature map of p6 is maximum, the feature map p6 can represent the global feature of the whole image, so that the scene semantic feature extraction network is used for inputting the feature map p6, and the subsequent scene semantic feature extraction is facilitated. Preferably, a 3x3 layer of convolutional layers is used for the feature map p6, then a global pooling operation is used, and then a layer of fully-connected layers is used to output the scene semantic features s.
And calculating the multi-label classification loss of scene prediction according to the scene semantic features. This step is needed for training the model, and is not needed for image testing after the model training is completed.
And inputting the multi-scale feature map into a candidate target feature extraction network to obtain a candidate target feature set. The candidate target feature set is a set formed by each feature point on the multi-scale feature map, and the feature points are all possible training targets, namely the candidate set.
And inputting the scene semantic features and the candidate target feature set into a fusion network for fusion to obtain fusion features, and updating each feature vector of the candidate target feature set according to the fusion features to obtain a new candidate target feature set.
And inputting the new candidate target feature set and the candidate target feature set into a detection head network for classification and regression operation, and calculating corresponding classification loss and regression loss. The detection header network comprises two branches: classification branch and regression branch.
Training the target detection model by combining multi-label classification Loss, classification Loss and regression Loss to obtain the trained target detection model, so that the total Loss function is Loss ═ Lmll+Lcls+LregWherein L ismllFor multi-label classification loss, LclsTo detect loss functions of the classification branches in the head network, LregTo detect loss functions of regression branches in the head network. Preferably, the target detection model is trained using a stochastic gradient descent algorithm.
And inputting the image to be detected into the trained target detection model to obtain the category information and the position information of the target to be detected in the image to be detected. Generally, the location information is determined by a bounding box. The testing process includes inputting a given image, extracting multi-scale features of the given image, obtaining scene semantic features and candidate target features, calculating semantic compatibility equal parts of the scene semantic features and the candidate target features, fusing the candidate target features and the scene semantic features, inputting the candidate target features into a detection classification branch and a regression branch, and obtaining a detection result, specifically referring to fig. 3, which is a flow diagram of a testing stage in the target detection method based on scene semantic detection provided by the embodiment of the invention.
The embodiment of the invention provides the target detection method based on the scene semantics, which can fully utilize the context scene semantic information and effectively solve the problem that the target with the fuzzy appearance is difficult to identify in the existing detection method.
As an improvement of the above scheme, the calculating a multi-label classification loss of scene prediction according to the scene semantic features specifically includes:
inputting the scene semantic features into two fully-connected layers and outputting predicted values of scene categories;
according to
Figure BDA0002980562460000071
Computing multi-label classification loss L for scene predictionsmll(ii) a Where C is the total number of categories of target detection, ycIs the label value of whether or not an object of class c appears in the image,
Figure BDA0002980562460000072
is a predicted value of whether or not an object of category c appears in the image.
Specifically, scene semantic features are input into two fully-connected layers, and predicted values of scene categories are output;
according to
Figure BDA0002980562460000073
Computing multi-label classification loss L for scene predictionsmll(ii) a Wherein C is the total number of categories of target detection, and the scene category corresponding to the image is y ∈ Rc,ycA label value, y, of whether an object of class c appears in the imagec1 indicates that an object of the category i appears in the image, and 0 indicates that an object of the category i does not appear in the image.
Figure BDA0002980562460000074
Is a predicted value of whether or not an object of class c appears in the picture,
Figure BDA0002980562460000081
σ (x) is a sigmoid function.
As an improvement of the above scheme, the inputting the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set specifically includes:
inputting the multi-scale feature map into an RPN network to generate target Propusals, then performing ROI Align operation, and then obtaining the candidate target feature set through two full-connection layers;
or inputting the multi-scale feature map into a 4-layer convolutional network to obtain the candidate target feature set. In particular, when applied to a two-stage detector such as FPN, the multiscale feature map is passed through the RPN network to generate the target Propusals and then ROI Align operation is performed,then, a set of all candidate target characteristic vectors is obtained through two full-connection layers
Figure BDA0002980562460000082
Or, when the method is applied to a single-stage detector such as FCOS, the candidate target feature set is a set formed by each feature point on the feature map obtained by passing the multi-scale feature map through a 4-layer convolutional network.
As an improvement of the above scheme, the inputting the scene semantic features and the candidate target feature set into the fusion network for fusion to obtain a new candidate target feature set specifically includes:
embedding any feature vector in the candidate target feature set and the scene semantic features into the same dimension through a full connection layer respectively to obtain a first embedded vector and a second embedded vector correspondingly;
according to the formula
Figure BDA0002980562460000083
Calculating a semantic compatibility score θ for the arbitrary feature vector and the scene semantic featuresi(ii) a Wherein v isieI is more than or equal to 1 and less than or equal to n which is a first embedded vector corresponding to the ith feature vector, n is the total number of the feature vectors in the candidate target feature set, seA second embedding vector corresponding to the scene semantic features;
and updating any feature vector according to the semantic compatibility score to obtain the new candidate target feature set.
Specifically, any feature vector v in the candidate target feature set is usediEmbedding the scene semantic features s into the same dimension through a full connection layer respectively to obtain a first embedded vector v correspondinglyieAnd a second embedding vector se. The calculation formula is as follows:
vie=σ(Tvvi+bv)
se=σ(Tss+bs)
wherein σ (x) is sigmoid functionNumber, TvFor corresponding feature vector viWeight of the full connection layer of (1), TsAs a weight of a fully connected layer corresponding to said scene semantic feature s, bvAnd bsIs a bias factor.
According to the formula
Figure BDA0002980562460000091
Calculating semantic compatibility score theta of any feature vector and scene semantic featuresi(ii) a Wherein v isieI is more than or equal to 1 and less than or equal to n which is a first embedded vector corresponding to the ith feature vector, n is the total number of the feature vectors in the candidate target feature set, seA second embedded vector corresponding to the scene semantic features;
and updating any feature vector according to the semantic compatibility score to obtain a new candidate target feature set, namely guiding to update the candidate target feature set through the semantic compatibility score.
As an improvement of the above scheme, the updating any feature vector according to the semantic compatibility score to obtain the new candidate target feature set specifically includes:
converting any feature vector and the scene semantic feature by using a full connection layer respectively, adding and activating by using a tanh activation function to obtain a fusion feature of s' ═ tan (Wv)i+ Us); wherein W is used to convert any of the feature vectors viU is the weight of the full link layer used to transform the scene semantic features s;
according to vi′=θivi+(1-θi) s' performs weighted calculation on any eigenvector to obtain any updated eigenvector vi'; wherein, thetaiScoring the semantic compatibility;
and obtaining the new candidate target feature set according to any updated feature vector.
Specifically, any feature vector and scene semantic feature are converted by using a full connection layer respectively, and are added and activated through a tanh activation function to obtain a fusion feature of s'=tan(Wvi+ Us); wherein W is used to convert any feature vector viU is the weight of the fully-connected layer used to transform the scene semantic features s.
Weighting the fused features with the feature vector using the semantic compatibility score as a coefficient, i.e. according to vi′=θivi+(1-θi) s' performs weighted calculation on any eigenvector to obtain any updated eigenvector vi'; wherein, thetaiIs a semantic compatibility score.
And obtaining a new candidate target feature set according to any updated feature vector.
As an improvement of the above scheme, the inputting the new candidate target feature set and the candidate target feature set into the detection head network for classification and regression includes:
inputting any feature vector in the new candidate target feature set into a classification branch in the detection head network for prediction to obtain a scene category of a training target;
and inputting any feature vector of the candidate target feature set into a regression branch in the detection head network for prediction to obtain a boundary box of the training target.
Specifically, any feature vector in the new candidate target feature set is input into a classification branch in a detection head network for prediction, and a scene category of a training target is obtained; and inputting any feature vector of the candidate target feature set into a regression branch in the detection head network for prediction to obtain a boundary box of the training target. In the training stage, the scene category and the boundary box are predicted for multiple times until the training of the target detection model is completed.
In the testing stage, after the predicted scene category and the boundary box are obtained, in order to further improve the accuracy of the detection, the predicted scene category and the predicted boundary box may be input into a non-maximum suppression algorithm (NMS) to obtain a final detection result: scene categories and bounding boxes.
To demonstrate the effect of the present invention, the method of the present invention was compared with the prior art method to obtain the comparative data of table 1 and the comparative effect plot of fig. 4.
TABLE 1 comparative data on the prior art method of MS-COCO data set and the method of the present invention
Figure BDA0002980562460000101
In table 1, the bold data is obtained by the method of the present invention, and the non-bold data is obtained by the existing method, so that compared with the existing method, the present invention has the advantages of improved overall accuracy, obvious improvement on small targets, more accurate positioning and more accurate classification.
In fig. 4, the previous line of images is the result according to the prior art method and the next line is the result according to the method of the invention, the images being from the data set COCO 2017-val. From left to right, the invention improves the detection result into that: the false detection snowboard is eliminated, the false detection repeater is eliminated, and the missing detection surfboard is detected.
Referring to fig. 5, which is a schematic structural diagram of an object detection apparatus based on scene semantics according to the embodiment of the present invention, the apparatus includes:
the model construction module 11 is used for constructing a target detection model; the target detection model comprises a feature map extraction network, a scene semantic feature extraction network, a candidate target feature extraction network, a fusion network and a detection head network;
the feature map extraction module 12 is configured to input a training image into the feature map extraction network to obtain a multi-scale feature map;
a scene semantic feature extraction module 13, configured to input the multi-scale feature map into the scene semantic feature extraction network to obtain a scene semantic feature;
a multi-label calculation module 14, configured to calculate a multi-label classification loss of scene prediction according to the scene semantic features;
a candidate target feature extraction module 15, configured to input the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set;
the fusion module 16 is configured to input the scene semantic features and the candidate target feature set into the fusion network for fusion, so as to obtain a new candidate target feature set;
a classification regression module 17, configured to input the new candidate target feature set and the candidate target feature set into the detection head network to perform classification and regression operations, and calculate corresponding classification loss and regression loss;
a training module 18, configured to train the target detection model in combination with the multi-label classification loss, the classification loss, and the regression loss, so as to obtain a trained target detection model;
and the test module 19 is configured to input the image to be tested into the trained target detection model, so as to obtain category information and position information of the target to be tested in the image to be tested.
The target detection device based on scene semantics provided by the embodiment of the present invention can implement all the processes of the target detection method based on scene semantics described in any one of the embodiments, and the functions and implemented technical effects of each module and unit in the device are respectively the same as those of the target detection method based on scene semantics described in the embodiment and implemented technical effects, and are not described herein again.
Referring to fig. 6, which is a schematic diagram of a terminal device provided in the embodiment of the present invention, the terminal device includes a processor 10, a memory 20, and a computer program stored in the memory 20 and configured to be executed by the processor 10, and when the processor 10 executes the computer program, the object detection method based on scene semantics according to any one of the above embodiments is implemented.
Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 20 and executed by the processor 10 to implement the present invention. One or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of a computer program in a scene semantics based object detection. For example, the computer program may be divided into a model building module, a feature map extraction module, a scene semantic feature extraction module, a multi-label calculation module, a candidate target feature extraction module, a fusion module, a classification regression module, a training module, and a test module, and each module has the following specific functions:
the model construction module 11 is used for constructing a target detection model; the target detection model comprises a feature map extraction network, a scene semantic feature extraction network, a candidate target feature extraction network, a fusion network and a detection head network;
the feature map extraction module 12 is configured to input a training image into the feature map extraction network to obtain a multi-scale feature map;
a scene semantic feature extraction module 13, configured to input the multi-scale feature map into the scene semantic feature extraction network to obtain a scene semantic feature;
a multi-label calculation module 14, configured to calculate a multi-label classification loss of scene prediction according to the scene semantic features;
a candidate target feature extraction module 15, configured to input the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set;
the fusion module 16 is configured to input the scene semantic features and the candidate target feature set into the fusion network for fusion, so as to obtain a new candidate target feature set;
a classification regression module 17, configured to input the new candidate target feature set and the candidate target feature set into the detection head network to perform classification and regression operations, and calculate corresponding classification loss and regression loss;
a training module 18, configured to train the target detection model in combination with the multi-label classification loss, the classification loss, and the regression loss, so as to obtain a trained target detection model;
and the test module 19 is configured to input the image to be tested into the trained target detection model, so as to obtain category information and position information of the target to be tested in the image to be tested.
The terminal device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The terminal device may include, but is not limited to, a processor, a memory. It will be understood by those skilled in the art that the schematic diagram 6 is merely an example of a terminal device, and is not intended to limit the terminal device, and may include more or less components than those shown, or some components may be combined, or different components, for example, the terminal device may further include an input-output device, a network access device, a bus, etc.
The Processor 10 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor 10 may be any conventional processor or the like, the processor 10 being the control center of the terminal device and connecting the various parts of the whole terminal device with various interfaces and lines.
The memory 20 may be used to store the computer programs and/or modules, and the processor 10 implements various functions of the terminal device by running or executing the computer programs and/or modules stored in the memory 20 and calling data stored in the memory 20. The memory 20 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory 20 may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Wherein, the module integrated with the terminal device can be stored in a computer readable storage medium if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium and can implement the steps of the embodiments of the method when the computer program is executed by a processor. The computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, in accordance with legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunications signals.
The embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, a device where the computer-readable storage medium is located is controlled to execute the target detection method based on scene semantics according to any one of the above embodiments.
To sum up, the scene semantic-based object detection method, device, terminal device and storage medium provided by the embodiments of the present invention display and utilize semantic information of a scene during modeling, and introduce a semantic adaptive manner to fuse scene semantic features and features of an object itself so as to improve classification discrimination capability of object features, thereby significantly improving detection accuracy of weak and small objects, and effectively solving the problems that the existing object detection method lacks flexible and effective utilization of scene context information, thereby resulting in difficulty in detecting objects with fuzzy appearance and easiness in object misclassification with ambiguous appearance.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (8)

1. A target detection method based on scene semantics is characterized by comprising the following steps:
constructing a target detection model; the target detection model comprises a feature map extraction network, a scene semantic feature extraction network, a candidate target feature extraction network, a fusion network and a detection head network;
inputting a training image into the feature map extraction network to obtain a multi-scale feature map;
inputting the multi-scale feature map into the scene semantic feature extraction network to obtain scene semantic features;
inputting the scene semantic features into two fully-connected layers, outputting a predicted value of a scene category, and calculating multi-label classification loss of scene prediction according to the predicted value of the scene category;
inputting the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set;
inputting the scene semantic features and the candidate target feature set into the fusion network for fusion to obtain a new candidate target feature set;
inputting the new candidate target feature set into the detection head network for classification operation, and calculating corresponding classification loss; inputting the candidate target feature set into the detection head network to perform regression operation, and calculating corresponding regression loss;
training the target detection model by combining the multi-label classification loss, the classification loss and the regression loss to obtain a trained target detection model;
inputting an image to be detected into a trained target detection model to obtain category information and position information of a target to be detected in the image to be detected;
inputting the scene semantic features and the candidate target feature set into the fusion network for fusion to obtain a new candidate target feature set, wherein the method specifically comprises the following steps:
embedding any feature vector in the candidate target feature set and the scene semantic features into the same dimension through a full connection layer respectively to obtain a first embedded vector and a second embedded vector correspondingly;
according to the formula
Figure FDA0003449063530000011
Calculating a semantic compatibility score θ for the arbitrary feature vector and the scene semantic featuresi(ii) a Wherein v isieI is more than or equal to 1 and less than or equal to n which is a first embedded vector corresponding to the ith feature vector, n is the total number of the feature vectors in the candidate target feature set, seA second embedding vector corresponding to the scene semantic features;
and updating any feature vector according to the semantic compatibility score to obtain the new candidate target feature set.
2. The method for detecting targets based on scene semantics of claim 1, wherein the step of inputting the scene semantics features into two fully-connected layers to obtain predicted values of scene categories, and calculating multi-label classification losses of scene predictions according to the predicted values of the scene categories specifically comprises:
inputting the scene semantic features into two fully-connected layers and outputting predicted values of scene categories;
according to
Figure FDA0003449063530000021
Computing multi-label classification loss for scene predictionsLmll(ii) a Where C is the total number of categories of target detection, ycIs the label value of whether or not an object of class c appears in the image,
Figure FDA0003449063530000022
σ (x) is a sigmoid function, which is a predicted value of whether or not a target of the category c appears in the image.
3. The target detection method based on scene semantics of claim 1, wherein the inputting the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set specifically comprises:
inputting the multi-scale feature map into an RPN network to generate target Propusals, then performing ROI Align operation, and then obtaining the candidate target feature set through two full-connection layers;
or inputting the multi-scale feature map into a 4-layer convolutional network to obtain the candidate target feature set.
4. The target detection method based on scene semantics of claim 1, wherein the updating of any one of the feature vectors according to the semantic compatibility score to obtain the new candidate target feature set specifically comprises:
converting any feature vector and the scene semantic feature by using a full connection layer respectively, adding and activating by using a tanh activation function to obtain a fusion feature of s' ═ tan (Wv)i+ Us); wherein W is used to convert any of the feature vectors viU is the weight of the full link layer used to transform the scene semantic features s;
according to v'i=θivi+(1-θi) s 'carries out weighted calculation on any feature vector to obtain any updated feature vector v'i(ii) a Wherein, thetaiScoring the semantic compatibility;
and obtaining the new candidate target feature set according to any updated feature vector.
5. The target detection method based on scene semantics as claimed in claim 1, wherein the inputting the new candidate target feature set into the detection head network for classification operation and the inputting the candidate target feature set into the detection head network for regression operation specifically includes:
inputting any feature vector in the new candidate target feature set into a classification branch in the detection head network for prediction to obtain a scene category of a training target;
and inputting any feature vector of the candidate target feature set into a regression branch in the detection head network for prediction to obtain a boundary box of the training target.
6. An object detection device based on scene semantics, comprising:
the model construction module is used for constructing a target detection model; the target detection model comprises a feature map extraction network, a scene semantic feature extraction network, a candidate target feature extraction network, a fusion network and a detection head network;
the characteristic diagram extraction module is used for inputting the training image into the characteristic diagram extraction network to obtain a multi-scale characteristic diagram;
the scene semantic feature extraction module is used for inputting the multi-scale feature map into the scene semantic feature extraction network to obtain scene semantic features;
the multi-label calculation module is used for inputting the scene semantic features into the two fully-connected layers, outputting the predicted values of the scene categories and calculating the multi-label classification loss of the scene prediction according to the predicted values of the scene categories;
the candidate target feature extraction module is used for inputting the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set;
the fusion module is used for inputting the scene semantic features and the candidate target feature set into the fusion network for fusion to obtain a new candidate target feature set;
the classification regression module is used for inputting the new candidate target feature set into the detection head network for classification operation and calculating corresponding classification loss; inputting the candidate target feature set into the detection head network to perform regression operation, and calculating corresponding regression loss;
the training module is used for training the target detection model by combining the multi-label classification loss, the classification loss and the regression loss to obtain a trained target detection model;
the test module is used for inputting an image to be tested into the trained target detection model to obtain the category information and the position information of the target to be tested in the image to be tested;
inputting the scene semantic features and the candidate target feature set into the fusion network for fusion to obtain a new candidate target feature set, wherein the method specifically comprises the following steps:
embedding any feature vector in the candidate target feature set and the scene semantic features into the same dimension through a full connection layer respectively to obtain a first embedded vector and a second embedded vector correspondingly;
according to the formula
Figure FDA0003449063530000041
Calculating a semantic compatibility score θ for the arbitrary feature vector and the scene semantic featuresi(ii) a Wherein v isieI is more than or equal to 1 and less than or equal to n which is a first embedded vector corresponding to the ith feature vector, n is the total number of the feature vectors in the candidate target feature set, seA second embedding vector corresponding to the scene semantic features;
and updating any feature vector according to the semantic compatibility score to obtain the new candidate target feature set.
7. A terminal device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the scene semantics based object detection method according to any one of claims 1 to 5 when executing the computer program.
8. A computer-readable storage medium, comprising a stored computer program, wherein when the computer program runs, the computer-readable storage medium controls a device to execute the target detection method based on scene semantics according to any one of claims 1 to 5.
CN202110286154.1A 2021-03-17 2021-03-17 Target detection method, device and equipment based on scene semantics and storage medium Active CN112966697B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110286154.1A CN112966697B (en) 2021-03-17 2021-03-17 Target detection method, device and equipment based on scene semantics and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110286154.1A CN112966697B (en) 2021-03-17 2021-03-17 Target detection method, device and equipment based on scene semantics and storage medium

Publications (2)

Publication Number Publication Date
CN112966697A CN112966697A (en) 2021-06-15
CN112966697B true CN112966697B (en) 2022-03-11

Family

ID=76278919

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110286154.1A Active CN112966697B (en) 2021-03-17 2021-03-17 Target detection method, device and equipment based on scene semantics and storage medium

Country Status (1)

Country Link
CN (1) CN112966697B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743459B (en) * 2021-07-29 2024-04-02 深圳云天励飞技术股份有限公司 Target detection method, target detection device, electronic equipment and storage medium
CN113591771B (en) * 2021-08-10 2024-03-08 武汉中电智慧科技有限公司 Training method and equipment for object detection model of multi-scene distribution room
CN114445711B (en) * 2022-01-29 2023-04-07 北京百度网讯科技有限公司 Image detection method, image detection device, electronic equipment and storage medium
CN114359594B (en) * 2022-03-17 2022-08-19 杭州觅睿科技股份有限公司 Scene matching method and device, electronic equipment and storage medium
CN114782797B (en) * 2022-06-21 2022-09-20 深圳市万物云科技有限公司 House scene classification method, device and equipment and readable storage medium
CN114972947B (en) * 2022-07-26 2022-12-06 之江实验室 Depth scene text detection method and device based on fuzzy semantic modeling
CN114998357B (en) * 2022-08-08 2022-11-15 长春摩诺维智能光电科技有限公司 Industrial detection method, system, terminal and medium based on multi-information analysis
CN115527070B (en) * 2022-11-01 2023-05-19 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Traffic scene-based target detection method, device, equipment and storage medium
CN115761607B (en) * 2022-11-17 2023-10-10 人工智能与数字经济广东省实验室(深圳) Target identification method, device, terminal equipment and readable storage medium
CN115661584B (en) * 2022-11-18 2023-04-07 浙江莲荷科技有限公司 Model training method, open domain target detection method and related device
CN118015385B (en) * 2024-04-08 2024-07-05 山东浪潮科学研究院有限公司 Long-tail target detection method, device and medium based on multi-mode model

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9916522B2 (en) * 2016-03-11 2018-03-13 Kabushiki Kaisha Toshiba Training constrained deconvolutional networks for road scene semantic segmentation
CN109934163B (en) * 2018-12-27 2022-07-08 北京航空航天大学 Aerial image vehicle detection method based on scene prior and feature re-fusion
CN109919000A (en) * 2019-01-23 2019-06-21 杭州电子科技大学 A kind of Ship Target Detection method based on Multiscale Fusion strategy
CN109902629A (en) * 2019-03-01 2019-06-18 成都康乔电子有限责任公司 A kind of real-time vehicle target detection model under vehicles in complex traffic scene
CN110246141B (en) * 2019-06-13 2022-10-21 大连海事大学 Vehicle image segmentation method based on joint corner pooling under complex traffic scene
CN110334705B (en) * 2019-06-25 2021-08-03 华中科技大学 Language identification method of scene text image combining global and local information
CN111027493B (en) * 2019-12-13 2022-05-20 电子科技大学 Pedestrian detection method based on deep learning multi-network soft fusion
CN111091099A (en) * 2019-12-20 2020-05-01 京东方科技集团股份有限公司 Scene recognition model construction method, scene recognition method and device
CN111275688B (en) * 2020-01-19 2023-12-12 合肥工业大学 Small target detection method based on context feature fusion screening of attention mechanism
CN111476089B (en) * 2020-03-04 2023-06-23 上海交通大学 Pedestrian detection method, system and terminal for multi-mode information fusion in image
CN111680759B (en) * 2020-06-16 2022-05-10 西南交通大学 Power grid inspection insulator detection classification method
CN111598968B (en) * 2020-06-28 2023-10-31 腾讯科技(深圳)有限公司 Image processing method and device, storage medium and electronic equipment
CN111898439B (en) * 2020-06-29 2022-06-07 西安交通大学 Deep learning-based traffic scene joint target detection and semantic segmentation method
CN111783683A (en) * 2020-07-03 2020-10-16 北京视甄智能科技有限公司 Human body detection method based on feature balance and relationship enhancement
CN111899172A (en) * 2020-07-16 2020-11-06 武汉大学 Vehicle target detection method oriented to remote sensing application scene
CN112434618B (en) * 2020-11-26 2023-06-23 西安电子科技大学 Video target detection method, storage medium and device based on sparse foreground priori

Also Published As

Publication number Publication date
CN112966697A (en) 2021-06-15

Similar Documents

Publication Publication Date Title
CN112966697B (en) Target detection method, device and equipment based on scene semantics and storage medium
US11436739B2 (en) Method, apparatus, and storage medium for processing video image
CN110020592B (en) Object detection model training method, device, computer equipment and storage medium
CN111340195B (en) Training method and device for network model, image processing method and storage medium
CN110781784A (en) Face recognition method, device and equipment based on double-path attention mechanism
CN112990432A (en) Target recognition model training method and device and electronic equipment
CN109086811A (en) Multi-tag image classification method, device and electronic equipment
CN111652181B (en) Target tracking method and device and electronic equipment
CN112699832B (en) Target detection method, device, equipment and storage medium
CN113298152A (en) Model training method and device, terminal equipment and computer readable storage medium
CN110852327A (en) Image processing method, image processing device, electronic equipment and storage medium
CN111046949A (en) Image classification method, device and equipment
CN111428567B (en) Pedestrian tracking system and method based on affine multitask regression
CN117710921A (en) Training method, detection method and related device of target detection model
CN117496399A (en) Clustering method, system, equipment and medium for detecting moving target in video
CN116958873A (en) Pedestrian tracking method, device, electronic equipment and readable storage medium
CN111160219B (en) Object integrity evaluation method and device, electronic equipment and storage medium
CN113947771B (en) Image recognition method, apparatus, device, storage medium, and program product
CN113609948B (en) Method, device and equipment for detecting video time sequence action
CN114359572A (en) Training method and device of multi-task detection model and terminal equipment
CN113837236A (en) Method and device for identifying target object in image, terminal equipment and storage medium
CN113705643A (en) Target detection method and device and electronic equipment
CN114155420B (en) Scene recognition model training method, device, equipment and medium
CN116580063B (en) Target tracking method, target tracking device, electronic equipment and storage medium
CN116912920B (en) Expression recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant