CN112966697B

CN112966697B - Target detection method, device and equipment based on scene semantics and storage medium

Info

Publication number: CN112966697B
Application number: CN202110286154.1A
Authority: CN
Inventors: 谢雪梅; 刘卓; 李旭阳
Original assignee: Guangzhou Institute of Technology of Xidian University
Current assignee: Guangzhou Institute of Technology of Xidian University
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2022-03-11
Anticipated expiration: 2041-03-17
Also published as: CN112966697A

Abstract

The invention discloses a target detection method, a device, terminal equipment and a storage medium based on scene semantics, wherein the method comprises the following steps: constructing a target detection model; inputting the training image into a feature map extraction network to obtain a multi-scale feature map; inputting the multi-scale feature map into a scene semantic feature extraction network to obtain scene semantic features; calculating the multi-label classification loss of scene prediction according to the scene semantic features; inputting the multi-scale feature map into a candidate target feature extraction network to obtain a candidate target feature set; inputting the scene semantic features and the candidate target feature set into a fusion network for fusion to obtain a new candidate target feature set; inputting a detection head network to perform classification and regression operation, and calculating classification loss and regression loss; training a target detection model by combining the three loss functions; and inputting the image to be detected into the trained target detection model to obtain a detection result. The invention can solve the problem that the target with fuzzy appearance is difficult to identify at present.

Description

Target detection method, device and equipment based on scene semantics and storage medium

Technical Field

The invention relates to the field of computer visual image target detection, in particular to a target detection method and device based on scene semantics, terminal equipment and a storage medium.

Background

In the existing target detection method, most of the internal features of the target to be detected are extracted through a convolutional neural network, then the extracted features are classified, and then the detection of the target to be detected is realized. However, these methods ignore global scene semantic information in the detection process, and in a real scene, the appearance of an object is often related to the scene, so that it is difficult for the existing object detection method to identify an object with a fuzzy or unobvious appearance.

Disclosure of Invention

The embodiment of the invention aims to provide a target detection method, a device, a terminal device and a storage medium based on scene semantics, which can make full use of scene semantic information and effectively solve the problem that the target with fuzzy appearance is difficult to identify in the existing detection method.

In order to achieve the above object, an embodiment of the present invention provides a target detection method based on scene semantics, including:

constructing a target detection model; the target detection model comprises a feature map extraction network, a scene semantic feature extraction network, a candidate target feature extraction network, a fusion network and a detection head network;

inputting a training image into the feature map extraction network to obtain a multi-scale feature map;

inputting the multi-scale feature map into the scene semantic feature extraction network to obtain scene semantic features;

calculating the multi-label classification loss of scene prediction according to the scene semantic features;

inputting the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set;

inputting the scene semantic features and the candidate target feature set into the fusion network for fusion to obtain a new candidate target feature set;

inputting the new candidate target feature set and the candidate target feature set into the detection head network for classification and regression operation, and calculating corresponding classification loss and regression loss;

training the target detection model by combining the multi-label classification loss, the classification loss and the regression loss to obtain a trained target detection model;

inputting the image to be detected into the trained target detection model to obtain the category information and the position information of the target to be detected in the image to be detected.

Preferably, the calculating the multi-label classification loss of the scene prediction according to the scene semantic features specifically includes:

inputting the scene semantic features into two fully-connected layers and outputting predicted values of scene categories;

according to

Computing multi-label classification loss L for scene predictions_mll(ii) a Where C is the total number of categories of target detection, y^cIs the label value of whether or not an object of class c appears in the image,

is a predicted value of whether or not an object of category c appears in the image.

Preferably, the inputting the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set specifically includes:

inputting the multi-scale feature map into an RPN network to generate target Propusals, then performing ROI Align operation, and then obtaining the candidate target feature set through two full-connection layers;

or inputting the multi-scale feature map into a 4-layer convolutional network to obtain the candidate target feature set. Preferably, the inputting the scene semantic features and the candidate target feature set into the fusion network for fusion to obtain a new candidate target feature set specifically includes:

embedding any feature vector in the candidate target feature set and the scene semantic features into the same dimension through a full connection layer respectively to obtain a first embedded vector and a second embedded vector correspondingly;

according to the formula

Calculating a semantic compatibility score θ for the arbitrary feature vector and the scene semantic features_i(ii) a Wherein v is_ieI is more than or equal to 1 and less than or equal to n, n is the total number of the feature vectors in the candidate target feature set,_ea second embedding vector corresponding to the scene semantic features;

and updating any feature vector according to the semantic compatibility score to obtain the new candidate target feature set.

Preferably, the updating any feature vector according to the semantic compatibility score to obtain the new candidate target feature set specifically includes:

converting any feature vector and the scene semantic feature by using a full connection layer respectively, adding and activating by using a tanh activation function to obtain a fusion feature of s' ═ tan (Wv)_i+ Us); wherein W is used to convert any of the feature vectors v_iU is the weight of the full link layer used to transform the scene semantic features s;

according to v_i′＝θ_iv_i+(1-θ_i) s' performs weighted calculation on any eigenvector to obtain any updated eigenvector v_i'; wherein, theta_iScoring the semantic compatibility;

and obtaining the new candidate target feature set according to any updated feature vector.

Preferably, the inputting the new candidate target feature set and the candidate target feature set into the detection head network for classification and regression includes:

inputting any feature vector in the new candidate target feature set into a classification branch in the detection head network for prediction to obtain a scene category of a training target;

and inputting any feature vector of the candidate target feature set into a regression branch in the detection head network for prediction to obtain a boundary box of the training target.

Another embodiment of the present invention provides a target detection apparatus based on scene semantics, including:

the model construction module is used for constructing a target detection model; the target detection model comprises a feature map extraction network, a scene semantic feature extraction network, a candidate target feature extraction network, a fusion network and a detection head network;

the characteristic diagram extraction module is used for inputting the training image into the characteristic diagram extraction network to obtain a multi-scale characteristic diagram;

the scene semantic feature extraction module is used for inputting the multi-scale feature map into the scene semantic feature extraction network to obtain scene semantic features;

the multi-label calculation module is used for calculating the multi-label classification loss of scene prediction according to the scene semantic features;

the candidate target feature extraction module is used for inputting the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set;

the fusion module is used for inputting the scene semantic features and the candidate target feature set into the fusion network for fusion to obtain a new candidate target feature set;

the classification regression module is used for inputting the new candidate target feature set and the candidate target feature set into the detection head network for classification and regression operation, and calculating corresponding classification loss and regression loss;

the training module is used for training the target detection model by combining the multi-label classification loss, the classification loss and the regression loss to obtain a trained target detection model;

and the test module is used for inputting the image to be tested into the trained target detection model to obtain the category information and the position information of the target to be tested in the image to be tested.

Another embodiment of the present invention provides a terminal device, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and when the processor executes the computer program, the processor implements the scene semantic-based object detection method according to any one of the above items.

Another embodiment of the present invention provides a computer-readable storage medium, which includes a stored computer program, where when the computer program runs, a device in which the computer-readable storage medium is located is controlled to execute any one of the above-mentioned methods for detecting an object based on scene semantics.

Compared with the prior art, the target detection method, the device, the terminal equipment and the storage medium based on the scene semantics, provided by the embodiment of the invention, display and utilize the semantic information of the scene during modeling, and introduce a semantic self-adaptive mode to fuse the scene semantic characteristics and the characteristics of the target so as to improve the classification discrimination capability of the target characteristics, so that the detection precision of weak and small targets can be obviously improved, and the problems that the existing target detection method lacks flexible and effective utilization of the context information of the scene, so that the target with fuzzy appearance is difficult to detect and the target with ambiguous appearance is easy to be misclassified can be effectively solved.

Drawings

Fig. 1 is a schematic flowchart of a target detection method based on scene semantics according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a target detection model according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a testing stage in a target detection method based on scene semantic detection according to an embodiment of the present invention;

FIG. 4 is a comparison graph of the detection effect of the present invention on the data set COCO2017-val using the present invention;

fig. 5 is a schematic structural diagram of an object detection apparatus based on scene semantics according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, which is a schematic flowchart of a target detection method based on scene semantics according to the embodiment of the present invention, the method includes steps S1 to S9:

s1, constructing a target detection model; the target detection model comprises a feature map extraction network, a scene semantic feature extraction network, a candidate target feature extraction network, a fusion network and a detection head network;

s2, inputting the training image into the feature map extraction network to obtain a multi-scale feature map;

s3, inputting the multi-scale feature map into the scene semantic feature extraction network to obtain scene semantic features;

s4, calculating the multi-label classification loss of scene prediction according to the scene semantic features;

s5, inputting the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set;

s6, inputting the scene semantic features and the candidate target feature set into the fusion network for fusion to obtain a new candidate target feature set;

s7, inputting the new candidate target feature set and the candidate target feature set into the detection head network for classification and regression operation, and calculating corresponding classification loss and regression loss;

s8, training the target detection model by combining the multi-label classification loss, the classification loss and the regression loss to obtain a trained target detection model;

and S9, inputting the image to be detected into the trained target detection model to obtain the category information and the position information of the target to be detected in the image to be detected.

Specifically, a target detection model is constructed; the target detection model comprises a feature map extraction network, a scene semantic feature extraction network, a candidate target feature extraction network, a fusion network and a detection head network. Fig. 2 is a schematic structural diagram of a target detection model according to the embodiment of the present invention.

And inputting the training image into a feature map extraction network to obtain a multi-scale feature map. Generally, the training images need to be labeled before being input, and the labels include the positions of the training targets, the classes and scene classes determined by the classes of the training targets. Preferably, the feature map extraction network is a ResNet network + FPN network. The multiscale feature map is defined as { p }₂，p₃，p₄，p₅，p₆And the corresponding down sampling rates are 4, 8, 16, 32 and 64.

And inputting the multi-scale feature map into a scene semantic feature extraction network to obtain scene semantic features. Because the down-sampling rate of p2-p5 is low, the corresponding receptive field is small, and the receptive field corresponding to the feature map of p6 is maximum, the feature map p6 can represent the global feature of the whole image, so that the scene semantic feature extraction network is used for inputting the feature map p6, and the subsequent scene semantic feature extraction is facilitated. Preferably, a 3x3 layer of convolutional layers is used for the feature map p6, then a global pooling operation is used, and then a layer of fully-connected layers is used to output the scene semantic features s.

And calculating the multi-label classification loss of scene prediction according to the scene semantic features. This step is needed for training the model, and is not needed for image testing after the model training is completed.

And inputting the multi-scale feature map into a candidate target feature extraction network to obtain a candidate target feature set. The candidate target feature set is a set formed by each feature point on the multi-scale feature map, and the feature points are all possible training targets, namely the candidate set.

And inputting the scene semantic features and the candidate target feature set into a fusion network for fusion to obtain fusion features, and updating each feature vector of the candidate target feature set according to the fusion features to obtain a new candidate target feature set.

And inputting the new candidate target feature set and the candidate target feature set into a detection head network for classification and regression operation, and calculating corresponding classification loss and regression loss. The detection header network comprises two branches: classification branch and regression branch.

Training the target detection model by combining multi-label classification Loss, classification Loss and regression Loss to obtain the trained target detection model, so that the total Loss function is Loss ═ L_mll+L_cls+L_regWherein L is_mllFor multi-label classification loss, L_clsTo detect loss functions of the classification branches in the head network, L_regTo detect loss functions of regression branches in the head network. Preferably, the target detection model is trained using a stochastic gradient descent algorithm.

And inputting the image to be detected into the trained target detection model to obtain the category information and the position information of the target to be detected in the image to be detected. Generally, the location information is determined by a bounding box. The testing process includes inputting a given image, extracting multi-scale features of the given image, obtaining scene semantic features and candidate target features, calculating semantic compatibility equal parts of the scene semantic features and the candidate target features, fusing the candidate target features and the scene semantic features, inputting the candidate target features into a detection classification branch and a regression branch, and obtaining a detection result, specifically referring to fig. 3, which is a flow diagram of a testing stage in the target detection method based on scene semantic detection provided by the embodiment of the invention.

The embodiment of the invention provides the target detection method based on the scene semantics, which can fully utilize the context scene semantic information and effectively solve the problem that the target with the fuzzy appearance is difficult to identify in the existing detection method.

As an improvement of the above scheme, the calculating a multi-label classification loss of scene prediction according to the scene semantic features specifically includes:

according to

Specifically, scene semantic features are input into two fully-connected layers, and predicted values of scene categories are output;

according to

Computing multi-label classification loss L for scene predictions_mll(ii) a Wherein C is the total number of categories of target detection, and the scene category corresponding to the image is y ∈ R^c，y^cA label value, y, of whether an object of class c appears in the image^c1 indicates that an object of the category i appears in the image, and 0 indicates that an object of the category i does not appear in the image.

Is a predicted value of whether or not an object of class c appears in the picture,

σ (x) is a sigmoid function.

As an improvement of the above scheme, the inputting the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set specifically includes:

or inputting the multi-scale feature map into a 4-layer convolutional network to obtain the candidate target feature set. In particular, when applied to a two-stage detector such as FPN, the multiscale feature map is passed through the RPN network to generate the target Propusals and then ROI Align operation is performed,then, a set of all candidate target characteristic vectors is obtained through two full-connection layers

Or, when the method is applied to a single-stage detector such as FCOS, the candidate target feature set is a set formed by each feature point on the feature map obtained by passing the multi-scale feature map through a 4-layer convolutional network.

As an improvement of the above scheme, the inputting the scene semantic features and the candidate target feature set into the fusion network for fusion to obtain a new candidate target feature set specifically includes:

according to the formula

Calculating a semantic compatibility score θ for the arbitrary feature vector and the scene semantic features_i(ii) a Wherein v is_ieI is more than or equal to 1 and less than or equal to n which is a first embedded vector corresponding to the ith feature vector, n is the total number of the feature vectors in the candidate target feature set, s_eA second embedding vector corresponding to the scene semantic features;

Specifically, any feature vector v in the candidate target feature set is used_iEmbedding the scene semantic features s into the same dimension through a full connection layer respectively to obtain a first embedded vector v correspondingly_ieAnd a second embedding vector s_e. The calculation formula is as follows:

v_ie＝σ(T_vv_i+b_v)

s_e＝σ(T_ss+b_s)

wherein σ (x) is sigmoid functionNumber, T_vFor corresponding feature vector v_iWeight of the full connection layer of (1), T_sAs a weight of a fully connected layer corresponding to said scene semantic feature s, b_vAnd b_sIs a bias factor.

According to the formula

Calculating semantic compatibility score theta of any feature vector and scene semantic features_i(ii) a Wherein v is_ieI is more than or equal to 1 and less than or equal to n which is a first embedded vector corresponding to the ith feature vector, n is the total number of the feature vectors in the candidate target feature set, s_eA second embedded vector corresponding to the scene semantic features;

and updating any feature vector according to the semantic compatibility score to obtain a new candidate target feature set, namely guiding to update the candidate target feature set through the semantic compatibility score.

As an improvement of the above scheme, the updating any feature vector according to the semantic compatibility score to obtain the new candidate target feature set specifically includes:

Specifically, any feature vector and scene semantic feature are converted by using a full connection layer respectively, and are added and activated through a tanh activation function to obtain a fusion feature of s'＝tan(Wv_i+ Us); wherein W is used to convert any feature vector v_iU is the weight of the fully-connected layer used to transform the scene semantic features s.

Weighting the fused features with the feature vector using the semantic compatibility score as a coefficient, i.e. according to v_i′＝θ_iv_i+(1-θ_i) s' performs weighted calculation on any eigenvector to obtain any updated eigenvector v_i'; wherein, theta_iIs a semantic compatibility score.

And obtaining a new candidate target feature set according to any updated feature vector.

As an improvement of the above scheme, the inputting the new candidate target feature set and the candidate target feature set into the detection head network for classification and regression includes:

Specifically, any feature vector in the new candidate target feature set is input into a classification branch in a detection head network for prediction, and a scene category of a training target is obtained; and inputting any feature vector of the candidate target feature set into a regression branch in the detection head network for prediction to obtain a boundary box of the training target. In the training stage, the scene category and the boundary box are predicted for multiple times until the training of the target detection model is completed.

In the testing stage, after the predicted scene category and the boundary box are obtained, in order to further improve the accuracy of the detection, the predicted scene category and the predicted boundary box may be input into a non-maximum suppression algorithm (NMS) to obtain a final detection result: scene categories and bounding boxes.

To demonstrate the effect of the present invention, the method of the present invention was compared with the prior art method to obtain the comparative data of table 1 and the comparative effect plot of fig. 4.

TABLE 1 comparative data on the prior art method of MS-COCO data set and the method of the present invention

In table 1, the bold data is obtained by the method of the present invention, and the non-bold data is obtained by the existing method, so that compared with the existing method, the present invention has the advantages of improved overall accuracy, obvious improvement on small targets, more accurate positioning and more accurate classification.

In fig. 4, the previous line of images is the result according to the prior art method and the next line is the result according to the method of the invention, the images being from the data set COCO 2017-val. From left to right, the invention improves the detection result into that: the false detection snowboard is eliminated, the false detection repeater is eliminated, and the missing detection surfboard is detected.

Referring to fig. 5, which is a schematic structural diagram of an object detection apparatus based on scene semantics according to the embodiment of the present invention, the apparatus includes:

the model construction module 11 is used for constructing a target detection model; the target detection model comprises a feature map extraction network, a scene semantic feature extraction network, a candidate target feature extraction network, a fusion network and a detection head network;

the feature map extraction module 12 is configured to input a training image into the feature map extraction network to obtain a multi-scale feature map;

a scene semantic feature extraction module 13, configured to input the multi-scale feature map into the scene semantic feature extraction network to obtain a scene semantic feature;

a multi-label calculation module 14, configured to calculate a multi-label classification loss of scene prediction according to the scene semantic features;

a candidate target feature extraction module 15, configured to input the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set;

the fusion module 16 is configured to input the scene semantic features and the candidate target feature set into the fusion network for fusion, so as to obtain a new candidate target feature set;

a classification regression module 17, configured to input the new candidate target feature set and the candidate target feature set into the detection head network to perform classification and regression operations, and calculate corresponding classification loss and regression loss;

a training module 18, configured to train the target detection model in combination with the multi-label classification loss, the classification loss, and the regression loss, so as to obtain a trained target detection model;

and the test module 19 is configured to input the image to be tested into the trained target detection model, so as to obtain category information and position information of the target to be tested in the image to be tested.

The target detection device based on scene semantics provided by the embodiment of the present invention can implement all the processes of the target detection method based on scene semantics described in any one of the embodiments, and the functions and implemented technical effects of each module and unit in the device are respectively the same as those of the target detection method based on scene semantics described in the embodiment and implemented technical effects, and are not described herein again.

Referring to fig. 6, which is a schematic diagram of a terminal device provided in the embodiment of the present invention, the terminal device includes a processor 10, a memory 20, and a computer program stored in the memory 20 and configured to be executed by the processor 10, and when the processor 10 executes the computer program, the object detection method based on scene semantics according to any one of the above embodiments is implemented.

Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 20 and executed by the processor 10 to implement the present invention. One or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of a computer program in a scene semantics based object detection. For example, the computer program may be divided into a model building module, a feature map extraction module, a scene semantic feature extraction module, a multi-label calculation module, a candidate target feature extraction module, a fusion module, a classification regression module, a training module, and a test module, and each module has the following specific functions:

The terminal device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The terminal device may include, but is not limited to, a processor, a memory. It will be understood by those skilled in the art that the schematic diagram 6 is merely an example of a terminal device, and is not intended to limit the terminal device, and may include more or less components than those shown, or some components may be combined, or different components, for example, the terminal device may further include an input-output device, a network access device, a bus, etc.

The Processor 10 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor 10 may be any conventional processor or the like, the processor 10 being the control center of the terminal device and connecting the various parts of the whole terminal device with various interfaces and lines.

The memory 20 may be used to store the computer programs and/or modules, and the processor 10 implements various functions of the terminal device by running or executing the computer programs and/or modules stored in the memory 20 and calling data stored in the memory 20. The memory 20 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory 20 may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Wherein, the module integrated with the terminal device can be stored in a computer readable storage medium if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium and can implement the steps of the embodiments of the method when the computer program is executed by a processor. The computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, in accordance with legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunications signals.

The embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, a device where the computer-readable storage medium is located is controlled to execute the target detection method based on scene semantics according to any one of the above embodiments.

To sum up, the scene semantic-based object detection method, device, terminal device and storage medium provided by the embodiments of the present invention display and utilize semantic information of a scene during modeling, and introduce a semantic adaptive manner to fuse scene semantic features and features of an object itself so as to improve classification discrimination capability of object features, thereby significantly improving detection accuracy of weak and small objects, and effectively solving the problems that the existing object detection method lacks flexible and effective utilization of scene context information, thereby resulting in difficulty in detecting objects with fuzzy appearance and easiness in object misclassification with ambiguous appearance.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A target detection method based on scene semantics is characterized by comprising the following steps:

inputting the scene semantic features into two fully-connected layers, outputting a predicted value of a scene category, and calculating multi-label classification loss of scene prediction according to the predicted value of the scene category;

inputting the new candidate target feature set into the detection head network for classification operation, and calculating corresponding classification loss; inputting the candidate target feature set into the detection head network to perform regression operation, and calculating corresponding regression loss;

inputting an image to be detected into a trained target detection model to obtain category information and position information of a target to be detected in the image to be detected;

inputting the scene semantic features and the candidate target feature set into the fusion network for fusion to obtain a new candidate target feature set, wherein the method specifically comprises the following steps:

according to the formula

2. The method for detecting targets based on scene semantics of claim 1, wherein the step of inputting the scene semantics features into two fully-connected layers to obtain predicted values of scene categories, and calculating multi-label classification losses of scene predictions according to the predicted values of the scene categories specifically comprises:

according to

Computing multi-label classification loss for scene predictionsL_mll(ii) a Where C is the total number of categories of target detection, y^cIs the label value of whether or not an object of class c appears in the image,

σ (x) is a sigmoid function, which is a predicted value of whether or not a target of the category c appears in the image.

3. The target detection method based on scene semantics of claim 1, wherein the inputting the multi-scale feature map into the candidate target feature extraction network to obtain a candidate target feature set specifically comprises:

or inputting the multi-scale feature map into a 4-layer convolutional network to obtain the candidate target feature set.

4. The target detection method based on scene semantics of claim 1, wherein the updating of any one of the feature vectors according to the semantic compatibility score to obtain the new candidate target feature set specifically comprises:

according to v'_i＝θ_iv_i+(1-θ_i) s 'carries out weighted calculation on any feature vector to obtain any updated feature vector v'_i(ii) a Wherein, theta_iScoring the semantic compatibility;

5. The target detection method based on scene semantics as claimed in claim 1, wherein the inputting the new candidate target feature set into the detection head network for classification operation and the inputting the candidate target feature set into the detection head network for regression operation specifically includes:

6. An object detection device based on scene semantics, comprising:

the multi-label calculation module is used for inputting the scene semantic features into the two fully-connected layers, outputting the predicted values of the scene categories and calculating the multi-label classification loss of the scene prediction according to the predicted values of the scene categories;

the classification regression module is used for inputting the new candidate target feature set into the detection head network for classification operation and calculating corresponding classification loss; inputting the candidate target feature set into the detection head network to perform regression operation, and calculating corresponding regression loss;

the test module is used for inputting an image to be tested into the trained target detection model to obtain the category information and the position information of the target to be tested in the image to be tested;

according to the formula

7. A terminal device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the scene semantics based object detection method according to any one of claims 1 to 5 when executing the computer program.

8. A computer-readable storage medium, comprising a stored computer program, wherein when the computer program runs, the computer-readable storage medium controls a device to execute the target detection method based on scene semantics according to any one of claims 1 to 5.