CN111310604A

CN111310604A - Object detection method and device and storage medium

Info

Publication number: CN111310604A
Application number: CN202010072238.0A
Authority: CN
Inventors: 徐航; 周峰暐; 黎嘉伟; 梁小丹; 李震国; 钱莉
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2020-06-19
Also published as: WO2021147325A1

Abstract

The application discloses an object detection method and device, relates to the field of artificial intelligence, and particularly relates to the field of computer vision. The method may include acquiring an image to be detected. And determining the initial image characteristics of the object to be detected in the image to be detected. Determining enhanced image characteristics of the object to be detected according to the cross-domain knowledge graph information, wherein the cross-domain knowledge graph information comprises an association relation between object categories corresponding to the object to be detected in different domains, and the enhanced image characteristics indicate semantic information of object categories corresponding to other objects associated with the object to be detected in different domains. And determining candidate frames and classification of the object to be detected according to the initial image characteristics of the object to be detected and the enhanced image characteristics of the object to be detected. Through the technical scheme provided by the application, the cross-domain knowledge graph is constructed, the internal relation among different objects to be detected can be captured, and the object detection effect is improved.

Description

Object detection method and device and storage medium

Technical Field

The present application relates to the field of computer vision, and in particular, to an object detection method, device and storage medium.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, human-computer interaction, recommendation and search, AI basic theory, and the like.

Object detection is a basic computer vision task that can identify the location and class of objects in an image. In practical applications, researchers and engineers may create data sets for different specific problems for training highly customized and unique automatic object detectors according to different application scenarios and actual task requirements.

Disclosure of Invention

Object detection across datasets is an efficient way to achieve large-scale object detection. However, in the existing multi-task learning, a plurality of tasks are processed simultaneously only by adding a plurality of branches into a model, interaction between different data sets and different object types cannot be realized, and an internal relationship between objects to be detected in different data sets cannot be captured, so that the effect is poor.

The application provides an object detection method, an object detection device and a computer storage medium, which are used for improving the object detection effect.

A first aspect of the present application provides an object detection method, which may include: and acquiring an image to be detected. And determining the initial image characteristics of the object to be detected in the image to be detected. Determining enhanced image characteristics of the object to be detected according to the cross-domain knowledge graph information, wherein the cross-domain knowledge graph information can comprise an association relation between object categories corresponding to the object to be detected in different domains, and the enhanced image characteristics indicate semantic information of object categories corresponding to other objects associated with the object to be detected in different domains. And determining candidate frames and classification of the object to be detected according to the initial image characteristics of the object to be detected and the enhanced image characteristics of the object to be detected.

The object detection method can be applied to different application scenes, for example, the object detection method can be applied to a scene for identifying all things, and can also be applied to a scene for identifying street view.

When the method is applied to identifying scenes of all things by using the mobile terminal, the image to be detected can be an image shot by the mobile terminal through a camera or an image stored in an album of the mobile terminal.

When the method is applied to a street view recognition scene, the image to be detected can be a street view image shot by a roadside camera.

The greater the probability that two categories appear in the same image at the same time, the greater the probability that the two categories appear in the same image at the same time, the more the two categories are considered to have a relationship. For example, the object categories in the first domain or in the first data set include boys, girls, roads, and streets. The object categories in the second domain include people, handbags, bags, cars, trucks. It may be considered that there is an associative relationship between men, women, boys, girls in the first domain and people in the second domain. A woman and girl in a first domain have an associative relationship with a handbag in a second domain. Roads and streets in a first domain have an associative relationship with cars and trucks in a second domain.

Semantic information may refer to high-level information that can assist in image detection. For example, the semantic information may be specifically what the object is, what is around the object (the semantic information is generally different from the low-level information, such as the edge, pixel point, and brightness of the image). For example, the object to be detected is a woman, and the other objects associated with the bicycle in the image to be detected include a handbag, then the enhanced image feature of the object to be detected may indicate semantic information of the handbag.

According to the first aspect, the scheme provided by the application can effectively utilize a large amount of different data sets and different types of information to train the same network, so that the data utilization rate is greatly improved, and the detection performance is higher.

Optionally, with reference to the first aspect, in a first possible implementation manner, the cross-domain knowledge graph may include nodes and node connecting edges, where the nodes correspond to the objects to be detected, and the node connecting edges correspond to relationships between high-level semantic features of different objects to be detected, and the method may further include: and acquiring classification layer parameters corresponding to different domains. And weighting and fusing classification layer parameters corresponding to different domains according to the classification weights of the initial image features in the different domains on different object classes to obtain the high-level semantic features of the object to be detected. And projecting the relation weight between the object categories corresponding to the objects to be detected in different domains to the node connecting edges of the objects to be detected to obtain the weight of the node connecting edges.

Optionally, with reference to the first possible implementation manner of the first aspect, in a second possible implementation manner, the method may further include: and determining the relation weight according to the distance relation between the object categories corresponding to the objects to be detected in different domains.

Optionally, with reference to the second possible implementation manner of the first aspect, in a third possible implementation manner, the distance relationship between the object classes corresponding to the objects to be detected may include one or more of the following information: and attribute relation among object categories corresponding to the objects to be detected in different domains. And the position relation or the active guest relation among the object types corresponding to the objects to be detected in different domains. And word embedding similarity constructed by using linguistic knowledge among object categories corresponding to the objects to be detected in different domains. And distance relations among object classes corresponding to the objects to be detected in different domains are obtained by training the neural network model according to the training data.

Optionally, with reference to the first to third possible implementation manners of the first aspect, in a fourth possible implementation manner, determining the enhanced image feature of the object to be detected according to the cross-domain knowledge-map information may include: and performing convolution processing on the high-level semantic features according to the weight of the node connecting edges to obtain the enhanced image features of the object to be detected. As can be seen from the fourth possible implementation manner of the first aspect, through the graph volume, the relevant semantic information can be merged and transferred in a plurality of different domains, and the internal relationship between different objects under different data sets can be effectively captured, so that the labeling information of different domains or different data sets can be complemented.

A second aspect of the present application provides an image detection apparatus, which may include: and the image acquisition module is used for acquiring an image to be detected. And the characteristic extraction module is used for determining the initial image characteristics of the object to be detected in the image to be detected. The feature extraction module is further configured to determine, according to the cross-domain knowledge graph information, an enhanced image feature of the object to be detected, where the cross-domain knowledge graph information may include an association relationship between object categories corresponding to the object to be detected in different domains, and the enhanced image feature indicates semantic information of object categories corresponding to other objects associated with the object to be detected in different domains. And the detection module is used for determining the candidate frame and the classification of the object to be detected according to the initial image characteristics of the object to be detected and the enhanced image characteristics of the object to be detected.

Optionally, with reference to the second aspect, in a first possible implementation manner, the cross-domain knowledge graph may include nodes and node connecting edges, where the nodes correspond to the object to be detected, and the node connecting edges correspond to relationships between high-level semantic features of different objects to be detected, and the image detection apparatus may further include a parameter acquisition module and a projection module, where the parameter acquisition module is configured to acquire classification layer parameters corresponding to different domains. And the feature extraction module is specifically used for weighting and fusing classification layer parameters corresponding to different domains according to the classification weights of the initial image features in the different domains on different object categories to obtain the high-level semantic features of the object to be detected. And the projection module is used for projecting the relation weight between the object categories corresponding to the objects to be detected in different domains onto the node connecting edges of the objects to be detected to obtain the weight of the node connecting edges.

Optionally, with reference to the first possible implementation manner of the second aspect, in a second possible implementation manner, the method may further include a relationship weight determining module, where the relationship weight determining module is configured to determine a relationship weight according to a distance relationship between object categories corresponding to the objects to be detected in different domains.

Optionally, with reference to the second possible implementation manner of the second aspect, in a third possible implementation manner, the distance relationship between the object classes corresponding to the objects to be detected may include one or more of the following information: and attribute relation among object categories corresponding to the objects to be detected in different domains. And the position relation or the active guest relation among the object types corresponding to the objects to be detected in different domains. And word embedding similarity constructed by using linguistic knowledge among object categories corresponding to the objects to be detected in different domains. And distance relations among object classes corresponding to the objects to be detected in different domains are obtained by training the neural network model according to the training data.

Optionally, with reference to the first to third possible implementation manners of the second aspect, in a fourth possible implementation manner, the feature extraction module is specifically configured to perform convolution processing on the high-level semantic features according to the weight of the node connecting edges to obtain the enhanced image features of the object to be detected.

A third aspect of the present application provides a method for training a neural network, the method including: acquiring training data, wherein the training data comprises a training image and an object detection labeling result of an object to be detected in the training image; extracting initial image characteristics of the object to be detected in the training image according to the neural network; extracting the enhanced image characteristics of the object to be detected in the training image according to the neural network and the cross-domain knowledge map information; processing the initial image characteristics and the enhanced image characteristics of the object to be detected according to the neural network to obtain an object detection result of the object to be detected; and determining the model parameters of the neural network according to the object detection result of the object to be detected in the training image and the object detection labeling result of the object to be detected in the training image.

The cross-domain knowledge map information may include an association relationship between object categories corresponding to the object to be detected in different domains, and the enhanced image feature indicates semantic information of object categories corresponding to other objects associated with the object to be detected in different domains.

The object detection labeling result of the object to be detected in the training image comprises a labeling candidate frame and a labeling classification result of the object to be detected in the training image.

In the process of training the neural network, a set of initial model parameters can be set for the neural network, then the model parameters of the neural network are gradually adjusted according to the difference between the object detection result of the object to be detected in the training image and the object detection labeling result of the object to be detected in the training image until the difference between the object detection structure of the object to be detected in the training image and the object detection labeling result of the object to be detected in the training image is within a certain preset range, or when the number of times of training reaches a preset number of times, the model parameters of the neural network at the moment are determined as the final parameters of the neural network model, so that the training of the neural network is completed. It will be appreciated that the neural network trained by the third aspect can be used to perform the method of the first aspect of the present application.

Optionally, with reference to the third aspect, in a first possible implementation manner, the cross-domain knowledge graph may include nodes and node connecting edges, where the nodes correspond to the objects to be detected, and the node connecting edges correspond to relationships between high-level semantic features of different objects to be detected. And weighting and fusing classification layer parameters corresponding to different domains according to the classification weight of the initial image features in the different domains on different object classes to obtain the high-level semantic features of the object to be detected, wherein the classification layer parameters can be understood as a class center for maintaining the classes. And projecting the relation weight between the object categories corresponding to the objects to be detected in different domains to the node connecting edges of the objects to be detected to obtain the weight of the node connecting edges.

Optionally, with reference to the first possible implementation manner of the third aspect, in a second possible implementation manner, the determining of the relationship weight may further include determining the relationship weight according to a distance relationship between object categories corresponding to the objects to be detected in different domains.

Optionally, with reference to the second possible implementation manner of the third aspect, in a third possible implementation manner, the distance relationship between the object classes corresponding to the objects to be detected may include one or more of the following information: and attribute relation among object categories corresponding to the objects to be detected in different domains. And the position relation or the active guest relation among the object types corresponding to the objects to be detected in different domains. And word embedding similarity constructed by using linguistic knowledge among object categories corresponding to the objects to be detected in different domains. And distance relations among object classes corresponding to the objects to be detected in different domains are obtained by training the neural network model according to the training data.

Optionally, with reference to the first to third possible implementation manners of the third aspect, in a fourth possible implementation manner, the high-level semantic features are subjected to convolution processing according to the weight of the node connecting edges, so as to obtain enhanced image features of the object to be detected.

In a fourth aspect, there is provided an object detection apparatus comprising means for performing the method of the first aspect.

In a fifth aspect, there is provided a training apparatus for neural networks, the apparatus comprising means for performing the method in the third aspect.

In a sixth aspect, there is provided an object detecting apparatus, comprising: a memory for storing a program; a processor for executing the program stored in the memory, the processor being configured to perform the method of the first aspect when the program stored in the memory is executed.

In a seventh aspect, an apparatus for training a neural network is provided, the apparatus including: a memory for storing a program; a processor for executing the program stored in the memory, the processor being configured to perform the method of the third aspect when the program stored in the memory is executed.

In an eighth aspect, an electronic device is provided, which includes the object detection apparatus in the fourth aspect or the sixth aspect.

In a ninth aspect, there is provided an electronic apparatus comprising the object detection device in the fifth or seventh aspect.

The electronic device may be a mobile terminal (e.g., a smart phone), a tablet computer, a notebook computer, an augmented reality/virtual reality device, an in-vehicle terminal device, and the like.

A tenth aspect provides a computer storage medium having stored program code comprising instructions for performing the steps of the method of the first or third aspect.

In an eleventh aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first or third aspect.

In a twelfth aspect, a chip is provided, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to perform the method in the first aspect or the third aspect.

Optionally, as an implementation manner, the chip may further include a memory, where instructions are stored in the memory, and the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the method in the first aspect. The chip can be specifically a field programmable gate array FPGA or an application specific integrated circuit ASIC.

It is to be understood that the method of the first aspect described above may particularly refer to the first aspect as well as the method of any of its various implementations in the first aspect. The method of the third aspect may specifically refer to the third aspect and a method in any one of various implementation manners of the third aspect.

Through the technical scheme provided by the application, the cross-domain knowledge graph is constructed, the internal relation between different objects to be detected can be captured, the image characteristics are enhanced to include semantic information of object categories corresponding to other objects related to the objects to be detected in different domains, and therefore the effect of the object detection method can be improved.

Drawings

Fig. 1 is a schematic structural diagram of a system architecture provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of object detection using a convolutional neural network model provided in an embodiment of the present application;

fig. 3 is a schematic diagram of a chip hardware structure according to an embodiment of the present disclosure;

FIG. 4 is a schematic flow chart diagram of an object detection method of an embodiment of the present application;

FIG. 5 is a schematic diagram of an association relationship according to an embodiment of the present application;

FIG. 6 is a flow chart of an object detection method of an embodiment of the present application;

FIG. 7 is a flow chart of an object detection method of an embodiment of the present application;

FIG. 8 is a schematic flow chart diagram of a method of training a neural network of an embodiment of the present application;

FIG. 9 is a schematic block diagram of an object detection apparatus of an embodiment of the present application;

FIG. 10 is a schematic block diagram of an object detection apparatus of an embodiment of the present application;

fig. 11 is a schematic block diagram of a neural network training device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

The method and the device are mainly applied to large-scale object detection scenes, such as mobile phone face recognition, mobile phone recognition of all objects, a perception system of an unmanned vehicle, a security camera, social network photo object recognition, an intelligent robot and the like. A few exemplary application scenarios will be briefly described below:

identifying all things by the mobile phone:

by using a camera on the mobile phone, pictures containing various things can be shot. After the picture is acquired, the location and category of each object in the picture can be determined by next performing object detection on the picture.

The object detection method can be used for detecting the object of the picture shot by the mobile phone, and the object detection method of the embodiment combines the cross-domain knowledge graph when detecting the object, so that the object detection method of the embodiment has better effect (for example, the position of the object and the classification of the object are more accurate) when detecting the object of the picture shot by the mobile phone.

Street view identification:

the camera arranged on the street can be used for photographing vehicles and people, after the picture is obtained, the picture can be uploaded to the control center device, the control center device can be used for detecting objects of the picture, an object detection result is obtained, and when abnormal objects appear, the control center can give an alarm.

The method provided by the application is described from the model training side and the model application side as follows:

the training method of the neural network provided by the embodiment of the application relates to computer vision processing, and particularly can be applied to data processing methods such as data training, machine learning and deep learning, and intelligent information modeling, extraction, preprocessing, training and the like which are symbolized and formalized are carried out on training data (such as training pictures and labeling results of the training pictures in the application), so that the trained neural network is finally obtained.

The object detection method provided by the embodiment of the application can use the trained neural network to input data (such as pictures in the application) into the trained neural network, so as to obtain output data (such as detection results of the pictures in the application). It should be noted that the training method of the neural network provided in the embodiment of the present application and the object detection method provided in the embodiment of the present application are inventions based on the same concept, and may also be understood as two parts in a system or two stages of an overall process: such as a model training phase and a model application phase.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the convenience of understanding, the related terms and related concepts such as neural networks related to the embodiments of the present application will be described below.

(1) Neural network

The neural network may be composed of neural units, the neural units may refer to operation units with xs and intercept 1 as inputs, and the output of the operation units may be:

where s is 1, 2, … … n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by a number of the above-mentioned single neural units joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) Deep neural network

Deep neural network (dee)DNN), which can be understood as a neural network with many layers of hidden layers, where "many" has no particular metric, we often say that the multilayer neural network and the deep neural network are essentially the same thing, from DNN, which is divided into three categories, input layer, hidden layer, output layer, the first layer is generally an input layer, the last layer is an output layer, and the middle layers are hidden layers, which are all connected between layers, i.e. any neuron at the i-th layer must be connected to any neuron at the i + 1-th layer, although DNN seems complicated, it is not complicated in terms of the operation of each layer, and simply the following linear relation expression y '═ α (Wx' + b), where,

is the input vector of the input vector,

is the output vector of the output vector,

is an offset vector, w is a weight matrix (also called coefficient), α () is an activation function

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is large. Then, as to how DNN is defined, we first look at the definition of the coefficient W. Taking a three-layer DNN as an example, such as: the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input. In summary, the coefficients from the kth neuron at layer L-1 to the jth neuron at layer L are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks.

(3) Convolutional neural network

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of convolutional layers and sub-sampling layers, which can be regarded as a filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Classifier

Many neural network architectures have a classifier for classifying objects in the image. The classifier is generally composed of a fully connected layer (full connected layer) and a softmax function, and is capable of outputting probabilities of different classes according to inputs.

(5) Feature Pyramid Network (FPN)

Most of original object detection algorithms only adopt top-level features for prediction, but we know that the semantic information of the features of a lower level is less, but the target position is accurate; the feature semantic information of the high layer is rich, but the target position is rough. The FPN is based on the original detector and can independently predict different characteristic layers.

(6) Transfer learning (transfer learning)

The initial goal of the migration learning is to deal with the problem of insufficient training samples, so that the model can be migrated to related but not identical target domain data (target domain data) through the existing source domain data (source domain data), thereby training the model suitable for the target domain. A domain is an abstraction that refers to tasks with similar properties. Specifically, a domain may be a detection task on a specific data set, a detection task for a specific object (such as a human face), and the like. Different domains often have obvious differences and are difficult to uniformly process. Universe refers to the collective term for all domains, including all potential tasks, and is a collection of domains that is commonly used for definition and conceptual presentation. The core algorithm of the transfer learning is to extract information with domain invariance by maximizing the similarity measurement of a specific domain, so that data in different domains can be cooperatively learned to obtain a model suitable for a target domain.

Some basic contents of the neural network are briefly described above, and some specific neural networks that may be used in image data processing are described below.

(7) Graph convolution neural network

A graph (graph) is a data format that can be used to represent a social network, a communication network, a protein molecular network, etc., where nodes in the graph represent individuals in the network and lines represent connection relationships between individuals. Graph structure data is required for many machine learning tasks such as community discovery, link prediction and the like, so the appearance of a graph convolutional neural network (GCN) provides a new idea for solving the problems. Deep learning of the graph data is possible using GCN.

GCN is a natural generalization of convolutional neural networks over graph domains (graph domains). The method can simultaneously carry out end-to-end learning on the node characteristic information and the structural information, and is the best choice for the current graph data learning task. The GCN has wide applicability and is suitable for nodes and graphs with any topological structures.

The system architecture of the embodiment of the present application is described in detail below with reference to fig. 1.

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in FIG. 1, the system architecture 100 includes an execution device 110, a training device 120, a database 130, a client device 140, a data storage system 150, and a data collection device 160.

In addition, the execution device 110 includes a calculation module 111, an I/O interface 112, a preprocessing module 113, and a preprocessing module 114. Wherein, the calculation module 111 may include the target model/rule 101, and the pre-processing module 113 and the pre-processing module 114 are optional.

The data acquisition device 160 is used to acquire training data. For the object detection method of the embodiment of the present application, the training data may include training images of different domains or different data sets and labeling results corresponding to the training images. The labeling result of the training image may be a classification result of each object to be detected in the (manually) pre-labeled training image. After the training data is collected, data collection device 160 stores the training data in database 130, and training device 120 trains target model/rule 101 based on the training data maintained in database 130.

The following describes that the training device 120 obtains the target model/rule 101 based on the training data, the training device 120 performs object detection on the input training image, and compares the output detection result with the detection result labeled in advance for the object until the difference between the detection result of the object output by the training device 120 and the detection result labeled in advance is smaller than a certain threshold value, thereby completing the training of the target model/rule 101.

The target model/rule 101 can be used for implementing the object detection method of the embodiment of the present application, that is, the image to be detected (after being subjected to the relevant preprocessing) is input into the target model/rule 101, and the detection result of the image to be detected can be obtained. The target model/rule 101 in the embodiment of the present application may specifically be a neural network. It should be noted that, in practical applications, the training data maintained in the database 130 may not necessarily all come from the acquisition of the data acquisition device 160, and may also be received from other devices. It should be noted that, the training device 120 does not necessarily perform the training of the target model/rule 101 based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places for performing the model training.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 1, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR), a vehicle-mounted terminal, or a server or a cloud. In fig. 1, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include: the image to be processed is input by the client device. The client device 140 may specifically be a terminal device.

The preprocessing module 113 and the preprocessing module 114 are configured to perform preprocessing according to input data (such as an image to be processed) received by the I/O interface 112, and in this embodiment of the application, the preprocessing module 113 and the preprocessing module 114 may not be provided (or only one of the preprocessing modules may be provided), and the computing module 111 may be directly used to process the input data.

In the process that the execution device 110 preprocesses the input data or in the process that the calculation module 111 of the execution device 110 executes the calculation or other related processes, the execution device 110 may call the data, the code, and the like in the data storage system 150 for corresponding processes, and may store the data, the instruction, and the like obtained by corresponding processes in the data storage system 150.

Finally, the I/O interface 112 presents the processing result, such as the detection result of the object obtained as described above, to the client device 140, thereby providing it to the user.

It should be noted that the training device 120 may generate corresponding target models/rules 101 for different targets or different tasks based on different training data, and the corresponding target models/rules 101 may be used to achieve the targets or complete the tasks, so as to provide the user with the required results.

In the case shown in fig. 1, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user can view the result output by the execution device 110 at the client device 140, and the specific presentation form can be display, sound, action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 1, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110.

As shown in fig. 1, a target model/rule 101 is obtained according to training of a training device 120, which may be a neural network in the present application in this embodiment, specifically, the neural network provided in the present application may be a CNN, a Deep Convolutional Neural Network (DCNN), and the like.

Since CNN is a very common neural network, the structure of CNN will be described in detail below with reference to fig. 2. As described in the introduction of the basic concept above, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, where the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input thereto.

As shown in fig. 2, Convolutional Neural Network (CNN)200 may include an input layer 210, a convolutional/pooling layer 220 (where pooling is optional), and a neural network layer 230. The relevant contents of these layers are described in detail below.

Convolutional layer/pooling layer 220:

and (3) rolling layers:

the convolutional layer/pooling layer 220 shown in fig. 2 may include layers such as example 221 and 226, for example: in one implementation, 221 is a convolutional layer, 222 is a pooling layer, 223 is a convolutional layer, 224 is a pooling layer, 225 is a convolutional layer, 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 is a pooling layer, 224, 225 are convolutional layers, and 226 is a pooling layer. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

The inner working principle of a convolutional layer will be described below by taking convolutional layer 221 as an example.

Convolution layer 221 may include a number of convolution operators, also called kernels, whose role in image processing is to act as a filter to extract specific information from the input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed pixel by pixel (or two pixels by two pixels … …, depending on the value of the step size stride) in the horizontal direction on the input image, so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same size (row by column), i.e. a plurality of matrices of the same type, are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by "plurality" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The plurality of weight matrices have the same size (row × column), the sizes of the convolution feature maps extracted by the plurality of weight matrices having the same size are also the same, and the extracted plurality of convolution feature maps having the same size are combined to form the output of the convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used to extract information from the input image, so that the convolutional neural network 200 can make correct prediction.

When convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (e.g., 221) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 200 increases, the more convolutional layers (e.g., 226) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

A pooling layer:

since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, where the layers 221-226, as illustrated by 220 in fig. 2, may be one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a certain range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

The neural network layer 230:

after processing by convolutional layer/pooling layer 220, convolutional neural network 200 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to generate one or a set of the required number of classes of output using the neural network layer 230. Accordingly, a plurality of hidden layers (231, 232 to 23n shown in fig. 2) and an output layer 240 may be included in the neural network layer 230, and parameters included in the hidden layers may be pre-trained according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

After the hidden layers in the neural network layer 230, i.e. the last layer of the whole convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from the direction 210 to 240 in fig. 2 is the forward propagation) of the whole convolutional neural network 200 is completed, the backward propagation (i.e. the propagation from the direction 240 to 210 in fig. 2 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 200, and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.

It should be noted that the convolutional neural network 200 shown in fig. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.

It should be understood that the Convolutional Neural Network (CNN)200 shown in fig. 2 may be used to perform the object detection method of the embodiment of the present application, and as shown in fig. 2, the image to be processed may obtain the detection result of the image after being processed by the input layer 210, the convolutional/pooling layer 220, and the neural network layer 230.

Fig. 3 is a hardware structure of a chip provided in an embodiment of the present application, where the chip includes a neural network processor. The chip may be provided in the execution device 110 as shown in fig. 1 to complete the calculation work of the calculation module 111. The chip may also be disposed in the training apparatus 120 as shown in fig. 1 to complete the training work of the training apparatus 120 and output the target model/rule 101. The algorithms for the various layers in the convolutional neural network shown in fig. 2 can all be implemented in a chip as shown in fig. 3.

The neural network processor NPU is mounted as a coprocessor on a main Central Processing Unit (CPU) (host CPU), and tasks are allocated by the main CPU. The core portion of the NPU is an arithmetic circuit 303, and the controller 304 controls the arithmetic circuit 303 to extract data in a memory (weight memory or input memory) and perform an operation.

In some implementations, the arithmetic circuitry 303 includes a plurality of processing units (PEs) internally. In some implementations, the operational circuitry 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 303 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 303 fetches the data corresponding to the matrix B from the weight memory 302 and buffers the data on each PE in the arithmetic circuit 303. The arithmetic circuit 303 takes the matrix a data from the input memory 301 and performs matrix arithmetic with the matrix B, and a partial result or a final result of the obtained matrix is stored in an accumulator (accumulator) 308.

The vector calculation unit 307 may further process the output of the operation circuit 303, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 307 may be used for network calculation of a non-convolution/non-FC layer in a neural network, such as pooling (Pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector calculation unit 307 can store the processed output vector to the unified buffer 306. For example, the vector calculation unit 307 may apply a non-linear function to the output of the arithmetic circuit 303, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 307 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 303, for example, for use in subsequent layers in a neural network.

The unified memory 306 is used to store input data as well as output data.

The weight data directly passes through a memory unit access controller 305 (DMAC) to carry input data in the external memory to the input memory 301 and/or the unified memory 306, store the weight data in the external memory into the weight memory 302, and store data in the unified memory 306 into the external memory.

A Bus Interface Unit (BIU) 310, configured to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 309 through a bus.

An instruction fetch buffer (instruction fetch buffer)309, coupled to the controller 304, is used to store instructions used by the controller 304.

The controller 304 is configured to call the instruction cached in the finger memory 309, so as to control the operation process of the operation accelerator.

Generally, the unified memory 306, the input memory 301, the weight memory 302 and the instruction fetch memory 309 are On-Chip (On-Chip) memories, the external memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM) or other readable and writable memories.

The operation of each layer in the convolutional neural network shown in fig. 2 can be performed by an operation circuit or a vector calculation module 307.

The execution device 110 in fig. 1 described above can execute the steps of the object detection method in the embodiment of the present application, and the CNN model shown in fig. 2 and the chip shown in fig. 3 can also be used to execute the steps of the object detection method in the embodiment of the present application. The object detection method according to the embodiment of the present application will be described in detail below with reference to the accompanying drawings.

The object detection method according to the embodiment of the present application will be described in detail with reference to fig. 4.

401. And acquiring an image to be detected.

The method shown in fig. 4 can be applied in different scenarios, and specifically, the method shown in fig. 4 can be applied in scenarios such as identifying everything and street view identification.

When the method shown in fig. 4 is applied to a mobile terminal to identify a scene of everything, the image to be detected in step 401 may be an image shot by the mobile terminal through a camera, or an image already stored in an album of the mobile terminal.

When the method shown in fig. 4 is applied to a street view identified scene, the image to be detected in step 401 may be a street view image captured by a camera on the roadside.

The method shown in fig. 4 may be performed by a neural network (model), and particularly, the method shown in fig. 4 may be performed by CNN or DNN.

402. And determining the initial image characteristics of the object to be detected in the image to be detected.

In step 402, convolution processing or regularization processing may be performed on the entire image of the image to be detected to obtain image features of the entire image, and then initial image features corresponding to the object to be detected are obtained from the image features of the entire image.

In a specific embodiment, the convolving the image to be detected to obtain the initial image feature of the object to be detected includes: carrying out convolution processing on the whole image of the image to be detected to obtain the complete image characteristic of the image to be detected; and determining the image characteristics corresponding to the object to be detected in the complete image characteristics of the image to be detected as the initial image characteristics of the object to be detected.

In a specific embodiment, the convolving the image to be detected to obtain the initial image feature of the object to be detected includes: and independently acquiring the image characteristics corresponding to the objects to be detected each time.

403. And determining the enhanced image characteristics of the object to be detected according to the cross-domain knowledge map information.

The cross-domain knowledge map information comprises the incidence relation among the object classes corresponding to the objects to be detected in different domains, and the enhanced image characteristics indicate the semantic information of the object classes corresponding to other objects related to the objects to be detected in different domains.

The greater the probability that two categories appear in the same image at the same time, the greater the probability that the two categories appear in the same image at the same time, the more the two categories are considered to have a relationship. For example, as shown in FIG. 5, the object categories in the first domain or first data set include boys, girls, roads, and streets. The object categories in the second domain include people, handbags, bags, cars, trucks. It may be considered that there is an associative relationship between men, women, boys, girls in the first domain and people in the second domain. A woman and girl in a first domain have an associative relationship with a handbag in a second domain. Boys and girls in a first domain have an associative relationship with bags in a second domain. Roads and streets in a first domain have an associative relationship with cars and trucks in a second domain.

Semantic information may refer to high-level information that can assist in image detection. For example, the semantic information may be specifically what the object is, what is around the object (the semantic information is generally different from the low-level information, such as the edge, pixel point, and brightness of the image).

For example, the object to be detected is a woman, and the other objects associated with the woman in the image to be detected include a handbag, then the enhanced image feature of the object to be detected may indicate semantic information of the handbag.

In a specific embodiment, the cross-domain knowledge graph may include nodes and node connecting edges, where the nodes correspond to the objects to be detected and the node connecting edges correspond to the relationships between the high-level semantic features of different objects to be detected. According to the classification weight of the initial image features in different domains on different object categories, weighting and fusing the classification layer parameters corresponding to the different domains to obtain the high-level semantic features of the object to be detected, wherein the classification layer parameters can be understood as a class center for maintaining the categories, and the class center refers to the high-level semantic features of the categories. And projecting the relation weight between the object categories corresponding to the objects to be detected in different domains to the node connecting edges of the objects to be detected to obtain the weight of the node connecting edges. This process is explained below: by epsilon_S-PWeights representing node-bound edges, S and P representing two fields, cM, respectively_SAnd M_PRepresenting the analytes f in the respective domains_iMatrix of classification weights of the body on the object class of the corresponding domain, in M_SFor the purpose of explanation, M_SThe element of the ith row and the jth column of

s_ijIs the inner product of the initial image characteristic of the ith object to be detected and the parameter corresponding to the jth classification category of the classifier. The weight of the connecting edge between the ith node and the jth node of the regional graph in the S domain is

Wherein the neutralization Cf_jThe characteristics of the ith object to be detected in one domain and the characteristics of the jth object to be detected in another domain. G_S-PIs the relation weight between object classes corresponding to the objects to be detected in different domains, and can be used for G_S-PSeen as a matrix. Projecting the relation weight between object categories corresponding to the objects to be detected in different domains to the node connecting edges of the objects to be detected to obtain the weight of the node connecting edges, wherein the weight can be expressed by the following formula, wherein T represents the transposition of a matrix:

the projection process can be regarded as a process of converting the relation weight between object categories into the relation weight between objects to be detected, and the relation weight between the objects to be detected is the weight of the node connecting edges.

In a specific embodiment, the enhanced image features of the object to be detected can be obtained by performing convolution processing on the high-level semantic features according to the weight of the node connecting edges.

In a specific embodiment, the relationship weight may be determined according to a distance relationship between object categories corresponding to the objects to be detected in different domains. The distance relationship includes one or more of the following information:

(1) attribute relationships of different object classes in different domains.

For example, if the color of apple is red and the color of strawberry is also red, then apple and strawberry have the same attribute in color (or, alternatively, apple and strawberry are closer in color attribute).

(2) The position relationship or the master guest relationship of different object types in different domains.

For example, a car on a street, a woman carrying a handbag, and the street and car are located close together, the woman and handbag satisfy the active guest relationship.

(3) Word embedding similarity constructed using linguistic knowledge for different object classes in different domains.

The word embedding similarity constructed by using linguistic knowledge can be understood as the similarity degree between word vectors of different object categories.

(4) And training the neural network model according to the training data to obtain the distance relation among the characteristics of the different objects to be detected in different domains.

For example, for two different domains, the weight of the connecting edge between the ith node in one domain and the jth node in the other domain is

Wherein f is_iAnd f_jThe characteristics of the ith object to be detected in one domain and the characteristics of the jth object to be detected in another domain (short for the initial image characteristics of the object to be detected). It should be noted that, in this case, that is, when the neural network model is trained according to the training data to obtain the distance relationship between the features of different objects to be detected in different domains, since the relationship weight between the objects to be detected is already obtained, in this case, the projection process is not required.

404. And determining candidate frames and classification of the object to be detected according to the initial image characteristics of the object to be detected and the enhanced image characteristics of the object to be detected.

The candidate frame and the classification of the object to be detected determined in step 404 may be the final candidate frame and the final classification (result) of the object to be detected, respectively.

In step 404, the initial image features of the object to be detected and the enhanced image features of the object to be detected may be combined to obtain the final image features of the object to be detected, and then the candidate frame and the classification of the object to be detected may be determined according to the final image features of the object to be detected.

For example, if the initial image feature of the object to be detected is a convolution feature map with a size of M1 × N1 × C1(M1, N1, and C1 may respectively represent width, height, and number of channels), and the enhanced image feature of the object to be detected is a convolution feature map with a size of M1 × N1 × C2(M1, N1, and C2 respectively represent width, height, and number of channels), then by combining the two convolution feature maps, the final image feature of the object to be detected, which is a convolution feature map with a size of M1 × N1 × (C1+ C2), may be obtained.

It should be understood that the description is given here by taking as an example that the convolution feature map of the initial image feature and the convolution feature map of the enhanced image feature are the same in size (same in width and height), but different in the number of channels. In fact, when the sizes of the convolution feature map of the initial image feature and the convolution feature map of the enhanced image feature are different, the initial image feature and the enhanced image feature may also be combined, and at this time, the sizes of the convolution feature map of the initial image feature and the convolution feature map of the enhanced image feature may be unified (the width and the height are unified), and then the convolution feature map of the initial image feature and the convolution feature map of the enhanced image feature may be combined to obtain the convolution feature map of the final image feature.

In the application, when the object detection is carried out on the image to be detected, the detection result of the object to be detected is comprehensively determined through the initial image characteristics and the enhanced image characteristics of the object to be detected, and compared with a mode of only considering the initial image characteristics of the object to be detected to obtain the detection result, a better detection result can be obtained.

Specifically, when the detection result of the object to be detected is determined, not only the initial image characteristics reflecting the characteristics of the object to be detected are considered, but also the semantic information of other objects related to the object to be detected in the image to be detected is considered. The invention can capture the internal relation among different objects by constructing the cross-domain knowledge graph (also called as migratable knowledge graph under multi-scene), can fuse a large amount of information of different data sets and different categories by utilizing the graph convolution network, greatly improves the data utilization rate, ensures higher detection performance and really realizes large-scale object detection.

For example, a model trained only by the second field mentioned above may determine that the test result is a person and a handbag, and if the test result is a woman carrying a handbag by the model trained by the first field and the second field according to the scheme provided by the present application. And further the effect of object detection is finally improved.

In a specific embodiment, after step 401, the method shown in fig. 4 further includes: and determining an initial candidate frame of the object to be detected according to the initial image characteristics of the object to be detected.

In the process of determining the initial candidate frame of the object to be detected, the convolution processing is generally performed on the whole image of the image to be detected to obtain the convolution characteristics of the whole image of the image to be detected, then the image to be detected is divided into different frames according to the fixed size requirement, the characteristics corresponding to the image in each frame are scored, and the frame with higher score is screened out to be used as the initial candidate frame.

For example, the image to be detected is a first image, and in order to obtain an initial candidate frame of the object to be detected in the first image, convolution processing may be performed on the entire image of the first image to obtain convolution characteristics of the entire image of the first image, then the first image is divided into 3 × 3 boxes, and characteristics corresponding to the image of each box are scored. And finally, screening out the boxes A and B with higher scores as initial candidate boxes.

In the step 404, in the process of determining the candidate frame and the classification of the object to be detected according to the initial image feature of the object to be detected and the enhanced image feature of the object to be detected, the initial image feature and the enhanced image feature may be combined to obtain a final image feature, then the initial candidate frame is adjusted according to the final image feature to obtain a candidate frame, and the initial classification result is corrected according to the final image feature to obtain a classification result. Specifically, the adjusting the initial candidate frame according to the final image feature may be adjusting the coordinates around the initial candidate frame according to the final image feature until the candidate frame is obtained, and the adjusting the initial classification result according to the final image feature may be establishing a classifier to perform reclassification to obtain the classification result.

In order to better understand the complete flow of the object detection method according to the embodiment of the present application, the object detection method according to the embodiment of the present application is described below with reference to fig. 6.

Fig. 6 is a schematic flowchart of an object detection method according to an embodiment of the present application.

The method illustrated in fig. 6 may be performed by an object detection apparatus, which may be an electronic device having an object detection function. The form of the apparatus embodied by the electronic device may be as described above in connection with the method shown in fig. 4.

The method shown in fig. 6 includes steps 601 to 609, which are described in detail below.

Wherein, steps 602 and 603 may be a refined implementation (or referred to as a detailed implementation) of step 402, and steps 604 to 608 may be a refined implementation (or referred to as a detailed implementation) of step 403.

601. And acquiring an image to be detected.

Step 601 can be understood by referring to step 401 in the embodiment corresponding to fig. 4, and is not repeated herein.

602. An initial candidate region is selected.

The image to be detected may be input to a conventional object detector for processing (e.g., fast-RCNN) to obtain initial candidate regions. Since the object detection is performed for a plurality of different domains, each domain has an initial candidate region corresponding to each domain.

Specifically, the convolution processing may be performed on the image to be detected to obtain convolution characteristics of the whole image of the image to be detected, then the image to be detected is divided into different frames according to a certain size requirement, then the characteristics corresponding to the image in each frame are scored according to different domains, and the frame with the higher score is screened out as an initial candidate frame, so that initial candidate frames corresponding to different domains are obtained.

603. Initial image features of the initial candidate region are extracted.

The image features of the initial candidate region may be extracted by CNN. For example, if the first image is an image to be detected, in order to obtain an initial candidate frame of an object to be detected in the first image, the first image may be convolved to obtain a convolution feature of the first image, then the first image is divided into 4 × 4 blocks (or into other number of blocks), a feature corresponding to the image of each block is scored, and a block a and a block B with higher scores are screened out as the initial candidate frame.

Further, after the initial candidate frame is obtained, image features of the entire image of the image to be detected (image features of the entire image of the image to be detected can be obtained by performing convolution processing on the entire image of the image to be detected) may be mapped to the frame a and the frame B, so as to obtain initial image features corresponding to the frame a and initial image features corresponding to the frame B.

604. And extracting classification layer parameters.

The classifier corresponding to different domains in the object detector may be used to extract the classification layer parameters, for example, for each domain, the classifier in the object detector (e.g., fast-RCNN) may be used to extract the classification layer parameters, and for each domain, a semantic pool related to the domain is respectively constructed to record the high-level semantic features of each class. The extracted classification layer parameters may be classification layer parameters of all classifications in classifiers corresponding to different domains in an object detector that performs object detection on an object to be detected.

605. And constructing an intra-domain region map.

According to the classification weight of the initial image characteristics of the object to be detected on different object types given by the detection network, mapping the high-level semantic characteristics in the semantic pool corresponding to the domain to the nodes of the intra-domain area graph to obtain the high-level semantic representation of the object to be detected. And giving the weight on the node connection side of the intra-domain area graph according to the relation weight between the object types corresponding to different objects to be detected in the domain.

Specifically, the semantic pool is

I.e. parameters of the classifier, where C_TFor the number of classes, D is the dimension of the classifier weight corresponding to each class, by X ═ M_TP_TMapping to the nodes of the intra-domain regional graph to obtain a high-level semantic representation of the object to be detected, wherein M_TThe element of the ith row and the jth column of

s_ijIs the inner product of the initial image characteristic of the ith object to be detected and the parameter corresponding to the jth classification category of the classifier. The weight of the connecting edge between the ith node and the jth node of the intra-domain regional graph is

Wherein f is_iAnd f_iThe characteristics of the ith object to be detected in one domain and the characteristics of the jth object to be detected in another domain.

For each domain, a domain-within region map can be constructed in the manner described above.

606. And constructing an inter-domain regional graph.

And mapping the high-level semantic features of the semantic pool to nodes of an inter-domain regional graph according to the classification weights of the initial image features of the object to be detected in the respective domains on different classes given by the detection network to obtain high-level semantic representation of the object to be detected. And giving the relation weight between the classes according to the distance between the classification class characteristics of the detected objects in two different domains, and projecting the relation weight to the node connecting edges of the inter-domain regional graph to obtain the weight of the node connecting edges of the inter-domain regional graph. The distance in step 606 can be understood by referring to the explanation of the distance in the corresponding embodiment of fig. 4, and the detailed description is not repeated here.

For two different domains, the feature construction mode on the nodes of the inter-domain regional graph is the same as that of the intra-domain regional graph, and the weight of the connecting edge between the ith node in one domain and the jth node in the other domain of the inter-domain regional graph is

607. And (4) performing intra-domain graph convolution network inference.

And (3) acquiring the characteristics which are fused with the high-level semantic representations of other objects to be detected after inference and inference by using the high-level semantic representations of different objects to be detected on the convolution network propagation nodes of the intra-domain map through the constructed intra-domain area map.

In particular, a graph convolution of the spatial information mechanism may be selected. The relative spatial information between the objects to be detected is used to learn K Gaussian kernels, and the specific formula is as follows:

where ω (k) is the kth Gaussian kernel, μ_kSum Σ_kIs a learnable mean vector and covariance vector, g_ijThe relative spatial relationship between the ith object to be detected and the jth object to be detected is represented by the following specific formula:

wherein x_iAnd x_jIs the ith and jth rows of X, w_i，w_j，h_iAnd h_jIs the width and height of the ith and jth object candidate frames to be detected. The output of each graph convolution is:

f'_k(i)＝∑_{j is as the adjacent node (i)}ω_k(g_ij)x_je_ij。

The K features obtained by convolution of the in-domain graph on each node can be fused into the high-level semantic representation of the corresponding object to be detected.

608. And (4) carrying out inference on the inter-domain graph convolution network.

And (3) acquiring the characteristics which are subjected to inference and fused with the high-level semantic representations of the objects to be detected in different domains by using the constructed inter-domain regional graph and the high-level semantic representations of the objects to be detected in different domains on the nodes of the convolutional network of the inter-domain graph.

609. And determining candidate frames and classification of the object to be detected according to the initial image characteristics of the object to be detected and the enhanced image characteristics of the object to be detected.

And projecting the features obtained by inference and deduction of the convolution of the intra-domain graph and the convolution of the inter-domain graph into the corresponding high-level semantic representation of the object to be detected, and classifying and regressing the features.

Step 609 can be understood by referring to step 404 in the embodiment corresponding to fig. 4, and the detailed description is not repeated here.

In order to better understand the object detection method according to the embodiment of the present application, the object detection method according to the embodiment of the present application is described in detail with reference to a more specific flowchart.

Fig. 7 is a schematic flowchart of an object detection method according to an embodiment of the present application.

The method illustrated in fig. 7 may be performed by an object detection apparatus, which may be an electronic device having an object detection function. The form of the apparatus embodied in the electronic device may be as described above in connection with the method shown in fig. 4.

Step 1: inputting the picture, and obtaining a preliminary alternative frame and the characteristics of the object to be detected through a traditional object detector.

Step 2: and extracting classification layer parameters by using classifiers corresponding to different domains in the object detector, and respectively constructing a domain-related semantic pool for each domain to record the high-level semantic features of each class. This semantic pool is continuously updated during the training process as the classifier is optimized.

And step 3: constructing an intra-domain region map: and mapping the high-level semantic features of the semantic pool to nodes of the intra-domain regional graph according to the classification weights of the features of the object to be detected on different classes, which are given by the detection network, so as to obtain high-level semantic representation of the object to be detected. And giving the weight on the node connection side of the intra-domain area graph according to the relation between the characteristics of different objects to be detected.

Step 4, constructing an inter-domain regional graph: and mapping the high-level semantic features of the semantic pool to nodes of an inter-domain regional graph according to the classification weights of the features of the object to be detected in the respective domains on different classes given by the detection network to obtain high-level semantic representation of the object to be detected. And giving the relation weight between the classes according to the distance between the classification class characteristics of the detected objects in two different domains, and projecting the relation weight to the node connecting edges of the inter-domain regional graph to obtain the weight of the node connecting edges of the inter-domain regional graph.

And 5: convolution of the image in the domain: and (3) acquiring the characteristics which are fused with the high-level semantic representations of other objects to be detected after inference and inference by using the high-level semantic representations of different objects to be detected on the convolution network propagation nodes of the intra-domain map through the constructed intra-domain area map. The high-level semantic representations of different objects to be detected are fused by learning a sparse region map, so that the feature expression capability of the different objects to be detected is enhanced.

Step 6: convolution of the inter-domain graph: and (3) acquiring the characteristics which are subjected to inference and fused with the high-level semantic representations of the objects to be detected in different domains by using the constructed inter-domain regional graph and the high-level semantic representations of the objects to be detected in different domains on the nodes of the convolutional network of the inter-domain graph.

And 7: optimizing and enhancing the candidate region feature layer: and projecting the features obtained by inference and inference of the convolution of the intra-domain graph and the convolution of the inter-domain graph into the high-level semantic representation of the corresponding object to be detected, and classifying and regressing the features to achieve the aim of improving the large-scale detection performance.

In order to better explain the beneficial effects of the object detection method according to the embodiment of the present application, the following describes in detail the effects of the object detection method according to the embodiment of the present application with respect to the existing object detection method by specific examples with reference to tables 1 and 2.

In the following, the following will describe several existing methods for detecting objects and the effect of object detection by the method provided in the present application by taking table 1 as an example with reference to specific experimental data. The first method shown in table 1 is the FPN detection method, and the second method is the Multi-branch detection method (Multi Branches). The data sets used for training the model comprise an MSCOCO data set, a Visual Genome (VG) data set and an ADE data set, namely the model is trained by the three data sets together, and each data set is tested in a testing stage. The MSCOCO data set has detection labels of 80 general objects, and comprises about 11 ten thousand training data sets and 5 thousand test sets. The VG dataset has a total of 1000 large-scale general object detection datasets, a training dataset of 8.8 ten thousand pictures and a test set of 5000. The ADE dataset has a large-scale general object detection dataset of 445 categories, a training dataset of 2 ten thousand pictures and a test set of 1 thousand.

When the object detection effect is evaluated, the average Accuracy (AP) and the average recall rate (AR) are mainly used for evaluation, and the accuracy under different thresholds, the average accuracy and the average recall rate of objects with different sizes are considered in comparison.

As shown in table 1, model training is performed on the MSCOCO dataset, the VG dataset, and the ADE dataset together, and when testing is performed on different datasets, the AP and the AR of the method of the present application are respectively larger than those of the first method and the second method, and the larger the values of the AP and the AR are, the better the effect of object detection is. As can be seen from Table 1, the method of the present application has a significant effect improvement over several existing object detection methods.

Table 1:

in table 1, the training model is trained by using three data sets, and in order to better embody the beneficial effects brought by the method provided by the present application, the following description is made by combining table 2 to train the training model by using two data sets, and the effects of the method provided by the present application and several existing object detection methods are compared. In addition to the above-mentioned several object detection methods, when model training is performed using two data sets, other object detection methods can be included, such as the third method: fine-tuning detection method (fine-tuning), fourth method: overlap label detection method (overlap labels), fifth method: pseudo label detection method (pseudo labels).

As shown in table 2, through the MSCOCO dataset, the VG dataset, and the ADE dataset, any two datasets in the three datasets are subjected to model training together, and when testing is performed under different datasets, the AP and the AR of the method of the present application are respectively greater than those of the first object detection method to the sixth object detection method, and the larger the values of the AP and the AR are, the better the effect of object detection is. As can be seen from Table 2, the method of the present application has a significantly improved effect compared to several existing object detection methods.

Table 2:

the method provided by the application has the advantages that the detection effect is obviously improved under the conditions of serious object shielding, fuzzy category, small-scale objects and the like. Compared with other domain migration object detection methods, the method effectively captures the internal relation among different objects by constructing the migratable knowledge-graph under multiple domains, fuses a large amount of information of different data sets and different categories by utilizing the graph convolution network, greatly improves the data utilization rate, enables the detection performance to be higher, and really realizes large-scale object detection.

Fig. 8 is a schematic flow chart of a training method of a neural network according to an embodiment of the present application. The method shown in fig. 8 may be executed by a device with high arithmetic capability, such as a computer device, a server device, or an arithmetic device. As described in detail below.

801. Training data is acquired.

The training data comprises training images of different domains and object detection labeling results of the object to be detected in the training images.

802. And extracting the initial image characteristics of the object to be detected in the training image according to the neural network.

803. And extracting the enhanced image characteristics of the object to be detected in the training image according to the neural network and the cross-domain knowledge map information.

804. And processing the initial image characteristics and the enhanced image characteristics of the object to be detected according to the neural network to obtain an object detection result of the object to be detected.

805. And determining the model parameters of the neural network according to the object detection result of the object to be detected in the training image and the object detection labeling result of the object to be detected in the training image.

Optionally, the object detection labeling result of the object to be detected in the training image includes a labeling candidate frame and a labeling classification result of the object to be detected in the training image.

In addition, in the training process, a plurality of different domains or different data sets can be used, and a plurality of training images are generally used.

In the process of training the neural network, a set of initial model parameters can be set for the neural network, then the model parameters of the neural network are gradually adjusted according to the difference between the object detection result of the object to be detected in the training image and the object detection labeling result of the object to be detected in the training image until the difference between the object detection structure of the object to be detected in the training image and the object detection labeling result of the object to be detected in the training image is within a certain preset range, or when the number of times of training reaches a preset number of times, the model parameters of the neural network at the moment are determined as the final parameters of the neural network model, so that the training of the neural network is completed.

It should be understood that the neural network trained by the method shown in fig. 8 can be used to perform the object detection method of the embodiments of the present application.

In the application, when the neural network is trained, not only the initial image features of the object to be detected in the training image are extracted, but also the enhanced image features of the object to be detected in the training image are extracted, and the object detection result of the object to be detected is determined comprehensively according to the initial image features and the enhanced image features of the object to be detected. That is to say, the training method of the application extracts more features to perform object detection in the training process, and can train to obtain a neural network with better performance, so that better object detection effect can be obtained by using the neural network to perform object detection.

In a specific embodiment, the cross-domain knowledge graph may include nodes and node connecting edges, where the nodes correspond to the objects to be detected and the node connecting edges correspond to the relationships between the high-level semantic features of different objects to be detected. And weighting and fusing classification layer parameters corresponding to different domains according to the classification weight of the initial image features in the different domains on different object classes to obtain the high-level semantic features of the object to be detected, wherein the classification layer parameters can be understood as a class center for maintaining the classes. And projecting the relation weight between the object categories corresponding to the objects to be detected in different domains to the node connecting edges of the objects to be detected to obtain the weight of the node connecting edges.

In a specific embodiment, the enhanced image features of the object to be detected can be obtained by performing convolution processing on the high-level semantic features according to the weight of the node connecting edges

(1) attribute relationships of different object classes in different domains.

(4) And distance relations obtained by training the neural network model according to the training data of different object types in different domains.

Wherein f is_iAnd f_jThe characteristics of the ith object to be detected in one domain and the characteristics of the jth object to be detected in another domain (short for the initial image characteristics of the object to be detected).

The object detection method and the neural network training method according to the embodiments of the present application are described in detail above with reference to the accompanying drawings, and the related apparatuses according to the embodiments of the present application are described in detail below with reference to fig. 9 to 11. It should be understood that the object detection device shown in fig. 9 and 10 can perform the respective steps of the object detection method of the embodiment of the present application, and the neural network training device shown in fig. 11 can perform the respective steps of the neural network training method of the embodiment of the present application, and the repetitive description will be appropriately omitted when the devices shown in fig. 9 to 11 are described below.

Fig. 9 is a schematic block diagram of an object detection apparatus according to an embodiment of the present application. The object detection device shown in fig. 9 includes:

an image obtaining module 901, configured to execute step 401 in the embodiment corresponding to fig. 4, and step 601 in the embodiment corresponding to fig. 6.

The feature extraction module 902 is configured to execute step 402 in the embodiment corresponding to fig. 4, step 602 in the embodiment corresponding to fig. 6, step 603 in the embodiment corresponding to fig. 6, step 607 in the embodiment corresponding to fig. 6, and step 608 in the embodiment corresponding to fig. 6.

The detecting module 903 is configured to execute step 404 in the embodiment corresponding to fig. 4, and step 609 in the embodiment corresponding to fig. 6.

A parameter extracting module 904, configured to perform step 403 in the embodiment corresponding to fig. 4 and step 604 in the embodiment corresponding to fig. 6.

The projection module 905 is configured to perform step 605 in the embodiment corresponding to fig. 6 and step 606 in the embodiment corresponding to fig. 6.

A relation weight determining module 906, configured to execute step 605 in the embodiment corresponding to fig. 6 and step 606 in the embodiment corresponding to fig. 6.

In the application, a large amount of different data sets and different types of information can be effectively utilized to train the same network, so that the data utilization rate is greatly improved, and the detection performance is higher. Related semantic information can be merged and transmitted in a plurality of different domains through the graph volume, and the internal relation between different objects under different data sets can be effectively captured, so that the labeling information of different domains and different data sets can be complemented. The high-level semantic information of the object to be detected, which is enhanced by the intra-domain and inter-domain graph convolution, can be simultaneously used in a plurality of different domains to identify and classify the object, so that the identification accuracy is greatly improved.

When the object detection method according to the embodiment of the present application is executed by the execution device 110 in fig. 1, the image acquisition module 901 in the object detection apparatus may be equivalent to the I/O interface 112 in the execution device 110, and the feature extraction module 902 and the detection module 903 in the object detection apparatus may be equivalent to the calculation module 111 in the execution device 110.

When the object detection method according to the embodiment of the present application is executed by the neural network processor in fig. 3, the image obtaining module 901 in the object detection apparatus may be equivalent to the bus interface unit 310 in the neural network processor, and the feature extracting module 902 and the detecting module 903 in the object detection apparatus may be equivalent to the arithmetic circuit 303 in the execution device 110, or the feature extracting module 902 and the detecting module 903 in the object detection apparatus may also be equivalent to the arithmetic circuit 303 in the execution device 110 + the vector calculating unit 307+ the accumulator 308.

Fig. 10 is a schematic block diagram of an object detection apparatus according to an embodiment of the present application. The object detection device module shown in fig. 10 includes a memory 1001, a processor 1002, a communication interface 1003, and a bus 1004. The memory 1001, the processor 1002, and the communication interface 1003 are communicatively connected to each other via a bus 1004.

The communication interface 1003 corresponds to an image acquisition module 901 in the object detection device, and the processor 1002 corresponds to a feature extraction module 902 and a detection module 903 in the object detection device. Each of the object detecting device modules and the modules will be described in detail below.

The memory 1001 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 1001 may store a program, and the processor 1002 and the communication interface 1003 are used to perform the steps of the object detection method of the embodiment of the present application when the program stored in the memory 1001 is executed by the processor 1002. In particular, the communication interface 1003 may retrieve an image to be detected from a memory or other device, and then subject the image to be detected to object detection by the processor 1002.

The processor 1002 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more integrated circuits, and is configured to execute related programs to implement functions that are required to be executed by modules in the object detection apparatus according to the embodiment of the present disclosure (for example, the processor 1002 may implement the functions that are required to be executed by the feature extraction module 902 and the detection module 903 in the object detection apparatus), or to execute the object detection method according to the embodiment of the present disclosure.

The processor 1002 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the object detection method according to the embodiment of the present application may be implemented by integrated logic circuits of hardware in the processor 1002 or instructions in the form of software.

The processor 1002 may also be a general purpose processor, a Digital Signal Processor (DSP), an ASIC, an FPGA (field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1001, and the processor 1002 reads information in the memory 1001, and completes functions required to be executed by modules included in the object detection apparatus according to the embodiment of the present application, or executes the object detection method according to the embodiment of the method of the present application, in combination with hardware thereof.

The communication interface 1003 enables communication between the device module and other equipment or a communication network using a transceiver device such as, but not limited to, a transceiver. For example, the image to be processed may be acquired through the communication interface 1003.

Bus 1004 may include a pathway to transfer information between various components of the device module (e.g., memory 1001, processor 1002, communication interface 1003).

Fig. 11 is a schematic hardware configuration diagram of a neural network training device according to an embodiment of the present application. Similar to the devices described above, the neural network training device shown in FIG. 11 includes a memory 1101, a processor 1102, a communication interface 1103, and a bus 1104. The memory 1101, the processor 1102 and the communication interface 1103 are communicatively connected to each other through a bus 1104.

The memory 1101 may store a program, and when the program stored in the memory 1101 is executed by the processor 1102, the processor 1102 is configured to perform the steps of the training method of the neural network according to the embodiment of the present application.

The processor 1102 may be a general-purpose CPU, a microprocessor, an ASIC, a GPU or one or more integrated circuits, and is configured to execute the relevant programs to implement the neural network training method according to the embodiment of the present application.

The processor 1102 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the neural network training method (such as the method shown in fig. 8) according to the embodiment of the present application may be implemented by hardware integrated logic circuits in the processor 1102 or instructions in the form of software.

It should be understood that, by training the neural network through the neural network training device shown in fig. 11, the trained neural network can be used to perform the object detection method (such as the method shown in fig. 8) according to the embodiment of the present application.

Specifically, the apparatus shown in fig. 11 may obtain training data and a neural network to be trained from the outside through the communication interface 1103, and then train the neural network to be trained according to the training data by the processor.

It should be noted that although the above-described apparatus modules and apparatus show only memories, processors, and communication interfaces, in particular implementations, those skilled in the art will appreciate that the apparatus modules and apparatus may also include other devices necessary to achieve proper operation. Also, the device modules and devices may include hardware components to implement other additional functions, as may be appreciated by those skilled in the art, according to particular needs. Furthermore, those skilled in the art will appreciate that the apparatus modules and apparatus may also include only those components necessary to implement the embodiments of the present application, and need not include all of the components shown in FIGS. 10 and 11.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An object detection method, comprising:

acquiring an image to be detected;

determining the initial image characteristics of the object to be detected in the image to be detected;

determining enhanced image characteristics of the object to be detected according to cross-domain knowledge graph information, wherein the cross-domain knowledge graph information comprises an incidence relation between object categories corresponding to the object to be detected in different domains, and the enhanced image characteristics indicate semantic information of object categories corresponding to other objects related to the object to be detected in the different domains;

and determining candidate frames and classification of the object to be detected according to the initial image characteristics of the object to be detected and the enhanced image characteristics of the object to be detected.

2. The method according to claim 1, wherein the cross-domain knowledge-graph comprises nodes and node connecting edges, the nodes correspond to the objects to be detected, and the node connecting edges correspond to relationships between high-level semantic features of different objects to be detected, the method further comprising:

obtaining classification layer parameters corresponding to the different domains;

weighting and fusing classification layer parameters corresponding to different domains according to the classification weights of the initial image features in the different domains on different object classes to obtain the high-level semantic features of the object to be detected;

and projecting the relation weight between the object categories corresponding to the objects to be detected in the different domains onto the node connecting edges of the objects to be detected to obtain the weight of the node connecting edges.

3. The method of claim 2, further comprising:

and determining the relation weight according to the distance relation between the object categories corresponding to the objects to be detected in the different domains.

4. The method according to claim 3, wherein the distance relationship between the object categories corresponding to the objects to be detected comprises one or more of the following information:

attribute relations of different object types among object types corresponding to the object to be detected in different domains;

the position relation or the active guest relation among the object categories corresponding to the objects to be detected in different domains;

word embedding similarity constructed by using linguistic knowledge among object categories corresponding to the objects to be detected in different domains;

and distance relations among object classes corresponding to the objects to be detected in different domains are obtained by training the neural network model according to the training data.

5. The method according to any one of claims 2 to 4, wherein determining the enhanced image features of the object to be detected from the cross-domain knowledge-map information comprises:

and carrying out convolution processing on the high-level semantic features according to the weight of the node connecting edges to obtain the enhanced image features of the object to be detected.

6. An image detection apparatus, characterized by comprising:

the image acquisition module is used for acquiring an image to be detected;

the characteristic extraction module is used for determining the initial image characteristics of the object to be detected in the image to be detected;

the feature extraction module is further configured to determine, according to cross-domain knowledge graph information, enhanced image features of the object to be detected, where the cross-domain knowledge graph information includes an association relationship between object categories corresponding to the object to be detected in different domains, and the enhanced image features indicate semantic information of object categories corresponding to other objects associated with the object to be detected in the different domains;

and the detection module is used for determining the candidate frame and the classification of the object to be detected according to the initial image characteristics of the object to be detected and the enhanced image characteristics of the object to be detected.

7. The image detection device according to claim 6, wherein the cross-domain knowledge-graph comprises nodes and node connecting edges, the nodes correspond to the objects to be detected, the node connecting edges correspond to the relationship between the high-level semantic features of different objects to be detected, the image detection device further comprises a parameter acquisition module and a projection module,

the parameter acquisition module is used for acquiring classification layer parameters corresponding to the different domains;

the feature extraction module is specifically configured to perform weighted fusion on classification layer parameters corresponding to the different domains according to the classification weights of the initial image features in the different domains in the different object categories to obtain the high-level semantic features of the object to be detected;

and the projection module is used for projecting the relation weight between the object categories corresponding to the objects to be detected in the different domains onto the node connecting edges of the objects to be detected to obtain the weight of the node connecting edges.

8. The image detection apparatus according to claim 7, further comprising a relationship weight determination module,

and the relation weight determining module is used for determining the relation weight according to the distance relation between the object categories corresponding to the objects to be detected in the different domains.

9. The image detection device according to claim 8, wherein the distance relationship between the object categories corresponding to the objects to be detected includes one or more of the following information:

attribute relations among object categories corresponding to the objects to be detected in different domains;

10. The image detection apparatus according to any one of claims 7 to 9,

the feature extraction module is specifically configured to perform convolution processing on the high-level semantic features according to the weight of the node connecting edge to obtain enhanced image features of the object to be detected.

11. An object detecting device, comprising:

a memory for storing a program;

a processor for executing the memory-stored program, the processor for performing the method of any of claims 1-5 when the memory-stored program is executed.

12. A computer storage medium, characterized in that the computer storage medium stores a program code comprising instructions for performing the steps in the method according to any of claims 1-5.