WO2021147325A1

WO2021147325A1 - Object detection method and apparatus, and storage medium

Info

Publication number: WO2021147325A1
Application number: PCT/CN2020/112796
Authority: WO
Inventors: 徐航; 周峰暐; 黎嘉伟; 梁小丹; 李震国; 钱莉
Original assignee: 华为技术有限公司
Priority date: 2020-01-21
Filing date: 2020-09-01
Publication date: 2021-07-29
Also published as: CN111310604A

Abstract

Embodiments of the present application relate to the field of artificial intelligence, and specifically relate to the field of computer vision. Disclosed are an object detection method and apparatus. The method may comprise: obtaining an image to be detected; determining, in the image to be detected, the initial image features of an object to be detected; according to cross-domain knowledge graph information, determining the enhanced image features of the object to be detected, wherein the cross-domain knowledge graph information comprises an association relationship between object categories corresponding to the object to be detected in different domains, and the enhanced image features indicate semantic information of the object categories corresponding to other objects associated with the object to be detected in different domains; and according to the initial image features of the object to be detected and the enhanced image features of the object to be detected, determining the candidate box and category of the object to be detected. By means of the technical solution provided by the present application, a cross-domain knowledge graph is constructed, an internal relationship between different objects to be detected can be obtained, and the effect of object detection is improved.

Description

Object detection method, device and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on January 21, 2020, the application number is 202010072238.0, and the invention title is "an object detection method, device and storage medium", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the field of computer vision, and in particular to an object detection method, device and storage medium.

Background technique

Artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theories.

Object detection is a basic computer vision task that can identify the location and category of objects in an image. In practical applications, researchers and engineers will create data sets for different specific problems according to the application scenarios and actual task requirements, and use them to train highly customized and unique automatic object detectors.

Summary of the invention

Object detection across data sets is an efficient method to achieve large-scale object detection. However, the existing multi-task learning only handles multiple tasks at the same time by adding multiple branches to the model, and cannot realize the interaction between different data sets and different object categories, and cannot capture the internal relationship between the objects to be detected in different data sets. , So the effect is not good.

The present application provides an object detection method, device, and computer storage medium to improve the effect of object detection.

The first aspect of the present application provides an object detection method, which may include: acquiring an image to be detected. Determine the initial image characteristics of the object to be detected in the image to be detected. Determine the enhanced image feature of the object to be detected according to the cross-domain knowledge map information. The cross-domain knowledge map information can include the association relationship between the object categories corresponding to the object to be detected in different domains, and the enhanced image feature indicates that the object in different domains is related to the object to be detected The semantic information of the object category corresponding to the other objects in the link. According to the initial image feature of the object to be detected and the enhanced image feature of the object to be detected, the candidate frame and classification of the object to be detected are determined.

The above object detection method can be applied in different application scenarios. For example, the above object detection method can be applied in the scene of recognizing everything, and it can also be applied in the scene of street view recognition.

When the above method is applied to a scene where a mobile terminal is used to identify everything, the above-mentioned image to be detected may be an image taken by the mobile terminal through a camera, or an image already stored in the mobile terminal's album.

When the above method is applied to a scene of street view recognition, the above-mentioned image to be detected may be a street view image taken by a camera on the roadside.

The greater the probability of two categories appearing in the same image at the same time, the greater the correlation between the two categories. For example, the object categories in the first domain or the first data set include men, women, boys, girls, roads, and streets. Object categories in the second domain include people, handbags, school bags, cars, and trucks. It can be considered that the men, women, boys, and girls in the first domain have an association relationship with the people in the second domain. The women and girls in the first domain have an association with the handbags in the second domain. There is an association between roads and streets in the first domain and cars and trucks in the second domain.

Semantic information can refer to high-level information that can assist in image detection. For example, the above-mentioned semantic information can specifically be what the object is and what is around the object (semantic information is generally different from low-level information, such as image edges, pixels, brightness, etc.). For example, if the object to be detected is a woman, and other objects associated with the bicycle in the image to be detected include a handbag, then the enhanced image feature of the object to be detected may indicate semantic information of the handbag.

From the first aspect, it can be seen that the solution provided by this application can effectively use a large number of different data sets and different types of information to train the same network at the same time, which greatly improves the data utilization rate and makes the detection performance higher.

Optionally, in combination with the above first aspect, in the first possible implementation manner, the cross-domain knowledge graph may include nodes and node edges, nodes corresponding to objects to be detected, and node edges corresponding to high-level semantic features of different objects to be detected The method may also include: obtaining classification layer parameters corresponding to different domains. According to the classification weights of the initial image features in different domains on different object categories, the classification layer parameters corresponding to different domains are weighted and merged to obtain the high-level semantic features of the object to be detected. The weight of the relationship between the object categories corresponding to the object to be detected in different domains is projected onto the node connection edge of the object to be detected, and the weight of the node connection edge is obtained.

Optionally, in combination with the first possible implementation manner of the first aspect described above, in the second possible implementation manner, the method may further include: determining the relationship according to the distance relationship between the object categories corresponding to the objects to be detected in different domains Weights.

Optionally, in combination with the second possible implementation manner of the first aspect described above, in the third possible implementation manner, the distance relationship between object categories corresponding to the object to be detected may include one or more of the following information: Attribute relationships between object categories corresponding to objects to be detected in different domains. The positional relationship or active-object relationship between object categories corresponding to objects to be detected in different domains. The similarity of word embeddings between the object categories corresponding to the objects to be detected in different domains is constructed using linguistic knowledge. The distance relationship between the object categories corresponding to the objects to be detected in different domains is obtained by training the neural network model according to the training data.

Optionally, in combination with the foregoing first aspect of the first aspect to the third possible implementation manner of the first aspect, in the fourth possible implementation manner, the enhanced image feature of the object to be detected is determined according to the cross-domain knowledge map information, and Including: performing convolution processing on the high-level semantic features according to the weights of the edges of the nodes to obtain the enhanced image features of the object to be detected. From the fourth possible implementation of the first aspect, it can be seen that through graph convolution, relevant semantic information can be merged and transmitted in multiple different domains, and the intrinsic relationship between different objects under different data sets can be effectively captured, so that different The annotation information of domains or different data sets can be complementary.

A second aspect of the present application provides an image detection device, which may include: an image acquisition module for acquiring an image to be detected. The feature extraction module is used to determine the initial image feature of the object to be detected in the image to be detected. The feature extraction module is also used to determine the enhanced image features of the object to be detected according to the cross-domain knowledge map information. The cross-domain knowledge map information can include the association relationship between the object categories corresponding to the object to be detected in different domains, and the enhanced image feature indicates different Semantic information of object categories corresponding to other objects in the domain associated with the object to be detected. The detection module is used to determine the candidate frame and classification of the object to be detected according to the initial image feature of the object to be detected and the enhanced image feature of the object to be detected.

Optionally, in combination with the above second aspect, in the first possible implementation manner, the cross-domain knowledge graph may include nodes and node edges, nodes corresponding to objects to be detected, and node edges corresponding to high-level semantic features of different objects to be detected For the relationship between the image detection device, the image detection device may also include a parameter acquisition module and a projection module. The parameter acquisition module is used to acquire classification layer parameters corresponding to different domains. The feature extraction module is specifically used to weight and fuse the classification layer parameters corresponding to different domains according to the classification weights of the initial image features in different domains on different object categories to obtain the high-level semantic features of the object to be detected. The projection module is used to project the weights of the relationships between the object categories corresponding to the objects to be detected in different domains onto the edges of the nodes of the objects to be detected to obtain the weights of the edges of the nodes.

Optionally, in combination with the first possible implementation of the second aspect described above, in the second possible implementation, it may also include a relationship weight determination module, a relationship weight determination module, configured to correspond to objects to be detected in different domains. The distance relationship between the object categories determines the relationship weight.

Optionally, in combination with the second possible implementation manner of the second aspect described above, in the third possible implementation manner, the distance relationship between object categories corresponding to the object to be detected may include one or more of the following information: Attribute relationships between object categories corresponding to objects to be detected in different domains. The positional relationship or active-object relationship between object categories corresponding to objects to be detected in different domains. The similarity of word embeddings between the object categories corresponding to the objects to be detected in different domains is constructed using linguistic knowledge. The distance relationship between the object categories corresponding to the objects to be detected in different domains is obtained by training the neural network model according to the training data.

Optionally, in combination with the above-mentioned first aspect of the second aspect to the third possible implementation manner of the second aspect, in the fourth possible implementation manner, the feature extraction module is specifically used to compare the high-level semantics according to the weight of the node connection The features are processed by convolution to obtain the enhanced image features of the object to be detected.

The third aspect of the present application provides a neural network training method. The method includes: acquiring training data, the training data including training images and object detection and labeling results of the objects to be detected in the training images; extracting the training images from the neural network The initial image features of the object to be detected; the enhanced image features of the object to be detected in the training image are extracted according to the neural network and the cross-domain knowledge map information; the initial image features and the enhanced image features of the object to be detected are processed according to the neural network , Obtain the object detection result of the object to be detected; determine the model parameters of the neural network according to the object detection result of the object to be detected in the training image and the object detection label result of the object to be detected in the training image.

Wherein, the cross-domain knowledge map information may include the association relationship between the object categories corresponding to the object to be detected in different domains, and the enhanced image feature indicates semantic information of the object category corresponding to other objects in the different domains associated with the object to be detected.

The object detection and labeling result of the object to be detected in the training image includes the labeling candidate frame and labeling classification result of the object to be detected in the training image.

In the process of training the above neural network, a set of initial model parameters can be set for the neural network, and then based on the object detection result of the object to be detected in the training image and the object detection labeling result of the object to be detected in the training image. Gradually adjust the model parameters of the neural network until the difference between the object detection structure of the object to be detected in the training image and the object detection and annotation results of the object to be detected in the training image is within a certain preset range, or when When the number of times of training reaches the preset number of times, the model parameters of the neural network at this time are determined as the final parameters of the neural network model, thus completing the training of the neural network. It should be understood that the neural network obtained through training in the third aspect can be used to implement the method in the first aspect of the present application.

Optionally, in combination with the above third aspect, in the first possible implementation manner, the cross-domain knowledge graph may include nodes and node edges, where nodes correspond to objects to be detected, and node edges correspond to the high-level semantics of different objects to be detected The relationship between features. According to the classification weights of the initial image features in different domains on different object categories, the classification layer parameters corresponding to different domains are weighted and merged to obtain the high-level semantic features of the object to be detected. The classification layer parameters can be understood as maintaining a class of the category. center. The weight of the relationship between the object categories corresponding to the object to be detected in different domains is projected onto the node connection edge of the object to be detected, and the weight of the node connection edge is obtained.

Optionally, in combination with the first possible implementation manner of the third aspect described above, in the second possible implementation manner, it may further include determining the relationship weight according to the distance relationship between the object categories corresponding to the objects to be detected in different domains.

Optionally, in combination with the second possible implementation manner of the third aspect described above, in the third possible implementation manner, the distance relationship between the object categories corresponding to the object to be detected may include one or more of the following information: Attribute relationships between object categories corresponding to objects to be detected in different domains. The positional relationship or active-object relationship between object categories corresponding to objects to be detected in different domains. The similarity of word embeddings between the object categories corresponding to the objects to be detected in different domains is constructed using linguistic knowledge. The distance relationship between the object categories corresponding to the objects to be detected in different domains is obtained by training the neural network model according to the training data.

Optionally, in combination with the above-mentioned first aspect of the third aspect to the third possible implementation manner of the third aspect, in the fourth possible implementation manner, the high-level semantic features are convolved according to the weights of the node edges to obtain Enhanced image characteristics of the object to be detected.

In a fourth aspect, an object detection device is provided. The object detection device includes modules for executing the method in the first aspect.

In a fifth aspect, a neural network training device is provided, and the device includes various modules for executing the method in the third aspect.

In a sixth aspect, an object detection device is provided, the device includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processing The device is used to perform the method in the first aspect described above.

In a seventh aspect, a neural network training device is provided. The device includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the device The processor is used to execute the method in the third aspect described above.

In an eighth aspect, an electronic device is provided, which includes the object detection device in the fourth aspect or the sixth aspect.

In a ninth aspect, an electronic device is provided, and the electronic device includes the object detection device in the fifth aspect or the seventh aspect.

The above-mentioned electronic device may specifically be a mobile terminal (for example, a smart phone), a tablet computer, a notebook computer, an augmented reality/virtual reality device, a vehicle-mounted terminal device, and so on.

In a tenth aspect, a computer storage medium is provided, the computer storage medium stores program code, and the program code includes instructions for executing the steps in the method in the first aspect or the third aspect.

In an eleventh aspect, a computer program product containing instructions is provided, when the computer program product runs on a computer, the computer executes the method in the first aspect or the third aspect.

In a twelfth aspect, a chip is provided. The chip includes a processor and a data interface. The processor reads instructions stored in a memory through the data interface and executes the method in the first aspect or the third aspect.

Optionally, as an implementation manner, the chip may further include a memory in which instructions are stored, and the processor is configured to execute instructions stored on the memory. When the instructions are executed, the The processor is used to execute the method in the first aspect. The above-mentioned chip may specifically be a field programmable gate array FPGA or an application-specific integrated circuit ASIC.

It should be understood that the above-mentioned method of the first aspect may specifically refer to the first aspect and a method in any one of the various implementation manners of the first aspect. The foregoing method of the third aspect may specifically refer to the third aspect and a method in any one of the various implementation manners of the third aspect.

Through the technical solution provided by this application, a cross-domain knowledge graph is constructed, which can capture the intrinsic relationship between different objects to be detected, and the enhanced image features include the semantics of object categories corresponding to other objects in different domains associated with the object to be detected Information, so this application can improve the effect of the object detection method.

Description of the drawings

FIG. 1 is a schematic structural diagram of a system architecture provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of object detection using a convolutional neural network model provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application;

FIG. 4 is a schematic flowchart of an object detection method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of the association relationship of an embodiment of the present application;

Fig. 6 is a flowchart of an object detection method according to an embodiment of the present application;

FIG. 7 is a flowchart of an object detection method according to an embodiment of the present application;

FIG. 8 is a schematic flowchart of a neural network training method according to an embodiment of the present application;

FIG. 9 is a schematic block diagram of an object detection device according to an embodiment of the present application;

FIG. 10 is a schematic block diagram of an object detection device according to an embodiment of the present application;

Fig. 11 is a schematic block diagram of a neural network training device according to an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments.

The embodiments of this application are mainly applied in scenes of large-scale object detection, such as mobile phone face recognition, mobile phone recognition of everything, the perception system of unmanned vehicles, security cameras, photo object recognition on social networking sites, smart robots, and so on. The following will briefly introduce several typical application scenarios:

Mobile phone recognizes everything:

Using the camera on the mobile phone, you can take pictures that contain various things. After the picture is acquired, object detection is performed on the picture to determine the position and category of each object in the picture.

The object detection method of the embodiment of the application can be used to detect objects in the pictures taken by the mobile phone. Since the object detection method of the embodiment of the application combines the cross-domain knowledge graph when detecting objects, the method of the embodiment of the application is adopted The object detection method performs better object detection on the pictures taken by the mobile phone (for example, the position of the object and the classification of the object are more accurate).

Street view recognition:

Cameras deployed on the street can take pictures of passing vehicles and people. After the pictures are obtained, the pictures can be uploaded to the control center equipment, and the control center equipment can perform object detection on the pictures and obtain the object detection results. When abnormalities occur, the pictures can be uploaded to the control center equipment. The control center can send out an alarm when the object is missing.

The following describes the method provided in this application from the model training side and the model application side:

The neural network training method provided in the embodiments of this application involves computer vision processing, and can be specifically applied to data processing methods such as data training, machine learning, and deep learning. Labeling results) Carry out symbolic and formal intelligent information modeling, extraction, preprocessing, training, etc., and finally get a trained neural network.

The object detection method provided by the embodiments of this application can use the above-mentioned trained neural network to input input data (such as the picture in this application) into the trained neural network to obtain output data (such as the picture in this application). Test results). It should be noted that the neural network training method provided in the embodiments of this application and the object detection method in the embodiments of this application are inventions based on the same concept, and can also be understood as two parts in a system, or an overall process Two stages: such as model training stage and model application stage.

Since the embodiments of the present application involve a large number of applications of neural networks, in order to facilitate understanding, the following first introduces related terms, neural networks and other related concepts involved in the embodiments of the present application.

(1) Neural network

A neural network can be composed of neural units. A neural unit can refer to an arithmetic unit that takes xs and intercept 1 as inputs. The output of the arithmetic unit can be:

Among them, s=1, 2,...n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of the activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function. A neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field. The local receptive field can be a region composed of several neural units.

(2) Deep neural network

Deep neural network (DNN) can be understood as a neural network with many hidden layers. There is no special metric for "many" here. The essence of the multi-layer neural network and deep neural network we often say The above is the same thing. From the division of DNN according to the location of different layers, the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in the middle are all hidden layers. The layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1th layer. Although DNN looks very complicated, it is not complicated in terms of the work of each layer. Simply put, it is the following linear relationship expression: y′=α(Wx′+b), where,

Is the input vector,

Is the output vector,

Is the offset vector, w is the weight matrix (also called coefficient), and α() is the activation function. Each layer is just the input vector

After such a simple operation, the output vector is obtained

Due to the large number of DNN layers, the coefficient W and the offset vector

The number is also a lot. So, how are specific parameters defined in DNN? First of all, let's take a look at the definition of coefficient W. Take a three-layer DNN as an example. For example, the linear coefficients from the fourth neuron in the second layer to the second neuron in the third layer are defined as

The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third-level index 2 and the input second-level index 4. In summary, the coefficients from the kth neuron in the L-1th layer to the jth neuron in the Lth layer are defined as

Note that the input layer has no W parameter. In deep neural networks, more hidden layers make the network more capable of portraying complex situations in the real world. In theory, a model with more parameters is more complex and has a greater "capacity", which means that it can complete more complex learning tasks.

(3) Convolutional neural network

Convolutional neural network (convolutional neuron network, CNN) is a deep neural network with a convolutional structure. The convolutional neural network contains a feature extractor composed of a convolutional layer and a sub-sampling layer. The feature extractor can be regarded as a filter. The convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network. In the convolutional layer of a convolutional neural network, a neuron can be connected to only part of the neighboring neurons. A convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units in the same feature plane share weights, and the shared weights here are the convolution kernels. Sharing weight can be understood as the way of extracting image information has nothing to do with location. The convolution kernel can be initialized in the form of a matrix of random size. In the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, and at the same time reduce the risk of overfitting.

(4) Classifier

Many neural network structures have a classifier at the end to classify objects in the image. The classifier is generally composed of a fully connected layer and a softmax function, which can output probabilities of different categories according to the input.

(5) Feature pyramid networks (FPN)

Originally, most object detection algorithms only use top-level features for prediction, but we know that low-level feature semantic information is relatively small, but the target location is accurate; high-level feature semantic information is richer, but the target location is relatively rough. FPN is based on the original detector to independently predict in different feature layers.

(6) Transfer learning

The original intention of migration learning is to deal with the problem of insufficient training samples, so that the model can use the existing source domain data (source domain data) to migrate to related but not identical target domain data (target domain data), thereby training suitable for the target The model of the domain. The domain is an abstract concept that refers to tasks with similar properties. Specifically, a domain (or domain) can be a detection task on a specific data set, or it can refer to a detection task for a specific object (such as a human face), and so on. There are often obvious differences between different domains, which are difficult to deal with in a unified manner. The global domain refers to the collective name of all domains including all potential tasks. It is the complete set of domains and is generally used for definitions and conceptual expressions. The core algorithm of transfer learning is to extract domain-invariant information by maximizing a specific domain similarity measure, so that data in different domains can learn from each other to obtain a model suitable for the target domain.

The above gives a brief introduction to some basic contents of neural networks, and the following introduces some specific neural networks that may be used in image data processing.

(7) Graph Convolutional Neural Network

A graph is a data format that can be used to represent social networks, communication networks, protein molecular networks, etc. The nodes in the graph represent individuals in the network, and the lines represent the connections between individuals. Many machine learning tasks such as community discovery, link prediction, etc. require graph structure data. Therefore, the emergence of graph convolutional neural networks (GCN) provides new ideas for solving these problems. GCN can be used for deep learning of graph data.

GCN is a natural promotion of convolutional neural networks in the graph domain. It can perform end-to-end learning of node feature information and structural information at the same time, and is currently the best choice for graph data learning tasks. The applicability of GCN is extremely wide, and it is suitable for nodes and graphs of any topology.

The system architecture of the embodiment of the present application will be described in detail below in conjunction with FIG. 1.

Fig. 1 is a schematic diagram of the system architecture of an embodiment of the present application. As shown in FIG. 1, the system architecture 100 includes an execution device 110, a training device 120, a database 130, a client device 140, a data storage system 150, and a data collection system 160.

In addition, the execution device 110 includes a calculation module 111, an I/O interface 112, a preprocessing module 113, and a preprocessing module 114. Among them, the calculation module 111 may include the target model/rule 101, and the preprocessing module 113 and the preprocessing module 114 are optional.

The data collection device 160 is used to collect training data. For the object detection method of the embodiment of the present application, the training data may include training images of different domains or different data sets and the annotation results corresponding to the training images. Wherein, the labeling result of the training image may be the (manually) pre-labeled classification result of each object to be detected in the training image. After the training data is collected, the data collection device 160 stores the training data in the database 130, and the training device 120 trains to obtain the target model/rule 101 based on the training data maintained in the database 130.

The following describes the target model/rule 101 obtained by the training device 120 based on the training data. The training device 120 performs object detection on the input training image, and compares the output detection result with the object pre-labeled detection result, until the training device 120 outputs The difference between the detection result of the object and the pre-labeled detection result is less than a certain threshold, thereby completing the training of the target model/rule 101.

The above-mentioned target model/rule 101 can be used to implement the object detection method of the embodiment of the present application, that is, input the image to be detected (after relevant preprocessing) into the target model/rule 101 to obtain the detection result of the image to be detected. The target model/rule 101 in the embodiment of the present application may specifically be a neural network. It should be noted that in actual applications, the training data maintained in the database 130 may not all come from the collection of the data collection device 160, and may also be received from other devices. In addition, it should be noted that the training device 120 does not necessarily perform the training of the target model/rule 101 completely based on the training data maintained by the database 130. It may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application. Limitations of the embodiment.

The target model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. 1, which can be a terminal, such as a mobile phone terminal, a tablet computer, Notebook computers, augmented reality (AR)/virtual reality (VR), vehicle-mounted terminals, etc., can also be servers or clouds. In FIG. 1, the execution device 110 is configured with an input/output (input/output, I/O) interface 112 for data interaction with external devices. The user can input data to the I/O interface 112 through the client device 140. The input data in this embodiment of the present application may include: a to-be-processed image input by the client device. The client device 140 here may specifically be a terminal device.

The preprocessing module 113 and the preprocessing module 114 are used to perform preprocessing according to the input data (such as the image to be processed) received by the I/O interface 112. In the embodiment of the present application, the preprocessing module 113 and the preprocessing module may not be provided. 114 (there may only be one preprocessing module), and the calculation module 111 is directly used to process the input data.

When the execution device 110 preprocesses input data, or when the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 may call data, codes, etc. in the data storage system 150 for corresponding processing , The data, instructions, etc. obtained by corresponding processing may also be stored in the data storage system 150.

Finally, the I/O interface 112 presents the processing result, such as the detection result of the object obtained above, to the client device 140 to provide it to the user.

It is worth noting that the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete The above tasks provide users with the desired results.

In the case shown in FIG. 1, the user can manually set input data, and the manual setting can be operated through the interface provided by the I/O interface 112. In another case, the client device 140 can automatically send input data to the I/O interface 112. If the client device 140 is required to automatically send the input data and the user's authorization is required, the user can set the corresponding authority in the client device 140. The user can view the result output by the execution device 110 on the client device 140, and the specific presentation form may be a specific manner such as display, sound, and action. The client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data and store it in the database 130 as shown in the figure. Of course, it is also possible not to collect through the client device 140, but the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result of the output I/O interface 112 as a new sample as shown in the figure. The data is stored in the database 130.

It is worth noting that FIG. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG. 1, the data The storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 may also be placed in the execution device 110.

As shown in FIG. 1, the target model/rule 101 obtained by training according to the training device 120 can be the neural network in this application in the embodiment of the application. Specifically, the neural network provided in the embodiment of the application can be CNN and deep convolution. Neural networks (deep convolutional neural networks, DCNN) and so on.

Since CNN is a very common neural network, the structure of CNN will be introduced in detail below in conjunction with Figure 2. As mentioned in the introduction to the basic concepts above, a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture. The deep learning architecture refers to the algorithm of machine learning. Multi-level learning is carried out on the abstract level of. As a deep learning architecture, CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to the input image.

As shown in FIG. 2, a convolutional neural network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (the pooling layer is optional), and a neural network layer 230. The following is a detailed introduction to the relevant content of these layers.

Convolutional layer/pooling layer 220:

Convolutional layer:

The convolutional layer/pooling layer 220 shown in FIG. 2 may include layers 221-226 as shown in the examples. For example, in one implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, and layer 223 is a convolutional layer. Layers, 224 is the pooling layer, 225 is the convolutional layer, and 226 is the pooling layer; in another implementation, 221 and 222 are the convolutional layers, 223 is the pooling layer, and 224 and 225 are the convolutional layers. Layer, 226 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.

The following will take the convolutional layer 221 as an example to introduce the internal working principle of a convolutional layer.

The convolution layer 221 can include many convolution operators. The convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator is essentially It can be a weight matrix. This weight matrix is usually pre-defined. In the process of convolution on the image, the weight matrix is usually one pixel after one pixel (or two pixels after two pixels) along the horizontal direction on the input image. ...It depends on the value of stride) to complete the work of extracting specific features from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same. During the convolution operation, the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a single depth dimension convolution output, but in most cases, a single weight matrix is not used, but multiple weight matrices of the same size (row×column) are applied. That is, multiple homogeneous matrices. The output of each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" mentioned above. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to eliminate unwanted noise in the image. Perform obfuscation and so on. The multiple weight matrices have the same size (row×column), the size of the convolution feature maps extracted by the multiple weight matrices of the same size are also the same, and then the multiple extracted convolution feature maps of the same size are combined to form The output of the convolution operation.

The weight values in these weight matrices need to be obtained through a lot of training in practical applications. Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions. .

When the convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (such as 221) often extracts more general features, which can also be called low-level features; with the convolutional neural network With the deepening of the network 200, the features extracted by the subsequent convolutional layers (for example, 226) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.

Pooling layer:

Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer. In the 221-226 layers as illustrated by 220 in Figure 2, it can be a convolutional layer followed by a layer. The pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers. In the image processing process, the sole purpose of the pooling layer is to reduce the size of the image space. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain an image with a smaller size. The average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of the average pooling. The maximum pooling operator can take the pixel with the largest value within a specific range as the result of the maximum pooling. In addition, just as the size of the weight matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after processing by the pooling layer can be smaller than the size of the image of the input pooling layer, and each pixel in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.

Neural network layer 230:

After processing by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate final output information (required class information or other related information), the convolutional neural network 200 needs to use the neural network layer 230 to generate one or a group of required classes of output. Therefore, the neural network layer 230 can include multiple hidden layers (231, 232 to 23n as shown in FIG. 2) and an output layer 240. The parameters contained in the hidden layers can be based on specific task types. Relevant training data of, is obtained through pre-training. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, and so on.

After the multiple hidden layers in the neural network layer 230, that is, the final layer of the entire convolutional neural network 200 is the output layer 240. The output layer 240 has a loss function similar to the classification cross entropy, which is specifically used to calculate the prediction error. Once the forward propagation of the entire convolutional neural network 200 (as shown in Figure 2 from 210 to 240 is the forward propagation) is completed, the back propagation (as shown in Figure 2 is the propagation from 240 to 210 is the back propagation). Start to update the weight values and deviations of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the output result of the convolutional neural network 200 through the output layer and the ideal result.

It should be noted that the convolutional neural network 200 shown in FIG. 2 is only used as an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models.

It should be understood that the convolutional neural network (CNN) 200 shown in FIG. 2 may be used to execute the object detection method of the embodiment of the present application. As shown in FIG. 2, the image to be processed passes through the input layer 210 and the convolutional layer/pooling layer 220. After processing with the neural network layer 230, the detection result of the image can be obtained.

FIG. 3 is a hardware structure of a chip provided by an embodiment of the application, and the chip includes a neural network processor. The chip may be set in the execution device 110 as shown in FIG. 1 to complete the calculation work of the calculation module 111. The chip can also be set in the training device 120 as shown in FIG. 1 to complete the training work of the training device 120 and output the target model/rule 101. The algorithms of each layer in the convolutional neural network as shown in Figure 2 can be implemented in the chip as shown in Figure 3.

The neural network processor NPU is mounted as a coprocessor to a main central processing unit (central processing unit, CPU) (host CPU), and the main CPU distributes tasks. The core part of the NPU is the arithmetic circuit 303. The controller 304 controls the arithmetic circuit 303 to extract data from the memory (weight memory or input memory) and perform calculations.

In some implementations, the arithmetic circuit 303 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 303 is a general-purpose matrix processor.

For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 303 fetches the data corresponding to the matrix B from the weight memory 302 and caches it on each PE in the arithmetic circuit 303. The arithmetic circuit 303 fetches the matrix A data and matrix B from the input memory 301 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 308.

The vector calculation unit 307 can perform further processing on the output of the arithmetic circuit 303, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on. For example, the vector calculation unit 307 can be used for network calculations in the non-convolutional/non-FC layer of the neural network, such as pooling, batch normalization, local response normalization, etc. .

In some implementations, the vector calculation unit 307 can store the processed output vector to the unified buffer 306. For example, the vector calculation unit 307 may apply a nonlinear function to the output of the arithmetic circuit 303, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 307 generates a normalized value, a combined value, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 303, for example for use in a subsequent layer in a neural network.

The unified memory 306 is used to store input data and output data.

The weight data directly transfers the input data in the external memory to the input memory 301 and/or the unified memory 306 through the storage unit access controller 305 (direct memory access controller, DMAC), stores the weight data in the external memory into the weight memory 302, and The data in the unified memory 306 is stored in the external memory.

The bus interface unit (BIU) 310 is used to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 309 through the bus.

An instruction fetch buffer 309 connected to the controller 304 is used to store instructions used by the controller 304.

The controller 304 is used to call the instructions cached in the memory 309 to control the working process of the computing accelerator.

Generally, the unified memory 306, the input memory 301, the weight memory 302, and the instruction fetch memory 309 are all on-chip (On-Chip) memories. The external memory is a memory external to the NPU. The external memory can be a double data rate synchronous dynamic random access memory. Memory (double data rate synchronous dynamic random access memory, referred to as DDR SDRAM), high bandwidth memory (HBM) or other readable and writable memory.

Among them, the operations of each layer in the convolutional neural network shown in FIG. 2 can be executed by the arithmetic circuit or the vector calculation module 307.

The execution device 110 in FIG. 1 introduced above can execute each step of the object detection method in the embodiment of the present application. The CNN model shown in FIG. 2 and the chip shown in FIG. 3 can also be used to execute the object in the embodiment of the present application. The various steps of the detection method. The object detection method of the embodiment of the present application will be described in detail below with reference to the accompanying drawings.

The object detection method of the embodiment of the present application will be described in detail below in conjunction with FIG. 4.

401. Acquire an image to be detected.

The method shown in FIG. 4 can be applied in different scenarios. Specifically, the method shown in FIG. 4 can be applied in scenarios such as recognizing everything and street view recognition.

When the method shown in FIG. 4 is applied to a scene where the mobile terminal recognizes everything, the image to be detected in step 401 may be an image taken by the mobile terminal through a camera, or an image already stored in the mobile terminal's album.

When the method shown in FIG. 4 is applied to a scene of street view recognition, the image to be detected in step 401 may be a street view image taken by a camera on the roadside.

The method shown in FIG. 4 may be executed by a neural network (model). Specifically, the method shown in FIG. 4 may be executed by CNN or DNN.

402. Determine the initial image feature of the object to be detected in the image to be detected.

In step 402, the entire image of the image to be detected may be subjected to convolution processing or regularization processing, etc., to obtain the image characteristics of the entire image, and then the initial image characteristics corresponding to the object to be detected are obtained from the image characteristics of the entire image. .

In a specific embodiment, performing convolution processing on the image to be detected to obtain the initial image feature of the object to be detected includes: performing convolution processing on the entire image of the image to be detected to obtain the complete image feature of the image to be detected; Among the complete image features of the image to be detected, the image feature corresponding to the object to be detected is determined as the initial image feature of the object to be detected.

In a specific embodiment, performing convolution processing on the image to be detected to obtain the initial image feature of the object to be detected includes: separately acquiring the image feature corresponding to each object to be detected each time.

403. Determine the enhanced image feature of the object to be detected according to the cross-domain knowledge map information.

The cross-domain knowledge map information includes the association relationship between object categories corresponding to the objects to be detected in different domains, and the enhanced image features indicate semantic information of object categories corresponding to other objects in different domains that are associated with the objects to be detected.

The greater the probability of two categories appearing in the same image at the same time, the greater the correlation between the two categories. For example, as shown in Figure 5, the object categories in the first domain or the first data set include men, women, boys, girls, roads, and streets. Object categories in the second domain include people, handbags, school bags, cars, and trucks. It can be considered that the men, women, boys, and girls in the first domain have an association relationship with the people in the second domain. The women and girls in the first domain have an association with the handbags in the second domain. The boys and girls in the first domain have an association with the school bags in the second domain. There is an association between roads and streets in the first domain and cars and trucks in the second domain.

Semantic information can refer to high-level information that can assist in image detection. For example, the above-mentioned semantic information can specifically be what the object is and what is around the object (semantic information is generally different from low-level information, such as image edges, pixels, brightness, etc.).

For example, if the object to be detected is a woman, and other objects associated with the woman in the image to be detected include a handbag, then the enhanced image feature of the object to be detected may indicate semantic information of the handbag.

In a specific embodiment, the cross-domain knowledge graph may include nodes and node edges, where nodes correspond to objects to be detected, and node edges correspond to relationships between high-level semantic features of different objects to be detected. According to the classification weights of the initial image features in different domains on different object categories, the classification layer parameters corresponding to different domains are weighted and merged to obtain the high-level semantic features of the object to be detected. The classification layer parameters can be understood as maintaining a class of the category. Center, class center refers to the high-level semantic features of the category. The weight of the relationship between the object categories corresponding to the object to be detected in different domains is projected onto the node connection edge of the object to be detected, and the weight of the node connection edge is obtained. This process is explained below: use ε _{SP to} represent the weight of the node's edge, S and P respectively represent two domains, and cM _S and M _P represent the objects to be detected in their respective domains. The f _i body is in the object category of the corresponding domain. _{Take M S} as an example to explain the matrix formed by the classification weights of M S. The element in the i-th row and j-th column of _{M S is}

s _ij is the inner product of the initial image feature of the i-th object to be detected and the parameter corresponding to the j-th classification category of the classifier. The weight of the edge between the i-th node and the j-th node of the area graph in the S domain is

Where Cf _j is the feature of the i-th object to be detected in one domain and the feature of the j-th object to be detected in another domain. G _SP is the weight of the relationship between object categories corresponding to the objects to be detected in different domains, and G _SP can be regarded as a matrix. The weight of the relationship between the object categories corresponding to the object to be detected in different domains is projected onto the node connection edge of the object to be detected, and the weight of the node connection edge is obtained, which can be expressed by the following formula, where T represents the transposition of the matrix:

The process of projection can be regarded as the process of converting the weight of the relationship between the object categories into the weight of the relationship between the objects to be detected, and the weight of the relationship between the objects to be detected is the weight of the edges of the nodes.

In a specific implementation manner, the high-level semantic features are convolved according to the weights of the edges of the nodes, and the enhanced image features of the object to be detected can be obtained.

In a specific implementation, the relationship weight may be determined according to the distance relationship between the object categories corresponding to the objects to be detected in different domains. The distance relationship includes one or more of the following information:

(1) The attribute relationship of different object categories in different domains.

For example, the color of an apple is red, and the color of a strawberry is also red. Then, apples and strawberries have the same color attributes (or, it can be said that apples and strawberries are relatively close in color attributes).

(2) The positional relationship or active-object relationship of different object categories in different domains.

For example, in a car on the street, a woman is carrying a handbag, then the location between the street and the car is close, and the woman and the handbag satisfy the active guest relationship.

(3) Word embedding similarity constructed with linguistic knowledge in different object categories in different domains.

The similarity of word embedding constructed with linguistic knowledge can be understood as the degree of similarity between word vectors of different object categories.

(4) The distance relationship between the features of different objects to be detected in different domains obtained by training the neural network model according to the training data.

For example, for two different domains, the weight of the edge between the i-th node in one domain and the j-th node in the other domain is

Where f _i and f _j are the feature of the i-th object to be detected in one domain and the feature of the j-th object to be detected in the other domain (abbreviation of the initial image feature of the object to be detected). It should be noted that in this case, when the neural network model is trained based on the training data to obtain the distance relationship between the features of different objects to be detected in different domains, because the relationship between the objects to be detected has been obtained Weight, so in this case, the process of projection is no longer needed.

404. Determine a candidate frame and classification of the object to be detected according to the initial image feature of the object to be detected and the enhanced image feature of the object to be detected.

The candidate frame and classification of the object to be detected determined in step 404 may be the final candidate frame and the final classification (result) of the object to be detected, respectively.

In step 404, the initial image feature of the object to be detected and the enhanced image feature of the object to be detected can be combined to obtain the final image feature of the object to be detected, and then the candidate of the object to be detected can be determined according to the final image feature of the object to be detected. Box and classification.

For example, the initial image feature of the object to be detected is a convolution feature map with a size of M1×N1×C1 (M1, N1, and C1 can represent width, height, and number of channels, respectively), and the enhanced image feature of the object to be detected is a size It is a convolution feature map of M1×N1×C2 (M1, N1, and C2 represent width, height, and number of channels, respectively). Then, by combining these two convolution feature maps, the final image features of the object to be detected can be obtained , The final image feature is a convolution feature map with a size of M1×N1×(C1+C2).

It should be understood that the description here is based on an example in which the convolution feature map of the initial image feature and the convolution feature map of the enhanced image feature have the same size (same width and height) but different channel numbers. In fact, when the size of the convolution feature map of the initial image feature and the convolution feature map of the enhanced image feature are different, the initial image feature and the enhanced image feature can also be combined. The size of the convolution feature map and the convolution feature map of the enhanced image feature are unified (the width and height are unified), and then the convolution feature map of the initial image feature and the convolution feature map of the enhanced image feature are combined to obtain the final image feature Convolution feature map.

In this application, when performing object detection on the image to be detected, the detection result of the object to be detected is comprehensively determined by the initial image feature of the object to be detected and the enhanced image feature, and the detection result is obtained by considering only the initial image feature of the object to be detected In comparison, better detection results can be obtained.

Specifically, when determining the detection result of the object to be detected, this application not only considers the initial image characteristics reflecting the characteristics of the object to be detected, but also considers the semantic information of other objects in the image to be detected that are associated with the object to be detected. By constructing a cross-domain knowledge graph (also known as a transferable knowledge graph in multiple scenarios), the present invention can capture the internal relationship between different objects, and use graph convolutional networks to fuse a large number of different data sets and different types of information. The data utilization rate is greatly improved, the detection performance is higher, and the large-scale object detection is truly realized.

For example, a model trained only through the second domain mentioned above may be determined to be a person and a handbag when the detection result is determined. If the solution provided in this application is passed, the second training through the first domain and the second domain The model test, after confirming that the test result may be a woman carrying a handbag. And finally improve the effect of object detection.

In a specific embodiment, after the above step 401, the method shown in FIG. 4 further includes: determining the initial candidate frame of the object to be detected according to the initial image feature of the object to be detected.

In the process of determining the initial candidate frame of the object to be detected, generally the entire image of the image to be detected is first subjected to convolution processing to obtain the convolution characteristics of the entire image of the image to be detected, and then according to the fixed size requirements, the image to be detected Divide into different boxes, score the features corresponding to the image in each box, and filter out the boxes with higher scores as the initial candidate boxes.

For example, the image to be detected is the first image. In order to obtain the initial candidate frame of the object to be detected in the first image, the entire image of the first image can be convolved to obtain the convolution characteristics of the entire image of the first image. , And then divide the first image into 3×3 boxes, and score the corresponding features of each box image. Finally, box A and box B with higher scores can be screened out as initial candidate boxes.

In the above step 404, according to the initial image feature of the object to be detected and the enhanced image feature of the object to be detected, the process of determining the candidate frame and classification of the object to be detected may be to first combine the initial image feature and the enhanced image feature to obtain After the final image feature, the initial candidate frame is adjusted according to the final image feature to obtain the candidate frame, and the initial classification result is corrected according to the final image feature to obtain the classification result. Specifically, the foregoing adjustment of the initial candidate frame according to the final image feature may be adjusting the coordinates around the initial candidate frame according to the final image feature until the candidate frame is obtained, and the foregoing adjustment of the initial classification result according to the final image feature may be: Build a classifier to reclassify, and then get the classification result.

In order to better understand the complete flow of the object detection method according to the embodiment of the present application, the object detection method according to the embodiment of the present application will be described below with reference to FIG. 6.

Fig. 6 is a schematic flowchart of an object detection method according to an embodiment of the present application.

The method shown in FIG. 6 may be executed by an object detection device, which may be an electronic device with an object detection function. The form of the device specifically included in the electronic device can be as described above in the method shown in FIG. 4.

The method shown in FIG. 6 includes steps 601 to 609, and these steps are described in detail below.

Among them, step 602 and step 603 may be detailed implementations of step 402 (or referred to as specific implementations), and steps 604 to 608 can be detailed implementations of step 403 (or referred to as specific implementations).

601. Acquire an image to be detected.

Step 601 can be understood with reference to step 401 in the embodiment corresponding to FIG. 4, and details are not repeated here.

602. Select an initial candidate area.

The image to be detected can be input into a traditional object detector for processing (such as Faster-RCNN) to obtain the initial candidate area. Since this application performs object detection for multiple different domains, each domain has its own corresponding initial candidate area.

Specifically, the image to be detected can be convolved first to obtain the convolution characteristics of the entire image of the image to be detected, and then the image to be detected is divided into different boxes according to certain size requirements, and then for different domains, Score the features corresponding to the image in each box, and filter out the boxes with higher scores as the initial candidate boxes, thereby obtaining the initial candidate boxes corresponding to different domains.

603. Extract an initial image feature of the initial candidate area.

CNN can be used to extract the image features of the initial candidate region. For example, if the first image is the image to be detected, in order to obtain the initial candidate frame of the object to be detected in the first image, the first image can be convolved to obtain the convolution feature of the first image, and then Divide the first image into 4×4 boxes (can also be divided into other numbers of boxes), score the corresponding features of the image of each box, and score the higher box A and box B Filtered out as the initial candidate frame.

Further, after the initial candidate frame is obtained, the image features of the entire image of the image to be detected (the image features of the entire image of the image to be detected can be obtained by convolution processing the entire image of the image to be detected) corresponding to the square In box A and box B, the initial image feature corresponding to box A and the initial image feature corresponding to box B are obtained.

604. Extract classification layer parameters.

You can use the classifiers in the object detector corresponding to different domains to extract the classification layer parameters. For example, for each domain, you can use the classifier in the object detector (such as Faster-RCNN) to extract the classification layer parameters, and construct one for each domain. The domain-related semantic pool records the high-level semantic features of each category. During the training process, the classification layer parameters corresponding to different categories in the classifier may continuously change. In this case, the semantic pool can be classified The corresponding classification layer parameters are updated. The extracted classification layer parameters may be the classification layer parameters of all classifications in the classifiers corresponding to different domains in the object detector for object detection of the object to be detected.

605. Construct an area map within the domain.

According to the classification weights of the initial image features of the object to be detected in different object categories given by the detection network, the high-level semantic features in the semantic pool corresponding to the domain are mapped to the nodes of the area map in the domain to obtain the high-level of the object to be detected Semantic representation. According to the weight of the relationship between the object categories corresponding to the different objects to be detected in the domain, the weights on the edges of the area graph nodes in the domain are given.

Specifically, the semantic pool is

That is, the parameters of the classifier, where C _T is the number of categories, D is the dimension of the weight of the classifier corresponding to each category, through X = M _T P _T mapped to the nodes of the area graph in the domain, the high-level of the object to be detected is obtained Semantic representation, where the element in the i-th row and j-th column of _{M T is}

s _ij is the inner product of the initial image feature of the i-th object to be detected and the parameter corresponding to the j-th classification category of the classifier. The weight of the edge between the i-th node and the j-th node in the region graph is

Where f _i and f _i are the feature of the i-th object to be detected in one domain and the feature of the j-th object to be detected in the other domain.

For each domain, an intra-domain area map can be constructed separately according to the above method.

606. Construct an inter-domain area map.

According to the classification weights of the initial image features of the object to be detected in the respective domains given by the detection network in different categories, the high-level semantic features of the semantic pool are mapped to the nodes of the inter-domain map to obtain the high-level semantics of the object to be detected Express. According to the distance between the classification features of the detected objects in two different domains, the weight of the relationship between the categories is given, and then projected to the edge of the inter-domain graph node to obtain the weight of the node's edge of the inter-domain graph. The distance in step 606 can be understood with reference to the explanation of the distance in the embodiment corresponding to FIG. 4, and details are not repeated here.

For two unused domains, the feature construction method on the nodes of the inter-domain graph is the same as that of the intra-domain graph. The weight of the edge between the i-th node in one domain and the j-th node in the other domain of the inter-domain graph is

607. Inference and inference of intradomain graph convolutional network.

Through the constructed intra-domain map, the intra-domain graph convolutional network is used to spread the high-level semantic representations of different objects to be detected on the nodes, and the features that are combined with the high-level semantic representations of other objects to be detected after inference and inference are obtained.

Specifically, the graph convolution of the spatial information mechanism can be selected. The relative spatial information between the objects to be detected is used to learn K Gaussian kernels. The specific formula is:

Where ω(k) is the k-th Gaussian kernel, μ _k and ∑ _k are the learnable mean vector and covariance vector, g _ij represents the relative spatial relationship between the i-th and j-th objects to be detected, the specific formula for:

Wherein x _i and x _j is the i-th row and the j-th row _{_{X, w i, w j, h}} i and H _j is the i th and j-th object to be detected candidate frame width and height. The output of each graph convolution is:

f′ _k (i)=∑ _{j∈adjacent node (i)} ω _k (g _ij )x _j e _ij .

The K features obtained by the intra-domain graph convolution on each node will be fused into the corresponding high-level semantic representation of the object to be detected.

608. Inter-domain graph convolutional network inference inference.

Through the constructed inter-domain region map, the high-level semantic representation of the object to be detected in different domains on the node is propagated using the inter-domain graph convolutional network, and the features of the high-level semantic representation of the object to be detected in different domains are obtained after inference and inference.

609. Determine a candidate frame and classification of the object to be detected according to the initial image feature of the object to be detected and the enhanced image feature of the object to be detected.

The features obtained after inference and inference of intra-domain graph convolution and inter-domain graph convolution are projected into the corresponding high-level semantic representation of the object to be detected, and classification and regression are performed.

Step 609 can be understood with reference to step 404 in the embodiment corresponding to FIG. 4, and details are not repeated here.

The object detection method of the embodiment of the present application is described in detail above in combination with the flowchart. In order to better understand the object detection method of the embodiment of the present application, the object detection method of the embodiment of the present application will be described in detail below in conjunction with a more specific flowchart. description of.

FIG. 7 is a schematic flowchart of an object detection method according to an embodiment of the present application.

The method shown in FIG. 7 may be executed by an object detection device, which may be an electronic device with an object detection function. The form of the device specifically included in the electronic device can be as described above in the method shown in FIG. 4 introduced above.

Step 1: Input the picture and pass through a traditional object detector to obtain a preliminary candidate frame and the characteristics of the object to be detected.

Step 2: Use classifiers corresponding to different domains in the object detector to extract classification layer parameters, and construct a domain-related semantic pool for each domain to record the high-level semantic features of each category. This semantic pool will be continuously updated as the classifier is optimized during the training process.

Step 3: Construct a region map in the domain: According to the classification weights of the features of the object to be detected in different categories given by the detection network, map the high-level semantic features of the semantic pool to the nodes of the region map in the domain to obtain the high-level semantics of the object to be detected Express. According to the relationship between the features of different objects to be detected, the weights on the edges of the nodes in the area graph are given.

Step 4 Construct the inter-domain area map: According to the classification weights of the object features to be detected in the respective domains given by the detection network in different categories, map the high-level semantic features of the semantic pool to the nodes of the inter-domain area graph to obtain the High-level semantic representation of detected objects. According to the distance between the classification features of the detected objects in two different domains, the weight of the relationship between the categories is given, and then projected to the edge of the inter-domain graph node to obtain the weight of the node's edge of the inter-domain graph.

Step 5: Intra-domain graph convolution: Through the constructed intra-domain map, the intra-domain graph convolution network is used to spread the high-level semantic representations of different objects to be detected on the nodes, and the features that are combined with the high-level semantic representations of other objects to be detected after inference are obtained . By learning a sparse area map to fuse the high-level semantic representation of different objects to be detected, the feature expression ability of different objects to be detected is enhanced.

Step 6: Inter-domain graph convolution: Through the constructed inter-domain area graph, the inter-domain graph convolution network is used to propagate the high-level semantic representation of the objects to be detected in different domains on the node, and the inference and inference of the fusion of different domains are obtained. Detect features of high-level semantic representations of objects.

Step 7: Optimize and enhance the feature layer of the candidate region: project the features obtained after inference inference from the intra-domain graph convolution and the inter-domain graph convolution into the corresponding high-level semantic representation of the object to be detected, and perform classification and regression to achieve improvement The purpose of large-scale testing performance.

The object detection method of the embodiment of the present application is described in detail above in conjunction with the drawings. In order to better illustrate the beneficial effects of the object detection method of the embodiment of the present application, the following table 1 and table 2 are used to implement the present application with specific examples. The effect of the object detection method of the example with respect to the existing object detection method will be described in detail.

In the following, in combination with specific experimental data, taking Table 1 as an example, the effects of several existing object detection methods and the object detection methods provided in this application are compared and described. The first method shown in Table 1 is the FPN detection method, and the second method is the multi-branch detection method (Multi Branches). The data set used for training the model includes three data sets: MSCOCO data set, visual genome (VG) data set and ADE data set, that is, the three data sets are used to train the model together, and the test phase is performed for each data set separately test. Among them, the MSCOCO data set has 80 general object detection annotations, containing about 110,000 training data sets and 5,000 test sets. The VG data set has a total of 1,000 large-scale general object detection data sets, a training data set of 88,000 images, and a test set of 5,000. The ADE dataset has 445 types of large-scale general object detection datasets, a training dataset of 20,000 images, and a test set of 1,000.

When evaluating the effect of object detection, the average precision (AP) and average recall (AR) are mainly used for evaluation, and the accuracy under different thresholds is considered in the comparison. The average precision and average recall of the object.

As shown in Table 1, through the MSCOCO data set, the VG data set and the ADE data set, the three data sets are used for model training. When testing under different data sets, the AP and AR of the method of this application are respectively greater than the first The AP and AR of one method and the second method, and the larger the value of AP and AR, the better the effect of object detection. It can be seen from Table 1 that the method of the present application has a significant improvement in effect compared with several existing object detection methods.

Table 1:

The training model in Table 1 uses three data sets for training. In order to better reflect the beneficial effects brought about by the method provided in this application, when the training model uses two data sets for training in combination with Table 2, this application provides The method is compared with the effects of several existing object detection methods. In addition to the several object detection methods mentioned above, when two data sets are used for model training, several other object detection methods can also be included, such as the third method: fine-tuning, fourth One method: overlap label detection method (overlap labels), and the fifth method: pseudo label detection method (pseudo labels).

As shown in Table 2, through the MSCOCO data set, VG data set and ADE data set, any two data sets of the three data sets are used for model training. When testing under different data sets, the AP and AR must be larger than AP and AR of the first object detection method to the sixth object detection method, and the larger the value of AP and AR, the better the effect of object detection. It can be seen from Table 2 that the method of the present application has a significant improvement in effect compared with several existing object detection methods.

Table 2:

The method provided in this application has a significant improvement in the detection effect in situations where there are serious object occlusions, blurred categories, and small-scale objects. Compared with other domain migration object detection methods, our method effectively captures the internal relationship between different objects by constructing a multi-domain transferable knowledge map, and uses graph convolutional networks to fuse a large number of different data sets and different categories Information greatly improves the data utilization rate, makes the detection performance higher, and truly realizes large-scale object detection.

Fig. 8 is a schematic flowchart of a neural network training method according to an embodiment of the present application. The method shown in FIG. 8 can be executed by a device with strong computing capabilities such as a computer device, a server device, or a computing device. A detailed introduction is given below.

801. Obtain training data.

The training data includes training images in different domains and object detection and labeling results of the objects to be detected in the training images.

802. Extract the initial image features of the object to be detected in the training image according to the neural network.

803. Extract an enhanced image feature of the object to be detected in the training image according to the neural network and the cross-domain knowledge map information.

804. Process the initial image feature and the enhanced image feature of the object to be detected according to the neural network to obtain an object detection result of the object to be detected.

805. Determine the model parameters of the neural network according to the object detection result of the object to be detected in the training image and the object detection and labeling result of the object to be detected in the training image.

Optionally, the object detection and annotation result of the object to be detected in the training image includes the annotation candidate frame and the annotation classification result of the object to be detected in the training image.

In addition, in the above training process, multiple different domains or different data sets can be used, and generally multiple training images are used.

In the process of training the above neural network, a set of initial model parameters can be set for the neural network, and then based on the object detection result of the object to be detected in the training image and the object detection labeling result of the object to be detected in the training image. Gradually adjust the model parameters of the neural network until the difference between the object detection structure of the object to be detected in the training image and the object detection and annotation results of the object to be detected in the training image is within a certain preset range, or when When the number of times of training reaches the preset number of times, the model parameters of the neural network at this time are determined as the final parameters of the neural network model, thus completing the training of the neural network.

It should be understood that the neural network trained through the method shown in FIG. 8 can be used to implement the object detection method of the embodiment of the present application.

In this application, when training the neural network, not only the initial image features of the object to be detected in the training image are extracted, but also the enhanced image features of the object to be detected in the training image are extracted, and the initial image features of the object to be detected are comprehensively extracted. And enhance the image features to determine the object detection result of the object to be detected. That is to say, the training method of the present application extracts more features for object detection during the training process, and can train a neural network with better performance, so that the neural network for object detection can achieve better object detection results. .

In a specific embodiment, the cross-domain knowledge graph may include nodes and node edges, where nodes correspond to objects to be detected, and node edges correspond to relationships between high-level semantic features of different objects to be detected. According to the classification weights of the initial image features in different domains on different object categories, the classification layer parameters corresponding to different domains are weighted and merged to obtain the high-level semantic features of the object to be detected. The classification layer parameters can be understood as maintaining a class of the category. center. The weight of the relationship between the object categories corresponding to the object to be detected in different domains is projected onto the node connection edge of the object to be detected, and the weight of the node connection edge is obtained.

In a specific implementation, the high-level semantic features are convolved according to the weights of the edges of the nodes, and the enhanced image features of the object to be detected can be obtained.

For example, in a car on the street, a woman is carrying a handbag, then the location between the street and the car is close, the woman and the handbag satisfy the active guest relationship.

(4) The distance relationship obtained by training the neural network model according to the training data for different object categories in different domains.

Where f _i and f _j are the feature of the i-th object to be detected in one domain and the feature of the j-th object to be detected in the other domain (abbreviation of the initial image feature of the object to be detected).

The object detection method and neural network training method of the embodiments of the present application are described in detail above with reference to the accompanying drawings, and the related devices of the embodiments of the present application are described in detail below with reference to FIGS. 9 to 11. It should be understood that the object detection device shown in FIG. 9 and FIG. 10 can execute each step of the object detection method of the embodiment of the present application, and the neural network training device shown in FIG. 11 can execute each of the neural network training method of the embodiment of the present application. Steps, the repeated description will be appropriately omitted when introducing the devices shown in FIG. 9 to FIG. 11 below.

Fig. 9 is a schematic block diagram of an object detection device according to an embodiment of the present application. The object detection device 7000 shown in FIG. 9 includes:

The image acquisition module 901 is configured to perform step 401 in the embodiment corresponding to FIG. 4 and step 601 in the embodiment corresponding to FIG. 6.

The feature extraction module 902 is configured to perform step 402 in the embodiment corresponding to FIG. 4, step 602 in the embodiment corresponding to FIG. 6, step 603 in the embodiment corresponding to FIG. 6, and step in the embodiment corresponding to FIG. 6 607, step 608 in the embodiment corresponding to FIG. 6.

The detection module 903 is configured to execute step 404 in the embodiment corresponding to FIG. 4 and step 609 in the embodiment corresponding to FIG. 6.

The parameter extraction module 904 is configured to execute step 403 in the embodiment corresponding to FIG. 4 and step 604 in the embodiment corresponding to FIG. 6.

The projection module 905 is configured to perform step 605 in the embodiment corresponding to FIG. 6 and step 606 in the embodiment corresponding to FIG. 6.

The relationship weight determination module 906 is configured to perform step 605 in the embodiment corresponding to FIG. 6 and step 606 in the embodiment corresponding to FIG. 6.

In this application, a large number of different data sets and different types of information can be effectively used to train the same network at the same time, which greatly improves the data utilization rate and makes the detection performance higher. Through graph convolution, relevant semantic information can be merged and transmitted in multiple different domains, and the intrinsic relationship between different objects in different data sets can be effectively captured, so that the annotation information of different domains and different data sets can be complementary. The high-level semantic information of the object to be detected that has been enhanced by intra-domain and inter-domain graph convolution can be used in multiple different domains at the same time to identify and classify objects, which greatly improves the recognition accuracy.

When the object detection method in the embodiment of the present application is executed by the execution device 110 in FIG. 1, the image acquisition module 901 in the above object detection device may be equivalent to the I/O interface 112 in the execution device 110, and the object detection device The feature extraction module 902 and the detection module 903 are equivalent to the calculation module 111 in the execution device 110.

When the object detection method of the embodiment of the present application is executed by the neural network processor in FIG. 3, the image acquisition module 901 in the above object detection device may be equivalent to the bus interface unit 510 in the neural network processor, and the object detection device The feature extraction module 902 and the detection module 903 in the execution device 110 are equivalent to the arithmetic circuit 503 in the execution device 110, or the feature extraction module 902 and the detection module 903 in the object detection device can also be equivalent to the arithmetic circuit 303+vector calculation in the execution device 110 Unit 307 + accumulator 308.

Fig. 10 is a schematic block diagram of an object detection device according to an embodiment of the present application. The object detection device module shown in FIG. 10 includes a memory 1001, a processor 1002, a communication interface 1003, and a bus 1004. Among them, the memory 1001, the processor 1002, and the communication interface 1003 implement communication connections between each other through the bus 1004.

The communication interface 1003 is equivalent to the image acquisition module 901 in the object detection device, and the processor 1002 is equivalent to the feature extraction module 902 and the detection module 903 in the object detection device. The following is a detailed introduction to each module and modules in the object detection device module.

The memory 1001 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 1001 may store a program. When the program stored in the memory 1001 is executed by the processor 1002, the processor 1002 and the communication interface 1003 are used to execute each step of the object detection method in the embodiment of the present application. Specifically, the communication interface 1003 may obtain the image to be detected from a memory or other devices, and then the processor 1002 performs object detection on the image to be detected.

The processor 1002 may adopt a general central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processing unit (GPU), or one or more The integrated circuit is used to execute related programs to realize the functions required by the modules in the object detection device of the embodiment of the present application (for example, the processor 1002 can implement the feature extraction module 902 and the detection module 903 in the above-mentioned object detection device). The function to be executed), or execute the object detection method of the embodiment of the present application.

The processor 1002 may also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the object detection method in the embodiment of the present application can be completed by an integrated logic circuit of hardware in the processor 1002 or instructions in the form of software.

The above-mentioned processor 1002 may also be a general-purpose processor, a digital signal processing (digital signal processing, DSP), an ASIC, an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices , Discrete hardware components. The aforementioned general-purpose processor may be a microprocessor or the processor may also be any conventional processor. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory 1001, and the processor 1002 reads the information in the memory 1001, and combines its hardware to complete the functions required by the modules included in the object detection apparatus of the embodiment of the present application, or perform the object detection of the method embodiment of the present application method.

The communication interface 1003 uses a transceiving device such as but not limited to a transceiver to implement communication between the device module and other devices or a communication network. For example, the image to be processed can be obtained through the communication interface 1003.

The bus 1004 may include a path for transferring information between various components of the device module (for example, the memory 1001, the processor 1002, and the communication interface 1003).

FIG. 11 is a schematic diagram of the hardware structure of a neural network training device according to an embodiment of the present application. Similar to the above device, the neural network training device shown in FIG. 11 includes a memory 1101, a processor 1102, a communication interface 1103, and a bus 1104. Among them, the memory 1101, the processor 1102, and the communication interface 1103 implement communication connections between each other through the bus 1104.

The memory 1101 may store a program. When the program stored in the memory 1101 is executed by the processor 1102, the processor 1102 is configured to execute each step of the neural network training method of the embodiment of the present application.

The processor 1102 may adopt a general CPU, a microprocessor, an ASIC, a GPU, or one or more integrated circuits for executing related programs to implement the neural network training method of the embodiment of the present application.

The processor 1102 may also be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the neural network training method (the method shown in FIG. 8) of the embodiment of the present application can be completed by the integrated logic circuit of the hardware in the processor 1102 or the instructions in the form of software.

It should be understood that the neural network is trained by the neural network training device shown in FIG. 11, and the trained neural network can be used to execute the object detection method of the embodiment of the present application (the method shown in FIG. 8).

Specifically, the device shown in FIG. 11 can obtain training data and the neural network to be trained from the outside through the communication interface 1103, and then the processor trains the neural network to be trained according to the training data.

It should be noted that although the above device modules and devices only show the memory, processor, and communication interface, in the specific implementation process, those skilled in the art should understand that the device modules and devices may also include other necessary for normal operation. Device. At the same time, according to specific needs, those skilled in the art should understand that the device modules and devices may also include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the device module and the device may also only include the necessary devices for implementing the embodiments of the present application, and not necessarily all the devices shown in FIG. 10 and FIG. 11.

A person of ordinary skill in the art may realize that the modules and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the system, device and module described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the modules is only a logical function division, and there may be other divisions in actual implementation, for example, multiple modules or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or modules, and may be in electrical, mechanical or other forms.

The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed to multiple network modules. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

If the function is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

An object detection method, characterized in that it comprises:

Obtain the image to be detected;

Determining the initial image feature of the object to be detected in the image to be detected;

Determine the enhanced image feature of the object to be detected according to the cross-domain knowledge atlas information, the cross-domain knowledge atlas information includes the association relationship between the object categories corresponding to the object to be detected in different domains, and the enhanced image feature indicates the different Semantic information of object categories corresponding to other objects associated with the object to be detected in the domain;

According to the initial image feature of the object to be detected and the enhanced image feature of the object to be detected, the candidate frame and classification of the object to be detected are determined.
The method according to claim 1, wherein the cross-domain knowledge graph comprises nodes and node edges, the nodes correspond to the object to be detected, and the node edges correspond to different advanced objects of the object to be detected. The relationship between semantic features, the method further includes:

Acquiring classification layer parameters corresponding to the different domains;

According to the classification weights of the initial image features in the different domains on different object categories, weighted fusion of the classification layer parameters corresponding to the different domains to obtain the high-level semantic features of the object to be detected;

Projecting the weight of the relationship between the object categories corresponding to the object to be detected in the different domains on the node connection edge of the object to be detected, to obtain the weight of the node connection edge.
The method according to claim 2, wherein the method further comprises:

The weight of the relationship is determined according to the distance relationship between the object categories corresponding to the object to be detected in the different domains.
The method according to claim 3, wherein the distance relationship between the object categories corresponding to the object to be detected includes one or more of the following information:

The attribute relationship of different object categories among the object categories corresponding to the objects to be detected in different domains;

The positional relationship or active-object relationship between object categories corresponding to the objects to be detected in different domains;

Word embedding similarity constructed by using linguistic knowledge between object categories corresponding to the objects to be detected in different domains;

The distance relationship between the object categories corresponding to the objects to be detected in different domains is obtained by training the neural network model according to the training data.
The method according to any one of claims 2 to 4, wherein the determining the enhanced image feature of the object to be detected according to cross-domain knowledge map information comprises:

Performing convolution processing on the high-level semantic feature according to the weight of the node connection edge to obtain the enhanced image feature of the object to be detected.
An image detection device, characterized in that it comprises:

The image acquisition module is used to acquire the image to be detected;

The feature extraction module is used to determine the initial image feature of the object to be detected in the image to be detected;

The feature extraction module is further configured to determine the enhanced image feature of the object to be detected according to cross-domain knowledge atlas information, where the cross-domain knowledge atlas information includes the association relationship between object categories corresponding to the object to be detected in different domains, The enhanced image feature indicates semantic information of object categories corresponding to other objects in the different domains that are associated with the object to be detected;

The detection module is configured to determine the candidate frame and classification of the object to be detected based on the initial image feature of the object to be detected and the enhanced image feature of the object to be detected.
The image detection device according to claim 6, wherein the cross-domain knowledge graph includes nodes and node edges, the nodes correspond to the object to be detected, and the node edges correspond to different objects to be detected The relationship between the high-level semantic features of the image detection device, the image detection device also includes a parameter acquisition module and a projection module,

The parameter acquisition module is configured to acquire classification layer parameters corresponding to the different domains;

The feature extraction module is specifically configured to weight and fuse the classification layer parameters corresponding to the different domains according to the classification weights of the initial image features in the different domains on different object categories to obtain all the objects to be detected. Describe high-level semantic features;

The projection module is configured to project the relationship weights between the object categories corresponding to the objects to be detected in the different domains onto the node edges of the objects to be detected to obtain the weights of the node edges.
8. The image detection device according to claim 7, further comprising a relation weight determination module,

The relationship weight determination module is configured to determine the relationship weight according to the distance relationship between the object categories corresponding to the object to be detected in the different domains.
8. The image detection device according to claim 8, wherein the distance relationship between the object categories corresponding to the object to be detected includes one or more of the following information:

The attribute relationship between object categories corresponding to the objects to be detected in different domains;

The positional relationship or active-object relationship between object categories corresponding to the objects to be detected in different domains;

Word embedding similarity constructed by using linguistic knowledge between object categories corresponding to the objects to be detected in different domains;

The distance relationship between the object categories corresponding to the objects to be detected in different domains is obtained by training the neural network model according to the training data.
The image detection device according to any one of claims 7 to 9, wherein:

The feature extraction module is specifically configured to perform convolution processing on the high-level semantic features according to the weights of the edges of the nodes to obtain the enhanced image features of the object to be detected.
An object detection device, characterized in that it comprises:

Memory, used to store programs;

The processor is configured to execute the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to execute the method according to any one of claims 1-5.
A computer storage medium, wherein the computer storage medium stores program code, and the program code includes instructions for executing the steps in the method according to any one of claims 1-5.
A neural network training method, which is characterized in that it includes:

Acquiring training data, where the training data includes a training image and an object detection label result of the object to be detected in the training image;

Extracting the initial image features of the object to be detected in the training image according to the neural network;

Extracting enhanced image features of the object to be detected in the training image according to the neural network and the cross-domain knowledge map information;

Processing the initial image feature and the enhanced image feature of the object to be detected according to the neural network to obtain an object detection result of the object to be detected;

The model parameters of the neural network are determined according to the object detection result of the object to be detected in the training image and the object detection and labeling result of the object to be detected in the training image.
A chip system, characterized by comprising: the chip system includes at least one processor, and an interface circuit, the interface circuit and the at least one processor are interconnected through a line, and the processor executes an instruction The method of any one of claims 1 to 5.
A processor, characterized by being used to execute the method according to any one of claims 1 to 5.
An object detection device for executing the method described in any one of claims 1 to 5.
A computer program product containing instructions that, when run on a computer, causes the computer to execute the method described in any one of claims 1 to 5.
An electronic device comprising an image detection device, and the object detection device is the image detection device described in any one of claims 6 to 10.