CN116993996B

CN116993996B - Method and device for detecting object in image

Info

Publication number: CN116993996B
Application number: CN202311152480.9A
Authority: CN
Inventors: 李嘉麟; 付威福; 林愉欢; 刘永; 汪铖杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-09-08
Filing date: 2023-09-08
Publication date: 2024-01-12
Anticipated expiration: 2043-09-08
Also published as: CN116993996A

Abstract

Embodiments of the present disclosure provide a method, apparatus, computer program product, and storage medium for detecting objects in images, applicable to various scenarios such as cloud technology, artificial intelligence, intelligent transportation, driving assistance, and the like. The method comprises the following steps: extracting network features by using image features to obtain the features of the image; acquiring a first query feature; based on the features of the image and the first query feature, acquiring a region of interest in the image and a second query feature by using a global positioning network, wherein the second query feature is the optimized first query feature; and acquiring a detection result of the object by using a local detection network based on the second query feature and the region of interest in the image. The method disclosed by the invention can be used for detecting the object in the image only by two stages of global positioning and local monitoring without manually designed modules, effectively simplifying the model structure and simultaneously considering the efficiency and accuracy of target detection.

Description

Method and device for detecting object in image

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly, to methods, apparatuses, computer program products and storage media for detecting objects in images, and methods, apparatuses, computer program products and storage media for training neural network models.

Background

Images are a great weight in various information bases as one of the most important sources for human acquisition of information. With the rapid development of computer technology, image processing has been widely applied to various aspects of human social life, such as: industrial inspection, medicine, intelligent robots, etc. Images are often applied to various fields to describe and express characteristics and logical relationships of things in terms of their liveliness and intuitiveness, and thus, the application range is wide, so that development of image processing technology and information processing for various fields are extremely important.

The image processing technique is a technique of processing image information with a computer. Mainly comprises image target detection, image enhancement and restoration, image data coding, image segmentation, image recognition and the like. The image target detection technology is widely used in application scenes such as safety monitoring, automatic driving, traffic condition monitoring, unmanned airport scene analysis, robot vision and the like. With the development of artificial intelligence technology and the improvement of the computing power of processors, deep learning models have been widely used in the whole field of computer vision, including general purpose target detection and specific field target detection. Most object detectors utilize a deep learning network as their backbone network and detection network to extract features from the input image (or video) and further to classify and locate objects. Object detection is a computer technology associated with computer vision and image processing for detecting a class of semantic objects (such as people, buildings or cars) in digital images and videos. Research fields of object detection include multi-category detection, edge detection, salient object detection, gesture detection, scene text detection, face detection, pedestrian detection, and the like. Object detection is widely used in many fields of modern life, such as security, traffic, medical and life fields, as an important component of scene understanding.

Along with the increasing complexity of the application scene of target detection, the target detection data volume is larger and larger, and how to simultaneously improve the efficiency and the accuracy of target detection is a problem to be solved urgently at present.

Disclosure of Invention

In order to ensure the efficiency and accuracy of target detection with simplified model structure, the present disclosure provides a method for detecting an object in an image, including: extracting network features by using image features to obtain the features of the image; acquiring a first query feature; based on the features of the image and the first query feature, acquiring a region of interest in the image and a second query feature by using a global positioning network, wherein the second query feature is the optimized first query feature; and acquiring a detection result of the object by using a local detection network based on the second query feature and the region of interest in the image.

Embodiments of the present disclosure also provide a method of training a neural network model, comprising: extracting network features by using image features to obtain the features of the image; acquiring a first query feature, wherein the query feature is used for determining a region of interest in the image together with the feature of the image; based on the features of the image and the first query feature, acquiring an area where the object is located and a second query feature by using a global detection network, wherein the second query feature is the optimized first query feature; based on the area where the object is located and the second query feature, acquiring a detection result of the object by using a local detection network; acquiring a label of object detection; training the neural network model based on the detection result of the object and the label detected by the object to update parameters of the neural network model, wherein the neural network model comprises the image feature extraction network, the global detection network and the local detection network.

The embodiment of the disclosure also provides an apparatus for detecting an object in an image, including: an image feature acquisition module configured to: extracting network features by using image features to obtain the features of the image; a query feature acquisition module configured to: acquiring a first query feature, wherein the query feature is used for determining a region of interest in the image together with the features of the image; a global detection network model configured to: acquiring an area where the object is located and a second query feature based on the feature of the image and the first query feature, wherein the second query feature is the optimized first query feature; and a local detection network model configured to: and acquiring a detection result of the object based on the area where the object is located and the second query feature.

The embodiment of the disclosure also provides a device for training the neural network model, which comprises: an image feature acquisition module configured to: extracting network features by using image features to obtain the features of the image; a query feature acquisition module configured to: acquiring a first query feature, wherein the query feature is used for determining a region of interest in the image together with the feature of the image; a global detection network model configured to: based on the features of the image and the first query feature, acquiring an area where the object is located and a second query feature by using a global detection network, wherein the second query feature is the optimized first query feature; a local detection network model configured to: based on the area where the object is located and the second query feature, acquiring a detection result of the object by using a local detection network; a tag acquisition module configured to: acquiring a label of object detection; a training module configured to: training the neural network model based on the detection result of the object and the label detected by the object to update parameters of the neural network model, wherein the neural network model comprises the image feature extraction network, the global detection network and the local detection network.

Embodiments of the present disclosure also provide a computer program product comprising computer software code which, when run by a processor, provides the above method.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, provide the above-described method.

Compared with a target detection method based on deep learning, the method disclosed by the invention does not need a manually designed module, and completely depends on machine learning to improve model parameters, so that manpower is effectively saved, and meanwhile, the method can detect the target object more accurately without being influenced by the limitation of people.

In addition, the existing method for detecting the target object based on the query characteristics can ensure the detection accuracy through the processing of a multi-stage detector, and the method disclosed by the invention can realize the detection of the object in the image only through two stages of global positioning and local monitoring, so that the model structure is effectively simplified, and the efficiency and the accuracy of target detection can be simultaneously considered.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are used in the description of the embodiments will be briefly described below. It should be apparent that the drawings in the following description are only some exemplary embodiments of the present disclosure, and that other drawings may be obtained from these drawings by those of ordinary skill in the art without undue effort.

Here, in the drawings:

FIG. 1 shows a schematic diagram of an application scenario according to an embodiment of the present disclosure;

FIG. 2 is an example schematic diagram illustrating a scenario of object detection and training based on a neural network model for object detection, according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart diagram illustrating a method of detecting an object in an image according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating an image feature acquisition process according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating a process of detecting objects in an image according to an embodiment of the present disclosure;

FIG. 6A is a schematic diagram illustrating a global detection network based process according to an embodiment of the present disclosure;

FIG. 6B is a schematic diagram illustrating a local detection network-based process according to an embodiment of the present disclosure;

FIG. 7 is a schematic flow chart diagram illustrating a method of training a neural network model according to an embodiment of the present disclosure;

fig. 8 is a composition diagram showing an apparatus for detecting an object in an image according to an embodiment of the present disclosure;

FIG. 9 is a composition diagram illustrating an apparatus for training a neural network model according to an embodiment of the present disclosure; and

fig. 10 is a diagram illustrating an architecture of a computing device according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, exemplary embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.

In addition, in the present specification and the drawings, steps and elements having substantially the same or similar are denoted by the same or similar reference numerals, and repeated descriptions of the steps and elements will be omitted.

Furthermore, in the present specification and drawings, elements are described in the singular or plural form according to the embodiments. However, the singular and plural forms are properly selected for the proposed case only for convenience of explanation and are not intended to limit the present disclosure thereto. Accordingly, the singular may include the plural and the plural may include the singular unless the context clearly indicates otherwise.

In the present specification and drawings, steps and elements having substantially the same or similar are denoted by the same or similar reference numerals, and repeated descriptions of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first," "second," and the like are used merely to distinguish the descriptions, and are not to be construed as indicating or implying relative importance or order.

For purposes of describing the present disclosure, the following presents concepts related to the present disclosure.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. Research in the field of artificial intelligence involves robotic control, natural language processing, computer vision, decision and reasoning, man-machine interaction, information recommendation and searching, and so forth.

A Neural Network (NN) is an important branch of artificial intelligence, and is a Network structure that imitates the behavior characteristics of an animal Neural Network to perform information processing. The neural network structure is formed by interconnecting a large number of nodes (or neurons), and the aim of processing information is achieved by learning and training input information based on a specific operation model. The neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer is responsible for receiving input signals, the output layer is responsible for outputting calculation results of the neural network, the hidden layer is responsible for calculation processes of learning, training and the like, the neural network is a memory unit of the network, the memory function of the hidden layer is represented by a weight matrix, and each neuron generally corresponds to one weight coefficient.

The DEtection TRansformer (DETR) model is a TRansformer-based end-to-end target DEtection network. The input to the DETR model is an image, image features are encoded by a Convolutional Neural Network (CNN), and then these features are input as a sequence into a transducer. In the transducer, DETR employs an encoder-decoder structure, where the encoder is used to semantically encode image features and the decoder is used to generate target detection results. The advantage of the DETR is that it enables end-to-end target detection without using traditional candidate box generation and screening methods, reducing model complexity and computational effort. In addition, DETR can also handle the detection of multiple targets in an image and achieve performance on some datasets comparable to conventional target detection models.

Query-based object detector (Query feature-based target detector): DETR proposes an end-to-end object detection algorithm based on Query features (Query based) that gets rid of the fact that previous object predictions are always based on anchor boxes/anchors (anchor based/key based) of fixed spatial locations, instead relying on learnable vectors for predictions. In the training process, the predicted result is matched with the actual true value one by one, and the matched result determines the calculation of the predicted loss. The one-to-one matching mode effectively avoids the prediction of repeated redundancy generated by the network, so that post-processing algorithms such as non-maximum suppression and the like can be not relied on in the reasoning stage, and the end-to-end target detection is realized.

Attention (Attention) mechanism: the attention mechanism is a computer model or algorithm that simulates the way a human's attention works. It simulates humans handling specific information by selectively focusing on and focusing on the task. The attention mechanism generally consists of the following steps: calculating attention weight: the attention weight of each input is calculated from the input features (Query features) and the current state of the model (Key features, keys). This may be achieved by calculating the similarity or correlation between the input and the current state. Weighted summation: the input representation (Value) is multiplied by the corresponding attention weight and they are weighted together to obtain a weighted input representation. This may focus attention on task related inputs. Updating the model state: based on the weighted input representation, the state of the model is updated for further processing tasks. The above-described process may be a loop process that is implemented by iteratively calculating the attention weights and updating the model states.

Various neural networks (or neural network models) that may be used in embodiments of the present disclosure below may be artificial intelligence models, and in particular artificial intelligence based neural network models. Typically, artificial intelligence based neural network models are implemented as loop-free graphs, in which neurons are arranged in different layers. Typically, the neural network model includes an input layer and an output layer, which are separated by at least one hidden layer. The hidden layer transforms the input received by the input layer into a representation useful for generating an output in the output layer. The network nodes (i.e., neurons) are all connected to nodes in adjacent layers via edges, and there are no edges between nodes within each layer. Data received at a node of an input layer of the neural network is propagated to a node of an output layer via any one of a hidden layer, an active layer, a pooling layer, a convolutional layer, and the like. The input and output of the neural network model may take various forms, which is not limited by the present disclosure.

In summary, the present disclosure relates to techniques of artificial intelligence, image detection, and the like. Embodiments of the present disclosure will be further described below with reference to the accompanying drawings.

First, an application scenario of a method according to an embodiment of the present disclosure and a corresponding apparatus or the like will be described with reference to fig. 1. Fig. 1 shows a schematic diagram of an application scenario 100, in which a server 110 and a plurality of terminals 120 are schematically shown, according to an embodiment of the present disclosure.

The neural network model for object detection of the embodiments of the present disclosure may be integrated in an apparatus for detecting an object in an image, and located in various electronic devices, for example, any of the server 110 and the plurality of terminals 120 in fig. 1. For example, a neural network model for object detection may be integrated in the terminal 120. The terminals 120 include, but are not limited to, cell phones, computers, intelligent voice interaction devices, intelligent home appliances, car terminals. For another example, a neural network model for object detection may also be integrated at the server 110. The server 110 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the disclosure is not limited herein.

It can be understood that the device for detecting the object in the image by applying the embodiment of the present disclosure may be a terminal, a server, or a system composed of the terminal and the server. The method for detecting the object in the image by applying the embodiment of the disclosure can be executed on the terminal, can be executed on the server, and can be executed by the terminal and the server together.

The neural network model for object detection provided by the embodiment of the disclosure can be used for performing various tasks of object detection on objects in images. For example, tumor detection based on medical images, vehicle detection in traffic images, face detection, industrial defect detection, and the like.

The neural network model for object detection provided by the embodiment of the disclosure can also relate to artificial intelligence cloud services in the field of cloud technology. Cloud technology (Cloud technology) refers to a hosting technology that unifies serial resources such as hardware, software, networks and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

Among them, the artificial intelligence cloud Service is also commonly called AIaaS (AI as a Service, chinese is "AI as Service"). The service mode of the artificial intelligent platform is the mainstream at present, and particularly, the AIaaS platform can split several common AI services and provide independent or packaged services at the cloud. This service mode is similar to an AI theme mall: all developers can access one or more artificial intelligence services provided by the use platform through application program interfaces (APIs, application Programming Interface), and part of the deep developers can deploy and operate and maintain proprietary cloud artificial intelligence services by using AI frameworks and AI infrastructure provided by the platform.

Fig. 2 is an example schematic diagram illustrating a scenario 200 for object detection and training based on a neural network model for object detection, according to an embodiment of the present disclosure.

During the training phase, the server 110 may train a neural network model for object detection based on the image training samples. After training is completed, the server may deploy the trained neural network model for object detection to one or more servers (or cloud services) to provide artificial intelligence services related to detecting objects in the images.

Notably, all images used in this disclosure are compliant with legal, dawshare, and privacy regulations. In particular, the source of all images is legal and has been clearly licensed by the user during the acquisition process. Furthermore, all images used in this disclosure adhere to privacy protection guidelines, have been strictly screened and cleaned, and are not compromised to any unauthorized third party.

In the stage of detecting an object in an image based on the neural network model for object detection, it is assumed that a client or application (i.e., various image detection (e.g., person detection, vehicle detection, etc.) applications that interact with the server 110 for object detection have been installed on the user terminal 120 that performs object detection. The user terminal 120 may send a request for detecting an object in an image to the server 110 corresponding to the application through a network to request a neural network for object detection deployed on the server 110 to detect the object in the image. For example, after the server 110 receives a request for detecting an object in an image, the object in the image is detected in response to the request using the trained neural network model for object detection, and the object detection result is fed back to the user terminal 120. The user terminal 120 may receive the object detection result. The user terminal 120 may then perform further analysis or processing based on the object detection results.

Notably, the image training sample data shown in fig. 2 may also be updated in real-time. For example, the user may score the results of object detection. For example, if the user considers that the rationality and accuracy of the object detection results are high, the user may give a high score to the object detection results, and the server 110 may treat the object detection results as positive samples for training the neural network model for object detection in real time. If the user gives a lower score to the result of the object detection, the server 110 may take the object detection result as a negative sample.

The image training sample set shown in fig. 2 may also be set in advance. For example, referring to fig. 2, the server may obtain training data (e.g., image training samples) from a database and then generate a set of image training samples for the neural network model for object detection. Of course, the present disclosure is not limited thereto.

Fig. 3 is a schematic flow chart illustrating a method 300 of detecting an object in an image according to an embodiment of the disclosure.

In step S310, the features of the image are acquired using the image feature extraction network.

It should be understood that the image feature extraction network of the present disclosure may be an image feature extraction network of various structures. For example, convolutional neural networks (Convolutional Neural Network, CNN), residual networks (res net), acceptance networks, VGG networks, feature pyramid networks (Feature Pyramid Network, FPN), hole convolutional networks (Dilated Convolutional Network, dialatednet), and the like. Alternatively, the image feature extraction network of the present disclosure may include one of the above-described neural network structures, and a combination of the above-described plurality of neural network structures, as desired.

It should be appreciated that in order to make the information contained in the features of the image more rich and comprehensive, the image feature extraction network may be utilized to obtain features of different sizes of the image on different feature layers; and fusing the features with different sizes on the different feature layers to obtain the features of the image.

According to an embodiment of the disclosure, each of the features of different sizes may include a plurality of sub-features of the same number, and the corresponding sub-features of the different feature layers may be fused to obtain the feature of the image.

Optionally, in order to make the features of the image contain important information as much as possible, the features of different sizes may be fused to obtain the first image feature; then upsampling the first image feature to obtain a second image feature; and then downsampling the second image feature to obtain a third image feature, and taking the third image feature as the feature of the image.

It should be appreciated that the third image feature may or may not be the same size as the first image feature. For example, the size of the third image feature may be smaller than the size of the first image feature, thereby reducing the amount of data to be processed by the computer.

In step S320, a first query feature is acquired, wherein the query feature is used to determine a region of interest in the image together with features of the image.

In accordance with embodiments of the present disclosure, the present disclosure provides improvements in the manner in which query features are determined in order to enable the query features to more accurately determine regions of interest in an image (i.e., regions of an object) from the features of the object.

In particular, a plurality of attribute features may be determined from a plurality of attributes of the object, wherein each attribute feature indicates at least one of a color, a shape, a size, a direction of the object; the plurality of attribute features may then be weighted based on the weights corresponding to the attribute features to obtain the first query feature. In this way, query features are constrained by the properties of the object, such as color, shape, size, orientation, etc., and regions of interest in the image can be more accurately determined in conjunction with the features of the image.

It should be appreciated that the attribute feature may correspond to not only one attribute of an object, but may correspond to a plurality of attributes of an object. For example, the attribute features may indicate that the object is a yellow circle, the attribute features may indicate that the object is a green small object, and so on.

In step S330, based on the features of the image and the first query feature, a global detection network is used to obtain a region where the object is located and a second query feature, where the second query feature is the optimized first query feature.

According to an embodiment of the present disclosure, the global detection network may include: a cross-attention network and a first self-attention network. In this case, a third query feature may be first obtained using the cross-attention network based on the feature of the image and the first query feature, wherein the third query feature is the optimized first query feature; acquiring a fourth query feature by using the first self-attention network based on the third query feature, wherein the fourth query feature is the optimized third query feature; then taking the fourth query feature as the second query feature; and acquiring the area where the object is located based on the second query feature.

According to another embodiment of the present disclosure, the global detection network may include: a cross-attention network, a first self-attention network, and a first point feature sampling network. In this case, a third query feature may be first obtained using the cross-attention network based on the feature of the image and the first query feature, wherein the third query feature is the optimized first query feature; acquiring a fourth query feature by using the first self-attention network based on the third query feature, wherein the fourth query feature is the optimized third query feature; then, based on the fourth query feature and the feature of the image, acquiring the second query feature by utilizing the first point feature sampling network, wherein the second query feature is the optimized fourth query feature; and then acquiring the area where the object is located based on the second query feature.

According to an embodiment of the present disclosure, obtaining the second query feature using the first point feature sampling network may include: and for each point in the image, obtaining the sampling characteristic of the point based on the point sampling characteristic of the point on each characteristic layer and the weight corresponding to the point sampling characteristic, obtaining the sampling characteristic of the image based on the sampling characteristic of each point in the image, and taking the sampling characteristic of the image as the second query characteristic. In this way, the first point feature sampling network enables points in different directions of the object to correspond to features on feature maps of different sizes. Therefore, the advantages of different feature layers can be fully utilized, so that the finally obtained second query features better reflect the characteristics of the objects in the image.

For example, a feature map of small size can better reflect abstract semantic features of an image, and a feature map of large size can better reflect detailed features of an image. Under the condition that an object to be detected in the image is an elongated object (namely, the length L of the object is far greater than the width W of the object), the characteristics of the object in the length L and the width W can not be reflected well based on the sampling characteristics on one characteristic image.

In step S340, based on the area where the object is located and the second query feature, a detection result of the object is obtained by using a local detection network.

According to an embodiment of the disclosure, the local detection network may include a regional feature fusion network and a second self-attention network. In this case, the features of the region where the object is located and the second query feature may be fused by using a region feature fusion network based on the features of the region where the object is located and the second query feature, so as to obtain a fifth query feature; acquiring a sixth query feature by using the second self-attention network based on the fifth query feature, wherein the sixth query feature is the optimized fifth query feature; and then obtaining a detection result of the object by utilizing the sixth query characteristic and the characteristic of the area where the object is located, wherein the detection result of the object comprises the prediction category of the object and the prediction position of the object.

According to an embodiment of the disclosure, in order to obtain a fifth query feature, the feature of the region where the object is located may be transformed based on the second query feature to obtain a first transformed feature, the feature of the region where the object is located is linearly transformed to obtain a second transformed feature, and the second query feature, the first transformed feature and the second transformed feature are fused to obtain the fifth query feature, where the sizes of the first transformed feature and the second transformed feature are the same as the size of the second query feature.

According to an embodiment of the disclosure, in order to obtain the first transformation feature, the feature of the region where the object is located may be upsampled to obtain a fourth image feature; and then downsampling the fourth image feature to obtain a fifth image feature, and adding the fifth image feature and the second query feature to obtain the first transformation feature, wherein the size of the fifth image feature is the same as the size of the second query feature.

According to another embodiment of the present disclosure, the local detection network may include: a regional feature fusion network, a second self-attention network, and a second point feature sampling network. In this case, the features of the region where the object is located and the second query feature may be fused by using a region feature fusion network based on the features of the region where the object is located and the second query feature, so as to obtain a fifth query feature; acquiring the sixth query feature by using the second self-attention network based on the fifth query feature, wherein the sixth query feature is the optimized fifth query feature; then, based on the sixth query feature and the feature of the area where the object is located, acquiring the seventh query feature by using the second point feature sampling network, wherein the seventh query feature is the optimized sixth query feature; and then obtaining a detection result of the object based on the seventh query feature, wherein the detection result of the object comprises a prediction category of the object and a prediction position of the object (for example, the detection result may be embodied in a box at the position of the object).

According to an embodiment of the present disclosure, obtaining the seventh query feature using the second point feature sampling network may include: and for each point in the area where the object is located, obtaining the sampling characteristic of the point based on the point sampling characteristic of the point on each characteristic layer and the weight corresponding to the point sampling characteristic, obtaining the sampling characteristic of the area where the object is located based on the sampling characteristic of each point in the area where the object is located, and taking the sampling characteristic of the area where the object is located as the seventh query characteristic.

Similar to the function of the first point feature sampling network of the global detection network, the second point feature sampling network can enable points in different directions of the object to correspond to features on feature graphs of different sizes. Therefore, the advantages of different feature layers can be fully utilized, so that the finally obtained seventh query feature better reflects the characteristics of the object in the image, and a more accurate detection result is obtained.

It should be appreciated that the number of local detection networks may be one or more in accordance with embodiments of the present disclosure. The structures of the plurality of local detection networks may be the same or different (e.g., a first local detection network includes a region feature fusion network and a second self-attention network, and a second local detection network includes a region feature fusion network, a second self-attention network, and a second point feature sampling network).

The method 300 of detecting objects in images of the present disclosure may be used to detect one object in an image as well as to detect multiple images in an image simultaneously. In the case where the method 300 for detecting an object in an image is used for face recognition, only a human face portion in the image may be detected, and in the case where the method 300 for detecting an object in an image is used for recognizing an animal, a plurality of animals such as dogs, cats, horses, etc. in the image may be detected at the same time.

As can be seen from the above description of the method 300 for detecting an object in an image, the present disclosure implements detection of an object in an image only through two stages of global positioning and local monitoring, which effectively simplifies a model structure.

Compared with a target detection method based on deep learning, the target object is detected based on the query characteristics, and the prediction of repeated redundancy generated by a network can be effectively avoided in the mode, so that a post-processing algorithm such as non-maximum suppression and the like can be not relied on in an reasoning stage, and the end-to-end target detection is realized.

Fig. 4 is a schematic diagram illustrating an image feature acquisition process according to an embodiment of the present disclosure.

As shown in fig. 4, in order to make the information contained in the features of the image richer and more comprehensive, the features of the image with different sizes on different feature layers can be obtained by using an image feature extraction network; and fusing the features with different sizes on the different feature layers to obtain the features of the image.

For example, a Feature Pyramid Network (FPN) may be used as the image feature extraction network to extract features of different sizes of images on different feature layers. Alternatively, the FPN may extract features on 5 different feature layers to obtain 5 feature maps of different sizes from the feature map F1 to the feature map F5, where the feature map F2 may be 4 times the size of the feature map F1 (i.e., the feature map F2 may be 2 times longer than the feature map F1, the feature map F2 may be 2 times wider than the feature map F1), and the feature map F3 may be 4 times … … the feature map F5 may be 4 times the feature map F4.

The feature map F2, the feature map F3, the feature map F4, and the feature map F5 may be selected from the feature maps F1 to F5 for further feature fusion. Specifically, each of the feature maps F2 to F5 may be divided into the same number of sub-features. For example, the feature map F2 may include a sub-feature maps A1 of the same size, the feature map F3 may include a sub-feature maps A2 of the same size, the feature map F4 may include a sub-feature maps A3 of the same size, the feature map F5 may include a sub-feature maps A4 of the same size, and a is a positive integer. As shown in the shaded portion of each feature map in fig. 4, sub-features at corresponding positions in each feature map may be fused. For example, the corresponding A1 sub-feature map, A2 sub-feature map, A3 sub-feature map and A4 sub-feature map are fused.

It is assumed that the feature corresponding to the sub-feature map A1 is A1-1, the feature corresponding to the sub-feature map A2 is A2-1, the feature corresponding to the sub-feature map A3 is A3-1, and the feature corresponding to the sub-feature map A4 is A4-1. Therefore, the size of the feature A1-1 can be expressed as [ ]c1), whereincIs the number of channels, typically 256, and the size of the feature A2-1 can be expressed as [ ]c4), the size of A3-1 can be expressed as [ (]c16), the size of A4-1 can be expressed as [ (]c,64). Thus, the features A1-1, A2-1 and A2-1 are fused (e.g., by a stitching operation) to obtain a size of [ (]c85), and may be taken as a feature of the image.

In order to make the features of the image contain important information as much as possible, the first image feature may be up-sampled to obtain a second image feature; and then downsampling the second image feature to obtain a third image feature, and taking the third image feature as the feature of the image. The upsampling and downsampling processes may be implemented by a linear transformation layer.

For example, the size of the buffer can be first calculated as [ ] through a Feed Forward Network (FFN) cUpsampling the first image feature of 85) to obtain a size of%c256) of the first image feature; then the size is%cDownsampling the second image feature of 256) to obtain a size of%c，k _mff ) And to match said largeSmall size of%c，k _mff ) As a feature of the image,k _mff representing the number of cores output at the same location on the feature map, optionally,k _mff may be 64.

Therefore, through feature fusion, for each image, the size of the image is finally obtainedh _s ，w _s ，c，k _mff ) Features of (2)，h _s Andw _s is the height and width of the minimum feature map (i.e., feature map F2 herein).

By the processing formula, the characteristics of the image can contain the context information on different characteristic layers, so that the distinguishing capability of the object is enhanced when the characteristics of the image are used for object detection.

Fig. 5 is a schematic diagram illustrating a process of detecting an object in an image according to an embodiment of the present disclosure.

As shown in FIG. 5, the initial first query is characterized byq _g The image is characterized in thatfThe optimized second query feature can be obtained after the global detection network processingq _l And a region of interest RoI in the image. Alternatively, the global detection network may include a linear layer and a Feed Forward Network (FFN), and after processing by the linear layer, the global detection network outputs the optimized second query feature q _l After processing by the Feed Forward Network (FFN), the global detection network outputs the region of interest RoI in the image. The region of interest RoI in the image may include a location of the region of interest (e.g., may be represented in the form of a location box) and a score corresponding to the region of interest.

With a second query featureq _l And the region of interest RoI in the image are used as inputs to the local detection network, and the category of the detection object and the position of the detection object can be obtained after the processing of the local detection network (for exampleRepresented in the form of a position box).

The query feature reflects a combination of properties of the object (e.g., basic visual features of the object: color, shape, size, orientation, etc.), the combination of properties being different for different objects, i.e., the query feature being different. Thus, m learnable dimensions can be used asd(the default setting m=256,d=256) to represent basic visual features of an object usingRepresenting allnQuery features of individual objects, each query feature->Can be m attribute featuresmI.e. can be expressed by the formula (1):

（1）

wherein the method comprises the steps ofIs allocated to the firstiItem number of query featuresjWeights of the individual attribute features. Attribute features mAnd weight->Is learnable during training.

By means of the query feature determining mode, the query feature is constrained by the color, shape, size, direction and other attributes of the object, and the region of interest in the image can be determined more accurately together with the feature of the image.

It should be appreciated that fig. 5 illustrates a forward process of detecting an object in an image based on a neural network model (i.e., a process of using the neural network model). In the case of training the neural network model, the tags of the object detection (which may include the position tag of the object and the class tag of the object) may be additionally acquired to perform supervised training on the neural network model.

For example, parameters of the global detection network may be optimized based on the location tag of the object and the location of the object predicted by the global detection network (i.e., the RoI), parameters of the local detection network may be optimized based on the location tag of the object and the location of the object predicted by the local detection network, and parameters of the local detection network may be optimized based on the category tag of the object and the category of the object predicted by the local detection network.

According to embodiments of the present disclosure, the loss function for training the neural network model may be composed of two parts: a loss of object classification and a loss of object location.

Wherein the loss of object classification is a cross entropy loss between the predicted class of the object and the class label of the object (i.e., the real object class). Assuming that there are N predicted objects and M real objects, the loss of object classification can be specifically defined as shown in formula (2):

（2）

wherein the method comprises the steps ofIs a real objectjAnd predicting the objectiA match indicator between them, in case of a match between the two, < +.>=1, otherwise 0, ++>Is the object of predictioniBelongs to the category ofjIs a probability of (2).

The loss of object position is the difference between the prediction bounding box (i.e., the box where the object is located) and the real bounding box (i.e., the position tag of the object). Alternatively, the penalty for object location may be a Generalized Intersection over Union (GIoU) penalty. The GIoU penalty considers the intersection and union between the prediction bounding box and the real bounding box, as well as the minimum bounding rectangle between them. The calculation formula of the GIoU loss is shown as formula (3):

（3）

where IoU denotes the ratio of the intersection to the union between the prediction bounding box and the real bounding box, C denotes the area of the smallest bounding rectangle, and U denotes the union area of the prediction bounding box and the real bounding box.

Finally, as shown in equation (4), the loss function Is a weighted sum of the loss of object classification and the loss of object location:

（4）

where λ is a weight coefficient used to balance the contribution of the loss of object classification and the loss of object location.

By minimizing the loss function during neural network trainingThe neural network model may be enabled to efficiently detect the class of the object and predict the bounding box of the object.

It should be appreciated that the object classification represented herein by equation (2) is lostAnd loss of object position represented by formula (3)>By way of example, and not limitation. Optionally, loss of object classification +.>The calculation can also be performed using equation (5):

（5）

as an example, loss of object positionThe calculation can also be performed using equation (6):

（6）

wherein, the middle partBA prediction bounding box is represented and a prediction bounding box is represented,representing a real bounding box.

The processing of the global and local detection networks in fig. 5 is described for clarity. Further description is provided below in connection with fig. 6A and 6B.

Fig. 6A is a schematic diagram illustrating a global detection network-based process according to an embodiment of the present disclosure.

Fig. 6A may include, with a global detection network: examples of a cross-attention network, a first self-attention network, and a first point feature sampling network are described. Optionally, the global detection network may also include: a cross-attention network and a first self-attention network.

As shown in fig. 6A, the global detection network may include: a cross-attention network, a first self-attention network, and a first point feature sampling network.

Key (Key) features of images can be obtained based on the features of the imageskSum Value (Value) featurevWherein the key featurekSum value featurevBased on the same position coding. k and v are features of the image obtained by fusing the processes shown in FIG. 4Applying a linear transformation to the last dimension of (2) to obtain +.>Then willYIs separated in two dimensions. Wherein k and v are each +.>。

The cross-attention network may include a multi-head attention network and Add&Norm (residual connection and layer normalization) layer. Features of the imagekAndv) And a first query featureq _g As input to the cross-attention network, a third query feature may be obtainedq _g3 Wherein the third query featureq _g3 For the first query feature optimized based on the attention mechanism.

The process of implementing the cross-attention operation based on the query feature q and the features k and v from the image may be implemented based on the following equation (7):

（7）

wherein,the dimension of k. This attention mechanism generally consists of the following steps: calculating attention weight: and calculating the attention weight of each query feature q according to the query feature q and the key feature k of the image. This may be achieved by calculating the similarity or correlation between the query feature q and the key feature k. Weighted summation: the value features v of the image are multiplied by the corresponding attention weights and they are weighted together to obtain a weighted input representation. This may focus attention on the input query feature q associated with the task. Updating the model state: based on the weighted input representation, the state of the model is updated for further processing tasks. The above-described process may be a loop process that is implemented by iteratively calculating the attention weights and updating the model states.

The structure of the first attention network and the cross-attention network classSimilarly, multiple head attention networks and Add may also be included&Norm (residual connection and layer normalization) layer. Characterizing a third queryq _g3 As input to the first self-attention network, a fourth query feature may be obtainedq _g4 Wherein the fourth query featureq _g4 For a third query feature optimized based on the attention mechanism. Self-noting this allows query features to exchange information with each other so that each query feature can determine the best match and make classification decisions.

Based on third query featureq _g3 The process of achieving self-attention operation is similar to equation (7). In contrast, the cross-attention network implements attention operations based on query features and image features, which are features of different domains; while the first self-attention network implements attention operations based on each query feature and other query features, both of which are features of the same domain.

It should be appreciated that the cross-Attention network and self-Attention network of the present disclosure may be standard Attention networks (e.g., multi-head Attention (MHA) based mechanisms), as well as variations of Attention networks (e.g., multi-Query Attention (MQA) based mechanisms, or group-Query Attention (GQA) based mechanisms, etc.).

The integration of cross-attention and self-attention mechanisms provides the query features with the necessary information for localization and classification, enabling them to focus more effectively on relevant regions of the input. To further enhance the ability of query feature extraction and representation information, the present disclosure also designs a point feature sampling network. This allows each query feature to extract features from the entire image and fuse them with its own features, thereby improving the accuracy of localization and foreground/background discrimination.

The first point feature sampling network may include a point feature sampling layer, a Feed Forward Network (FFN) and an Add&Norm (residual connection and layer normalization) layer. The first point feature sampling network may be based on the fourth query feature and the feature of the imagef ₁ (features of the image)f ₁ Can be the size of the Chinese medicine shown in figure 4c85) may be of the sizeh _s ，w _s ，c，k _mff ) Features of (2)X _mff Without limitation herein), obtain the second query featureq _l Wherein the second query featureq _l Is the fourth query feature after optimizationq _g4 . Furthermore, the second query feature is respectively addressed by two different Feed Forward Networks (FFNs)q _l Processing may result in the location of the region of interest RoI in the image (e.g., may be represented in the form of a location box) and the score corresponding to the region of interest (i.e., the accuracy of the region of interest), respectively.

The processing procedure of the first point feature sampling network is as follows: for each point in the image, the sampling feature of the point can be obtained based on the point sampling feature of the point on each feature layer and the weight corresponding to the point sampling feature, the sampling feature of the image is obtained based on the sampling feature of each point in the image, and the sampling feature of the image is used as the second query featureq _l 。

In this way, the first point feature sampling network enables points in different directions of the object to correspond to features on feature maps of different sizes. Therefore, the advantages of different feature layers can be fully utilized, so that the finally obtained second query features better reflect the characteristics of the objects in the image.

Assume that the bounding box of the detection object has a width and a heightIf->And->The dimensions being significantly different, e.g. whenWhen (1). In this case, in order to better reflect the characteristics of the object, it should be thatwAndhthe features on the different feature maps are selected in the direction. Thus, the present disclosure proposes to sample features in different directions using a point feature sampling network. In particular, the present disclosure will feature diagramszThe axis coordinates are expressed as +.>Wherein- >Is the firstjDownsampling step sizes of the feature maps. />For querying featuresqA kind of electronic deviceuOne of the linear transformations, wherein->And->Plane coordinates of the point to be sampled by the query feature, +.>And->The points of the queried feature sample are respectivelyzRelative to the axial directionwAndhcoordinates of direction, +.>And->Is the firstjCharacteristic diagram ofzRelative to the axial directionwAndhthe coordinates of the direction are used to determine,uis the number of points sampled. The point feature sampling process is shown in the following formula (8):

（8）

wherein,is allocated to the firstjThe weights of the individual feature maps, which can be calculated by the following equation (9):

（9）

is the weighted sampling feature, is the firstjPoint in the individual feature map->The characteristic value of the position,nis the number of feature maps used (e.g., for the example shown in fig. 4, if four layers from feature map F2 to feature map F5 are used,n=4). By using this point feature sampling method, the obtained sampling feature +.>More suitable for query featuresq。/>

Fig. 6B is a schematic diagram illustrating a local detection network-based process according to an embodiment of the present disclosure.

The local detection network may include: a regional feature fusion network, a second self-attention network, and a second point feature sampling network.

The regional feature fusion network can comprise a dynamic convolution layer, add &A Norm (residual connection and layer normalization) layer, a Norm (normalization) layer, and a linear layer. The region feature fusion network may fuse the features of the region in which the object is located based on the features of the region in which the object is located and the second query featuref _l The second query featureq _l To obtain a fifth query featureq _l5 。

Can be obtained by using dynamic convolution layerq _l Andf _l a dynamic instance interaction representation (i.e., the first transformation feature described below). The calculation process can be expressed by the following formula (10):

（10）

wherein,representing characteristics of the region in which the object is locatedf _l Upsampling to obtain a fourth image feature; downsampling the fourth image feature to obtain a fifth image feature, and combining the fifth image feature with the second query featureq _l And adding to obtain the first transformation feature, wherein the size of the fifth image feature is the same as the size of the second query feature.

As shown in the formula (11), forf _l Using two linear transformations can be obtainedr _l ：

（11）

Wherein,W ₁ andW ₂ in the case of two linear transformations,LNrepresenting the normalization layer, reLU represents the modified linear units.

Fusion query features for regions of interest I.e. fifth query featureq _l5 ) The calculation can be performed by the following formula (12):

（12）

wherein,、/>and->The three have the same size.

The second attention network is similar in structure to the cross-attention network of FIG. 6A, and may also include a multi-head attention network and Add&Norm (residual connection and layer normalization) layer. The processing of the second attention network is similar to the processing of the first attention network. Specifically, a fifth query featureq _l5 As input to the second self-attention network, a sixth query feature may be obtainedq _l6 Wherein the sixth query featureq _l6 For the fifth query feature optimized based on the attention mechanism (the process is similar to the fourth query feature in FIG. 6A)q _g4 Is not described in detail herein). Self-noting this allows query features to exchange information with each other so that each query feature can determine the best match and make classification decisions.

Similar to the first point feature sampling network in FIG. 6A, the second point feature sampling network may include a point feature sampling layer, a Feed Forward Network (FFN), and Add&Norm (residual connection and layer normalization) layer. The processing of the second point feature sampling network is similar to that of the first point feature sampling network. In particular, the second point feature sampling network may be based on the sixth query feature q _l6 And image features of region of interest RoIf ₂ Obtaining optimized seventh query featuresq _l7 (the process is similar to the sixth query feature in FIG. 6A)q _l Is not described in detail herein). Furthermore, the seventh query feature is respectively addressed by two different Feed Forward Networks (FFNs)q _l7 Processing can be performed to obtain the predicted category of the object in the image and the predicted position of the object (for example, the position of the objectBlock at).

Similar to the processing of the global detection network, the query features exchange information through a self-care mechanism during the processing of the local detection network to improve classification performance. The information contained in the query feature is then enhanced by the first point feature sampling network. Finally, the predicted category of the object and the predicted position of the object are obtained respectively.

Fig. 7 is a schematic flow chart diagram illustrating a method 700 of training a neural network model according to an embodiment of the present disclosure.

In step S710, features of an image are acquired using an image feature extraction network.

It should be understood that the images herein are image samples for training. The image samples may originate from an existing image database. Such as the MS COCO database commonly used in the field of image object detection.

It should be noted that the image feature extraction network may be a neural network model that is optimized through pre-training. In the process of training the neural network model, the network parameters of the image feature extraction network can be further optimized.

In step S720, a first query feature is acquired, the query feature being used to determine a region of interest in the image together with features of the image.

The process of determining the region of interest in the image based on the query features together with the features of the image may be implemented based on an attention mechanism. For example, the attention weight of each input may be calculated from the feature of the input (Query feature) and the current state of the model (Key feature, key). The input representation (Value) is then multiplied by the corresponding attention weights and they are weighted together to obtain a weighted input representation which is used to focus the neural network attention to the region of interest in the image.

In step S730, based on the features of the image and the first query feature, a global detection network is used to obtain a region where the object is located and a second query feature, where the second query feature is the optimized first query feature.

Step S730 is used to initially locate the region where the object is located and optimize the query feature.

In step S740, based on the area where the object is located and the second query feature, a detection result of the object is obtained by using a local detection network.

Step S740 is used for performing finer analysis based on step S730, so as to obtain a more accurate detection result.

It should be noted that, the processing procedures of step S710 to step S740 are similar to the processing procedures of step S310 to step S340 in the method 300 for detecting an object in an image, and the descriptions of step S310 to step S340 are also applicable to step S710 to step S740, which are not repeated here.

In step S750, a tag of object detection is acquired.

According to an embodiment of the present disclosure, the detection result of the object may include: the predicted category of the object and the predicted location of the object (e.g., may be embodied in a box at the location of the object). Correspondingly, the tag of the object detection may include: the category label of the object and the location label of the object. The label of the object detection can be manually marked and determined, can be from the existing marked image database, and can also be determined by detecting the object in the image based on other trained neural networks with higher precision.

In the training process, the parameters of the global detection network can be optimized by using the labels detected by the objects and the positions of the objects predicted by the global detection network, and the parameters of the local detection network can be optimized by using the labels detected by the objects and the positions of the objects predicted by the local detection network.

In step S760, the neural network model is trained to update parameters of the neural network model based on the detection result of the object and the label detected by the object, wherein the neural network model includes the image feature extraction network, the global detection network, and the local detection network.

According to an embodiment of the present disclosure, a plurality of attribute features may be determined from a plurality of attributes of the object, wherein each attribute feature indicates at least one of a color, a shape, a size, a direction of the object; and then weighting the attribute features based on the weights corresponding to the attribute features to obtain the first query feature. In this case, each attribute feature and its corresponding weight may also be updated during each training process.

According to embodiments of the present disclosure, the neural network model may be trained based on a loss function during the training process. Optionally, parameters of the neural network model can be optimized and updated based on the loss function, so that when the neural network model is used for object detection, both accuracy of object classification and accuracy of the position of the object are improved. For example, a classification loss function may be determined based on a predicted class of the object and a class label of the object; determining a position loss function based on a predicted position of the object and a position tag of the object; determining a joint loss function based on the classification loss function and the location loss function; the neural network model is then trained using the joint loss function.

It should be understood that there are a variety of forms of classification loss functions and location loss functions, and are not limiting herein.

To verify the detection effect of the method of detecting objects in images of the present disclosure, the present disclosure experiments based on MS COCO datasets, all models below for comparison were trained on a COCO 2017 segmentation training set, and tested on a COCO 2017 test set. In the training process, a residual network (ResNet) is used as an image feature extraction network.

To ensure fair comparison between different detectors, a model 1 comprising two decoders (i.e., comprising only 1 global detection network and 1 local detection network) was designed to match an existing single-stage or two-stage object detection model using the methods of the present disclosure when experiments were conductedComparing; model 2 (i.e., comprising 1 global detection network and 2 local detection networks) comprising three decoders was designed using the method of the present disclosure to compare with existing multi-stage models. Wherein, in the target detection index described below,APrepresenting the average accuracy (Average Precision),AP ₅₀ mean accuracy of threshold calculation using 0.5 as the cross-over ratio (IoU, intersection over Union) AP），AP ₇₅ Refer to calculation using 0.75 as the threshold for IoUAP，AP _s Finger small targetAP，AP _m Mean a medium targetAP，AP _l Directing to a large targetAPThe FPS represents the number of transmission frames per second (Frames Per Second).

Table 1 shows the results of a comparison of a model designed using the methods of the present disclosure with an existing single-stage or two-stage object detection model. As can be seen from the following Table 1, the model 1 designed by the concept of the present disclosure has high accuracy and high detection speed compared with the RetinaNet model and the Faster R-CNN model.

Table 2 shows the results of comparing a model designed using the method of the present disclosure with an existing multi-stage object detection model. As can be seen from table 2 below, model 2 designed using the concept of the present disclosure has only three stages, but has only a slight degradation in performance compared to the cascades R-CNN model of the four-stage decoder, the DETR model of the six-stage decoder, the default DETR model, the spark R-CNN model, and the adamxer model.

Table 3 shows the effect of varying the number of query features on the model of the method design of the present disclosure. Here experiments were performed based on model 1. As can be seen from table 3 below, increasing the number of query features can improve the performance of the neural network model for object detection, but training time increases, and increasing the number of query features after the number of query features reaches a certain number does not significantly improve the model performance. To balance model training speed and performance, a suitable number of query features may be selected, e.g., 300.

Table 4 shows the effect of varying the number of local detection networks on the model of the method design of the present disclosure. As can be seen from table 4 below, good model performance can be obtained by using only one local detection network, and increasing the number of local detection networks can further improve the performance of the model, but the model performance improvement effect is not significant. In case of limited resources, the number of local detection networks can be reduced.

/>

Table 5 shows experimental results in the case where the number of decoders of each object detection model is 2 as follows. As can be seen from table 5 below, model 1 designed with the ideas of the present disclosure has the best model performance when the decoder stage number is limited to 2 stages.

Based on model 1, table 6 shows the effect on experimental results of using a cross-attention network (denoted CA in table 6) and using attribute features to determine query features (denoted Mata Init in table 6). The numbers in table 6 indicate the presence of this module. As can be seen from Table 6 below, the use of a cross-attention network and the use of attribute features to determine query features allows for improved model performance for object detection.

Based on model 1, table 7 shows the effect on experimental results of using a point feature sampling network (where the point feature sampling network in the global detection network is a first point feature sampling network and the point feature sampling network in the local detection network is a second point feature sampling network). The numbers in table 7 indicate the presence of this module. As can be seen from table 7 below, the model performance for object detection can be improved by using the point feature sampling network in the global detection network and the local detection network, respectively.

The experimental results of the above tables 1-7 show that the method of the present disclosure can detect the object in the image only through two stages of global positioning and local monitoring, effectively simplify the model structure, and can simultaneously consider the efficiency and accuracy of target detection. The use of a cross-attention network, the use of attribute features to determine query features, and the use of a point feature sampling network all enable improved model performance for object detection, and the design of the present application is efficient and feasible.

Fig. 8 is a composition diagram illustrating an apparatus 800 for detecting an object in an image according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the apparatus 800 for detecting an object in an image may include: an image feature acquisition module 810, a query feature acquisition module 820, a global detection network model 830, and a local detection network model 840.

The apparatus 800 for detecting an object in an image may integrate a neural network model for detecting an object in an image, which may include: the image feature extraction network, the global detection network model and the local detection network model.

Specifically, the image feature acquisition module 810 may be configured to: and acquiring the characteristics of the image by utilizing the image characteristic extraction network.

The image feature extraction network may be an image feature extraction network of various structures. Optionally, an image feature extraction network may be utilized to obtain features of the image of different sizes on different feature layers; and fusing the features with different sizes on the different feature layers to obtain the features of the image.

The query feature acquisition module 820 may be configured to: a first query feature is acquired, wherein the query feature is used to determine a region of interest in the image in conjunction with a feature of the image.

Optionally, the global detection network may include: a cross-attention network and a first self-attention network. To improve the accuracy of the detection, the global detection network may include: a cross-attention network, a first self-attention network, and a first point feature sampling network. The first point feature sampling network is used for enabling points in different directions of the object to correspond to features on feature graphs with different sizes.

The global detection network model 830 may be configured to: and acquiring a region where the object is located and a second query feature by using a global detection network based on the feature of the image and the first query feature, wherein the second query feature is the optimized first query feature.

Optionally, the local detection network may include a regional feature fusion network and a second self-attention network. To improve the accuracy of the detection, the local detection network may include: a regional feature fusion network, a second self-attention network, and a second point feature sampling network. The second point feature sampling network is used for enabling points in different directions of the object to correspond to features on feature graphs with different sizes.

The query feature acquisition module 820 and the global detection network model 830 perform object detection in images based on query features, which may be implemented end-to-end. That is, the image is input to the neural network model to directly obtain the category and the position of the target object in the graph, that is, instead of generating a plurality of candidate position frames first and then selecting a better position frame from the candidate position frames.

The local detection network model 840 may be configured to: and acquiring a detection result of the object by using a local detection network based on the area where the object is located and the second query characteristic.

The detection result of the object may include a prediction category of the object and a prediction position box of the object. The result may be output to the user terminal for reference by the user.

The apparatus 800 of detecting objects in images of the present disclosure may be used for various image detection scenarios. Such as tumor detection based on medical images, vehicle detection in traffic images, face detection, industrial defect detection, etc. For example, a picture to be detected is input into the apparatus 800 for detecting an object in an image, thereby outputting a category of the detected target object and a frame representing a position where the target object is located.

It should be appreciated that the apparatus 800 for detecting objects in an image shown in fig. 8 may implement various methods of detecting objects in an image as described with respect to fig. 3. The neural network model used in the embodiment of fig. 8 may be trained by the method 700 as described in fig. 7.

The device 800 for detecting an object in an image may be located on the server 110 shown in fig. 1 or on the terminal 120 shown in fig. 1.

Fig. 9 is a composition diagram illustrating an apparatus 900 for training a neural network model according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the apparatus 900 for training a neural network model may include: an image feature acquisition module 910, a query feature acquisition module 920, a global detection network model 930, a local detection network model 940, a tag acquisition module 950, and a training module 960.

Wherein the image feature acquisition module 910 may be configured to: and acquiring the characteristics of the image by utilizing the image characteristic extraction network.

It should be understood that the images herein are image samples for training. The image samples may originate from an existing image database.

The query feature acquisition module 920 may be configured to: a first query feature is acquired, wherein the query feature is used to determine a region of interest in the image in conjunction with a feature of the image.

The process of determining the region of interest in the image based on the query features together with the features of the image may be implemented based on an attention mechanism.

The global detection network model 930 may be configured to: and acquiring a region where the object is located and a second query feature by using a global detection network based on the feature of the image and the first query feature, wherein the second query feature is the optimized first query feature.

The global detection network model 930 may be used to initially locate the region where the object is located and optimize the query characteristics.

The local detection network model 940 may be configured to: and acquiring a detection result of the object by using a local detection network based on the area where the object is located and the second query characteristic.

The local detection network model 940 may be used to perform a more detailed analysis of the region where the initially located object is located, thereby obtaining a more accurate detection result.

It should be understood that the image feature acquiring module 910, the query feature acquiring module 920, the global detection network model 930, and the local detection network model 940 in fig. 9 correspond to the processing procedures of the image feature acquiring module 810, the query feature acquiring module 820, the global detection network model 830, and the local detection network model 840 in fig. 8, respectively, and are not described herein again.

The tag acquisition module 950 may be configured to: and acquiring a label detected by the object.

The detection result of the object may include: the predicted category of the object and the predicted location of the object.

The training module 960 may be configured to: training the neural network model based on the detection result of the object and the label detected by the object to update parameters of the neural network model, wherein the neural network model comprises the image feature extraction network, the global detection network and the local detection network.

The training module 960 may train the neural network model based on the loss function in the training process, so as to optimize and update parameters of the neural network model, thereby improving accuracy of the neural network model when the neural network model is used for object detection.

It should be appreciated that the apparatus 900 for training a neural network model shown in fig. 9 may implement various methods of training a neural network model as described with respect to fig. 7. The neural network in fig. 9 is trained to obtain the neural network model for detecting the object in the image in fig. 8.

In general, the various example embodiments of the disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

For example, a method or apparatus according to embodiments of the present disclosure may also be implemented by means of the architecture of computing device 3000 shown in fig. 10. As shown in fig. 10, computing device 3000 may include a bus 3010, one or more CPUs 3020, a Read Only Memory (ROM) 3030, a Random Access Memory (RAM) 3040, a communication port 3050 connected to a network, an input/output component 3060, a hard disk 3070, and the like. A storage device in the computing device 3000, such as a ROM 3030 or hard disk 3070, may store various data or files for processing and/or communication of the methods provided by the present disclosure and program instructions for execution by the CPU. The computing device 3000 may also include a user interface 3080. Of course, the architecture shown in FIG. 10 is merely exemplary, and one or more components of the computing device shown in FIG. 10 may be omitted as may be practical in implementing different devices.

According to yet another aspect of the present disclosure, a computer-readable storage medium is also provided. The computer storage medium has computer readable instructions stored thereon. When the computer readable instructions are executed by the processor, the method according to the embodiments of the present disclosure described with reference to the above figures may be performed. The computer readable storage medium in embodiments of the present disclosure may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (ddr SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory. It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from a computer-readable storage medium, the processor executing the computer instructions, causing the computer device to perform a method according to an embodiment of the present disclosure.

In summary, embodiments of the present disclosure provide a method for detecting an object in an image, including: extracting network features by using image features to obtain the features of the image; acquiring a first query feature; based on the features of the image and the first query feature, acquiring a region of interest in the image and a second query feature by using a global positioning network, wherein the second query feature is the optimized first query feature; and acquiring a detection result of the object by using a local detection network based on the second query feature and the region of interest in the image.

It is noted that the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises at least one executable instruction for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The present disclosure uses specific words to describe embodiments of the disclosure. Such as "first/second embodiment," "an embodiment," and/or "some embodiments," means a particular feature, structure, or characteristic associated with at least one embodiment of the present disclosure. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the present disclosure may be combined as suitable.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. Although a few exemplary embodiments of this invention have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this invention. Accordingly, all such modifications are intended to be included within the scope of this invention as defined in the following claims. It is to be understood that the foregoing is illustrative of the present invention and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The invention is defined by the claims and their equivalents.

Claims

1. A method of detecting an object in an image, comprising:

extracting network features by using image features to obtain the features of the image;

acquiring a first query feature, wherein a plurality of attribute features are weighted based on weights corresponding to the attribute features to obtain the first query feature, wherein the attribute features indicate attributes of objects to be detected in the image, and the first query feature is used for determining a concerned region in the image together with the features of the image;

based on the features of the image and the first query feature, acquiring an area where the object is located and a second query feature by using a global detection network, wherein the second query feature is the optimized first query feature; and

and acquiring a detection result of the object by using a local detection network based on the area where the object is located and the second query characteristic.

2. The method of claim 1, wherein acquiring features of the image using the image feature extraction network comprises:

acquiring features of different sizes of the image on different feature layers by using an image feature extraction network; and

and fusing the features with different sizes on the different feature layers to obtain the features of the image.

3. The method of claim 2, wherein each of the different sized features comprises a same number of sub-features,

fusing the features of different sizes on the different feature layers includes:

and fusing the corresponding sub-features in the different feature layers.

4. The method of claim 3, wherein fusing the features of different sizes to obtain features of the image further comprises:

fusing the features with different sizes to obtain first image features; and

upsampling the first image feature to obtain a second image feature;

and downsampling the second image feature to obtain a third image feature, and taking the third image feature as the feature of the image.

5. The method of claim 1, wherein,

each of the attribute features indicates at least one of a color, a shape, a size, a direction of the object to be detected.

6. The method of claim 1, wherein the global detection network comprises: a cross-attention network and a first self-attention network,

based on the features of the image and the first query feature, acquiring the region where the object is located and the second query feature using a global detection network includes:

Acquiring a third query feature by using the cross-attention network based on the feature of the image and the first query feature, wherein the third query feature is the optimized first query feature;

acquiring a fourth query feature by using the first self-attention network based on the third query feature, wherein the fourth query feature is the optimized third query feature;

taking the fourth query feature as the second query feature; and

and acquiring the area where the object is located based on the second query feature.

7. The method of claim 2, wherein the global detection network comprises: a cross-attention network, a first self-attention network and a first point feature sampling network,

Acquiring the second query feature by utilizing the first point feature sampling network based on the fourth query feature and the feature of the image, wherein the second query feature is the optimized fourth query feature; and

8. The method of claim 7, wherein obtaining the second query feature using the first point feature sampling network comprises:

for each point in the image, obtaining the sampling characteristic of the point based on the point sampling characteristic of the point on each characteristic layer and the weight corresponding to the point sampling characteristic,

and obtaining the sampling characteristics of the image based on the sampling characteristics of each point in the image, and taking the sampling characteristics of the image as the second query characteristics.

9. The method of claim 1, wherein the number of local detection networks is one or more, wherein each local detection network comprises: the regional feature fusion network and the second self-attention network,

based on the region where the object is located and the second query feature, obtaining the detection result of the object by using the local detection network includes:

Based on the characteristics of the area where the object is located and the second query characteristics, fusing the characteristics of the area where the object is located and the second query characteristics by using an area characteristic fusion network to obtain fifth query characteristics;

acquiring a sixth query feature with the second self-attention network based on the fifth query feature, wherein the sixth query feature is an optimized fifth query feature;

and obtaining a detection result of the object by utilizing the sixth query characteristic and the characteristic of the area where the object is located, wherein the detection result of the object comprises the prediction category of the object and the prediction position of the object.

10. The method of claim 9, wherein fusing the features of the region in which the object is located and the second query feature using a region feature fusion network to obtain a fifth query feature comprises:

transforming the features of the region in which the object is located based on the second query feature to obtain a first transformed feature,

the characteristics of the area where the object is located are subjected to linear transformation to obtain second transformation characteristics,

and fusing the second query feature, the first transformation feature and the second transformation feature to obtain the fifth query feature, wherein the sizes of the first transformation feature and the second transformation feature are the same as the size of the second query feature.

11. The method of claim 10, wherein transforming the features of the region in which the object is located based on the second query feature to obtain a first transformed feature comprises:

up-sampling the characteristics of the area where the object is located to obtain fourth image characteristics;

and downsampling the fourth image feature to obtain a fifth image feature, and adding the fifth image feature and the second query feature to obtain the first transformation feature, wherein the size of the fifth image feature is the same as the size of the second query feature.

12. The method of claim 2, wherein the local detection network comprises: a regional signature fusion network, a second self-attention network and a second point signature sampling network,

Acquiring a seventh query feature by using the second point feature sampling network based on the sixth query feature and the feature of the area where the object is located, wherein the seventh query feature is the optimized sixth query feature; and

and obtaining a detection result of the object based on the seventh query feature, wherein the detection result of the object comprises a prediction category of the object and a prediction position of the object.

13. The method of claim 12, wherein obtaining the seventh query feature using the second point feature sampling network comprises:

for each point in the area where the object is located, obtaining the sampling feature of the point based on the point sampling feature of the point on each feature layer and the weight corresponding to the point sampling feature,

and obtaining sampling characteristics of the area where the object is located based on the sampling characteristics of each point in the area where the object is located, and taking the sampling characteristics of the area where the object is located as the seventh query characteristics.

14. A method of training a neural network model, comprising:

Based on the features of the image and the first query feature, acquiring an area where the object is located and a second query feature by using a global detection network, wherein the second query feature is the optimized first query feature;

based on the area where the object is located and the second query feature, acquiring a detection result of the object by using a local detection network;

acquiring a label of object detection; and

training the neural network model based on the detection result of the object and the label detected by the object to update parameters of the neural network model, wherein the neural network model comprises the image feature extraction network, the global detection network and the local detection network.

15. The method of claim 14, wherein,

16. The method of claim 15, wherein training the neural network model to update parameters of the neural network model based on the detection result of the object and the tag of the object detection further comprises:

and updating each attribute characteristic and the corresponding weight thereof.

17. The method of claim 14, wherein,

the detection result of the object comprises: a predicted category of the object and a predicted location of the object, the tag of the object detection comprising: a category label of the object and a location label of the object,

training the neural network model based on the detection result of the object and the label detected by the object includes:

determining a classification loss function based on a predicted class of the object and a class label of the object;

determining a position loss function based on a predicted position of the object and a position tag of the object;

determining a joint loss function based on the classification loss function and the location loss function; and

and training the neural network model by utilizing the joint loss function.

18. An apparatus for detecting an object in an image, comprising:

an image feature acquisition module configured to: extracting network features by using image features to obtain the features of the image;

a query feature acquisition module configured to: acquiring a first query feature, wherein a plurality of attribute features are weighted based on weights corresponding to the attribute features to obtain the first query feature, wherein the attribute features indicate attributes of objects to be detected in the image, and the first query feature is used for determining a concerned region in the image together with the features of the image;

A global detection network model configured to: acquiring an area where the object is located and a second query feature based on the feature of the image and the first query feature, wherein the second query feature is the optimized first query feature; and

a local detection network model configured to: and acquiring a detection result of the object based on the area where the object is located and the second query feature.

19. A computer readable storage medium having stored thereon computer executable instructions for implementing the method of any of claims 1-17 when executed by a processor.