CN117011566A

CN117011566A - Target detection method, detection model training method, device and electronic equipment

Info

Publication number: CN117011566A
Application number: CN202210873136.8A
Authority: CN
Inventors: 徐东
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-07-22
Filing date: 2022-07-22
Publication date: 2023-11-07

Abstract

The invention discloses a target detection method, a detection model training method, a device and electronic equipment, wherein the method comprises the following steps: extracting initial local area characteristics from an image to be detected, and extracting global context information from the image to be detected; acquiring single-scale context information of the image to be detected under multiple scales according to the global context information; determining single-scale context region features of multiple scales according to each single-scale context information and the initial local region features; connecting each single-scale context region feature to obtain a multi-scale context region feature, and connecting the multi-scale context region feature with the initial local region feature to obtain a target region feature; the target area characteristics are identified to obtain target category information of each target area in the image to be detected, the method can improve the accuracy of target detection, further improve the accuracy of small target detection, and can be applied to the technical fields of computer vision, cloud computing, vehicle networking and other derivative technical fields.

Description

Target detection method, detection model training method, device and electronic equipment

Technical Field

The invention relates to the technical field of computer vision, in particular to a target detection method, a detection model training device and electronic equipment.

Background

Object detection is mainly to locate objects present in an image or video and to give specific categories of objects. In recent years, based on the deep convolutional neural network technology, target object detection has made great progress, and is widely applied to the fields of intelligent driving, complex scene recognition, intelligent search, intelligent authentication and the like. For example, a smart car needs to detect a forward obstacle before decision control; the intelligent interaction system needs to detect the person needing interaction before identifying the related gestures and instructions; in the game test, each virtual object appearing on the complex interface scene needs to be detected so as to detect information such as action instructions or real-time states of the virtual objects.

However, the current method generally performs detection based on local visual features in the target image or the target video, for example, extracting the ROI (Regions of Interest, interesting) region, which generally only performs a better detection effect when the distinction between the local features and the global background features is large, and when the distinction between the local and the global background is small, for example, in a night scene, the background color and brightness of the target object to be detected in each frame image of the target video are similar to the background color and brightness of the whole image, and the boundary is not obvious, so that it is difficult to perform target detection based on the ROI features of the local region, and the detection accuracy is low.

Disclosure of Invention

In view of the above, the embodiments of the present invention provide a target detection method, a detection model training method, a device, and an electronic apparatus with high detection accuracy, so as to improve the detection accuracy of a target object in a part of complex scenes.

An aspect of an embodiment of the present invention provides a target detection method, including the steps of:

acquiring an image to be detected;

extracting initial local area characteristics from the image to be detected, and extracting global context information from the image to be detected;

acquiring multi-scale context information of the image to be detected according to the global context information; wherein the multi-scale context information comprises single-scale context information at a plurality of different scales;

determining single-scale context region features under a plurality of different scales according to each single-scale context information and the initial local region features;

connecting each single-scale context region feature to obtain a multi-scale context region feature, and connecting the multi-scale context region feature with the initial local region feature to obtain a target region feature;

and identifying the characteristics of the target area to obtain target category information corresponding to each target area in the image to be detected.

In one possible implementation manner, the extracting global context information from the image to be detected includes:

according to a preset downsampling rate, visual feature extraction is carried out on the image to be detected through a depth residual error network;

according to the visual characteristics, extracting the determined channel dimension, space height and space width to obtain a convolution characteristic diagram output by a depth residual error network;

and determining the global context information according to the convolution characteristic diagram.

In one possible implementation manner, the acquiring the multi-scale context information of the image to be detected according to the global context information includes:

configuring the corresponding scale size of each single-scale context information;

according to each scale size, pooling and polymerizing the regions corresponding to the current scale size in the global context information to obtain single-scale context information corresponding to the current scale size;

and determining the multi-scale context information according to the single-scale context information corresponding to all the scales.

In one possible implementation manner, according to each scale size, pooling and aggregating the region corresponding to the current scale size in the global context information to obtain single-scale context information corresponding to the current scale size, where the pooling and aggregating includes:

Sequentially acquiring the sizes of all the scales as the current scale;

determining a region to be pooled from the global context information according to the current scale size;

carrying out pooling aggregation on the area to be pooled through maximum pooling treatment or average pooling treatment to obtain single-scale context information corresponding to the current scale;

the information of each position in the single-scale context information represents information of all positions in the region to be pooled.

In one possible implementation manner, the determining the single-scale context region features at a plurality of different scales according to each of the single-scale context information and the initial local region features includes:

determining a convolution feature map corresponding to each piece of single-scale context information;

calculating the influence value of each position in the convolution characteristic map on the initial local area characteristic;

and performing context aggregation calculation according to the influence value and the characterization vector of each position in the convolution feature map to obtain the single-scale context region feature corresponding to the convolution feature map.

In one possible implementation manner, the calculating the impact value of each position in the convolution feature map on the initial local region feature includes:

Performing first dimension reduction processing on the characterization vectors of all positions in the convolution feature map, and performing second dimension reduction processing on the characterization vectors of the initial local area features;

constructing a normalization factor according to the space height and the space width of the convolution feature diagram;

and carrying out normalization processing on the results of the first dimension reduction processing and the results of the second dimension reduction processing according to the normalization factors, and determining the influence value of each position in the convolution characteristic diagram on the initial local region characteristic.

In one possible implementation manner, the performing context aggregation calculation according to the influence value and the characterization vector of each position in the convolution feature map to obtain a single-scale context region feature corresponding to the convolution feature map includes:

when the initial local area extracted from the image to be detected is one, multiplying the influence value by the characterization vector of each position in the convolution feature map to obtain the area feature vector of each position; combining the regional feature vectors of each position to obtain single-scale context regional features corresponding to the convolution feature map;

when the number of the initial local areas extracted from the image to be detected is multiple, multiplying the influence value by the characterization vector of each local area in the convolution feature map to obtain the area feature vector of each local area; and combining the regional feature vectors of each local region to obtain the single-scale context regional features corresponding to the convolution feature map.

On the other hand, the embodiment of the invention also discloses a target detection method, which comprises the following steps:

responding to a detection instruction, acquiring an image to be detected, and sending the image to be detected to a target server so that the target server carries out target detection on the image to be detected, and identifying and obtaining target category information corresponding to each target area in the image to be detected;

receiving target category information identified by the target server, and displaying a target detection result;

wherein the object category information is determined according to the object detection method provided by the embodiment of the first aspect.

On the other hand, the embodiment of the invention also discloses a detection model training method, which comprises the following steps:

acquiring an image training set;

extracting initial local area characteristics from each sample image of the image training set, and extracting global context information from each sample image;

acquiring multi-scale context information of the sample image according to the global context information; wherein the multi-scale context information comprises single-scale context information at a plurality of different scales;

identifying the target region characteristics to obtain prediction category information corresponding to each target region in the sample image;

and calculating a loss value of the prediction type information according to the prediction type information and the correct type information of each target area in the sample image, and correcting parameters of a detection model according to the loss value.

On the other hand, the embodiment of the invention also discloses a target detection device, which comprises:

the first module is used for acquiring an image to be detected;

the second module is used for extracting initial local area characteristics from the image to be detected and extracting global context information from the image to be detected;

a third module, configured to obtain multi-scale context information of the image to be detected according to the global context information; wherein the multi-scale context information comprises single-scale context information at a plurality of different scales;

a fourth module, configured to determine single-scale context region features under a plurality of different scales according to each of the single-scale context information and the initial local region features;

A fifth module, configured to connect each single-scale context region feature to obtain a multi-scale context region feature, and connect the multi-scale context region feature with the initial local region feature to obtain a target region feature;

and a sixth module, configured to identify the target area features, and obtain target category information corresponding to each target area in the image to be detected.

a seventh module, configured to obtain an image to be detected in response to a detection instruction, and send the image to be detected to a target server, so that the target server performs target detection on the image to be detected, and identifies and obtains target category information corresponding to each target area in the image to be detected;

an eighth module, configured to receive the target category information identified by the target server, and display a target detection result;

wherein the target class information is determined according to the target detection method described above.

On the other hand, the embodiment of the invention also discloses a detection model training device, which comprises:

a ninth module, configured to obtain an image training set;

A tenth module, configured to extract initial local area features from each sample image of the image training set, and extract global context information from each sample image;

an eleventh module, configured to obtain multi-scale context information of the sample image according to the global context information; wherein the multi-scale context information comprises single-scale context information at a plurality of different scales;

a twelfth module, configured to determine single-scale context region features under a plurality of different scales according to each of the single-scale context information and the initial local region features;

a thirteenth module, configured to connect each single-scale context region feature to obtain a multi-scale context region feature, and connect the multi-scale context region feature with the initial local region feature to obtain a target region feature;

a fourteenth module, configured to identify the target region features, and obtain prediction category information corresponding to each target region in the sample image;

and a fifteenth module, configured to calculate a loss value of the prediction type information according to the prediction type information and correct type information of each target area in the sample image, and correct parameters of the detection model according to the loss value.

On the other hand, the embodiment of the invention also discloses electronic equipment, which comprises a processor and a memory; the memory is used for storing programs; the processor executes the program to implement the target detection method as described above or the detection model training method as described above.

Furthermore, embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the foregoing target detection method or detection model training method.

According to the embodiment of the invention, the initial local area characteristics are extracted from the image to be detected, and the global context information is extracted from the image to be detected; further, the embodiment of the invention obtains multi-scale context information of an image to be detected according to global context information, determines single-scale context region features under a plurality of different scales according to each single-scale context information and initial local region features, connects each single-scale context region feature to obtain multi-scale context region features, and connects the multi-scale context region features with the initial local region features to obtain target region features; the target region features are identified to obtain target category information corresponding to each target region in the image to be detected, and compared with a target detection process of matching global context information with local region features, the embodiment of the invention further obtains single-scale context region features under different scales according to the global context information, and can further utilize the features of different spatial scales to detect the target.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an implementation environment for performing object detection according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a target detection method according to an embodiment of the present application;

FIG. 3 is a diagram illustrating a process for extracting global context information according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a process for acquiring multi-scale context information according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a process for acquiring single-scale context information according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a process for determining single-scale contextual regional features at a plurality of different scales, as provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of a process for calculating an influence value according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a process for obtaining a single-scale context region feature according to an impact value according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a complete architecture of object detection in an embodiment of the present application;

FIG. 10 is a schematic diagram of a game interface according to an embodiment of the present application;

FIG. 11 is a flowchart illustrating a training method of a detection model according to an embodiment of the present application;

fig. 12 is an implementation environment schematic diagram of a game test scenario in an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It is to be understood that the terms "first," "second," and the like, as used herein, may be used to describe various concepts, but are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another.

The terms "at least one", "a plurality", "each", "any" and the like as used herein, at least one includes one, two or more, a plurality includes two or more, each means each of the corresponding plurality, and any one means any of the plurality.

Before explaining the embodiments of the present application in detail, technical terms that may be involved in the embodiments of the present application are explained as necessary:

artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. The display device comprising the image acquisition component mainly relates to the computer vision technology and the directions of machine learning/deep learning, automatic driving, intelligent traffic and the like.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Computer Vision (CV) is a science of how to "look" at a machine, and more specifically, to replace a camera and a Computer to perform machine Vision such as identifying and measuring a target by human eyes, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image semantic segmentation, image recognition, image semantic understanding, image retrieval, optical character recognition (Optical Character Recognition, OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, etc., as well as common biometric recognition techniques such as face recognition, fingerprint recognition, etc.

The self-attention mechanism (Self Attention Mechanism) is used for better learning the dependency relationship between the global features, and the self-attention mechanism acquires the global geometric features of the graph structure in one step by directly calculating the relationship between any two nodes in the graph structure. Self-attention mechanisms utilize the attention mechanism to calculate in three phases: (1) Different functions and computing mechanisms are introduced, and similarity or correlation of the Query (Query) and a certain keyword or keyword (Key) is calculated according to the Query, wherein the most common method comprises the following steps: vector dot product of both, vector similarity of both, or by reintroducing additional neural networks; (2) The score of the first stage is subjected to numerical conversion by introducing a calculation mode of an activation function (Softmax), on one hand, normalization can be carried out, and the original calculated score is arranged into probability distribution with the sum of all element weights being 1; on the other hand, the weight of the important element can be more highlighted through the inherent mechanism of Softmax; (3) And the calculation result in the second stage is the corresponding weight coefficient, and then the weighted summation is carried out to obtain the value of the degree.

On the basis of this, the characteristic concepts possibly involved in the invention are explained:

A Scale space (SIFT, scale-invariant feature transform); objects in nature have different manifestations with different observation scales. The blurring degree of each scale image in the scale space gradually becomes larger, and the formation process of a human target on retina from near to far from the target can be simulated. Thus, the larger the scale, the more blurred the image. The image scale space is a local feature, key points can be detected in the image, SIFT feature extraction is divided into two parts of searching the key points on the image and extracting neighborhood information of the key points, and only stable key points and information nearby the stable key points are focused when the features are extracted, so that the features are more descriptive.

The multi-scale information is information which is presented after sampling different granularities of signals, and generally, different characteristics can be observed under different scales, so that different tasks are completed. Generally, smaller granularity/denser sampling information can see more detail, and larger granularity/sparser sampling information can see overall trends.

Semantic/spatial context information is used for identifying and processing new targets by capturing interaction information between different objects, wherein the interaction information between the objects and a scene is used as a condition.

The region of interest (ROI, region of interest) is a region to be processed, called a region of interest, outlined from the processed image in the form of a square, circle, ellipse, irregular polygon, or the like in machine vision, image processing.

Based on the above theoretical basis, and the research and progress of artificial intelligence technology, artificial intelligence technology is being developed in various fields such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, robotic, smart medical, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and will have increasingly important value.

It can be understood that the target detection method provided by the embodiment of the invention can be applied to any computer device with data processing and computing capabilities, and the computer device can be various terminals or servers. When the computer device in the embodiment is a server, the server is an independent physical server, or is a server cluster or a distributed system formed by a plurality of physical servers, or is a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDN, content Delivery Network), basic cloud computing services such as big data and artificial intelligence platforms, and the like. Alternatively, the terminal is a smart phone, a tablet computer, a notebook computer, a desktop computer, or the like, but is not limited thereto.

It should be further noted that, the terminals involved in the embodiments of the present application include, but are not limited to, smart phones, computers, smart voice interaction devices, smart home appliances, vehicle terminals, aircrafts, and the like. The embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent transportation, auxiliary driving and the like.

In some possible implementations, a computer program capable of implementing the object detection method or an object detection model training method provided by the embodiments of the present application may be deployed to be executed on one computer device, or on a plurality of computer devices located at one site, or on a plurality of computer devices distributed at a plurality of sites and interconnected by a communication network, where a plurality of computer devices distributed at a plurality of sites and interconnected by a communication network can form a blockchain system.

Taking the target detection process in the game scene as an example, a blockchain system can be formed based on a plurality of computer devices, and the computer devices for realizing the target detection method in the embodiment of the application can be nodes in the blockchain system. The node is stored with a machine learning model, and the machine learning model is used for acquiring single-scale context information of different scales of a game interface screenshot of a target object, so that when acquiring single-scale context region characteristics of different scales, the single-scale context information of different scales is used for providing information which is not contained in original local region characteristics, and determining target region characteristics by combining the original local region characteristics, thereby identifying and obtaining target category information corresponding to each target region on the game interface screenshot. The node or nodes corresponding to other devices in the blockchain can also store game interface shots, target type information, other intermediate feature data obtained in the prediction process, and the like.

FIG. 1 is a schematic view of an embodiment of the invention. Referring to fig. 1, the implementation environment includes at least one terminal 101 and a server 102. The terminal 101 and the server 102 can be connected through a network in a wireless or wired mode to complete data transmission and exchange. The terminal 101 is, but not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. An application supporting image display is installed and run on the terminal 101. The server 102 is a background server of the target detection application, or is an independent physical server, or is a server cluster or a distributed system formed by a plurality of physical servers, or is a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a distribution network (CDN, content Delivery Network), basic cloud computing services such as big data and an artificial intelligence platform, and the like.

Based on the implementation environment shown in fig. 1, the embodiment of the invention further provides a training scene of the target detection model. The object detection model improves the accuracy of the detection results through a dual mode global context network (DMGCN, double Module Global Context Network). The dual-mode global context network comprises a spatial context representation module (SCRM, space Context Representation Module) and at least one spatial context dependent module (SCDM, space Context Dependent Module). The spatial context representation module is mainly used for constructing multi-scale context representation, and the module can obtain global context obtained in the previous processing step, namely multi-scale context, and average pooling is carried out on the multi-scale context according to the principle of pyramid scale transformation to obtain single-scale context representation; it should be noted that, the scaling in the multi-scale context representation process may select the scaling of the pyramid, or other feasible scaling methods, which are not enumerated herein. The spatial context dependence module is used for performing dependence calculation mainly according to the single-scale context representation output by the spatial context representation module and the region of interest (ROI) obtained by preprocessing, performing context aggregation according to the ROI to generate single-scale context ROI characteristics, and performing characteristic fusion on the single-scale context ROI characteristics to obtain multi-scale context ROI characteristics; further, the parallel multi-scale context ROI features and the pre-processed ROI features are connected to classify according to the result of the dependency calculation. It should be noted that, in the embodiment of the present invention, the process of calculating the degree of dependence may be an affinity calculation process or other weight calculation processes capable of reflecting the degree of influence between feature information; the global context obtained by the preamble processing of the spatial context representation module may be a convolution feature map obtained by performing visual features of an image through, for example, a convolutional neural network (CNN, convolutional Neural Networks) model, the convolution feature map being capable of characterizing context information; after obtaining the convolution feature map, the embodiment may identify the candidate features in the convolution feature map through a region candidate network (RPN, region Proposal Network), and perform frame selection of the target object in a box labeling manner, so as to obtain the region of interest ROI, which is one of the inputs of the spatial context dependent module.

It should be noted that, in the scene, a large number of historical game interface screenshots uploaded by the game terminal during running the game are stored in the server, a plurality of historical game interface screenshots are formed into an image training set, the image training set is input into a target detection model, the target detection model firstly extracts global context information used for representing the association relationship between each pixel and surrounding pixels in each sample image in the image training set, and extracts the characteristics of the region of interest in each sample image; and then inputting the global context information into a spatial context representation module, outputting to obtain single-scale context information for representing a plurality of different scales, inputting all the single-scale context information and the characteristics of the region of interest into a spatial context dependence module, outputting to obtain single-scale context region characteristics for representing the plurality of different scales, connecting the characteristics of the region of interest according to all the single-scale context region characteristics to obtain target region characteristics, predicting to obtain the category for the target region characteristics, calculating a loss value according to the predicted category, and optimizing and adjusting the parameters of the target detection model according to the loss value to obtain the trained target detection model.

Based on the implementation environment shown in fig. 1, the embodiment of the invention further provides a scene of object detection. In this scenario, when the object authorization of the game test player is acquired, the server acquires a game interface screenshot when the game terminal runs the game by performing data interaction with the terminal device on which the game client is mounted. The trained object detection model is stored in the server. The game interface screenshot is input into the target detection model, global context information and multi-scale context area characteristics of the game interface screenshot are extracted through the target detection model, and multi-scale characteristic information can be provided when the targets are classified, so that the accuracy of classification results is improved.

As shown in fig. 2, a flowchart of steps of a target detection method according to an embodiment of the present invention is provided, where an execution subject of the target detection method may be any of the foregoing computer devices. Referring to fig. 2, the method includes the steps of:

s201, acquiring an image to be detected.

The image to be detected can be a game interface image, a vehicle image, a human face image, an animal and plant image and the like, and can also be other types of images. By way of example, taking a game test scenario as an example, the target detection object is capable of connecting different models of game terminals with the detection terminal through a connection line. The target detection object installs the game application program to be tested on the game terminals of different models through the detection terminal. After the game application is installed, the detection terminal controls the game application installed on the game terminal to start. And responding to the game detection instruction, and controlling the game application program to jump to the interface to be detected by the detection terminal. And the detection terminal controls the game terminal to capture a screen of the interface to be detected to obtain an image to be detected. The game interface image intercepted by the game terminal may be one frame image or may be multiple frames of images. After capturing the image frame, the game terminal can transmit the image frame to the detection terminal in a wireless transmission mode, and the detection terminal takes the acquired frame image as an image to be detected when detecting the target.

S202, extracting initial local area characteristics from the image to be detected, and extracting global context information from the image to be detected.

The local region can be understood as a local region of interest (RoI, regions of Interest) and also as a proposed region of the original image. Local region features refer to image features within the local region of interest. For example, when the game interface screenshot includes a plurality of game objects such as heroes and soldiers, a region where one or more target game objects are located may be marked by a box (bbox) as a local region of interest, and the local region features refer to image features in the region where the target game objects are located. In this embodiment, the image feature of the image to be detected may be obtained by inputting the image to be detected into a convolutional neural network, outputting the image feature of the image to be detected, then determining a local region on the image to be detected as a local region of interest through a region candidate network (RPN, region Proposal Network), and taking the image feature corresponding to the local region of interest as the local region feature.

In the field of computer vision, the context information may include semantic context information, spatial context information, and scale context information. Both semantic context information and spatial context information can be understood as interaction information between different objects, interaction information between objects and scenes. Illustratively, in a game scenario, in a frame of a game interface screenshot, global context information includes interaction information between different game objects in the game interface screenshot and interaction information between a game object and a background.

Illustratively, in a game scenario, when the game terminal runs the game, the game terminal captures a shot in the target interface and sends it to the server. The convolutional neural network (CNN, convolutional Neural Networks) is stored in the server in advance. The convolution neural network carries out convolution operation on the target game interface screenshot so as to extract the characteristics on the target game interface screenshot. And the convolution layer in the convolution neural network continuously performs sliding operation on the screenshot of the target game interface through convolution processing to obtain a corresponding inner product result, and performs pooling operation on the inner product result by taking the maximum value of each local block, so as to obtain the characteristics of different local areas corresponding to the screenshot of the target game interface.

S203, acquiring multi-scale context information of the image to be detected according to the global context information.

Where the scale of an image refers to the thickness of the image content. The scale is used to simulate how far or near the target object is from the object. It will be appreciated that the further the target object is from, the more likely the object is seen to have a rough outline; the closer the target object is, the more likely the detailed information of the object is seen. Taking an implementation scenario of road condition detection as an example, in a vehicle driving obstacle detection scenario, an on-board camera acquires an image in front of a vehicle in real time, converts the image into a frequency domain image, and the thickness degree of the image represents a low-frequency component and a high-frequency component of the frequency domain information of the image. The rough image representing information is mostly concentrated in the low frequency band, and a small amount of high frequency information exists. The fine image represents information with rich components and information of high and low frequency bands. The scale space of an image refers to a collection of different scales of the same image. Multiscale refers to different spatial sizes. According to the embodiment, the global context information in the image to be detected is respectively extracted according to different scales, so that the context information under different scales is obtained. Taking a vehicle image as an example, if different scales respectively include 1.0, 2.0 and 3.0, single-scale context information corresponding to the vehicle image when the scale is 1.0 is extracted respectively, single-scale context information corresponding to the vehicle image when the scale is 2.0 is extracted, single-scale context information corresponding to the vehicle image when the scale is 3.0 is extracted, and single-scale context information corresponding to the scales of 1.0, 2.0 and 3.0 is combined into multi-scale context information, so that the context information can be extracted from different scales.

S204, determining single-scale context region features under a plurality of different scales according to each single-scale context information and the initial local region features, connecting the single-scale context region features to obtain multi-scale context region features, and connecting the multi-scale context region features with the initial local region features to obtain target region features.

The process of connecting the features of the single-scale upper and lower regions can adopt a concat () function to connect two or more pieces of information. The concatenation process of concat () does not change the existing information amount, and only generates one piece of concatenated copy information. After obtaining a plurality of single-scale context information and initial local area characteristics, the embodiment can generate a plurality of context information with different scales through the interrelation between the two information. And then connecting each single-scale context region feature into a multi-scale context region feature through concat (), so that features except the initial local region feature can be provided through the multi-scale context region feature, and further more accurate target region features can be obtained. In this embodiment, the multi-scale contextual regional features have the same channel dimensions as the original local regional features.

And S205, identifying the characteristics of the target areas to obtain target category information corresponding to each target area in the image to be detected.

After the target region features are obtained in step S204, the target region features are input to the full-connection layer, so that the target region features are classified by the full-connection layer. It can be understood that all neurons of the full-connection layer are connected by weights, and after the previous convolution layer captures enough characteristics, the full-connection layer can classify the target object to be detected. It should be noted that the fully-connected layer of this embodiment includes a linear layer with a ReLU activation function for predicting tags.

In order to improve the accuracy of the global context information extraction, in some possible embodiments, as shown in fig. 3, in the step of extracting global context information from the image to be detected, steps S2021-S2023 may be included:

s2021, performing visual feature extraction on an image to be detected through a depth residual error network according to a preset downsampling rate;

where downsampling refers to downsampling or downsampling of the image. The downsampling rate refers to the multiple of the downscaled image. Taking a target person image acquired by a camera as an example, if the size of the target person image is m×n, when the preset downsampling rate is 2, downsampling the target task image to obtain an image with a size of (M/2) ×n/2. In order to obtain global context information, the downsampling rate may be set to be a step between the image to be detected and the global context information, and then visual feature extraction is performed on the image to be detected through a depth residual network (res net, deep Residual Network). It will be appreciated that the depth residual network has many branches bypassing it to connect the input directly to the later layers so that the later layers can learn the residual directly. Illustratively, the visual features may include information on color, contour, texture, spatial relationship, etc. on the image to be detected.

S2022, extracting the determined channel dimension, space height and space width according to the visual characteristics to obtain a convolution characteristic diagram output by a depth residual error network;

the channel is used for detecting the image characteristics, and the intensity of the channel numerical value can reflect the intensity of the current characteristics. The channel dimension refers to the number of input channels of the convolution kernel in the depth residual network, the space height refers to the height of the convolution kernel in the depth residual network, and the space width refers to the width of the convolution kernel in the depth residual network. The number of input channels is the dimension of the matrix corresponding to the input image. The convolution feature map may be used to represent the relationship between individual pixels in the global context.

S2023, determining global context information according to the convolution characteristic diagram.

In an exemplary recognition scenario of a target object during a vehicle driving, an image of the surroundings of the target vehicle during the driving is acquired by an in-vehicle terminal, and then the image of the surroundings is transmitted to a server in which depth is stored in advanceAnd (5) a residual network. The depth residual network performs visual feature extraction on the surrounding image according to a stride D between the surrounding image and the global context, wherein the stride D can also be understood as a downsampling rate of the depth residual network, and then the channel dimension of the depth residual network is set to be C, the spatial height to be H, and the spatial width to be W according to the visual feature extraction process. After the surrounding environment image is input into the depth residual error network, a convolution characteristic image X epsilon R representing the global context can be output _C×H×W At this time, each position in the convolution feature image X may represent a pixel including the d×d surrounding image, and global context information in the surrounding image is then determined from the pixel that each position may represent, so that it is possible to provide other local area feature information than the local area feature at the time of object recognition.

In order to extract context information corresponding to more scales, in some possible embodiments, as shown in fig. 4, in the step of acquiring multi-scale context information of an image to be detected according to global context information, steps S2031-S2033 may be included:

s2031, configuring the corresponding scale size of each single-scale context information;

the scale size corresponding to each single-scale context information can be represented by the scale size of each layer on the pyramid. The pyramid structure is used for the forward convolution process of the network, and for each feature map with resolution, the feature map with resolution scaled by 2 times is introduced to perform element-by-element bottom-up addition operation, so that the bottom feature map with high resolution and low semantic information in the convolution neural network and the high-level feature map with low resolution and high semantic information can be fused, and the feature map after fusion contains more semantic information. The resolution of an image refers to the quantization level of the image in the horizontal and vertical directions, and can also be understood as the level of detail that the image can exhibit. When the size of the object in the image is small or the contrast is low, the image detail information needs to be observed at a higher resolution.

S2032, pooling and polymerizing the regions corresponding to the current scale in the global context information according to the size of each scale to obtain single-scale context information corresponding to the current scale;

the pooling aggregation may include two processes of pooling processing and aggregation processing, where the pooling processing is a processing process of taking an average value of a certain region in a (intermediate state) feature image of a current scale as a representation of the region, and the aggregation processing is a processing process of aggregating all content information in the region to one target position in the region after the pooling processing is performed to obtain the average value representation of the region, so as to obtain single-scale context information at the current scale. For example, in a vehicle detection scenario, the scale size determined by the embodiment is 2, global context information obtained by the preamble step is first obtained, where the global context information may include all body information and license plate information of the target vehicle. Based on the predetermined scale information, dividing a feature map containing the global context information into a plurality of image areas with the size of 2 x 2, for example, in an image area containing the local feature of a vehicle body, firstly carrying out area average processing, calculating to obtain an average value of pixels in the area, and taking the average value as a pixel characterization value of the area; meanwhile, in order to keep the content information in the area as much as possible, the body characteristic information in the area is aggregated to a specific position in the area, so that the single-scale context information of the target vehicle with the scale of 2 is obtained.

And S2033, determining multi-scale context information according to the single-scale context information corresponding to all the scales.

In an exemplary vehicle detection scenario, when all the scales are 2, 4, and 8, pooling is performed on the regions corresponding to 2×2, 4*4, and 8×8 in the global context information, so as to obtain a plurality of single-scale context information corresponding to the regions 2×2, 4*4, and 8×8, and the plurality of single-scale context information are connected to obtain multi-scale context information.

In order to improve the accuracy of target detection and identification, in some possible embodiments, as shown in fig. 5, in the step of pooling and aggregating the region corresponding to the current scale size in the global context information according to each scale size to obtain the single-scale context information corresponding to the current scale size, steps S20321-S20322 may be included:

s20321, sequentially acquiring the sizes of all scales as the current scale, and determining a region to be pooled from the global context information according to the current scale;

the region to be pooled may refer to various target regions with different scales, which are obtained by dividing according to different scales. In an animal and plant detection scene, the shape and the size of animals and plants in different types are different in the feature map of the global context information, so that the region to be pooled needs to be determined and obtained through different scale division modes; each scale size can comprise 3, 6 and 9, the scale size is 3 as the current scale size, the area 3*3 can be determined in the global context information as an area to be pooled, and animals and plants with smaller sizes, such as mosses, ferns, insects and the like, can be covered in the area to be pooled in the scale; similarly, if the size of the scale is 6 as the current size of the scale, the area 6*6 can be determined in the global context information as the area to be pooled; taking the scale size of 9 as the current scale size, the area of 9*9 can be determined as the area to be pooled in the global context information, and correspondingly, animals such as shrubs or cats and dogs can be covered in a larger area to be pooled.

S20322, carrying out pooling aggregation on the area to be pooled through maximum pooling treatment or average pooling treatment to obtain single-scale context information corresponding to the current scale.

The information of each position in the single-scale context information represents information of all positions in the region to be pooled. Pooling refers to a downsampling process that can maintain invariance to translation, telescoping, etc. Maximum pooling refers to the maximum value going to a region as a representation of that region. The maximum pooling can inhibit the phenomenon of mean shift of estimation caused by network parameter errors, and texture information in an image to be detected is better extracted. Mean-pooling (mean-pooling) refers to taking the average value of a certain area as the representation of the area, and can inhibit the phenomenon that the variance of the estimated value increases due to the limitation of the area size, and has better effect on the background retention.

In this embodiment, adjacent overlapping regions may be pooled and the spatial pyramid pooled, so that the windows have overlapping regions each time they slide. Wherein, pooling of adjacent overlapping areas refers to using a smaller step size than the window width, and spatial pyramid pooling is based on a description of multi-scale information. Illustratively, pooling of matrices of 1*1, 2 x 2, 4*4 is computed simultaneously and the results are stitched together as input to the next network layer.

In a vehicle detection scenario, for example, after a feature map X containing global context information of an image of a vehicle to be detected is acquired, the feature map is processed by a pyramid to obtain a multi-scale context representation. In the scale transformation of the pyramid, s is used for representing the scale of each layer on the pyramid, and the values can take different numbers such as 1, 2, 3, 6 and the like. One of the values is selected as the size of the scale s, the position of s X s in the feature map X is used as a region, and the value in the region is averaged and output to obtain the single-scale feature map Xs. It will be appreciated that when the scale is s, the dimension of the averaged pooling results in a single-scale feature map Xs is R _{C×(H/s)×(W/s)} Aggregating the content in the region s to a position to obtain a single-scale context representation xR _{C×(H/s)×(W/)} . In the single-scale feature map Xs, each position represents a (d×s) x (d×s) region in the vehicle image to be detected. And processing the single-scale context representation corresponding to other scales based on the processing process, and connecting a plurality of single-scale context representations to obtain a multi-scale context representation, so that visual characteristic information with a plurality of dimensional changes can be captured when vehicle target detection is carried out.

In some possible embodiments, as shown in fig. 6, in the step of determining the single-scale context region feature at a plurality of different scales according to each single-scale context information and the initial local region feature, steps S2041-S2043 may be included:

s2041, determining a convolution feature map corresponding to each single-scale context information;

after the image to be detected is input into the convolutional neural network, a convolutional feature map can be output. When the image to be detected is converted according to the size of each single scale, and the image to be detected after the scale conversion is respectively input into the convolutional neural network, a convolutional feature map corresponding to each single scale context information can be output and obtained.

S2042, calculating influence values of all positions in the convolution characteristic diagram on the initial local area characteristics;

wherein the impact value can be used to characterize the magnitude of the interaction between the two information combinations. For example, in a detection scene of a target object, after an image to be detected is input into a convolutional neural network, a convolutional feature map and initial local region features can be output. It will be appreciated that the convolution signature includes this local region. And calculating the mutual influence degree of the local area features and each position on the convolution feature map in sequence, so that influence values of the local area features and the convolution feature map can be obtained.

S2043, performing context aggregation calculation according to the influence values and the characterization vectors of all the positions in the convolution feature map, and obtaining single-scale context region features corresponding to the convolution feature map.

The characterization vector is used for representing the context representation of the corresponding position on the convolution characteristic diagram; in particular, in an embodiment, the token vector may refer to a vector representation of information content aggregated at each location in the single-scale feature map Xs after pooling. After the influence values of the local area features at all positions on the convolution feature map are obtained, the relation between the initial local area features and the convolution feature map can be reassigned through the influence values and the characterization vectors, and the context is aggregated and calculated according to the reassigned relation, so that the single-scale context area features corresponding to the convolution feature map are obtained.

In some possible embodiments, as shown in fig. 7, in the step of calculating the impact value of each position in the convolution feature map on the initial local region feature, steps S20421-S20423 may be included:

s20421, performing first dimension reduction processing on the characterization vectors of all positions in the convolution feature map, and performing second dimension reduction processing on the characterization vectors of the initial local region features;

The dimension reduction processing refers to projecting the data of the high-dimension space into the low-dimension space on the premise of not changing the high-dimension data structure. The high-dimensional data is processed after being reduced to the low-dimensional space, so that the calculation cost can be effectively reduced. It can be understood that after the characterization vector of each position on the convolution feature is obtained, the process of projecting the characterization vector from the high-dimensional space to the low-dimensional space is used as a first dimension reduction process; similarly, after the characterization vector of the initial local region feature is obtained, the process of projecting the characterization vector from the high-dimensional space to the low-dimensional space is used as a second dimension reduction process.

S20422, constructing a normalization factor according to the space height and the space width of the convolution characteristic diagram;

the convolution characteristic diagram can be obtained after the image to be detected is processed through the convolution neural network. Spatial height and spatial width are understood to be the height and width of the input or output channels in the convolutional neural network. Normalization refers to converting values within a range into a target range. The purpose of normalization is to control the numerical range of the input data or the output data. Normalization factors can be understood as the adjustment data used in the normalization process.

S20423, normalizing the first dimension reduction processing result and the second dimension reduction processing result according to the normalization factor, and determining the influence value of each position in the convolution characteristic diagram on the initial local area characteristic.

Illustratively, when affinity is adopted as the influence value, the influence value can be calculated by the following formula when the normalization factor and the result after the dimension reduction processing are obtained in the present embodiment:

wherein r is _i Representing the i-th officeA token vector for the region;a token vector representing a j-th position in the convolution feature map Xs; omega _ij The influence value of the position j in the ith local area is represented, and it can be understood that the position j is the sequence number of the local area, and the position of the position j in the convolution characteristic map Xs can be confirmed according to the sequence number; f (f) _θ (. Cndot.) represents the query transformation function, f _φ (. Cndot.) represents a key transformation function, both the query transformation function and the key transformation function can be implemented as 1 x 1 convolutions; θ and φ each represent a different dimension reduction layer; c (Xs) represents a normalization factor, the value of which can be expressed as h x w, h being the spatial height and w being the spatial width. Wherein; omega _ij The influence value representing the interaction relation between the two information is obtained by simultaneous calculation according to the characterization vector of the ith local area and the jth position in the convolution feature diagram.

In some possible embodiments, as shown in fig. 8, in the step of performing context aggregation calculation according to the influence value and the characterization vector of each position in the convolution feature map to obtain the single-scale context region feature corresponding to the convolution feature map, steps S20431-S20432 may be included:

s20431, when the initial local area extracted from the image to be detected is one, multiplying the influence value by the characterization vector of each position in the convolution feature map to obtain the area feature vector of each position; combining the regional feature vectors of each position to obtain single-scale context regional features corresponding to the convolution feature map;

the initial local area is an interest area obtained by preliminary extraction based on a convolution feature map; after the image to be detected is input into the convolutional neural network, a certain region of interest on the image to be detected can be determined through the region candidate network, vector representation of information content aggregated at each position in a single region of interest is calculated, an influence value is used as the weight of the characterization vector, the influence value is multiplied by the characterization vector of each position of a local region of a single-scale feature map Xs, and the accumulated result is obtained as a single-scale context ROI feature of the ith ROI. In this embodiment, when the extracted local area is one, a single-scale context area feature is calculated by a local area on a convolution feature map of one scale.

S20432, when a plurality of initial local areas are extracted from the image to be detected, multiplying the influence value by the characterization vector of each local area in the convolution feature map to obtain the area feature vector of each local area; and combining the regional feature vectors of each local region to obtain the single-scale context regional features corresponding to the convolution feature map.

In this embodiment, when there are multiple regions of interest extracted, each region of interest can be calculated on a convolution feature map of one scale to obtain a corresponding single-scale context region feature. Similarly, by calculating the vector representation of the information content aggregated at each position in the single region of interest, multiplying the influence value as the weight of the characterization vector with the characterization vector at each position of the local region of the single-scale feature map Xs, accumulating the multiplication result of the characterization vector at each position, obtaining the single-scale context ROI feature of the ith ROI by the accumulation result, and then, obtaining the single-scale context ROI feature composition vector of a plurality of ROIs as the single-scale context ROI feature.

Illustratively, after calculating the affinities between each region of interest and each location in the convolution feature map Xs, the single-scale contextual region-of-interest features are reassigned according to the affinities and the contextual representation. It will be appreciated that the calculated impact value ω can be based on _ij Multiplying the single-scale contextual region-of-interest features by the characterization vector of each position of the convolution feature map Xs, and then accumulating the results to obtain the single-scale contextual region-of-interest features of the ith region-of-interest. And obtaining the component vectors of the single-scale context region-of-interest features of each region-of-interest to obtain a set of the single-scale context region-of-interest features under different scales.

In conjunction with fig. 9 of the accompanying drawings, taking a game test scenario as an example, a complete implementation process of the target detection method in the technical scheme of the present application is described as follows:

step one, obtaining an image to be detected. It can be understood that the application can control the game terminal to intercept the game interface image of a single frame as the current image to be detected when the game terminal runs the game. For example, the object detection object can connect different models of game terminals with the server. The target detection object installs the game application program to be tested on the game terminals with different models through the server. After the game application is installed, the game application installed on the game terminal is controlled to be started. And the game terminal responds to the game detection instruction, the server controls the game application program to jump to the interface to be detected, and then controls the game terminal to capture a screen of the interface to be detected, so as to obtain an image to be detected.

And step two, extracting initial local area characteristics from the game interface image to be detected, and extracting global context information from the game interface image to be detected. After the image to be detected is obtained, the embodiment can input the image to be detected into the convolutional neural network, and output global context information and global image characteristics of the image to be detected. And then determining a local region on the image to be detected as a local region of interest through a region candidate network, and taking the feature belonging to the local region of interest in the global image feature as an initial local region feature. It can be understood that the region candidate network marks the region where one or more target game objects are located on the image to be detected as a local region of interest through a box. In an embodiment, when the convolutional neural network performs global context information extraction, visual feature extraction can be performed through an image to be detected of a depth residual error network in the convolutional neural network by combining with a preset downsampling rate, and channel dimension, space height and space width are determined through a visual extraction process, so that a visual feature map can be obtained by determining the channel dimension, the space height and the space width, and then global context information in a surrounding environment image is determined according to pixels at each position on the visual feature map.

And thirdly, acquiring multi-scale context information of the image to be detected according to the global context information. After global context information is acquired, the global context information can be input into a spatial context representation module, and single-scale context information under a plurality of different scales can be output. It can be understood that the spatial context representation module obtains single-scale context information at a plurality of different scales by obtaining context information in a single-frame image of the game interface using different scale sizes. For example, in the processing procedure of the spatial context representation module, the scale size corresponding to each single-scale context information in the pyramid may be configured first, and then the region of the global context information corresponding to the scale size may be pooled according to each scale size. It is understood that the pooling process in embodiments includes a maximum pooling process or an average pooling process. After the pooling aggregation of each scale size is completed, a plurality of single-scale context information can be obtained, and the plurality of single-scale context information are connected to obtain multi-scale context information.

And fourthly, determining single-scale context region features under a plurality of different scales according to the single-scale context information and the initial local region features, connecting the single-scale context region features to obtain multi-scale context region features, and connecting the multi-scale context region features with the initial local region features to obtain target region features. In an embodiment, the single-scale context information and the initial local region features under a plurality of different scales may be input to the spatial context dependent module, so as to obtain the multi-scale context region features through the output of the spatial context dependent module. The spatial context dependence module calculates influence values of all positions on the convolution feature map and the initial local region features respectively, and determines single-scale context region features under a plurality of different scales by combining characterization vectors of all positions in the convolution feature map. And then connecting the single-scale context region features under a plurality of different scales with the initial local region features through a concat () function so as to provide the features outside the local region through the features of the plurality of different scales, thereby obtaining more accurate target region features. Illustratively, the impact value is calculated as follows:

Wherein r is _i A token vector representing an i-th local region;a token vector representing a j-th position in the convolution feature map Xs; omega _ij An influence value indicating a position j in the i-th partial region; f (f) _θ (. Cndot.) represents the query transformation function, f _φ (. Cndot.) represents a key transformation function; θ and φ each represent a different dimension reduction layer; c (Xs) represents a normalization factor, the value of which can be expressed as h x w, h being the spatial height and w being the spatial width.

And fifthly, identifying the characteristics of the target areas to obtain target category information corresponding to each target area in the image to be detected. After the target area characteristics are obtained, the embodiment performs category identification on the target objects in the target area through the full connection layer. It can be understood that, in the embodiment, after obtaining the category of the target object, the server may control the display end to display the category information.

Illustratively, in a game test scenario, after the game terminal transmits a single frame image of the target game interface shown in fig. 10 to the server, the server can detect the target type by running the target detection method shown in fig. 2. As can be seen from fig. 10, there are several target objects on the single frame image 1010 of the target game interface, where, when there are partially overlapped target objects, the embodiment can also accurately identify the overlapped target objects 1020 according to the multi-scale context area features determined by the global context information, and mark the target objects on the single frame image 1010 of the target game interface through the box 1030.

As shown in fig. 11, a test model training method includes steps R1101 to R1106:

r1101, obtaining an image training set;

the image training set may include a positive example sample and a negative example, among others. The positive sample is a sample corresponding to the category which the model needs to correctly predict or classify; a counterexample sample may be a set of samples constructed from any sample data that does not belong to a positive example sample. It can be appreciated that the historical images corresponding to the scene to be detected can be formed into an image training sample set. On the training sample set, the targets on each training sample correspond to the correct category information.

R1102, extracting initial local area characteristics from each sample image of an image training set, and extracting global context information from each sample image;

wherein a local region may be understood as a local region of interest in the sample image. Local region features refer to image features within the local region of interest. The context information may include semantic context information, spatial context information, and scale context information. Both semantic context information and spatial context information can be understood as interaction information between different objects, interaction information between objects and scenes. The global context information includes interaction information between different objects in the sample image and interaction information between the objects and the background.

R1103, acquiring multi-scale context information of the sample image according to the global context information; wherein the multi-scale context information comprises single-scale context information at a plurality of different scales;

where the scale of an image refers to the thickness of the image content. The scale is used to simulate how far or near the target object is from the object. Multiscale refers to different spatial sizes. According to the embodiment, the global context information in the image to be detected is respectively extracted according to different scales, so that the context information under different scales is obtained. Taking walking object detection on a road as an example, different scales respectively include 1.0, 3.0 and 5.0, single-scale context information corresponding to a road sample image when the scale is 1.0 is extracted respectively, single-scale context information corresponding to the road sample image when the scale is 3.0 is extracted, single-scale context information corresponding to the road sample image when the scale is 5.0 is extracted, and single-scale context information corresponding to the scales of 1.0, 3.0 and 5.0 is combined into multi-scale context information, so that the context information can be extracted from different scales.

R1104, determining single-scale context region features under a plurality of different scales according to each single-scale context information and the initial local region features; connecting each single-scale context region feature to obtain a multi-scale context region feature, and connecting the multi-scale context region feature with the initial local region feature to obtain a target region feature;

Wherein a connection may connect two or more information using a concat () function. Taking a game test scene as an example, after obtaining a plurality of single-scale context information and initial local area characteristics of a game interface sample image, the embodiment can generate a plurality of different-scale context information corresponding to the game interface sample image through the interrelation between the two information. Each single-scale context region feature is then connected to a multi-scale context region feature by a concat () function, so that features other than the initial local region feature are provided by the multi-scale context region feature when model training is enabled.

R1105, identifying characteristics of the target areas to obtain prediction category information corresponding to each target area in the sample image;

in this embodiment, the target area features corresponding to the game interface sample image may be input to the full connection layer, so that the classification prediction of the target area features is obtained through the full connection layer, thereby obtaining prediction category information corresponding to all target areas in the game interface sample image.

R1106, calculating a loss value of the prediction type information according to the prediction type information and the correct type information of each target area in the sample image, and correcting the parameters of the detection model according to the loss value.

In this embodiment, the loss value of the category information can be obtained by calculation through Softmax and cross entropy, and then the model super-parameters are obtained by calculation through deriving the loss function and adopting optimization operators such as gradient descent, so as to correct the parameters of the recommended model, and finally the target detection model is obtained by training. Illustratively, the parameters for correcting the object detection model according to the present embodiment may include: the hyper-parameters of convolutional neural networks, spatial context representation module parameters, spatial context dependent module parameters, hyper-parameters of the attention mechanism, and so forth.

In combination with fig. 11 of the specification, taking a game scene as an example, the whole process of the training method of the object detection model in the embodiment of the invention is described as follows:

in a server background of a game, first, the embodiment obtains historical data of a single-frame image sample of a game interface corresponding to a current game, takes the single-frame image sample of the game interface with a target object as a positive example, and takes the single-frame image sample of the game interface without the target object as a negative example. Then, providing information outside local area characteristics for the model in the process of carrying out target detection training by using global context information of a single-frame image sample of each game interface, thereby improving the training accuracy of the model; and acquiring single-scale context information under a plurality of different scales in a single-frame image sample of the game interface according to the global context information, so that the accuracy of model training can be improved when small target detection training is performed. And the model super-parameters are obtained through calculation by adopting optimization operators such as gradient descent and the like through deriving the loss function so as to better optimize the model.

The embodiment of the invention also discloses another target detection method, which comprises the following steps of T001-T002:

t001, responding to the detection instruction, acquiring an image to be detected, and sending the image to be detected to a target server, so that the target server carries out target detection on the image to be detected, and the target category information corresponding to each target area in the image to be detected is identified;

in this embodiment, after receiving the target instruction from the server, the image acquisition terminal may acquire the image to be detected through a screenshot tool or a photographing tool, and may send the image to be detected to the server through a wireless transmission manner, and after completing detection of the target category of the image to be detected, the server returns the detected target category information to the image acquisition terminal.

T002, receiving the target category information identified by the target server and displaying a target detection result;

in this embodiment, after receiving the target class information from the server, the image acquisition terminal may display the target class information on the object interactive interface of the image acquisition terminal.

Taking a road vehicle type detection scene as an example, the determination process of the target class information is as follows:

According to the embodiment, global context information and initial local area characteristics of a road street view image sent by an image acquisition terminal are extracted, single-scale context information of the road street view image in a plurality of different scales is acquired through the global context information, then the single-scale context area characteristics in the plurality of different scales are determined by combining the initial local area characteristics, and further target type identification on the road street view image is carried out according to the single-scale context area characteristics in the plurality of different scales.

Taking a game test scenario as an example, a complete process of the target detection method in the embodiment of the present invention is described as follows in conjunction with fig. 12:

the target object can place the mobile game terminals with different models in the test room, and connect the mobile game terminals with different models with a computer (PC, personal Computer) terminal with data processing capability through a connecting wire in the test room. And the target object installs game application programs to be tested on the mobile game terminals of different models through the PC terminal. After the installation is completed, the PC side controls the game application program installed on the mobile game terminal to start. The mobile game terminal responds to the detection instruction sent by the PC terminal, meanwhile, the PC terminal controls a game application program on the mobile game terminal to jump to an interface to be detected, and then the PC terminal controls the mobile game terminal to perform screenshot on a single frame of the interface to be detected, so that a game interface screenshot corresponding to each frame is obtained and used as an image to be detected. After the PC side obtains the game interface screenshot, the global context information corresponding to the obtained game interface screenshot is identified, and single-scale context area features under a plurality of different scales are obtained according to the global context information, so that the features of the different scales can be provided when the game interface screenshot is carried out for identifying the target object, and the accuracy of the target object identification result is improved.

It should be noted that, any one of the target detection methods provided by the technical scheme of the present application may be applied not only in a game scene, but also in other technical fields, such as a vehicle detection field, a face detection field, and other technical fields of animal and plant detection. The specific implementation in this specification is merely illustrative of possible ways of the method, and it does not limit the application scenario of the method.

The embodiment of the application also discloses a target detection device, which comprises:

the first module is used for acquiring an image to be detected;

the third module is used for acquiring multi-scale context information of the image to be detected according to the global context information; wherein the multi-scale context information comprises single-scale context information at a plurality of different scales;

a fourth module, configured to determine single-scale context region features under a plurality of different scales according to each single-scale context information and the initial local region feature;

And a sixth module, configured to identify the features of the target areas, and obtain target category information corresponding to each target area in the image to be detected.

In a game scene, the target detection device of the embodiment extracts global context information and local area characteristics of the screenshot of the game interface to be detected, so that when target detection is performed, the target detection can be performed by providing other global information outside the local area through the global context information, and the accuracy of a target detection result is improved; in addition, the target detection device of the embodiment obtains single-scale context region features under a plurality of different scales according to global context information, and further can obtain target region features by combining the single-scale context region features under the plurality of different scales with initial local region features, so that when target category identification is performed on a target region, target detection can be performed by utilizing the features of different spatial scales, and the accuracy of small target detection is improved.

The embodiment of the invention also discloses another object detection device, which comprises:

a seventh module, configured to obtain an image to be detected in response to the detection instruction, and send the image to be detected to the target server, so that the target server performs target detection on the image to be detected, and identifies and obtains target category information corresponding to each target area in the image to be detected;

An eighth module, configured to receive the target category information identified by the target server, and display a target detection result; wherein the object class information is determined according to the object detection method shown in fig. 2.

In a target object detection scene, the shooting terminal is taken as a seventh module, and the server is taken as an eighth module. According to the method, the device and the system, the image of the target object to be detected, which is obtained through shooting, can be sent to the server through the shooting terminal, global context information and local area characteristics of the image are extracted through the server, single-scale context area characteristics under different scales are obtained according to the global context information, target area characteristics are obtained through the combination of the single-scale context area characteristics under different scales and initial local area characteristics, when target type identification is conducted on the target area, the characteristics of different spatial scales can be utilized to conduct target type detection, accurate target type detection information is obtained, and the target type detection information is returned to the shooting terminal to be displayed.

The embodiment of the invention also discloses a detection model training device, which comprises:

a ninth module, configured to obtain an image training set;

an eleventh module for acquiring multi-scale context information of the sample image according to the global context information; wherein the multi-scale context information comprises single-scale context information at a plurality of different scales;

a twelfth module, configured to determine single-scale context region features under a plurality of different scales according to each single-scale context information and the initial local region feature;

a fourteenth module, configured to identify features of the target areas, to obtain prediction category information corresponding to each target area in the sample image;

and a fifteenth module for calculating a loss value of the prediction type information according to the prediction type information and the correct type information of each target area in the sample image, and correcting the parameters of the detection model according to the loss value.

In a road vehicle detection scene, the present embodiment acquires history data of a street view image sample of a road, and takes a street view image sample of a target vehicle as a positive example, and takes a street view image sample of a target vehicle as a negative example. Then, providing information outside local area characteristics for the model in the process of carrying out target detection training through global context information of each street view image sample, so that the training accuracy of the model is improved; and the single-scale context information under a plurality of different scales in the street view image sample is obtained according to the global context information, so that the accuracy of model training can be improved when small target detection training is carried out. And the model super-parameters are obtained through calculation by adopting optimization operators such as gradient descent and the like through deriving the loss function so as to better optimize the model.

It should be noted that, the device for detecting any object provided by the technical scheme of the application not only can be applied to a game scene, but also can be applied to other technical fields, such as the field of vehicle detection, the field of face detection, the field of animal and plant detection and other technical fields. The implementation in this specification is merely illustrative of possible manners of the device, and does not limit the application scenario of the device.

The embodiment of the invention also provides electronic equipment, which comprises a processor and a memory; the memory stores a program; the processor executes the program to perform the target detection method described above; the electronic device has a function of carrying and running a software system for service data processing provided by the embodiment of the present invention, for example, a personal computer (Personal Computer, PC), a mobile phone, a smart phone, a personal digital assistant (Personal Digital Assistant, PDA), a wearable device, a palm computer PPC (Pocket PC), a tablet computer, a vehicle-mounted terminal, and the like.

When the game player logs in to the game client through any electronic device, the embodiment captures a game interface through the electronic device and detects a target of the game interface capture through the processor, and when the game player can understand that the processor extracts global context information and local area characteristics of the game interface capture after obtaining the game interface capture, so that other global information except the local area is provided through the global context information to detect the target, and the accuracy of a target detection result is improved; and the single-scale context region characteristics under a plurality of different scales are acquired according to the global context information, so that when the target region is identified in the target category, the characteristics of different spatial scales can be utilized to detect the target, and the accuracy of small target detection is improved.

The embodiment of the invention also provides a computer readable storage medium, wherein the storage medium stores a program, and the program is executed by a processor to realize the target detection method. At the same time, embodiments of the present invention also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the virtual target detection method described previously.

Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the aforementioned method.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

From the embodiments provided in the above description, it can be clearly understood that the technical solution of the present application has at least the following advantages:

according to the technical scheme, local information except local area characteristics can be provided for a target detection process through global context information of an image to be detected, so that target detection accuracy is improved, and the target identification is performed through the characteristics of different spatial scales by acquiring the single-scale context area characteristics under a plurality of different scales.

Furthermore, while the application is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present application. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the application as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the application, which is to be defined in the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the application, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present application has been described in detail, the present application is not limited to the embodiments described above, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present application, and these equivalent modifications or substitutions are included in the scope of the present application as defined in the appended claims.

Claims

1. A method of detecting an object, comprising:

acquiring an image to be detected;

2. The method according to claim 1, wherein the extracting global context information from the image to be detected comprises:

3. The method according to claim 1 or 2, wherein the acquiring the multi-scale context information of the image to be detected according to the global context information comprises:

4. The method for detecting a target according to claim 3, wherein the pooling aggregation is performed on the region corresponding to the current scale in the global context information according to each scale size to obtain single-scale context information corresponding to the current scale, and the method comprises the steps of:

sequentially acquiring the sizes of all the scales as the current scale;

5. The method of claim 1, wherein determining single-scale context region features at a plurality of different scales based on each of the single-scale context information and the initial local region features comprises:

6. The method of claim 5, wherein calculating the impact value of each position in the convolution signature on the initial local region feature comprises:

7. The method for detecting a target according to claim 5, wherein the performing a context aggregation calculation according to the influence value and the characterization vector of each position in the convolution feature map to obtain a single-scale context region feature corresponding to the convolution feature map includes:

8. A method of detecting an object, comprising:

wherein the object class information is determined according to the object detection method as claimed in any one of claims 1 to 7.

9. A test model training method, comprising:

acquiring an image training set;

10. An object detection apparatus, comprising:

the first module is used for acquiring an image to be detected;

11. An object detection apparatus, comprising:

wherein the object class information is determined according to the object detection method as claimed in claims 1-7.

12. A test model training device, comprising:

a ninth module, configured to obtain an image training set;

13. An electronic device comprising a processor and a memory;

the memory is used for storing programs;

the processor executing the program implements the object detection method according to any one of claims 1 to 8 or the detection model training method according to claim 9.

14. A computer-readable storage medium, characterized in that the storage medium stores a program that is executed by a processor to implement the object detection method according to any one of claims 1 to 8 or the detection model training method according to claim 9.

15. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the object detection method according to any one of claims 1 to 8 or the detection model training method according to claim 9.