CN110738211A

CN110738211A - object detection method, related device and equipment

Info

Publication number: CN110738211A
Application number: CN201910989269.XA
Authority: CN
Inventors: 黄超; 张力柯
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2020-01-31

Abstract

The application discloses a object detection method which comprises the steps of obtaining an image set, wherein the image set at least comprises a th image and a second image, the th image is a first frame image of the second image, obtaining a depth feature set based on the image set, wherein the depth feature set comprises a th depth feature and a second depth feature, generating a target time sequence feature corresponding to a region to be detected according to the depth feature set, obtaining an object detection result through a time sequence detection model based on the target time sequence feature, wherein the object detection result is a detection result of the region to be detected in the second image.

Description

object detection method, related device and equipment

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, a related apparatus, and a device for detecting kinds of objects.

Background

With the improvement of living standard, people can experience various games through terminal equipment such as computers, mobile phones and tablet computers at any time and any place, the games provide convenient and fast entertainment modes for users, and the effect of relieving pressure is achieved. In order to be able to better maintain the proper functioning of the game, it is often necessary to automate the testing of the game.

In the process of automated testing, a target object in a User Interface (UI) needs to be detected. Currently, a target detection algorithm based on a depth network can be used to detect the position and the type of a target object, that is, a depth feature of the target object in each frame of game picture is extracted, the position of the target object is predicted according to the depth feature, and a corresponding operation is executed based on the position of the target object.

However, since the position change of the target object is often small in a short time, the depth features between the adjacent frames of the game pictures detected in the above manner are close to each other, so that the difference degree of the detection results is small, and the position change of the target object is difficult to recognize. Thereby reducing the detection accuracy.

Disclosure of Invention

The embodiment of the application provides object detection methods, related devices and equipment, which can fuse the characteristics of adjacent images in the same region to obtain a target time sequence characteristic with time sequence, and the detection result obtained based on the target time sequence characteristic prediction is more accurate, so that the detection precision is improved.

In view of the above, the th aspect of the present application provides a method for object control, comprising:

acquiring a set of images, wherein the set of images at least comprises th image and a second image, and the th image is the first frames of the second image;

acquiring a depth feature set based on the image set, wherein the depth feature set comprises -th depth features and second depth features, the -th depth features belong to depth features of a region to be detected in the -th image, and the second depth features belong to depth features of the region to be detected in the second image;

generating a target time sequence characteristic corresponding to the to-be-detected region according to the depth characteristic set;

and acquiring an object detection result through a time sequence detection model based on the target time sequence feature, wherein the object detection result is a detection result of a region to be detected in the second image.

A second aspect of the present application provides kinds of object detecting apparatuses, including:

an obtaining module, configured to obtain an image set, where the image set includes at least an th image and a second image, and the th image is a first th image of the second image;

the acquiring module is further configured to acquire a depth feature set based on the image set, where the depth feature set includes -th depth features and second depth features, the -th depth features belong to depth features of a region to be detected in the -th image, and the second depth features belong to depth features of the region to be detected in the second image;

the generating module is used for generating a target time sequence characteristic corresponding to the to-be-detected region according to the depth characteristic set acquired by the acquiring module;

the obtaining module is further configured to obtain an object detection result through a time sequence detection model based on the target time sequence feature generated by the generating module, where the object detection result is a detection result of a to-be-detected region in the second image.

Of the possible designs, in the implementation of the second aspect of the embodiments of the present application,

the acquiring module is specifically configured to acquire the -th depth feature through the target detection model based on the -th image, where the -th depth feature includes the -th feature of the region to be detected in P scales, and P is an integer greater than or equal to 1;

acquiring a second depth feature through a target detection model based on the second image, wherein the second depth feature comprises second features of the region to be detected under P scales;

generating the set of depth features from the th depth feature and the second depth feature.

Of the possible designs, in a second implementation of the second aspect of the embodiments of the present application,

the generating module is specifically configured to perform cascade processing on the th depth feature and the second depth feature to obtain the target time sequence feature, where the target time sequence feature is a feature matrix, and the th depth feature and the second depth feature are feature vectors.

Of the possible designs, in a third implementation of the second aspect of the embodiments of the present application,

the generating module is specifically configured to perform cascade processing on a th feature in the th depth features and a second feature in the second depth features based on an th scale to obtain the th target time-series feature, where the th target time-series feature is a feature matrix, the th feature and the second feature are feature vectors, and the th scale belongs to scales of the P scales;

performing cascade processing on a th feature of the th depth features and a second feature of the second depth features based on a second scale to obtain the second target time series feature, wherein the second target time series feature is a feature matrix, the second scale belongs to another scales of the P scales, and the second scale and the th scale belong to different scales.

Of the possible designs, in a fourth implementation of the second aspect of the embodiment of the present application,

the obtaining module is specifically configured to obtain an object detection feature through the timing detection model based on the target timing feature, where the object detection feature is a feature vector;

and generating the object detection result according to the object detection characteristics, wherein the object detection result comprises the object occurrence probability, the class information and the position information in the region to be detected.

Of the possible designs, in a fifth implementation of the second aspect of the embodiments of the present application,

the obtaining module is specifically configured to obtain th object detection features through the time sequence detection model based on the th target time sequence features, where the th object detection features are feature vectors;

acquiring a second object detection feature through the time sequence detection model based on the second target time sequence feature, wherein the second object detection feature is a feature vector;

determining a confidence level from the th object detection feature;

determining a second confidence level according to the second object detection feature;

if the confidence level is greater than the second confidence level, generating the object detection result according to the object detection feature, wherein the object detection result comprises an object occurrence probability, category information and position information in the region to be detected;

and if the second confidence degree is greater than the confidence degree, generating the object detection result according to the second object detection feature.

Of the possible designs, in a sixth implementation of the second aspect of the embodiments of the present application,

the obtaining module is further configured to, after the obtaining module obtains an object detection result through a timing detection model based on the target timing feature, obtain an auxiliary operation result according to an execution target operation if the object detection result determines that the object detection result includes the target object.

In possible designs, in a seventh implementation manner of the second aspect of the embodiment of the present application, the object detecting apparatus further includes a training module;

the acquisition module is further configured to acquire an image set to be trained, where the image set to be trained includes at least images to be trained, and the images to be trained carry real annotation information;

the acquisition module is further used for acquiring the prediction marking information corresponding to the image to be trained through a target detection model to be trained based on the image set to be trained;

the obtaining module is further configured to calculate an th loss function according to the real annotation information of the image to be trained and the predicted annotation information of the image to be trained;

and the training module is used for training to obtain a target detection model when the th loss function is converged.

Of the possible designs, in an eighth implementation of the second aspect of the embodiments of the present application,

the obtaining module is specifically configured to determine position information of a prediction bounding box according to the prediction tagging information, where the position information of the prediction bounding box includes a central abscissa value, a central ordinate value, a height value, and a width value of the prediction bounding box;

determining the position information of a real boundary frame according to the real labeling information, wherein the position information of the real boundary frame comprises a central horizontal coordinate value, a central vertical coordinate value, a height value and a width value of the real boundary frame;

determining frame confidence according to the obtained real labeling information and the prediction labeling information;

determining a prediction type according to the prediction marking information;

determining a real category according to the real labeling information;

and calculating the th loss function based on the position information of the prediction bounding box, the position information of the real bounding box, the confidence of the frame, the prediction category and the real category.

Of the possible designs, in a ninth implementation of the second aspect of the embodiments of the present application,

the acquisition module is specifically used for acquiring a video to be processed, wherein the video to be processed comprises a plurality of frames of images to be processed;

and carrying out duplication elimination processing on the video to be processed to obtain the image set to be trained.

Of the possible designs, in a tenth implementation of the second aspect of the embodiment of the application,

acquiring the size of an object to be trained in the image to be processed;

and if the object size of the object to be trained is larger than or equal to the size threshold, determining the image to be processed as the image to be trained.

Of the possible designs, in a tenth implementation of the second aspect of the embodiments of the present application,

the acquisition module is further configured to acquire an image set to be trained, where the image set to be trained includes a plurality of images to be trained, and the images to be trained carry real annotation information;

the generating module is further configured to generate a to-be-trained sample set according to the to-be-trained image set, where the to-be-trained sample set includes at least to-be-trained samples, and the to-be-trained samples include a plurality of to-be-trained images;

the obtaining module is further configured to obtain, based on the to-be-trained sample set generated by the generating module, prediction tagging information corresponding to the to-be-trained sample through a to-be-trained time sequence detection model;

the obtaining module is further configured to calculate to obtain a second loss function according to the real labeling information of the sample to be trained and the prediction labeling information of the image to be trained;

and the training module is also used for training to obtain a time sequence detection model when the second loss function is converged.

The third aspect of the present application provides electronic devices comprising a memory, a transceiver, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor is configured to execute the program in the memory, including performing the method of any of as described in the aspect above;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

A fourth aspect of the present application provides computer-readable storage media having stored therein instructions that, when executed on a computer, cause the computer to perform the method of the above-described aspects.

According to the technical scheme, the embodiment of the application has the following advantages:

in the embodiment of the application, object detection methods are provided, an image set is firstly obtained, the image set at least comprises a th image and a second image, then a depth feature set is obtained based on the image set, a target time sequence feature corresponding to a to-be-detected region is generated according to a th depth feature and the second depth feature, finally an object detection result is obtained through a time sequence detection model based on the target time sequence feature, the object detection result is a detection result of the to-be-detected region in the second image, the depth features of multiple frames of adjacent images are extracted through the method, the features of the adjacent images in the same region are fused to obtain the target time sequence feature with time sequence, and the target time sequence feature utilizes information of the multiple images, so that the detection result obtained based on prediction of the target time sequence feature is more accurate, and the detection precision is improved.

Drawings

FIG. 1 is a diagram illustrating architectures of an object detection system according to an embodiment of the present invention;

FIG. 2A is a schematic diagram of display scales based on a virtual scene in an embodiment of the present application;

FIG. 2B is a schematic diagram of another display scales based on a virtual scene in the embodiment of the present application;

FIG. 3 is a schematic diagram of embodiments of a method for object detection in an embodiment of the present application;

FIG. 4 is a diagram illustrating structures of an object detection network according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating structures of a timing detection network according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a network structure for detecting targets based on multi-scale extraction in an embodiment of the present application;

FIG. 7 is a schematic diagram of a network structure of timing detection based on three-frame sequential image detection in the embodiment of the present application;

FIG. 8 is a schematic diagram of embodiments of the present application based on single-scale object detection;

FIG. 9 is a diagram of embodiments based on multi-scale object detection in an embodiment of the present application;

FIG. 10 is a schematic flow chart of an object detection framework in an embodiment of the present application;

FIG. 11 is a schematic diagram of embodiments of an object detection device in an embodiment of the present application;

FIG. 12 is a schematic view of another embodiments of the object detecting device in the embodiment of the present application;

fig. 13 is a schematic structural diagram of pieces of terminal equipment in the embodiment of the present application;

fig. 14 is a schematic structural diagram of servers in the embodiment of the present application.

Detailed Description

Furthermore, the terms "comprises" and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a series of steps or elements is not necessarily limited to the expressly listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that the method provided by the present application can be implemented based on Computer Vision (CV) of Artificial Intelligence (AI), which is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital Computer or a digital Computer controlled machine, senses the environment, acquires knowledge and uses the knowledge to obtain the best result in other words, Artificial Intelligence is comprehensive techniques of Computer science, which attempts to understand the essence of Intelligence and produces new intelligent machines that can react in a similar way to human Intelligence.

Computer vision technology computer vision is the science of how to make a machine "see" in , and more specifically, in , it means that a camera and a computer are used to replace human eyes to perform machine vision such as Recognition, tracking and measurement on a target, and then graphics processing is performed to make the computer process an image more suitable for human eyes to observe or transmit to an instrument for detection, as scientific disciplines, theories and technologies related to computer vision research are tried to establish an artificial intelligence system capable of acquiring information from an image or multi-dimensional data.

It should be understood that the object detection method provided by the application can be applied to an automatic test scene, a man-machine battle scene or an intelligent team-friend auxiliary scene, an intelligent demonstration scene and the like. Taking a man-machine battle scene as an example, the machine can detect the positions of different objects in a game picture based on an object detection method, and when the positions of real players are detected, the machine can initiate operations such as attack and the like to the real players. Taking an intelligent teammate auxiliary scene as an example, the machine can detect the positions of different objects in a game picture based on an object detection method, and when the position of an opposite player or the position of a non-player Character (NPC) is detected, the machine can initiate operations such as attack to the opposite player or the NPC, so that the purpose of assisting a real player is achieved. Taking an intelligent demonstration scene as an example, the machine may detect the positions of different buttons in the application picture based on an object detection method, when a start button is detected, show the user the operation of simulating clicking the start button, and when a close button is detected, show the user the operation of simulating clicking the close button.

The following description will be given by taking an automated testing scenario as an example, where the automated testing may be performed on different types of application programs, including but not limited to interactive applications, instant messaging applications, video applications, and the like, where the interactive applications include but not limited to Shooting (STG) games, multiplayer online tactical sports (MOBA) games, Role-playing (RPG) games, efficiency of application creation can be improved through the automated testing, for example, reduction of encoding difficulty of rules and behavior trees, testing games, and generation of customs, and for the automated testing of games, is important for identifying different objects in a User Interface (UI), including types of detected objects and positions where the objects are located, and executing corresponding action policies based on detection results, that is, controlling operations of the Game through the program Interface, so as to simulate behaviors of a User.

For convenience of understanding, the present application provides object detection methods, which are applied to an object detection system shown in fig. 1, please refer to fig. 1, where fig. 1 is a schematic diagram of architectures of the object detection system in the embodiment of the present application, as shown in the figure, a detection model is obtained by training, and the detection model includes two parts, namely, a target detection model and a time sequence detection model, and then the detection model is used to detect and recognize images.

It should be noted that the client is disposed on a terminal device, where the terminal device includes but is not limited to a tablet computer, a notebook computer, a palm computer, a mobile phone, a voice interaction device, and a Personal Computer (PC), and is not limited herein. The voice interaction device includes, but is not limited to, an intelligent sound and an intelligent household appliance.

For convenience of introduction, please refer to fig. 2A and 2B, in which fig. 2A is a schematic diagram illustrating display scales based on a virtual scene in the embodiment of the present disclosure, fig. 2B is a schematic diagram illustrating another display scales based on a virtual scene in the embodiment of the present disclosure, as shown in the diagram, the position of a target object (an area indicated by a 1) in fig. 2A is on the right side of the screen, the position of a target object (an area indicated by a 2) in fig. 2B is on the left side of the screen, and the scale of the target object in fig. 2A is larger than the scale of the target object in fig. 2B.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

With reference to fig. 3, the following describes a method for object detection in the present application, and in the embodiments of the present application, embodiments of the method for object detection include:

101. acquiring an image set, wherein the image set at least comprises an th image and a second image, and the th image is a first frame image of the second image;

in this embodiment, an object detection apparatus first obtains an image set, where the image set includes a multi-frame image, that is, at least an th image and a second image, and the th image and the second image to be predicted are two adjacent front and back images.

It should be noted that the image set may further include 3 frames of images to be predicted, or other numbers of images to be predicted, and a time interval between two adjacent frames of images to be predicted may be 0.2 seconds, or other time intervals may also be set, which is not limited herein.

102. Acquiring a depth feature set based on an image set, wherein the depth feature set comprises th depth features and second depth features, the th depth features belong to the depth features of the region to be detected in the th image, and the second depth features belong to the depth features of the region to be detected in the second image;

specifically, the th image comprises N to-be-detected regions, the second image also comprises R to-be-detected regions, and R is an integer greater than or equal to 1. the kth to-be-detected region in the th image is input to the target detection model, and the th depth feature is output by the target detection model. the kth to-be-detected region in the th image is input to the target detection model, and the th depth feature is output by the target detection model. the kth to-be-detected region in the second image corresponding to the th image is input to the target detection model, and the th depth feature is output by the target detection model, is abstract information extracted from the to-be-detected regions in the th image, and is expressed as a multi-dimensional feature vector.

For convenience of understanding, please refer to fig. 4, fig. 4 is a schematic diagram of structures of a target detection network in the embodiment of the present application, as shown in the figure, taking the target detection network as an example where You only see the network times (YOLO), the YOLO network can use a deep network 53(Darknet53) to extract depth features, Darknet53 is a deep network including 53 convolutional layers, the Darknet53 deep network can better use a Graphics Processor (GPU), and the number of residual layers is small, so that the prediction efficiency is higher and the speed is higher by using a Darknet53 deep network, the YOLO network rasterizes an input picture to obtain P units, each unit is detection areas, an area to be detected is an area to be detected, the YOLO network can output the depth features of each area to be detected, the depth features include abstract information, and the structure of the area to be detected represents 2 times, and the structure of the image is represented by 2 times, respectively.

It should be noted that the target detection Network may also be a Single-Shot multi-box detector (SSD), a regional Convolutional Neural Network (R-CNN), a Fast regional Convolutional Neural Network (Fast R-CNN), and a Faster regional Convolutional Neural Network (Fast R-CNN), which are exemplified by the YOLO Network, but this should not be construed as limiting the present application.

103. Generating target time sequence characteristics corresponding to the to-be-detected region according to the depth characteristic set;

assuming that the -th depth feature is a feature vector with dimension of 1 × N and the second depth feature is a feature vector with dimension of 1 × N, the target time series feature is a feature matrix with dimension of 2 × N.

104. And acquiring an object detection result through the time sequence detection model based on the target time sequence feature, wherein the object detection result is a detection result of the to-be-detected region in the second image.

In this embodiment, the object detection apparatus inputs the target time sequence feature into the time sequence detection model, and the time sequence detection model outputs the object detection result, where the object detection result includes a detection result corresponding to the kth to-be-detected region in the second image, and the detection result includes, but is not limited to, the probability of occurrence of the target object and the category of the target object.

For convenience of introduction, please refer to fig. 5, fig. 5 is a schematic view of structures of the timing detection network in the embodiment of the present application, and as shown in the figure, after obtaining a depth feature of a kth region in an nth frame and a depth feature of a kth region in an N +1 th frame, cascade processing is performed on the two depth features to obtain a target timing feature, and the target timing feature is input to the timing detection network, where the timing detection network may include a Long Short Term Memory (LSTM) network and a Fully connected layer (FC layer), the LSTM network is time-cycled neural network, the FC layer may convert a feature matrix into a feature vector, such as to obtain a1 × M feature vector, and obtain a detection result corresponding to the kth region in the N +1 th frame according to the 1 × M feature vector, where cascade processing means that two or more than two feature vectors are merged, it is assumed that the depth feature vector is 1 × N, the second depth feature vector is a depth vector, and after obtaining a second depth feature vector, the cascade processing may be limited to obtain a second depth feature vector, where the cascade processing may be a second depth vector, and the cascade processing may be further referred to obtain a second depth vector.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, in an th optional embodiment of the method for object control provided in the embodiment of the present application, acquiring the depth feature set based on the image set may include:

acquiring a depth feature through a target detection model based on the th image, wherein the th depth feature comprises a th feature of a region to be detected under P scales, and P is an integer greater than or equal to 1;

acquiring a second depth characteristic through the target detection model based on the second image, wherein the second depth characteristic comprises second characteristics of the region to be detected under P scales;

and generating a depth feature set according to the th depth feature and the second depth feature.

In this embodiment, multi-scale-based depth feature set extraction methods are introduced, where the object detection apparatus inputs a th image into the target detection model, the target detection model outputs a th depth feature, the th depth feature includes a th feature of the region to be detected in P scales, and in addition, the object detection apparatus needs to input a second image into the target detection model, the target detection model outputs a second depth feature, and the th depth feature includes a second feature of the region to be detected in P scales.

It will be appreciated that the multiscale includes P scales, and when P is equal to 1, the original scale of the image is obtained. When P is greater than 1, the image needs to be segmented. Specifically, assuming that P is equal to 2, it indicates that the image is divided into 2 × 2 regions to be detected. Assuming that P is equal to 3, it means that the image is divided into 3 × 3 regions to be detected. By analogy, the larger the scale is, the larger the number of the regions to be detected of the image is, and the fewer image features are included in each region to be detected in the image. Therefore, the feature extraction based on multiple scales has global overall information and local detailed information, so that more comprehensive image information is obtained.

Specifically, for convenience of introduction, please refer to fig. 6, fig. 6 is a schematic diagram of a structure of target detection Networks extracted based on multiple scales in the embodiment of the present application, as shown in the figure, the target detection Networks adopt a Feature Pyramid Network (FPN) to extract multiple scales, that is, including a scale , a scale two, and a scale three, the number of scales in fig. 6 is only , and should not be understood as a limitation of the present application, the FPN is a feature extractor designed according to a feature pyramid concept, a feature map is obtained after a predicted image is convolved by series, the feature map is upsampled and then restored, so that the size of the feature map is increased under the condition that high-level semantic information is not lost, and then a large-size feature map is used to detect a small target, thereby solving the problem that the small target is difficult to detect.

Taking the th image as an example, the th image is input to the object detection model shown in fig. 6, the th feature at the th scale, the th feature at the second scale and the th feature at the third scale are output by the object detection model, and the th features all belong to the th depth feature.

Secondly, in the embodiment of the present application, depth feature set extraction manners based on multiple scales are provided, that is, based on the th image, the th feature under P scales is obtained through the target detection model, and based on the second image, the second feature under P scales is obtained through the target detection model, and according to the th feature and the second feature under P scales, a depth feature set is generated.

Optionally, on the basis of each embodiment corresponding to fig. 3, in a second optional embodiment of the object control method provided in the embodiment of the present application, generating the target time series feature corresponding to the to-be-detected region according to the depth feature set may include:

and performing cascade processing on the th depth feature and the second depth feature to obtain a target time sequence feature, wherein the target time sequence feature is a feature matrix, and both the th depth feature and the second depth feature are feature vectors.

In this embodiment, target time sequence feature generation manners based on a single scale are introduced, after a depth feature set is obtained, cascade processing may be performed on depth features in the depth feature set, that is, cascade processing may be performed on the th depth feature and the second depth feature, so as to obtain a target time sequence feature, where the th depth feature and the second depth feature are depth features corresponding to a region to be detected, the th depth feature and the second depth feature are feature vectors, and the target time sequence feature obtained after cascade processing belongs to a feature matrix, assuming that the th depth feature is a depth feature of a kth region to be detected in the th image, the second depth feature is a depth feature of a kth region to be detected in the second image, assuming that the th depth feature is a feature vector of 1 × N, and the second depth feature is a feature vector of 1 × N, and obtaining a target time sequence feature of 2 × N after cascade processing, that is a feature matrix of 2 × N.

For convenience of understanding, a prediction mode based on three continuous frames of images to be predicted will be described below, please refer to fig. 7, where fig. 7 is a schematic diagram of a network structure for detecting time sequences based on three continuous frames of images to be predicted in the embodiment of the present application, and as shown in the figure, it is specifically assumed that the three continuous frames of images to be predicted are divided into an nth frame of image, an n +1 th frame of image, and an n +2 th frame of image, and the three frames of images to be predicted are all divided into R regions to be detected, where a kth region to be detected is subjected to cascade processing at the same time to obtain a target time sequence feature corresponding to the kth region to be detected.

And after the target time sequence characteristics pass through an LSTM network in the time sequence detection model, obtaining a characteristic matrix, converting the characteristic matrix into 1 xM object detection characteristics through an FC layer, and detecting the detection result of the kth to-be-detected area in the n +2 frame image based on the object detection characteristics to obtain the object detection result. It is understood that, for R regions to be detected, R object detection features can be obtained, and the object detection feature corresponding to the kth region to be detected is taken as an example for description, which should not be construed as a limitation to the present application.

Secondly, in the embodiment of the application, target time sequence feature generation modes based on a single scale are provided, namely, the th depth feature and the second depth feature are subjected to cascade processing to obtain a target time sequence feature, the target time sequence feature is a feature matrix, and the th depth feature and the second depth feature are feature vectors.

Optionally, on the basis of each embodiment corresponding to fig. 3, in a third optional embodiment of the object control method provided in the embodiment of the present application, generating the target time series feature corresponding to the to-be-detected region according to the depth feature set may include:

based on the th scale, carrying out cascade processing on the th feature in the th depth feature and the second feature in the second depth feature to obtain a th target time sequence feature, wherein the th target time sequence feature is a feature matrix, the th feature and the second feature are feature vectors, and the th scale belongs to scales in the P scales;

and performing cascade processing on the th feature in the th depth features and the second feature in the second depth features based on a second scale to obtain a second target time sequence feature, wherein the second target time sequence feature is a feature matrix, the second scale belongs to another scales in the P scales, and the second scale and the scale belong to different scales.

In this embodiment, target time series feature generation manners based on multiple scales are introduced, after a depth feature set is obtained, cascade processing may be performed on depth features in the depth feature set, that is, cascade processing may be performed on -th depth features and second depth features, so as to obtain target time series features, where the -th depth feature and the second depth feature both include depth features under different scales.

Similarly, under a second scale, an th image includes R2 regions to be detected, the second image also includes R2 regions to be detected, a th feature corresponding to a kth region to be detected in the th image and a second feature corresponding to a kth region to be detected in the second image are subjected to cascade processing, and assuming that the th feature is a feature vector of 1 × N and the second feature is a feature vector of 1 × N, a second target time series feature of 2 × N is obtained after cascade processing, that is, a feature matrix of 2 × N is obtained, where k is an integer greater than or equal to 1 and less than or equal to R2.

Thirdly, according to the embodiment of the present application, target time series feature generation manners based on multiple scales are provided, that is, based on the scale, the th feature in the th depth features and the second feature in the second depth features are cascaded to obtain the th target time series feature, and based on the second scale, the th feature in the th depth features and the second feature in the second depth features are cascaded to obtain the second target time series feature, the th depth feature and the second depth feature are cascaded to obtain the target time series feature, the target time series feature is a feature matrix, and the th depth feature and the second depth feature are feature vectors.

Optionally, on the basis of the foregoing embodiments corresponding to fig. 3, in a fourth optional embodiment of the method for object control provided in the embodiment of the present application, obtaining the object detection result through the timing detection model may include:

acquiring object detection characteristics through a time sequence detection model based on the target time sequence characteristics, wherein the object detection characteristics are characteristic vectors;

and generating an object detection result according to the object detection characteristics, wherein the object detection result comprises the object occurrence probability, the class information and the position information in the region to be detected.

In this embodiment, object detection result generation manners based on single-scale object detection are introduced, for convenience of introduction, referring to fig. 8, fig. 8 is an illustration of embodiments based on single-scale object detection in this embodiment, as shown in the figure, it is assumed that the th image is image a, the second image is image B, image a is divided into 3 × 3 regions to be detected, and image B is also divided into 3 × 3 regions to be detected, first, the th depth feature (feature vector of 1 × N) of region 1 in image a is extracted, and the second depth feature (feature vector of 1 × N) of region 1 in image B is extracted, then, the th depth feature and the second depth feature of region 1 are subjected to cascade processing, so as to obtain a target time-series feature (feature matrix of 2 × N) corresponding to region 1, then, the target time-series feature corresponding to region 1 is input to a time-series detection model, so as to obtain an object detection feature corresponding to region 1, an object detection result is determined according to the object detection features, the object detection result includes object detection result of 1, the object detection result of occurrence probability in the region, and the object detection center position information of the region 1, the coordinate value is represented by x, and the longitudinal coordinate value, the coordinate value of 70, the coordinate value of x, the coordinate value of 30, the coordinate value of.

Similarly, th depth feature (1 × N feature vector) of the region 2 in the image A is extracted, the second depth feature (1 × N feature vector) of the region 2 in the image B is extracted, the th depth feature and the second depth feature of the region 2 are subjected to cascade processing to obtain target time sequence feature (2 × N feature matrix) corresponding to the region 2, the target time sequence feature corresponding to the region 2 is input into the time sequence detection model to obtain object detection feature corresponding to the region 2.

, in the embodiment of the present application, ways of generating object detection results based on a single scale are provided, that is, based on target time sequence features, object detection features are obtained through a time sequence detection model, and then object detection results are generated according to the object detection features.

Optionally, on the basis of the foregoing embodiments corresponding to fig. 3, in a fifth optional embodiment of the method for object control provided in the embodiment of the present application, obtaining an object detection result through a timing detection model based on a target timing characteristic may include:

acquiring th object detection features through a time sequence detection model based on th target time sequence features, wherein th object detection features are feature vectors;

acquiring a second object detection characteristic through the time sequence detection model based on the second target time sequence characteristic, wherein the second object detection characteristic is a characteristic vector;

determining confidence level according to object detection features;

if the th confidence coefficient is larger than the second confidence coefficient, generating an object detection result according to th object detection features, wherein the object detection result comprises the object occurrence probability, the class information and the position information in the to-be-detected region;

and if the second confidence coefficient is greater than the confidence coefficient, generating an object detection result according to the second object detection feature.

In this embodiment, object detection result generation manners based on multi-scale are introduced, for convenience of introduction, referring to fig. 9, fig. 9 is a schematic diagram of embodiments based on multi-scale object detection in the embodiment of the present application, as shown in the figure, it is assumed that the th image is image a, the second image is image B, the image a is assumed to have two scales, the image a is divided into 3 × 3 regions to be detected at the th scale, the image a is divided into 4 × 4 regions to be detected at the second scale, the image a is also divided into two scales, the image B is divided into 3 × 3 regions to be detected at the th scale, and the image B is divided into 4 × 4 regions to be detected at the second scale.

Under a -th scale, firstly, -th features (1 × N feature vectors) of the area 1 in the image a are extracted, second features (1 × N feature vectors) of the area 1 in the image B are extracted, then, 0-th features and the second features of the area 1 are subjected to cascade processing to obtain -th target time sequence features (2 × N feature matrices) corresponding to the area 1, then, -th target time sequence features corresponding to the area 1 are input into the time sequence detection model, so that -th object detection features corresponding to the area 1 are obtained, similarly, -th features (1 × N feature vectors) of the area 2 in the image a are firstly extracted, second features (1 × N feature vectors) of the area 2 in the image B are extracted, then, the 2-th features and the second features of the area 2 are subjected to cascade processing to obtain -th target time sequence features (2 × N feature matrices) corresponding to the area 2, then, the -th target time sequence features corresponding to the time sequence detection model are input into so that the object detection features corresponding to obtain the 387-th object detection features, so that the object detection features are obtained.

Under a second scale, firstly, -th features (1 × N feature vectors) of the region 3 in the image a are extracted, second features (1 × N feature vectors) of the region 3 in the image B are extracted, then, -th features and the second features of the region 3 are subjected to cascade processing to obtain second target time sequence features (2 × N feature matrixes) corresponding to the region 3, then, the second target time sequence features corresponding to the region 3 are input into the time sequence detection model to obtain second object detection features corresponding to the region 3.

After th object detection feature and the second object detection feature are obtained, object detection features can be selected by adopting a Non-Maximum Suppression (NMS) method to generate a final object detection result, firstly, a confidence score is generated according to each object detection feature, then, a frame (Bounding Box, BBox) with the highest confidence is selected according to the confidence score, then, BBoxs corresponding to the other object detection features are traversed, if the overlapping area (Intersection over Union, IoU) of the BBoxs with the current highest score is larger than a threshold value, the BBoxs corresponding to the object detection features are deleted, finally, BBoxs with the highest scores are continuously selected from unprocessed BBoxs, and the above process is repeated until all the BBoxs corresponding to the object detection features are processed.

Taking th object detection feature and second object detection feature as examples, the th confidence coefficient of the th object detection feature and the second confidence coefficient of the second object detection feature are obtained, and an object detection result is generated with the object detection feature with higher confidence coefficient as a criterion, thereby obtaining the object occurrence probability, the category information and the position information in the region to be detected.

, in this embodiment, object detection result generation manners based on multi-scale are provided, that is, for the multi-scale feature extraction, a plurality of object detection features are obtained, confidence degrees corresponding to different object detection features are respectively calculated, and finally, an object detection result is generated according to the object detection feature with a higher confidence degree.

Optionally, on the basis of the foregoing embodiments corresponding to fig. 3, in a sixth optional embodiment of the method for object control provided in the embodiment of the present application, after obtaining the object detection result through the timing detection model based on the target timing characteristic, the method may further include:

and if the object detection result determines that the target object is included, acquiring an auxiliary operation result according to the executed target operation.

In this embodiment, methods for performing an auxiliary operation in combination with an object detection result are introduced, that is, determining whether a target object exists according to the object detection result, and if the target object exists, performing a target operation on the target object according to an association policy to obtain an auxiliary operation result.

For convenience of introduction, please refer to fig. 10, in which fig. 10 is a schematic flow chart of processes of an object detection framework in the embodiment of the present application, and as shown in the figure, taking AI operation of a gunfight game as an example, specifically:

in the step S1, recording the video of the gunfight game, and collecting the image to be trained from the recorded video;

in step S2, labeling the object detection features in a worker labeling manner or in a machine automatic labeling manner, where the labeled content includes whether there is an object to be trained in the image to be trained, the position information of the object to be trained, and the like;

in step S3, training the YOLO network based on the labeled image to be trained;

in step S4, extracting a feature pyramid of the image to be predicted by using the trained YOLO network to obtain depth features at P scales;

in step S5, the depth features of the continuous multi-frame object detection features are cascade-processed to obtain timing features, the timing features are input to an LSTM network, and the LSTM network outputs the object detection features;

in step S6, a target object in the battle game is detected based on the object detection feature;

in step S7, the AI is assisted to perform corresponding operations according to the information such as the type and location of the target object, for convenience of description, please refer to table 1, where table 1 shows corresponding relationships of the association policy between the object and the operation.

TABLE 1

Object	Operation of
		Guardians and guards	Converse with guardian
Latent one	Attacking latentiator using general skills
		Small monster	Attacking small monsters using ordinary skill
Big monster	Attacking big monsters using legal skills
		Grass mat	To hide
Iron craftsman	Conversing with a carpenter

It can be seen that the target object belongs to objects, and the target operation is an operation corresponding to the object, and it can be understood that the association policy shown in table 1 is only schematic and should not be construed as a limitation to the present application.

In the embodiment of the application, methods for performing auxiliary operations in combination with object detection results are provided, that is, if the object detection results determine that the object detection results include the target object, the auxiliary operation results are obtained according to the execution target operations.

Optionally, on the basis of the foregoing embodiments corresponding to fig. 3, in a seventh optional embodiment of the method for object control provided in the embodiment of the present application, the method may further include:

acquiring an image set to be trained, wherein the image set to be trained comprises at least images to be trained, and the images to be trained carry real annotation information;

based on the image set to be trained, obtaining the prediction marking information corresponding to the image to be trained through a target detection model to be trained;

according to the real annotation information of the image to be trained and the prediction annotation information of the image to be trained, calculating to obtain an th loss function;

when the th loss function converges, the training obtains a target detection model.

In this embodiment, a training method of target detection models is introduced, after an image set to be trained is obtained, an image to be trained in the image set to be trained needs to be labeled, so as to obtain corresponding real labeling information, the real labeling information includes a category label (such as a guardian label or a latency label) and position information (such as a horizontal coordinate of an upper left corner, a vertical coordinate of an upper left corner, a width of a BBox, and a height of the BBox) of the object to be trained, after the real labeling information is obtained, the image to be trained is input to the target detection model to be trained, and the target detection model to be trained outputs prediction labeling information corresponding to the image to be trained, specifically, taking the target detection model as a YOLO network model as an example, feature extraction is performed through a darknet53 depth network, then, depth features of regions under different scales are extracted in a feature pyramid manner, wherein feature extraction methods of common multiscale are used for fusing bottom-level features with high-level features, discrimination power of different scale features is calculated by a convolution function, and a prediction loss is calculated by using a first prediction function for minimizing a target parameter, and a target parameter, a target loss is obtained by a prediction model, and a target detection function for calculating a target detection algorithm, so as a target loss, a target loss is obtained by using a target detection model, a target loss is calculated based on a target loss, a target loss is obtained by a target detection model, a target loss is calculated by a target detection model, a target.

Further , in the embodiment of the present application, training modes of the target detection model are provided, that is, an image set to be trained is obtained first, then, based on the image set to be trained, the prediction annotation information corresponding to the image to be trained is obtained through the target detection model to be trained, then, according to the real annotation information of the image to be trained and the prediction annotation information of the image to be trained, the th loss function is obtained by calculation, and when the th loss function is converged, the target detection model is obtained by training.

Optionally, on the basis of the foregoing embodiments corresponding to fig. 3, in an eighth optional embodiment of the method for object control provided in this embodiment of the present application, the calculating to obtain an th loss function according to the real annotation information of the image to be trained and the predicted annotation information of the image to be trained may include:

determining the position information of the prediction boundary box according to the prediction marking information, wherein the position information of the prediction boundary box comprises a central abscissa value, a central ordinate value, a height value and a width value of the prediction boundary box;

determining the position information of the real boundary frame according to the real labeling information, wherein the position information of the real boundary frame comprises a central horizontal coordinate value, a central vertical coordinate value, a height value and a width value of the real boundary frame;

determining a prediction type according to the prediction marking information;

determining a real category according to the real labeling information;

and calculating an th loss function based on the position information of the prediction bounding box, the position information of the real bounding box, the confidence of the frame, the prediction category and the real category.

In this embodiment, calculation methods of the loss function are introduced, after obtaining the predicted annotation information and the real annotation information, the position information of the predicted bounding box (BBox) needs to be obtained according to the predicted annotation information, and the position information of the real bounding box needs to be obtained according to the real annotation information, where the position information includes a central abscissa value, a central ordinate value, a height value, and a width value.

And determining the confidence coefficient of the frame based on the real labeling information and the prediction labeling information, namely obtaining a prior bounding box by using dimension clustering, and using a mean square error loss function during training. The confidence of the existence of the object is predicted by using a logistic regression strategy, and when the overlap of a certain real bounding box and the bounding box is more than that of other prior, the confidence of the corresponding frame is 1. If a priori it is not the best, but the overlap exceeds a set threshold (e.g., 0.5), the prediction is ignored.

Based on the true class and the predicted class, a binary cross entropy penalty is used for class prediction, and multi-label classification is used for each bounding box to predict the classes that the bounding box may contain.

Wherein λ is_coordDenotes the th coefficient, and_noobjwhich is indicative of the second coefficient of the signal,indicating whether the jth bounding box in the ith grid corresponds to the target object or not, if so, then

Is 1, and otherwise,

is a non-volatile organic compound (I) with a value of 0,

indicating that the jth bounding box in the ith mesh does not correspond to the target object. C represents the bounding box confidence. w represents the width value, h represents the height value, x represents the central abscissa value, y represents the central ordinate value, and P represents the category.

It is understood that the target detection network adopted in the present application may be a YOLO V3 network, and the YOLO V3 network may perform 3 kinds of bounding box predictions on 3 different scales, so as to obtain 9 cluster centers corresponding to the 3 scales, which are (10 × 13), (16 × 30), (33 × 23), (30 × 61), (62 × 45), (59 × 119), (116 × 90), (156 × 198), and (373 × 326), respectively.

, in this embodiment, methods for calculating the th loss function are provided, that is, the confidence of the bounding box is determined according to the position information of the predicted bounding box and the position information of the real bounding box, and the th loss function is obtained by combining the prediction category and the real category.

Optionally, on the basis of the foregoing embodiments corresponding to fig. 3, in a ninth optional embodiment of the method for object control provided in the embodiment of the present application, acquiring the set of images to be trained may include:

acquiring a video to be processed, wherein the video to be processed comprises a plurality of frames of images to be processed;

and carrying out duplication elimination processing on the video to be processed to obtain an image set to be trained.

In this embodiment, ways of acquiring images to be trained are described, specifically, firstly, a recorded video to be processed may be acquired, where the video to be processed may be a video recorded for an interactive application, for example, a video recorded by a player during a gunfight game, the video to be processed includes a plurality of frames of images to be processed, and if frames of processed images are acquired every 1 second, then 2 minutes of the video to be processed includes 120 frames of images to be processed.

For example, if the similarity between the image a to be processed and the image B to be processed is higher than the similarity threshold, the image a to be processed is removed, then the image B to be processed and the image C to be processed are compared in similarity, and if the similarity between the image B to be processed and the image C to be processed is lower than or equal to the similarity threshold, the image B to be processed is determined as the image to be trained, and the image C to be trained enters a subsequent similarity comparison process, which is not described herein again.

, in the embodiment of the present application, ways of obtaining images to be trained are provided, that is, first obtain a video to be processed, where the video to be processed includes multiple frames of images to be processed, and then perform deduplication processing on the video to be processed to obtain an image set to be trained.

Optionally, on the basis of the foregoing embodiments corresponding to fig. 3, in a tenth optional embodiment of the method for object control provided in the embodiment of the present application, acquiring the set of images to be trained may include:

acquiring the size of an object to be trained in an image to be processed;

and if the size of the object to be trained is larger than or equal to the size threshold, determining the image to be processed as the image to be trained.

In this embodiment, ways of obtaining an image to be trained are also introduced, specifically, a recorded video to be processed may be obtained first, where the video to be processed may be a video recorded for interactive application, after the acquisition is completed, a sample may be manually screened as an image to be trained, or an image to be trained may be obtained through automatic screening by a device.

For the case that the device automatically filters the images to be trained, an object size of each object to be trained in the images to be trained may be extracted, where the object size may be represented by pixels, for example, 10 × 10 or 5 × 50, the extracted object size of each object to be trained is compared with a size threshold, and if the object size of the object to be trained is greater than or equal to the size threshold, the frame of the images to be trained is determined as the images to be trained. And if the size of the object to be trained is smaller than the size threshold, rejecting the frame of image to be processed. It is understood that the size threshold may be 1/400 of the whole image area to be processed, and in practical applications, other size thresholds may be set, which is not limited herein.

, in the embodiment of the present application, another ways of obtaining an image to be trained are provided, that is, obtaining a video to be processed first, then obtaining the object size of an object to be trained in the image to be processed, and if the object size of the object to be trained is greater than or equal to a size threshold, determining the image to be processed as the image to be trained.

Optionally, on the basis of the foregoing embodiments corresponding to fig. 3, in a tenth optional embodiment of the method for object control provided in the embodiment of the present application, the method may further include:

acquiring an image set to be trained, wherein the image set to be trained comprises a plurality of images to be trained, and the images to be trained carry real labeling information;

generating a to-be-trained sample set according to the to-be-trained image set, wherein the to-be-trained sample set comprises at least to-be-trained samples, and the to-be-trained samples comprise a plurality of to-be-trained images;

based on a sample set to be trained, obtaining prediction marking information corresponding to the sample to be trained through a time sequence detection model to be trained;

calculating to obtain a second loss function according to the real annotation information of the sample to be trained and the prediction annotation information of the image to be trained;

and when the second loss function is converged, training to obtain a time sequence detection model.

After obtaining the real labeling information, using continuous multiframe images to be trained as samples to be trained, and using a plurality of samples to be trained to form a sample set to be trained, then inputting the samples to be trained into the time sequence detection model to be trained, specifically, extracting depth features of different regions of the image through a depth network, inputting depth features of corresponding regions of adjacent images into the time sequence detection model to be trained (such as an LSTM depth network), outputting prediction labeling information by the time sequence detection model to be trained, wherein the prediction labeling information includes probability of occurrence of a target object contained in each region, class information of the target object and boundary information of the target object, and loss information of a second frame to be trained, and may be calculated according to a loss function of the second frame to be trained, wherein the loss information is a loss function of a second training loss function.

The model parameters are optimized by minimizing a second loss function to reduce a difference between the true and predicted position information of the target object and to reduce a difference between the true and predicted classes of the target object. And calculating the gradient through a second loss function, calculating the gradient of the model parameter through a gradient backward transfer method according to the gradient, finally updating the model parameter, and training by adopting the currently obtained model parameter to obtain the time sequence detection model when the second loss function is converged.

, in this embodiment, timing sequence detection models are provided, that is, an image set to be trained is obtained, a sample set to be trained is generated according to the image set to be trained, prediction labeling information corresponding to the sample to be trained is obtained through the timing sequence detection model to be trained, a second loss function is calculated according to the real labeling information of the sample to be trained and the prediction labeling information of the image to be trained, and when the second loss function converges, the timing sequence detection model is obtained through training.

For convenience of description, the scheme provided by the present application will be described below with reference to specific scenes, taking a gun battle game as an example, first, a client acquires a plurality of frames of continuous images, assuming that image 1 and image 2 are separated by 0.2 seconds, next, image 1 and image 2 are input to a target detection model, and a depth feature of image 1 and a depth feature of image 2 are extracted by the target detection model, assuming that the depth feature of a region to be detected in image 1 is a feature vector of 1 × N, and the depth feature of the region to be detected in image 2 is also a feature vector of 1 × N, and the two feature vectors are subjected to cascade processing to obtain a target time series feature of 2 × N, and then the target time series feature is input to a time series detection model, thereby outputting a target detection feature, assuming that the target detection feature is represented as (0.8, 0.1, 0.2, 0.7, 50, 70, 30, 15)' where 0.8 represents a probability of the occurrence of a target object, and since 0.8 is greater than 0.5, the probability of the existence of a target object in the image 2, 0.7, 0.15 represents a depth value of a target object, 0.1, 0.7, 0.15, and 0.15 represents a depth value of a target object, 0.0.1, 0.5, 0.1, 0.2, 0.1, 0..

And obtaining an object detection result based on the object detection characteristics, wherein the object detection result is that a target object exists in the image 2, the class information of the target object is 'latentier', 'the latentier' is represented by BBox in the image 2, the central abscissa value of BBox is 50 pixels, the central ordinate value of BBox is 70 pixels, the height value of BBox is 30, and the width value of BBox is 15. Thereby identifying the target object in the gunfight game.

Referring to fig. 11, fig. 11 is a schematic view of embodiments of the object detection apparatus 20 according to the embodiment of the present application, which includes:

an obtaining module 201, configured to obtain an image set, where the image set includes at least an th image and a second image, and the th image is a th image before the second image;

the obtaining module 201 is further configured to obtain a depth feature set based on the image set, where the depth feature set includes -th depth features and second depth features, the -th depth features belong to depth features of a region to be detected in the -th image, and the second depth features belong to depth features of the region to be detected in the second image;

a generating module 202, configured to generate a target timing characteristic corresponding to the to-be-detected region according to the depth characteristic set acquired by the acquiring module 201;

the obtaining module 201 is further configured to obtain an object detection result through a time sequence detection model based on the target time sequence feature generated by the generating module 202, where the object detection result is a detection result of a to-be-detected region in the second image.

Alternatively, on the basis of the embodiment corresponding to fig. 11, in another embodiment of the object detecting device 20 provided in the embodiment of the present application,

the obtaining module 201 is specifically configured to obtain, based on the th image, the th depth feature through the target detection model, where the th depth feature includes the th feature of the region to be detected in P scales, and P is an integer greater than or equal to 1;

acquiring a second depth feature through the target detection model based on the second image, wherein the second depth feature comprises second features of the region to be detected under P scales;

the generating module 202 is specifically configured to perform cascade processing on the th depth feature and the second depth feature to obtain the target time sequence feature, where the target time sequence feature is a feature matrix, and the th depth feature and the second depth feature are feature vectors.

the generating module 202 is specifically configured to perform cascade processing on a th feature in the th depth features and a second feature in the second depth features based on an th scale to obtain the th target time-series feature, where the th target time-series feature is a feature matrix, the th feature and the second feature are feature vectors, and the th scale belongs to scales in the P scales;

the obtaining module 201 is specifically configured to obtain an object detection feature through the timing detection model based on the target timing feature, where the object detection feature is a feature vector;

the obtaining module 201 is specifically configured to obtain a th object detection feature through the time sequence detection model based on the th target time sequence feature, where the th object detection feature is a feature vector;

determining a confidence level from the th object detection feature;

the obtaining module 201 is further configured to, after obtaining an object detection result through a timing detection model based on the target timing characteristic, if the object detection result determines that the object detection result includes the target object, obtain an auxiliary operation result according to an executed target operation.

Optionally, on the basis of the embodiment corresponding to fig. 11, please refer to fig. 12, in another embodiment of the object detecting device 20 provided in the embodiment of the present application, the object detecting device 20 further includes a training module 203;

the obtaining module 201 is further configured to obtain an image set to be trained, where the image set to be trained includes at least images to be trained, and the images to be trained carry real annotation information;

the obtaining module 201 is further configured to obtain, based on the set of images to be trained, prediction tagging information corresponding to the images to be trained through a target detection model to be trained;

the obtaining module 201 is further configured to calculate an th loss function according to the real annotation information of the image to be trained and the predicted annotation information of the image to be trained;

and the training module 203 is used for training to obtain a target detection model when the th loss function is converged.

Alternatively, on the basis of the embodiments corresponding to fig. 11 or fig. 12, in another embodiment of the object detecting device 20 provided in the embodiment of the present application,

the obtaining module 201 is specifically configured to determine position information of a prediction bounding box according to the prediction tagging information, where the position information of the prediction bounding box includes a central abscissa value, a central ordinate value, a height value, and a width value of the prediction bounding box;

determining a prediction type according to the prediction marking information;

determining a real category according to the real labeling information;

the obtaining module 201 is specifically configured to obtain a video to be processed, where the video to be processed includes multiple frames of images to be processed;

acquiring the size of an object to be trained in the image to be processed;

the obtaining module 201 is further configured to obtain an image set to be trained, where the image set to be trained includes a plurality of images to be trained, and the images to be trained carry real annotation information;

the generating module 202 is further configured to generate a to-be-trained sample set according to the to-be-trained image set acquired by the acquiring module 201, where the to-be-trained sample set includes at least to-be-trained samples, and the to-be-trained samples include a plurality of to-be-trained images;

the obtaining module 201 is further configured to obtain, based on the to-be-trained sample set generated by the generating module 202, prediction tagging information corresponding to the to-be-trained sample through a to-be-trained time sequence detection model;

the obtaining module 201 is further configured to calculate to obtain a second loss function according to the real labeling information of the sample to be trained and the prediction labeling information of the image to be trained;

the training module 203 is further configured to train to obtain a timing detection model when the second loss function converges.

As shown in fig. 13, for convenience of description, only the relevant parts of the present embodiment are shown, and no specific technical details are disclosed, please refer to the method part of the present embodiment, the terminal device may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a Point of sale (POS), a vehicle-mounted computer, and the like, taking the terminal device as a mobile phone:

fig. 13 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 13, the handset includes: radio Frequency (RF) circuit 310, memory 320, input unit 330, display unit 340, sensor 350, audio circuit 360, wireless fidelity (WiFi) module 370, processor 380, and power supply 390. Those skilled in the art will appreciate that the handset configuration shown in fig. 13 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 13:

the RF circuit 310 may be used for receiving and transmitting signals during a message transmission or call, and in particular, for receiving downlink information of a base station and then processing the received downlink information, and for transmitting design uplink data to the base station. generally, the RF circuit 310 includes, but is not limited to, an antenna, at least amplifiers, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, etc. in addition, the RF circuit 310 may also communicate with a network and other devices via wireless communication.

The memory 320 may be used to store software programs and modules, and the processor 380 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 320, the memory 320 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs (such as a sound playing function, an image playing function, and the like) required for at least functions, and the like, and the storage data area may store data (such as audio data, a phonebook, and the like) created according to the use of the mobile phone, and the like, and further, the memory 320 may include a high-speed random access memory, and may further include a nonvolatile memory, for example, at least disk storage devices, flash memory devices, or other volatile solid-state storage devices.

The input unit 330 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the cellular phone, and particularly, the input unit 330 may include a touch panel 331 and other input devices 332. the touch panel 331, also referred to as a touch screen, may collect touch operations of a user on or near the touch panel 331 (such as operations of a user on or near the touch panel 331 using any suitable object or attachment such as a finger, a stylus, etc.) and drive corresponding connection means according to a preset program.

The display unit 340 may be used to display information input by or provided to a user and various menus of the mobile phone, the display unit 340 may include a display panel 341, and optionally, the display panel 341 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like, and , the touch panel 331 may cover the display panel 341, and when a touch operation is detected on or near the touch panel 331, the touch panel 331 may be transmitted to the processor 380 to determine the type of the touch event, and then the processor 380 may provide a corresponding visual output on the display panel 341 according to the type of the touch event, and although in fig. 13, the touch panel 331 and the display panel 341 are two separate components to implement the input and input functions of the mobile phone, in some embodiments, the touch panel 331 may be integrated with the display panel 341 to implement the input and output functions of the mobile phone.

The mobile phone may further include at least sensors 350, such as a light sensor, a motion sensor, and other sensors, specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 341 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 341 and/or backlight when the mobile phone moves to the ear, motion sensors, an accelerometer sensor may detect the magnitude of acceleration in various directions ( is three axes), may detect the magnitude and direction of gravity when the mobile phone is stationary, may be used for applications for recognizing the posture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which may be configured for the mobile phone, and will not be described herein again.

The audio circuit 360 can convert the received audio data into electrical signals, transmit the electrical signals to the speaker 361, and convert the electrical signals into sound signals for output by the speaker 361. in another aspect , the microphone 362 converts the collected sound signals into electrical signals, which are received by the audio circuit 360 and converted into audio data, which are then processed by the audio data output processor 380 and then transmitted to another cell phone via the RF circuit 310, or output the audio data to the memory 320 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 370, and provides wireless broadband internet access for the user. Although fig. 13 shows the WiFi module 370, it is understood that it does not belong to the essential constitution of the handset, and may be omitted entirely as needed within the scope not changing the essence of the invention.

Processor 380 is the control center for the handset, and interfaces and circuitry are used to connect various portions of the overall handset to enable overall monitoring of the handset by running or executing software programs and/or modules stored in memory 320 and invoking data stored in memory 320 to perform various functions and processes on the handset data, hi the alternative, processor 380 may include or more processing units, and in the alternative, processor 380 may integrate an application processor that handles primarily the operating system, user interfaces, application programs, etc., and a modem processor that handles primarily wireless communications.

The handset also includes a power supply 390 (e.g., a battery) for powering the various components, optionally, the power supply may be logically connected to the processor 380 through a power management system, so that the power management system may be used to manage charging, discharging, and power consumption.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In this embodiment, the processor 380 included in the terminal device further has the following functions:

acquiring an image set, wherein the image set at least comprises an th image and a second image, and the th image is a th image before the second image;

Optionally, the processor 380 is specifically configured to perform the following steps:

acquiring the depth feature through the target detection model based on the image, wherein the depth feature comprises the feature of the region to be detected under P scales, and P is an integer greater than or equal to 1;

and performing cascade processing on the th depth feature and the second depth feature to obtain the target time sequence feature, wherein the target time sequence feature is a feature matrix, and the th depth feature and the second depth feature are feature vectors.

performing cascade processing on a th feature in the th depth features and a second feature in the second depth features on the basis of an th scale to obtain a th target time-series feature, wherein the th target time-series feature is a feature matrix, the th feature and the second feature are feature vectors, and the th scale belongs to scales in the P scales;

acquiring object detection features through the time sequence detection model based on the target time sequence features, wherein the object detection features are feature vectors;

acquiring th object detection features through the time sequence detection model based on the th target time sequence features, wherein the th object detection features are feature vectors;

determining a confidence level from the th object detection feature;

Optionally, the processor 380 is further configured to perform the following steps:

Fig. 14 is a structural schematic diagram of servers according to an embodiment of the present invention, where the server 400 may have relatively large differences due to different configurations or performances, and may include or or more Central Processing Units (CPUs) 422 (e.g., or or more processors) and memory 432, or more storage media 430 (e.g., or or more mass storage devices) storing the application 442 or data 444, where the memory 432 and the storage media 430 may be transient storage or persistent storage, the program stored in the storage media 430 may include or or more modules (not shown), each of which may include series of instruction operations in the server, further, the central processor 422 may be configured to communicate with the storage media 430, and execute series of instruction operations in the storage media 430 on the server 400.

The Server 400 may also include or or more power supplies 426, or or more wired or wireless network interfaces 450, or or more input/output interfaces 458, and/or or or more operating systems 441, such as a Windows Server^TM，Mac OS X^TM，Unix^TM,Linux^TM，FreeBSD^TMAnd so on.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 14.

In the embodiment of the present application, the CPU 422 included in the server further has the following functions:

Optionally, the CPU 422 is further configured to perform the following steps:

when the th loss function converges, a target detection model is obtained through training.

Optionally, the CPU 422 is specifically configured to perform the following steps:

determining the position information of a prediction boundary box according to the prediction marking information, wherein the position information of the prediction boundary box comprises a central abscissa value, a central ordinate value, a height value and a width value of the prediction boundary box;

determining a prediction type according to the prediction marking information;

determining a real category according to the real labeling information;

acquiring the size of an object to be trained in the image to be processed;

Optionally, the CPU 422 is further configured to perform the following steps:

based on the sample set to be trained, obtaining the prediction marking information corresponding to the sample to be trained through a time sequence detection model to be trained;

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units into logical functional divisions may be realized in other ways, for example, multiple units or components may be combined or integrated into another systems, or features may be omitted or not executed, in another point, the shown or discussed coupling or direct coupling or communication connection between each other may be through interfaces, indirect coupling or communication connection between units or devices may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in places, or may also be distributed on multiple network units.

In addition, the functional units in the embodiments of the present application may be integrated into processing units, or each unit may exist alone physically, or two or more units are integrated into units.

Based on the understanding, the technical solution of the present application, which is essentially or partially contributed to by the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in storage media, which includes several instructions for causing computer devices (which may be personal computers, servers, or network devices) to execute all or part of the steps of the methods described in the embodiments of the present application.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

A method for detecting objects, comprising:

acquiring a set of images, wherein the set of images at least comprises th image and a second image, and the th image is the first frames of the second image;

acquiring a depth feature set based on the image set, wherein the depth feature set comprises -th depth features and second depth features, the -th depth features belong to depth features of a region to be detected in the -th image, and the second depth features belong to depth features of the region to be detected in the second image;

generating a target time sequence characteristic corresponding to the to-be-detected region according to the depth characteristic set;

and acquiring an object detection result through a time sequence detection model based on the target time sequence feature, wherein the object detection result is a detection result of a region to be detected in the second image.
2. The method of claim 1, wherein obtaining a set of depth features based on the set of images comprises:

acquiring the depth feature through a target detection model based on the image, wherein the depth feature comprises the feature of the region to be detected under P scales, and P is an integer greater than or equal to 1;

acquiring a second depth feature through the target detection model based on the second image, wherein the second depth feature comprises second features of the region to be detected under P scales;

generating the set of depth features from the th depth feature and the second depth feature.
3. The method according to claim 1, wherein the generating the target timing characteristics corresponding to the region to be detected according to the depth characteristic set includes:

and performing cascade processing on the th depth feature and the second depth feature to obtain the target time sequence feature, wherein the target time sequence feature is a feature matrix, and the th depth feature and the second depth feature are feature vectors.
4. The method according to claim 2, wherein the generating of the target timing characteristics corresponding to the region to be detected according to the depth characteristic set includes:

performing cascade processing on a th feature in the th depth features and a second feature in the second depth features on the basis of an th scale to obtain a th target time-series feature, wherein the th target time-series feature is a feature matrix, the th feature and the second feature are feature vectors, and the th scale belongs to scales in the P scales;

performing cascade processing on a th feature of the th depth features and a second feature of the second depth features based on a second scale to obtain the second target time series feature, wherein the second target time series feature is a feature matrix, the second scale belongs to another scales of the P scales, and the second scale and the th scale belong to different scales.
5. The method of claim 3, wherein obtaining object detection results through a timing detection model based on the target timing characteristics comprises:

acquiring object detection features through the time sequence detection model based on the target time sequence features, wherein the object detection features are feature vectors;

and generating the object detection result according to the object detection characteristics, wherein the object detection result comprises the object occurrence probability, the class information and the position information in the region to be detected.
6. The method of claim 4, wherein obtaining object detection results through a timing detection model based on the target timing characteristics comprises:

acquiring th object detection features through the time sequence detection model based on the th target time sequence features, wherein the th object detection features are feature vectors;

acquiring a second object detection feature through the time sequence detection model based on the second target time sequence feature, wherein the second object detection feature is a feature vector;

determining a confidence level from the th object detection feature;

determining a second confidence level according to the second object detection feature;

if the confidence level is greater than the second confidence level, generating the object detection result according to the object detection feature, wherein the object detection result comprises an object occurrence probability, category information and position information in the region to be detected;

and if the second confidence degree is greater than the confidence degree, generating the object detection result according to the second object detection feature.
7. The method of claim 1, wherein after obtaining the object detection result through a timing detection model based on the target timing characteristic, the method further comprises:

and if the object detection result determines that the target object is included, acquiring an auxiliary operation result according to the executed target operation.
8. The method of any of claims 1-7, further comprising:

acquiring an image set to be trained, wherein the image set to be trained comprises at least images to be trained, and the images to be trained carry real annotation information;

based on the image set to be trained, obtaining the prediction marking information corresponding to the image to be trained through a target detection model to be trained;

according to the real annotation information of the image to be trained and the prediction annotation information of the image to be trained, calculating to obtain an th loss function;

when the th loss function converges, a target detection model is obtained through training.
9. The method according to claim 8, wherein the calculating an th loss function according to the real annotation information of the image to be trained and the predicted annotation information of the image to be trained comprises:

determining the position information of a prediction boundary box according to the prediction marking information, wherein the position information of the prediction boundary box comprises a central abscissa value, a central ordinate value, a height value and a width value of the prediction boundary box;

determining the position information of a real boundary frame according to the real labeling information, wherein the position information of the real boundary frame comprises a central horizontal coordinate value, a central vertical coordinate value, a height value and a width value of the real boundary frame;

determining frame confidence according to the obtained real labeling information and the prediction labeling information;

determining a prediction type according to the prediction marking information;

determining a real category according to the real labeling information;

and calculating the th loss function based on the position information of the prediction bounding box, the position information of the real bounding box, the confidence of the frame, the prediction category and the real category.
10. The method of claim 8, wherein the obtaining a set of images to be trained comprises:

acquiring a video to be processed, wherein the video to be processed comprises a plurality of frames of images to be processed;

and carrying out duplication elimination processing on the video to be processed to obtain the image set to be trained.
11. The method according to claim 8 or the method, wherein the acquiring a set of images to be trained comprises:

acquiring a video to be processed, wherein the video to be processed comprises a plurality of frames of images to be processed;

acquiring the size of an object to be trained in the image to be processed;

and if the object size of the object to be trained is larger than or equal to the size threshold, determining the image to be processed as the image to be trained.
12. The method of any of claims 1-7, further comprising:

acquiring an image set to be trained, wherein the image set to be trained comprises a plurality of images to be trained, and the images to be trained carry real labeling information;

generating a to-be-trained sample set according to the to-be-trained image set, wherein the to-be-trained sample set comprises at least to-be-trained samples, and the to-be-trained samples comprise a plurality of to-be-trained images;

based on the sample set to be trained, obtaining the prediction marking information corresponding to the sample to be trained through a time sequence detection model to be trained;

calculating to obtain a second loss function according to the real annotation information of the sample to be trained and the prediction annotation information of the image to be trained;

and when the second loss function is converged, training to obtain a time sequence detection model.
An object detecting device of , comprising:

an obtaining module, configured to obtain an image set, where the image set includes at least an th image and a second image, and the th image is a th image before the second image;

the acquiring module is further configured to acquire a depth feature set based on the image set, where the depth feature set includes -th depth features and second depth features, the -th depth features belong to depth features of a region to be detected in the -th image, and the second depth features belong to depth features of the region to be detected in the second image;

the generating module is used for generating a target time sequence characteristic corresponding to the to-be-detected region according to the depth characteristic set acquired by the acquiring module;

the obtaining module is further configured to obtain an object detection result through a time sequence detection model based on the target time sequence feature generated by the generating module, where the object detection result is a detection result of a to-be-detected region in the second image.
14, electronic device, comprising a memory, a transceiver, a processor and a bus system;

wherein the memory is used for storing programs;

the processor is configured to execute a program in the memory, including performing the method of any of claims 1-12 above;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.
15, a computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any of claims 1-12, the method comprising the steps of any of claims .