CN116994022A

CN116994022A - Object detection method, model training method, device, electronic equipment and medium

Info

Publication number: CN116994022A
Application number: CN202211471828.6A
Authority: CN
Inventors: 徐东
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-23
Filing date: 2022-11-23
Publication date: 2023-11-03

Abstract

The application provides an object detection method, a model training method, a device, electronic equipment and a medium, relates to the technical field of artificial intelligence, and can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like; extracting features of the image to be detected to obtain image features, density features and space features of the image to be detected; fusion processing is carried out on the image features, the density features and the space features, so that global fusion features of the image to be detected are obtained; and performing object classification processing on a plurality of feature points corresponding to the global fusion features based on the spatial features, the density features and the global fusion features to obtain object segmentation results corresponding to at least one target object. The technical scheme of the application can improve the accuracy and reliability of object detection.

Description

Object detection method, model training method, device, electronic equipment and medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an object detection method, a model training method, an apparatus, an electronic device, and a medium.

Background

The object detection is one of important applications of computer vision processing technology, and can be applied to scenes such as single-image object positioning, multi-image or video object continuous positioning, and the like, wherein the traditional object detection technology comprises a point, line and edge detection method, a watershed algorithm, a frame difference method and the like, but the former mode has poor performance in position detection of complex images, such as incapability of accurately identifying game objects containing a large number of dot line elements, the watershed algorithm has large detection calculation amount and long time consumption, the application of continuous object positioning is not facilitated, the frame difference method relies on pixel difference of adjacent images to distinguish background and target objects, and when the object moves slowly, the omission phenomenon occurs, and the object detection accuracy and reliability are poor.

Disclosure of Invention

The application provides an object detection method, an object detection device and a storage medium, which can obviously improve the accuracy and the reliability of object detection.

In one aspect, the present application provides an object detection method, the method comprising:

acquiring an image to be detected, wherein the image to be detected comprises at least one target object;

extracting features of the image to be detected to obtain image features, density features and space features of the image to be detected;

Performing fusion processing on the image features, the density features and the space features to obtain global fusion features of the image to be detected;

and performing object classification processing on a plurality of feature points corresponding to the global fusion feature based on the spatial feature, the density feature and the global fusion feature to obtain object segmentation results corresponding to the at least one target object.

In another aspect, a method for training an object detection model is provided, the method comprising:

obtaining a sample training set, the sample training set comprising a sample image, the sample image comprising at least one sample object;

taking the sample image as the input of an initial detection model, and extracting the characteristics of the sample image to obtain sample image characteristics, sample density characteristics and sample space characteristics;

performing fusion processing on the sample image features, the sample density features and the sample space features to obtain global fusion features of the sample image;

performing object classification processing on a plurality of feature points corresponding to the global fusion feature based on the sample space feature, the sample density feature and the global fusion feature to obtain object segmentation results corresponding to the at least one sample object;

Performing reconstruction loss calculation based on the sample image characteristics and object segmentation results corresponding to the at least one sample object respectively to obtain model loss;

training the initial detection model according to the model loss to obtain an object detection model, wherein the object detection model is applied to the object detection method according to any one of claims 1-8.

Another aspect provides an object detection apparatus, the apparatus comprising:

an image acquisition module: the method comprises the steps of acquiring an image to be detected, wherein the image to be detected comprises at least one target object;

and the feature extraction module is used for: the method comprises the steps of extracting features of an image to be detected to obtain image features, density features and space features of the image to be detected;

and the fusion processing module is used for: the method comprises the steps of performing fusion processing on the image features, the density features and the space features to obtain global fusion features of the image to be detected;

an object classification module: and the object classification processing is used for carrying out object classification processing on a plurality of feature points corresponding to the global fusion feature based on the spatial feature, the density feature and the global fusion feature, so as to obtain object segmentation results corresponding to the at least one target object.

In another aspect, an object detection model training apparatus is provided, the method including:

sample acquisition module: for obtaining a sample training set, the sample training set comprising a sample image, the sample image comprising at least one sample object;

sample feature extraction module: the sample image detection method comprises the steps of taking the sample image as input of an initial detection model, and extracting features of the sample image to obtain sample image features, sample density features and sample space features;

sample fusion processing module: the fusion processing is used for carrying out fusion processing on the sample image characteristics, the sample density characteristics and the sample space characteristics to obtain global fusion characteristics of the sample image;

sample classification module: the object classification processing is used for carrying out object classification processing on a plurality of feature points corresponding to the global fusion feature based on the sample space feature, the sample density feature and the global fusion feature to obtain object segmentation results corresponding to the at least one sample object respectively;

and a loss calculation module: the reconstruction loss calculation is used for calculating reconstruction loss based on the sample image characteristics and the object segmentation results corresponding to the at least one sample object respectively, so as to obtain model loss;

Model training module: the method is used for training the initial detection model according to the model loss to obtain an object detection model, and the object detection model is applied to the object detection method.

In another aspect, a computer device is provided, the device comprising a processor and a memory, the memory storing at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by the processor to implement an object detection method as described above.

Another aspect provides a computer readable storage medium having stored therein at least one instruction or at least one program loaded and executed by a processor to implement an object detection method or an object detection model training method as described above.

In another aspect, a server is provided, where the server includes a processor and a memory, where at least one instruction or at least one program is stored, where the at least one instruction or the at least one program is loaded and executed by the processor to implement an object detection method or an object detection model training method as described above.

In another aspect, a terminal is provided, where the terminal includes a processor and a memory, where at least one instruction or at least one program is stored, where the at least one instruction or the at least one program is loaded and executed by the processor to implement an object detection method or an object detection model training method as described above.

Another aspect provides a computer program product or computer program comprising computer instructions which, when executed by a processor, implement an object detection method or object detection model training method as described above.

The object detection method, the device, the equipment, the storage medium, the server, the terminal, the computer program and the computer program product provided by the application have the following technical effects:

according to the technical scheme, the image to be detected is obtained, and the image to be detected comprises at least one target object; the method comprises the steps of extracting features of an image to be detected to obtain image features, density features and space features of the image to be detected, carrying out fusion processing on the image features, the density features and the space features to obtain global fusion features of the image to be detected, further forming a fusion feature space containing multi-dimensional semantic information, then carrying out object classification processing on a plurality of feature points corresponding to the global fusion features based on the space features, the density features and the global fusion features to obtain object segmentation results corresponding to at least one target object, so as to realize accurate segmentation and positioning of object detection in a single image by combining multi-dimensional semantic information, avoid noise interference of point line elements and the like in the object, improve the efficiency, accuracy and reliability of object detection, and realize continuous positioning of the object by comparing the object segmentation results between frames without depending on pixel value differences between frames, avoid missing detection, overcome detection obstruction caused by object detection differences caused by decoupling of frames, offset and the like, and optimize detection effect.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application environment provided by an embodiment of the present application;

fig. 2 is a schematic flow chart of an object detection method according to an embodiment of the present application;

FIG. 3 is a flowchart of another object detection method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a framework of an object detection model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a density feature extraction region provided by an embodiment of the present application;

FIG. 6 is a flowchart of another object detection method according to an embodiment of the present application;

FIG. 7 is a flowchart of another object detection method according to an embodiment of the present application;

FIG. 8 is an exemplary diagram of a set of images to be detected (left), a feature map (middle), and a super-pixel segmented image (right) provided by an embodiment of the present application;

FIG. 9 is a video frame of a game provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of a solution framework provided by an embodiment of the present application;

FIG. 11 is a flow chart of an object detection model training method according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a frame of an object detection apparatus according to an embodiment of the present application;

FIG. 13 is a schematic diagram of a training apparatus for object detection models according to an embodiment of the present application;

fig. 14 is a block diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or sub-modules is not necessarily limited to those steps or sub-modules that are expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or sub-modules that are not expressly listed.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

In recent years, with research and progress of artificial intelligence technology, the artificial intelligence technology is widely applied in a plurality of fields, and the scheme provided by the embodiment of the application relates to the technology of artificial intelligence such as machine learning/deep learning, natural language processing and the like, and is specifically described by the following embodiments.

Referring to fig. 1, fig. 1 is a schematic diagram of an application environment provided in an embodiment of the present application, and as shown in fig. 1, the application environment may at least include a terminal 01 and a server 02. In practical applications, the terminal 01 and the server 02 may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

The server 02 in the embodiment of the present application may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content distribution networks), and basic cloud computing services such as big data and artificial intelligent platforms.

Specifically, cloud technology (Cloud technology) refers to a hosting technology that unifies serial resources such as hardware, software, networks, etc. in a wide area network or a local area network, so as to implement calculation, storage, processing, and sharing of data. The cloud technology can be applied to various fields such as medical cloud, cloud internet of things, cloud security, cloud education, cloud conference, artificial intelligent cloud service, cloud application, cloud calling, cloud social contact and the like, and is based on cloud computing (closed computing) business model application, and the cloud technology distributes computing tasks on a resource pool formed by a large number of computers, so that various application systems can acquire computing power, storage space and information service according to requirements. The network providing the resources is called a ' cloud ', and the resources in the cloud ' are infinitely expandable to the user, and can be acquired, used as required, expanded as required and paid for use as required. As a basic capability provider of cloud computing, a cloud computing resource pool (cloud platform is generally called IaaS (Infrastructure as a Service, infrastructure as a service)) platform is established, and multiple types of virtual resources are deployed in the resource pool for external clients to select for use. The cloud computing resource pool mainly comprises: computing devices (which are virtualized machines, including operating systems), storage devices, network devices.

According to the logic function division, a PaaS (Platform as a Service ) layer can be deployed on the IaS layer, a SaaS (Software as a Service ) layer can be deployed on the PaaS layer, and the SaaS can also be directly deployed on the IaS. PaaS is a platform on which software runs, such as a database, web container, etc. SaaS is a wide variety of business software such as web portals, sms mass senders, etc. Generally, saaS and PaaS are upper layers relative to IaaS.

Specifically, the server 02 may include an entity device, may include a network communication sub-module, a processor, a memory, and the like, may include software running in the entity device, and may include an application program and the like.

Specifically, the terminal 01 may include a smart phone, a desktop computer, a tablet computer, a notebook computer, a digital assistant, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, an intelligent voice interaction device, an intelligent home appliance, an intelligent wearable device, a vehicle-mounted terminal device, and other types of entity devices, and may also include software running in the entity devices, such as an application program, and the like.

In the embodiment of the present application, the terminal 01 may be configured to send an object detection instruction and an image to be detected to the server 02, so that the server 02 performs a corresponding object detection operation. The server 02 may be used to provide an object detection service to obtain an object detection result. In particular, the server 02 may also be used for model training services of an initial detection model to obtain an object detection model, storing a sample training set and model training data, etc.

Further, it should be understood that fig. 1 illustrates an application environment of only one object detection method, and the application environment may include more or fewer nodes, and the present application is not limited herein.

The application environment, or the terminal 01 and the server 02 in the application environment, according to the embodiments of the present application may be a distributed system formed by connecting a client, a plurality of nodes (any form of computing device in an access network, such as a server, a user terminal) through a network communication. The distributed system may be a blockchain system that may provide the object detection service, the data storage service, and the like described above.

The following describes an object detection method based on the above application environment, and the embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent traffic, driving assistance, and the like. Referring to fig. 2, fig. 2 is a flow chart of an object detection method according to an embodiment of the present application, and the present specification provides method operation steps according to an embodiment or the flow chart, but may include more or less operation steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented in a real system or server product, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multithreaded environment). Specifically, as shown in fig. 2, the method may include the following steps S201-S207.

S201: and acquiring an image to be detected.

Specifically, the image to be detected includes at least one target object, which may be, but is not limited to, a photo, a video frame, a screenshot, etc., and the target object is an instance element in the image to be detected, and may include, but is not limited to, a living object, a virtual object, a road element object, or an object of an article, etc. Taking a game image scene as an example, the image to be detected may be a game video frame of a game video, the target object may be a virtual character object in the game video frame, or the like, and in some cases, the image background area may also be used as a target object. Referring to fig. 9, fig. 9 is a schematic diagram of a video frame of a game, in which a virtual object framed in the video frame is a target object to be detected. Specifically, the image to be detected may be an image obtained by performing uniform-size preprocessing on the original image.

S203: and extracting the characteristics of the image to be detected to obtain the image characteristics, density characteristics and space characteristics of the image to be detected.

Specifically, the image feature may be a reconstructed image corresponding to the image to be detected, where the reconstructed image is a large-pixel image obtained by merging and denoising a part of pixels in the image to be detected. The density characteristic comprises pixel density information of each image characteristic point in the image characteristic, the density characteristic can be a pixel density matrix corresponding to the image characteristic, the pixel density information represents the pixel change degree between the image characteristic point and the neighborhood characteristic point, the larger the pixel density corresponding to the pixel density information is, the larger the pixel change degree is represented, and otherwise, the smaller the pixel change degree is; for example, in the background area of the image, the pixel value of the adjacent image feature points is generally changed less, and the pixel density corresponding to the pixel density information of the image feature points corresponding to the background area is relatively less. The spatial feature comprises coordinate information of each image feature point in the image feature, the spatial feature can be specifically a coordinate information matrix corresponding to the image feature, the coordinate information of the image feature point can be determined based on pixel coordinates in the image to be detected, in the process that the image to be detected generates the image feature through the feature extraction process, image transformation and pixel combination are involved, and correspondingly, the pixel coordinates in the image to be detected are mapped to the image feature to obtain the spatial feature corresponding to the image feature. It will be appreciated that the spatial features carry low-level semantic information of the image to be detected, and that the image features and density features carry high-level semantic information of the image.

In practical applications, referring to fig. 3, S203 may specifically include:

s2031: and carrying out convolution processing and deconvolution processing on the image to be detected to obtain image characteristics, wherein the image characteristics comprise a plurality of image characteristic points.

Specifically, the above convolution processing may be performed on the image to be detected based on the encoder of the object detection model, and the above deconvolution processing may be performed by the decoder of the object detection model.

In some embodiments, referring to fig. 4, the encoder includes a first convolution layer, a pooling layer, and a second convolution layer, and the image to be detected is input into the first convolution layer as an input image, and a first convolution operation is performed to obtain an initial convolution characteristic, where the first convolution operation may be a low-frequency filtering convolution; carrying out pooling treatment on the initial convolution characteristics through a pooling layer to obtain pooled characteristics so as to reduce the calculated amount of characteristic treatment; and then, carrying out a second convolution operation on the pooled features through a second convolution layer to obtain convolution output features, wherein the second convolution operation can be smooth filtering processing, and the convolution output features are frequency domain feature images corresponding to the images to be detected.

In some embodiments, referring to fig. 4, the decoder may include a deconvolution layer, an activation layer, an upsampling layer and an image segmentation layer, where deconvolution and anti-pooling operations are performed on the convolved output features by the deconvolution layer to obtain an initial restored image, and then spatial activation processing is performed by the activation layer, and the obtained result is input to the upsampling layer to perform upsampling processing, so as to obtain the image features.

In some embodiments, the convolution layer includes a plurality of convolution processing branches, and the convolution layer performs a first convolution operation on the image to be detected, and the obtained initial convolution feature includes a plurality of convolution feature graphs, where the convolution feature graphs and the convolution processing branches are in one-to-one correspondence; correspondingly, the deconvolution processing also comprises deconvolution processing branches corresponding to the convolution processing branches one by one, so that the up-sampling layer outputs a plurality of feature maps corresponding to the convolution feature maps one by one, namely, the output restored image corresponding to each convolution feature map. Thus, a plurality of image features are obtained through one-time input, and a plurality of density features are obtained, so that the segmentation information amount in the subsequent segmentation process is increased, and the accuracy of object detection is improved. Meanwhile, model training can be achieved through multi-task learning, training efficiency is improved, the problem that training samples are insufficient can be solved, and the model training method is excellent in task performance optimization scenes of the pre-training model.

In one embodiment, the first convolution layer comprises 5 convolution networks constructed based on 3*3 convolution kernels, i.e. 5 convolution processing branches, which are low-frequency filtering convolution kernels generated based on gaussian smoothing filtering; maximum pooling of pooling layer 2 x 2, etc., i.e. sliding window pair transport based on 2 x 2Pooling the 2 x 2 area matrix by the input features, taking a pooling maximum value by each area matrix, and representing a corresponding whole 2 x 2 area by using the pooling maximum value; the second convolution layer is a convolution network constructed based on 3*3 convolution kernel, the convolution kernel can be smooth filtering convolution kernel, and image features comprising a plurality of feature graphs are obtained through the encoder, so that a plurality of feature matrixes [ I ] are obtained ₁ ,I ₂ ,I ₃ ,I ₄ ,I ₅ ]I characterizes a feature map in the image features. Further, the deconvolution layer comprises 5 deconvolution networks corresponding to the convolution network constructed based on the 3*3 convolution kernel and is connected with a 2 x 2 deconvolution layer; the activation layer can be constructed based on the Relu function, and the up-sampling layer performs up-sampling operation on the output result of the activation layer to obtain a restored image, and the above-mentioned process of the encoder is equivalent to the image reconstruction process. Referring to fig. 8, an exemplary diagram of a group of images to be detected (left diagram), a feature diagram (middle diagram) and a super-pixel segmented image (right diagram) is shown in fig. 8, it can be seen that after deconvolution, anti-pooling and upsampling processes, denoising processes are implemented, most of image feature points in the feature diagram correspond to one feature point of the images to be detected, and part of image feature points correspond to a plurality of pixel points, such as about 3.

Therefore, the frequency domain feature extraction of the image can be realized through the encoder, the decoder is used for decoding and restoring to obtain the required information, the encoder is used for compressing the noisy image to obtain the frequency domain feature expression, the decoder is used for decompressing to realize denoising, the processing process of the game video frame is taken as an example, the process can remove the background noise and the noise introduced by edge elements such as points, lines and the like in the target object (such as a virtual character object), and the multi-dimensional feature extraction and denoising are combined, so that the accuracy of feature point clustering is improved, and the clustering calculation difficulty is reduced.

S2032: and taking each image characteristic point in the plurality of image characteristic points as a sliding window center point, and carrying out pixel difference processing by taking a preset sliding window as a density characteristic extraction area to obtain pixel density information of each image characteristic point.

In some embodiments, the pixel difference processing may be calculating a sum of squares of pixel differences between the sliding window center point and the neighborhood feature point, and accordingly, S2032 may be specifically: and taking the image characteristic points as sliding window center points of the preset sliding windows, determining the square sum of pixel difference values between the image characteristic points and other image characteristic points in a density characteristic processing area set by the preset sliding windows, and obtaining pixel density information of each image characteristic point. Thus, through the square sum of pixel difference values, the pixel information of surrounding feature points is synthesized, the expression accuracy of the pixel density information on the neighborhood difference of the image feature points is improved, and the classification accuracy of the feature points is further improved.

In some embodiments, the size of the preset sliding window is (1+2 m) ×1+2n, the following formula may be used to calculate the pixel density value corresponding to the pixel density information, where r (x, y) is the pixel density value of the image feature point (x, y), f (x, y) is the pixel value of the image feature point (x, y), f (x-m, y-n) is the pixel value of the image feature point (x-m, y-n), and f (x, y) -f (x-m, y-n) represents the pixel difference between the two image feature points, and C is a constant, for example, 256×256; the pixel density value of the image characteristic point is in the range of [0,1], and the pixel density information of the image characteristic point is represented. Based on the formula, the larger the pixel difference value between the pixels of the neighborhood feature point and the central image feature point is, the larger the pixel density value is, and the larger the pixel change degree of the image feature point compared with the neighborhood is represented, namely, the larger the density is, and the smaller the density is, otherwise, the image feature point is represented. Taking the image background area as an example, the pixel density value of the corresponding image feature point of the area is smaller, namely the pixel difference is smaller, so that the probability of dividing the image background area into one type is higher in the object classification processing process, the super-pixel area formed by combining pixels is larger, and the pixel density information is one of determining factors of the size of the dividing area.

In one embodiment, the size of the preset sliding window may be 3*3, and the density feature processing area of the image feature points (x, y) is shown in fig. 5.

S2033: a density feature is generated based on pixel density information for each image feature point.

Specifically, after obtaining the pixel density value of each image feature point, carrying out pixel density value arrangement based on the space coordinates of the image feature points to obtain density features, namely obtaining a density matrix taking the pixel density values as elements.

S2034: and generating a spatial feature based on the spatial coordinate information of the image to be detected.

Specifically, mapping pixel coordinates of each pixel point in the image to be detected into a feature space of the image feature to obtain coordinate information of each image feature point of the image feature, further obtaining a space feature, namely a two-dimensional coordinate information matrix, and establishing a mapping relation between the pixel points in the image to be detected and the image feature points. The spatial feature can provide spatial marking information for subsequent feature fusion and feature point clustering, and the spatial feature synchronously marks and moves the feature points in the image transformation and feature transformation process, for example, when an object is classified, the feature points belonging to the same cluster category are combined and marked based on the spatial feature.

In this way, primary semantic information of the image to be detected is extracted through spatial features, denoising and semantic feature extraction are performed through convolution and deconvolution operations by utilizing image frequency domain information, depth semantic information of the image to be detected is obtained through density information extraction, synchronous position marking and movement of feature points are performed by combining the spatial information of an input image, further feature point classification based on multidimensional semantic information is realized,

it can be understood that, in the case that the image feature includes a plurality of feature images corresponding to the image to be detected, the above-mentioned S2032-S2033 are performed for a plurality of image feature points in each feature image, so as to obtain a density feature corresponding to each feature image. Optionally, the step S2034 may be performed on a plurality of image feature points in each feature map, so as to obtain a spatial feature corresponding to each feature map.

S205: and carrying out fusion processing on the image features, the density features and the space features to obtain global fusion features of the image to be detected.

In the embodiment of the application, the obtained characteristics are fused to obtain the multidimensional information matrix to be used as the expression characteristics of the image to be detected, so that the global fusion characteristics integrate and express the primary and depth semantic information of the image to be detected, the rationality of the subsequent characteristic point distance calculation is improved, the information expression of the characteristic distance result is optimized, and the object detection effect is improved.

As described above, in the case where there are a plurality of processing branches, the image features include a plurality of feature maps corresponding to the image to be detected; accordingly, S205 may include:

s2051: respectively carrying out feature fusion on each feature map, the spatial feature and the density feature corresponding to the feature map in the plurality of feature maps to obtain fusion features corresponding to the feature maps respectively;

s2052: and adding the fusion features corresponding to the feature graphs to obtain global fusion features.

In particular, the summation process here may be embodied as a simple summation. In one embodiment, the feature space expression of the global fusion feature is shown in the following formula, wherein, F (x, y) is the global fusion feature, and b represents the number of feature graphs, as described in [ I ] above ₁ ,I ₂ ,I ₃ ,I ₄ ,I ₅ ]The corresponding b is 5, f (x, y) _t Characterization map t, (x, y) _t Characterization of the spatial features of the feature map t, r (x, y) _t The density characteristic of the characteristic diagram t is characterized by K being constant and beta _t For model parameters, each of the volume processing branches and the inverse volume processing branches in the encoder and decoder corresponds to one β _t In some cases, model parameters of each branch network are shared, corresponding, beta _t And also sharing parameters. In some cases, beta _t The coefficients may be adjusted for the degree of loss.

S207: and performing object classification processing on a plurality of feature points corresponding to the global fusion features based on the spatial features, the density features and the global fusion features to obtain object segmentation results corresponding to at least one target object.

In the embodiment of the application, the object segmentation result includes image segmentation area information of the target object, in some embodiments, the object segmentation result further includes object marking points of the target object, the image segmentation area information characterizes a segmentation position of the target object in the image to be detected, the object marking points are abstract position marks of the target object in the image to be detected, and the abstract position marks can be obtained by extracting pixel centers based on feature points corresponding to the image segmentation area of the target object in the global fusion feature. Specifically, according to the object segmentation result and the image characteristics corresponding to at least one target object, a super-pixel segmentation image corresponding to the image to be detected is further generated, and the super-pixel areas in the super-pixel segmentation image correspond to the image segmentation areas in a one-to-one correspondence manner or in a many-to-one correspondence manner. The same superpixel region may be marked with the same color.

Specifically, S205 and S207 described above can be obtained by image division layer processing, and a super-pixel division image is output.

In summary, the technical scheme obtains low-level and high-level semantic information of a multi-dimensional image through extraction of image features, density features and spatial features, forms a fusion feature space containing the multi-dimensional semantic information through fusion processing, further realizes accurate segmentation and positioning of object detection in a single image by combining the multi-dimensional semantic information, avoids noise interference of point line elements and the like in the object, improves efficiency, accuracy and reliability of object detection, and in an object continuous positioning scene, the method does not depend on pixel value differences among frames, can realize continuous positioning of the object by comparing object segmentation results among frames, avoids omission and can overcome object detection barriers introduced by object shielding and the like through decoupling of the inter-frame differences, offset and the like, and optimize detection effects. In addition, the object mark point of each target object is obtained through the abstract position mark, so that object matching reference information of adjacent images is provided for object continuous image positioning in an object detection scene of continuous images, and the object continuous positioning efficiency and accuracy are improved.

In practical applications, referring to fig. 6, S207 may specifically include:

s301: and determining a plurality of clustering centers corresponding to the plurality of feature points.

Specifically, the number of cluster centers is equal to or greater than the number of target objects. In the single image detection or in the object detection scene of the first image of the continuous multiple images, the cluster centers can be initialized based on a preset number, so as to obtain multiple cluster centers, wherein the preset number is greater than the upper limit of the number of target objects which can be identified in the single image by the object detection model, for example, the preset number can be 10. If the image to be detected is not the first image of the continuous multiple images, determining the number and the position of initial clustering centers corresponding to the image to be detected based on object mark points in a historical object segmentation result of the historical image or determining based on a target clustering center corresponding to the historical image, wherein the historical image can be a previous frame image of the image to be detected.

In some cases, a large number of objects exist in the continuous multiple images, but the continuous object positioning is not needed, before the first image is detected, initial marking information of the target object, such as an object site or an object frame, can be obtained to determine the target to be detected, and then in the process of initializing the cluster center, an image limiting area of the initial multiple cluster centers is provided. The object site or object box may be generated by a first image memory object selection operation on the display interface.

S303: and carrying out feature distance processing on the plurality of feature points and the plurality of clustering centers based on the spatial features, the density features and the global fusion features to obtain feature distance index information.

Specifically, the characteristic distance index information characterizes the difference degree among discrete characteristic points, and the larger the characteristic distance index value corresponding to the characteristic distance index information is, the larger the difference degree is characterized, and otherwise, the smaller the difference degree is characterized. The feature distance index information combined with the image feature value, the space information and the density information is used as the distance measurement in the clustering process, namely the similarity measurement, so that the accuracy of feature point clustering is improved, and the object segmentation result is optimized.

In some embodiments, S303 may include:

s3031: and respectively determining the characteristic difference information between each characteristic point and each clustering center based on the characteristic values of a plurality of characteristic points in the global fusion characteristic.

In some cases, the clustering center may coincide with the feature points, in other cases, the clustering center is a sampling point between the feature points, then the feature values of the feature points around the clustering center may be sampled numerically to obtain the feature values of the clustering center, and similarly, the pixel density information and the spatial coordinate information of the surrounding feature points may be sampled numerically to obtain the pixel density information and the spatial coordinate information of the clustering center. The characteristic difference information is obtained by carrying out difference processing on the characteristic value of the characteristic point and the characteristic value of the clustering center.

In one embodiment, the feature difference value corresponding to the feature difference value information is obtained by using the following formula, wherein O represents a feature point, C represents a cluster center, and d _F (O _i ,C _h ) Is the characteristic difference value, x between the characteristic point i and the clustering center h _i ,y _i Is the space coordinate of the feature point i, x _h ,y _h Is the spatial coordinate of the cluster center h, F (x _i ,y _i ) Is the characteristic value of the characteristic point i, F (x _h ,y _h ) N is a constant, for example N is 2, which is a characteristic value of the cluster center h.

Specifically, for each feature point, feature difference information between the feature point and each cluster center is calculated as a measurement factor of feature distance.

In one embodiment, the feature points may be pixel points in the global fusion feature, and correspondingly, the feature values may be pixel values, which may be pixel values.

S3032: based on the spatial features, spatial distance information between each feature point and each clustering center is determined respectively.

In particular, a spatial signature envelopeThe space coordinates of each feature point are included, and the space distance information is obtained by performing difference processing on the space coordinates of the feature points and the space coordinates of the clustering center. In one embodiment, the spatial distance information is obtained using the following formula, where d _l (O _i ,C _h ) Is the spatial distance between the feature point i and the cluster center h, (x) _i -x _h ) Is the X-axis distance between the feature point i and the cluster center h, (y) _i -y _h ) Is the Y-axis distance between the feature point i and the cluster center h.

S3033: and based on the density characteristics, respectively determining fusion density information corresponding to each characteristic point and each clustering center.

Specifically, the fusion density information may be obtained by performing fusion processing on pixel density information of the feature points and pixel density information of the clustering center, where the fusion processing may be addition processing or average processing. The fused density information may be a simple addition value, a weighted addition value, a square addition value, or the like of the pixel density information of the feature point and the pixel density information of the cluster center, or may be a simple arithmetic average value, geometric average, square average value, or the like of the two.

In one embodiment, the fusion density information is obtained using the following formula, where r (O) _i ,C _h ) ' is fusion density information corresponding to the feature point i and the clustering center h, and r (x _i ,y _i ) Is the pixel density value of the feature point i, r (x _h ,y _h ) Is the pixel density value between cluster centers h.

/>

S3034: and fusing the characteristic difference information, the space distance information and the fusion density information to obtain characteristic distance index information between each characteristic point and each clustering center.

Specifically, the spatial distance information and the characteristic difference value information are respectively in a direct-proportion association relation with the characteristic distance index information, and the density information is in an inverse-proportion association relation with the characteristic distance index information, namely, the larger the spatial distance is, the larger the characteristic difference value is, the larger the characteristic distance index corresponding to the characteristic distance index information is, and otherwise, the smaller the characteristic distance index is.

In one embodiment, the characteristic distance index is obtained using the following formula, where d ₀ (O _i ,C _h ) Is a characteristic distance index between the characteristic point i and the clustering center h.

Therefore, on the basis of adopting the feature information of the global fusion features, the feature distance calculation is further carried out by combining the space distance and the fusion density, so that the feature distance index can fully express the space difference, the pixel change degree and the feature difference between two points, and the feature point classification effect is optimized.

S305: and clustering the plurality of feature points based on the feature distance index information to obtain object segmentation results corresponding to at least one target object.

Specifically, feature distance index information is used as distance measurement, and classification marking is performed on feature points in the neighborhood of each clustering center, namely, a plurality of feature points are respectively distributed to the categories corresponding to different clustering centers, so that an image segmentation area corresponding to a target object is obtained, and super-pixel area division is realized.

It can be appreciated that in the case where the initial marking information of the target object does not exist, each object in the image to be detected may be detected, the instance element in the foreground is the foreground target object, the image segmentation areas of different foreground target objects may be marked with different colors, the background area may be the background target object, may be marked with the same color as a whole, or different background blocks may be marked with different colors based on the cutting of the foreground target object.

In some embodiments, S305 may include:

s3051: based on the characteristic distance index information, determining a neighborhood characteristic point corresponding to each clustering center from a plurality of characteristic points to obtain a characteristic point classification result.

Specifically, feature distance index information is used as distance measurement, and each feature point is distributed to a corresponding cluster center in the neighborhood of each cluster center, which is equivalent to assigning a specific super-pixel label. The feature point classification result comprises feature point sets corresponding to each cluster center, and comprises neighborhood feature points of the cluster centers, wherein the neighborhood feature points form a cluster image area corresponding to the cluster centers. It can be understood that the smaller the feature distance index information between the feature point and the cluster center, the larger the probability that the feature point belongs to the neighborhood feature point of the cluster center, and conversely, the smaller the probability.

In some embodiments, under the condition that an initial object mark does not exist, determining characteristic distance index information between a single characteristic point and each cluster center in all cluster centers respectively, and determining a cluster center corresponding to the minimum characteristic distance index information as a cluster center to which the characteristic point belongs; or, calculating the classification probability index of each cluster center of all the cluster centers of the single feature point based on the feature distance index information, and determining the cluster center corresponding to the maximum classification probability index as the cluster center to which the feature point belongs.

In other embodiments, in the case where there is an initial object marker, that is, in the case where a part of objects in the image need to be detected, a distance threshold or a probability threshold may be set, if the feature distance indicator information is smaller than the distance threshold or if the classification probability indicator is smaller than the probability threshold, it is determined that the feature point does not belong to a neighboring feature point of any cluster center, and if it is larger than the neighboring feature point, it is determined that the feature point belongs to the cluster center based on the above manner.

In one embodiment, the classification probability index is calculated using the following formula, where P (O, C) _h ) The characteristic points i belong to classification probability indexes of the clustering centers H, the probability that the characteristic points i are neighborhood characteristic points of the clustering centers H is represented, and the H is the number of the clustering centers.

S3052: and updating the cluster centers based on the feature point classification result to obtain a plurality of updated cluster centers.

It can be understood that the initialized cluster centers need to be updated and iterated to determine the accurate target cluster centers, and the cluster center updating can be realized by adopting the existing cluster algorithm. In one embodiment, for a feature point set corresponding to each cluster center in the feature classification result, a feature average value of each neighborhood feature point in the feature point set may be calculated, where the feature average value may be obtained by performing an average processing on feature values of the neighborhood feature points, and the feature values may be the foregoing F (x, y). And then, determining a preset number of target neighborhood feature points in the feature point set, wherein the target neighborhood feature points are the neighborhood feature points with the feature values closest to the feature average value. Or a neighborhood characteristic point with the characteristic value closest to the characteristic average value can be determined as an initial target point, then a preset number of neighborhood characteristic points with the characteristic value closest to the characteristic average value and the space coordinate closest to the initial target point are determined as target neighborhood characteristic points, and the preset number can be 5, for example. Further, carrying out average processing on the space coordinates of the preset number of target neighborhood feature points to obtain updated cluster centers corresponding to each cluster center.

S3053: and repeatedly executing the steps of feature distance processing, neighborhood feature point determination and cluster center updating according to a plurality of updated cluster centers until the cluster suspension condition is met, and obtaining a target cluster result.

Specifically, the steps of S303 (including S3031-S3034) above, S3051 and S3052 are repeatedly performed based on updated cluster centers until a cluster suspension condition is satisfied, which may be, for example, the number of cluster iterations reached, or the offset between cluster centers of adjacent iterations or the area difference between cluster image areas and the offset satisfy a minimum error change. Correspondingly, the target clustering result comprises a plurality of target clustering centers and a feature point set corresponding to each target clustering center

S3054: and generating object segmentation results corresponding to at least one target object respectively based on the target clustering results.

In some embodiments, the cluster centers are in one-to-one correspondence with the target objects, and then the clustered image areas, that is, the super-pixel areas, are determined according to the feature point sets corresponding to the cluster centers, and are determined as the image segmentation areas of the corresponding target objects. In other embodiments, the number of cluster centers is greater than the number of target objects, and the adjacent cluster image areas are combined based on the characteristic distance index information and the nearest principle to obtain the object segmentation result. Specifically, after obtaining the clustered image area, if the characteristic distance index information between adjacent target clustered centers is smaller than or equal to the characteristic distance threshold value, merging the characteristic distance index information into an area; or if the characteristic distance index information between the adjacent target cluster centers is smaller than or equal to the characteristic distance threshold value and the characteristic value between the adjacent target cluster centers is smaller than or equal to the preset characteristic value, merging the characteristic distance index information into a region; or if the characteristic distance index information between adjacent target clustering centers is smaller than or equal to the characteristic distance threshold and the number of adjacent characteristic points between the clustered image areas is larger than the threshold of the number of adjacent points, merging the adjacent characteristic points into one area, namely a super-pixel area corresponding to the target object, and obtaining the image segmentation area.

Above, the neighborhood relation between a plurality of feature points and a plurality of clustering centers is determined by combining the spatial features, the density features and the global fusion features, so as to determine the feature point set corresponding to each clustering center, and further generate the image segmentation area, so that the separation of the target and the background can be realized.

In the scene of object detection and object continuous positioning of continuous multiple images, for example, continuous object detection and positioning of a virtual object for a game video, where the virtual object may be an object with a motion feature, for example, a virtual game character, then the object segmentation result between adjacent images has a correlation, and accordingly, after S207, please refer to fig. 7, the method further includes:

s401: and obtaining an object segmentation result of the history object corresponding to the history image.

Specifically, the history image is a preamble image adjacent to the timing sequence of the image to be detected, and taking video as an example, the history image may be a previous frame of video frame of the image to be detected, or a plurality of video frames adjacent to the preamble, such as 5 frames of video frames before the image to be detected.

In a continuous image detection scenario, the object segmentation result may also include object marker points of the target object. In some cases, the object mark point may be the target cluster center, in other cases, the object mark point may be an image partition area based on the target object, a target area corresponding to the image feature (restored image) is determined, if the image feature includes multiple feature images, the target area in each feature image is determined, and then pixel average value calculation is performed based on the image feature points in the target area, so as to obtain a pixel average value of each feature image; and then, in each feature map, determining a preset number of target image feature points corresponding to the corresponding pixel average value, wherein the target image feature points are the image feature points with the pixel values closest to the pixel average value. Or an image feature point with the pixel value closest to the pixel average value can be determined as an initial point, then a preset number of pixel feature points with the pixel value closest to the pixel average value and the space coordinate closest to the initial point are determined as target pixel feature points, and the preset number can be, for example, 5. Further, the space coordinates of the preset number of target pixel feature points are subjected to averaging processing, and the points corresponding to the obtained space coordinates are determined to be object mark points corresponding to the target objects.

S403: and comparing the object segmentation results corresponding to at least one target object with the object segmentation results of the historical objects to obtain object position comparison results.

S405: and matching the historical object with the target object based on the object position comparison result to obtain an object matching result, wherein the object matching result characterizes the association relationship between the historical object and at least one target object in the historical image.

In some embodiments, the above-mentioned image position comparison may be implemented by a region distance comparison, and the object position comparison result is a region comparison result between the image segmentation regions, and the object position comparison result may be characterized based on a region offset. The specific object matching process may be: and aiming at the image segmentation area of each target object in the image to be detected, determining the history object corresponding to the image segmentation area with the smallest area offset in the object segmentation result of the history image as the associated object of the target object, namely the object homologous to the target object, and representing the object as an instance element and the object as a game role.

In other embodiments, the above image position comparison may be implemented based on a distance between object mark points, specifically, for a target object in an object detection result of an image to be detected, a spatial distance between an object mark point of the target object and an object mark point of each history object in a history image is calculated, to obtain an object position comparison result, and a history object with a closest spatial distance is determined as an associated object of the target object; or under the global fusion characteristic, the characteristic difference value between the object mark point of the target object and the object mark points of each historical object is calculated, the spatial distance and the characteristic difference value are subjected to normalization processing and addition processing, the mark point distance is obtained, the object position comparison result is represented, and the historical object corresponding to the minimum mark point distance is determined to be the associated object of the target object.

In other embodiments, the object segmentation result may include the aforementioned multiple target cluster centers, and the spatial distance between each target cluster center and the object mark point of each history object may be calculated, so that the history object with the closest spatial distance is determined as the history object to which the target cluster center belongs, that is, the target cluster center is subjected to object allocation, and the clustered image areas corresponding to the target cluster centers belonging to the same history object are combined, so as to obtain the image segmentation area of the target object corresponding to the history object. Alternatively, similar to the foregoing marker point distance, the center marker distance between the target cluster center and the object marker point of the history object may be calculated, thereby determining the history object to which the plurality of target cluster centers belong.

In this way, the object matching is realized by combining the segmentation result of the historical image, the object synchronous positioning is realized, the motion speed of the object is not required to be relied on or a similarity search window is not required to be established, the efficient and accurate synchronous positioning is realized, the missing detection is avoided, the problem of positioning missing caused by the effective shielding of the object is effectively solved, and the continuous object detection method and the robustness and the reliability are improved.

Based on the above part or all embodiments, each step of the above object detection method may be implemented based on an object detection model, and correspondingly, the object detection method may further include a training method of the object detection model, and specifically includes the following steps:

S501: a sample training set is obtained, the sample training set comprising a sample image, the sample image comprising at least one sample object.

In some embodiments, the sample training set may further include a sample object initial mark corresponding to the sample image, where the sample object initial mark is used to frame a sample object to be detected in the sample image, as shown in fig. 9, the sample object initial mark may be a box in the figure, and the virtual character in the box is the object to be detected. Similar to the foregoing, the sample object initial mark may be used as a basis for cluster center initialization to determine an initialized cluster center at the sample object initial mark.

S503: and taking the sample image as the input of an initial detection model, and extracting the characteristics of the sample image to obtain the characteristics of the sample image, the characteristics of the sample density and the characteristics of the sample space.

In some embodiments, the initial detection model may be a pre-trained image feature extraction model, may be a network model including an encoder and a decoder, and the feature extraction backbone network is a convolutional neural network. In one embodiment, the network structure may be as shown in fig. 4.

S505: and carrying out fusion processing on the sample image features, the sample density features and the sample space features to obtain global fusion features of the sample image.

S507: and performing object classification processing on a plurality of feature points corresponding to the global fusion feature based on the sample space feature, the sample density feature and the global fusion feature to obtain object segmentation results corresponding to at least one sample object.

It is to be understood that S505 and S507 are similar to S205 and S207, respectively, and are not described herein.

S509: and carrying out reconstruction loss calculation based on the sample image characteristics and the object segmentation results corresponding to at least one sample object respectively to obtain model loss.

S511: and training an initial detection model according to the model loss to obtain an object detection model.

Specifically, the sample image features include one or more feature maps, a sample super-pixel segmented image may be determined based on an object segmentation result and a sample image feature corresponding to each of at least one sample object, a reconstruction difference calculation is performed on the sample super-pixel segmented image and each feature map in the sample image features to obtain a reconstruction difference degree corresponding to each feature map, and a sum-of-differences averaging process is performed on the reconstruction differences degrees of the feature maps to obtain model loss, where the sum-of-differences averaging process may include, but is not limited to, simple arithmetic averaging, square root averaging, or weighted averaging. If the model loss exceeds the preset loss, adjusting the model parameters, and performing iterative training until the iteration suspension condition is met, if the model loss is smaller than or equal to the preset loss, or the preset iteration times are reached.

In some embodiments, the calculation method of the model loss may be specifically: for each super-pixel region in the sample super-pixel segmented image, as in the same-color gray scale block in fig. 8, determining the corresponding spatial mapping region of the super-pixel region in the feature map, namely the region of the same image position; and determining the number of difference pixel points, wherein the absolute value of the pixel difference value between the image characteristic points in the space mapping area and the corresponding pixel points in the super pixel area exceeds a preset pixel threshold value, and if the number of the image characteristic points, namely the number of the difference pixel points, is 50, the absolute value of the pixel difference value exceeds the preset pixel threshold value, the preset pixel threshold value can be set based on actual requirements, and the application is not particularly limited. Referring to fig. 8, it can be clearly observed that there are many different color pixel points, i.e. different pixel points, in the super-pixel region 1 marked by the arrow in the super-pixel divided image (right image) and the spatial mapping region 2 marked by the arrow in the feature image (middle image).

And then, adding the number of the difference pixel points corresponding to each super-pixel region, determining the total number of the difference pixel points between the feature map and the super-pixel segmented image, and determining the pixel point duty ratio of the difference pixel points accounting for all the super-pixel regions, wherein if the total number of the difference pixel points is 500 and all the super-pixel regions comprise 1000 pixel points, the pixel point duty ratio is 50%. If the characteristic images are included, the pixel point duty ratios of the characteristic images are subjected to addition and averaging processing, and the target pixel point duty ratio is obtained.

If the pixel point duty ratio or the target pixel point duty ratio is greater than or equal to the first duty ratio (e.g., 30%), the loss adjustment coefficient beta is adjusted based on the first preset multiple _t If the loss is increased to 1.4 times of the original value, if the loss is larger than or equal to the second duty ratio (such as 10%) and smaller than the first duty ratio, the loss adjustment coefficient beta is adjusted based on the second preset multiple _t If the temperature is increased to 1.2 times of the original value, the adjustment coefficient beta is maintained within the second ratio _t Is unchanged. Therefore, the model parameter adjustment is realized, the iterative training of the initial detection model is realized, the characteristic fusion process is optimized, the distance expression accuracy of the characteristic distance information is further optimized, and the model effect is improved.

In one embodiment, referring to fig. 10, the scheme flow framework of the present application may specifically include: the method comprises the steps that through a backbone network of an object detection model, multi-task feature extraction is conducted on an input image to be detected by combining with space coordinate information of the input image, image features, space features and density features comprising a plurality of feature images are obtained, and then feature fusion processing is conducted to map to a fusion feature space, so that global fusion features are obtained; determining a clustering center corresponding to the global fusion feature, performing feature distance processing between feature points and the clustering center to obtain feature distance index information, and then performing super-pixel region division to obtain an object segmentation result; and calculating loss based on the object segmentation result and the image characteristics, and realizing model parameter adjustment.

The embodiment of the present application further provides a training method for an object detection model, please refer to fig. 11, fig. 11 is a flow chart of a method for object detection provided in the embodiment of the present application, and the present specification provides the method operation steps of the embodiment or the flow chart, but may include more or less operation steps based on conventional or non-creative labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented in a real system or server product, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multithreaded environment). Specifically, as shown in fig. 11, the method may include the following steps S601 to S607.

S601: acquiring a sample training set, wherein the sample training set comprises a sample image, and the sample image comprises at least one sample object;

s603: taking a sample image as an input of an initial detection model, and extracting features of the sample image to obtain sample image features, sample density features and sample space features;

s605: carrying out fusion processing on the sample image features, the sample density features and the sample space features to obtain global fusion features of the sample image;

S607: performing object classification processing on a plurality of feature points corresponding to the global fusion features based on the sample space features, the sample density features and the global fusion features to obtain object segmentation results corresponding to at least one sample object;

s609: performing reconstruction loss calculation based on object segmentation results corresponding to the sample image features and at least one sample object respectively to obtain model loss;

s611: and training an initial detection model according to the model loss to obtain an object detection model, wherein the object detection model is applied to the object detection method.

The embodiment of the application also provides an object detection device 700, as shown in fig. 12, fig. 12 shows a schematic structural diagram of the object detection device provided by the embodiment of the application, and the device may include the following modules.

The image acquisition module 11: the method comprises the steps of acquiring an image to be detected, wherein the image to be detected comprises at least one target object;

feature extraction module 12: the method comprises the steps of extracting features of an image to be detected to obtain image features, density features and space features of the image to be detected;

fusion processing module 13: the method comprises the steps of performing fusion processing on image features, density features and space features to obtain global fusion features of an image to be detected;

Object classification module 14: the object classification method is used for carrying out object classification processing on a plurality of feature points corresponding to the global fusion feature based on the spatial feature, the density feature and the global fusion feature to obtain object segmentation results corresponding to at least one target object.

In some embodiments, the object classification module 14 may include:

center determination submodule: the method comprises the steps of determining a plurality of clustering centers corresponding to a plurality of feature points, wherein the number of the clustering centers is greater than or equal to the number of target objects;

and the characteristic distance processing sub-module: the method comprises the steps of performing feature distance processing on a plurality of feature points and a plurality of clustering centers based on spatial features, density features and global fusion features to obtain feature distance index information;

clustering submodule: and the object segmentation method is used for clustering the plurality of characteristic points based on the characteristic distance index information to obtain object segmentation results corresponding to at least one target object.

In some embodiments, the feature distance processing sub-module may include:

a feature difference determination unit: the feature difference information between each feature point and each clustering center is respectively determined based on the feature values of a plurality of feature points in the global fusion feature;

a spatial distance determination unit: the method is used for respectively determining the space distance information between each feature point and each clustering center based on the space features;

Fusion density determination unit: the fusion density information corresponding to each characteristic point and each clustering center is respectively determined based on the density characteristics;

an information fusion unit: the method is used for fusing the characteristic difference information, the space distance information and the fusion density information to obtain characteristic distance index information between each characteristic point and each clustering center, wherein the space distance information and the characteristic difference information are respectively in direct proportion relation with the characteristic distance index information, and the density information is in inverse proportion relation with the characteristic distance index information.

In some embodiments, the clustering sub-module may include:

a neighborhood feature point determining unit: the method comprises the steps of determining neighborhood feature points corresponding to each cluster center from a plurality of feature points based on feature distance index information to obtain feature point classification results;

a center updating unit: the method comprises the steps of updating a cluster center based on a feature point classification result to obtain a plurality of updated cluster centers;

iterative clustering unit: the method comprises the steps of repeatedly executing the characteristic distance processing, the neighborhood characteristic point determining and the cluster center updating according to a plurality of updated cluster centers until a cluster suspension condition is met, and obtaining a target cluster result;

A segmentation result generation unit: and the object segmentation results are used for generating object segmentation results corresponding to at least one target object respectively based on the target clustering results.

In some embodiments, the apparatus may further include:

a historical result acquisition module: the method comprises the steps of performing object classification processing on a plurality of feature points corresponding to global fusion features based on spatial features, density features and global fusion features, obtaining object segmentation results of historical objects corresponding to historical images after obtaining object segmentation results corresponding to at least one target object, wherein the historical images are precursor images adjacent to a time sequence of an image to be detected;

and a position comparison module: the object segmentation method comprises the steps of comparing object segmentation results corresponding to at least one target object with object segmentation results of historical objects to obtain object position comparison results;

and a matching module: and the object matching result is used for matching the historical object with the target object based on the object position comparison result to obtain an object matching result, and the object matching result characterizes the association relationship between the historical object and at least one target object in the historical image.

In some embodiments, the feature extraction module 12 includes:

a convolution sub-module: the method comprises the steps of carrying out convolution processing and deconvolution processing on an image to be detected to obtain image characteristics, wherein the image characteristics comprise a plurality of image characteristic points;

A pixel difference processing sub-module: the method comprises the steps of using each image characteristic point in a plurality of image characteristic points as a sliding window center point, and using a preset sliding window as a density characteristic extraction area to conduct pixel difference value processing to obtain pixel density information of each image characteristic point;

a density characteristic generation sub-module: for generating density features based on pixel density information for each image feature point;

a spatial feature generation sub-module: for generating spatial features based on spatial coordinate information of the image to be detected.

In some embodiments, the pixel difference processing sub-module may be specifically configured to: and taking the image characteristic points as sliding window center points of the preset sliding windows, determining the square sum of pixel difference values between the image characteristic points and other image characteristic points in a density characteristic processing area set by the preset sliding windows, and obtaining pixel density information of each image characteristic point.

In some embodiments, the image features include a plurality of feature maps corresponding to the image to be detected; the fusion processing module 13 may include:

feature fusion submodule: the method comprises the steps of respectively carrying out feature fusion on each feature map, space features and density features corresponding to the feature maps in a plurality of feature maps to obtain fusion features corresponding to the feature maps;

And the addition processing submodule: and the method is used for adding and processing the fusion features corresponding to the feature graphs to obtain global fusion features.

The embodiment of the application also provides an object detection model training device 800, as shown in fig. 13, fig. 13 shows a schematic structural diagram of the object detection model training device provided by the embodiment of the application, and the device may include the following modules.

Sample acquisition module 21: for obtaining a sample training set, the sample training set comprising a sample image, the sample image comprising at least one sample object;

sample feature extraction module 22: the method comprises the steps of taking a sample image as input of an initial detection model, and extracting characteristics of the sample image to obtain sample image characteristics, sample density characteristics and sample space characteristics;

sample fusion processing module 23: the fusion processing method comprises the steps of carrying out fusion processing on sample image features, sample density features and sample space features to obtain global fusion features of a sample image;

sample classification module 24: the method comprises the steps of performing object classification processing on a plurality of feature points corresponding to global fusion features based on sample space features, sample density features and the global fusion features to obtain object segmentation results corresponding to at least one sample object;

Loss calculation module 25: the method comprises the steps of carrying out reconstruction loss calculation based on object segmentation results corresponding to sample image features and at least one sample object respectively to obtain model loss;

model training module 26: the method is used for training an initial detection model according to model loss to obtain an object detection model, and the object detection model is applied to the object detection method.

It should be noted that the above apparatus embodiments and method embodiments are based on the same implementation manner.

The embodiment of the application provides an object detection device, which can be a terminal or a server, and comprises a processor and a memory, wherein at least one instruction or at least one section of program is stored in the memory, and the at least one instruction or the at least one section of program is loaded and executed by the processor to realize the object detection method provided by the embodiment of the method.

The memory may be used to store software programs and modules that the processor executes to perform various functional applications and object detection by running the software programs and modules stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for functions, and the like; the storage data area may store data created according to the use of the device, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory may also include a memory controller to provide access to the memory by the processor.

The method embodiment provided by the embodiment of the application can be executed in electronic equipment such as a mobile terminal, a computer terminal, a server or similar computing devices. Fig. 14 is a block diagram of a hardware structure of an electronic device according to an embodiment of the present application. As shown in fig. 14, the electronic device 900 may vary considerably in configuration or performance, and may include one or more central processing units (Central Processing Units, CPU) 910 (the processor 910 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA), a memory 930 for storing data, one or more storage media 920 (e.g., one or more mass storage devices) for storing applications 923 or data 922. Wherein memory 930 and storage medium 920 may be transitory or persistent storage. The program stored on the storage medium 920 may include one or more modules, each of which may include a series of instruction operations in the electronic device. Still further, the central processor 910 may be configured to communicate with a storage medium 920 and execute a series of instruction operations in the storage medium 920 on the electronic device 900. The electronic device 900 may also include one or more power supplies 960, one or more wired or wireless network interfaces 950, one or more input/output interfaces 940, and/or one or more operating systems 921, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM LinuxTM, freeBSDTM, etc.

The input-output interface 940 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communications provider of the electronic device 900. In one example, the input-output interface 940 includes a network adapter (Network Interface Controller, NIC) that may be connected to other network devices through a base station to communicate with the internet. In one example, the input/output interface 940 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 14 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, electronic device 900 may also include more or fewer components than shown in FIG. 14, or have a different configuration than shown in FIG. 14.

Embodiments of the present application also provide a computer readable storage medium that may be disposed in an electronic device to store at least one instruction or at least one program related to implementing an object detection method in a method embodiment, where the at least one instruction or the at least one program is loaded and executed by the processor to implement the object detection method provided in the method embodiment.

Alternatively, in this embodiment, the storage medium may be located in at least one network server among a plurality of network servers of the computer network. Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in the various alternative implementations described above.

As can be seen from the above embodiments of the object detection method, apparatus, device, server, terminal, storage medium and program product provided by the present application, the technical solution of the present application is to obtain an image to be detected, where the image to be detected includes at least one target object; the method comprises the steps of extracting features of an image to be detected to obtain image features, density features and space features of the image to be detected, carrying out fusion processing on the image features, the density features and the space features to obtain global fusion features of the image to be detected, further forming a fusion feature space containing multi-dimensional semantic information, then carrying out object classification processing on a plurality of feature points corresponding to the global fusion features based on the space features, the density features and the global fusion features to obtain object segmentation results corresponding to at least one target object, so as to realize accurate segmentation and positioning of object detection in a single image by combining multi-dimensional semantic information, avoid noise interference of point line elements and the like in the object, improve the efficiency, accuracy and reliability of object detection, and realize continuous positioning of the object by comparing the object segmentation results between frames without depending on pixel value differences between frames, avoid missing detection, overcome detection obstruction caused by object detection differences caused by decoupling of frames, offset and the like, and optimize detection effect.

It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for apparatus, devices and storage medium embodiments, the description is relatively simple as it is substantially similar to method embodiments, with reference to the description of method embodiments in part.

It will be appreciated by those of ordinary skill in the art that all or part of the steps of implementing the above embodiments may be implemented by hardware, or may be implemented by a program indicating that the relevant hardware is implemented, and the program may be stored in a computer readable storage medium, where the storage medium may be a read only memory, a magnetic disk or optical disk, etc.

The foregoing is only illustrative of the present application and is not to be construed as limiting thereof, but rather as various modifications, equivalent arrangements, improvements, etc., within the spirit and principles of the present application.

Claims

1. An object detection method, the method comprising:

2. The method of claim 1, wherein performing object classification processing on a plurality of feature points corresponding to the global fusion feature based on the spatial feature, the density feature and the global fusion feature to obtain object segmentation results corresponding to the at least one target object respectively comprises:

Determining a plurality of clustering centers corresponding to the plurality of feature points, wherein the number of the clustering centers is greater than or equal to the number of the target objects;

based on the spatial features, the density features and the global fusion features, carrying out feature distance processing on the feature points and the clustering centers to obtain feature distance index information;

and clustering the plurality of feature points based on the feature distance index information to obtain object segmentation results corresponding to the at least one target object.

3. The method of claim 2, wherein the performing feature distance processing on the plurality of feature points and the plurality of cluster centers based on the spatial feature, the density feature, and the global fusion feature to obtain feature distance index information includes:

based on the feature values of the feature points in the global fusion feature, determining feature difference information between each feature point and each clustering center;

based on the spatial features, determining spatial distance information between each feature point and each clustering center respectively;

based on the density characteristics, respectively determining fusion density information corresponding to each characteristic point and each clustering center;

And fusing the characteristic difference information, the spatial distance information and the fusion density information to obtain characteristic distance index information between each characteristic point and each clustering center, wherein the spatial distance information and the characteristic difference information are respectively in direct-proportion association with the characteristic distance index information, and the density information is in inverse-proportion association with the characteristic distance index information.

4. The method of claim 3, wherein clustering the plurality of feature points based on the feature distance index information to obtain the object segmentation result corresponding to each of the at least one target object comprises:

based on the characteristic distance index information, determining a neighborhood characteristic point corresponding to each clustering center from the plurality of characteristic points to obtain a characteristic point classification result;

updating the clustering centers based on the feature point classification result to obtain a plurality of updated clustering centers;

repeatedly executing the steps of feature distance processing, neighborhood feature point determination and cluster center updating according to the updated cluster centers until the cluster suspension condition is met, and obtaining a target cluster result;

And generating object segmentation results corresponding to the at least one target object respectively based on the target clustering results.

5. The method according to any one of claims 1-4, wherein after performing object classification processing on a plurality of feature points corresponding to the global fusion feature based on the spatial feature, the density feature and the global fusion feature, to obtain an object segmentation result corresponding to each of the at least one target object, the method further comprises:

obtaining an object segmentation result of a history object corresponding to a history image, wherein the history image is a preamble image adjacent to the image time sequence to be detected;

comparing the object segmentation results corresponding to the at least one target object with the object segmentation results of the historical object to obtain object position comparison results;

and matching the historical object with the target object based on the object position comparison result to obtain an object matching result, wherein the object matching result represents the association relationship between the historical object in the historical image and the at least one target object.

6. The method according to any one of claims 1-4, wherein the performing feature extraction on the image to be detected to obtain image features, density features and spatial features of the image to be detected includes:

Performing convolution processing and deconvolution processing on the image to be detected to obtain the image characteristics, wherein the image characteristics comprise a plurality of image characteristic points;

taking each image characteristic point in the plurality of image characteristic points as a sliding window center point, and taking a preset sliding window as a density characteristic extraction area to perform pixel difference processing to obtain pixel density information of each image characteristic point;

generating the density characteristic based on the pixel density information of each image characteristic point;

and generating the spatial feature based on the spatial coordinate information of the image to be detected.

7. The method of claim 6, wherein the performing pixel difference processing with each image feature point of the plurality of image feature points as a sliding window center point and a preset sliding window as a density feature extraction area to obtain pixel density information of each image feature point comprises:

and taking the image characteristic points as sliding window center points of the preset sliding windows, determining the square sum of pixel difference values between the image characteristic points and other image characteristic points in a density characteristic processing area set by the preset sliding windows, and obtaining the pixel density information of each image characteristic point.

8. The method according to any one of claims 1-4, wherein the image features comprise a plurality of feature maps corresponding to the image to be detected; the fusing the image features, the density features and the space features to obtain global fusion features of the image to be detected comprises the following steps:

respectively carrying out feature fusion on each feature map in the feature maps, the spatial features and density features corresponding to the feature maps to obtain fusion features corresponding to the feature maps;

and adding the fusion features corresponding to the feature graphs to obtain the global fusion feature.

9. A method of training an object detection model, the method comprising:

10. An object detection apparatus, the apparatus comprising:

11. An object detection model training apparatus, the method comprising:

model training module: for training the initial detection model on the basis of the model loss, resulting in an object detection model, which is applied to the object detection method of any of the claims 1-8.

12. A computer readable storage medium having stored therein at least one instruction or at least one program loaded and executed by a processor to implement the object detection method of any one of claims 1-8 or the object detection model training method of claim 9.

13. A computer device, characterized in that it comprises a processor and a memory, in which at least one instruction or at least one program is stored, which is loaded and executed by the processor to implement the object detection method according to any one of claims 1-8 or the object detection model training method according to claim 9.