CN112819008B

CN112819008B - Method, device, medium and electronic equipment for optimizing instance detection network

Info

Publication number: CN112819008B
Application number: CN202110031402.8A
Authority: CN
Inventors: 单鼎一
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-11
Filing date: 2021-01-11
Publication date: 2022-10-28
Anticipated expiration: 2041-01-11
Also published as: CN112819008A

Abstract

The application provides an optimization method of an example detection network, an optimization device of the example detection network, a computer readable storage medium and an electronic device; relates to the technical field of artificial intelligence; the method comprises the following steps: extracting semantic feature vectors and instance feature vectors in a target image (such as a map) through an instance detection network, and identifying instance targets in the target image through the vectors; calculating the pixel area ratio of each example target in all the example targets and the inter-class vector distance corresponding to the example targets; calculating an intra-class loss function according to the pixel area ratio and the at least two instance targets, calculating an inter-class loss function according to the inter-class vector distance and the at least two instance targets, and calculating a semantic loss function according to the at least two instance targets; and training the example detection network according to the semantic loss function, the intra-class loss function and the inter-class loss function. By implementing the embodiment of the application, the detection accuracy of the example target (such as a building) in the map can be improved through the optimization of the loss function.

Description

Method, device, medium and electronic equipment for optimizing instance detection network

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an optimization method for an instance detection network, an optimization apparatus for an instance detection network, a computer-readable storage medium, and an electronic device.

Background

In the field of image recognition, neural networks are often required to learn image features to enable detection of example objects in an image, such as faces, animals, objects, scenery, buildings, etc.

For example target detection, position rough detection is usually required to be performed on an example target to determine features of a candidate rectangular frame, the features serve as input of fine detection, and then a specific position of the example target can be determined through fine detection, an example target category is determined in a classified manner, and the example target in the image is detected according to the example target category and the specific position.

However, the candidate frames in the above manner are usually used for roughly framing example targets, and when example targets (e.g., buildings in satellite top-down images) correspond to different density degrees and sizes in different images, it is difficult to frame denser but smaller example targets one by one, and it is also difficult to accurately frame larger example targets, which easily results in a problem of low detection accuracy of example targets.

It is noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the application and therefore may include information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

The application aims to provide an optimization method of an example detection network, an optimization device of the example detection network, a computer-readable storage medium and electronic equipment, which can optimize an intra-class loss function through an example pixel area and optimize an inter-class loss function through an inter-class vector distance of an example, so as to improve the detection precision of an example target.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of the present application, there is provided an optimization method of an instance detection network, including:

extracting semantic feature vectors and example feature vectors in the target image through an example detection network, and identifying at least two example targets in the target image through the semantic feature vectors and the example feature vectors;

calculating the pixel area ratio of each instance target in the at least two instance targets to all the instance targets and the inter-class vector distance corresponding to the at least two instance targets;

calculating an intra-class loss function according to the pixel area ratio and the at least two instance targets, calculating an inter-class loss function according to the inter-class vector distance and the at least two instance targets, and calculating a semantic loss function according to the at least two instance targets;

and training the example detection network according to the semantic loss function, the intra-class loss function and the inter-class loss function.

According to an aspect of the present application, there is provided an optimization apparatus of an instance detection network, including: the device comprises a feature extraction unit, a parameter calculation unit, a loss function calculation unit and a parameter adjustment unit, wherein:

the feature extraction unit is used for extracting semantic feature vectors and example feature vectors in the target image through an example detection network and identifying at least two example targets in the target image through the semantic feature vectors and the example feature vectors;

the parameter calculation unit is used for calculating the pixel area ratio of each example target in the at least two example targets to all the example targets and the inter-class vector distance corresponding to the at least two example targets;

a loss function calculation unit for calculating an intra-class loss function according to the pixel area ratio and the at least two instance targets, calculating an inter-class loss function according to the inter-class vector distance and the at least two instance targets, and calculating a semantic loss function according to the at least two instance targets;

and the parameter adjusting unit is used for training the example detection network according to the semantic loss function, the intra-class loss function and the inter-class loss function.

In an exemplary embodiment of the present application, the feature extraction unit extracts a semantic feature vector and an instance feature vector in a target image through an instance detection network, including:

obtaining a shared characteristic vector corresponding to a target image;

respectively inputting the shared feature vectors into a semantic feature extraction sub-network and an example feature extraction sub-network in an example detection network;

and extracting semantic feature vectors through a semantic feature extraction sub-network, and extracting example feature vectors through an example feature extraction sub-network.

In an exemplary embodiment of the present application, the acquiring a shared feature vector corresponding to a target image by a vector acquiring unit includes:

extracting a reference characteristic vector of a target image;

normalizing the reference characteristic vector to obtain a normalization result conforming to normal distribution;

and carrying out nonlinear mapping processing on the normalization result to obtain a shared characteristic vector.

In an exemplary embodiment of the present application, the semantic feature extraction sub-networks and the instance feature extraction sub-networks correspond to different network parameters of the same network architecture, the semantic feature extraction sub-networks correspond to semantic loss functions, and the instance feature extraction sub-networks correspond to intra-class loss functions and inter-class loss functions.

In an exemplary embodiment of the present application, the training of the instance detection network by the parameter adjusting unit according to the semantic loss function, the intra-class loss function, and the inter-class loss function includes:

extracting a sub-network according to the example characteristics in the intra-class loss function and the inter-class loss function training example detection network;

and detecting semantic features in the network according to the semantic loss function training example to extract sub-networks.

In an exemplary embodiment of the present application, the parameter adjusting unit trains an instance feature extraction sub-network in the instance detection network according to the intra-class loss function and the inter-class loss function, and includes:

and adjusting network parameters of the example feature extraction sub-network according to the intra-class loss function and the inter-class loss function until the intra-class loss function and the inter-class loss function are both within a preset threshold range.

In an exemplary embodiment of the present application, the feature extraction unit extracts the semantic feature vector through a semantic feature extraction sub-network, including:

carrying out deconvolution processing on the shared feature vector through a plurality of semantic convolution layers in the semantic feature extraction sub-network to obtain a semantic feature vector; wherein the input of each semantic convolution layer in the plurality of semantic convolution layers is the output of the previous semantic convolution layer and the shared feature vector;

and, extracting the instance feature vector through the instance feature extraction subnetwork, comprising:

carrying out deconvolution processing on the shared feature vector through a plurality of example convolution layers in the example feature extraction sub-network to obtain an example feature vector; wherein the input to each of the plurality of instance convolutional layers is the output of the previous instance convolutional layer and the shared eigenvector.

In an exemplary embodiment of the present application, the feature extraction unit identifies at least two example objects in the object image by the semantic feature vector and the example feature vector, including:

fusing the semantic feature vectors and the feature vectors corresponding to the same instance target in the instance feature vectors to obtain a fused feature vector set;

clustering the fusion characteristic vector set to obtain at least two vector clusters;

at least two example targets are determined according to the expression positions of the vector clusters in the target image.

In an exemplary embodiment of the present application, the feature extraction unit fuses a semantic feature vector and a feature vector corresponding to a same instance target in an instance feature vector to obtain a fused feature vector set, including:

predicting at least two first reference example targets in the target image according to the semantic feature vector;

predicting at least two second reference example targets in the target image according to the example feature vectors; the first reference example target and the second reference example target correspond to each other one by one;

and fusing the feature vectors corresponding to the same example target in the first reference example target and the second reference example target to obtain a fused feature vector set.

In an exemplary embodiment of the present application, the calculating a vector distance between classes corresponding to at least two instance targets by the parameter calculating unit includes:

calculating a feature vector mean value corresponding to each example target in the at least two example targets according to the fusion feature vector set;

and calculating the inter-class vector distance corresponding to each two example targets in the at least two example targets according to the feature vector mean value.

In an exemplary embodiment of the present application, the loss function calculation unit calculates the intra-class loss function based on the pixel area ratio and at least two example targets, including:

calculating the intra-class feature mean values corresponding to the at least two example targets respectively, wherein the number of the intra-class feature mean values is consistent with the number of the at least two example targets;

and calculating an intra-class loss function according to the penalty factors in the preset class, the pixel area ratio, the fusion feature vector set, the feature vector quantity of each instance target, the feature vector quantity which does not meet the penalty factors in the preset class and the intra-class feature mean value.

In an exemplary embodiment of the present application, the loss function calculating unit calculates the inter-class loss function according to the inter-class vector distance and at least two instance targets, including:

and calculating an inter-class loss function according to a preset inter-class penalty factor, an inter-class vector distance, a fusion characteristic vector set, a vector distance threshold and the number of instance targets meeting the vector distance threshold.

According to an aspect of the present application, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any of the above via execution of the executable instructions.

According to an aspect of the application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations described above.

The exemplary embodiments of the present application may have some or all of the following advantages:

in the optimization method of the example detection network provided by an example embodiment of the application, a semantic feature vector and an example feature vector in a target image can be extracted through the example detection network, and at least two example targets in the target image are identified through the semantic feature vector and the example feature vector; calculating the pixel area ratio of each example target in the at least two example targets to all the example targets and the inter-class vector distance corresponding to the at least two example targets; calculating an intra-class loss function according to the pixel area ratio and the at least two instance targets, calculating an inter-class loss function according to the inter-class vector distance and the at least two instance targets, and calculating a semantic loss function according to the at least two instance targets; and training the example detection network according to the semantic loss function, the intra-class loss function and the inter-class loss function. According to the scheme description, on one hand, the intra-class loss function can be optimized through the example pixel area, the inter-class loss function can be optimized through the inter-class vector distance of the example, the weight can be adjusted in a self-adaptive mode according to the example pixel area, and the detection precision of the example target is improved. In another aspect of the application, the inter-class loss function is optimized through the inter-class vector distance of the example, and the network training efficiency can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a schematic diagram illustrating an exemplary system architecture of an example detection network optimization method and an example detection network optimization apparatus to which the embodiments of the present application may be applied;

FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use to implement the electronic device of the present application;

FIG. 3 schematically illustrates a flow diagram of an example optimization method of detecting a network, according to one embodiment of the present application;

FIG. 4 schematically illustrates an architectural diagram of an example detection network, according to one embodiment of the present application;

FIG. 5 schematically illustrates an example recognition effect diagram according to an embodiment of the present application;

FIG. 6 schematically illustrates a flow diagram of an example optimization method of detecting a network, according to one embodiment of the present application;

fig. 7 schematically shows a block diagram of an optimization apparatus of an example detection network in an embodiment according to the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present application.

Furthermore, the drawings are merely schematic illustrations of the present application and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 is a schematic diagram illustrating a system architecture of an exemplary application environment to which an example inspection network optimization method and an example inspection network optimization apparatus according to an embodiment of the present application may be applied.

As shown in fig. 1, system architecture 100 may include one or more of

end devices

101, 102, 103, a network 104, and a server cluster 105. The network 104 serves to provide a medium of communication links between the

terminal devices

101, 102, 103 and the server cluster 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The

terminal devices

101, 102, 103 may be various electronic devices having a display screen, including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. The server cluster 105 may be an independent physical server, or may be a server cluster or a distributed system including a plurality of physical servers.

The example detection network optimization method provided by the embodiment of the present application is generally performed by at least one server in the server cluster 105. Accordingly, the optimization device of the instance detection network is typically disposed in at least one server of the server cluster 105. For example, in an exemplary embodiment, at least one server in the server cluster 105 may extract a semantic feature vector and an instance feature vector in the target image through the instance detection network, and identify at least two instance targets in the target image through the semantic feature vector and the instance feature vector; calculating the pixel area ratio of each example target in the at least two example targets to all the example targets and the inter-class vector distance corresponding to the at least two example targets; calculating an intra-class loss function according to the pixel area ratio and the at least two instance targets, calculating an inter-class loss function according to the inter-class vector distance and the at least two instance targets, and calculating a semantic loss function according to the at least two instance targets; and training the example detection network according to the semantic loss function, the intra-class loss function and the inter-class loss function.

FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use to implement the electronic device of the embodiments of the subject application.

It should be noted that the computer system 200 of the electronic device shown in fig. 2 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 2, the computer system 200 includes a Central Processing Unit (CPU) 201 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 202 or a program loaded from a storage section 208 into a Random Access Memory (RAM) 203. In the RAM 203, various programs and data necessary for system operation are also stored. The CPU 201, ROM 202, and RAM 203 are connected to each other via a bus 204. An input/output (I/O) interface 205 is also connected to bus 204.

The following components are connected to the I/O interface 205: an input portion 206 including a keyboard, a mouse, and the like; an output section 207 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 208 including a hard disk and the like; and a communication section 209 including a network interface card such as a LAN card, a modem, or the like. The communication section 209 performs communication processing via a network such as the internet. A drive 210 is also connected to the I/O interface 205 as needed. A removable medium 211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 210 as necessary, so that a computer program read out therefrom is installed into the storage section 208 as necessary.

In particular, according to embodiments of the present application, the processes described below with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 209 and/or installed from the removable medium 211. The computer program, when executed by a Central Processing Unit (CPU) 201, performs various functions defined in the methods and apparatus of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

In the field of image recognition, the satellite image-based building detection industry typically practices a two-stage object detection algorithm. The first stage is coarse detection of the position of the target instance, and is used for outputting a series of candidate rectangular target frames, and taking the image features corresponding to the candidate rectangular target frames as the input of the second stage. Furthermore, the example positions can be refined through a small position regression classification network, the example categories are classified and determined, and finally the example categories are input into a semantic segmentation network to realize pixel-level single-example front background segmentation, namely detection of buildings in the satellite image.

However, in the above approach, a variety of anchor points of different sizes need to be designed to correspond to different sized instance targets. For the small building detection problem of the dense area, the design of the anchor point is closely related to the design and detection of the candidate frame in the first stage, and the anchor point corresponding to the small building of the dense area is difficult to design, so that the problems of low instance recall rate and serious missed detection are easily caused. For a large building, the candidate frame in the first stage often cannot cover the whole instance, and the problems of incomplete detection and low precision are easily caused. Meanwhile, the above method usually uses an identification task as a first target, and after a small object is down-sampled by several times, the edge of an outer example is often unclear, so that the separation of the front background is difficult, and the problem of mutual capping often occurs after the vectorization of the dense small object.

Based on the above problems, the present exemplary embodiment provides an optimization method of an instance detection network. Referring to fig. 3, fig. 3 schematically illustrates a flow diagram of an example optimization method of detecting a network, according to an embodiment of the present application. As shown in fig. 3, the optimization method for the detection network of this example may include the following steps S310 to S340:

step S310: and extracting semantic feature vectors and example feature vectors in the target image through an example detection network, and identifying at least two example targets in the target image through the semantic feature vectors and the example feature vectors.

Step S320: and calculating the pixel area ratio of each example target in the at least two example targets to all the example targets and the inter-class vector distance corresponding to the at least two example targets.

Step S330: the method includes computing an intra-class loss function from a pixel area ratio and at least two instance objectives, computing an inter-class loss function from an inter-class vector distance and at least two instance objectives, and computing a semantic loss function from at least two instance objectives.

Step S340: and training the example detection network according to the semantic loss function, the intra-class loss function and the inter-class loss function.

By implementing the method shown in fig. 3, the intra-class loss function can be optimized through the example pixel area, and the inter-class loss function can be optimized through the inter-class vector distance of the example, so that the weight can be adaptively adjusted according to the example pixel area, and the detection accuracy of the example target can be improved. In addition, the loss function between classes is optimized through the inter-class vector distance of the example, and the network training efficiency can be improved.

The above steps of the present exemplary embodiment will be described in more detail below.

In step S310, the semantic feature vector and the instance feature vector in the target image are extracted through the instance detection network, and at least two instance targets in the target image are identified through the semantic feature vector and the instance feature vector.

The target image may be a satellite image, i.e., a satellite top-down image. The example detection network may be used to identify example objects, the target image may include one or more example objects as a sample image, and when multiple example objects are included, the multiple example objects may correspond to the same size or different sizes. For example, the example target may be an object meeting the definition above, such as a building or a green area in a satellite overhead view, and the embodiment of the present application is not limited thereto.

As an alternative embodiment, extracting semantic feature vectors and instance feature vectors in a target image through an instance detection network includes: acquiring a shared characteristic vector corresponding to a target image; respectively inputting the shared feature vectors into a semantic feature extraction sub-network and an example feature extraction sub-network in an example detection network; extracting semantic feature vectors through a semantic feature extraction sub-network; an instance feature vector is extracted by an instance feature extraction subnetwork.

The semantic feature extraction sub-network and the example feature extraction sub-network correspond to different network parameters of the same network architecture, the semantic feature extraction sub-network corresponds to a semantic loss function, and the example feature extraction sub-network corresponds to an intra-class loss function and an inter-class loss function.

Specifically, the semantic Feature extraction sub-network and the example Feature extraction sub-network adopt a Feature Pyramid (FPN) strategy, each of which includes a plurality of deconvolution layers, the output of each deconvolution layer is the input of the next deconvolution layer according to the arrangement order of the deconvolution layers, and each deconvolution layer corresponds to different weights and bias terms.

It can be seen that implementing this alternative embodiment can improve network training efficiency by sharing shared feature vectors to the semantic feature extraction sub-network and the instance feature extraction sub-network.

As an alternative embodiment, the extracting the semantic feature vector through the semantic feature extraction sub-network includes: carrying out deconvolution processing on the shared feature vector through a plurality of semantic convolution layers in the semantic feature extraction sub-network to obtain a semantic feature vector; wherein the input of each semantic convolution layer in the plurality of semantic convolution layers is the output of the previous semantic convolution layer and the shared feature vector;

and extracting the instance feature vector through the instance feature extraction sub-network, including:

The deconvolution processing is used for carrying out scale dimension increasing on the feature vectors and fusing necessary feature information for upper-layer sampling.

Therefore, by implementing the optional embodiment, the accuracy of detecting the example target can be improved by multi-level feature extraction.

As an alternative embodiment, acquiring a shared feature vector corresponding to a target image includes: extracting a reference characteristic vector of a target image; normalizing the reference characteristic vector to obtain a normalization result conforming to normal distribution; and carrying out nonlinear mapping processing on the normalization result to obtain a shared characteristic vector.

Wherein, extracting the reference feature vector of the target image comprises: sequentially extracting the features of the target image through the plurality of convolution layers; wherein the output of each convolutional layer is the input of the next convolutional layer. Specifically, the feature extraction of the target image sequentially by the plurality of convolutional layers includes: extracting texture basic features of the target image through one type of convolution layer in the plurality of convolution layers, and extracting reference feature vectors corresponding to the texture basic features through the two types of convolution layers; the first type of convolutional layer is a low-layer convolutional layer, the second type of convolutional layer is a high-layer convolutional layer, and the first type of convolutional layer and the second type of convolutional layer may be multiple layers.

Based on this, the normalization processing is performed on the reference feature vector to obtain a normalization result conforming to normal distribution, and the normalization processing includes: and obtaining a normalization result conforming to normal distribution by multiplying the normalization function by the reference characteristic vector. Furthermore, the non-linear mapping processing is performed on the normalization result to obtain a shared feature vector, and the method comprises the following steps: and multiplying a preset nonlinear function by the normalization result to obtain a shared characteristic vector.

Therefore, by implementing the optional embodiment, gradient disappearance can be avoided and the network training efficiency can be improved by normalizing the reference feature vector, and the generalization capability of the network can be enhanced by nonlinear mapping processing of the normalization result.

As an alternative embodiment, identifying at least two instance objects in the object image by the semantic feature vector and the instance feature vector includes: fusing the semantic feature vectors and the feature vectors corresponding to the same instance target in the instance feature vectors to obtain a fused feature vector set; clustering the fusion characteristic vector set to obtain at least two vector clusters; at least two example targets are determined according to the expression positions of the vector clusters in the target image.

The method for fusing the semantic feature vectors and the feature vectors corresponding to the same instance target in the instance feature vectors to obtain a fused feature vector set includes: and splicing the semantic feature vectors and the feature vectors corresponding to the same instance target in the instance feature vectors to realize vector fusion, thereby obtaining a fusion feature vector set.

For example, the target image includes an instance target a and an instance target B, and the semantic feature vector and the feature vector corresponding to the same instance target in the instance feature vector are fused to obtain a fused feature vector set, including: respectively determining a feature vector of the instance target A from the semantic feature vector and the instance feature vector and carrying out vector fusion to obtain a fusion feature vector corresponding to the instance target A; respectively determining a feature vector of the instance target B from the semantic feature vector and the instance feature vector, and performing vector fusion to obtain a fusion feature vector corresponding to the instance target B, so as to obtain a fusion feature vector set; the fused feature vector set may include a feature vector of the instance object a and a feature vector of the instance object B.

In addition, clustering the fusion feature vector set to obtain at least two vector clusters, including: and taking each fusion characteristic vector in the fusion characteristic vector set as a coordinate point to represent in the same coordinate system, and clustering according to the distance between the coordinate points to obtain at least two vector clusters.

Therefore, by implementing the optional embodiment, the detection precision of the instance target in the image can be improved by fusing the semantic feature vector and the instance feature vector and clustering the features.

As an optional embodiment, fusing the semantic feature vector and the feature vector corresponding to the same instance target in the instance feature vector to obtain a fused feature vector set, including: predicting at least two first reference example targets in the target image according to the semantic feature vector; predicting at least two second reference example targets in the target image according to the example feature vectors; the first reference example target and the second reference example target correspond to each other one by one; and fusing the feature vectors corresponding to the same example target in the first reference example target and the second reference example target to obtain a fused feature vector set.

The method for fusing the feature vectors corresponding to the same instance target in the first reference instance target and the second reference instance target to obtain a fused feature vector set comprises the following steps: and splicing the feature vectors corresponding to the same example target in the first reference example target and the second reference example target to obtain a fusion feature vector set.

Therefore, by implementing the optional embodiment, the network detection precision can be improved by fusing the semantic feature vector and the instance feature vector.

In step S320, a pixel area ratio of each of the at least two instance targets to all of the instance targets and inter-class vector distances corresponding to the at least two instance targets are calculated.

The method for calculating the pixel area ratio of each instance target in at least two instance targets to all instance targets comprises the following steps: calculating the Area of the sub-pixel corresponding to at least two example targets respectively ^c Wherein c is used to represent instance objectives, different instance objectives corresponding to different c; summing all sub-pixel areas to obtain an example pixel Area and Area ^all (ii) a According to Area ^c And Area ^all Calculating the pixel area ratio of each example target to the whole example target

As an alternative embodiment, calculating the inter-class vector distance corresponding to at least two instance targets includes: calculating a feature vector mean value corresponding to each example target in the at least two example targets according to the fusion feature vector set; and calculating the inter-class vector distance corresponding to each two example targets in the at least two example targets according to the feature vector mean value.

Specifically, calculating a feature vector mean value corresponding to each instance target of the at least two instance targets according to the fusion feature vector set includes: determining a fusion feature vector set corresponding to each instance target according to the fusion feature vector set, wherein the fusion feature vector set i contains all fusion feature vectors corresponding to the instance target i and does not contain fusion feature vectors corresponding to other instance targets (such as (i-1), (i + 1) and the like); further, each fusion feature can be calculatedFeature vector means of a set of feature vectors, e.g. mu _ca 、μ _cb Wherein, mu _ca 、μ _cb Is the mean of the feature vectors corresponding to different example targets. Specifically, the way of calculating the feature vector mean of each fused feature vector set may be: and carrying out counterpoint addition on all the fused feature vectors in the fused feature vector set, and dividing each bit in an addition result by the total number of the fused feature vectors in the feature vector set to obtain a feature vector mean value.

Based on the above, calculating the inter-class vector distance corresponding to each two instance targets in the at least two instance targets according to the feature vector mean value includes: calculating the inter-class vector distance Loc corresponding to every two example targets in the at least two example targets according to the feature vector mean value _ca，cb (ii) a Where ca, cb are used to characterize two different example targets. In addition, the method can further comprise the following steps: if it is detected

Then it is judged that Loc _ca，cb Satisfying a vector distance threshold

Therefore, the optional embodiment can be implemented to calculate the inter-class vector distance, and when the inter-class vector distance is applied to the inter-class loss function calculation, the example target which does not meet the inter-class vector distance can be filtered, so that the efficiency of the inter-class loss function calculation is improved.

In step S330, an intra-class loss function is calculated from the pixel area ratio and the at least two instance targets, an inter-class loss function is calculated from the inter-class vector distance and the at least two instance targets, and a semantic loss function is calculated from the at least two instance targets.

If the semantic loss function is a cross entropy loss function, the semantic loss function L _Semantics The corresponding expression may be:

wherein p is _i For characterizing the prediction probability, y _i Can be a category label, y _i =0/1. Based on this, at least two example objectives compute the semantic loss function, including: determining a fused feature vector set of each of the at least two instance targets, taking one of the at least two instance targets as an example, a semantic loss L of each fused feature vector i in the fused feature vector set of the instance target can be calculated _i And according to L corresponding to each fusion feature vector i _i Calculating semantic loss function L corresponding to semantic feature vector _Semantics 。

As an alternative embodiment, calculating the intra-class loss function from the pixel area ratio and at least two instance targets comprises: calculating the intra-class feature mean values corresponding to the at least two example targets respectively, wherein the number of the intra-class feature mean values is consistent with the number of the at least two example targets; and calculating an intra-class loss function according to the penalty factors in the preset class, the pixel area ratio, the fusion feature vector set, the feature vector quantity of each instance target, the feature vector quantity which does not meet the penalty factors in the preset class and the intra-class feature mean value.

Specifically, before calculating the intra-class loss function according to the penalty factor in the preset class, the pixel area ratio, the fusion feature vector set, the feature vector quantity of each instance target, the feature vector quantity which does not satisfy the penalty factor in the preset class, and the intra-class feature mean value, the method may further include the following steps: determining the number N of the feature vectors corresponding to each instance target according to the fusion feature vector set _c (ii) a Determining N corresponding to each instance target _c Middle greater than delta _v As a penalty factor δ not satisfying within a predetermined class _v Number of feature vectors of

And determining the total number CC of the instance targets according to the fusion feature vector set.

Based on the method, a set of penalty factors, pixel area ratios and fusion feature vectors in a preset class are usedThe method comprises the following steps of calculating an intra-class loss function according to the number of feature vectors of each instance target, the number of feature vectors which do not satisfy a penalty factor in a preset class and an intra-class feature mean value, wherein the method comprises the following steps: according to a penalty factor delta in a preset class _v Pixel area ratio

N _c 、

CC. Mean value of the features in class mu _c Fusing the pixel characteristic x of the ith example target in the characteristic vector set _i Computing intra-class loss functions

δ _v May be constant.

Therefore, by implementing the optional embodiment, the intra-class loss function is calculated based on the pixel area ratio, the problem of smooth loss of the example target edge can be solved, and the calculation precision of the intra-class loss function is improved. In addition, the intra-class loss function is calculated based on the pixel area ratio, and the distribution of weight based on the example area can be realized, so that the weight distribution among different examples can be dynamically adjusted in a self-adaptive manner, the difficulty sample loss difference can be dynamically optimized and solved, the loss weight of a large building is improved, the loss weight of a small building is reduced, and the identification precision of each building in the target image can be improved.

As an alternative embodiment, calculating an inter-class loss function based on the inter-class vector distance and the at least two instance targets comprises: and calculating an inter-class loss function according to a preset inter-class penalty factor, an inter-class vector distance, a fusion characteristic vector set, a vector distance threshold and the number of instance targets meeting the vector distance threshold.

Before calculating the inter-class loss function according to the predetermined inter-class penalty factor, the inter-class vector distance, the fusion feature vector set, the vector distance threshold, and the number of instance targets satisfying the vector distance threshold, the method may further include: and determining the total number CC of the example targets according to the fusion feature vector set.

Based on this, according to the predetermined inter-class penalty factor, inter-class vector distance, fusion feature vector set, vector distance threshold, the number of instance targets satisfying the vector distance threshold, the inter-class loss function is calculated, which includes: according to CC and a penalty factor delta between preset classes _d Inter-class vector distance Loc _ca ， _cb Fusion feature vector set and vector distance threshold

Satisfying a vector distance threshold

Example target number of compute inter-class loss function

δ _d May be constant. Wherein a vector distance threshold is satisfied

The number of instance targets in (2) may be such that the distance between the two vectors is less than

The target number of instances of (c). That is, the vector distance is greater than

The two example targets of (1) are two different buildings with no need to recalculate the loss function, and the vector distance is less than

The two example targets of (a) may be two different buildings, or may be two portions of the same building.

Therefore, by implementing the optional embodiment, the inter-class loss function can be calculated based on the inter-class vector distance, so that the network training difficulty is reduced, the calculation amount of the loss function is reduced, and the function convergence speed is accelerated.

In step S340, the instance detection network is trained according to the semantic loss function, the intra-class loss function, and the inter-class loss function.

The semantic loss function can be used for representing loss in the semantic feature extraction process; the intra-class loss function is used for representing intra-cluster cohesion loss, and the intra-class loss function is used for constraining the pixels in the same instance to be as much as possible; the inter-class loss function is used for representing the loss of the degree of distinction between classes, and the inter-class loss function is used for constraining the pixel distinction between different instances to be as large as possible. Specifically, the semantic loss function may be a 0-1 loss function, an absolute value loss function, a log-log loss function, a square loss function, an exponential loss function, a Hinge loss function, a perceptual loss function, or a cross-entropy loss function, and the like, and the embodiment of the present application is not limited.

In addition, the method can further comprise the following steps: calculating a background loss function according to at least two example targets, and training an example detection network according to the background loss function so as to improve the distinguishing capability of the example detection network on the foreground and the background; wherein, the example object in the image is the foreground, and the other parts are the background.

In addition, after the network is detected according to the semantic loss function, the intra-class loss function and the inter-class loss function training examples, the method may further include: inputting the image to be recognized received in real time into a trained example detection network, and extracting a shared characteristic vector of the image to be recognized through the example detection network; respectively inputting the shared feature vector into an example feature extraction sub-network and a semantic feature extraction sub-network to obtain an example feature map and a semantic feature map corresponding to the image to be recognized; furthermore, features corresponding to the same instance can be fused according to the instance feature map and the semantic feature map to obtain a fusion feature set; clustering the fusion feature set to obtain at least two vector clusters; and determining at least two example targets according to the expression positions of the vector clusters in the image to be recognized.

Based on this, the following steps can be further included: and generating an example contour, and marking an example target in the image to be recognized according to the example contour so as to highlight the example target in the image to be recognized. Further, the method may further include: displaying the marked image to be identified on a user interface, wherein a mark trace in the image to be identified is in an editable state; furthermore, the mark trace can be adjusted by detecting the operation of the user, so that a secondary editing function can be provided for the user, and the mark accuracy of the example in the image is improved.

Based on this, the following steps can be further included: vectorization, gland detection and data standardization are carried out on at least two example targets and output, so that the usability of a network output result is improved. Wherein the capping detection is used to detect the presence of instance objects that are mutually capped.

As an alternative embodiment, training an instance detection network according to a semantic loss function, an intra-class loss function, and an inter-class loss function includes: extracting a sub-network according to the example characteristics in the intra-class loss function and the inter-class loss function training example detection network; and detecting semantic features in the network according to the semantic loss function training example to extract sub-networks.

Specifically, the semantic feature extraction sub-network in the network is detected according to the training example of the semantic loss function, and comprises the following steps: and adjusting the network parameters of the semantic feature extraction sub-network according to the semantic loss function until the semantic loss function is within the preset threshold range. The preset threshold range corresponding to the semantic loss function may be the same as or different from the preset threshold ranges corresponding to the intra-class loss function and the inter-class loss function, and the preset threshold range may be formed by an upper limit value and a lower limit value. In addition, the network parameters of the semantic feature extraction sub-network may include network weights and/or bias terms.

It can be seen that implementing this alternative embodiment, sub-networks corresponding to different extraction destinations can be trained by different loss functions to improve the recognition accuracy for the example target.

As an alternative embodiment, training an instance feature extraction sub-network in an instance detection network according to an intra-class loss function and an inter-class loss function includes: and adjusting network parameters of the example feature extraction sub-network according to the intra-class loss function and the inter-class loss function until the intra-class loss function and the inter-class loss function are both within a preset threshold range.

The method for extracting the network parameters of the sub-network according to the intra-class loss function and the inter-class loss function adjustment example features comprises the following steps: calculating a total loss function L = a + b + inter-class loss function according to weight values corresponding to the intra-class loss function and the inter-class loss function respectively, and further extracting network parameters of the sub-network through the total loss function L adjusting example characteristics; wherein, a is the weight value of the loss function in the class, and b is the weight value of the loss function between the classes. Therefore, the intra-class loss function and the inter-class loss function are subjected to proportion distribution on the weight values, so that the optimization effect of the example detection network is further improved. Additionally, the network parameters of the example feature extraction sub-network may include network weights and/or bias terms.

Specifically, adjusting network parameters of the instance feature extraction sub-network according to the intra-class loss function and the inter-class loss function until the intra-class loss function and the inter-class loss function are both within a preset threshold range includes: and adjusting network parameters of the example feature extraction sub-network according to the intra-class loss function and the inter-class loss function until the intra-class loss function is within a first preset range and the inter-class loss function is within a second preset range. The first preset range and the second preset range both belong to the preset threshold range.

Therefore, by implementing the optional embodiment, parameters can be adjusted according to the inter-class loss function and the intra-class loss function, the feature extraction effect of the example feature extraction sub-network is improved, and the identification precision of the optimized example detection network is improved.

Referring to fig. 4, fig. 4 schematically illustrates an architecture diagram of an example detection network according to an embodiment of the present application. As shown in FIG. 4, the example detection network may include a shared feature extraction layer 410, an example feature extraction sub-network 420, and a semantic feature extraction sub-network 430. The example detection network can be a U-Net + FPN structure, the U-Net is a feature extraction network, and a small sample data set can be quickly and effectively segmented.

Specifically, the target image may be input into the shared feature extraction layer 410, so that the shared feature extraction layer 410 extracts a reference feature vector of the target image, normalizes the reference feature vector to obtain a normalization result conforming to normal distribution, and further performs nonlinear mapping on the normalization result to obtain a shared feature vector; wherein the shared feature vector may be used for sharing into the example feature extraction sub-network 420 and the semantic feature extraction sub-network 430. Further, the example feature extraction sub-network 420 and the semantic feature extraction sub-network 430 may perform example feature extraction and semantic feature extraction, respectively, on the shared feature vector, thereby obtaining an example feature vector and a semantic feature vector that may be represented as feature images. Furthermore, the example detection network can fuse the example feature vectors and the semantic feature vectors, so that a fused feature vector set can be obtained, and clustering the fused feature vectors can obtain at least one cluster, wherein one cluster is composed of a plurality of feature points, and the plurality of feature points are used for representing one example (such as a building top view) together. Furthermore, the example detection network can determine the number of examples according to the clustering result, so that example targets in the target image can be identified, and an identification result is obtained, wherein the identification result comprises at least one example target.

Referring to fig. 5 based on fig. 4, fig. 5 schematically illustrates an example recognition effect according to an embodiment of the present application. As shown in fig. 5, fig. 510 is a result of recognizing a satellite image using a prior art building complex recognition scheme, and fig. 520 is a result of recognizing a satellite image using an example detection network according to an embodiment of the present application. The small building groups not identified in the diagram 510 are identified in the diagram 520, and it can be seen that the identification of the building groups in the satellite image by using the embodiment of the present application can improve the identification accuracy. In addition, based on the fact that the inter-class vector distance is introduced into the inter-class loss function, the network training efficiency can be improved, the probability that a single instance target is recognized as a plurality of objects is reduced, and the recognition accuracy of the instance target in the image is improved. When the example target is a building, the identification precision of the building in the satellite aerial view can be improved, the identification precision of dense buildings is guaranteed, the fine degree of segmentation of large buildings is improved, and the cost of the building in the manual annotation image is reduced.

Referring to fig. 6, fig. 6 schematically illustrates a flow diagram of an example optimization method of detecting a network, according to an embodiment of the present application. As shown in fig. 6, the optimization method of the example detection network includes: step S600 to step S622.

Step S600: and extracting a reference characteristic vector of the target image, carrying out normalization processing on the reference characteristic vector to obtain a normalization result which accords with normal distribution, and carrying out nonlinear mapping processing on the normalization result to obtain a shared characteristic vector.

Step S602: and respectively inputting the shared feature vector into a semantic feature extraction sub-network and an example feature extraction sub-network in the example detection network.

Step S604: carrying out deconvolution processing on the shared feature vector through a plurality of example convolution layers in the example feature extraction sub-network to obtain an example feature vector; wherein the input to each of the plurality of instance convolutional layers is the output of the previous instance convolutional layer and the shared eigenvector. Then, step S608 is executed.

Step S606: carrying out deconvolution processing on the shared feature vector through a plurality of semantic convolution layers in the semantic feature extraction sub-network to obtain a semantic feature vector; wherein the input of each semantic convolution layer in the plurality of semantic convolution layers is the output of the previous semantic convolution layer and the shared feature vector. Then, step S608 is executed.

Step S608: and fusing the semantic feature vectors and the feature vectors corresponding to the same instance target in the instance feature vectors to obtain a fused feature vector set.

Step S610: and clustering the fusion characteristic vector set to obtain at least two vector clusters, and determining at least two example targets according to the expression positions of the vector clusters in the target image.

Step S612: and calculating a feature vector mean value corresponding to each instance target in the at least two instance targets according to the fused feature vector set, and calculating an inter-class vector distance corresponding to each two instance targets in the at least two instance targets according to the feature vector mean value.

Step S614: and calculating the pixel area ratio of each example target in the at least two example targets to the whole example targets.

Step S616: and calculating the characteristic mean values in the class corresponding to the at least two example targets respectively, wherein the number of the characteristic mean values in the class is consistent with the number of the at least two example targets.

Step S618: and calculating an intra-class loss function according to the penalty factors in the preset class, the pixel area ratio, the fusion feature vector set, the feature vector quantity of each instance target, the feature vector quantity which does not meet the penalty factors in the preset class and the intra-class feature mean value.

Step S620: and calculating an inter-class loss function according to a preset inter-class penalty factor, an inter-class vector distance, a fusion characteristic vector set, a vector distance threshold and the number of instance targets meeting the vector distance threshold.

Step S622: extracting a sub-network according to the example characteristics in the intra-class loss function and the inter-class loss function training example detection network; and detecting semantic features in the network according to the semantic loss function training example to extract sub-networks.

It should be noted that steps S600 to S622 correspond to the steps and embodiments shown in fig. 3, and for the specific implementation of steps S600 to S622, please refer to the steps and embodiments shown in fig. 3, which are not described again here.

Further, in the present exemplary embodiment, an optimization apparatus for an instance detection network is also provided. Referring to fig. 7, the optimization apparatus 700 of the example detection network may include: a feature extraction unit 701, a parameter calculation unit 702, a loss function calculation unit 703, and a parameter adjustment unit 704, wherein:

a feature extraction unit 701, configured to extract a semantic feature vector and an instance feature vector in a target image through an instance detection network, and identify at least two instance targets in the target image through the semantic feature vector and the instance feature vector;

a parameter calculating unit 702, configured to calculate a pixel area ratio of each of the at least two instance targets to all of the instance targets and inter-class vector distances corresponding to the at least two instance targets;

a loss function calculation unit 703 for calculating an intra-class loss function according to the pixel area ratio and the at least two instance targets, calculating an inter-class loss function according to the inter-class vector distance and the at least two instance targets, and calculating a semantic loss function according to the at least two instance targets;

and a parameter adjusting unit 704, configured to train the instance detection network according to the semantic loss function, the intra-class loss function, and the inter-class loss function.

Therefore, by implementing the device shown in fig. 7, the intra-class loss function can be optimized through the example pixel area, and the inter-class loss function can be optimized through the inter-class vector distance of the example, and the weight can be adaptively adjusted according to the example pixel area, so that the detection accuracy of the example target is improved. In addition, the loss function between classes is optimized through the inter-class vector distance of the example, and the network training efficiency can be improved.

In an exemplary embodiment of the present application, the feature extraction unit 701 extracts a semantic feature vector and an instance feature vector in a target image through an instance detection network, including:

acquiring a shared characteristic vector corresponding to a target image, and respectively inputting the shared characteristic vector into a semantic characteristic extraction sub-network and an example characteristic extraction sub-network in an example detection network;

extracting a reference characteristic vector of a target image;

In an exemplary embodiment of the present application, the parameter adjusting unit 704 trains the instance detection network according to the semantic loss function, the intra-class loss function, and the inter-class loss function, including:

In an exemplary embodiment of the present application, the parameter adjusting unit 704 trains an instance feature extraction sub-network in an instance detection network according to the intra-class loss function and the inter-class loss function, and includes:

In an exemplary embodiment of the present application, the feature extraction unit 701 extracts a semantic feature vector through a semantic feature extraction subnetwork, which includes:

In an exemplary embodiment of the present application, the feature extraction unit 701 identifies at least two example objects in the object image by the semantic feature vector and the example feature vector, including:

In an exemplary embodiment of the present application, the feature extraction unit 701 fuses a semantic feature vector and a feature vector corresponding to a same instance target in an instance feature vector to obtain a fused feature vector set, including:

In an exemplary embodiment of the present application, the calculating unit 702 calculates the inter-class vector distance corresponding to at least two instance targets, including:

Therefore, by implementing the optional embodiment, calculation of the inter-class vector distance can be realized, and when the inter-class vector distance is applied to calculation of the inter-class loss function, example targets which do not meet the inter-class vector distance can be filtered, so that the efficiency of calculation of the inter-class loss function is improved.

In an exemplary embodiment of the present application, the loss function calculation unit 703 calculates an intra-class loss function according to a pixel area ratio and at least two example targets, including:

In an exemplary embodiment of the present application, the loss function calculation unit 703 calculates an inter-class loss function according to the inter-class vector distance and at least two instance targets, including:

Therefore, by implementing the optional embodiment, the inter-class loss function can be calculated based on the inter-class vector distance, so that the network training difficulty is reduced, the loss function calculation amount is reduced, and the function convergence speed is accelerated.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

For details that are not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the optimization method of the example detection network described above for the details that are not disclosed in the embodiments of the apparatus of the present application.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the above embodiments.

It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice in the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. An optimization method for an instance detection network, comprising:

extracting semantic feature vectors and example feature vectors in a target image through an example detection network, and identifying at least two example targets in the target image through the semantic feature vectors and the example feature vectors;

calculating the pixel area ratio of each instance target in the at least two instance targets to all instance targets and the inter-class vector distance corresponding to the at least two instance targets;

calculating an intra-class loss function based on the pixel area ratio and the at least two instance targets, the pixel area ratio being used to adjust weight distribution between the at least two instance targets, an

Calculating an inter-class loss function from the instance target of the at least two instance targets for which the inter-class vector distance satisfies a vector distance threshold, an

Calculating a semantic loss function according to the at least two example targets;

2. The method of claim 1, wherein extracting semantic feature vectors and instance feature vectors in the target image through an instance detection network comprises:

acquiring a shared characteristic vector corresponding to the target image;

respectively inputting the shared feature vector into a semantic feature extraction sub-network and an example feature extraction sub-network in the example detection network;

extracting the semantic feature vector through the semantic feature extraction subnetwork;

the instance feature vector is extracted by the instance feature extraction sub-network.

3. The method of claim 2, wherein obtaining the shared feature vector corresponding to the target image comprises:

extracting a reference feature vector of the target image;

and carrying out nonlinear mapping processing on the normalization result to obtain the shared characteristic vector.

4. The method of claim 2, wherein the semantic feature extraction sub-network and the instance feature extraction sub-network correspond to different network parameters of a same network architecture, the semantic feature extraction sub-network corresponds to the semantic loss function, and the instance feature extraction sub-network corresponds to the intra-class loss function and the inter-class loss function.

5. The method of claim 4, wherein training the instance detection network according to the semantic loss function, the intra-class loss function, and the inter-class loss function comprises:

training the instance feature extraction sub-network in the instance detection network according to the intra-class loss function and the inter-class loss function;

training the semantic feature extraction sub-network in the instance detection network according to the semantic loss function.

6. The method of claim 5, wherein training the instance feature extraction sub-network in the instance detection network according to the intra-class loss function and the inter-class loss function comprises:

7. The method of claim 2, wherein extracting the semantic feature vector through the semantic feature extraction sub-network comprises:

carrying out deconvolution processing on the shared feature vector through a plurality of semantic convolution layers in the semantic feature extraction sub-network to obtain the semantic feature vector; wherein the input of each semantic convolution layer in the plurality of semantic convolution layers is the output of the previous semantic convolution layer and the shared feature vector;

and extracting the instance feature vector through the instance feature extraction subnetwork, comprising:

carrying out deconvolution processing on the shared feature vector through a plurality of example convolution layers in the example feature extraction sub-network to obtain an example feature vector; wherein the input to each of the plurality of example convolutional layers is the output of the previous example convolutional layer and the shared feature vector.

8. The method of claim 1, wherein identifying at least two instance objects in the object image from the semantic feature vector and the instance feature vector comprises:

fusing the semantic feature vectors and feature vectors corresponding to the same instance target in the instance feature vectors to obtain a fused feature vector set;

and determining the at least two example targets according to the expression positions of the vector clusters in the target image.

9. The method according to claim 8, wherein fusing the semantic feature vector and the feature vector corresponding to the same instance target in the instance feature vector to obtain a fused feature vector set, comprises:

predicting at least two first reference instance targets in the target image according to the semantic feature vector;

predicting at least two second reference example targets in the target image according to the example feature vectors; wherein instance targets in the first reference instance target and the second reference instance target correspond one-to-one;

and fusing the feature vectors corresponding to the same instance target in the first reference instance target and the second reference instance target to obtain the fused feature vector set.

10. The method of claim 8, wherein calculating the inter-class vector distance corresponding to the at least two instance targets comprises:

calculating a feature vector mean value corresponding to each instance target in the at least two instance targets according to the fusion feature vector set;

11. The method of claim 8, wherein computing an intra-class loss function from the pixel area ratio and the at least two instance targets comprises:

calculating intra-class feature mean values corresponding to the at least two example targets respectively, wherein the number of the intra-class feature mean values is consistent with the number of the at least two example targets;

and calculating the intra-class loss function according to the punishment factors in the preset class, the pixel area ratio, the fusion feature vector set, the feature vector quantity of each instance target, the feature vector quantity which does not meet the punishment factors in the preset class and the intra-class feature mean value.

12. The method of claim 8, wherein computing an inter-class loss function based on the one of the at least two instance targets for which the inter-class vector distance satisfies a vector distance threshold comprises:

and calculating the inter-class loss function according to a preset inter-class penalty factor, the inter-class vector distance, the fusion characteristic vector set, the vector distance threshold and the number of instance targets meeting the vector distance threshold.

13. An apparatus for optimizing an instance detection network, comprising:

the feature extraction unit is used for extracting semantic feature vectors and example feature vectors in a target image through an example detection network, and identifying at least two example targets in the target image through the semantic feature vectors and the example feature vectors;

the parameter calculation unit is used for calculating the pixel area ratio of each instance target in the at least two instance targets to all the instance targets and the inter-class vector distance corresponding to the at least two instance targets;

a loss function calculation unit for calculating an intra-class loss function according to the pixel area ratio for adjusting weight distribution between the at least two instance targets and the at least two instance targets, and calculating an inter-class loss function according to an instance target of the at least two instance targets for which the inter-class vector distance satisfies a vector distance threshold, and calculating a semantic loss function according to the at least two instance targets;

14. The apparatus of claim 13, wherein the feature extraction unit extracts the semantic feature vector and the instance feature vector in the target image through an instance detection network, and comprises: acquiring a shared characteristic vector corresponding to the target image;

respectively inputting the shared feature vector into a semantic feature extraction sub-network and an instance feature extraction sub-network in the instance detection network;

extracting the instance feature vector through the instance feature extraction subnetwork.

15. The apparatus according to claim 14, wherein the vector obtaining unit obtains the shared feature vector corresponding to the target image, and comprises:

extracting a reference feature vector of the target image;

16. The apparatus of claim 14, wherein the semantic feature extraction sub-network and the instance feature extraction sub-network correspond to different network parameters of a same network architecture, wherein the semantic feature extraction sub-network corresponds to the semantic loss function, and wherein the instance feature extraction sub-network corresponds to the intra-class loss function and the inter-class loss function.

17. The apparatus of claim 16, wherein the parameter adjustment unit trains the instance detection network according to a semantic loss function, an intra-class loss function, and an inter-class loss function, and comprises:

18. The apparatus of claim 17, wherein the parameter adjusting unit trains the instance feature extraction sub-network in the instance detection network according to the intra-class loss function and the inter-class loss function, and comprises:

19. The apparatus of claim 14, wherein the feature extraction unit extracts the semantic feature vector through a semantic feature extraction sub-network, comprising:

carrying out deconvolution processing on the shared feature vector through a plurality of semantic convolution layers in the semantic feature extraction sub-network to obtain the semantic feature vector; wherein the input of each semantic convolution layer in the plurality of semantic convolution layers is the output of a previous semantic convolution layer and the shared feature vector;

20. The apparatus of claim 13, wherein the feature extraction unit identifies at least two example objects in the object image by the semantic feature vector and the example feature vector, and comprises:

21. The apparatus of claim 20, wherein the feature extraction unit fuses the semantic feature vectors and the feature vectors corresponding to the same instance target in the instance feature vectors to obtain a fused feature vector set, and comprises:

22. The apparatus of claim 20, wherein the parameter calculating unit calculates the inter-class vector distance corresponding to at least two instance targets, comprising:

23. The apparatus of claim 20, wherein the loss function calculating unit calculates the intra-class loss function according to the pixel area ratio and the at least two instance targets, comprising:

and calculating the intra-class loss function according to a preset intra-class penalty factor, the pixel area ratio, the fusion feature vector set, the feature vector quantity of each instance target, the feature vector quantity which does not meet the preset intra-class penalty factor and the intra-class feature average value.

24. The apparatus of claim 20, wherein the loss function calculating unit calculates the inter-class loss function according to an instance target of the at least two instance targets for which the inter-class vector distance satisfies a vector distance threshold, and comprises:

25. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 12.

26. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1-12 via execution of the executable instructions.