CN111008631B

CN111008631B - Image association method and device, storage medium and electronic device

Info

Publication number: CN111008631B
Application number: CN201911329727.3A
Authority: CN
Inventors: 于晋川
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2023-06-16
Anticipated expiration: 2039-12-20
Also published as: CN111008631A

Abstract

The invention provides an image association method and device, a storage medium and an electronic device, wherein the method comprises the following steps: inputting the image to be processed into a target neural network, and obtaining an output result from an output layer of the target neural network; filtering a plurality of detection frames in the output result according to the non-maximal inhibition NMS; and determining the detection frames with the association relation from the filtered detection frames according to the overlapping degree IoU of the detection frames. The method and the device solve the problem that in the related art, the time required by target association is long because the algorithm is complex and the time is long in a target association mode based on key point detection and aggregation.

Description

Image association method and device, storage medium and electronic device

Technical Field

The present invention relates to the field of computers, and in particular, to a method and apparatus for associating images, a storage medium, and an electronic apparatus.

Background

In the field of computer vision, object association refers to association analysis of different objects perceived by a vision algorithm. The existing target association method in the field of computer vision is commonly used for estimating the human body posture based on a reference human body posture, such as Open-post of Kanekal metilone, and the method is that firstly, each key point of the human body is predicted on a deep network thermodynamic characteristic diagram, then, the key points are gathered according to an embedded vector of network training, and finally, the human body posture is further estimated. Other approaches, such as Google AI, employ a frame's attention mechanism to correlate between targets, and network input requires constant feedback of the target's attention template. There is also a method for establishing a relationship between objects based on graph theory, such as a method proposed by hong Kong university, which first needs CNN to extract characteristics and detect and output each object, then sends characteristic information of each object into a graph network to perform association, and finally outputs an association result.

However, the method based on the key point detection and aggregation, such as Open-phase, needs to enter into sub-networks of 6 stages after passing through the backbone network, each sub-network is composed of 2 branches, and thus, the huge network needs to consume great computing resources. In many scenes, such as computing resource consumption, high-performance equipment is often required to meet the real-time performance, and the method can only estimate the coordinate point positions of the human body and the human face and cannot generate a regression frame of the target. The Open-phase method needs to detect key points of a human body first, then needs to introduce loss of an intermediate layer when purifying, and in this way, although gradient disappearance or explosion of a network is guaranteed not to occur, the difficulty of network training is greatly increased, the speed is reduced, real-time requirements are difficult to achieve, and meanwhile, the acquisition of the key point labels of the human body is often more difficult than the acquisition of frame labels in reality.

Attention mechanism-based methods require continual use of attention frame information as input stimuli as targets establish relationships. Although faster than the keypoint-based approach, this approach is more complex to implement than most deep-learning detection frameworks, and is more time-consuming than most deep-learning detection algorithms. Based on the method of graph theory, a graph network is added at the output back end of the deep learning network, and the graph network needs to calculate the target correlation according to the characteristic information of the detection target. Not only is the method complex, but a large number of graph networks are time-consuming to calculate, real-time performance cannot be met, and the graph networks are not easy to train.

There is currently no effective solution to the above-described problems in the related art.

Disclosure of Invention

The embodiment of the invention provides an image association method and device, a storage medium and an electronic device, which at least solve the problem that the time required for object association is long because an algorithm is complex and time is long in a related technology for realizing object association based on key point detection and aggregation.

According to an embodiment of the present invention, there is provided an image association method including: inputting an image to be processed into a target neural network, and obtaining an output result from an output layer of the target neural network, wherein the image to be processed comprises a plurality of target images, each target image comprises a first object and a second object, and the first object and the second object have an association relationship; the number of channels in the output layer is determined by the following parameters: dividing the image to be processed into a plurality of grids, namely the number of grids, the number of frames forming each grid, the position information of the frames in the image to be processed, the confidence level, the category probability of the first object and the second object, and predicting the position information of the second object according to the first object; the output result comprises a plurality of detection frames related to the first object and detection frames related to the second object; filtering a plurality of detection frames in the output result according to a non-maximum suppression NMS; and determining the detection frames with the association relation from the filtered detection frames according to the overlapping degree IoU of each detection frame.

According to another embodiment of the present invention, there is provided an image associating apparatus including: the input module is used for inputting an image to be processed into a target neural network and obtaining an output result from an output layer of the target neural network, wherein the image to be processed comprises a plurality of target images, each target image comprises a first object and a second object, and the first object and the second object have an association relationship; the number of channels in the output layer is determined by the following parameters: dividing the image to be processed into a plurality of grids, namely the number of grids, and the number of frames forming each grid, wherein the position information of the frames in the image to be processed, the confidence level, the category probability of the first object and the second object and the position information of the second object predicted according to the first object are included in the image to be processed; the output result comprises a plurality of detection frames related to the first object and a plurality of detection frames related to the second object; the filtering module is used for filtering a plurality of detection frames in the output result according to the non-maximal inhibition NMS; and the association module is used for determining the detection frames with the association relation from the filtered multiple detection frames according to the overlapping degree IoU of each detection frame.

According to a further embodiment of the invention, there is also provided a storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

According to a further embodiment of the invention, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

According to the invention, as the output result can be obtained only through the output layer of the target neural network, the calculation process of the target neural network is reduced, and the channel number of the output layer is determined by the following parameters: the method comprises the steps of dividing an image to be processed into a plurality of grids, dividing the number of frames forming each grid, determining the position information of the frames in the image to be processed, and confidence, wherein the type probability of a first object and a second object is less than that of the second object predicted according to the first object, comparing the prior art, determining that the parameters of the channel number are more than that of the second object predicted according to the first object, further guaranteeing that a plurality of detection frames in an output result are filtered according to a non-greatly-inhibited NMS, and determining the realization of the association of the first object and the second object in the process of determining the detection frames with the association relation according to the overlapping degree IoU of each detection frame from the filtered detection frames.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a flow chart of a method of associating images according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a target neural network structure according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a Yolo algorithm according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a YOLOv3 architecture based on a Resnet infrastructure network according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an output layer channel according to an embodiment of the invention;

FIG. 6 is a schematic diagram of coordinate position calculation according to an embodiment of the invention;

FIG. 7 is a schematic diagram of IOU computation according to an embodiment of the invention;

fig. 8 is a schematic diagram of an NMS according to an embodiment of the invention;

FIG. 9 is a schematic diagram of the association of a human body with a human face according to an embodiment of the present invention;

fig. 10 is a block diagram of a structure of an image associating apparatus according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

Examples

In this embodiment, a method for associating images is provided, fig. 1 is a flowchart of a method for associating images according to an embodiment of the present invention, and as shown in fig. 1, the flowchart includes the following steps:

step S102, inputting an image to be processed into a target neural network, and obtaining an output result from an output layer of the target neural network, wherein the image to be processed comprises a plurality of target images, each target image comprises a first object and a second object, and the first object and the second object have an association relationship; the number of channels in the output layer is determined by the following parameters: dividing the image to be processed into a plurality of grids, wherein the grids comprise the number of frames forming each grid, the position information of the frames in the image to be processed, the confidence level, the class probability of the first object and the second object and the position information of the second object predicted according to the first object; the output result comprises a plurality of detection frames related to the first object and detection frames related to the second object;

step S104, filtering a plurality of detection frames in the output result according to the non-maximum suppression NMS;

step S106, determining a detection frame with an association relationship from the filtered multiple detection frames according to the overlapping degree IoU of each detection frame.

As can be seen from the above steps S102 to S106, since the output result can be obtained only by the output layer of the target neural network, the calculation process of the target neural network is reduced, and the number of channels of the output layer is determined by the following parameters: the method comprises the steps of dividing an image to be processed into a plurality of grids, dividing the number of frames forming each grid, determining the position information of the frames in the image to be processed, and confidence, wherein the type probability of a first object and a second object is less than that of the second object predicted according to the first object, comparing the prior art, determining that the parameters of the channel number are more than that of the second object predicted according to the first object, further guaranteeing that a plurality of detection frames in an output result are filtered according to a non-greatly-inhibited NMS, and determining the realization of the association of the first object and the second object in the process of determining the detection frames with the association relation according to the overlapping degree IoU of each detection frame from the filtered detection frames.

Optionally, in the method of inputting the image to be processed in the target neural network and obtaining the output result from the output layer of the target neural network in step S102 in the present application, the method may further include:

step S102-11, inputting an image to be processed into an output layer in a target neural network;

step S102-12, processing the image to be processed according to the channel of the output layer to obtain a plurality of detection frames related to the first object and a plurality of detection frames related to the second object; wherein the detection frame associated with the second object comprises: the detected detection frames related to the second object are predicted according to the detection frames related to the second image and the detection frames related to the first object.

It should be noted that the number of output layers referred to in the present application is 3, and the dimensions of the 3 output layers are not consistent with each other.

In a specific application scenario, the target neural network in the application may be implemented based on a YOLOv3 detection framework, and further, as shown in fig. 2, in order to ensure real-time performance, the YOLOv3 is configured as a mobilent base network based on a base network modification of a resnet. The mobilent network adopts depth separable convolution, and the mobilent base network has higher speed under the condition of ensuring the same precision. The whole network output layer is a detection output layer with three dimensions of large, medium and small, specifically, the 5 th module layer of mobilent is taken as an intermediate layer, up-sampled and spliced with the 3 rd module layer, and then the detection output layer with the medium dimension of 14 multiplied by 14 is obtained after convolution. And splicing up-sampling results of the 2 nd module layer and the middle layer with the previous scale, obtaining a final 28 multiplied by 28 large-scale detection output layer through convolution, and generating 3 scale detection output layers by utilizing multi-scale and multi-depth feature information fusion, so that the detection capability of the model on a multi-scale target is ensured, and the calculation process is reduced.

It should be noted that, the basic idea of the Yolov3 algorithm is to fuse the classification task and the regression task in the target detection task together, and obtain the position information and the category information of the target through a network prediction. As shown in fig. 3, YOLOv3 divides the picture into s×s grids, each of which is responsible for detecting the object category appearing on that grid, s×s grids are reflected on the last generated featuremap of the network, and the size of the last featuremap is s×s×b× (4+1+c). Wherein S x S represents the number of grids, B represents the number of detection frames to be predicted in one grid, 4 represents position information, 1 represents foreground probability, and C represents the number of object categories to be predicted.

FIG. 4 is a view of a Yolov3 architecture based on Resnet infrastructure, as shown in FIG. 4, where Yolov3 eliminates the pooling layer and full-connection layer, and outputs featuremp in 3 different scales in forward propagation. Each grid of featuremap in Yolov3 predicts 3 frames, with a substantial increase in prediction. And the detection precision is improved by up-sampling and splicing the feature images with different scales.

It can be seen that the existing Yolov3 output layer does not contain the prediction information of the last 4 channels. In order to predict the matched face position by using the human frame, as shown in fig. 5, 4 pieces of channel information are added to the Yolov3 output layer of the application to predict the position of the face relative to the human body. Each detection output layer has b× (4+1+2+4) channels, where B represents the number of anchor-based detection frames to be predicted by each grid point, 4 represents the position information of x, y, w, h of the frames, 1 represents the foreground probability, 2 represents the class probability of a human face and a human body, and 4 represents the position information of a human face frame x, y, w, h predicted from the human body. That is, additional channel information is added to the output layer in the present application, thus ensuring real-time and usability.

When the detection frame is a human body, the rear 4 channels are the positions of the human face prediction frame relative to the human body anchor positions. When the detection frame is a human face, the rear 4 channels have no practical significance, and meanwhile, gradient calculation which is not participated in the process of reverse gradient propagation is realized.

Based on this, in a specific application scenario, the first object in the present application may be a human body, and the second object may be a human face, so steps S102 to S106 in the present application may be an output layer for sending a picture to be detected as input into the target neural network, where the output layer includes three featuremap with different scales. Each featuremap has b× (4+1+2+4) channels, where B represents the number of frames to be predicted for each grid point, 4 represents the position information of the frames, 1 represents the foreground probability, 2 represents the class probability of a human face and a human body, and 4 represents the position information of the human face predicted from the human body. And filtering redundant detection frames by using NMS (network management system), and finally performing human face and human body association post-processing.

Optionally, the loss function in the target neural network in the present application is a combination of a total variance loss function and a cross entropy loss function.

In Yolov1, the loss function is the total variance loss function:

wherein->

In Yolov3

Still the total variance error, others become cross entropy loss functions. The scheme increases the regression loss function of the face relative to the center position deviation of the human body anchor, and the position calculation of the human body and the face is shown in figure 6, wherein +.>

Corresponding to the first 4 channels of each anchor group in the detection output layer featuremap +.>

Corresponding to the last 4 channels of information. The face detection section does not involve the calculation of fig. 6, but only calculates the position of the face.

Thus, the regression loss function of the present solution in this application is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

is a human body detection loss function, wherein ∈>

Represents the abscissa of the pedestrian position predicted by the algorithm, wherein +.>

Representing a genuine label->

For human body detection ordinate, ->

Representing the true tag value. />

Is a face detection loss function, wherein +.>

The abscissa of the face position predicted by the algorithm +.>

Is the longitudinal direction of the face position predicted by the algorithmCoordinates->

Is a real label of a human face, which is->

For predicting the width and height of the frame for the human body, < >>

，/>

Is the width and height of the real label. />

，/>

Predictive width and height for face +.>

Is the true tag value corresponding to the value. />

，/>

，/>

，/>

Is a weight parameter.

Optionally, the determining, according to the overlapping degree IoU of each detection frame, a detection frame having an association relationship from the filtered multiple detection frames in step S104 of the present application may further include:

step S104-11, determining IoU between each detected detection frame related to the second object and the detection frames related to the second image predicted according to the plurality of detection frames related to the first object;

in step S104-12, the largest detected detection frame associated with the second object is selected IoU to have an association with the detection frame associated with the second image predicted from the plurality of detection frames associated with the first object.

It should be noted that, the IOU calculation is simply dividing the common area of the two rectangular areas by the merge area, commonly called the cross-over ratio. The scheme involves more steps for calculating the IOU, and needs to calculate the matching between the prediction frame and the GT (real label) during training, and the matching between the GT and the prediction frame, and the like, and the IOU is required to be calculated, and during actual prediction, the IOU is required to be calculated by the NMS (non-maximum suppression), and the IOU is required to be calculated during post-processing. A computational schematic of the IOU is shown in fig. 7.

Non-maximal suppression is typically used to reduce the detection frames, and in general, the detection results in many candidate frames, that is, many candidate frames around a target, and at this time, we need to filter by NMS, mainly to filter out unnecessary frames in the same target.

Therefore, in a specific application scenario, the steps S104-11 and S104-12 may be, as shown in fig. 8, a set of candidate frames is obtained, 5 sets of points including the target therein are first sorted by confidence, and the frame with the highest score and the most accurate target is included. Then traverse with the remaining boxes, when the IOU value is greater than the threshold, they are considered repeated boxes containing the same target, which need to be culled. After traversing all frames, carrying out confidence degree sequencing, selecting the frame with the second highest confidence degree as the most accurate frame of the next target, traversing the rest frames, repeating the operation, and finally obtaining the simplified detection frame.

Optionally, for the filtering manner of the multiple detection frames in the output result according to the non-greatly suppressing NMS in step S106 of the present application, the method may further include:

s1, selecting a first detection frame with highest confidence from a plurality of detection frames in an output result according to the confidence of each detection frame;

s2, determining a plurality of IoU between the first detection frame and other detection frames in the output result;

s3, determining IoU which is larger than a preset threshold value from a plurality of IoU, and filtering another detection frame outside the first detection frame from the determined IoU;

and S4, repeatedly executing the steps S1 to S3 from the rest detection frames according to the confidence degree sequence until the detection frame with the lowest confidence degree.

In a specific application scenario, as shown in fig. 9, two groups of face human frames are detected, wherein dark frames are visualized as the result of the human detection frames, and light frames are visualized as the detection result of the face detection frames. The specific method of the post-processing is that for each face detection frame, the face frames predicted by each face detection frame are traversed, the IOU of the face detection frames and the human body detection frame is calculated, if the IOU is the largest, the face detection frame and the human body detection frame are derived from the same person, the algorithm is successfully associated until all face detection frames are traversed, and the association of all faces and human bodies is completed.

In addition, in the training process of the target neural network, firstly 100000 marked pictures are prepared, the labels are human frames and face frames matched with the human frames, and generalization is performed through operations of picture overturn transformation, random trimming, color dithering, translation transformation, scale transformation, noise and the like during training.

In the application, the optimization is realized and trained under a dark net framework, the optimizer is set to adam, the maximum iteration number is 400000, the batch size is 64, and two 1080ti gpus are adopted. The training process firstly sends data of each batch into a network for forward propagation, and carries out logistic function activation on coordinate information x and y at an output layer, and meanwhile, confidence coefficient C and object category probabilities C1 and C2 are activated. The loss and gradient are then calculated.

Secondly, matching each anchor frame with the GT frame, selecting the anchor of the biggest IOU of the GT frame as a matching object, and calculating classification loss, position loss and face prediction loss. If the GT frame is a face frame, only the classification loss and the position loss are calculated. If the GT frame is a human frame, in addition to the classification loss and the position loss, a face prediction loss needs to be calculated. Xu process is as follows:

input: m batch picture samples, maximum iteration number MAX, stop iteration IOU threshold, and parameters

And (3) outputting: model with appropriate weights

1) Initializing network parameters

2） for iter from 1 to MAX:

for i =1 to m:

a. Forward propagation:

b. output layer operation

for gt in GT:

Find an anchor at the (i, j) th position of the maximum iou of the gt

Respectively calculating classification loss, position loss and face prediction loss

c. Gradient counter-propagation

d. Optimizing update weights

3) Stopping iteration and storing the weight file.

Therefore, through the image association method in the application, the face and human body association can be realized by changing the detection output layer only on the basis of the yolo single-stage detection frame, and the real-time performance is high under the condition of no loss of precision. Furthermore, the present application may be extended to other applications, such as detection correlations between various parts of the human body. In addition, the number of channels is only increased in the detection output layer, and the occupation of too much space is not additionally increased.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the various embodiments of the present invention.

Example 2

The embodiment also provides an image association device, which is used for implementing the above embodiment and the preferred implementation, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 10 is a schematic structural view of an image associating apparatus according to an embodiment of the present invention, as shown in fig. 10, the apparatus including: the input module 1002 is configured to input an image to be processed into a target neural network, and obtain an output result from an output layer of the target neural network, where the image to be processed includes a plurality of target images, each target image includes a first object and a second object, and the first object and the second object have an association relationship; the number of channels in the output layer is determined by the following parameters: dividing the image to be processed into a plurality of grids, wherein the grids comprise the number of frames forming each grid, the position information of the frames in the image to be processed, the confidence level, the category probability of the first object and the second object, and the position information of the second object predicted according to the first object; the output result comprises a plurality of detection frames related to the first object and detection frames related to the second object; a filtering module 1004, configured to filter the multiple detection frames in the output result according to the non-maximum suppression NMS; the association module 1006 is configured to determine a detection frame with an association relationship from the filtered multiple detection frames according to the overlapping degree IoU of each detection frame.

Optionally, the input module 1002 includes: the input unit is used for inputting the image to be processed into an output layer in the target neural network; the processing unit is used for processing the image to be processed according to the channel of the output layer to obtain a plurality of detection frames related to the first object and a plurality of detection frames related to the second object; wherein the detection frame associated with the second object comprises: the detected detection frames related to the second object are predicted according to the detection frames related to the second image and the detection frames related to the first object.

Optionally, the association module 1006 includes: a determining unit configured to determine IoU between each detected detection frame related to the second object and a detection frame related to the second image predicted from the plurality of detection frames related to the first object; and an association unit selecting IoU from which the largest detected detection frame related to the second object has an association relationship with the detection frames related to the second image predicted from the plurality of detection frames related to the first object.

Optionally, the filtering module 1004 is configured to perform the following steps:

It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; alternatively, the above modules may be located in different processors in any combination.

Example 3

An embodiment of the invention also provides a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:

s1, inputting an image to be processed into a target neural network, and obtaining an output result from an output layer of the target neural network, wherein the image to be processed comprises a plurality of target images, each target image comprises a first object and a second object, and the first object and the second object have an association relationship; the number of channels in the output layer is determined by the following parameters: dividing an image to be processed into a plurality of grids, and then dividing the image to be processed into the number of grids, the number of frames forming each grid, the position information of the frames in the image to be processed, the confidence, the category probability of the first object and the second object, and the position information of the second object predicted according to the first object; the output result comprises a plurality of detection frames related to the first object and detection frames related to the second object;

s2, filtering a plurality of detection frames in an output result according to the non-maximum inhibition NMS;

s3, determining the detection frames with the association relation from the filtered detection frames according to the overlapping degree IoU of the detection frames.

Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of correlating images, comprising:

inputting an image to be processed into a target neural network, and obtaining an output result from an output layer of the target neural network, wherein the image to be processed comprises a plurality of target images, each target image comprises a first object and a second object, and the first object and the second object have an association relationship; the number of channels in the output layer is determined by the following parameters: dividing the image to be processed into a plurality of grids, namely the number of grids, the number of frames forming each grid, the position information of the frames in the image to be processed, the confidence level, the category probability of the first object and the second object, and the position information of the second object predicted according to the first object; the output result comprises a plurality of detection frames related to the first object and a plurality of detection frames related to the second object;

filtering a plurality of detection frames in the output result according to a non-maximum suppression NMS;

determining a detection frame with the association relation according to the overlapping degree IoU of each detection frame from the filtered multiple detection frames;

the method comprises the steps that a first object is a human body, a second object is a human face, a plurality of channel information is additionally added to an output layer of a target neural network to represent position information of a human face detection frame predicted according to the human body, when the detection frame is a human body detection frame, the additionally added channels are human face prediction frame positions relative to the human body position, and when the detection frame is the human face detection frame, the human face prediction frame positions relative to the human body position are not predicted by the aid of the plurality of channel information.

2. The method of claim 1, wherein inputting the image to be processed into the target neural network and obtaining an output result from an output layer of the target neural network comprises:

inputting the image to be processed into an output layer in the target neural network;

processing the image to be processed according to the channel of the output layer to obtain a plurality of detection frames related to the first object and a plurality of detection frames related to the second object; wherein the detection frame associated with the second object comprises: the detected detection frames related to the second object are predicted according to a plurality of detection frames related to the first object and the detection frames related to the second image.

3. The method of claim 2, wherein determining a detection box having the association relationship from the filtered plurality of detection boxes according to the overlapping degree IoU of each detection box, comprises:

determining IoU between each detected detection frame associated with the second object and a detection frame associated with the second image predicted from the plurality of detection frames associated with the first object;

the largest detected detection frame associated with the second object is selected IoU from which to have the association with the detection frame associated with the second image predicted from the plurality of detection frames associated with the first object.

4. The method of claim 1, wherein the filtering the plurality of detection boxes in the output result according to the non-greatly suppressing NMS comprises:

s1, selecting a first detection frame with highest confidence from a plurality of detection frames in the output result according to the confidence of each detection frame;

s3, determining IoU which is larger than a preset threshold value from a plurality of IoU, and filtering out another detection frame outside the first detection frame from the determined IoU;

5. The method of any one of claims 1 to 4, wherein the loss function in the target neural network is a combination of a total variance loss function and a cross entropy loss function.

6. The method according to any one of claims 1 to 4, wherein the number of output layers is 3, and the dimensions of the 3 output layers are not consistent with each other.

7. An image associating apparatus, comprising:

the input module is used for inputting an image to be processed into a target neural network and obtaining an output result from an output layer of the target neural network, wherein the image to be processed comprises a plurality of target images, each target image comprises a first object and a second object, and the first object and the second object have an association relationship; the number of channels in the output layer is determined by the following parameters: dividing the image to be processed into a plurality of grids, namely the number of grids, the number of frames forming each grid, the position information of the frames in the image to be processed, the confidence level, the category probability of the first object and the second object, and predicting the position information of the second object according to the first object; the output result comprises a plurality of detection frames related to the first object and a plurality of detection frames related to the second object;

the filtering module is used for filtering a plurality of detection frames in the output result according to the non-maximal inhibition NMS;

the association module is used for determining the detection frames with the association relation according to the overlapping degree IoU of each detection frame from the filtered multiple detection frames;

8. The apparatus of claim 7, wherein the input module comprises:

the input unit is used for inputting the image to be processed into an output layer in the target neural network;

the processing unit is used for processing the image to be processed according to the channel of the output layer to obtain a plurality of detection frames related to the first object and a plurality of detection frames related to the second object; wherein the detection frame related to the second object comprises: the detected detection frames related to the second object are predicted according to a plurality of detection frames related to the first object and the detection frames related to the second image.

9. The apparatus of claim 8, wherein the association module comprises:

a determining unit configured to determine IoU between each detected detection frame related to the second object and a detection frame related to the second image predicted from the plurality of detection frames related to the first object;

and an association unit selecting IoU from which the largest detected detection frame related to the second object and the detection frames related to the second image predicted from the plurality of detection frames related to the first object have the association relationship.

10. The apparatus of claim 7, wherein the filtering module is configured to perform the steps of:

11. A computer-readable storage medium, characterized in that the storage medium has stored therein a computer program, wherein the computer program is arranged to execute the method of any of the claims 1 to 6 when run.

12. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of any of the claims 1 to 6.