CN111008631A

CN111008631A - Image association method and device, storage medium and electronic device

Info

Publication number: CN111008631A
Application number: CN201911329727.3A
Authority: CN
Inventors: 于晋川
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-04-14
Anticipated expiration: 2039-12-20
Also published as: CN111008631B

Abstract

The invention provides an image association method and device, a storage medium and an electronic device, wherein the method comprises the following steps: inputting an image to be processed into a target neural network, and obtaining an output result from an output layer of the target neural network; filtering a plurality of detection frames in an output result according to the non-maximum inhibition NMS; and determining the detection frames with the association relationship according to the overlapping degree IoU of the detection frames from the plurality of filtered detection frames. The method and the device solve the problem that the time required by target association is long due to the fact that an algorithm is complex and time is consumed for realizing the target association based on key point detection and aggregation in the related technology.

Description

Image association method and device, storage medium and electronic device

Technical Field

The present invention relates to the field of computers, and in particular, to an image association method and apparatus, a storage medium, and an electronic apparatus.

Background

In the field of computer vision, target association is the association analysis of different targets perceived by a visual algorithm. The existing target association method in the field of computer vision is commonly used based on a reference human body posture estimation method, such as Open-pos of Chimeron in a card, and the method comprises the steps of predicting each key point of a human body on a deep network thermodynamic characteristic diagram, then aggregating the key points according to an embedded vector of network training, and finally further estimating the human body posture. Other approaches, such as Google AI, employ a box attention mechanism to correlate objects, and network input requires constant feedback of the target attention template. There is also a method for establishing the relationship between targets based on graph theory, such as the method proposed by hong kong university, which first needs CNN to extract features and detect and output each target, then sends the feature information of each target into a graph network for association, and finally outputs the result of association.

However, after passing through the backbone network, the method based on the key point detection and aggregation, such as Open-pos, needs to enter sub-networks of 6 stages, each sub-network is composed of 2 branches, and thus a huge network needs to consume a large amount of computing resources. In many scenes, high-performance equipment is often needed to meet the real-time performance due to the consumption of computing resources, and the method can only estimate the coordinate point positions of the human body and the human face and cannot generate a regression frame of the target. The Open-pos method firstly needs to detect key points of a human body, and then needs to introduce the loss of an intermediate layer when purification is carried out, so that although the network is ensured not to have gradient disappearance or explosion, the difficulty of network training is greatly increased, the speed is reduced, the real-time requirement is difficult to achieve, and meanwhile, the acquisition of key point labels of the human body is often more difficult than the acquisition of box labels in reality.

The attention mechanism-based method needs to continuously use attention frame information as input excitation when the target establishes the relationship. Although faster than the method based on key points, the method is more complex to implement than most deep learning detection frameworks, and the time consumption is higher than that of most deep learning detection algorithms. According to the method based on the graph theory, a graph network is added at the output rear end of a deep learning network, and the graph network needs to calculate the target correlation according to the characteristic information of the detected target. The method is not only complex, but also a large amount of graph network calculation is time-consuming and cannot meet the real-time performance, and the graph network is not easy to train.

In view of the above problems in the related art, no effective solution exists at present.

Disclosure of Invention

The embodiment of the invention provides an image association method and device, a storage medium and an electronic device, which are used for at least solving the problem that the time required by target association is long due to the fact that an algorithm is complex and time is consumed for realizing a target association mode based on key point detection and aggregation in the related technology.

According to an embodiment of the present invention, there is provided an image association method including: inputting an image to be processed into a target neural network, and obtaining an output result from an output layer of the target neural network, wherein the image to be processed comprises a plurality of target images, each target image comprises a first object and a second object, and the first object and the second object have an association relation; the number of channels in the output layer is determined by the following parameters: dividing the image to be processed into a plurality of grids, then obtaining the number of the grids, the number of frames forming each grid, the position information of the frames in the image to be processed, the confidence coefficient, the class probability of the first object and the second object, and the position information of the second object predicted according to the first object; the output result comprises a plurality of detection frames related to the first object and a plurality of detection frames related to the second object; filtering a plurality of detection boxes in the output result according to the non-maximum inhibition NMS; and determining the detection frame with the association relation according to the overlapping degree IoU of each detection frame from the plurality of filtered detection frames.

According to another embodiment of the present invention, there is provided an apparatus for associating images, including: the image processing device comprises an input module, a processing module and a processing module, wherein the input module is used for inputting an image to be processed into a target neural network and obtaining an output result from an output layer of the target neural network, the image to be processed comprises a plurality of target images, each target image comprises a first object and a second object, and the first object and the second object have an association relation; the number of channels in the output layer is determined by the following parameters: dividing the image to be processed into a plurality of grids, then determining the number of the grids, the number of frames forming each grid, the position information of the frames in the image to be processed, the confidence, the class probability of the first object and the second object, and the position information of the second object predicted according to the first object; the output result comprises a plurality of detection frames related to the first object and a plurality of detection frames related to the second object; the filtering module is used for filtering the plurality of detection frames in the output result according to the non-maximum inhibition NMS; and the association module is used for determining the detection frames with the association relation from the plurality of filtered detection frames according to the overlapping degree IoU of each detection frame.

According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the invention, because the output result can be obtained only through the output layer of the target neural network, the calculation process of the target neural network is reduced, and the number of channels of the output layer is determined by the following parameters: the method comprises the steps of dividing an image to be processed into a plurality of grids, then dividing the number of the grids after the grid is divided, then dividing the number of frames forming each grid into a plurality of grids, then dividing the number of the grids after the grid is divided into a plurality of grids, then dividing the number of the frames after the grid is divided into a plurality of grids, then determining the number of the frames after the grid is divided into a plurality of grids according to the number of the grids, and comparing the number of the frames with the number of the frames in the prior art, so as to ensure that the number of the frames after the frames is larger than the number of the frames predicted by the first object, the method solves the problem that the time required by target association is long due to the fact that an algorithm is complex and time is consumed for realizing the target association based on key point detection and aggregation in the related technology.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of a method of associating images according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a target neural network architecture according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the YOLO algorithm according to an embodiment of the present invention;

fig. 4 is a structural diagram of YOLOv3 based on Resnet infrastructure according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an output layer channel according to an embodiment of the invention;

FIG. 6 is a schematic illustration of coordinate location calculation according to an embodiment of the invention;

FIG. 7 is a schematic diagram of IOU computation according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of an NMS according to an embodiment of the invention;

FIG. 9 is a schematic diagram of human body and human face association according to an embodiment of the invention;

fig. 10 is a block diagram of a structure of an apparatus for associating images according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Example 1

In the present embodiment, an image association method is provided, and fig. 1 is a flowchart of an image association method according to an embodiment of the present invention, as shown in fig. 1, the flowchart includes the following steps:

step S102, inputting an image to be processed into a target neural network, and obtaining an output result from an output layer of the target neural network, wherein the image to be processed comprises a plurality of target images, each target image comprises a first object and a second object, and the first object and the second object have an association relation; the number of channels in the output layer is determined by the following parameters: dividing an image to be processed into the number of grids after a plurality of grids, the number of frames forming each grid, position information of the frames in the image to be processed, confidence, class probability of a first object and a second object, and position information of the second object predicted according to the first object; the output result comprises a plurality of detection frames related to the first object and a plurality of detection frames related to the second object;

step S104, filtering a plurality of detection frames in an output result according to the non-maximum inhibition NMS;

in step S106, a detection frame having an association relationship is identified from the plurality of filtered detection frames based on the overlapping degree IoU of each detection frame.

As can be seen from the above steps S102 to S106, since the output result can be obtained only by the output layer of the target neural network, the calculation process of the target neural network is reduced, and the number of channels of the output layer is determined by the following parameters: the method comprises the steps of dividing an image to be processed into a plurality of grids, then dividing the number of the grids after the grid is divided, then dividing the number of frames forming each grid into a plurality of grids, then dividing the number of the grids after the grid is divided into a plurality of grids, then dividing the number of the frames after the grid is divided into a plurality of grids, then determining the number of the frames after the grid is divided into a plurality of grids according to the number of the grids, and comparing the number of the frames with the number of the frames in the prior art, so as to ensure that the number of the frames after the frames is larger than the number of the frames predicted by the first object, the method solves the problem that the time required by target association is long due to the fact that an algorithm is complex and time is consumed for realizing the target association based on key point detection and aggregation in the related technology.

Optionally, the manner of inputting the image to be processed into the target neural network and obtaining the output result from the output layer of the target neural network, which is referred to in step S102 in this application, may further include:

s102-11, inputting an image to be processed into an output layer in a target neural network;

step S102-12, processing the image to be processed according to the channel of the output layer to obtain a plurality of detection frames related to the first object and a plurality of detection frames related to the second object; wherein the detection box associated with the second object comprises: a detected detection frame associated with the second object, a detection frame associated with the second image predicted from the plurality of detection frames associated with the first object.

It should be noted that the number of output layers referred to in this application is 3, and the dimensions of the 3 output layers are not consistent with each other.

In a specific application scenario, the target neural network in the present application may be implemented based on the YOLOv3 detection framework, and further, as shown in fig. 2, in order to ensure real-time performance, a basic network modification of YOLOv3 itself based on resnet is configured as a mobilenet basic network. The mobile network adopts deep separable convolution, and the speed of the mobile basic network is higher under the condition of ensuring the same precision. The whole network output layer is a detection output layer with three scales of large, medium and small, and specifically, a 5 th module layer of the mobilene is used as a middle layer, is subjected to upsampling and splicing with a 3 rd module layer, and is subjected to convolution to obtain a medium-scale 14 x 14 detection output layer. And splicing the up-sampling results of the 2 nd module layer and the middle layer of the previous scale, obtaining the final 28X 28 large-scale detection output layer by convolution, and generating the 3-scale detection output layer by utilizing the multi-scale and multi-depth characteristic information fusion, thereby ensuring the detection capability of the model on the multi-scale target and reducing the calculation process.

It should be noted that the fundamental idea of the Yolov3 algorithm is to fuse classification and regression tasks in a target detection task and obtain position information and category information of a target through a network prediction. As shown in fig. 3, YOLOv3 divides the picture into S × S grids, each grid is responsible for detecting the object type appearing on the grid, the S × S grids are reflected on the featuremaps generated last by the network, and the size of the resulting featuremaps is S × sx (4+1+ C). Wherein, S multiplied by S represents the grid number, B represents the number of detection frames needing to be predicted in one grid, 4 represents the position information, 1 represents the foreground probability, and C represents the number of object categories needing to be predicted.

Fig. 4 is a block diagram of YOLOv3 based on Resnet infrastructure according to an embodiment of the present invention, as shown in fig. 4, YOLOv3 eliminates the pooling layer and the full connection layer, and outputs featuremaps of 3 different scales in forward propagation. Each grid of featuremap in Yolov3 predicts 3 boxes, and the prediction amount is greatly increased. And the detection precision is improved by upsampling and splicing feature maps of different scales.

It can be seen that the existing Yolov3 output layer does not contain prediction information for the last 4 channels. In order to realize the prediction of the matched face position by the human body frame, as shown in fig. 5, 4 channels of information are added to the Yolov3 output layer of the present application to predict the position of the face relative to the human body. Each detection output layer has B x (4+1+2+4) channels, wherein B represents the number of detection frames based on anchors to be predicted at each grid point, 4 represents the position information of x, y, w and h of the frames, 1 represents the foreground probability, 2 represents the class probability of the human face and the human body, and 4 represents the position information of x, y, w and h of the human face frames predicted according to the human body. That is, additional channel information is added to the output layer in the present application, which ensures real-time and easy use.

When the detection frame is a human body, the back 4 channels are the positions of the human face prediction frames relative to the human body anchor position. When the detection frame is a human face, the last 4 channels have no practical significance, and simultaneously, gradient calculation is not involved in the process of reverse gradient propagation.

Based on this, in a specific application scenario, the first object in the present application may be a human body, and the second object may be a human face, so steps S102 to S106 in the present application may be an output layer that takes a picture to be detected as input and sends the input to a target neural network, where the output layer includes three featuremas with different scales. Each feature template has B × (4+1+2+4) channels, where B represents the number of frames to be predicted for each grid point, 4 represents the position information of the frames, 1 represents the foreground probability, 2 represents the class probability of the face and the human body, and 4 represents the face position information predicted from the human body. And then, filtering redundant detection frames by using NMS (network management system), and finally performing face-human correlation post-processing.

Optionally, the loss function in the target neural network in the present application is a combination of a total variance loss function and a cross entropy loss function.

In Yolov1, the loss function is the total variance loss function: loss_{yolo_v1}＝loss_{x_y}+loss_{w_h}+loss_C+loss_p

Wherein,

loss in Yolov3_{w_h}Still the total variance error, others become cross entropy loss functions. The scheme increases the regression loss function of the offset of the human face relative to the center position of the human body anchor, and the position calculation of the human body and the human face is shown in figure 6, wherein

Corresponding to the detection of the first 4 channel information of each anchor group in the output layer featuremap

Corresponding to the last 4 channels of information. The face detection section does not involve the calculation of fig. 6, and only calculates the position of the face.

Therefore, the regression loss function of the present solution in the present application is as follows:

wherein,

is a human body detection loss function, wherein

An abscissa representing the position of the pedestrian predicted by the algorithm, wherein

Represents a real label and is a label of a real label,

the vertical coordinate is detected for the human body,

representing the true tag value.

Detecting a loss function for the face, wherein

Abscissa of face position predicted by algorithm

Is the ordinate of the position of the face predicted by the algorithm

Is a real label of the human face,

the width and height of the box are predicted for the human body,

the width and height of the real label.

Predicting width and height for a face

Is the true tag value corresponding thereto.

Is a weight parameter.

Optionally, the determining, in step S104, a manner of determining, from the filtered plurality of detection frames, a detection frame having an association relationship according to the overlapping degree IoU of each detection frame may further include:

step S104-11 of determining IoU between each detected detection frame associated with the second object and a detection frame associated with the second image predicted from the plurality of detection frames associated with the first object;

step S104-12, selecting IoU the largest detected detection frame related to the second object from the plurality of detection frames related to the first object, and predicting that the detection frame related to the second image has an association relationship.

It should be noted that the IOU calculation is the division of the common region of the two rectangular regions by the merging region, which is commonly referred to as the cross-over ratio. The scheme relates to the technical field of IOU calculation, and the IOU calculation is needed during training, wherein the IOU calculation is needed for calculating the matching between a prediction box and a GT (real label), the matching between the GT and a prior box, the matching between the GT and the prediction box and the like, the IOU calculation is needed for NMS (non-maximum suppression) during actual prediction, and the IOU calculation is needed during post-processing. The calculation of the IOU is schematically shown in FIG. 7.

Non-maximum suppression is generally used to simplify the detection frame, and generally, many candidate frames are obtained by detection, that is, many candidate frames exist around one target, and at this time, we need to filter through the NMS, mainly to filter out unnecessary frames in the same target.

Therefore, in a specific application scenario, the steps S104-11 and S104-12 may be, as shown in fig. 8, obtaining a group of candidate frames, wherein 5 groups of points including the target are sorted by confidence, and the frame with the highest score is the frame including the target with the highest accuracy. And then traversing with the rest of the boxes, and when the IOU value is larger than the threshold value, considering that the IOU value is a repeated box containing the same target and needing to be removed. And after traversing all the frames, sequencing the confidence degrees, selecting the frame with the second highest confidence degree as the most accurate frame of the next target, traversing the rest frames, repeating the operation, and finally obtaining the simplified detection frame.

Optionally, the manner of filtering the plurality of detection boxes in the output result according to the non-maximum suppression NMS, which is referred to in step S106 of the present application, further may include:

s1, selecting a first detection frame with the highest confidence from the plurality of detection frames in the output result according to the confidence of each detection frame;

s2, determining a plurality IoU between the first detection box and the other detection boxes in the output result;

s3, determining IoU of the IoU frames that is greater than the preset threshold, and filtering out another detection frame other than the first detection frame from the determined IoU;

s4, the above steps S1 to S3 are repeatedly executed from the remaining detection boxes according to the confidence ranking until the detection box with the lowest confidence.

In a specific application scenario, as shown in fig. 9, two groups of human face frames are obtained by detection, where a dark frame is used for visualizing the result of the human face detection frame, and a light frame is used for visualizing the detection result of the human face detection frame. The post-processing method comprises the steps of traversing the face frame predicted by each human body detection frame for each face detection frame, calculating the IOU of the face detection frame and the human body detection frame, if the IOU is the largest, the face detection frame and the human body detection frame are from the same person, successfully performing algorithm association until all the face detection frames are traversed, and completing the association of all the faces and the human bodies.

In addition, in the training process of the target neural network, 100000 marked pictures are prepared, and the labels are human body frames and face frames matched with the human body frames, and the generalization is performed by turning and transforming the pictures, randomly trimming, dithering colors, translating and transforming, transforming scales, noise and the like during training.

In the application, the optimization is realized and trained under a dark net framework, the optimizer is set to adam, the maximum iteration number is 400000, the batch size is 64, and two 1080 TIGPUs are provided. The training process firstly sends each batch of data into a network for forward propagation, and performs logistic function activation on coordinate information x and y on an output layer, and meanwhile, the activation also comprises confidence C and class probabilities C1 and C2 of the objects. The loss and gradient are then calculated.

Secondly, each anchor frame is matched with the GT frame, the anchor with the largest IOU of the GT frame is selected as a matching object, and classification loss, position loss and face prediction loss are calculated. If the GT box is a face box, only the classification loss and the position loss are calculated. If the GT box is a human body box, in addition to calculating the classification loss and the position loss, a face prediction loss needs to be calculated. The durian process is as follows:

inputting: m batches of picture samples, maximum iteration number MAX, stop iteration IOU threshold, and parameters

And (3) outputting: models with appropriate weights

1) Initializing network parameters

2)for iter from 1to MAX:

for i＝1to m:

a. Forward propagation:

b. output layer operation

for gt in GT:

Finding the anchor of the (i, j) th position with the largest iou of the gt

Respectively calculating classification loss, position loss and face prediction loss

c. Gradient counter-propagation

d. Optimizing update weights

3) Stopping iteration and saving the weight file.

Therefore, by the image association method, the face and human body association can be realized only by changing the detection output layer based on the yolo single-stage detection framework, and the real-time performance is strong under the condition of no precision loss. Furthermore, the application may be extended to other applications, such as detection of associations between various parts of the human body. In addition, the number of channels is only increased on the detection output layer, and too much space occupation cannot be additionally increased.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

In this embodiment, an image association apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description of the apparatus is omitted for brevity. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 10 is a schematic structural diagram of an apparatus for associating images according to an embodiment of the present invention, as shown in fig. 10, the apparatus including: an input module 1002, configured to input an image to be processed into a target neural network, and obtain an output result from an output layer of the target neural network, where the image to be processed includes a plurality of target images, each target image includes a first object and a second object, and the first object and the second object have an association relationship; the number of channels in the output layer is determined by the following parameters: dividing an image to be processed into the number of grids after a plurality of grids, the number of frames forming each grid, position information of the frames in the image to be processed, confidence, category probability of a first object and a second object, and predicting the position information of the second object according to the first object; the output result comprises a plurality of detection frames related to the first object and a plurality of detection frames related to the second object; a filtering module 1004, configured to filter the plurality of detection frames in the output result according to the non-maximum suppression NMS; and an association module 1006, configured to determine, from the filtered plurality of detection frames, a detection frame having an association relationship according to the overlapping degree IoU of each detection frame.

Optionally, the input module 1002 includes: the input unit is used for inputting the image to be processed into an output layer in the target neural network; the processing unit is used for processing the image to be processed according to the channel of the output layer to obtain a plurality of detection frames related to the first object and a plurality of detection frames related to the second object; wherein the detection box associated with the second object comprises: a detected detection frame associated with the second object, a detection frame associated with the second image predicted from the plurality of detection frames associated with the first object.

Optionally, the associating module 1004 includes: a determination unit configured to determine IoU between each of the detected detection frames related to the second object and a detection frame related to the second image predicted from the plurality of detection frames related to the first object; and an associating unit that selects IoU a detected detection frame relating to the second object which is the largest among the detected detection frames relating to the second object and a detection frame relating to the second image predicted from the plurality of detection frames relating to the first object to have an association relationship.

Optionally, the filtering module 1006 is configured to perform the following steps:

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Example 3

Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, inputting the image to be processed into a target neural network, and obtaining an output result from an output layer of the target neural network, wherein the image to be processed comprises a plurality of target images, each target image comprises a first object and a second object, and the first object and the second object have an association relation; the number of channels in the output layer is determined by the following parameters: dividing an image to be processed into a plurality of grids, then dividing the number of the grids, forming the number of frames of each grid, position information of the frames in the image to be processed, confidence coefficient, class probability of a first object and a second object, and predicting the position information of the second object according to the first object; the output result comprises a plurality of detection frames related to the first object and a plurality of detection frames related to the second object;

s2, filtering a plurality of detection boxes in the output result according to the non-maximum inhibition NMS;

s3, determining the detection frames with the association relation according to the overlapping degree IoU of the detection frames from the filtered detection frames.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for associating images, comprising:

inputting an image to be processed into a target neural network, and obtaining an output result from an output layer of the target neural network, wherein the image to be processed comprises a plurality of target images, each target image comprises a first object and a second object, and the first object and the second object have an association relation; the number of channels in the output layer is determined by the following parameters: dividing the image to be processed into a plurality of grids, then counting the number of the grids, counting the number of frames forming each grid, and position information, confidence, class probabilities of the first object and the second object in the image to be processed, and position information of the second object predicted according to the first object; the output result comprises a plurality of detection frames related to the first object and a plurality of detection frames related to the second object;

filtering a plurality of detection boxes in the output result according to the non-maximum inhibition NMS;

from the plurality of detection frames after filtering, the detection frame having the association relationship is determined based on the overlapping degree IoU of each detection frame.

2. The method of claim 1, wherein inputting the image to be processed into a target neural network and obtaining an output result from an output layer of the target neural network comprises:

inputting the image to be processed into an output layer in the target neural network;

processing the image to be processed according to the channel of the output layer to obtain a plurality of detection frames related to the first object and a plurality of detection frames related to the second object; wherein the detection box associated with the second object comprises: a detected detection frame associated with a second object, a detection frame associated with a second image predicted from a plurality of detection frames associated with the first object.

3. The method of claim 2, wherein determining the detection frame having the association relationship according to the overlapping degree IoU of each detection frame from the filtered plurality of detection frames comprises:

determining IoU between each detected detection frame associated with the second object and a detection frame associated with the second image predicted from the plurality of detection frames associated with the first object;

the detected detection frame associated with the second object from which the largest is selected IoU has the association with the detection frame associated with the second image predicted from the plurality of detection frames associated with the first object.

4. The method according to claim 1, wherein filtering the plurality of detection boxes in the output result according to the non-maximum suppression NMS comprises:

s2, determining a plurality IoU between the first detection box and other detection boxes in the output result;

s3, determining IoU which is larger than a preset threshold value from the IoU, and filtering out another detection frame outside the first detection frame from the determined IoU;

5. The method of any one of claims 1 to 4, wherein the loss function in the target neural network is a combination of a total variance loss function and a cross entropy loss function.

6. The method according to any one of claims 1 to 4, wherein the number of output layers is 3 and the dimensions of the 3 output layers are not uniform with respect to each other.

7. An apparatus for associating images, comprising:

the image processing device comprises an input module, a processing module and a processing module, wherein the input module is used for inputting an image to be processed into a target neural network and obtaining an output result from an output layer of the target neural network, the image to be processed comprises a plurality of target images, each target image comprises a first object and a second object, and the first object and the second object have an association relation; the number of channels in the output layer is determined by the following parameters: dividing the image to be processed into a plurality of grids, then obtaining the number of the grids, the number of frames forming each grid, the position information of the frames in the image to be processed, the confidence coefficient, the class probability of the first object and the second object, and the position information of the second object predicted according to the first object; the output result comprises a plurality of detection frames related to the first object and a plurality of detection frames related to the second object;

the filtering module is used for filtering the plurality of detection frames in the output result according to the non-maximum inhibition NMS;

and the association module is used for determining the detection frames with the association relation according to the overlapping degree IoU of each detection frame from the plurality of filtered detection frames.

8. The apparatus of claim 7, wherein the input module comprises:

the input unit is used for inputting the image to be processed into an output layer in the target neural network;

the processing unit is used for processing the image to be processed according to the channel of the output layer to obtain a plurality of detection frames related to the first object and a plurality of detection frames related to the second object; wherein the detection frame associated with the second object comprises: a detected detection frame associated with the second object, a detection frame associated with a second image predicted from a plurality of detection frames associated with the first object.

9. The apparatus of claim 8, wherein the associating module comprises:

a determination unit configured to determine IoU between each of the detected detection frames related to the second object and a detection frame related to the second image predicted from the plurality of detection frames related to the first object;

and an associating unit that selects IoU a largest detected detection frame related to the second object from among the plurality of detection frames related to the first object and a detection frame related to the second image predicted from the plurality of detection frames related to the first object to have the association relationship.

10. The apparatus of claim 7, wherein the filtering module is configured to perform the steps of:

11. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 6 when executed.

12. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 6.