CN115984537A

CN115984537A - Image processing method and device and related equipment

Info

Publication number: CN115984537A
Application number: CN202111198800.5A
Authority: CN
Inventors: 王昌安
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-10-14
Filing date: 2021-10-14
Publication date: 2023-04-18

Abstract

The embodiment of the application discloses an image processing method, an image processing device and related equipment, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like, wherein the image processing method comprises the following steps: acquiring an image to be processed; generating one or more candidate detection frames corresponding to the image to be processed; and calling a target detection model to process the one or more candidate detection frames to obtain a detection result of the image to be processed, wherein the detection result comprises a target detection frame corresponding to a target type in the one or more candidate detection frames, and the target detection model is obtained by training a first branch network and a second branch network which are included by the first scoring network by using the second scoring network, the sample image and the type label of the sample image. Through the embodiment of the application, the accuracy of target detection on the image can be effectively improved.

Description

Image processing method and device and related equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image processing method and apparatus, and a related device.

Background

In recent years, various computer vision tasks have made tremendous progress and breakthrough due to the development of deep learning. Object detection, one of the most basic tasks of computer vision, is a challenging task and also an indispensable basic technology for many advanced applications (e.g. robot vision, face recognition, automatic driving, etc.).

Object detection aims at determining the class information and position information of a specific object in an image, and relates to the classification and positioning of object instances in the image. Most of the existing weak supervision target detection technologies are based on multi-example learning, and this way is difficult to automatically learn out extremely accurate candidate detection frames, and is limited by candidate detection frames corresponding to partial regions with strong judgment force, and the trained detector is not ideal in effect, so that a target detection result may have a certain error.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device and related equipment, which can effectively improve the accuracy of target detection on an image.

An embodiment of the present application provides an image processing method, including:

acquiring an image to be processed;

generating one or more candidate detection frames corresponding to the image to be processed;

and calling a target detection model to process the one or more candidate detection frames to obtain a detection result of the image to be processed, wherein the detection result comprises a target detection frame corresponding to a target type in the one or more candidate detection frames, and the target detection model is obtained by training a first branch network and a second branch network which are included by the first scoring network by using the second scoring network, the sample image and the type label of the sample image.

An aspect of an embodiment of the present application provides an image processing apparatus, including:

the acquisition module is used for acquiring an image to be processed;

the generating module is used for generating one or more candidate detection frames corresponding to the image to be processed;

and the processing module is used for calling a target detection model to process the one or more candidate detection frames to obtain a detection result of the image to be processed, wherein the detection result comprises one or more target detection frames corresponding to the target type in the one or more candidate detection frames, and the target detection model is obtained by training a first branch network and a second branch network included in the first scoring network by using the second scoring network, the sample image and the type label of the sample image.

An aspect of an embodiment of the present application provides a computer device, including: a processor, a memory, and a network interface; the processor is connected with the memory and the network interface, wherein the network interface is used for providing a network communication function, the memory is used for storing program codes, and the processor is used for calling the program codes to execute the image processing method in the embodiment of the application.

In one aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, where the computer program includes program instructions, and when the program instructions are executed by a processor, the image processing method in the embodiments of the present application is performed.

Accordingly, embodiments of the present application provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium; the processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the image processing method provided in an aspect of the embodiment of the present application.

In the embodiment of the application, the candidate detection frames possibly containing real targets corresponding to the image to be processed are generated, then the candidate detection frames are processed by calling the target detection model, so that the detection result is obtained, and the target detection model is obtained by training the first branch network and the second branch network included in the first score, so that the target detection model can accurately determine the target detection frames corresponding to the target types by utilizing the complementarity of the two branch networks in the process of processing the candidate detection frames, and further the accuracy of target detection on the image is effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is an architecture diagram of an image processing system according to an embodiment of the present application;

fig. 2 is a schematic flowchart of an image processing method according to an embodiment of the present application;

FIG. 3 is a diagram illustrating a visualization of a detection result provided by an embodiment of the present application;

FIG. 4 is a schematic flowchart of a method for training a target detection model according to an embodiment of the present disclosure;

FIG. 5 is a block diagram of an exemplary map processing network according to an embodiment of the present disclosure;

FIG. 6 is a schematic flow chart illustrating another method for training a target detection model according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of another exemplary map processing network provided in an embodiment of the present application;

FIG. 8 is a block diagram of yet another exemplary map processing network provided by an embodiment of the present application;

FIG. 9 is a schematic flowchart of another image processing method provided in the embodiments of the present application;

fig. 10 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further performing graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, intelligent transportation and other technologies, and also includes common biometric identification technologies such as face recognition and fingerprint recognition. In an embodiment, in the image processing scheme provided by the application, candidate detection frames corresponding to an image to be processed are generated, and a target detection frame corresponding to a target category is obtained, specifically, technical contents such as image processing and image semantic understanding are involved.

The image processing scheme provided by the embodiment of the application belongs to Computer Vision technology (CV) and Machine Learning (ML) belonging to the field of artificial intelligence. Meanwhile, the scheme provided by the embodiment of the application can be applied to various scenes including but not limited to cloud technology, artificial intelligence, intelligent traffic, driving assistance and the like, and is specifically described by the following embodiments:

the architecture of the image processing system provided in the embodiments of the present application will be described below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is an architecture diagram of an image processing system according to an embodiment of the present disclosure. As shown in fig. 1, the architecture diagram may include a database 100 and an image processing apparatus 101, wherein the database 100 and the image processing apparatus 101 may be in communication connection through a wired or wireless manner. The database 100 may be a local database of the image processing apparatus 101, or may be a cloud database or the like accessible by the image processing apparatus 101, and the image processing apparatus 101 may be disposed in a computer device such as a server or a terminal.

The database 100 may include an image library and a detection result library, where the image library is used to store various image data and video data, where the image data and the video data may include various real-time collected pictures or videos, or may also be videos or pictures uploaded by a user through a terminal device, and an image to be processed may be obtained according to the image data or the video data, for example, the picture is directly used as an image to be processed or a certain frame in the video is captured as an image to be processed; the image library may also be used to store sample images, which may be training sets from a professional sample library, or may also be custom sample images, for example, sample images made from various types of images crawled from the internet, each sample image carrying a real category label, where the category label is used to indicate a category of a target object included in the sample image. Since the target detection can identify the target of interest and the location thereof in a given image, which is a pre-basic task of a subsequent higher-order computer vision analysis task, a special detection result library may be allocated, and the detection result library is used to store the detection result of the image to be processed, so that other algorithms may directly call the detection result from the detection result library when processing similar or identical images or when needing to use a relevant result, where the detection result may include one or more candidate detection boxes corresponding to the image to be processed and a target detection box corresponding to a target category, where the target category may be predefined one or more categories, each target category may correspond to multiple target detection boxes, and the image in the area corresponding to the target detection box is the target object of the target category, in other words, the detection result includes location information (target detection box) and category information (target category) of the target object.

The image processing device 101 may automatically acquire the image to be processed from the image library or acquire the image in real time through the camera device, and may process the image to be processed by using the image processing scheme provided in the present application, so as to obtain the detection result. In the embodiment of the application, a candidate detection frame corresponding to an image to be processed may be generated first, then a target detection model is used to perform detection processing on the candidate detection frame, where the target detection model may be a trained first scoring network, and includes a first branch network and a second branch network, which are respectively used to determine a first prediction score and a second prediction score of each candidate detection frame, and then prediction scores output by processing the candidate detection frame by two branch networks may be fused to obtain a target prediction score, and then a target detection frame corresponding to a target category is determined from one or more candidate detection frames according to the target prediction score.

In addition, training for the target detection model may also be implemented by the image processing apparatus 101, and the training process may be roughly: firstly, a candidate detection frame generation method (such as an unsupervised candidate detection frame generation technology) is utilized to generate a series of candidate detection frames possibly containing real targets for an input sample image, then supervised learning can be carried out by using image level labels (namely class labels of samples) through a multi-example learning method, after an initial score of each candidate detection frame is obtained, two-level pseudo labels including a first pseudo label and a second pseudo label are generated for the candidate detection frames in two different modes, and the first pseudo label and the second pseudo label complement each other through two branch networks, so that ambiguity in a model optimization process is reduced. Under the supervision of the secondary pseudo labels, the two branches can automatically give out class scores to a certain candidate detection frame, and further training of a target detection model is achieved. Finally, in prediction, the two scores are fused (for example, averaged voting) to be used as the prediction score of the final detection frame, and the final detection result can be output by using a method such as non-maximum suppression. It should be noted that the "score" and the "score" in the examples provided in the present application may mean the same meaning when the difference is not emphasized.

The method can find that the target detection of the image is a weakly supervised target detection method, only the target type contained in a certain image needs to be provided, accurate target position information does not need to be provided any more, the problem that a general target detector needs a large amount of marking data in the training process can be greatly relieved, the marking complexity is greatly reduced, and in addition, massive pictures obtained from the Internet can be effectively utilized for model training, so that the large-scale/long-tail type target detection is realized; the weakly supervised target detection method is different from a conventional mode, and the weakly supervised target detection technology based on the double-branch structure can train a target detector (namely a target detection model) under the condition of only providing a picture level label, so that efficient training is realized, respective advantages and disadvantages of the double-branch network structure can be complemented, and the accuracy of the model is further improved; the score evaluation is performed on the candidate detection frame from different angles through a target detection model with higher precision, namely different branch networks included in the trained first scoring network, so that a score with high accuracy can be obtained, and an accurate detection result can be obtained.

In this embodiment of the application, the server may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. And are not intended to be limiting herein. The terminal may be a smart phone, a tablet computer, a smart wearable device, a smart voice interaction device, a smart home appliance, a personal computer, a vehicle-mounted terminal, or the like, which is not limited herein.

It should be noted that the image processing scheme of the present application may be applied to various scenes related to robot vision, and the image to be processed may be different for different scenes. For example, in an application scenario of industrial quality inspection, the image to be processed may be an image of various industrial parts collected, such as a PCB circuit board, a steel surface, a solar panel, a metal surface, and so on. For another example, in an application scenario of automatic driving, the image to be processed may be an image acquired in real time or in advance for various roads, and the road scenario may include any one of an expressway, an urban road, a rural road, and the like. For another example, in an application scenario of live-action navigation, the image to be processed may also be an image acquired based on an environment such as a street, a mall, and the like, where no limitation is imposed on the specific application scenario and the content of the image to be processed.

In addition, the image processing scheme provided by the application can be widely applied to the fields of robot navigation, intelligent video monitoring, aerospace, industrial detection and the like. For example, an image processing scheme is deployed on an industrial quality inspection platform for performing surface defect detection processing on input images of various industrial parts and outputting defect positions (detection frames) and defect types of images to be processed. The image processing scheme can also be integrated into an unmanned obstacle avoidance planning algorithm, the images acquired in real time are detected and processed, the type and the position of the target obstacle are identified, and then effective obstacle avoidance and driving route planning of the moving obstacle are realized in automatic driving.

The following describes a specific implementation of the image processing method according to the embodiment of the present application in detail with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a schematic flowchart of an image processing method according to an embodiment of the present disclosure. The method may be performed by a computer device, such as the image processing device 101 shown in fig. 1. The image processing method comprises the following steps of:

s201, acquiring an image to be processed.

In an embodiment, the image to be processed may be an image acquired by an image processing device in real time, or may also be an image acquired in advance and stored in an image library, where the image may be a picture or any frame captured from a video, and a source and an acquisition manner of the image to be processed are not limited herein. The image to be processed may include different objects and corresponding examples thereof, for example, an image acquired based on an urban road environment, where the image may include different objects such as people, traffic lights, automobiles, buildings, and the like, and the different vehicles are different examples of the object, i.e., the vehicle.

S202, one or more candidate detection frames corresponding to the image to be processed are generated.

In an embodiment, a computer device (e.g., an image processing device) may automatically generate one or more candidate detection boxes corresponding to an image to be processed, where the candidate detection boxes include a candidate detection box that may include a target, where the target refers to a target object corresponding to a target category. Each candidate detection frame usually defines a certain area in the image in a closed rectangle, so that each candidate detection frame corresponds to different areas in the image to be processed, each area can be partially overlapped, and the images in each area may belong to the same target class or different target classes. The positions of the target categories included in the image to be processed can be preliminarily judged by generating the candidate detection frames, and then the detection frames which are most likely to contain real targets can be screened out by using the target detection model and taken as the detection results of the model.

Optionally, the candidate detection frame corresponding to the image to be processed may be generated by using an unsupervised detection algorithm, for example, selective Search (Selective Search), a classical detection algorithm, such as Edge box (Edge Boxes), sliding window method, or the like, or a deep learning Network, for example, using RPN (Region candidate Network) to extract the candidate detection frame, where a specific implementation manner used for generating the candidate detection frame is not limited. For convenience of explanation, taking selective search in the unsupervised candidate detection box extraction algorithm as an example, the process of candidate detection box generation is briefly described as follows: (1) initializing an original region by using an image segmentation method, namely segmenting an image to be processed into a plurality of small sub-regions; (2) calculating the similarity of every two adjacent sub-regions according to a greedy strategy, wherein the similarity can be the weighted sum of the similarity of color, texture, size and shape; (3) combining the two most similar sub-regions into one region; (4) calculating the similarity between the merging area and the adjacent sub-areas; (5) repeating the step (3) and the step (4), namely continuously carrying out region iterative combination until the whole image is combined into a region position, wherein the region position means that a new region cannot be combined again according to the similarity, and the finally obtained region granularity takes the region where each target object is located as a unit; (6) outputting the region after each merging to be a circumscribed rectangle, namely a candidate detection frame, so that the candidate detection frame exists no matter which merging is performed, and therefore, for a certain target class, the candidate detection frames with different granularities can exist, and one or more candidate detection frames corresponding to the image to be processed are obtained. In the present application, the "candidate box" and the "candidate detection box" may have the same meaning unless the distinction is emphasized.

The selective search method calculates the similarity between superpixels (i.e. sub-regions obtained by region segmentation of an image), combines the superpixels in a bottom-up manner (which can be understood as the merging granularity of image regions from small to large), and outputs the intermediate region generated in the middle in a manner of candidate frames, so that the candidate frames are used as candidate frames possibly containing targets. The method utilizes a series of traditional image characteristics to carry out similarity evaluation, helps to improve the probability of object detection by combining the diversity of regional similarity evaluation indexes, and ensures the accuracy of candidate detection frame extraction; in addition, the method does not need model training, has high calculation speed, high efficiency and high recall rate, and can provide a good initial detection result for the target detector. It should be noted that the unsupervised candidate frame generation method may be another more advanced method, and it is only necessary to generate a candidate frame for an image without a candidate frame label, and a basic requirement to be met may be that a recall rate of the generated candidate frame is as high as possible, so that a real target can be covered as much as possible during training and detection, and a coverage rate of object detection of a target category is improved.

In addition, all the areas in the image to be processed can be exhausted through sliding the window from left to right and from top to bottom by a sliding window method, the candidate detection frame can be determined by using Edge information (Edge) in an Edge box mode, and the candidate detection frame can be directly generated through a network obtained through deep learning, such as RPN. It should be noted that, compared to the conventional detection method, the RPN-like approach can also greatly increase the candidate detection frame generation speed due to its intelligent learning.

And S203, calling a target detection model to process one or more candidate detection frames to obtain a detection result of the image to be processed.

In an embodiment of the application, the detection result includes one or more target detection frames corresponding to a target class in the candidate detection frames, where the target detection model is obtained by training a first branch network and a second branch network included in a first scoring network by using the second scoring network, the sample image, and a class label of the sample image.

The target detection frame corresponding to each target category may include one or more, different target detection frames may correspond to different instances of the same target category or instances of different target categories, an image in a region corresponding to each target detection frame is an object of the target category, and the target category may be predefined one or more categories. For example, if the target categories are "person" and "car", the number of target detection frames corresponding to the target category of "person" may be 2, indicating that the image includes 2 different persons, and the number of target detection frames corresponding to the target category of "car" may be 5, indicating that there are 5 different cars in the image. For the detection result, not only a digital expression form may be used for storing, but also the detection result may be displayed in the image to be processed, that is, a target detection frame and a corresponding target category are marked, please refer to fig. 3, where fig. 3 is a schematic view of a visualization result of the detection result provided by the present application. The detection result may be a processing result of applying the image processing scheme to the field of industrial quality inspection, that is, a picture of a single PCB is taken as an input, and the output detection result is a score of each defect position (target detection frame), defect category (target category) and the defect category in the picture, including short defect (short, 1.00), open defect (open, 1.00), edge defect (mousebite, 1.00), middle defect (pin-hole, 1.00), and redundancy (wrapper, 1.00).

And training the first scoring network by using the second scoring network, the sample image and the class label of the sample image to obtain a target detection model with the capability of distinguishing whether the object of the target class is included in the image to be processed. The two branch networks included in the first scoring network are obtained by training under different supervision information, and the second scoring network may also include two branch networks, which are referred to as a classification branch network and a detection branch network herein and are respectively used for performing different score calculations on candidate detection frames corresponding to the sample image in a horizontal dimension and a vertical dimension, where the horizontal dimension refers to a possibility that a fixed candidate detection frame belongs to each target class, and the vertical dimension refers to a contribution of each candidate detection frame in the fixed target class. Alternatively, the second scoring network may employ classification branches and Detection branches in WSDDN (weak Supervised Deep Detection network).

In the training process, the first scoring network and the second scoring network are used at the same time, the second scoring network can be regarded as an auxiliary training network of the first scoring network, and in specific application, only the trained first scoring network is used, namely, the trained first scoring network is used as a target detection model. The specific process of training the object detection model can be referred to the content of the corresponding embodiment in fig. 4, and will not be described in detail herein.

In one embodiment, the second scoring network includes a classification branch network and a detection branch network; the classification branch network, the detection branch network, the first branch network and the second branch network may be respectively composed of a full connection layer and a normalization layer, and the difference is that parameters of the full connection layer FC and processing parameters of the normalization layer softmax may be different, for example, vectors of the same size are output by the full connection layer FC in the classification branch network and the detection branch network, and when the normalization layer calculates scores of candidate detection frames, calculation is performed along different dimensions, and association between the candidate detection frames and target classes is considered from different dimensions.

Optionally, before the target detection model is called to process one or more candidate detection frames of the image to be processed, the feature extraction network may be used to extract image features of image regions corresponding to the candidate detection frames in the image to be processed, and then the target detection model is called to process the image features, so as to obtain a detection result. The feature extraction network can be regarded as a backbone network, and any network structure can be adopted as long as high-resolution features (namely image features corresponding to candidate detection frames) with strong semantic information can be extracted.

In summary, the embodiments of the present application have at least the following advantages:

candidate detection frames of the images to be processed are automatically extracted in an unsupervised candidate detection frame generation mode, model training is not needed under the condition that the accuracy of extraction of the candidate frames is guaranteed, and the calculation amount brought by the model training can be reduced, so that the early preparation workload of target detection is reduced; the candidate detection frames are processed by adopting the target detection model with the double-branch network structure, and the target detection frames corresponding to the target categories can be accurately selected, so that the classification and positioning accuracy of the target object is improved.

Referring to fig. 4, fig. 4 is a flowchart illustrating a method for training a target detection model according to an embodiment of the present disclosure, where the method may be executed by a computer device (e.g., the image processing device 101 shown in fig. 1). Wherein the method includes, but is not limited to, the steps of:

s401, a training sample set is obtained.

In the embodiment of the present application, the training sample set includes a plurality of sample images and a category label of each sample image. The training sample set may include a large number of sample images carrying real class labels, where the class labels of the sample images belong to image-level labels, and are labels for classes of objects included in the sample images, and the classes indicated by all the class labels in the training sample set may be regarded as target classes.

Because the scheme adopts a novel weak supervision target detection structure based on a double-branch structure, the supervision at the target frame level is not needed, the model training can be carried out only by the category existing in the given image, and the large-scale internet picture resource can be well utilized, so that the sample images can be massive pictures acquired from the internet, such as pictures for carrying out preliminary image level marking by combining the search result of a keyword search engine. The sample images included in the training set may also be different for different application scenarios, for example, for industrial quality inspection, the sample images may be sample images including surface defects and sample images with complete surfaces. The target detection model obtained by training according to different training sample sets can have the capability of distinguishing objects of different target types.

S402, calling a second scoring network to process one or more candidate detection frames corresponding to each sample image to obtain the initial score of each candidate detection frame in each target category and the image score of each sample image in each target category.

In an embodiment, before invoking the second scoring network to process the one or more candidate detection frames corresponding to each sample image, the candidate detection frame of each sample image needs to be generated, and a specific manner may be the same as or different from a manner of generating the one or more candidate detection frames corresponding to the image to be processed in the foregoing embodiment, which is not limited herein. Because the training sample set does not provide a real target detection frame, the scheme can also use an unsupervised candidate detection frame generation method to generate some candidate detection frames which may contain the real target, and then can screen out the detection frames which are most likely to contain the real target by using the model to perform supervised learning.

In an embodiment, after the candidate detection box of each sample image is generated, the step of calling the second scoring network to process the candidate detection box may include: acquiring image characteristics of an image area corresponding to each candidate detection frame in each sample image in one or more candidate detection frames corresponding to each sample image, wherein the image characteristics are obtained by performing characteristic extraction on images in the image area corresponding to the candidate detection frame by using a characteristic extraction network; determining an initial category score and an initial region score of each candidate detection frame based on the image characteristics, wherein the initial category score is used for representing the probability that each candidate detection frame belongs to each target category, and the initial region score is used for representing the probability that each candidate detection frame contributes to each target category; performing fusion processing on the initial category score and the initial region score to obtain an initial score of each candidate detection frame in each target category; and determining the image score of each sample image in each target category based on the initial score of each candidate detection frame corresponding to each sample image.

The feature extraction Network may adopt various pre-training networks with feature extraction functions, for example, a Convolutional Neural Network pre-trained on the ImageNet, specifically, a Convolutional Network model of VGG (Visual Geometry Group) series, or a Convolutional Neural Network (CNN) carrying an SPP (Spatial Pyramid Pooling) layer, and a specific structure of the feature extraction Network is not limited herein. For one or more corresponding candidate detection frames in each sample image, the feature extraction network may be used to extract image features of corresponding image regions of the candidate detection frames in the corresponding sample image.

For a specific process of feature extraction, a depth convolutional neural network carrying an SPP (Spatial Pyramid Pooling) layer is taken as an example for explanation in the present application, a depth image feature is extracted from a whole detected sample image through a depth convolutional network CNN, and then for each candidate detection frame, a region Pooling technique of the SPP layer is used to extract a depth feature corresponding to the candidate detection frame, that is, an image feature of a corresponding image region of the candidate detection frame in a corresponding sample image is also a pooled depth image feature. The SPP layer is adopted to conduct pyramid pooling on the features of the depth image, sample images of any input size can be processed, the fact that the model is limited by the input size of a fixed size is avoided, and meanwhile, robustness of the model to spatial layout and object degeneration can be improved by extracting spatial feature information of different sizes. Further, the feature extraction network may further include a plurality of fully-connected layers, and after the image features are obtained, the image features may be processed through the plurality of (e.g., 2) fully-connected layers, and the abstract information of different receptive fields is mapped into a larger space, so as to obtain a comprehensive representation of the image features of each candidate detection frame, which may increase the characterization capability of the model.

Alternatively, the second scoring network may include a classification branch network and a detection branch network, which may give different scores, respectively an initial category score and an initial region score, based on the same image features, and these two scores measure the degree of association between a certain detection frame and each target category from different angles. In a specific processing process, the image features can be mapped into different feature spaces through a full connection layer to obtain two features, the two features can be used for performing class prediction of two branches, a branch one (namely, a classification branch network or called a classification channel) is responsible for predicting probability distribution of each candidate detection box belonging to each class, and a branch two (namely, a detection branch network or called a detection channel) is responsible for predicting probability of contribution of a current candidate detection box of each class. For each branch prediction, two features may be processed in different dimensions by a normalization layer, and softmax calculation for branch one may be as shown in equation (1) below:

wherein [ sigma ] _class (x ^c )] _ij Indicates the initial class score, i.e. the probability that the jth candidate detection box belongs to the ith target class, x ^c The method comprises the steps of representing the feature obtained by processing image features through a full connection layer of a classification branch network, wherein the feature is a matrix with the size of CxR |, R | represents the total number of candidate detection frames, C represents the total number of target classes, k represents all the target classesThe kth object class in the class.

The softmax calculation for branch two can be shown as the following equation (2):

wherein [ sigma ] _det (x ^d )] _ij Represents the initial region score, i.e. the informativeness contained in the ith target class corresponding to the jth candidate detection box, or the probability of the contribution of the jth candidate detection box corresponding to the ith target class, x ^d The image feature is obtained by processing through a full connection layer of a detection branch network, the feature is a matrix with the size of CxR |, wherein | R | represents the total number of candidate detection frames, C represents the total number of target categories, and k represents the kth candidate detection frame in all the candidate detection frames.

It can be found that the initial category score and the initial region score obtained above are calculated based on a matrix of the same size, starting from the dimension of the candidate detection frame and the dimension of the target category, respectively, and these two scores measure the association degree between a certain candidate detection frame and each target category from different angles, and the initial score of the candidate detection frame is obtained by fusing (for example, multiplying) these two scores, and this initial score can relatively comprehensively depict the probability that the detection frame belongs to a certain category.

Optionally, element-by-element multiplication (i.e., hadamard product) may be performed on the initial category scores and the initial region scores after all the candidate detection frames in each sample image are normalized, that is, element values of corresponding positions of a matrix of the initial category scores and a matrix of the initial region scores are multiplied, so that the obtained data has a relatively low dimensionality, which is beneficial to improving the calculation efficiency and speed.

For each candidate detection frame corresponding to each sample image, each candidate detection frame has an initial score belonging to each target category, the initial scores of all candidate detection frames under a certain target category may be subjected to fusion processing, for example, the initial scores are subjected to average processing through instance-level pooling, so as to obtain an image score of each sample image in the target category, and for the initial score of each target category included in each sample image, the above processing is repeatedly performed, so as to obtain an image score of each sample image in each target category, where the image score is a predicted image-level score and may be used in combination with a real category label to determine a loss of the second scoring network. For example, assuming that there are 10 category labels (i.e., 10 target categories) of the sample image and 20 candidate detection boxes of the sample image, each candidate detection box corresponds to 10 initial scores under 10 target categories, and the 10 initial scores under the target category may be obtained by averaging the image scores of the sample image under the target category.

That is, since there is no class label of a single candidate detection frame, only all candidate detection frames belong to the probability distribution of each target class, the probability distribution of the entire sample image can be obtained by averaging the probabilities of all candidate detection frames in the same target class, where the probability distribution includes the prediction probability that the sample image belongs to each target class, and then supervised learning is performed on the probability distribution at the image level, that is, the loss value is calculated by using the prediction value of the probability distribution and the true class label. In the inference stage, the probability distribution of each candidate detection box can be directly used as its initial prediction score, that is, the initial score of each candidate detection box belonging to each target category.

For the existing candidate detection frame of each sample image, the relevant content of the processing of the candidate detection frame by calling the second scoring network can be regarded as a multi-example learning mode, and the supervised learning of the model is performed by using the image level label. Specifically, all candidate detection frames are regarded as a bag (bag), this real target-containing frame also contains a background frame, and the label of the bag is the real target category contained in the sample image, that is, the candidate detection frames have the label of each target category, but the authenticity of the candidate detection frames belonging to each target category can be measured by an initial score, and then the candidate detection frames containing the real target category can be further screened by using the initial score.

And S403, calling a first scoring network to process one or more candidate detection frames corresponding to each sample image to obtain a first prediction score and a second prediction score of each candidate detection frame in each target category.

In an embodiment, the first scoring network and the second scoring network share the same image features of the region corresponding to the candidate detection box, that is, the image features input into the second scoring network are also input into the first scoring network for processing. Optionally, the step of calling the candidate detection box of the first scoring network processing sample image may be: calling a first branch network of a first scoring network to process one or more candidate detection frames corresponding to each sample image to obtain a first prediction score of each candidate detection frame in each target category; and calling two branch networks of the first scoring network to process one or more candidate detection frames corresponding to each sample image to obtain a second prediction score of each candidate detection frame in each target category.

The first branch network and the second branch network may both use a full connection layer and a normalization layer cascade, and the processing of the full connection layer and the normalization layer of the two branch networks may be similar or completely different. The first prediction score and the second prediction score obtained by the two branch networks are used for calculating the loss of various branch networks in the second scoring network. In addition, the two branch networks have different used supervision information, the first branch network supervises and learns through the example-level pseudo tags, and the second branch network supervises and learns through the packet-level pseudo tags, so that the two branch networks can evaluate the same candidate detection frame in different dimensions, the model has example-level discrimination capability, and ambiguity of model optimization can be avoided through packet-level tag supervision. The specific supervised learning process can be seen in the following examples, which are not described in detail herein.

S404, training the first scoring network and the second scoring network by using the first prediction score, the second prediction score, the initial score, the image score and the category label to obtain a target detection model.

In the embodiment of the application, for each branch network, the loss can be determined by using the corresponding score and label, then the losses of all the branch networks are fused to obtain the total loss, the first scoring network and the second scoring network are trained by using the total loss, and finally the trained first scoring network is used as a target detection model. For a specific training process (i.e., a supervised learning process), reference may be made to the contents of the corresponding embodiment in fig. 6, which is not described in detail herein.

For the model structure in the training phase, see fig. 5 showing a schematic structure diagram of an exemplary image processing network, which includes a feature extraction network, a first branch network and a second branch network of a first scoring network, a detection branch network and a classification branch network of a second scoring network, and results obtained by processing of the networks, where a determination manner of a loss of the first scoring network and a loss of the second scoring network may refer to contents in the embodiment corresponding to fig. 6, and other processing principles are not described herein again.

Therefore, the method can output the target area and the target category contained in the given image, and is mainly characterized in that a real target detection frame is not required to be provided in the model training stage, and only all the target categories contained in each image need to be provided. The method has important significance in realizing the training of the target detection model under a large-scale training picture, because a large amount of manpower labeling is avoided, especially for some long-tail classes (namely classes with few samples), the labeling of class labels or candidate detection frames usually needs the assistance of industry experts, wastes time and labor, and is difficult to meet a large amount of training data required by a deep learning model.

in the training process, a candidate detection frame for manual labeling is not required to be provided, and a target detection model is trained by using a sample image only carrying a class label, so that the cost and the workload of manual labeling are greatly reduced, and the training efficiency is improved; the candidate detection frame of the sample image is subjected to score prediction by using two different network branches of the first scoring network to obtain different prediction scores, and then the two scoring networks are subjected to supervised learning by combining the prediction scores and other related training information, so that the trained first scoring network with high accuracy can be obtained and used as a target detection model, and the accuracy of a detection result can be effectively improved in an application stage.

Referring to fig. 6, fig. 6 is a flowchart illustrating another method for training a target detection model according to an embodiment of the present disclosure, where the method may be executed by a computer device (e.g., the image processing device 101 shown in fig. 1). Wherein the method includes, but is not limited to, the steps of:

s601, obtaining a training sample set.

S602, calling a second scoring network to process one or more candidate detection frames corresponding to each sample image, and obtaining the initial score of each candidate detection frame in each target category and the image score of each sample image in each target category.

S603, calling a first scoring network to process one or more candidate detection frames corresponding to each sample image to obtain a first prediction score and a second prediction score of each candidate detection frame in each target category.

The specific implementation manner of the above steps may refer to the content introduced in steps S401 to S403 in the corresponding embodiment of fig. 4, which is not described herein again.

S604, acquiring a first pseudo label corresponding to each candidate detection frame at the initial score of each target category by using each candidate detection frame, and acquiring a second pseudo label corresponding to the multi-example packet where each candidate detection frame is located at the initial score of each target category by using each candidate detection frame.

In an embodiment, an optional implementation manner of obtaining the first pseudo tag corresponding to each candidate detection box may be: removing candidate detection frames with initial scores smaller than a fraction threshold value in one or more candidate detection frames corresponding to each sample image to obtain N candidate detection frames, wherein N is a positive integer; selecting a candidate detection frame with the largest initial score from the N candidate detection frames, taking the candidate detection frame with the largest initial score as a reference detection frame, and determining the corresponding target category as a first pseudo label of the reference detection frame; determining the overlapping degree between the reference detection frame and the candidate detection frames except the reference detection frame in the N candidate detection frames, and rejecting the candidate detection frames of which the overlapping degree is greater than or equal to a first overlapping degree threshold value; and selecting the candidate detection frame with the maximum initial score from the candidate detection frames after being removed, and re-executing the candidate detection frame with the maximum initial score as the reference detection frame until the N candidate detection frames are traversed.

For one or more candidate detection frames corresponding to each sample image, the candidate detection frames with the initial score smaller than or equal to the score threshold in each target category in all the candidate detection frames are removed by using the score threshold, and the candidate detection frames with low confidence degrees containing real targets can be filtered by adopting the method, so that the candidate detection frames with high confidence degrees containing the real targets can be screened out quickly in the following process. Before that, each target category may be traversed, that is, with the target category as a constant dimension, all candidate detection frames in the target category are arranged in sequence through the initial scores of the candidate detection frames under the target category, for example, arranged in a sequence from high to low, and then the candidate detection frame with a smaller initial score is removed, and for all candidate detection frames in the target category, N candidate detection frames remain after the removal.

Then, the candidate detection frame with the largest initial score among the N candidate detection frames is taken as the reference detection frame, and the corresponding target class may be determined as the first pseudo tag of the reference detection frame, it should be noted that the first pseudo tag is a tag marked on the candidate detection frame, and an object included in the candidate detection frame may be regarded as an example, so that the tag belongs to an example-level tag, which is referred to as a pseudo tag herein because whether the tag is exactly unknown. The reference detection frame is used as a currently determined real target frame and can be used for screening other redundant candidate detection frames, that is, the candidate detection frames with the overlapping degree larger than or equal to the first overlapping degree threshold value are removed by calculating the overlapping degree between the reference detection frame and other candidate detection frames in the N candidate detection frames. Optionally, the overlapping degree calculation may adopt an Intersection Union (IOU), when the IOU takes a value between 0 and 1, the overlapping degree between two candidate detection frames is represented, and the higher the numerical value is, the higher the overlapping degree is, the candidate detection frame with the high overlapping degree is rejected, so that the number of the candidate detection frames can be reduced, thereby facilitating extraction of other real target frames and labeling of instance-level labels on the candidate detection frames.

And then, continuously selecting a reference detection frame with the highest initial score from the remaining candidate detection frames which are not removed, taking the reference detection frame as a real target frame, and sequentially circulating the steps of calculating the overlapping degree, removing the candidate detection frames which do not meet the conditions and screening the reference detection frames until the N candidate detection frames are traversed, namely all the candidate detection frames of the target category are marked as the target category or are removed due to the non-meeting conditions. The above process is repeated for all categories, and finally all candidate detection boxes may be marked as a label, where the label may include a category label corresponding to the target category or a background label, and the first pseudo label herein generally refers to a category label corresponding to the target category, and such a label generally belongs to a foreground target label, and secondary supervised learning of the candidate detection boxes may be performed by using the first pseudo label, for example, a loss is calculated by using the first pseudo label and the first prediction score, and the loss is propagated backward to adjust the first branch network. Optionally, the candidate detection frames of the target category that are not marked after the above processes are repeated are all marked with a background label, and in addition, the candidate detection frames corresponding to the target category whose overlap degree is higher than the first overlap degree threshold may also be selected to be ignored, and finally, for all the candidate detection frames of a certain target category, three types of candidate detection frames are included, which are the candidate detection frame marked as the target category, the candidate detection frame that is ignored, and the candidate detection frame whose initial score is smaller than the score threshold and is not traversed.

The above process is also referred to as OICR (Online Instance Classifier Refinement) supervision. The branch supervision belongs to example-level pseudo label supervision, and due to the fact that generated pseudo labels introduce some noise inevitably, the direct use of the example-level supervision can cause learning ambiguity, for example, some backgrounds with high initial scores can be marked as foreground targets by mistake, some targets with only partial regions are marked as positive samples, and the rest objects with low overlapping degree can be marked as negative samples, so that the model generates ambiguity in the optimization process, but the branch directly carries out example-level supervision on candidate target frames, and when the pseudo labels are accurate enough, the model prediction accuracy can be optimized.

In an embodiment, an optional implementation manner of obtaining the second pseudo tag corresponding to the multiple example packet where each candidate detection box is located may be: selecting a first candidate detection frame with the maximum initial score of each target category from one or more candidate detection frames corresponding to each sample image, and determining the overlapping degree between the first candidate detection frame and candidate detection frames except the first candidate detection frame in the one or more candidate detection frames; combining the first candidate detection frame and the candidate detection frame with the overlapping degree larger than or equal to a second overlapping degree threshold value into a multi-example packet, and determining the corresponding target class as a second pseudo label of the multi-example packet; removing the candidate detection frames in the multi-instance packet from one or more candidate detection frames to obtain P candidate detection frames, wherein P is a positive integer; and selecting a first candidate detection frame with the largest initial score from the candidate detection frames after the rejection, and re-determining the overlapping degree between the first candidate detection frame and the candidate detection frames except the first candidate detection frame in the P candidate detection frames until one or more candidate detection frames are clustered into each multi-example packet.

Firstly, traversing each target category, taking the target category as a standard dimension, comparing initial scores of all candidate detection frames under the target category, and selecting a first candidate detection frame with the largest initial score in the candidate detection frames corresponding to the target category; then, calculating the overlapping degree between the first candidate detection frame and other candidate detection frames, wherein the calculating mode of the overlapping degree can be the same as or different from the mode of determining the overlapping degree; then, combining the first candidate detection frame and all candidate detection frames whose surrounding overlapping degrees are higher than the second overlapping degree threshold value into a multi-instance packet, which is called a multi-instance packet, because an image in a corresponding area of each candidate detection frame can be regarded as an instance (instance) or instance, and a second pseudo label of the multi-instance packet is a corresponding target class, where the second overlapping degree threshold value may be the same as or different from the first overlapping degree threshold value, and the setting of the overlapping degree threshold value may be set according to an artificial experience value, or may be iteratively optimized through training; and then determining candidate detection frames which are not contained in any multi-example packet from the remaining P candidate detection frames of the target category, and repeating the operation until all the candidate detection frames are selected into a certain multi-example packet. The above process is repeated for all target classes, and then supervised learning is performed by using the multi-example learning mode for each multi-example packet separately, and the candidate detection box for the background class can directly learn how to classify the background in the example-level supervised learning mode.

It should be noted that the candidate detection boxes of the background class may be classified into one multi-instance packet, where the label of the multi-instance packet is the background label. Alternatively, the candidate detection boxes for these background classes may be labeled as background classes due to a score that is too low to be categorized in the same multi-instance package. In addition, for the remaining P candidate detection frames in each object class, or the above N candidate detection frames, there is a difference in the specific screening process of the candidate detection frames for each object class, that is, P or N is variable under different object classes. For the second pseudo label of the multi-example packet, the pseudo labels of all the candidate detection frames contained in the multi-example packet can be represented, that is, the candidate detection frames of a batch are marked, and then the second pseudo label is used for simultaneously supervising the multiple candidate detection frames, so that the situation that the model falls into local optimization can be avoided.

The above process is also referred to as PCL (candidate detection box Cluster Learning) supervision. The supervision mode adopts a multi-example learning mode to refine scores, and as a plurality of candidate detection frames in one multi-example package are supervised simultaneously, the optimal detection frame cannot be made to stand out, which is contradictory to the requirement of the optimal detection frame on the highest score in the test stage, and the inconsistent result of the training stage and the test stage can cause the accuracy of the model to be reduced. However, because a plurality of candidate detection frames are supervised simultaneously, ambiguity defects caused by supervision on an instance level by OICR supervision can be overcome, namely, supervision is not directly applied to a single instance, and hardness caused by false tag noise can be avoided. The image processing scheme provided by the embodiment of the application is beneficial to maximizing the precision of the trained target detection model by fusing the two modes and complementing the advantages and the disadvantages of the two modes, and further improves the accuracy of the detection result.

The obtained first pseudo tag and the second pseudo tag may supervise different branch networks of the first scoring network, that is, adjust the first scoring network and the second scoring network according to the total loss by calculating the loss, which may be specifically described in step S605.

And S605, determining the total loss corresponding to the first scoring network and the second scoring network by using the first prediction score, the second prediction score, the first pseudo label, the second pseudo label, the image score and the category label.

In an embodiment, the specific implementation manner of this step may be: determining a loss of the second scoring network using the image score and the category label; determining a loss of a first branch network of the first scored network using the first predicted score and the first pseudo tag, and determining a loss of a second branch network of the first scored network using the second predicted score and the second pseudo tag; and determining the total loss corresponding to the first scoring network and the second scoring network by using the loss of the second scoring network, the loss of the first branch network and the loss of the second branch network.

Optionally, the loss of the second scoring network, the loss of the first branch network, and the loss of the second branch network may all calculate a loss value in a cross entropy loss manner, the loss of the first branch network and the loss of the second branch network are fused to obtain the loss of the first scoring network, and then the loss of the first scoring network and the loss of the second scoring network are fused to obtain a total loss corresponding to the first scoring network and the second scoring network, and the loss is fused in a weighted summation manner.

For the structural diagram of the image processing network in the training phase, on the basis of fig. 5, exemplary contents of each scoring network in the corresponding embodiment of fig. 4 and related descriptions in this embodiment may be further detailed, please refer to fig. 7, and fig. 7 shows a structural diagram of another exemplary image processing network. The first pseudo label and the second pseudo label obtained according to the initial score can be used in a first branch network and a second branch network of the first scoring network respectively and used for calculating the loss of the first branch network and the loss of the second branch network so as to obtain the loss of the first scoring network, the loss of the second scoring network is obtained according to the image score and the class label, the losses of the two scoring networks are fused, and the supervised learning of the model can be achieved.

S606, the first scoring network and the second scoring network are trained by means of total loss, and the trained first scoring network is used as a target detection model.

In the embodiment of the application, the parameters or the structures of the first scoring network and the second scoring network can be adjusted by using the total loss, the candidate detection frame is further processed by using the adjusted network, the training process is iterated continuously until the total loss is converged, and the finally trained first scoring network is used as a target detection model, wherein the first scoring network and the second scoring network are obtained by training.

In one embodiment, the first scoring network comprises one or more scoring units, each scoring unit comprising a first branch network and a second branch network, the one or more scoring units being connected in parallel; the input of each scoring unit is the image characteristics of the corresponding image area of each candidate detection frame in the sample image, the fusion score output by the last scoring unit is used for determining the first pseudo label and the second pseudo label used by the next scoring unit, and the fusion score is obtained by fusing the first prediction score output by the first branch network and the second prediction score output by the second branch network.

The above training process may be considered as a training process when the first scoring network includes one scoring unit, and the same processing flow is performed for each scoring unit, and the calculation of the total loss and the final structure of the target detection model may be affected by the number of scoring units. When the first scoring network comprises a plurality of scoring units, the scoring units are connected in parallel, each scoring unit comprises a first branch network and a second branch network and shares the same input data, namely the processed input data is the image characteristics of the corresponding image area of each candidate detection frame in the sample image; each scoring unit may output one fusion score, where the fusion score is obtained by fusing a first prediction score and a second prediction score obtained by two branch networks in the scoring unit, and specifically, the fusion score may be obtained by averaging two prediction scores in an average voting manner, or by performing weighted summation on the two prediction scores to obtain a fusion score, where the specific fusion manner is not limited herein. According to the fusion score of the last scoring unit, a first pseudo label and a second pseudo label which are needed by the next scoring unit can be determined, the two pseudo labels are used for performing supervised learning on the next scoring unit, the loss corresponding to each scoring unit is determined by combining the respective prediction scores, and further the total loss is obtained. It should be noted that each scoring unit can be regarded as a refinement classifier (or refinement branch), the second scoring network can be regarded as a basic classifier, the supervision of the first refinement classifier depends on the output of the basic classifier, and the supervision of the ith refinement classifier depends on the output of the (i-1) th refinement classifier, where i is a positive integer greater than or equal to 2. That is, each refinement branch takes the result of the previous refinement branch as input for the next iterative optimization.

In conjunction with the first scoring network, the image processing network may be further refined as shown in fig. 8, fig. 8 is a schematic structural diagram of another exemplary image processing network provided in this embodiment of the present application, where the first scoring network includes K scoring units, K is a positive integer, each scoring unit includes two branching networks each composed of a fully-connected layer and a normalization layer, and the second scoring network and other networks follow the content shown in fig. 7. For the first and second pseudo tags determined by the fusion score of the ith scoring unit in the K scoring units shown in fig. 8, the first and second pseudo tags are applied to the first and second branch networks in the (i + 1) th scoring unit, respectively, and the first and second pseudo tags obtained from the initial score output by the second scoring network are applied to the first and second branch networks of the first scoring unit, respectively, the loss corresponding to each branch network is calculated by combining the respective predicted scores, and the loss of the scoring unit, the loss of the first scoring network, and the total loss are sequentially obtained.

The scheme takes a multi-example learning framework as a basis, and improves the link of generating the pseudo label for the candidate detection box as follows: according to the initial scores of the single candidate frames output in the multi-example learning stage, one or more possible real target frames can be determined from each target category, and then the target frames are used as pseudo labels to supervise the model for learning, so that the model obtains the discrimination capability at the example level. Compared with the conventional mode that when a pseudo label is generated, the maximum initial score frame is assumed as a real target frame, or all frames in a certain range around the maximum score frame are taken as a packet to be subjected to multi-example learning again, the mode of the scheme can not only avoid that the former considers candidate frames with non-maximum scores as negative samples but ignores that the negative samples possibly cover a part of a real target area, so that ambiguity is generated in the training process of the model, but also avoid that the latter supervises a plurality of candidate frames as a packet, and does not apply example-level supervision, so that the model is inconsistent in the training and testing, namely the optimal detection frame is required to have the highest score in the testing of the model, and the scores of other frames are low enough, so that the optimal detection frame is not available in the training of the model.

In summary, the embodiment of the present application has at least the following advantages:

compared with the prior weak supervision target detection technology, due to the fact that a double-branch structure is used in the first scoring network, the first scoring network can encourage the optimal candidate detection frame score to stand out through screening the candidate detection frames possibly containing real targets, and the second scoring network can improve the detection coverage rate of the target type through packet-level label supervision, so that the target type is prevented from being correctly identified only in partial areas to a certain extent; the initial scores of the candidate detection frames are utilized to generate different pseudo labels based on two different modes for supervised learning of different branch networks of the model, the two branch networks are complemented, namely the second branch network solves the ambiguity of the first branch network caused by the pseudo labels in the optimization process, and the first branch network solves the problem of inconsistency of results in the training stage and the testing stage in the second branch network, so that the target detection model has case-level discrimination capability, local optimization in the optimization process is avoided, the consistency of the training stage and the testing stage is ensured, and the precision of the target detection model obtained by training can be effectively improved.

Referring to fig. 9, fig. 9 is a flowchart illustrating another image processing method according to an embodiment of the present disclosure, where the method may be executed by a computer device (e.g., the image processing apparatus 101 shown in fig. 1). Wherein the method includes, but is not limited to, the steps of:

and S901, acquiring an image to be processed.

And S902, generating one or more candidate detection frames corresponding to the image to be processed.

In an implementation, specific implementation manners of steps S901 to S902 may refer to S201 to S202 in the corresponding embodiment of fig. 2, which are not described herein again. Steps S903 to S906 in this embodiment are further described with respect to step S203 in the corresponding embodiment of fig. 2, and the precondition required for executing the subsequent steps is: the target detection model includes a first branch network and a second branch network of the trained first scoring network.

And S903, calling the first branch network to process one or more candidate detection frames to obtain a third prediction score of each candidate detection frame in each target category, and calling the second branch network to process one or more candidate detection frames to obtain a fourth prediction score of each candidate detection frame in each target category.

In an embodiment, the target detection model is obtained through training of the first scoring network in the foregoing embodiment, and different prediction scores of the candidate detection boxes can be obtained through processing of the candidate detection boxes by the first branch network and the second branch network in the trained first scoring network. Before the first branch network is called, each candidate detection frame may extract features from the to-be-processed image with the candidate detection frame through the feature extraction network to obtain image features corresponding to the candidate detection frame, which is not described herein in detail. The first branch network specifically processes the image features corresponding to the candidate detection frames, and optionally, the image features may sequentially pass through a full connection layer and a normalization layer of the first branch network to obtain a third prediction score of each candidate detection frame in each target category. Similar to the first branch network, the second branch network may also pass through the full connection layer and the normalization layer sequentially for the image features corresponding to the candidate detection frames to obtain a fourth prediction score of each candidate detection frame in each target category. Unlike the predicted scores obtained in the training phase, the predicted scores obtained in the application phase are more accurate scores obtained after refinement of the individual branch networks of the first scoring network.

And S904, carrying out fusion processing on the third prediction score and the fourth prediction score to obtain a target prediction score of each candidate detection frame in each target category.

In an embodiment, the third prediction score and the fourth prediction score of each candidate detection box in each target category may be fused, and optionally, the fusion manner may adopt a manner of voting by averaging the candidate detection boxes, that is, averaging the third prediction score and the fourth prediction score to obtain a fusion score, and further determine the target prediction score of each candidate detection box in each target category. Or a weighted average mode can be adopted, different weight values are given to the prediction scores, a fusion score is obtained after summation and averaging, and then the target prediction score of each candidate detection frame in each target category is determined and obtained.

Further, the fusion score may be directly used as the target prediction score, or the target prediction score may be obtained by performing corresponding processing on the fusion score, which depends on the number of scoring units included in the trained first scoring network. Optionally, the first scoring network comprises one or more scoring units, each scoring unit comprising a first branch network and a second branch network, the one or more scoring units being connected in parallel; the input of each scoring unit is the image characteristics of the image area corresponding to each candidate detection frame in the image to be processed, and each scoring unit correspondingly outputs the target prediction score of each candidate detection frame in each target category.

If the trained first scoring network comprises one scoring unit, the fused score obtained by fusing the corresponding third prediction score and the fourth prediction score can be directly used as the target prediction score. Conversely, if the trained first scoring network includes at least two scoring units, the score obtained by processing the candidate detection frame by each scoring unit includes the third prediction score and the fourth prediction score, the third prediction score and the fourth prediction score obtained by processing each scoring unit may be fused to obtain a fusion score of processing the candidate detection frame by each scoring unit, and the fusion score of each scoring unit may be fused to obtain the target prediction score of the candidate detection frame in each target category, where the fusion may be an average processing or a weighted average. Illustratively, as shown in fig. 8, K scoring units, assume that the fusion score α of the ith scoring unit after the average vote of the candidate boxes _i Then the target is pre-measured to be divided into

S905, determining a target detection frame corresponding to the target category from the one or more candidate detection frames according to the target prediction score, and taking the target detection frame corresponding to the target category as a detection result of the image to be processed.

In an embodiment, since one or more candidate detection frames of the generated image to be processed may not include a target class, and usually only a target detection frame for a certain target class needs to be given for a detection result, most of the candidate detection frames are redundant. The number of the candidate detection frames can be reduced by using the target detection score and a preset rule, so that the target detection frame corresponding to the target category is determined and is used as a detection result.

Alternatively, a Non-maximum Suppression (NMS) method may be used to obtain the final detection result. For the specific processing flow of the NMS, which is described by way of example below, assuming that the target classification task has 3 classes (i.e., 3 target classes), and there are 1000 candidate detection boxes for generating the image to be processed, the finally output vector related to the score is 1000 × 3, each column corresponds to one target class, and each row is the target prediction score of each candidate detection box. According to the NMS algorithm step, each column in the 1000 x 3 dimensional matrix can be sorted from big to small, the overlapping degree of the candidate detection frame behind the column is respectively calculated from the candidate detection frame with the maximum target prediction score of each column, the overlapping degree is larger than a given threshold value, the target prediction score is removed in a smaller way, a plurality of targets of the type possibly exist in the remaining candidate detection frames, namely the image, the steps are sequentially repeated for the remaining candidate detection frames until all the candidate detection frames of the column are traversed, and all the columns in the 1000 x 3 dimensional matrix are traversed, so that the detection result can be obtained.

In addition, other relevant algorithms of the NMS may be used to obtain the detection results and find the best location (i.e., target detection box) for object detection, such as Soft NMS, softer NMS, IOU-Guided NMS, etc., which are improved over the conventional NMS in that there are different screening strategies using the target prediction scores of the candidate detection boxes. For example, the Soft NMS does not directly exclude each time a box that overlaps a selected box by more than a certain threshold, but rather lowers the score of the corresponding box with a certain policy until it falls below a certain threshold so as not to remove too many boxes that are located correctly in the crowded case.

the candidate detection frames of the images to be processed are processed by utilizing the first branch network and the second branch network which are included in the trained first scoring network, and the accuracy of the model is maximized in the training process, so that the prediction scores of the two branch networks are accurate prediction results, the association degree between the candidate detection frames and the target category can be described more comprehensively and accurately by fusing the prediction scores of the two branch networks, and the target detection frame corresponding to the target category can be determined more accurately according to the target prediction score.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an image processing apparatus according to an exemplary embodiment of the present application. The image processing apparatus may be a computer program (including program code) running on a computer device, for example, the image processing apparatus is an application software; the image processing device can be used for executing corresponding steps in the method provided by the embodiment of the application. As shown in fig. 10, the image processing apparatus 1000 may include: an obtaining module 1001, a generating module 1002, and a processing module 1003, wherein:

an obtaining module 1001 configured to obtain an image to be processed;

a generating module 1002, configured to generate one or more candidate detection frames corresponding to an image to be processed;

the processing module 1003 is configured to invoke a target detection model to process the one or more candidate detection frames to obtain a detection result of the image to be processed, where the detection result includes a target detection frame corresponding to a target category in the one or more candidate detection frames, and the target detection model is obtained by training a first branch network and a second branch network included in a first scoring network by using a second scoring network, the sample image, and a category label of the sample image.

In an embodiment, the image processing apparatus 1000 may further include: a training module 1004, wherein:

the obtaining module 1001 is further configured to obtain a training sample set, where the training sample set includes a plurality of sample images and a category label of each sample image;

the processing module 1003 is further configured to invoke a second scoring network to process one or more candidate detection frames corresponding to each sample image, so as to obtain an initial score of each candidate detection frame in each target category and an image score of each sample image in each target category;

the processing module 1003 is further configured to invoke a first scoring network to process one or more candidate detection frames corresponding to each sample image, so as to obtain a first prediction score and a second prediction score of each candidate detection frame in each target category;

the training module 1004 is configured to train the first scoring network and the second scoring network by using the first prediction score, the second prediction score, the initial score, the image score, and the category label, so as to obtain a target detection model.

In an embodiment, the training module 1004 is specifically configured to: acquiring a first pseudo label corresponding to each candidate detection frame by using the initial score of each candidate detection frame in each target category, and acquiring a second pseudo label corresponding to the multi-instance packet in which each candidate detection frame is positioned by using the initial score of each candidate detection frame in each target category; determining total loss corresponding to the first scoring network and the second scoring network by using the first prediction score, the second prediction score, the first pseudo label, the second pseudo label, the image score and the category label; and training the first scoring network and the second scoring network by using the total loss, and taking the trained first scoring network as a target detection model.

In an embodiment, the training module 1004 is specifically configured to: eliminating the candidate detection frames with the initial score smaller than a fraction threshold value of each target category from one or more candidate detection frames corresponding to each sample image to obtain N candidate detection frames, wherein N is a positive integer; selecting a candidate detection frame with the largest initial score from the N candidate detection frames, taking the candidate detection frame with the largest initial score as a reference detection frame, and determining the corresponding target category as a first pseudo label of the reference detection frame; determining the overlapping degree between the reference detection frame and the candidate detection frames except the reference detection frame in the N candidate detection frames, and rejecting the candidate detection frames with the overlapping degree larger than or equal to a first overlapping degree threshold value; and selecting the candidate detection frame with the maximum initial score from the candidate detection frames after being removed, and re-executing the candidate detection frame with the maximum initial score as the reference detection frame until the N candidate detection frames are traversed.

In an embodiment, the training module 1004 is specifically configured to: selecting a first candidate detection frame with the maximum initial score of each target category from one or more candidate detection frames corresponding to each sample image, and determining the overlapping degree between the first candidate detection frame and candidate detection frames except the first candidate detection frame in the one or more candidate detection frames; combining the first candidate detection frame and the candidate detection frame with the overlapping degree larger than or equal to a second overlapping degree threshold value into a multi-example packet, and determining the corresponding target class as a second pseudo label of the multi-example packet; removing the candidate detection frames in the multi-instance packet from one or more candidate detection frames to obtain P candidate detection frames, wherein P is a positive integer; and selecting a first candidate detection frame with the largest initial score from the candidate detection frames after the rejection, and re-determining the overlapping degree between the first candidate detection frame and the candidate detection frames except the first candidate detection frame in the P candidate detection frames until one or more candidate detection frames are clustered into each multi-example packet.

In an embodiment, the training module 1004 is specifically configured to: determining a loss of the second scoring network using the image score and the category label; determining a loss of a first branch network of the first scored network using the first predictive score and the first pseudo-tag, and determining a loss of a second branch network of the first scored network using the second predictive score and the second pseudo-tag; and determining the total loss corresponding to the first scoring network and the second scoring network by using the loss of the second scoring network, the loss of the first branch network and the loss of the second branch network.

In an embodiment, the processing module 1003 is specifically configured to: calling a first branch network of a first scoring network to process one or more candidate detection frames corresponding to each sample image to obtain a first prediction score of each candidate detection frame in each target category; and calling two branch networks of the first scoring network to process one or more candidate detection frames corresponding to each sample image to obtain a second prediction score of each candidate detection frame in each target category.

In an embodiment, the processing module 1003 is specifically configured to: acquiring image characteristics of an image area corresponding to each candidate detection frame in each sample image in one or more candidate detection frames corresponding to each sample image, wherein the image characteristics are obtained by performing characteristic extraction on images in the image area corresponding to the candidate detection frames by using a characteristic extraction network; determining an initial category score and an initial region score of each candidate detection frame based on the image features, wherein the initial category score is used for representing the probability that each candidate detection frame belongs to each target category, and the initial region score is used for representing the probability that each candidate detection frame contributes to each target category; performing fusion processing on the initial category score and the initial region score to obtain an initial score of each candidate detection frame in each target category; and determining the image score of each sample image in each target category based on the initial score of each candidate detection frame corresponding to each sample image.

In an embodiment, the target detection model includes a first branch network and a second branch network of the trained first scoring network, and the processing module 1003 is further configured to: calling the first branch network to process one or more candidate detection frames to obtain a third prediction score of each candidate detection frame in each target category; calling a second branch network to process one or more candidate detection frames to obtain a fourth prediction score of each candidate detection frame in each target category; performing fusion processing on the third prediction score and the fourth prediction score to obtain a target prediction score of each candidate detection frame in each target category; and determining a target detection frame corresponding to the target category from the one or more candidate detection frames according to the target prediction score, and taking the target detection frame corresponding to the target category as a detection result of the image to be processed.

It can be understood that the functions of the functional modules of the image processing apparatus described in the embodiment of the present application may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the relevant description of the foregoing method embodiment, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application. The computer device 1100 may comprise a standalone device (e.g., one or more of a server, a node, a terminal, etc.) or may comprise components (e.g., chips, software modules, or hardware modules, etc.) within a standalone device. The computer device 1100 may include at least one processor 1101 and a communication interface 1102, and further optionally, the computer device 1100 may also include at least one memory 1103 and a bus 1104. The processor 1101, the communication interface 1102 and the memory 1103 are connected by a bus 1104.

The processor 1101 is a module for performing arithmetic operation and/or logical operation, and may specifically be one or a combination of multiple processing modules, such as a Central Processing Unit (CPU), a picture processing Unit (GPU), a Microprocessor (MPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Complex Programmable Logic Device (CPLD), a coprocessor (which assists the central processing Unit to complete corresponding processing and Application), and a Micro Control Unit (MCU).

The communication interface 1102 may be used to provide information input or output to the at least one processor. And/or, the communication interface 1102 may be used to receive and/or transmit data externally, and may be a wired link interface such as an ethernet cable, and may also be a wireless link (Wi-Fi, bluetooth, general wireless transmission, vehicle-mounted short-range communication technology, other short-range wireless communication technology, and the like) interface.

The memory 1103 is used to provide a storage space in which data, such as an operating system and computer programs, may be stored. The memory 1103 may be one or a combination of Random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or portable read-only memory (CD-ROM), among others.

At least one processor 1101 in the computer device 1100 is configured to call up a computer program stored in at least one memory 1103 for executing the aforementioned image processing method, such as the image processing method described in the embodiments shown in fig. 2, 4, 6 and 9.

In one possible implementation, the processor 1101 in the computer device 1100 is configured to invoke a computer program stored in the at least one memory 1103 for performing the following operations: acquiring an image to be processed through a communication interface 1102; generating one or more candidate detection frames corresponding to the image to be processed; and calling a target detection model to process one or more candidate detection frames to obtain a detection result of the image to be processed, wherein the detection result comprises one or more target detection frames corresponding to the target type in the candidate detection frames, and the target detection model is obtained by training a first branch network and a second branch network included in a first scoring network by using a second scoring network, the sample image and the type label of the sample image.

In an embodiment, the processor 1101 is further configured to: acquiring a training sample set, wherein the training sample set comprises a plurality of sample images and a category label of each sample image; calling a second scoring network to process one or more candidate detection frames corresponding to each sample image to obtain an initial score of each candidate detection frame in each target category and an image score of each sample image in each target category; calling a first scoring network to process one or more candidate detection frames corresponding to each sample image to obtain a first prediction score and a second prediction score of each candidate detection frame in each target category; and training the first scoring network and the second scoring network by using the first prediction score, the second prediction score, the initial score, the image score and the class label to obtain a target detection model.

In an embodiment, the processor 1101 trains the first scoring network and the second scoring network by using the first prediction score, the second prediction score, the initial score, the image score and the category label, and when obtaining the target detection model, is specifically configured to: acquiring a first pseudo label corresponding to each candidate detection frame by using the initial score of each candidate detection frame in each target category, and acquiring a second pseudo label corresponding to the multi-example packet in which each candidate detection frame is positioned by using the initial score of each candidate detection frame in each target category; determining total loss corresponding to the first scoring network and the second scoring network by using the first prediction score, the second prediction score, the first pseudo label, the second pseudo label, the image score and the category label; and training the first scoring network and the second scoring network by using the total loss, and taking the trained first scoring network as a target detection model.

In an embodiment, when the processor 1101 obtains the first pseudo tag corresponding to each candidate detection frame by using each candidate detection frame at the initial score of each target category, the processor is specifically configured to: removing candidate detection frames with initial scores smaller than a fraction threshold value in one or more candidate detection frames corresponding to each sample image to obtain N candidate detection frames, wherein N is a positive integer; selecting a candidate detection frame with the largest initial score from the N candidate detection frames, taking the candidate detection frame with the largest initial score as a reference detection frame, and determining the corresponding target category as a first pseudo label of the reference detection frame; determining the overlapping degree between the reference detection frame and the candidate detection frames except the reference detection frame in the N candidate detection frames, and rejecting the candidate detection frames with the overlapping degree larger than or equal to a first overlapping degree threshold value; and selecting the candidate detection frame with the maximum initial score from the candidate detection frames after being removed, and re-executing the candidate detection frame with the maximum initial score as the reference detection frame until the N candidate detection frames are traversed.

In an embodiment, when the processor 1101 obtains, by using each candidate detection box in the initial score of each target category, the second pseudo tag corresponding to the multiple example packet in which each candidate detection box is located, the processor is specifically configured to: selecting a first candidate detection frame with the maximum initial score of each target category from one or more candidate detection frames corresponding to each sample image, and determining the overlapping degree between the first candidate detection frame and candidate detection frames except the first candidate detection frame in the one or more candidate detection frames; combining the first candidate detection frame and the candidate detection frame with the overlapping degree larger than or equal to a second overlapping degree threshold value into a multi-example packet, and determining a corresponding target class as a second pseudo label of the multi-example packet; removing the candidate detection frames in the multi-instance packet from one or more candidate detection frames to obtain P candidate detection frames, wherein P is a positive integer; and selecting a first candidate detection frame with the largest initial score from the candidate detection frames after the removing, and re-executing the step of determining the overlapping degree between the first candidate detection frame and candidate detection frames except the first candidate detection frame in the P candidate detection frames until one or more candidate detection frames are clustered into each multi-example packet.

In an embodiment, when determining the total loss corresponding to the first scoring network and the second scoring network by using the first prediction score, the second prediction score, the first pseudo tag, the second pseudo tag, the image score, and the category tag, the processor 1101 is specifically configured to: determining a loss of the second scoring network using the image score and the category label; determining a loss of a first branch network of the first scored network using the first predicted score and the first pseudo tag, and determining a loss of a second branch network of the first scored network using the second predicted score and the second pseudo tag; and determining the total loss corresponding to the first scoring network and the second scoring network by using the loss of the second scoring network, the loss of the first branch network and the loss of the second branch network.

In an embodiment, the processor 1101 invokes the first scoring network to process one or more candidate detection frames corresponding to each sample image, and obtains a first predicted score and a second predicted score of each target category for each candidate detection frame, where the first predicted score and the second predicted score are specifically used for: calling a first branch network of a first scoring network to process one or more candidate detection frames corresponding to each sample image to obtain a first prediction score of each candidate detection frame in each target category; and calling two branch networks of the first scoring network to process one or more candidate detection frames corresponding to each sample image to obtain a second prediction score of each candidate detection frame in each target category.

In an embodiment, the processor 1101 invokes the second scoring network to process one or more candidate detection frames corresponding to each sample image, and when the initial score of each target category of each candidate detection frame and the image score of each sample image in each target category are obtained, the processor is specifically configured to: acquiring image characteristics of image areas corresponding to the candidate detection frames in each sample image in one or more candidate detection frames corresponding to each sample image, wherein the image characteristics are obtained by utilizing a characteristic extraction network to extract the characteristics of the images in the image areas corresponding to the candidate detection frames; determining an initial category score and an initial region score of each candidate detection frame based on the image features, wherein the initial category score is used for representing the probability that each candidate detection frame belongs to each target category, and the initial region score is used for representing the probability that each candidate detection frame contributes to each target category; performing fusion processing on the initial category score and the initial region score to obtain an initial score of each candidate detection frame in each target category; and determining the image score of each sample image in each target category based on the initial score of each candidate detection frame corresponding to each sample image.

In an embodiment, the processor 1101 includes a first branch network and a second branch network of the trained first scoring network, and when the target detection model is called to process one or more candidate detection frames to obtain a detection result of the image to be processed, the target detection model is specifically configured to: calling the first branch network to process one or more candidate detection frames to obtain a third prediction score of each candidate detection frame in each target category; calling a second branch network to process one or more candidate detection frames to obtain a fourth prediction score of each candidate detection frame in each target category; performing fusion processing on the third prediction score and the fourth prediction score to obtain a target prediction score of each candidate detection frame in each target category; and determining a target detection frame corresponding to the target category from the one or more candidate detection frames according to the target prediction score, and taking the target detection frame corresponding to the target category as a detection result of the image to be processed.

It should be understood that the computer device 1100 described in this embodiment of the present application can perform the description of the image processing method in the corresponding embodiment, and can also perform the description of the image processing apparatus 1000 in the corresponding embodiment of fig. 10, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

In addition, it should be further noted that an exemplary embodiment of the present application further provides a storage medium, where a computer program of the foregoing image processing method is stored in the storage medium, where the computer program includes program instructions, and when one or more processors load and execute the program instructions, the description of the image processing method in the embodiment may be implemented, which is not described herein again, and beneficial effects of the same method are also described herein without being described again. It will be understood that the program instructions may be deployed to be executed on one computer device or on multiple computer devices that are capable of communicating with each other.

The computer-readable storage medium may be the image processing apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash memory card (flash card), and the like provided on the computer device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the computer device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

In one aspect of the application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided by one aspect of the embodiments of the present application.

In one aspect of the present application, another computer program product is provided, which includes a computer program or computer instructions, and when the computer program or the computer instructions are executed by a processor, the steps of the image processing method provided by the embodiment of the present application are realized.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. An image processing method, characterized in that the method comprises:

acquiring an image to be processed;

and calling a target detection model to process the one or more candidate detection frames to obtain a detection result of the image to be processed, wherein the detection result comprises a target detection frame corresponding to a target type in the one or more candidate detection frames, and the target detection model is obtained by training a first branch network and a second branch network included in a first scoring network by using a second scoring network, a sample image and a type label of the sample image.

2. The method of claim 1, wherein the method further comprises:

acquiring a training sample set, wherein the training sample set comprises a plurality of sample images and a category label of each sample image;

calling a second scoring network to process one or more candidate detection frames corresponding to each sample image to obtain an initial score of each candidate detection frame in each target category and an image score of each sample image in each target category;

calling a first scoring network to process one or more candidate detection frames corresponding to each sample image to obtain a first prediction score and a second prediction score of each candidate detection frame in each target category;

and training the first scoring network and the second scoring network by using the first prediction score, the second prediction score, the initial score, the image score and the class label to obtain a target detection model.

3. The method of claim 2, wherein said training the first scoring network and the second scoring network using the first prediction score, the second prediction score, the initial score, the image score, and the category label to obtain an object detection model comprises:

acquiring a first pseudo label corresponding to each candidate detection frame by using the initial score of each candidate detection frame in each target category, and acquiring a second pseudo label corresponding to the multi-example packet in which each candidate detection frame is positioned by using the initial score of each candidate detection frame in each target category;

determining a total loss corresponding to the first scored network and the second scored network using the first predicted score, the second predicted score, the first pseudo label, the second pseudo label, the image score, and the category label;

and training the first scoring network and the second scoring network by using the total loss, and taking the trained first scoring network as a target detection model.

4. The method of claim 3, wherein said obtaining the first pseudo label corresponding to each candidate detection box by using the initial score of each candidate detection box in each target category comprises:

eliminating the candidate detection frames with the initial score smaller than a fraction threshold value of each target category from one or more candidate detection frames corresponding to each sample image to obtain N candidate detection frames, wherein N is a positive integer;

selecting a candidate detection frame with the maximum initial score from the N candidate detection frames, taking the candidate detection frame with the maximum initial score as a reference detection frame, and determining a corresponding target class as a first pseudo label of the reference detection frame;

determining the overlapping degree between the reference detection frame and the candidate detection frames except the reference detection frame in the N candidate detection frames, and rejecting the candidate detection frames of which the overlapping degree is greater than or equal to a first overlapping degree threshold value;

and selecting the candidate detection frame with the maximum initial score from the candidate detection frames after being removed, and re-executing the candidate detection frame with the maximum initial score as the reference detection frame until the N candidate detection frames are traversed.

5. The method of claim 3, wherein the obtaining the second pseudo label corresponding to the multi-instance packet in which each candidate detection box is located by using the initial score of each candidate detection box in each target category comprises:

selecting a first candidate detection frame with the maximum initial score of each target category from one or more candidate detection frames corresponding to each sample image, and determining the overlapping degree between the first candidate detection frame and candidate detection frames except the first candidate detection frame in the one or more candidate detection frames;

combining the first candidate detection frame and the candidate detection frame with the overlapping degree larger than or equal to a second overlapping degree threshold value into a multi-example packet, and determining a corresponding target class as a second pseudo label of the multi-example packet;

removing the candidate detection frames in the multi-instance packet from the one or more candidate detection frames to obtain P candidate detection frames, wherein P is a positive integer;

and selecting a first candidate detection frame with the largest initial score from the candidate detection frames after the elimination, and re-executing the determination of the overlapping degree between the first candidate detection frame and the candidate detection frames except the first candidate detection frame in the P candidate detection frames until one or more candidate detection frames are clustered into each multi-instance packet.

6. The method of claim 3, wherein said determining a total loss for the first scored network and the second scored network using the first predicted score, the second predicted score, the first pseudo label, the second pseudo label, the image score, and the category label comprises:

determining a loss of the second scoring network using the image score and the category label;

determining a loss of a first branch network of the first scored network using the first predicted score and the first pseudo tag, and determining a loss of a second branch network of the first scored network using the second predicted score and the second pseudo tag;

and determining the total loss corresponding to the first scoring network and the second scoring network by using the loss of the second scoring network, the loss of the first branch network and the loss of the second branch network.

7. The method of any one of claims 2 to 6, wherein the invoking the first scoring network to process one or more candidate detection boxes corresponding to each sample image to obtain a first prediction score and a second prediction score of each candidate detection box in each target category comprises:

calling a first branch network of a first scoring network to process one or more candidate detection frames corresponding to each sample image to obtain a first prediction score of each candidate detection frame in each target category;

and calling two branch networks of a first scoring network to process one or more candidate detection frames corresponding to each sample image to obtain a second prediction score of each candidate detection frame in each target category.

8. The method of any one of claims 2 to 6, wherein the invoking of the second scoring network to process one or more candidate detection boxes corresponding to each sample image to obtain an initial score of each candidate detection box in each target category and an image score of each sample image in each target category comprises:

acquiring image characteristics of an image area corresponding to each candidate detection frame in one or more candidate detection frames corresponding to each sample image, wherein the image characteristics are obtained by performing characteristic extraction on images in the image area corresponding to the candidate detection frames by using a characteristic extraction network;

determining an initial category score and an initial region score of each candidate detection frame based on the image features, wherein the initial category score is used for representing the probability that each candidate detection frame belongs to each target category, and the initial region score is used for representing the probability that each candidate detection frame contributes to each target category;

performing fusion processing on the initial category score and the initial region score to obtain an initial score of each candidate detection frame in each target category;

and determining the image score of each sample image in each target category based on the initial score of each candidate detection frame corresponding to each sample image.

9. The method of any one of claims 1-6, wherein the first scoring network comprises one or more scoring units, each scoring unit comprising the first branching network and the second branching network, the one or more scoring units connected in parallel; the input of each scoring unit is the image characteristics of the corresponding image area of each candidate detection frame in the sample image, the fusion score output by the last scoring unit is used for determining the first pseudo label and the second pseudo label used by the next scoring unit, and the fusion score is obtained by fusing the first prediction score output by the first branch network and the second prediction score output by the second branch network.

10. The method of claim 1, wherein the target detection model comprises a first branch network and a second branch network of a trained first scoring network, and the invoking the target detection model to process the one or more candidate detection boxes to obtain the detection result of the image to be processed comprises:

calling the first branch network to process the one or more candidate detection frames to obtain a third prediction score of each candidate detection frame in each target category;

calling the second branch network to process the one or more candidate detection frames to obtain a fourth prediction score of each candidate detection frame in each target category;

performing fusion processing on the third prediction score and the fourth prediction score to obtain a target prediction score of each candidate detection frame in each target category;

and determining a target detection frame corresponding to a target category from the one or more candidate detection frames according to the target prediction score, and taking the target detection frame corresponding to the target category as a detection result of the image to be processed.

11. An image processing apparatus characterized by comprising:

the acquisition module is used for acquiring an image to be processed;

and the processing module is used for calling a target detection model to process the one or more candidate detection frames to obtain a detection result of the image to be processed, wherein the detection result comprises a target detection frame corresponding to a target type in the one or more candidate detection frames, and the target detection model is obtained by training a first branch network and a second branch network included in a first scoring network by using a second scoring network, a sample image and a type label of the sample image.

12. A computer device, comprising: a processor, a memory, and a network interface;

the processor is connected with the memory and the network interface, wherein the network interface is used for providing a network communication function, the memory is used for storing program codes, and the processor is used for calling the program codes to execute the image processing method according to any one of claims 1 to 10.

13. A computer-readable storage medium, characterized in that it stores a computer program comprising program instructions which, when executed by a processor, perform the image processing method of any one of claims 1 to 10.

14. A computer program product, characterized in that it comprises a computer program or computer instructions which, when executed by a processor, implement the steps of the image processing method according to any one of claims 1 to 10.