CN114419327B - Image detection method and training method and device of image detection model - Google Patents

Image detection method and training method and device of image detection model Download PDF

Info

Publication number
CN114419327B
CN114419327B CN202210057370.3A CN202210057370A CN114419327B CN 114419327 B CN114419327 B CN 114419327B CN 202210057370 A CN202210057370 A CN 202210057370A CN 114419327 B CN114419327 B CN 114419327B
Authority
CN
China
Prior art keywords
image
features
feature
decoding
importance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210057370.3A
Other languages
Chinese (zh)
Other versions
CN114419327A (en
Inventor
伍天意
朱欤
郭国栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210057370.3A priority Critical patent/CN114419327B/en
Publication of CN114419327A publication Critical patent/CN114419327A/en
Application granted granted Critical
Publication of CN114419327B publication Critical patent/CN114419327B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The disclosure provides an image detection method and training method and device of an image detection model, electronic equipment and storage medium, and relates to the field of artificial intelligence, in particular to the field of deep learning and the field of computer vision. The specific implementation mode of the image detection method is as follows: extracting features of an image to be processed, and obtaining a plurality of image features under a plurality of scales, wherein each image feature comprises at least two pixel-level features; determining respective importance levels of a plurality of pixel level features for a plurality of pixel level features included in the plurality of image features; decoding the plurality of image features according to the importance degree to obtain a plurality of decoding features respectively corresponding to the plurality of image features; and determining a detection result for the image to be processed according to the plurality of decoding features.

Description

Image detection method and training method and device of image detection model
Technical Field
The disclosure relates to the technical field of artificial intelligence, in particular to the field of deep learning and the field of computer vision, and especially relates to an image detection method and a training method, device, electronic equipment and storage medium of an image detection model.
Background
With the development of computer technology and network technology, deep learning technology is widely used in a plurality of fields. For example, the image can be subjected to semantic recognition by adopting a deep learning technology, so that tasks such as target detection, target segmentation and the like are completed.
Disclosure of Invention
Based on the above, the disclosure provides an image detection method for improving detection precision, a training method and device of an image detection model, electronic equipment and a storage medium.
According to one aspect of the present disclosure, there is provided an image detection method including: extracting features of an image to be processed, and obtaining a plurality of image features under a plurality of scales, wherein each image feature comprises at least two pixel-level features; determining respective importance levels of a plurality of pixel level features for a plurality of pixel level features included in the plurality of image features; decoding the plurality of image features according to the importance degree to obtain a plurality of decoding features respectively corresponding to the plurality of image features; and determining a detection result for the image to be processed according to the plurality of decoding features.
According to another aspect of the present disclosure, there is provided a training method of an image detection model, wherein the image detection model includes a feature extraction network, a prediction network, a decoding network, and a detection network; the training method comprises the following steps: inputting the sample image into a feature extraction network to obtain a plurality of image features under a plurality of scales; the sample image comprises an actual detection result, and each image feature comprises at least two pixel-level features; inputting a plurality of pixel-level features included in the plurality of image features into a prediction network to obtain respective importance degrees of the plurality of pixel-level features; inputting the importance level and the plurality of image features into a decoding network to obtain a plurality of decoding features respectively corresponding to the plurality of image features; inputting a plurality of decoding features into a detection network to obtain a prediction detection result aiming at a sample image; and training the image detection model according to the predicted detection result and the actual detection result.
According to another aspect of the present disclosure, there is provided an image detection apparatus including: the feature extraction module is used for extracting features of an image to be processed to obtain a plurality of image features under a plurality of scales, and each image feature comprises at least two pixel-level features; an importance determining module for determining respective importance of a plurality of pixel level features for a plurality of pixel level features included in a plurality of image features; the decoding module is used for decoding the plurality of image features according to the importance degree to obtain a plurality of decoding features respectively corresponding to the plurality of image features; and the detection determining module is used for determining a detection result aiming at the image to be processed according to the plurality of decoding characteristics.
According to another aspect of the present disclosure, there is provided a training apparatus of an image detection model, wherein the image detection model includes a feature extraction network, a prediction network, a decoding network, and a detection network; the training device comprises: the feature extraction module is used for inputting the sample image into a feature extraction network to obtain a plurality of image features; the sample image comprises an actual detection result, and each image feature comprises at least two pixel-level features; the importance determining module is used for inputting a plurality of pixel-level features included in the plurality of image features into the prediction network to obtain the importance of each of the plurality of pixel-level features; the decoding module is used for inputting the importance level and the plurality of image features into a decoding network to obtain a plurality of decoding features respectively corresponding to the plurality of image features; the detection determining module is used for inputting a plurality of decoding characteristics into the detection network to obtain a prediction detection result aiming at the sample image; and the model training module is used for training the image detection model according to the prediction detection result and the actual detection result.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the image detection method and/or the training method of the image detection model provided by the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the image detection method and/or the training method of the image detection model provided by the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps in the image detection method and/or the training method of the image detection model provided by the present disclosure.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a schematic view of an application scenario of an image detection method and a training method and apparatus of an image detection model according to an embodiment of the present disclosure;
FIG. 2 is a flow diagram of an image detection method according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of determining the importance of each of a plurality of pixel level features according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of an image detection method according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of decoding each image feature according to an embodiment of the present disclosure;
FIG. 6 is a flow diagram of a training method of an image detection model according to an embodiment of the present disclosure;
FIG. 7 is a schematic diagram of decoding a plurality of image features according to an embodiment of the present disclosure;
fig. 8 is a block diagram of a structure of an image detection apparatus according to an embodiment of the present disclosure;
FIG. 9 is a block diagram of a training device of an image detection model according to an embodiment of the present disclosure; and
fig. 10 is a block diagram of an electronic device used to implement the image detection method and/or training method of the image detection model of embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The present disclosure provides an image detection method including a feature extraction stage, an importance determination stage, a decoding stage, and a detection determination stage. In the feature extraction stage, features of an image to be processed are extracted, and a plurality of image features under a plurality of scales are obtained, wherein each image feature comprises at least two pixel-level features. In the importance determination stage, respective importance of a plurality of pixel level features is determined for a plurality of pixel level features included in a plurality of image features. In the decoding stage, the plurality of image features are decoded according to the importance degree, and a plurality of decoding features corresponding to the plurality of image features are obtained. In the detection determination stage, a detection result for the image to be processed is determined based on the plurality of decoding features.
An application scenario of the method and apparatus provided by the present disclosure will be described below with reference to fig. 1.
Fig. 1 is a schematic view of an application scenario of an image detection method and a training method and apparatus of an image detection model according to an embodiment of the disclosure.
As shown in fig. 1, the application scenario 100 of this embodiment may include an electronic device 110, and the electronic device 110 may be various electronic devices with processing functions, including but not limited to a smart phone, a tablet computer, a laptop computer, a desktop computer, a server, and the like.
The electronic device 110 may detect the input image 120, for example, to obtain a detection result 130. Specifically, the electronic device 110 may perform target detection or target segmentation on the input image 120, so as to obtain a position of a target object in the image 120 and a class of the target object, and take the position and class of the target object as a detection result.
According to embodiments of the present disclosure, the position of the target object may be represented by the position of a bounding box of the target object, for example. The position of the bounding box may include, for example, a coordinate value of a center point of the bounding box in an image coordinate system, a width of the bounding box, and a height of the bounding box.
According to embodiments of the present disclosure, target detection is by regression bounding boxes to locate target objects. The target segmentation may use a Mask-based regional convolutional neural network (Mask-RCNN) as the backbone network to generate the segmentation map. In one embodiment, the electronic device 110 may employ the image detection model 150 to detect the image 120. For example, the image detection model 150 may be trained by the server 140, for example. The electronic device 110 may be communicatively coupled to the server 140 over a network to send a model acquisition request to the server 140. Accordingly, the server 140 may send the trained image detection model 150 to the electronic device 110 in response to the request. The object detection model may include, for example, a Region-convolutional neural network (Region-CNN, R-CNN) model, a Region-based full convolutional neural network (Region-based Fully Convolutional Networks, R-FCN) model, or the like, a single view detector (You Only Look Once, YOLO) series model, or the like.
In one embodiment, the electronic device 110 may also send the input image 120 to the server 140, and the server 140 may detect the image 120 based on the trained image detection model 150.
It should be noted that, the image detection method provided in the present disclosure may be executed by the electronic device 110 or may be executed by the server 140. Accordingly, the image detection apparatus provided by the present disclosure may be disposed in the electronic device 110 or may be disposed in the server 140. The training method of the image detection model provided by the present disclosure may be performed by the server 140. Accordingly, the training apparatus of the image detection model provided by the present disclosure may be provided in the server 140.
It should be understood that the number and type of electronic devices 110 and servers 140 in fig. 1 are merely illustrative. There may be any number and type of electronic devices 110 and servers 140 as desired for implementation.
The image detection method provided by the present disclosure will be described in detail below with reference to fig. 1 through fig. 2 to 5.
As shown in fig. 2, the image detection method 200 of this embodiment may include operations S210 to S240.
In operation S210, features of an image to be processed are extracted, and a plurality of image features at a plurality of scales are obtained.
According to the embodiment of the disclosure, the image to be processed can be input into the feature extraction network, and after being processed by the feature extraction network, a plurality of image features respectively corresponding to a plurality of scales are obtained. The feature extraction network may include, for example, a top-down multi-scale convolutional network or a Backbone network (Backbone) based on a transducer, etc. Among them, the backbone network based on the transducer may include a Swin transducer network, etc. For example, a plurality of image features of a plurality of scales may be obtained by downsampling the image to be processed step by step, each downsampling step resulting in an image feature. It will be appreciated that at least two pixel level features may be included in each image feature.
According to the embodiment of the disclosure, the above feature extraction network may be further adopted to extract features at multiple scales for the image to be processed, so as to obtain multiple initial features with sequentially reduced sizes. And then injecting the features with stronger semantics in the plurality of initial features into the features with weaker semantics, thereby obtaining the enhanced plurality of features as a plurality of image features.
Illustratively, the semantic strength of the multiple initial feature expressions, which decrease in size in turn, is increasingly high. Setting the multiple scales as n, the number of the obtained initial features is n, and the number of the obtained image features is also n. For example, for the i-th initial feature of the n initial features, the i-th initial feature and the (i+1) -th initial feature may be fused, thereby obtaining the i-th image feature of the n image features. Wherein the value interval of i is [1, n-1], n is an integer greater than 1. When the i-th initial feature and the (i+1) -th initial feature are fused, the (i+1) -th initial feature may be first up-sampled, so that the up-sampled feature has the same size as the i-th initial feature. The upsampled feature may then be concatenated with the i-th initial feature using a concat () function or the like to yield the i-th image feature.
Illustratively, an nth image feature of the n image features may be determined from an nth initial feature of the n initial features. For example, the nth initial feature may be referred to as the nth image feature. Alternatively, the nth initial feature may be processed via a convolution layer, and the feature processed via the convolution layer may be referred to as the nth image feature.
Illustratively, the image to be processed is setWherein H, W is the pixel height and pixel width of the image, respectively. The n initial features obtained by the feature extraction network can be used as feature setTo represent. Wherein C is the channel radix (base channel number). Setting the number of channels of the obtained n image features to be the same as D, the obtained n image features can be collected by using the image set +.>To express, wherein->
In operation S220, respective importance degrees of a plurality of pixel level features are determined for a plurality of pixel level features included in a plurality of image features.
According to an embodiment of the present disclosure, each image feature of the plurality of image features comprises a plurality of pixel level features. The number of pixel level features included in different image features may be different. The pixel-level feature is a feature of each pixel point in the image feature, and the number of the pixel points included in the image feature is the product of the width and the height of the image feature. For example, for image features Which comprises a number of pixel level features of +.>
According to embodiments of the present disclosure, a prediction network may be employed to predict the importance of each of a plurality of pixel level features. For example, each image feature may be input into a prediction network, from which the width of each image feature is outputAnd a matrix of equal height, each element of the matrix being used to represent the importance of a pixel-level feature at a corresponding location in each image feature. Alternatively, each image feature may be first transformed in dimension to convert the image feature into a one-dimensional feature, e.g., for image featuresThe converted one-dimensional features can be expressed asThe converted one-dimensional feature is then input into a prediction network, from which an importance vector is output, the importance vector comprising +.>Each element representing the importance of a pixel level feature.
Illustratively, the prediction network may be, for example, a Multi-Layer Perceptron (MLP) network, a support vector machine (Support Vector Machine, SVM) network, etc., which is not limited by the present disclosure.
In operation S230, the plurality of image features are decoded according to the importance degree, and a plurality of decoded features corresponding to the plurality of image features are obtained.
According to embodiments of the present disclosure, each pixel-level feature may be weighted according to the importance of the feature in each image feature, thereby obtaining a weighted image feature. The weighted image features are then input to a decoding network, and the decoded features for each image feature are output after processing via the decoding network. The decoding network may include, for example, a decoding network included in a transform model, and the disclosure is not limited thereto.
In operation S240, a detection result for the image to be processed is determined according to the plurality of decoding features.
According to the embodiment of the disclosure, after obtaining a plurality of decoding features, the embodiment may fuse the plurality of decoding features, and obtain a fusion feature for an image to be processed first. And then determining a detection result according to the fusion characteristic.
For example, the embodiment may first convert a plurality of decoded features to a plurality of features of the same size. And then, the characteristics with the same size are spliced and input into a detection network, and a detection result is obtained after the characteristics are processed by the detection network. For the semantic segmentation task, the detection network may be, for example, an output network included in the semantic segmentation network. The semantic segmentation network may be a full convolution network (Fully Convolution Networks, FCNs), a SegNet network, or a U-Net network, among others. For example, if the semantic segmentation network is a FCNs, the detection network may comprise a FCN Head network. It will be appreciated that the detection network may employ different network types for different image detection tasks.
In summary, it can be known that, in the image detection method according to the embodiment of the present disclosure, before decoding the multiple scale features, the importance degree of each pixel level feature in the image features under the multiple scales is determined, and then the image features are decoded according to the importance degree, so that an effect of selecting the important features from the image features under the multiple scales can be achieved, and therefore, the decoded features can better express the image to be processed, which is beneficial to improving the accuracy of the obtained detection result.
Fig. 3 is a schematic diagram of determining the importance of each of a plurality of pixel level features according to an embodiment of the present disclosure.
According to the embodiment of the disclosure, when determining the importance of each of the plurality of pixel-level features, the plurality of image features may be subjected to dimension transformation to obtain a plurality of one-dimensional features. The plurality of one-dimensional features corresponds to the plurality of image features, respectively. The plurality of one-dimensional features may then be stitched to obtain a cascading feature. And finally, carrying out nonlinear processing on the cascade features to obtain a plurality of importance vectors respectively corresponding to the plurality of image features.
Each weight vector may include respective importance levels of a plurality of pixel-level features included in its corresponding image feature.
In one embodiment, the dimensional transformation of the plurality of image features may be, for example, for each graphThe image features are rearranged in pixel units to obtain a 1-dimensional feature. For example, for image featuresThe converted one-dimensional feature can be expressed as +.>
In one embodiment, a splice of multiple one-dimensional features may employ a concat () function. For example, setting the number of the plurality of one-dimensional features as n, the cascade features obtained by the dimensional transformation of the image features and the stitching of the one-dimensional features can be represented by the following formula (1):
wherein phi () represents a dimension change operation []Representing a stitching operation performed on the one-dimensional feature sequence,j is any integer greater than 1 and less than n.
In one embodiment, the nonlinear processing of the cascading features may be handled by the MLP network described previously, or the like. The MLP network can obtain a feature matrix after processing the cascade features. The elements included in each row of the feature matrix form an importance vector, and the feature matrix includes a plurality of importance vectors in total. The plurality of importance vectors respectively correspond to a plurality of image features. Any one of the importance vectors includes: importance of the plurality of pixel level features in the cascade feature is relative to importance of the image feature to which the arbitrary importance vector corresponds.
For example, after the feature matrix is obtained by adopting the MLP network processing, the embodiment may perform normalization processing on the feature matrix, so that the value of each element in the feature matrix is greater than or equal to 0 and less than or equal to 1. For example, for cascading featuresFor example, the nonlinear processing of (a) can be expressed by the following equation (2):
wherein P is the obtained feature matrix. Setting the number of the image features as n, wherein the size of the feature matrix is 4 xL, and the element P in P i k An element in the ith row and kth column of the feature matrix representing the kth pixel level feature of the cascade feature for the ith image feature of the n image featuresImportance of (3). Wherein P is i k ∈[0,1]. k is any integer of 1 or more and L or less. P (P) i Representing an importance vector of elements of an ith row in the feature matrix, the importance vector comprising a plurality of pixel-level features in the cascade feature +.>Importance of (3).
In one embodiment, as shown in FIG. 3, this embodiment 300 may employ a predictive network to determine importance. The prediction network may include, for example, a transform subnetwork 311 and a prediction subnetwork 312. The transformation sub-network 311 is used to perform a dimension transformation operation on a plurality of image features and a stitching operation on a resulting plurality of one-dimensional features. The prediction sub-network is used for performing nonlinear processing on the cascade characteristics.
Illustratively, the n scales are set to include four, and in this embodiment 300, when determining the importance level, the image feature 301, the image feature 302, the image feature 303, and the image feature 304, which are sequentially reduced in scale, may be input to the transformation sub-network 311. The transformation sub-network 311 may, for example, perform a dimension transformation operation on the input four image features in parallel, resulting in four one-dimensional features. The transformation subnetwork 311 uses a concat () function to splice the four one-dimensional features, resulting in a cascading feature 305. The cascade feature 305 serves as an input to a prediction subnetwork 312, through which the importance matrix 306 can be output after processing by the prediction subnetwork 312. Alternatively, the prediction subnetwork 312 may output a sequence of importance vectors, where a plurality of importance vectors may be output from large to small according to a corresponding scale, or from small to large according to a corresponding scale, which is not limiting in this disclosure.
In summary, when determining the importance degree, the embodiment of the disclosure fuses the features of multiple scales, and can determine the importance degree of each fused pixel-level feature to each scale. When the method is used for decoding, important features are selected from the integral features formed by the image features of a plurality of scales to perform decoding of the multi-scale features, so that the decoding precision is improved, and the fusion capability and the image detection precision of the multi-scale features are further improved.
Fig. 4 is a schematic diagram of an image detection method according to an embodiment of the present disclosure.
According to the embodiment of the disclosure, after the feature matrix P and the importance degree of each pixel feature in the cascade feature for each scale image feature are obtained by adopting the method described above, for example, the important feature may be selected for each scale from the cascade feature as a basis when decoding the image feature under each scale.
In an embodiment, when decoding a plurality of image features according to importance, for each image feature of the plurality of image features, a target feature of the cascade feature for a target proportion of the each image feature may be determined first according to importance. A decoding feature for each image feature is then determined from the target feature and each image feature.
The target feature may be a feature of a target scale having a higher importance with respect to each image feature, so that more important features may be provided for decoding, which is advantageous for improving the accuracy of the decoded feature. The target ratio may be any value smaller than 1, such as 0.3,0.5, etc., and it is understood that the target ratio may be set according to actual requirements, which is not limited in this disclosure.
After the target feature is obtained, for example, the target feature and the image feature may be spliced and then used as an input to a decoding network, and the decoding network outputs the decoded feature for each image feature. Alternatively, the embodiment may use the target feature as a key feature and a value feature, use each image feature as a query feature, and use a multi-headed cross-attention mechanism to obtain the decoded feature. By adopting the attention mechanism for decoding, the decoding process can pay more attention to the selected target features, and the expressive power and accuracy of the obtained decoding features can be further improved.
As shown in fig. 4, in the embodiment 400, when the image 401 to be processed is detected, for example, the initial feature 402, the initial feature 403, the initial feature 404, and the initial feature 405 that are sequentially ordered from large scale to small scale may be extracted using the backhaul network described above. And the top-to-bottom structure is adopted to fuse the features of adjacent scales, so that four image features can be obtained. For example, the image features are obtained by fusing the initial features 405 and the initial features 404Fusing the initial feature 404 with the initial feature 403 to obtain the image feature +.>The initial feature 403 is fused with the initial feature 402 to obtain the image feature +. >The merging here may be, for example, up-sampling the features at the top to the same scale as the features at the bottom, from two initial features at adjacent scales, and stitching the up-sampled features with the features at the bottom. The initial feature 405 may be used directly as an image featureOr convolution processing to obtain image feature +.>The resulting four image features may then be usedThe importance vector described above is obtained by inputting the importance vector into the prediction network 410, for example, the importance vector sequence { P } i I=1, 2,3,4. The cascade features described above are also available via the prediction network
Then, the importance vector sequence { P } is obtained i Sum cascade featureThe embodiment may then vector the importance vector P 4 Cascading features->And image feature->The input to the decoding network 421 is processed by the decoding network 421 to obtain a decoded feature. At the same time, the importance vector P 3 Cascading features->And image feature->The input to the decoding network 422 is processed through the decoding network 422 to obtain a decoded feature. To vector the importance of P 2 Cascading features->And image feature->The input to the decoding network 423 is processed by the decoding network 423 to obtain a decoded feature. To vector the importance of P 1 Cascading features->And image featuresThe input to the decoding network 424 is processed by the decoding network 421 to obtain a decoded feature. Finally, the obtained four decoding features may be spliced and input into the detection network 430, and the detection result 406 is obtained after processing through the detection network.
Fig. 5 is a schematic diagram of decoding each image feature according to an embodiment of the present disclosure.
According to embodiments of the present disclosure, as shown in fig. 5, a decoding network employed in decoding each image feature may be constructed based on a multi-headed cross-attention mechanism. The decoding network may be similar to the decoder architecture in a transducer architecture, and may include a plurality of decoding units, each decoding unit 510 including an attention layer and a feed-forward layer, and both the attention layer and the feed-forward layer may be provided with linearization processing layers. Wherein the attention layer employs a multi-headed cross-attention mechanism to process the input image features.
It should be noted that the first decoding unit in the decoding network may also be provided with a filtering layer for filtering the image according to the input relative image characteristicsImportance vector P of 501 i To cascade features of inputs- >Screening to obtain the image characteristic>Obtain ρL number of target features of (2)Features, e.g. the target feature may be +.>To represent. Subsequently, the image feature can be +.>And taking ρL features as value features and key features as query features, and obtaining the output of the attention layer by adopting a multi-head cross attention algorithm.
Illustratively, for image featuresThe output yi of the attention layer can be expressed by the following formula (3):
wherein:
wherein, the liquid crystal display device comprises a liquid crystal display device,for the target feature->D is the number of channels.
It will be appreciated that the output y of the above-mentioned attention layer i Is merely an example to facilitate an understanding of the present disclosure, and the present disclosure is not limited thereto.
Based on the image detection method provided by the disclosure, the disclosure also provides a training method of the image detection model, and the image detection model obtained by training by the training method can be used for executing the image detection method provided by the disclosure.
Fig. 6 is a flow diagram of a training method of an image detection model according to an embodiment of the present disclosure.
As shown in fig. 6, the training method 600 of the image detection model of this embodiment may include operations S610 to S650. The image detection model comprises a feature extraction network, a prediction network, a decoding network and a detection network.
In operation S610, a sample image is input into a feature extraction network to obtain a plurality of image features at a plurality of scales.
According to an embodiment of the present disclosure, at least two pixel-level features are included in each image feature, and the implementation of this operation S610 is similar to the operation S210 described above, except that the sample image includes an actual detection result compared to the image to be detected. The feature extraction network is similar to the feature extraction network described above and will not be described in detail here. For example, for an object detection scenario, the actual detection result may include the position of the object in the image, which may be represented by the position of an actual bounding box surrounding the object. The actual bounding box may be obtained, for example, by manual labeling.
In operation S620, a plurality of pixel-level features included in the plurality of image features are input into a prediction network, resulting in respective importance levels of the plurality of pixel-level features. It is understood that this operation S620 is similar to the operation S220 described above, and the prediction network is similar to the prediction network described above, and will not be described again.
In operation S630, the importance level and the plurality of image features are input into a decoding network to obtain a plurality of decoding features corresponding to the plurality of image features, respectively. It is understood that the operation S630 is similar to the operation S230 described above, and the decoding network is similar to the decoding network described above, and will not be repeated here.
In operation S640, a plurality of decoding features are input into a detection network to obtain a prediction detection result for a sample image. It is to be understood that the operation S640 is similar to the operation S240 described above, and the detection network is similar to the detection network described above, and will not be repeated here.
In operation S650, the image detection model is trained according to the predicted detection result and the actual detection result.
The embodiment can determine the loss of the image detection model based on the difference between the predicted detection result and the actual detection result. And then, a back propagation algorithm is adopted to adjust network parameters in the image detection model, so that training of the image detection model is realized. Wherein, for example, a cross entropy loss function or the like may be employed to determine the loss of the image detection model, which is not limited by the present disclosure.
In an embodiment, the prediction network may include a transform sub-network and a prediction sub-network, as described above. The specific implementation manner of operation S620 may first input a plurality of image features into the transformation sub-network to perform dimension transformation, and splice a plurality of one-dimensional features obtained by the dimension transformation to obtain cascade features. And inputting the cascade characteristics into a prediction sub-network for nonlinear processing to obtain an importance matrix. The importance matrix is composed of a plurality of importance vectors respectively corresponding to a plurality of image features, each of the plurality of importance vectors including: the plurality of pixel level features in the cascade feature correspond to a plurality of importance levels of the image feature for each importance level vector. The importance vectors are similar to the importance vectors described above, and are not described herein.
Fig. 7 is a schematic diagram of decoding a plurality of image features according to an embodiment of the present disclosure.
As shown in fig. 7, in an embodiment 700, a decoding network may be provided with a decision sub-network 721 and a decoding sub-network 722 when training an image detection model. For each image feature, a decision sub-network 721 is used to generate binary decisions (binary decisions) to decide which pixel level features in the cascade of features to select. By the method, an auxiliary effect can be provided for training of the prediction network to a certain extent, and accuracy of the importance vector obtained in the test is improved.
Illustratively, after a plurality of image features (e.g., image feature 701-image feature 704) are obtained,the embodiment may input the plurality of image features into a transformation sub-network 711 comprised by the prediction network, resulting in a cascading feature 705. The cascade feature is input into the prediction subnetwork 712 described above to obtain an importance matrix, which may be a sequence of importance vectors { P } consisting of a plurality of importance vectors i And } is represented. When the number of the plurality of image features is set to n, i=1, 2,3,4 when n is 4.
According to embodiments of the present disclosure, after obtaining the importance matrix, the importance matrix may be input into the decision sub-network 72l, and processed by the decision sub-network 721 to obtain a plurality of decision features corresponding to a plurality of importance vectors, respectively, where the plurality of decision features may form a decision feature sequence { Q } i }. The decision sub-network 721 may sample from the cascade feature 705 by using a gummel-Softmax sampling method, etc., to obtain a sampling result. The plurality of decision features may be used to represent sampling results obtained by sampling the cascade feature 705 according to the plurality of importance vectors. For example, for a certain image feature, if the sampling result is that a certain pixel level feature in the cascade feature 705 is selected, in the decision feature for the certain image feature, the value of the element at the position corresponding to the position of the certain pixel level feature in the cascade feature 705 is 1, and if the certain pixel level feature is not selected, the value of the element at the corresponding position is 0. That is, the decision feature for each image feature may indicate: for each image feature, whether the respective pixel level feature included in the cascade feature is selected. It will be appreciated that the above sampling method is merely exemplary to facilitate understanding of the present disclosure, and that sampling may also be performed directly from an importance vector, for example. Decision feature Q i The number of the elements included in the method is L.
The decision sub-network 721 determines decision features by adopting a gummel-Softmax sampling method, and gummel distribution can be introduced for a sampling process, so that randomness of sampling is introduced, and each pixel level feature in the cascade features has a certain probability to be selected.
According to embodiments of the present disclosure, the decision sub-network 721 may also set a threshold for importance, for example, and select for pixel-level features corresponding to importance greater than or equal to the threshold; and for pixel level features corresponding to importance levels less than the threshold, not selected.
In accordance with an embodiment of the present disclosure, the decision feature sequence { Q i After the step, the embodiment may obtain, for each image feature, a decoded feature corresponding to each image feature by using a decoding sub-network according to the corresponding decision feature, each image feature, and the cascade feature.
For example, for any image featureThis embodiment makes it possible to first add a cascade of features to the corresponding decision features>The characteristics of each channel are weighted to obtain weighted characteristics. The weighted features and image features are then input into a decoding subnetwork 722, and decoded features corresponding to the image features are output by the decoding subnetwork 722.
According to embodiments of the present disclosure, for each image feature, the embodiments may determine mask features for each image feature based on the decision features of the each image feature. The mask features, each image feature, and the cascade features are then input into a decoding subnetwork 722, and the decoding subnetwork 722 processes the features using a multi-headed cross-attention mechanism to obtain decoded features corresponding to each image feature.
For example, the embodiment may replicate the decision feature in several copies, resulting in a mask feature. The number of parts may be the same as the number of channels of the cascade feature F, for example, the number of parts may be D.
For example, when the multi-head cross attention mechanism is adopted to obtain the decoding feature corresponding to each image feature, the embodiment can also obtain the initial score feature by taking each image feature as a query feature and taking the cascade feature as a key feature. Specifically, each image feature is defined as in the formula (4) described aboveWill cascade features->As +.>A calculated by the formula (4) is taken as an initial score characteristic.
According to embodiments of the present disclosure, decision features for each image feature may be replicated as N while obtaining initial scoring features i Parts and combining the N i The mask features are obtained after the decision feature arrangementThe embodiment may adjust the initial scoring feature based on the mask feature to obtain an adjusted scoring feature. Taking the adjusted score characteristic as A in formula (3), the cascade characteristic +.>As +.>The feature calculated via formula (3) is taken as the output of the attention layer. The output is processed by the feed-forward layer described above and then passed to the subsequent decoding units, and finally the decoding characteristics are output by the last decoding unit in the decoding network.
For example, the embodiment may weight the initial scoring feature with the mask feature as the weight of the initial scoring feature, thereby obtaining an adjusted scoring feature.
Illustratively, this embodiment may also employ the following equation (6) to adjust the initial scoring feature based on the mask feature, where,to be adjusted to be initialScore feature, ->Is->The value of the element corresponding to the kth pixel level characteristic, M ik For the value of the element corresponding to the kth pixel level feature in the mask features, A k The element corresponding to the kth pixel level feature in the initial score feature A is valued.
By adjusting the initial scoring feature using equation (6), the resulting adjusted initial scoring feature may be made to be a normalized attention matrix.
The embodiment of the disclosure can solve the problem that the image detection model cannot be trained efficiently and parallelly because the pixel-level features selected for different image features are not aligned by providing the mask features for the decoding process according to the decision features. Therefore, by setting the mask feature, the training efficiency of the image detection model can be improved.
According to the embodiment of the disclosure, when training the image detection model, in addition to considering the difference between the predicted detection result and the actual detection result, the loss of selecting the pixel level feature according to the importance vector obtained by the prediction network can be considered. Therefore, the accuracy of the prediction network in the image detection model obtained through training is improved, and the accuracy of the detection result is improved.
For example, in determining the loss generated by selecting the pixel level feature according to the importance vector obtained by the prediction network, an average value of the elements in each of the plurality of importance vectors may be determined for each vector, so as to obtain a plurality of average values respectively corresponding to the plurality of importance vectors. The loss is then determined from the difference between the plurality of averages and the target ratio described previously. Wherein the loss may be determined, for example, using an L2 loss function or an L1 loss function, etc. Thus, when training the image detection model, the first loss may be determined first according to the difference between the predicted detection result and the actual detection result. And determining a second loss based on the difference between the plurality of averages and the target ratio. Finally, an image detection model is trained based on the first loss and the second loss. For example, a weighted sum of the first loss and the second loss may be used as the total loss of the image detection model.
Wherein the difference between the plurality of average values and the target ratio may be represented by the following values: an average of a plurality of differences between the plurality of averages and the target ratio.
Based on the image detection method provided by the disclosure, the disclosure also provides an image detection device. The device will be described in detail below in connection with fig. 8.
Fig. 8 is a block diagram of the structure of an image detection apparatus according to an embodiment of the present disclosure.
As shown in fig. 8, the image detection apparatus 800 of this embodiment may include a feature extraction module 810, an importance determination module 820, a decoding module 830, and a detection determination module 840.
The feature extraction module 810 is configured to extract features of an image to be processed, and obtain a plurality of image features at a plurality of scales, where each image feature includes at least two pixel-level features. In an embodiment, the feature extraction module 810 may be configured to perform the operation S210 described above, which is not described herein.
The importance determination module 820 is configured to determine, for a plurality of pixel-level features included in a plurality of image features, respective importance of the plurality of pixel-level features. In an embodiment, the importance determining module 820 may be used to perform the operation S220 described above, which is not described herein.
The decoding module 830 is configured to decode the plurality of image features according to the importance level, so as to obtain a plurality of decoded features corresponding to the plurality of image features respectively. In an embodiment, the decoding module 830 may be configured to perform the operation S230 described above, which is not described herein.
The detection determining module 840 is configured to determine a detection result for the image to be processed according to the plurality of decoding features. In an embodiment, the detection determining module 840 may be configured to perform the operation S240 described above, which is not described herein.
The importance determination module 820 may include a transformation sub-module, a stitching sub-module, and a processing sub-module according to an embodiment of the present disclosure. The transformation submodule is used for carrying out dimension transformation on the plurality of image features to obtain a plurality of one-dimensional features, and the plurality of one-dimensional features correspond to the plurality of image features respectively. The splicing submodule is used for splicing the plurality of one-dimensional features to obtain cascading features. The processing sub-module is used for carrying out nonlinear processing on the cascade characteristics to obtain a plurality of importance vectors corresponding to a plurality of image characteristics respectively. Wherein each importance vector of the plurality of importance vectors comprises: the plurality of pixel level features in the cascade feature correspond to a plurality of importance levels of the image feature for each importance level vector.
According to an embodiment of the present disclosure, the decoding module may include a first feature determination submodule and a coding submodule. The first feature determination submodule is used for determining target features of the cascade features for target proportions of each image feature according to importance degrees for each image feature. The decoding submodule is used for determining a decoding characteristic aiming at each image characteristic according to the target characteristic and each image characteristic.
According to an embodiment of the present disclosure, the decoding submodule is configured to obtain a decoded feature for each image feature using a multi-headed cross-attention mechanism with the target feature as a key feature and the value feature and each image feature as a query feature.
According to an embodiment of the present disclosure, the detection determination module 840 may include a first fusion sub-module and a detection sub-module. The first fusion submodule is used for fusing a plurality of decoding features to obtain fusion features aiming at the image to be processed. The detection sub-module is used for determining a detection result according to the fusion characteristics.
According to an embodiment of the present disclosure, the above-described feature extraction module 810 may include a feature extraction sub-module, a second fusion sub-module, and a second feature determination sub-module. The number of the plurality of image features is n, and n is an integer greater than 1. The feature extraction submodule is used for extracting features under n scales from the image to be processed to obtain n initial features with sequentially reduced sizes. The second fusion sub-module is used for fusing the ith initial feature and the (i+1) th initial feature in the n initial features aiming at the ith initial feature in the n initial features to obtain the ith image feature in the plurality of image features. The second feature determination submodule is used for determining an nth image feature in the plurality of image features according to the nth initial feature in the n initial features. Wherein the value interval of i is [1, n-1].
Based on the training method of the image detection model provided by the disclosure, the disclosure also provides a training device of the image detection model. The device will be described in detail below in connection with fig. 9.
Fig. 9 is a block diagram of a training apparatus of an image detection model according to an embodiment of the present disclosure.
As shown in fig. 9, the training apparatus 900 of the image detection model of this embodiment may include a feature extraction module 910, an importance determination module 920, a decoding module 930, a detection determination module 940, and a model training module 950. The image detection model comprises a feature extraction network, a prediction network, a decoding network and a detection network.
The feature extraction module 910 is configured to input the sample image into a feature extraction network to obtain a plurality of image features under a plurality of scales. Wherein the sample image comprises an actual detection result, each image feature comprising at least two pixel-level features. In an embodiment, the feature extraction module 910 may be configured to perform the operation S610 described above, which is not described herein.
The importance determination module 920 is configured to input a plurality of pixel-level features included in the plurality of image features into the prediction network to obtain importance of each of the plurality of pixel-level features. In an embodiment, the importance determining module 920 may be used to perform the operation S620 described above, which is not described herein.
The decoding module 930 is configured to input the importance level and the plurality of image features into a decoding network to obtain a plurality of decoding features corresponding to the plurality of image features respectively. In an embodiment, the decoding module 930 may be configured to perform the operation S630 described above, which is not described herein.
The detection determining module 940 is configured to input a plurality of decoding features into a detection network to obtain a prediction detection result for the sample image. In an embodiment, the detection determining module 940 may be configured to perform the operation S640 described above, which is not described herein.
The model training module 950 is configured to train the image detection model according to the predicted detection result and the actual detection result. In an embodiment, the model training module 950 may be configured to perform the operation S650 described above, which is not described herein.
According to an embodiment of the present disclosure, the prediction network includes a transform sub-network and a prediction sub-network. The importance determination module 920 may include a transformation sub-module and a processing sub-module. The transformation sub-module is used for inputting a plurality of image features into the transformation sub-network to perform dimension transformation, and splicing a plurality of one-dimensional features obtained by the dimension transformation to obtain cascading features. The processing sub-module is used for inputting the cascade characteristics into the prediction sub-network to perform nonlinear processing to obtain an importance matrix. Wherein the importance matrix is composed of a plurality of importance vectors respectively corresponding to a plurality of image features, each of the plurality of importance vectors including: the plurality of pixel level features in the cascade feature correspond to a plurality of importance levels of the image feature for each importance level vector.
According to an embodiment of the present disclosure, the decoding network includes a decision sub-network and a decoding sub-network. The decoding module 930 may include a decision making sub-module and a decoding sub-module. The decision determining sub-module is used for inputting the importance matrix into the decision sub-network to obtain a plurality of decision features respectively corresponding to the importance vectors as decision features respectively aiming at a plurality of image features. The decoding submodule is used for, for each image feature: and obtaining a decoding characteristic corresponding to each image characteristic by adopting a decoding sub-network according to the decision characteristic aiming at each image characteristic, each image characteristic and the cascade characteristic. Wherein the decision feature for each image feature indicates whether a respective pixel level feature comprised by the cascade feature is selected.
According to an embodiment of the present disclosure, the decoding submodule may include a mask determining unit and a decoding unit. The mask determining unit is used for determining mask features for each image feature according to the decision features for each image feature. The decoding unit is used for inputting the mask features, each image feature and the cascade features into the decoding sub-network, and obtaining the decoding features corresponding to each image feature by adopting a multi-head cross attention mechanism.
According to an embodiment of the present disclosure, the decoding unit may include a score determining subunit, an adjusting subunit, and a decoding subunit. The score determination subunit is configured to obtain an initial score feature by using each image feature as a query feature and using a cascade feature as a key feature. The adjustment subunit is used for adjusting the initial score feature according to the mask feature to obtain an adjusted score feature. The decoding subunit is used for taking the cascade characteristic as a value characteristic and obtaining a decoding characteristic corresponding to each image characteristic according to the adjusted score characteristic.
According to embodiments of the present disclosure, the model training module 950 may include a first loss determination sub-module, an average determination sub-module, a second loss determination sub-module, and a training sub-module. The first loss determination submodule is used for determining the first loss of the image detection model according to the difference between the prediction detection result and the actual detection result. The average value determination submodule is used for determining the average value of the elements in each vector according to each vector in the plurality of importance vectors to obtain a plurality of average values. The second loss determination submodule is used for determining the second loss of the image detection model according to the differences between the average values and the target proportion. The training submodule is used for training the image detection model according to the first loss and the second loss.
It should be noted that, in the technical solution of the present disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing, etc. of the personal information of the user all conform to the rules of the related laws and regulations, and do not violate the popular regulations.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
FIG. 10 illustrates a schematic block diagram of an example electronic device 1000 that may be used to implement the image detection methods and/or training methods of image detection models of embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 10, the electronic device 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the electronic apparatus 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
Various components in the electronic device 1000 are connected to the I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows electronic device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.
The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the respective methods and processes described above, such as an image detection method and/or a training method of an image detection model. For example, in some embodiments, the image detection method and/or the training method of the image detection model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 1000 via the ROM 1002 and/or the communication unit 1009. When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001, one or more steps of the image detection method and/or the training method of the image detection model described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the image detection method and/or the training method of the image detection model in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS"). The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (26)

1. An image detection method, comprising:
extracting features of an image to be processed, and obtaining a plurality of image features under a plurality of scales, wherein each image feature comprises at least two pixel-level features;
determining respective importance levels of a plurality of pixel level features included in the plurality of image features;
decoding the plurality of image features according to the importance degree to obtain a plurality of decoding features respectively corresponding to the plurality of image features; and
And determining a detection result for the image to be processed according to the decoding characteristics.
2. The method of claim 1, wherein the determining, for a plurality of pixel-level features included in the plurality of image features, respective importance levels of the plurality of pixel-level features comprises:
performing dimension transformation on the plurality of image features to obtain a plurality of one-dimensional features respectively corresponding to the plurality of image features;
splicing the plurality of one-dimensional features to obtain cascading features; and
nonlinear processing is carried out on the cascade characteristics to obtain a plurality of importance vectors corresponding to the image characteristics,
wherein each importance vector of the plurality of importance vectors comprises: and a plurality of pixel-level features in the cascade feature correspond to a plurality of importance degrees of the image feature corresponding to each importance vector.
3. The method of claim 2, wherein decoding the plurality of image features according to the importance level to obtain a plurality of decoded features respectively corresponding to the plurality of image features comprises:
for each image feature, determining a target feature of the cascade feature for a target proportion of each image feature according to the importance degree; and
And determining a decoding characteristic for each image characteristic according to the target characteristic and each image characteristic.
4. A method according to claim 3, wherein said determining a decoding feature for said each image feature from said target feature and said each image feature comprises:
and taking the target feature as a key feature and a value feature, taking each image feature as a query feature, and adopting a multi-head cross attention mechanism to obtain a decoding feature aiming at each image feature.
5. The method of claim 1, wherein the determining a detection result for the image to be processed from the plurality of decoding features comprises:
fusing the decoding features to obtain fused features aiming at the image to be processed; and
and determining the detection result according to the fusion characteristic.
6. The method of claim 1, wherein the extracting features of the image to be processed, obtaining a plurality of image features at a plurality of scales, comprises:
extracting features under n scales from the image to be processed to obtain n initial features with sequentially reduced sizes;
fusing the ith initial feature and the (i+1) th initial feature in the n initial features aiming at the ith initial feature in the n initial features to obtain the ith image feature in the plurality of image features; and
Determining an nth image feature of the plurality of image features based on an nth initial feature of the n initial features,
the value interval of i is [1, n-1], the number of the plurality of image features is n, and n is an integer greater than 1.
7. The training method of the image detection model comprises a feature extraction network, a prediction network, a decoding network and a detection network; the method comprises the following steps:
inputting the sample image into the feature extraction network to obtain a plurality of image features under a plurality of scales; wherein the sample image comprises an actual detection result, each image feature comprising at least two pixel-level features;
inputting a plurality of pixel-level features included in the plurality of image features into the prediction network to obtain respective importance degrees of the plurality of pixel-level features;
inputting the importance level and the plurality of image features into the decoding network to obtain a plurality of decoding features respectively corresponding to the plurality of image features;
inputting the plurality of decoding features into the detection network to obtain a prediction detection result aiming at the sample image; and
and training the image detection model according to the prediction detection result and the actual detection result.
8. The method of claim 7, wherein the prediction network comprises a transform sub-network and a prediction sub-network; the inputting the plurality of pixel level features included in the plurality of image features into the prediction network, the obtaining the respective importance of the plurality of pixel level features includes:
inputting the plurality of image features into the transformation sub-network for dimensional transformation, and splicing a plurality of one-dimensional features obtained by the dimensional transformation to obtain cascading features; and
inputting the cascade characteristics into the prediction sub-network for nonlinear processing to obtain an importance matrix,
wherein the importance matrix is composed of a plurality of importance vectors respectively corresponding to the plurality of image features, each of the plurality of importance vectors including: and a plurality of pixel-level features in the cascade feature correspond to a plurality of importance degrees of the image feature corresponding to each importance vector.
9. The method of claim 8, wherein the decoding network comprises a decision sub-network and a decoding sub-network; the inputting the importance level and the plurality of image features into the decoding network, and obtaining a plurality of decoding features respectively corresponding to the plurality of image features includes:
Inputting the importance matrix into the decision sub-network to obtain a plurality of decision features respectively corresponding to the importance vectors as decision features respectively aiming at the image features; and
for each image feature: according to the decision feature for each image feature, each image feature and the cascade feature, the decoding sub-network is adopted to obtain a decoding feature corresponding to each image feature,
wherein the decision feature for each image feature indicates whether a respective pixel level feature comprised by the cascade feature is selected.
10. The method of claim 9, wherein deriving a decoded feature corresponding to the each image feature using the decoding sub-network based on the decision feature for the each image feature, and the cascade feature comprises:
determining mask features for each image feature according to the decision feature for each image feature; and
inputting the mask features, each image feature and the cascade features into the decoding sub-network, and obtaining the decoding features corresponding to each image feature by adopting a multi-head cross attention mechanism.
11. The method of claim 10, wherein deriving a decoded feature for each image feature using a multi-headed cross-attention mechanism comprises:
taking each image feature as a query feature and the cascade feature as a key feature to obtain an initial score feature;
adjusting the initial score characteristic according to the mask characteristic to obtain an adjusted score characteristic; and
and taking the cascade characteristic as a value characteristic, and obtaining a decoding characteristic corresponding to each image characteristic according to the adjusted score characteristic.
12. The method of claim 8, wherein training the image detection model based on the predicted detection result and the actual detection result comprises:
determining a first loss of the image detection model according to the difference between the prediction detection result and the actual detection result;
determining an average value of elements in each vector according to each vector in the plurality of importance vectors to obtain a plurality of average values;
determining a second loss of the image detection model based on differences between the plurality of averages and a target ratio; and
training the image detection model according to the first loss and the second loss.
13. An image detection apparatus comprising:
the feature extraction module is used for extracting features of an image to be processed and obtaining a plurality of image features under a plurality of scales, wherein each image feature comprises at least two pixel-level features;
an importance determining module, configured to determine, for a plurality of pixel-level features included in the plurality of image features, importance of each of the plurality of pixel-level features;
the decoding module is used for decoding the plurality of image features according to the importance degree to obtain a plurality of decoding features respectively corresponding to the plurality of image features; and
and the detection determining module is used for determining a detection result aiming at the image to be processed according to the decoding characteristics.
14. The apparatus of claim 13, wherein the importance determination module comprises:
the transformation submodule is used for carrying out dimension transformation on the plurality of image features to obtain a plurality of one-dimensional features corresponding to the plurality of image features respectively;
the splicing sub-module is used for splicing the plurality of one-dimensional features to obtain cascading features; and
a processing sub-module, configured to perform nonlinear processing on the cascade feature to obtain a plurality of importance vectors corresponding to the plurality of image features respectively,
Wherein each importance vector of the plurality of importance vectors comprises: and a plurality of pixel-level features in the cascade feature correspond to a plurality of importance degrees of the image feature corresponding to each importance vector.
15. The apparatus of claim 14, wherein the decoding module comprises:
a first feature determining sub-module, configured to determine, for each image feature, a target feature of the cascade feature that is a target proportion for the each image feature according to the importance level; and
and the decoding submodule is used for determining a decoding characteristic aiming at each image characteristic according to the target characteristic and each image characteristic.
16. The apparatus of claim 15, wherein the decoding submodule is to:
and taking the target feature as a key feature and a value feature, taking each image feature as a query feature, and adopting a multi-head cross attention mechanism to obtain a decoding feature aiming at each image feature.
17. The apparatus of claim 13, wherein the detection determination module comprises:
the first fusion submodule is used for fusing the plurality of decoding features to obtain fusion features aiming at the image to be processed; and
And the detection sub-module is used for determining the detection result according to the fusion characteristic.
18. The apparatus of claim 13, wherein the feature extraction module comprises:
the feature extraction submodule is used for extracting features under n scales from the image to be processed to obtain n initial features with sequentially reduced sizes;
the second fusion submodule is used for fusing the ith initial feature and the (i+1) th initial feature in the n initial features aiming at the ith initial feature in the n initial features to obtain the ith image feature in the plurality of image features; and
a second feature determination sub-module for determining an nth image feature of the plurality of image features based on an nth initial feature of the n initial features,
the value interval of i is [1, n-1], the number of the plurality of image features is n, and n is an integer greater than 1.
19. A training device of an image detection model, wherein the image detection model comprises a feature extraction network, a prediction network, a decoding network and a detection network; the device comprises:
the feature extraction module is used for inputting the sample image into the feature extraction network to obtain a plurality of image features under a plurality of scales; wherein the sample image comprises an actual detection result, each image feature comprising at least two pixel-level features;
The importance determining module is used for inputting a plurality of pixel-level features included in the plurality of image features into the prediction network to obtain the importance of each of the plurality of pixel-level features;
the decoding module is used for inputting the importance degree and the plurality of image features into the decoding network to obtain a plurality of decoding features respectively corresponding to the plurality of image features;
the detection determining module is used for inputting the decoding characteristics into the detection network to obtain a prediction detection result aiming at the sample image; and
and the model training module is used for training the image detection model according to the prediction detection result and the actual detection result.
20. The apparatus of claim 19, wherein the prediction network comprises a transform subnetwork and a prediction subnetwork; the importance determination module includes:
the transformation submodule is used for inputting the plurality of image features into the transformation submodule to perform dimension transformation, and splicing a plurality of one-dimensional features obtained by the dimension transformation to obtain cascading features; and
a processing sub-module for inputting the cascade characteristics into the prediction sub-network for nonlinear processing to obtain an importance matrix,
Wherein the importance matrix is composed of a plurality of importance vectors respectively corresponding to the plurality of image features, each of the plurality of importance vectors including: and a plurality of pixel-level features in the cascade feature correspond to a plurality of importance degrees of the image feature corresponding to each importance vector.
21. The apparatus of claim 20, wherein the decoding network comprises a decision sub-network and a decoding sub-network; the decoding module includes:
the decision determining sub-module is used for inputting the importance matrix into the decision sub-network to obtain a plurality of decision features respectively corresponding to the importance vectors as decision features respectively aiming at the image features; and
a decoding sub-module for, for each image feature: according to the decision feature for each image feature, each image feature and the cascade feature, the decoding sub-network is adopted to obtain a decoding feature corresponding to each image feature,
wherein the decision feature for each image feature indicates whether a respective pixel level feature comprised by the cascade feature is selected.
22. The apparatus of claim 21, wherein the decoding submodule comprises:
A mask determining unit, configured to determine mask features for each image feature according to decision features for each image feature; and
and the decoding unit is used for inputting the mask features, each image feature and the cascade features into the decoding sub-network, and obtaining the decoding features corresponding to each image feature by adopting a multi-head cross attention mechanism.
23. The apparatus of claim 22, wherein the decoding unit comprises:
a score determining subunit, configured to obtain an initial score feature by using each image feature as a query feature and using the cascade feature as a key feature;
the adjustment subunit is used for adjusting the initial score characteristic according to the mask characteristic to obtain an adjusted score characteristic; and
and the decoding subunit is used for taking the cascade characteristic as a value characteristic and obtaining a decoding characteristic corresponding to each image characteristic according to the adjusted score characteristic.
24. The apparatus of claim 20, wherein the model training module comprises:
a first loss determination submodule, configured to determine a first loss of the image detection model according to a difference between a predicted detection result and the actual detection result;
The average value determining submodule is used for determining the average value of the elements in each vector aiming at each vector in the plurality of importance vectors to obtain a plurality of average values;
a second loss determination submodule for determining a second loss of the image detection model according to a difference between the plurality of average values and a target proportion; and
and the training submodule is used for training the image detection model according to the first loss and the second loss.
25. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 12.
26. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-12.
CN202210057370.3A 2022-01-18 2022-01-18 Image detection method and training method and device of image detection model Active CN114419327B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210057370.3A CN114419327B (en) 2022-01-18 2022-01-18 Image detection method and training method and device of image detection model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210057370.3A CN114419327B (en) 2022-01-18 2022-01-18 Image detection method and training method and device of image detection model

Publications (2)

Publication Number Publication Date
CN114419327A CN114419327A (en) 2022-04-29
CN114419327B true CN114419327B (en) 2023-07-28

Family

ID=81273457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210057370.3A Active CN114419327B (en) 2022-01-18 2022-01-18 Image detection method and training method and device of image detection model

Country Status (1)

Country Link
CN (1) CN114419327B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2018202801A1 (en) * 2018-04-23 2019-11-07 Canon Kabushiki Kaisha Method, apparatus and system for producing a foreground map
CN111914698A (en) * 2020-07-16 2020-11-10 北京紫光展锐通信技术有限公司 Method and system for segmenting human body in image, electronic device and storage medium
CN112699937A (en) * 2020-12-29 2021-04-23 江苏大学 Apparatus, method, device, and medium for image classification and segmentation based on feature-guided network

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10783640B2 (en) * 2017-10-30 2020-09-22 Beijing Keya Medical Technology Co., Ltd. Systems and methods for image segmentation using a scalable and compact convolutional neural network
CN110276344B (en) * 2019-06-04 2023-11-24 腾讯科技(深圳)有限公司 Image segmentation method, image recognition method and related device
CN111046962B (en) * 2019-12-16 2022-10-04 中国人民解放军战略支援部队信息工程大学 Sparse attention-based feature visualization method and system for convolutional neural network model
CN111598951B (en) * 2020-05-18 2022-09-30 清华大学 Method, device and storage medium for identifying space target
CN112233038B (en) * 2020-10-23 2021-06-01 广东启迪图卫科技股份有限公司 True image denoising method based on multi-scale fusion and edge enhancement
CN112767959B (en) * 2020-12-31 2023-10-17 恒安嘉新(北京)科技股份公司 Voice enhancement method, device, equipment and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2018202801A1 (en) * 2018-04-23 2019-11-07 Canon Kabushiki Kaisha Method, apparatus and system for producing a foreground map
CN111914698A (en) * 2020-07-16 2020-11-10 北京紫光展锐通信技术有限公司 Method and system for segmenting human body in image, electronic device and storage medium
CN112699937A (en) * 2020-12-29 2021-04-23 江苏大学 Apparatus, method, device, and medium for image classification and segmentation based on feature-guided network

Also Published As

Publication number Publication date
CN114419327A (en) 2022-04-29

Similar Documents

Publication Publication Date Title
CN112966522B (en) Image classification method and device, electronic equipment and storage medium
US20220335711A1 (en) Method for generating pre-trained model, electronic device and storage medium
CN113255694B (en) Training image feature extraction model and method and device for extracting image features
JP2023541532A (en) Text detection model training method and apparatus, text detection method and apparatus, electronic equipment, storage medium, and computer program
JP7403605B2 (en) Multi-target image text matching model training method, image text search method and device
CN114612759B (en) Video processing method, video query method, model training method and model training device
CN114494784A (en) Deep learning model training method, image processing method and object recognition method
CN112749300B (en) Method, apparatus, device, storage medium and program product for video classification
US20230162477A1 (en) Method for training model based on knowledge distillation, and electronic device
CN115578735B (en) Text detection method and training method and device of text detection model
CN115546488B (en) Information segmentation method, information extraction method and training method of information segmentation model
CN116152833B (en) Training method of form restoration model based on image and form restoration method
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN113887615A (en) Image processing method, apparatus, device and medium
CN114429633A (en) Text recognition method, model training method, device, electronic equipment and medium
CN113902010A (en) Training method of classification model, image classification method, device, equipment and medium
CN113902899A (en) Training method, target detection method, device, electronic device and storage medium
CN116343233B (en) Text recognition method and training method and device of text recognition model
CN116246287B (en) Target object recognition method, training device and storage medium
CN114419327B (en) Image detection method and training method and device of image detection model
CN114691918B (en) Radar image retrieval method and device based on artificial intelligence and electronic equipment
CN113642654B (en) Image feature fusion method and device, electronic equipment and storage medium
CN114707638A (en) Model training method, model training device, object recognition method, object recognition device, object recognition medium and product
CN113806541A (en) Emotion classification method and emotion classification model training method and device
CN114445833A (en) Text recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant