WO2023179099A1 - 一种图像检测方法、装置、设备及可读存储介质 - Google Patents

一种图像检测方法、装置、设备及可读存储介质 Download PDF

Info

Publication number
WO2023179099A1
WO2023179099A1 PCT/CN2022/137773 CN2022137773W WO2023179099A1 WO 2023179099 A1 WO2023179099 A1 WO 2023179099A1 CN 2022137773 W CN2022137773 W CN 2022137773W WO 2023179099 A1 WO2023179099 A1 WO 2023179099A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
image
sample
attention
classification
Prior art date
Application number
PCT/CN2022/137773
Other languages
English (en)
French (fr)
Inventor
项进喜
杨森
张军
蒋冬先
侯英勇
韩骁
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2023179099A1 publication Critical patent/WO2023179099A1/zh
Priority to US18/378,405 priority Critical patent/US20240054760A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06V10/7753Incorporation of unlabelled data, e.g. multiple instance learning [MIL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/32Normalisation of the pattern dimensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/776Validation; Performance evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing
    • G06T2207/30028Colon; Small intestine
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing
    • G06T2207/30096Tumor; Lesion

Definitions

  • This application relates to the field of computer technology, and in particular to image detection technology.
  • MIL multiple instance learning
  • the self-attention module can be used to mine the information of all instances in existing multi-instance images, find out the relevant information between instances, and establish a multi-instance learning model to detect unknown multi-instance images.
  • the self-attention module has high computational complexity. When modeling multi-instance images such as digital pathological images, where the number of instances may reach about 10,000, hardware resources and time are consumed, making training difficult.
  • the supervision information is very weak, and it is difficult to train such a high-complexity self-attention module on a small data set.
  • the self-attention module can mine effective information, and the self-attention module is prone to over-fitting problems, resulting in low detection accuracy.
  • Embodiments of the present application provide an image detection method, device, equipment and readable storage medium, which can improve image detection speed and detection accuracy.
  • embodiments of the present application provide an image detection method, which is executed by a computer device, including:
  • the image to be detected perform feature extraction processing on the image to be detected, and obtain a feature representation subset of the image to be detected;
  • the image to be detected includes at least two sub-images;
  • the feature representation subset includes at least two sub-image features, and the at least two sub-image features are related to at least The two sub-images correspond one to one;
  • the classification clusters include sampled sub-image features. Based on the at least two classification clusters and the block sparse matrix, determine the block sparse self-determination corresponding to each sampled sub-image feature. Attention, determine the second feature vector based on at least two block sparse self-attention; the block sparse self-attention corresponding to the sampled sub-image feature is determined based on the sampled sub-image feature in the classification cluster to which it belongs;
  • the classification result of the image to be detected is determined.
  • embodiments of the present application provide an image detection method, which is executed by a computer device, including:
  • sample image perform feature extraction processing on the sample image, and obtain a sample feature representation subset of the sample image;
  • the sample image includes at least two sample sub-images;
  • the sample feature representation subset includes at least two sample sub-image features, and at least two samples
  • the sub-image features correspond one-to-one to at least two sample sub-images;
  • Input at least two sample sub-images into the initial image recognition model, and generate sample attention weights corresponding to at least two sample sub-image features through the initial image recognition model. According to the sample attention weights corresponding to the at least two sample sub-image features. , perform weighted aggregation processing on at least two sample sub-image features to obtain the first sample feature vector;
  • the classification clusters include sample sampling sub-image features. Each sample is determined based on at least two sample classification clusters and a block sparse matrix.
  • the sparse self-attention of the sample block corresponding to the sample sampling sub-image feature determines the second sample feature vector based on the sparse self-attention of at least two sample blocks; the sparse self-attention of the sample block corresponding to the sample sampling sub-image feature is based on the sparse self-attention of the sample block it belongs to.
  • the sample sampling sub-image characteristics in the sample classification cluster are determined;
  • the model parameters of the initial image recognition model are adjusted to obtain a method for identifying the image to be detected.
  • Image recognition model of classification results are adjusted to obtain a method for identifying the image to be detected.
  • an image detection device including:
  • the feature extraction module is used to obtain the image to be detected, perform feature extraction processing on the image to be detected, and obtain a feature representation subset of the image to be detected;
  • the image to be detected includes at least two sub-images;
  • the feature representation subset includes at least two sub-image features, at least Two sub-image features correspond one-to-one to at least two sub-images;
  • the first vector generation module is used to generate attention weights corresponding to at least two sub-image features, and perform weighted aggregation processing on at least two sub-image features based on the attention weights to obtain a first feature vector;
  • the second vector generation module is used to perform cluster sampling processing on at least two sub-image features to obtain at least two classification clusters.
  • the classification clusters include sampled sub-image features, and determines each sample based on at least two classification clusters and the block sparse matrix.
  • the block sparse self-attention corresponding to the sub-image feature determines the second feature vector based on at least two block sparse self-attention; the block sparse self-attention corresponding to the sampled sub-image feature is based on the sampled sub-image feature in the classification cluster to which it belongs. definite;
  • a classification module configured to determine the classification result of the image to be detected based on the first feature vector and the second feature vector.
  • an image detection device including:
  • a sample feature extraction module is used to obtain a sample image, perform feature extraction processing on the sample image, and obtain a sample feature representation subset of the sample image; the sample image includes at least two sample sub-images; the sample feature representation subset includes at least two sample sub-images. Image features, at least two sample sub-image features correspond one-to-one to at least two sample sub-images;
  • the first sample vector generation module is used to input at least two sample sub-images into the initial image recognition model, and generate sample attention weights corresponding to the characteristics of at least two sample sub-images through the initial image recognition model, based on at least two
  • the sample attention weight corresponding to each sample sub-image feature is used to perform a weighted aggregation process on at least two sample sub-image features to obtain the first sample feature vector;
  • the second sample vector generation module is used to perform cluster sampling processing on at least two sample sub-image features through the initial image recognition model to obtain at least two sample classification clusters.
  • the classification clusters include sample sampling sub-image features, based on at least two The sample classification cluster and block sparse matrix determine the sample block sparse self-attention corresponding to the sample sampling sub-image feature, and determine the second sample feature vector based on at least two sample block sparse self-attention; the sample block corresponding to the sample sampling sub-image feature Sparse self-attention is determined based on the sample sample sub-image features in the sample classification cluster to which it belongs;
  • a sample classification module used to determine the sample classification result of the sample image based on the first sample feature vector and the second sample feature vector through the initial image recognition model
  • the training module is used to adjust the model parameters of the initial image recognition model based on at least two sample classification clusters, the corresponding attention weights of at least two sample sub-image features, the sample classification results, and the corresponding classification labels of the sample images, and obtain the used An image recognition model that is used to identify the classification results of images to be detected.
  • embodiments of the present application provide a computer device, including: a processor, a memory, and a network interface;
  • the above-mentioned processor is connected to the above-mentioned memory and the above-mentioned network interface, wherein the above-mentioned network interface is used to provide data communication network elements, the above-mentioned memory is used to store computer programs, and the above-mentioned processor is used to call the above-mentioned computer programs to execute the embodiments of the present application. method.
  • embodiments of the present application provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program is suitable for being loaded by a processor and executing the method in the embodiment of the present application.
  • embodiments of the present application provide a computer program product or computer program.
  • the computer program product or computer program includes computer instructions.
  • the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device obtains the computer instructions from the computer-readable storage medium.
  • the medium reads the computer instructions, and the processor executes the computer instructions, so that the computer device performs the method in the embodiment of the present application.
  • feature extraction processing can be performed on an image to be detected including at least two sub-images to obtain a feature representation subset of the image to be detected.
  • the feature representation subset includes sub-image features corresponding to each of the at least two sub-images.
  • there are two ways to mine the information of the sub-images One is to mine the information of each sub-image independently, that is, to generate attention weights corresponding to at least two sub-image features, and then perform weighted aggregation of at least two sub-image features based on the attention weights.
  • the second is to mine the relevant information between sub-images of the same category, that is, perform cluster sampling processing on at least two sub-image features to obtain the sampled sub-image features included in each of at least two classification clusters, according to At least two classification clusters and block sparse matrices determine the block sparse self-attention corresponding to each sampled sub-image feature, and the second feature vector is determined based on at least two block sparse self-attentions; finally, based on the first feature vector and the second feature vector , determine the classification result of the image to be detected.
  • the first feature vector and the second feature vector obtained through the two information mining methods can complement and constrain each other, so the detection accuracy of the image can be improved.
  • the block sparse matrix To calculate the block sparse self-attention corresponding to the sampled sub-image feature, it can ensure that only the correlation between the sampled sub-image features that belong to the same classification cluster as the sampled sub-image feature is paid attention to, reducing the computational complexity and improving the detection speed.
  • Figure 1a is a schematic diagram of a network architecture provided by an embodiment of the present application.
  • Figure 1b is a schematic diagram of an application scenario of an image detection method provided by an embodiment of the present application.
  • Figure 2 is a schematic flow chart of an image detection method provided by an embodiment of the present application.
  • Figure 3 is a schematic diagram of a scene of image feature extraction processing provided by an embodiment of the present application.
  • Figure 4 is a schematic flow chart of an image detection method provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of the clustering results of a colorectal pathology image provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of the principle of block sparsity constraint on global self-attention provided by the embodiment of the present application.
  • Figure 7 is a schematic structural diagram of an image recognition model provided by an embodiment of the present application.
  • Figure 8 is a schematic flow chart of an initial image recognition model training method provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of an image detection device provided by an embodiment of the present application.
  • Figure 10 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • Figure 11 is a schematic structural diagram of another image detection device provided by an embodiment of the present application.
  • Figure 12 is a schematic structural diagram of another computer device provided by an embodiment of the present application.
  • the network architecture may include a service server 100 and a terminal device cluster.
  • the terminal device cluster may include a terminal device 10a, a terminal device 10b, a terminal device 10c,..., a terminal device 10n, where the terminal devices in the terminal device cluster
  • Any terminal device can have a communication connection with the business server 100.
  • there is a communication connection between the terminal device 10a and the business server 100 there is a communication connection between the terminal device 10b and the business server 100
  • There is a communication connection between 100 wherein the above communication connection is not limited to the connection method, and can be directly or indirectly connected through wired communication, can also be directly or indirectly connected through wireless communication, and can also be connected through other methods.
  • This application There are no restrictions here.
  • each terminal device in the terminal cluster shown in Figure 1a can be installed with an application client.
  • the application client When the application client is run in each terminal device, it can be connected to the business server 100 shown in Figure 1a. Data exchange is performed between terminal devices so that the service server 100 can receive service data from each terminal device.
  • the application client can be a game application, a video editing application, a social application, an instant messaging application, a live broadcast application, a short video application, a video application, a music application, a shopping application, a novel application, a payment application, a browser, etc. with relevant images.
  • the application client can be an independent client, or it can be an embedded sub-client integrated in a certain client (such as an instant messaging client, a social client, a video client, etc.), and there is no limitation here. .
  • each terminal device in the terminal device cluster can obtain the image to be detected by running the application client, and send it to the business server 100 as business data.
  • the business server 100 can perform image detection on the image to be detected, To determine the classification result of the image to be detected.
  • the image to be detected can also be called a multi-instance image, that is, it includes at least two sub-images, and one sub-image can be called an instance.
  • the multi-instance image can be viewed is an abnormal image.
  • the classification result of the image to be detected should be an abnormal image.
  • the digital pathology image can be obtained in the following manner: scanning carrier wave slices through a fully automatic microscope or optical magnification system to collect high-resolution digital images, and then using a computer
  • the obtained digital images with high distribution rate are automatically spliced and processed with high-precision multi-view seamlessness to obtain high-quality visual data, that is, digital pathology images.
  • Digital pathology images can be zoomed in and out at any position on a computer device, and there is no problem of image information distortion or unclear details. Compared with original carrier wave slice observation, it is more convenient for doctors to perform cancer diagnosis, survival detection, and gene mutation. Testing and other pathological diagnosis.
  • the resolution of digital pathology images is high, the image size is often very large, and it contains many examples (biological tissues such as cells and genes).
  • the digital pathology image can be uploaded through the above application client.
  • the terminal device After the terminal device obtains the digital pathology image, it can be sent to the business server 100 as business data, and then the business server 100 can image the digital pathology image. Detect and determine the classification result of the digital pathology image, which can assist doctors in medical diagnosis.
  • Figure 1b is an image detection provided by the embodiment of the present application. Schematic diagram of the application scenario of the method. For ease of understanding, the description is still made by taking the image to be detected as the digital pathological image in the above embodiment as an example.
  • the terminal device 200 (which can be any terminal device in Figure 1a, for example, the terminal device 10a) is installed with the patient management application 300, and the object A has an association relationship with the terminal device 200. Assume that subject A is the attending physician of subject B.
  • Subject A can view subject B's case information on the patient management application 300, such as colorectal pathology image 301.
  • subject A can diagnose whether subject B suffers from Colorectal cancer.
  • the image size of the colorectal pathology image 301 is large and there are many cell tissues that need to be observed, it takes a long time for subject A to perform manual observation. Therefore, subject A can send a request to the business server 400 (
  • the business server 100 shown in FIG. 1a above initiates an image detection request for the colorectal pathology image 301, and then the business server 400 can perform image detection on the colorectal pathology image 301 to determine the classification result of the colorectal pathology image 301. , that is, it is determined whether the colorectal pathology image 301 is a normal image or an abnormal image.
  • the classification result of the colorectal pathology image 301 by the service server 400 can assist the subject A in diagnosing the condition of the subject B.
  • the colorectal pathology image 301 has a large image size and contains numerous cell tissues, so it can be considered that the colorectal pathology image 301 includes at least two sub-images (that is, at least two sub-images can be obtained by dividing the colorectal pathology image 301 ), as long as there is an abnormality in one sub-image, the colorectal pathology image 301 is an abnormal image.
  • the service server 400 can first perform feature extraction processing on the colorectal pathology image 301 to obtain features used to represent the colorectal pathology image 301. Indicates a subset 401.
  • the feature representation subset 401 includes at least two sub-image features, and one sub-image feature is used to describe information of a sub-image in the colorectal pathology image 301.
  • the service server 400 may detect the colorectal pathology image 301 using the image recognition model 402, which may include a first attention sub-network 4021, a second attention sub-network 4022 and a classification sub-network 4023, where the first The attention sub-network 4021 is used to regard the sub-images as independent instances, mine the independent representation information of each sub-image according to the characteristics of each sub-image, and obtain the first feature vector used to represent the independent representation information; the second attention sub-network The network 4022 is used to mine the global representation information between all sub-images and obtain the second feature vector used to represent the global representation information; the classification sub-network 4023 is used to classify the image to be detected based on the first feature vector and the second feature vector. Classification, determine the classification result of the image to be detected.
  • the image recognition model 402 which may include a first attention sub-network 4021, a second attention sub-network 4022 and a classification sub-network 4023, where the first The attention sub-network 4021
  • the first attention sub-network 4021 will generate attention weights corresponding to at least two sub-image features, and then, based on the attention weights, At least two sub-image features are weighted and aggregated to obtain a first feature vector 403.
  • the second attention sub-network 4022 performs cluster sampling processing on at least two sub-image features to obtain sample sub-image features included in each of at least two classification clusters, and then determines each sub-image feature based on at least two classification clusters and the block sparse matrix.
  • the second feature vector 404 is determined based on at least two block sparse self-attentions corresponding to the sampled sub-image features.
  • the classification sub-network 4023 can perform feature fusion processing on the first feature vector 403 and the second feature vector 404 to obtain a fused feature vector, and then classify the fused feature vector. Processing to obtain the classification result 405 of the colorectal pathology image 301, where the classification result 405 may include the normal probability and the abnormal probability of the colorectal pathology image 301.
  • the normal probability refers to the probability that the colorectal pathology image 301 is a normal image, that is, object B
  • the probability of not having the disease and the abnormal probability refer to the probability that the colorectal image 301 is an abnormal image, that is, the probability that subject B may have colorectal cancer.
  • the service server 400 will return the classification result 405 to the terminal device 200, and the subject A can diagnose the disease condition of the subject B based on the classification result 405.
  • the terminal device 200 can perform an image detection task locally on the image to be detected, and obtain the classification result of the image to be detected. Since training the image recognition model 402 involves a large amount of offline calculations, the image recognition model local to the terminal device 200 may be sent to the terminal device after training is completed by the business server 400 .
  • the business server 100 in the embodiment of the present application can be a computer equipment, a terminal equipment in a terminal equipment cluster. It can also be a computer device, which is not limited here.
  • the above-mentioned server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers. It can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, Cloud servers for basic cloud computing services such as middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
  • the above-mentioned terminal devices include but are not limited to mobile phones, computers, intelligent voice interaction devices, smart home appliances, vehicle-mounted terminals, etc., but are not limited to these.
  • the embodiments of this application can be applied to various scenarios, including but not limited to cloud technology, cloud security, blockchain, artificial intelligence, smart transportation, assisted driving, etc.
  • the image to be detected is described using a colorectal pathological image as an example.
  • the image to be detected can also be a pathological image of other cancer types, or other pathological images including at least two sub-types. Multiple instance images of images are not limited in this application.
  • FIG 2 is a schematic flow chart of an image detection method provided by an embodiment of the present application.
  • the image detection method can be executed by a computer device, where the computer device can be the business server 100 shown in Figure 1a, or any terminal device in the terminal device cluster shown in Figure 1a, such as a terminal device. 10c.
  • the image detection method may at least include the following steps S101 to S104:
  • Step S101 obtain an image to be detected, perform feature extraction processing on the image to be detected, and obtain a feature representation subset of the image to be detected;
  • the image to be detected includes at least two sub-images;
  • the feature representation subset includes at least Two sub-image features, the at least two sub-image features correspond to the at least two sub-images in a one-to-one manner.
  • the image to be detected is a multi-instance image with weak image annotation and multiple instances corresponding to one label.
  • a multi-instance image also called a multi-instance bag, includes several instances. An instance can be regarded as a sub-image, but only the bag contains labels, and the instances do not contain labels. If a multi-instance package contains at least one positive instance, the package is marked as a positive multi-instance package (positive package); if all instances of the multi-instance package are negative instances, the package is marked as a negative class Multi-instance package (negative package).
  • the image to be detected can be a digital pathology image used in pathological diagnosis such as cancer diagnosis, survival prediction, and gene mutation prediction.
  • the classification result of the digital pathology image can be obtained.
  • the classification result can assist the doctor. Determine the corresponding medical diagnosis result.
  • the digital pathology image is the colorectal pathology image 301 in Figure 1b.
  • a classification result 405 is obtained. This classification result 405 can assist object A in determining the object. Whether B has colorectal cancer.
  • the image to be detected is media data used by humans and lacks information that can be understood by computer equipment. Therefore, the image to be detected needs to be converted from an unstructured original image into structured information that can be recognized and processed by computers, that is, to treat
  • the detection image is scientifically abstracted, and its mathematical model is established to describe and replace the image to be detected, so that the computer device can realize the recognition of the image to be detected through the calculation and operation of the mathematical model.
  • the mathematical model can be a vector space model.
  • the sub-image features corresponding to the sub-images included in the image to be detected can be vectors in the vector space model.
  • the computer device can describe and use it through a feature representation subset composed of sub-image features. Image to be detected.
  • a feasible and specific process for performing feature extraction processing on the image to be detected and obtaining a feature representation subset of the image to be detected is: identifying the background area and foreground area in the image to be detected, and then performing image processing on the image to be detected based on the background area and foreground area.
  • Segment to obtain the foreground image to be detected, and then the foreground image to be detected can be scaled according to the zoom ratio to obtain the foreground image to be cropped; and then the foreground image to be cropped is cropped according to the preset length of the sub-image and the preset width of the sub-image to obtain at least two sub-images, and finally perform image feature extraction processing on at least two sub-images respectively to obtain sub-image features corresponding to at least two sub-images, and determine a feature representation subset of the image to be detected based on the sub-image features corresponding to at least two sub-images.
  • the preset length of the sub-image is smaller than the length of the foreground image to be cropped; the preset width of the sub-image is smaller than the width of the foreground image to be cropped.
  • Figure 3 is a schematic diagram of a scene of image feature extraction processing provided by an embodiment of the present application.
  • the computer device can first perform front and rear background identification on the image 3000 to be detected, determine the foreground area 3001 and background area 3002 contained in the image 3000 to be detected, and then perform image recognition of the front and rear background areas. Segmentation is performed to obtain the foreground image 3003 to be detected.
  • the foreground image 3003 to be detected can be cropped according to the preset length of the sub-image and the preset width of the sub-image, for example, 512*512, to obtain at least two sub-images, namely, sub-image 3004, sub-image 3005, ..., sub-image 3006. .
  • at least two sub-images can be input into the feature extractor 3007, and the sub-image features corresponding to each sub-image are extracted by the feature extractor 3007, thereby obtaining a feature representation subset 3008.
  • the feature extractor 3007 can be implemented using the feature extractor RestNet50 (Residual net 50, a residual network), or other pre-trained networks can be used. This application is not limited here.
  • the foreground image 3003 to be detected can be zoomed at a set magnification.
  • the foreground image 3003 to be detected is enlarged ten times to obtain the foreground image to be cropped.
  • the foreground image to be cropped is cropped, and the foreground image to be cropped can be compared to the foreground image to be cropped.
  • the detection foreground image 3003 performs cropping processing to obtain more sub-images, so more sub-image features can be obtained and the image to be detected can be represented more accurately.
  • Step S102 Generate attention weights corresponding to the at least two sub-image features, and perform a weighted aggregation process on the at least two sub-image features based on the attention weights to obtain a first feature vector.
  • the attention weight can also be called the attention score, which is used to measure the importance of sub-image features.
  • the final output first feature vector its The proportion of the corresponding sub-image features is larger.
  • the attention weight corresponding to each sub-image feature can be learned through a network that takes itself as input.
  • the sub-image features can be weighted and aggregated, that is, the sub-image features are weighted and summed according to the attention weight to obtain the first feature vector.
  • Step S103 Perform cluster sampling processing on the at least two sub-image features to obtain at least two classification clusters.
  • the classification clusters include sampled sub-image features.
  • the block sparse matrix determine each The block sparse self-attention corresponding to the sampled sub-image feature determines the second feature vector based on at least two of the block sparse self-attention; the block sparse self-attention corresponding to the sampled sub-image feature is based on the classification to which it belongs.
  • the sampled sub-image features in the cluster are determined.
  • At least two sub-image features can be clustered first, that is, at least two sub-image features can be clustered based on the similarity of the sub-image features.
  • the sub-image features are divided into at least two classification clusters, and the sub-images corresponding to the sub-image features in one classification cluster belong to the same category of images. Then some sub-image features are sampled from each classification cluster as sampled sub-image features.
  • the self-attention of the sampled sub-image features is determined based on the global self-attention weight matrix, where the global self-attention weight matrix is used to characterize the correlation between the sampled sub-image features. Because the sampled sub-image features have been classified before, when determining the self-attention of a certain sampled sub-image feature, the computer device only pays attention to the sampled sub-image features that belong to the same classification cluster as the sampled sub-image feature. After the computer device determines the global self-attention weight matrix based on the sampled sub-image characteristics, it can obtain the block sparse matrices matching each of at least two classification clusters, filter the global self-attention weight matrix according to the block sparse matrix, and obtain the block sparse global self-attention matrix.
  • this block sparse global self-attention weight matrix is used to characterize the correlation between sampled sub-image features of the same classification cluster. Subsequently, the block sparse self-attention of each sampled sub-image feature can be determined based on the block sparse global self-attention weight matrix. The computer device then performs mean pooling processing on the block sparse self-attention of all sampled sub-image features. Get the second eigenvector.
  • Step S104 Determine the classification result of the image to be detected according to the first feature vector and the second feature vector.
  • the MLP Multilayer Perceptron
  • the MLP Multilayer Perceptron
  • feature extraction processing is performed on the image to be detected including at least two sub-images, and a feature representation subset of the image to be detected is obtained.
  • the feature representation subset includes sub-images corresponding to each of the at least two sub-images.
  • Weighted aggregation processing is performed to obtain the first feature vector; the second is to mine the relevant information between sub-images of the same category, that is, perform cluster sampling processing on at least two sub-image features to obtain at least two classification clusters, each classification cluster includes samples Sub-image features, determine the block sparse self-attention corresponding to each sampled sub-image feature based on at least two classification clusters and block sparse matrices, determine the second feature vector based on at least two block sparse self-attentions; finally, based on the first feature vector and the second feature vector to determine the classification result of the image to be detected.
  • the first eigenvector and the second eigenvector obtained through the two information mining methods can complement and constrain each other, so the detection accuracy of the image can be improved.
  • the block sparse matrix can ensure that the corresponding sampled sub-image features are calculated during calculation.
  • FIG. 4 is a schematic flowchart of an image detection method provided by an embodiment of the present application.
  • the image detection method can be executed by a computer device, where the computer device can be the business server 100 shown in Figure 1a, or any terminal device in the terminal device cluster shown in Figure 1a, such as a terminal device. 10c.
  • the following description will take the example of the image detection method being executed by a computer device.
  • the image detection method may at least include the following steps S201 to S210:
  • Step S201 obtain the image to be detected, perform feature extraction processing on the image to be detected, and obtain a feature representation subset of the image to be detected;
  • the image to be detected includes at least two sub-images;
  • the feature representation subset includes at least Two sub-image features, the at least two sub-image features correspond to the at least two sub-images in a one-to-one manner.
  • step S201 please refer to the detailed description of step S101 in the embodiment corresponding to FIG. 2, and will not be described again here.
  • Ff represents the feature extraction process, which is usually determined based on the selected feature extractor.
  • Step S202 Input the at least two sub-image features into a first attention sub-network in the image recognition model; the first attention sub-network includes a weight learning network layer and a weighted aggregation network layer.
  • Step S203 Perform weight fitting processing on the at least two sub-image features through the weight learning network layer to obtain attention weights corresponding to the at least two sub-image features.
  • the weight learning network layer can use a parameterized neural network to learn the attention weight corresponding to the sub-image feature, where the feature obtained by the above formula (1) represents the attention corresponding to the sub-image feature h k in the subset H
  • the force weight a k can be expressed as formula (2):
  • W and V are parameter matrices, and tanh is a nonlinear function. It can be seen from formula (2) that the attention weight a k is only related to the sub-image feature h k and has nothing to do with other sub-image features. In other words, the weight learning network layer assumes independent distribution of the feature representation subset H.
  • Step S204 Perform weighting processing on each sub-image feature according to the attention weight through the weighted aggregation network layer to obtain a weighted sub-image feature corresponding to each sub-image feature, and each of the at least two sub-image features corresponds to The weighted sub-image features are aggregated to obtain a first feature vector.
  • the sub-image features can be effectively aggregated by sampling a nonlinear attention weighting method, which can be calculated by formula (3):
  • Step S205 input the at least two sub-image features into the second attention sub-network in the image recognition model;
  • the second attention sub-network includes a cluster sampling network layer, a global self-attention network layer, and a self-attention network. layer and mean pooling network layer.
  • inputting at least two sub-image features into the first attention sub-network and inputting at least two sub-image features into the second attention network may be performed simultaneously, and the first attention sub-network and the second attention sub-network interact with each other. No impact.
  • Step S206 Perform cluster sampling processing on the at least two sub-image features through the cluster sampling network layer to obtain at least two classification clusters, where the classification clusters include sampled sub-image features; the at least two classification clusters The sum of the number of sampled sub-image features each includes is N, and N is a positive integer smaller than the number of the at least two sub-image features.
  • the computer device can first perform clustering processing on at least two sub-image features to obtain at least two classification clusters, and then obtain the k-th classification cluster among the at least two classification clusters, where, k is a positive integer, and the k-th classification cluster includes at least one cluster sub-image feature; then the computer device can obtain the vector distance between each of the at least one cluster sub-image feature and the cluster center of the k-th classification cluster as a reference distance.
  • h cluster sub-image features in order from at least one cluster sub-image feature, and use the h cluster sub-image features as sampled sub-image features included in the k-th classification cluster, where h is A positive integer, and h is less than or equal to the number of at least one clustered sub-image feature.
  • each classification cluster has a cluster center
  • the above-mentioned clustering processing can use the unsupervised k-means (an unsupervised clustering method) clustering method, or other clustering methods, which are not limited here.
  • the sub-images corresponding to the sub-image features included in each classification cluster belong to the same image category.
  • Figure 5 is a schematic diagram of a clustering result of colorectal pathological images provided by an embodiment of the present application. As shown in Figure 5, the sub-image features corresponding to the same group of sub-images belong to the same classification cluster. From Figure 5, it can be found that different pathological images can be roughly divided into categories with similar histological structural features.
  • the sub-images in group 1 contain obvious cancerous tissue; the sub-images in group 2 are all sub-images with abnormal staining or poor imaging quality; the sub-images in group 3 basically contain glandular tissue; group 4 The sub-images in all gather more immune cells and so on.
  • Step S207 Determine the block sparse global self-attention weight matrix of N sampled sub-image features through the global self-attention network layer based on the block sparse matrix.
  • the second attention subnetwork includes a query weight matrix and a key weight matrix.
  • the process of the computer device determining the block sparse global self-attention weight matrix of the N sampled sub-image features based on the block sparse matrix through the global self-attention network layer may be: through the global self-attention network layer, based on at least two classification clusters The sampled sub-image features included in each construct a sampled sub-image feature matrix; multiply the sampled sub-image feature matrix and the query weight matrix to obtain the query matrix; multiply the sampled sub-image feature matrix and the key weight matrix to obtain the key matrix; according to the query
  • the transposed matrix corresponding to the matrix, the key matrix and the block sparse matrix determine the block sparse global correlation matrix; the block sparse global correlation matrix is normalized to obtain the block sparse global self-attention weight matrix.
  • sampling sub-image feature matrix can be expressed as formula (4):
  • sampling sub-image feature matrix is the sampled sub-image feature matrix For the i-th sampled sub-image feature in , i is a positive integer less than or equal to N, and N is 128 as mentioned above.
  • W q is the query weight matrix, which is a matrix randomly initialized by the second attention sub-network.
  • Q is the query matrix
  • q i is the sampled sub-image feature matrix
  • W k is the key weight matrix, which is also a matrix randomly initialized by the second attention sub-network.
  • K is the key matrix
  • k i is the sampled sub-image feature matrix
  • Q is the above query matrix
  • K T is the transpose matrix of the above key matrix
  • B is a block sparse matrix related to at least two classification clusters
  • d k is the above N
  • the function of softmax is normalization
  • A is Block sparse global self-attention weight matrix.
  • Figure 6 is a schematic diagram of the principle of block sparsity constraint on global self-attention provided by an embodiment of the present application.
  • the global self-attention weight matrix 601 is the global self-attention matrix when it is not constrained. At this time, the information between sampled sub-image features between different categories will also be represented, and the computer device can obtain it.
  • the block sparse matrix 602 constrains it, and the constrained block sparse global self-attention weight matrix 603 is obtained.
  • the block sparse global self-attention weight matrix 603 only represents sampled sub-image features belonging to the same category. information between.
  • the block sparsity constraint essentially uses the correlation of categories to filter global self-attention, only paying attention to and retaining the attention of the same category, and removing the attention of different categories.
  • the block sparse global self-attention weight matrix 603 uses the category information of the instance to learn attention and performs appropriate simplified calculations.
  • Step S208 Determine the block sparse self-attention corresponding to each sampled sub-image feature through the self-attention network layer based on the at least two classification clusters and the block sparse global self-attention weight matrix.
  • the second attention sub-network also includes a value weight matrix;
  • the above-mentioned N sampled sub-image features include sampled sub-image features Na , where a is a positive integer less than or equal to N.
  • the computer device multiplies the sampled sub-image feature matrix and the value weight matrix through the self-attention network layer to obtain the value matrix;
  • the sampled sub-image feature in the classification cluster to which the sampled sub-image feature N a belongs is used as the target sampled sub-image feature ;
  • From the block sparse global self-attention weight matrix obtain the block sparse global self-attention weight between the sampled sub-image feature N a and the target sampled sub-image feature, as the target block sparse global self-attention weight;
  • from the value matrix obtain the value vector corresponding to the target sampling sub-image feature as the target value vector; according to the target value vector and the target block sparse global self-attention weight, determine the block sparse self-attention corresponding to the sampling sub-image feature N a
  • W v is the value weight matrix, which is a matrix randomly initialized by the second attention sub-network.
  • V is the value matrix
  • vi is the sampled sub-image feature matrix The value vector associated with the i-th sampled sub-image feature in .
  • z a refers to the block sparse self-attention corresponding to the sampled sub-image feature N a , is the sampling sub-image feature matrix
  • the a-th sampled sub-image feature in that is, the sampled sub-image feature N a
  • both a and b are positive integers less than or equal to N.
  • the cluster center of the taxonomic cluster it belongs to express Belongs to the classification cluster corresponding to the center of the cluster. is a constraint, that is, traversing b from 1 to N, only if and A a a b v b are only accumulated when they belong to the same classification cluster.
  • v b is the value matrix with The associated value vector.
  • a ab is the block sparse global self-attention weight of row a and column b in the block sparse global self-attention weight matrix, that is and Sparse global self-attention weights between blocks.
  • Step S209 Perform mean pooling processing on at least two of the block sparse self-attentions through the mean pooling network layer to obtain a second feature vector.
  • the mean pooling process refers to adding the sparse self-attention of at least two blocks and then averaging, and the resulting vector is the second feature vector X 2 .
  • Step S210 input the first feature vector and the second feature vector into the classification sub-network of the image recognition model; the classification sub-network includes a feature fusion network layer and a classification network layer; through the feature fusion network layer, The first feature vector and the second feature vector are subjected to feature fusion processing to obtain a fused feature vector; through the classification network layer, the fused feature vector is classified and processed to obtain a classification result of the image to be detected.
  • the first attention sub-network will output the first feature vector X 1
  • the second attention sub-network will output the second feature vector X 2
  • the two parallel feature vectors will be feature fused and classified at the feature fusion network layer.
  • the network layer can use the MLP classifier, so the final output can be expressed as the following formula (10):
  • Concate represents the feature fusion operation, and the commonly used feature fusion methods are feature splicing and weighted summation.
  • the final output is y, which can be the normal prediction probability of the image to be detected. When the normal prediction probability is lower than a certain threshold, the image to be detected can be determined to be an abnormal image.
  • FIG. 7 is a schematic structural diagram of an image recognition model provided by an embodiment of the present application.
  • the image recognition model 7 includes a first attention sub-network 71 , a second attention sub-network 72 and a classification sub-network 73 .
  • the computer device After obtaining the feature representation subset 700 corresponding to the image to be detected, the computer device will sequentially input the sub-image features in the feature representation subset 700 into the image recognition model 7 .
  • the feature representation subset 700 includes at least two sub-image features.
  • the computer device inputs the feature representation subset 700 into the first attention sub-network 71 and the second attention sub-network 72 respectively.
  • the first attention sub-network 71 for each sub-image feature in the feature representation subset 700, its corresponding weight will be learned through a parameterized neural network.
  • the sub-image feature 711 can be input into the parameterized
  • the neural network 712 will output the weight of the sub-image feature 711.
  • the computer device can use a linear term non-linear attention weighting method to effectively aggregate all the sub-image features in the feature representation subset 700, and finally obtain the first feature vector.
  • the computer device will first perform unsupervised clustering on the feature representation subset 700 to obtain at least two classification clusters, such as classification cluster 721, and sub-images corresponding to the sub-image features in the classification cluster 721. Images belonging to the same category. The computer device then obtains some of the sub-image features from each classification cluster as sampled sub-image features. For the processing of clustering and sampling, please refer to the above-mentioned step S206.
  • the computer device can perform matrix transformation on the adopted sub-image feature matrix 722 composed of the sampled sub-image features to obtain the key matrix 723, the query matrix 724 and the value matrix 725.
  • the matrix transformation see The above formula (5), formula (6) and formula (8) can be realized specifically through a convolutional network with a convolution kernel of 1 ⁇ 1.
  • the block sparse global self-attention matrix 726 can be determined according to the transposed matrix of the key matrix 723, the query matrix 724 and the block sparse matrix.
  • the determination process can be referred to the above step S207.
  • the second feature vector 727 is further determined according to the block sparse global attention matrix 726 and the value matrix 725.
  • the determination process may refer to the above steps S208 and S209.
  • the computer device will input the first feature vector 713 and the second feature vector 727 into the classification sub-network 73 .
  • the classification sub-network 73 after performing feature fusion on the first feature vector 713 and the second feature vector 727, the feature fused vector will be input into the classifier 731, and then a classification result 732 will be output.
  • the classification result 732 may include an image. Normal probability and image abnormality probability.
  • the first attention sub-network and the second attention sub-network in the image recognition model mine the information of the image to be detected in two different ways, and obtain the first feature vector and the second feature vector.
  • the two feature vectors can complement and constrain each other after being fused, and the fused feature vectors can be predicted, and the accuracy of the classification results obtained is high.
  • FIG. 8 is a schematic flowchart of an initial image recognition model training method provided by an embodiment of the present application.
  • the initial image recognition model training method can be executed by a computer device, where the computer device can be the business server 100 shown in Figure 1a above, or any terminal device in the terminal device cluster shown in Figure 1a above, For example, the terminal device 10c.
  • the initial image recognition model training method may at least include the following steps S301 to S305:
  • Step S301 obtain a sample image, perform feature extraction processing on the sample image, and obtain a sample feature representation subset of the sample image;
  • the sample image includes at least two sample sub-images;
  • the sample feature representation subset includes at least Two sample sub-image features, the at least two sample sub-image features have a one-to-one correspondence with the at least two sample sub-images.
  • step S301 please refer to the description of step S101 in the embodiment corresponding to FIG. 2, and will not be described again here.
  • Step S302 Input the at least two sample sub-images into an initial image recognition model, and use the initial image recognition model to generate sample attention weights corresponding to the features of the at least two sample sub-images.
  • the sample attention weight corresponding to each sample sub-image feature is used to perform a weighted aggregation process on the at least two sample sub-image features to obtain a first sample feature vector.
  • the initial image recognition model may include a first initial attention sub-network
  • the computer device may generate sample attention weights corresponding to at least two sample sub-image features through the first initial attention sub-network, based on at least two sample attention sub-networks.
  • the sample attention weights corresponding to the sample sub-image features are weighted and aggregated on at least two sample sub-image features to obtain the first sample feature vector.
  • Step S303 Perform cluster sampling processing on the at least two sample sub-image features through the initial image recognition model to obtain at least two sample classification clusters.
  • the sample classification clusters include sample sampling sub-image features.
  • the At least two sample classification clusters and block sparse matrices determine the sample block sparse self-attention corresponding to each of the sample sampling sub-image features, and determine the second sample feature vector based on the at least two sample block sparse self-attention; the sample sampling The sparse self-attention of the sample block corresponding to the sub-image feature is determined based on the sample sampling sub-image feature in the sample classification cluster to which it belongs.
  • the initial image recognition model may also include a second initial attention sub-network, and then perform clustering sampling processing on at least two sample sub-image features through the second initial attention sub-network to obtain at least two sample classification clusters.
  • the sample sampling sub-image features each includes are determined based on at least two sample classification clusters and block sparse matrices.
  • the sample block sparse self-attention corresponding to each sample sampling sub-image feature is determined based on the at least two sample block sparse self-attention.
  • Step S304 Determine the sample classification result of the sample image according to the first sample feature vector and the second sample feature vector through the initial image recognition model.
  • the initial image recognition model may also include an initial classification sub-network, and then through the initial classification sub-network, the sample classification result of the sample image is determined based on the first sample feature vector and the second sample feature vector.
  • the initial image recognition model may also include an initial classification sub-network, and then through the initial classification sub-network, the sample classification result of the sample image is determined based on the first sample feature vector and the second sample feature vector.
  • Step S305 Identify the initial image according to the at least two sample classification clusters, the attention weights corresponding to the at least two sample sub-image features, the sample classification results, and the classification labels corresponding to the sample images.
  • the model adjusts the model parameters to obtain an image recognition model used to identify the classification results of the image to be detected.
  • the attention of the first attention sub-network to at least two sub-image features is The distribution and the attention distribution of the second attention sub-network on at least two sub-image features should be consistent.
  • the computer device can first classify clusters based on at least two samples and at least The sample attention weight corresponding to the two sample sub-image features is used to determine the divergence loss value; then the classification loss value is determined based on the sample classification result and the classification label corresponding to the sample image; finally, the divergence loss value and the classification loss value are weighted Sum up to obtain the total loss value of the model; adjust the model parameters of the initial image recognition model based on the total loss value of the model to obtain the image recognition model.
  • the divergence loss value is used to ensure that the two network branches of the final trained image recognition model have consistent attention distribution for the same sub-image feature input.
  • the classification loss value is used to ensure that the classification result output by the final trained image recognition model is closer to the real result.
  • the above-mentioned implementation process of determining the divergence loss value based on the sample attention weights corresponding to at least two sample classification clusters and at least two sample sub-image features may be: obtaining the i-th value in the at least two sample classification clusters.
  • sample classification clusters i is a positive integer, and i is less than or equal to the number of at least two sample classification clusters; use the sample sub-image features included in the i-th sample classification cluster as the target sample sub-image features; according to the target sample sub-image
  • the sample attention weight corresponding to the feature and the number of target sample sub-image features are used to determine the category divergence loss value corresponding to the i-th sample classification cluster; the category divergence loss value corresponding to each sample classification cluster is accumulated to obtain the divergence loss value.
  • the sample sub-image features included in the sample image are clustered in the second initial attention sub-network, and at least two sample classification clusters are obtained.
  • the degree of attention of sample sub-image features in the second initial attention sub-network is the same, so in the first initial attention sub-network, the degree of attention of sample sub-image features in the same sample classification cluster should also be the same.
  • the sample image includes 6 sample sub-image features, namely B1, B2, B3, B4, B5 and B6.
  • sample attention weights generated in the first initial attention sub-network are 0.10, 0.22, 0.11, 0.31, 0.22, 0.12
  • sample classification clusters generated in the second initial attention sub-network are: sample classification cluster 1 ⁇ B1, B3, B6 ⁇ , sample classification cluster 2 ⁇ B2, B4, B5 ⁇
  • sample classification cluster 1 The sample attention weights corresponding to B1, B3, and B4 in the sample classification cluster 2 are close to the same, which is reasonable; but the sample attention weight corresponding to B4 in the sample classification cluster 2 is significantly higher than B2 and B5, which is unreasonable, so it is necessary to pass the divergence Loss value to adjust.
  • the attention weights generated by the sample sub-image features in the same sample classification cluster in the first attention sub-network should obey a uniform distribution, so each sample classification cluster can determine a category divergence loss value.
  • the category divergence loss values corresponding to each sample classification cluster are accumulated to obtain the divergence loss value.
  • the implementation process of determining the category divergence loss value corresponding to the i-th sample classification cluster can be: obtaining the target sample sub-image The fitted attention distribution composed of the sample attention weights corresponding to the image features; the fitted attention weight distribution is normalized to obtain the normalized fitted attention distribution; the number of sub-image features of the target sample is correspondingly uniform The attention distribution is used as the attention distribution label; based on the normalized fitting attention distribution and the attention distribution label, the category divergence loss value corresponding to the i-th sample classification cluster is determined.
  • the resulting fitted attention distribution is [0.10, 0.12, 0.11], which is required to be input to facilitate the subsequent calculation of the category divergence loss value. is a probability, so the fitted attention distribution needs to be normalized, that is, the sum is 1, and the normalized fitted attention distribution is [0.303, 0.363, 0.333].
  • the number of target sample sub-image features is 3, and the corresponding uniform attention distribution is used as the attention distribution label, which is [1/3, 1/3, 1/3].
  • the process of determining the category divergence loss value corresponding to the i-th sample classification cluster can be expressed by the following formula (11):
  • G is the number of sub-image features of the target sample
  • p( xi ) is the i-th value in the attention distribution label
  • d( xi ) is the i-th value in the normalized fitted attention distribution
  • D KL (P ⁇ D) is the category divergence loss value.
  • c is the number of sample classification clusters in at least two sample classification clusters
  • D KL (U ⁇ D i ) refers to the category divergence loss value of the i-th sample classification cluster in at least two sample classification clusters
  • KL is Divergence loss value.
  • the total loss value can be determined through the following formula (13):
  • y represents the classification label corresponding to the sample image
  • y′ represents the sample classification result output by the above-mentioned initial classification sub-network
  • KL is the above-mentioned divergence loss value
  • represents the weight
  • the default is 0.01.
  • the initial image recognition model when training the initial image recognition model, it can be trained for 100 epochs (period).
  • the optimizer defaults to Adam (an optimization algorithm), the initial learning rate is 1e-4, and the cosine annealing strategy is used to adjust the learning rate.
  • the minimum learning rate is 1e-6.
  • the K-L divergence loss function ensures that the attention distribution of the two sub-networks is consistent for the same multi-instance input, and the image detection accuracy of the final trained image recognition model is high.
  • FIG. 9 is a schematic structural diagram of an image detection device provided by an embodiment of the present application.
  • the image detection device may be a computer program (including program code) running on a computer device, for example, the image detection device may be an application software; the device may be used to execute corresponding steps in the image detection method provided by the embodiments of the present application.
  • the image detection device 1 may include: a feature extraction module 11 , a first vector generation module 12 , a second vector generation module 13 and a classification module 14 .
  • the feature extraction module 11 is used to obtain the image to be detected, perform feature extraction processing on the image to be detected, and obtain a feature representation subset of the image to be detected;
  • the image to be detected includes at least two sub-images;
  • the feature representation subset includes at least two sub-image features, At least two sub-image features correspond one-to-one to at least two sub-images;
  • the first vector generation module 12 is used to generate attention weights corresponding to at least two sub-image features, and perform weighted aggregation processing on the at least two sub-image features according to the attention weights to obtain a first feature vector;
  • the second vector generation module 13 is used to perform cluster sampling processing on at least two sub-image features to obtain at least two classification clusters.
  • the classification clusters include sampled sub-image features.
  • the block sparse matrix determine each The block sparse self-attention corresponding to the sampled sub-image feature determines the second feature vector based on at least two block sparse self-attention; the block sparse self-attention corresponding to the sampled sub-image feature is based on the sampled sub-image in the classification cluster to which it belongs. Characteristically determined;
  • the classification module 14 is used to determine the classification result of the image to be detected according to the first feature vector and the second feature vector.
  • the feature extraction module 11 the first vector generation module 12, the second vector generation module 13 and the classification module 14, please refer to the relevant description of the embodiment corresponding to Figure 2 above, and will not be described again here.
  • the feature extraction module 11 includes: a pre-processing unit 111 and a feature extraction unit 112.
  • Preprocessing unit 111 used to identify the background area and foreground area in the image to be detected
  • the preprocessing unit 111 is also used to segment the image to be detected according to the background area and the foreground area to obtain the foreground image to be detected;
  • the preprocessing unit 111 is also used to perform zoom processing on the foreground image to be detected according to the zoom magnification to obtain the foreground image to be cropped;
  • the preprocessing unit 111 is also configured to perform cropping processing on the foreground image to be cropped according to the preset length of the subimage and the preset width of the subimage to obtain at least two subimages; the preset length of the subimage is smaller than the length of the foreground image to be cropped; Let the width be smaller than the width of the foreground image to be cropped;
  • the feature extraction unit 112 is configured to perform image feature extraction processing on at least two sub-images, obtain sub-image features corresponding to at least two sub-images, and determine the feature representation of the image to be detected based on the sub-image features corresponding to at least two sub-images. Subset.
  • the first vector generation module 12 includes: a first input unit 121, a weight fitting unit 122 and an aggregation unit 123.
  • the first input unit 121 is used to input at least two sub-image features into the first attention sub-network in the image recognition model;
  • the first attention sub-network includes a weight learning network layer and a weighted aggregation network layer;
  • the weight fitting unit 122 is configured to perform weight fitting processing on at least two sub-image features through the weight learning network layer, and obtain the attention weights corresponding to the at least two sub-image features;
  • the aggregation unit 123 is configured to weight each sub-image feature according to the attention weight through the weighted aggregation network layer to obtain a weighted sub-image feature corresponding to each sub-image feature, and to obtain weighted sub-image features corresponding to at least two sub-image features. Perform aggregation processing to obtain the first feature vector.
  • the weight fitting unit 122 and the aggregation unit 123 please refer to the relevant description of the embodiment corresponding to Figure 4 above, and will not be described again here.
  • the second vector generation module 13 includes: a second input unit 131, a cluster sampling unit 132, a global self-attention determination unit 133, a self-attention determination unit 134 and a mean pooling unit 135.
  • the second input unit 131 is used to input at least two sub-image features into the second attention sub-network in the image recognition model;
  • the second attention sub-network includes a cluster sampling network layer, a global self-attention network layer, and a self-attention network layer.
  • the cluster sampling unit 132 is configured to perform cluster sampling processing on at least two sub-image features through the cluster sampling network layer to obtain at least two classification clusters.
  • the classification clusters include sampled sub-image features; each of the at least two classification clusters includes The sum of the number of sampled sub-image features is N, and N is a positive integer less than the number of at least two sub-image features;
  • the global self-attention determination unit 133 is configured to determine the block sparse global self-attention weight matrix of the N sampled sub-image features based on the block sparse matrix through the global self-attention network layer;
  • the self-attention determination unit 134 is configured to determine the block sparse self-attention corresponding to each sampled sub-image feature according to at least two classification clusters and the block sparse global self-attention weight matrix through the self-attention network layer;
  • the mean pooling unit 135 is configured to perform mean pooling processing on at least two blocks of sparse self-attention through the mean pooling network layer to obtain the second feature vector.
  • the cluster sampling unit 132 for the specific implementation of the second input unit 131, the cluster sampling unit 132, the global self-attention determination unit 133, the self-attention determination unit 134 and the mean pooling unit 135, please refer to the relevant embodiments corresponding to the above-mentioned Figure 4 description, which will not be described here.
  • the clustering sampling unit 132 includes: a clustering subunit 1321 and a sampling subunit 1322.
  • the clustering subunit 1321 is used to perform clustering processing on at least two sub-image features through the cluster sampling network layer to obtain at least two classification clusters;
  • Sampling subunit 1322 is used to obtain the k-th classification cluster among at least two classification clusters; k is a positive integer; the k-th classification cluster includes at least one cluster sub-image feature;
  • the sampling subunit 1322 is also used to obtain the vector distance between each of at least one clustering sub-image feature and the cluster center of the kth classification cluster as a reference distance;
  • the sampling subunit 1322 is also used to sequentially acquire h clustering sub-image features from at least one clustering sub-image feature based on the reference distance, and use the h clustering sub-image features as sampling sub-images included in the kth classification cluster.
  • Image features; h is a positive integer, and h is less than or equal to the number of at least one cluster sub-image feature.
  • the second attention sub-network includes the query weight matrix and the key weight matrix
  • the global self-attention determination unit 133 includes: a first initialization subunit 1331, a matrix determination subunit 1332 and a normalization subunit 1333.
  • the first initialization subunit 1331 is used to construct a sampled sub-image feature matrix according to the sampled sub-image features included in each of at least two classification clusters through the global self-attention network layer;
  • the first initialization subunit 1331 is also used to multiply the sampled sub-image feature matrix and the query weight matrix to obtain the query matrix, and multiply the sampled sub-image feature matrix and the key weight matrix to obtain the key matrix;
  • the matrix determination subunit 1332 is used to determine the block sparse global correlation matrix based on the query matrix, the transposed matrix corresponding to the key matrix, and the block sparse matrix;
  • the normalization subunit 1333 is used to normalize the block sparse global correlation matrix to obtain the block sparse global self-attention weight matrix.
  • the matrix determination sub-unit 1332 and the normalization sub-unit 1333 please refer to the relevant description of the embodiment corresponding to Figure 4 above, and will not be described again here.
  • the second attention sub-network also includes a value weight matrix;
  • the N sampled sub-image features include sampled sub-image features N j , j is a positive integer less than or equal to N;
  • the self-attention determination unit 134 includes: a target acquisition subunit 1341 and a determination subunit 1342.
  • the target acquisition subunit 1341 is used to multiply the sampled sub-image feature matrix and the value weight matrix through the self-attention network layer to obtain the value matrix;
  • the target acquisition subunit 1341 is also used to use the sampled sub-image features in the classification cluster to which the sampled sub-image feature N j belongs as the target sampled sub-image feature;
  • the target acquisition subunit 1341 is also used to obtain the block sparse global self-attention weight between the sampled sub-image feature N j and the target sampled sub-image feature from the block sparse global self-attention weight matrix, as the target block sparse global self-attention weight matrix. attention weight;
  • the target acquisition subunit 1341 is also used to obtain the value vector corresponding to the target sampling sub-image feature from the value matrix as the target value vector;
  • the determination subunit 1342 is used to determine the block sparse self-attention corresponding to the sampled sub-image feature N j according to the target value vector and the target block sparse global self-attention weight.
  • the classification module 14 includes: a third input unit 141, a feature fusion unit 142 and a classification unit 143.
  • the third input unit 141 is used to input the first feature vector and the second feature vector into the classification subnetwork of the image recognition model;
  • the classification subnetwork includes a feature fusion network layer and a classification network layer;
  • the feature fusion unit 142 is configured to perform feature fusion processing on the first feature vector and the second feature vector through the feature fusion network layer to obtain a fused feature vector;
  • the classification unit 143 is used to classify the fused feature vector through the classification network layer to obtain the classification result of the image to be detected.
  • FIG. 10 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the image detection device 1 in the embodiment corresponding to Figure 10 above can be applied to a computer device 1000.
  • the computer device 1000 can include: a processor 1001, a network interface 1004 and a memory 1005.
  • the above computer device 1000 Also included: a user interface 1003, and at least one communication bus 1002.
  • the communication bus 1002 is used to realize connection communication between these components.
  • the user interface 1003 may include a display screen (Display) and a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may optionally include a standard wired interface or a wireless interface (such as a WI-FI interface).
  • the memory 1005 may be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory.
  • the memory 1005 may optionally be at least one storage device located remotely from the aforementioned processor 1001. As shown in Figure 10, memory 1005, which is a computer-readable storage medium, may include an operating system, a network communication module, a user interface module, and a device control application program.
  • the network interface 1004 can provide network communication network elements; the user interface 1003 is mainly used to provide an input interface for the user; and the processor 1001 can be used to call the device control stored in the memory 1005 Application program to implement the image detection method provided by the embodiment of this application.
  • the computer device 1000 described in the embodiment of the present application can execute the image detection method described in any of the corresponding embodiments of FIG. 2 and FIG. 4, and the details will not be described again. In addition, the description of the beneficial effects of using the same method will not be described again.
  • the embodiment of the present application also provides a computer-readable storage medium, and the computer-readable storage medium stores the computer program executed by the image detection device 1 mentioned above, and the computer The program includes program instructions.
  • the above-mentioned processor executes the above-mentioned program instructions, it can execute the description of the above-mentioned image detection method in any of the embodiments corresponding to FIG. 2 and FIG. 4. Therefore, the details will not be described here.
  • the description of the beneficial effects of using the same method will not be described again.
  • technical details not disclosed in the computer-readable storage medium embodiments involved in this application please refer to the description of the method embodiments in this application.
  • FIG. 11 is a schematic structural diagram of another image detection device provided by an embodiment of the present application.
  • the image detection device 2 may be a computer program (including program code) running in a computer device.
  • the image detection device 2 may be an application software; the device may be used to execute corresponding steps in the method provided by the embodiments of the present application.
  • the image detection device 2 may include: a sample feature extraction module 21 , a first sample vector generation module 22 , a second sample vector generation module 23 , a sample classification module 24 and a training module 25 .
  • the sample feature extraction module 21 is used to obtain a sample image, perform feature extraction processing on the sample image, and obtain a sample feature representation subset of the sample image;
  • the sample image includes at least two sample sub-images;
  • the sample feature representation subset includes at least two samples Sub-image features, at least two sample sub-image features correspond one-to-one to at least two sample sub-images;
  • the first sample vector generation module 22 is used to input at least two sample sub-images into the initial image recognition model, and generate sample attention weights corresponding to the features of the at least two sample sub-images through the initial image recognition model. According to the at least two The sample attention weight corresponding to each sample sub-image feature is used to perform a weighted aggregation process on at least two sample sub-image features to obtain the first sample feature vector;
  • the second sample vector generation module 23 is configured to perform cluster sampling processing on at least two sample sub-image features through the initial image recognition model to obtain at least two sample classification clusters.
  • the sample classification clusters include sample sampling sub-image features, based on at least Two sample classification clusters and block sparse matrices determine the sample block sparse self-attention corresponding to each sample sampling sub-image feature, and determine the second sample feature vector based on at least two sample block sparse self-attention; the sample sampling sub-image feature corresponds
  • the sparse self-attention of the sample block is determined based on the sample sampling sub-image characteristics in the sample classification cluster to which it belongs;
  • the sample classification module 24 is used to determine the sample classification result of the sample image according to the first sample feature vector and the second sample feature vector through the initial image recognition model;
  • the training module 25 is used to adjust model parameters of the initial image recognition model based on at least two sample classification clusters, at least two sample sub-image features corresponding attention weights, sample classification results, and corresponding classification labels of the sample images, to obtain Image recognition model used to identify classification results of images to be detected.
  • sample feature extraction module 21 the first sample vector generation module 22, the second sample vector generation module 23, the sample classification module 24 and the training module 25, please refer to the relevant description of the embodiment corresponding to Figure 8 above. , which will not be described in detail here.
  • the training module 25 includes: a divergence loss value determination unit 251, a classification loss value determination unit 252, a weighted summation unit 253, and a model adjustment unit 254.
  • the divergence loss value determination unit 251 is configured to determine the divergence loss value based on the sample attention weights corresponding to at least two sample classification clusters and at least two sample sub-image features;
  • the classification loss value determination unit 252 is used to determine the classification loss value based on the sample classification result and the classification label corresponding to the sample image;
  • the weighted summation unit 253 is used to perform a weighted summation of the divergence loss value and the classification loss value to obtain the total loss value of the model;
  • the model adjustment unit 254 is used to adjust model parameters of the initial image recognition model according to the total loss value of the model to obtain an image recognition model.
  • the divergence loss value determination unit 251 includes: an acquisition subunit 2511, a category loss value determination subunit 2512, and a total loss value determination subunit 2513.
  • Acquisition subunit 2511 is used to obtain the i-th sample classification cluster among at least two sample classification clusters; i is a positive integer, and i is less than or equal to the number of at least two sample classification clusters;
  • the acquisition subunit 2511 is also used to use the sample sub-image features included in the i-th sample classification cluster as the target sample sub-image features;
  • the category loss value determination subunit 2512 is used to determine the category divergence loss value corresponding to the i-th sample classification cluster according to the sample attention weight corresponding to the target sample sub-image feature and the number of target sample sub-image features;
  • the total loss value determination subunit 2513 accumulates the category divergence loss values corresponding to each sample classification cluster to obtain the divergence loss value.
  • the category loss value determination subunit 2512 is specifically used to obtain the fitted attention distribution composed of the sample attention weights corresponding to the sub-image features of the target sample; normalize the fitted attention weight distribution to obtain the normalized Fit the attention distribution; use the uniform attention distribution corresponding to the number of sub-image features of the target sample as the attention distribution label; determine the correspondence of the i-th sample classification cluster based on the normalized fitting attention distribution and attention distribution label class divergence loss value.
  • the acquisition sub-unit 2511 the category loss value determination sub-unit 2512 and the total loss value determination sub-unit 2513, please refer to the relevant description of the embodiment corresponding to Figure 8 above, and will not be described again here.
  • FIG. 12 is a schematic structural diagram of another computer device provided by an embodiment of the present application.
  • the image detection device 2 in the embodiment corresponding to Figure 11 can be applied to a computer device 2000.
  • the computer device 2000 can include: a processor 2001, a network interface 2004 and a memory 2005.
  • the above computer device 2000 also included: a user interface 2003, and at least one communication bus 2002.
  • the communication bus 2002 is used to realize connection communication between these components.
  • the user interface 2003 may include a display screen (Display) and a keyboard (Keyboard), and the optional user interface 2003 may also include a standard wired interface and a wireless interface.
  • the network interface 2004 may optionally include a standard wired interface or a wireless interface (such as a WI-FI interface).
  • the memory 2005 may be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory.
  • the memory 2005 may optionally be at least one storage device located remotely from the aforementioned processor 2001. As shown in Figure 12, the memory 2005, which is a computer-readable storage medium, may include an operating system, a network communication module, a user interface module, and a device control application program.
  • the network interface 2004 can provide network communication functions; the user interface 2003 is mainly used to provide an input interface for the user; and the processor 2001 can be used to call the device control application stored in the memory 2005 Program to implement the initial image recognition model training method provided by the embodiment of this application.
  • the computer device 2000 described in the embodiment of the present application can perform the description of the access control method in the previous embodiments, and can also perform the description of the image detection device 2 in the previous embodiment corresponding to Figure 11. This will not be described again. In addition, the description of the beneficial effects of using the same method will not be described again.
  • the embodiment of the present application also provides a computer-readable storage medium, and the computer-readable storage medium stores the computer program executed by the image detection device 2 mentioned above.
  • the server loads and executes the above computer program, it can execute the description of the above access control method in any of the previous embodiments, so the details will not be described again here.
  • the description of the beneficial effects of using the same method will not be described again.
  • the computer-readable storage medium may be the image detection device provided in any of the aforementioned embodiments or an internal storage unit of the computer device, such as a hard disk or memory of the computer device.
  • the computer-readable storage medium can also be an external storage device of the computer device, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card equipped on the computer device, Flash card, etc.
  • the computer-readable storage medium may also include both an internal storage unit of the computer device and an external storage device.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by the computer device.
  • the computer-readable storage medium can also be used to temporarily store data that has been output or is to be output.
  • embodiments of the present application also provide a computer program product or computer program.
  • the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method provided by any corresponding embodiment mentioned above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Radiology & Medical Imaging (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种图像检测方法、装置、设备及可读存储介质,该方法包括:对待检测图像进行特征提取处理,得到至少两个子图像特征;生成至少两个子图像特征各自对应的注意力权重,根据注意力权重对至少两个子图像特征进行加权聚合处理,得到第一特征向量;对至少两个子图像特征进行聚类采样处理,得到至少两个分类簇各自包括的采样子图像特征,根据至少两个分类簇和块稀疏矩阵确定每个采样子图像特征对应的块稀疏自注意力,根据至少两个块稀疏自注意力确定第二特征向量;根据第一特征向量和第二特征向量,确定待检测图像的分类结果。采用本发明,可以提高图像的检测速度和检测准确率。

Description

一种图像检测方法、装置、设备及可读存储介质
本申请要求于2022年03月23日提交中国专利局、申请号为2022102886990、申请名称为“一种图像检测方法、装置、设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及图像检测技术。
背景技术
在经典的机器学习图像分类问题中,经常假设一个图像清楚地隶属于某个类别。但是,在实际应用中,在一个图像中会观察到多个实例(instance),而图像的标签仅对其中一个实例的类别有说明。这种问题一般称作多实例学习(multiple instance learning,MIL)。多实例学习的目的是通过对具有分类标签的多实例图像进行学习,建立多实例学习模型,进而将该多实例学习模型应用于未知的多实例图像的检测。
相关技术中,可以通过自注意力模块来挖掘已有的多实例图像中所有实例的信息,并从中找出实例之间的相关信息,进行建立多实例学习模型,从而检测未知的多实例图像。但是,自注意力模块具有高计算复杂度,对于数字化病理图像这种实例数量可能达到10000左右的多实例图像进行建模时,硬件资源和时间消耗都很大,训练困难。而且针对数字化病理图像这种数量少,且每张图像只对应一个全片的标签的多实例图像,监督信息很弱,在小数据集上训练如此高复杂度的自注意力模块,很难保证该自注意力模块能够挖掘出有效的信息,而且该自注意力模块容易产生过拟合的问题,导致检测准确率低。
发明内容
本申请实施例提供了一种图像检测方法、装置、设备及可读存储介质,可以提高图像的检测速度和检测准确率。
本申请实施例一方面提供了一种图像检测方法,由计算机设备执行,包括:
获取待检测图像,对待检测图像进行特征提取处理,得到待检测图像的特征表示子集;待检测图像包括至少两个子图像;特征表示子集包括至少两个子图像特征,至少两个子图像特征与至少两个子图像一一对应;
生成至少两个子图像特征各自对应的注意力权重,根据注意力权重,对至少两个子图像特征进行加权聚合处理,得到第一特征向量;
对至少两个子图像特征进行聚类采样处理,得到至少两个分类簇,分类簇包括采样子图像特征,根据至少两个分类簇和块稀疏矩阵,确定每个采样 子图像特征对应的块稀疏自注意力,根据至少两个块稀疏自注意力确定第二特征向量;采样子图像特征对应的块稀疏自注意力是基于其所属的分类簇中的采样子图像特征确定的;
根据第一特征向量和第二特征向量,确定待检测图像的分类结果。
本申请实施例一方面提供了一种图像检测方法,由计算机设备执行,包括:
获取样本图像,对样本图像进行特征提取处理,得到样本图像的样本特征表示子集;样本图像包括至少两个样本子图像;样本特征表示子集包括至少两个样本子图像特征,至少两个样本子图像特征与至少两个样本子图像一一对应;
将至少两个样本子图像输入初始图像识别模型,通过初始图像识别模型,生成至少两个样本子图像特征各自对应的样本注意力权重,根据至少两个样本子图像特征各自对应的样本注意力权重,对至少两个样本子图像特征进行加权聚合处理,得到第一样本特征向量;
通过初始图像识别模型,对至少两个样本子图像特征进行聚类采样处理,得到至少两个样本分类簇,分类簇包括样本采样子图像特征,根据至少两个样本分类簇和块稀疏矩阵确定每个样本采样子图像特征对应的样本块稀疏自注意力,根据至少两个样本块稀疏自注意力确定第二样本特征向量;样本采样子图像特征对应的样本块稀疏自注意力是基于其所属的样本分类簇中的样本采样子图像特征确定的;
通过初始图像识别模型,根据第一样本特征向量和第二样本特征向量,确定样本图像的样本分类结果;
根据至少两个样本分类簇、至少两个样本子图像特征各自对应的注意力权重、样本分类结果以及样本图像对应的分类标签,对初始图像识别模型进行模型参数调整,得到用于识别待检测图像的分类结果的图像识别模型。
本申请实施例一方面提供了一种图像检测装置,包括:
特征提取模块,用于获取待检测图像,对待检测图像进行特征提取处理,得到待检测图像的特征表示子集;待检测图像包括至少两个子图像;特征表示子集包括至少两个子图像特征,至少两个子图像特征与至少两个子图像一一对应;
第一向量生成模块,用于生成至少两个子图像特征各自对应的注意力权重,根据注意力权重,对至少两个子图像特征进行加权聚合处理,得到第一特征向量;
第二向量生成模块,用于对至少两个子图像特征进行聚类采样处理,得到至少两个分类簇,分类簇包括采样子图像特征,根据至少两个分类簇和块稀疏矩阵,确定每个采样子图像特征对应的块稀疏自注意力,根据至少两个块稀疏自注意力确定第二特征向量;采样子图像特征对应的块稀疏自注意力 是基于其所属的分类簇中的采样子图像特征确定的;
分类模块,用于根据第一特征向量和第二特征向量,确定待检测图像的分类结果。
本申请实施例一方面提供了一种图像检测装置,包括:
样本特征提取模块,用于获取样本图像,对样本图像进行特征提取处理,得到样本图像的样本特征表示子集;样本图像包括至少两个样本子图像;样本特征表示子集包括至少两个样本子图像特征,至少两个样本子图像特征与至少两个样本子图像一一对应;
第一样本向量生成模块,用于将至少两个样本子图像输入初始图像识别模型,通过初始图像识别模型中,生成至少两个样本子图像特征各自对应的样本注意力权重,根据至少两个样本子图像特征各自对应的样本注意力权重,对至少两个样本子图像特征进行加权聚合处理,得到第一样本特征向量;
第二样本向量生成模块,用于通过初始图像识别模型,对至少两个样本子图像特征进行聚类采样处理,得到至少两个样本分类簇,分类簇包括样本采样子图像特征,根据至少两个样本分类簇和块稀疏矩阵确定每个样本采样子图像特征对应的样本块稀疏自注意力,根据至少两个样本块稀疏自注意力确定第二样本特征向量;样本采样子图像特征对应的样本块稀疏自注意力是基于其所属的样本分类簇中的样本采样子图像特征确定的;
样本分类模块,用于通过初始图像识别模型,根据第一样本特征向量和第二样本特征向量,确定样本图像的样本分类结果;
训练模块,用于根据至少两个样本分类簇、至少两个样本子图像特征各自对应的注意力权重、样本分类结果以及样本图像对应的分类标签,对初始图像识别模型进行模型参数调整,得到用于识别待检测图像的分类结果的图像识别模型。
本申请实施例一方面提供了一种计算机设备,包括:处理器、存储器、网络接口;
上述处理器与上述存储器、上述网络接口相连,其中,上述网络接口用于提供数据通信网元,上述存储器用于存储计算机程序,上述处理器用于调用上述计算机程序,以执行本申请实施例中的方法。
本申请实施例一方面提供了一种计算机可读存储介质,上述计算机可读存储介质中存储有计算机程序,上述计算机程序适于由处理器加载并执行本申请实施例中的方法。
本申请实施例一方面提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中,计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例中的方法。
本申请实施例中,可以对包括至少两个子图像的待检测图像进行特征提 取处理,得到该待检测图像的特征表示子集,该特征表示子集包括至少两个子图像各自对应的子图像特征,然后通过两种方式来挖掘子图像的信息,一是独立挖掘各个子图像的信息,即生成至少两个子图像特征各自对应的注意力权重,再根据注意力权重对至少两个子图像特征进行加权聚合处理,得到第一特征向量;二是挖掘同类别的子图像之间的相关信息,即对至少两个子图像特征进行聚类采样处理,得到至少两个分类簇各自包括的采样子图像特征,根据至少两个分类簇和块稀疏矩阵确定每个采样子图像特征对应的块稀疏自注意力,根据至少两个块稀疏自注意力确定第二特征向量;最后根据第一特征向量和第二特征向量,确定待检测图像的分类结果。采用本申请实施例提供的方法,通过两种信息挖掘方式得到的第一特征向量和第二特征向量之间可以相互补充且相互约束,因此可以提高图像的检测准确率,另外,通过块稀疏矩阵来计算采样子图像特征对应的块稀疏自注意力,可以保证只关注和该采样子图像特征属于同一分类簇的采样子图像特征之间的相关性,降低计算复杂度,提高检测速度。
附图说明
图1a是本申请实施例提供的一种网络架构示意图;
图1b是本申请实施例提供的一种图像检测方法的应用场景示意图;
图2是本申请实施例提供的一种图像检测方法的流程示意图;
图3是本申请实施例提供的一种图像特征提取处理的场景示意图;
图4是本申请实施例提供的一种图像检测方法的流程示意图;
图5是本申请实施例提供的一种结直肠病理图像的聚类结果示意图;
图6是本申请实施例提供的一种对全局自注意力进行块稀疏约束的原理示意图;
图7是本申请实施例提供的一种图像识别模型的结构示意图;
图8是本申请实施例提供的一种初始图像识别模型训练方法的流程示意图;
图9是本申请实施例提供的一种图像检测装置的结构示意图;
图10是本申请实施例提供的一种计算机设备的结构示意;
图11是本申请实施例提供的另一种图像检测装置的结构示意图;
图12是本申请实施例提供的另一种计算机设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的方案涉及人工智能的计算机视觉技术、机器学习和 深度学习等技术,具体通过如下实施例进行说明。
请参见图1a,图1a是本申请实施例提供的一种网络架构示意图。如图1a所示,该网络架构可以包括业务服务器100以及终端设备集群,该终端设备集群可以包括终端设备10a、终端设备10b、终端设备10c、…、终端设备10n,其中,终端设备集群中的任一终端设备均可以与业务服务器100之间存在通信连接,例如,终端设备10a与业务服务器100之间存在通信连接,终端设备10b与业务服务器100之间存在通信连接,终端设备10c与业务服务器100之间存在通信连接,其中,上述通信连接不限定连接方式,可以通过有线通信方式进行直接或间接地连接,也可以通过无线通信方式进行直接或间接地连接,还可以通过其它方式,本申请在此不做限制。
应该理解,图1a所示的终端集群中的每个终端设备均可以安装有应用客户端,当该应用客户端运行于各终端设备中时,可以分别与上述图1a所示的业务服务器100之间进行数据交互,使得业务服务器100可以接收来自于每个终端设备的业务数据。其中,该应用客户端可以为游戏应用、视频编辑应用、社交应用、即时通信应用、直播应用、短视频应用、视频应用、音乐应用、购物应用、小说应用、支付应用、浏览器等具有相关图像处理功能的应用客户端。其中,该应用客户端可以为独立的客户端,也可以为集成在某客户端(例如即时通信客户端、社交客户端、视频客户端等)中的嵌入式子客户端,在此不做限定。
如图1a所示,终端设备集群中的每个终端设备可以通过运行该应用客户端,获取待检测图像,将其作为业务数据发送给业务服务器100,业务服务器100可以对待检测图像进行图像检测,以确定该待检测图像的分类结果。其中,待检测图像又可以称为多实例图像,即包括有至少两个子图像,一个子图像可以称之为一个实例,对于多实例图像,只要有一个实例异常,则多实例图像即可被视为异常图像,换言之,只要有一个子图像异常,则待检测图像的分类结果应该为异常图像。
在一种可行的实施例中,以待检测图像为数字化病理图像为例,数字化病理图像可以通过以下方式获得:通过全自动显微镜或光学放大系统扫描载波切片采集得到高分辨数字图像,再应用计算机对得到的高分配率数字图像自动进行高精度多视野无缝隙拼接和处理,从而获得优质可视化数据,即获得数字化病理图像。数字化病理图像可以在计算机设备中进行任意位置的放大和缩小,且不存在图像信息失真和细节不清的问题,相比较原始的载波切片观察,更方便医生进行癌症诊断、生存期检测、基因突变检测等病理学诊断。但是数字化病理图像的分辨率高,图像尺寸往往很大,且包含非常多的实例(细胞、基因等生物组织),人工观察数字化病理图像时需要不断调整检测位置以及检测倍数,往往需要耗费大量的时间和精力,因此可以通过上述应用客户端上传数字化病理图像,终端设备获取到数字化病理图像后,可 以将其作为业务数据发送给业务服务器100,进而,业务服务器100可以对该数字化病理图像进行图像检测,确定该数字化病理图像的分类结果,该分类结果可以辅助医生进行医学诊断。
业务服务器100获取到待检测图像后,可以对待检测图像进行图像检测,确定待检测图像的分类结果,该具体实现过程请一并参见图1b,图1b是本申请实施例提供的一种图像检测方法的应用场景示意图。为便于理解,仍然以待检测图像为上述实施方式中的数字化病理图像为例进行说明。如图1b所示,终端设备200(可以为上述图1a中任一终端设备,例如,终端设备10a)安装有患者管理应用300,对象A与终端设备200具有关联关系。假设对象A是对象B的主治医生,对象A在患者管理应用300上可以查看对象B的病例资料,例如结直肠病理图像301,通过观测结直肠病理图像301,对象A可以诊断对象B是否患有结直肠癌症。因为结直肠病理图像301的图像尺寸很大,且需观测的细胞组织很多,对象A人工观测所需时间长,因此对象A可以通过终端设备200上运行的患者管理应用300,向业务服务器400(例如,上述图1a所示的业务服务器100)发起针对该结直肠病理图像301的图像检测请求,然后业务服务器400可以对结直肠病理图像301进行图像检测,确定该结直肠病理图像301的分类结果,即确定该结直肠病理图像301属于正常图像还是异常图像。业务服务器400对结直肠病理图像301的分类结果可以辅助对象A进行对象B的病情诊断。
可以理解的是,结直肠病理图像301的图像尺寸大,包含的细胞组织众多,因此可以认为结直肠病理图像301中包括至少两个子图像(即通过划分结直肠病理图像301可以得到至少两个子图像),只要有一个子图像中存在异常,则结直肠病理图像301就为异常图像。
如图1b所示,终端设备200将结直肠病理图像301发送至业务服务器400后,业务服务器400可以先对结直肠病理图像301进行特征提取处理,得到用于表示该结直肠病理图像301的特征表示子集401,该特征表示子集401中包括至少两个子图像特征,一个子图像特征用于描述结直肠病理图像301中一个子图像的信息。随后,业务服务器400可以使用图像识别模型402检测结直肠病理图像301,该图像识别模型402可以包括第一注意力子网络4021、第二注意力子网络4022以及分类子网络4023,其中,第一注意力子网络4021用于将子图像看作独立的实例,根据每个子图像特征来挖掘每个子图像的独立表示信息,得到用于表示该独立表示信息的第一特征向量;第二注意力子网络4022用于挖掘所有子图像之间的全局表示信息,得到用于表示该全局表示信息的第二特征向量;分类子网络4023用于根据第一特征向量和第二特征向量对待检测图像进行图像分类,确定待检测图像的分类结果。
如图1b所示,业务服务器400将特征表示子集401输入图像识别模型402后,第一注意力子网络4021会生成至少两个子图像特征各自对应的注意 力权重,然后,根据注意力权重对至少两个子图像特征进行加权聚合处理,得到第一特征向量403。同时,第二注意力子网络4022会对至少两个子图像特征进行聚类采样处理,得到至少两个分类簇各自包括的采样子图像特征,然后,根据至少两个分类簇和块稀疏矩阵确定每个采样子图像特征对应的块稀疏自注意力,根据至少两个块稀疏自注意力确定第二特征向量404。在确定了第一特征向量403和第二特征向量404后,分类子网络4023可以对第一特征向量403和第二特征向量404进行特征融合处理,得到融合特征向量,然后对融合特征向量进行分类处理,得到结直肠病理图像301的分类结果405,其中,分类结果405可以包括结直肠病理图像301的正常概率和异常概率,正常概率是指结直肠病理图像301为正常图像的概率,即对象B未患病的概率,异常概率是指结直肠图像301为异常图像的概率,即对象B可能患有结直肠癌的概率。业务服务器400会将分类结果405返回至终端设备200,对象A可以根据分类结果405来诊断对象B的患病情况。
可选的,若终端设备200的本地存储有图像识别模型402,则终端设备200可以在本地对待检测图像做图像检测任务,得到该待检测图像的分类结果。由于训练图像识别模型402涉及大量的离线计算,因此终端设备200本地的图像识别模型可以是由业务服务器400训练完成后发送至终端设备的。
可以理解的是,本申请实施例提供的方法可以由计算机设备执行,计算机设备包括但不限于终端设备或服务器,本申请实施例中的业务服务器100可以为计算机设备,终端设备集群中的终端设备也可以为计算机设备,此处不限定。上述服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。上述终端设备包括但不限于手机、电脑、智能语音交互设备、智能家电、车载终端等,但并不局限于此。本申请实施例可应用于各种场景,包括但不限于云技术、云安全、区块链、人工智能、智慧交通、辅助驾驶等。
可以理解的是,当本申请实施例运用到具体产品或技术中时,涉及到的待检测图像等相关的数据,需要获得用户许可或者同意后再获取,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
需要说明的是,在本申请实施例中待检测图像是以结直肠病理图像为例进行描述,但是实际应用场景中,待检测图像还可以为其他癌种的病理图像,或者其他包括至少两个子图像的多实例图像,本申请在此不做限制。
进一步地,请参见图2,图2是本申请实施例提供的一种图像检测方法的流程示意图。其中,该图像检测方法可以由计算机设备执行,其中,计算机设备可以为上述图1a所示的业务服务器100,也可以为上述图1a所示的终端设备集群中的任一终端设备,例如终端设备10c。以下将以本图像检测方法 由计算机设备执行为例进行说明。其中,该图像检测方法至少可以包括以下步骤S101-步骤S104:
步骤S101,获取待检测图像,对所述待检测图像进行特征提取处理,得到所述待检测图像的特征表示子集;所述待检测图像包括至少两个子图像;所述特征表示子集包括至少两个子图像特征,所述至少两个子图像特征与所述至少两个子图像一一对应。
具体的,待检测图像是图像弱标注、多个实例对应一个标签的多实例图像。一个多实例图像,也称作多实例包(bag),包括有若干实例(instance),一个实例可以视为一个子图像,但是只有bag含有标签,实例不含有标签。如果多实例包中至少含有一个正实例(instance),则该包被标记为正类多实例包(正包);如果多实例包的所有实例都是负实例,则该包被标记为负类多实例包(负包)。待检测图像可以是在癌症诊断、生存期预测、基因突变预测等病理学诊断中应用的数字化病理图像,通过对数字化病理图像进行图像检测,得到数字化病理图像的分类结果,该分类结果可以辅助医生确定对应的医学诊断结果,例如,数字化病理图像为上述图1b中的结直肠病理图像301,通过对结直肠病理图像301进行图像检测,得到分类结果405,该分类结果405可以辅助对象A确定对象B是否患有结直肠癌。
具体的,待检测图像是人类所使用的媒体数据,缺乏计算机设备可理解的信息,因此需要将待检测图像从一个无结构的原始图像转化为结构化的、计算机可以识别处理的信息,即对待检测图像进行科学的抽象,建立它的数学模型,用以描述和代替待检测图像,使计算机设备能够通过对该数学模型的计算和操作来实现对待检测图像的识别。数学模型可以是向量空间模型,此时,待检测图像中包括的子图像对应的子图像特征可以是该向量空间模型中的向量,计算机设备可以通过由子图像特征构成的特征表示子集描述和运用待检测图像。
具体的,如果把所有的子图像特征都作为特征项,将导致计算量太大,因此需要在不损伤待检测图像核心信息的情况下尽量减少要处理的子图像,以此来简化计算,提高待检测图像处理的速度和效率。因此,对待检测图像进行特征提取处理,得到待检测图像的特征表示子集的一个可行具体过程为:识别待检测图像中的背景区域和前景区域,然后根据背景区域和前景区域对待检测图像进行图像分割,得到待检测前景图像,随后可以根据缩放倍率对待检测前景图像进行缩放处理,得到待裁剪前景图像;再根据子图像预设长度和子图像预设宽度,对待裁剪前景图像进行裁剪处理,得到至少两个子图像,最后对至少两个子图像分别进行图像特征提取处理,得到至少两个子图像各自对应的子图像特征,根据至少两个子图像各自对应的子图像特征确定待检测图像的特征表示子集。其中,子图像预设长度小于待裁剪前景图像的长度;子图像预设宽度小于待裁剪前景图像的宽度。
为便于理解上述特征提取处理的可行具体过程,请一并参见图3,图3是本申请实施例提供的一种图像特征提取处理的场景示意图。如图3所示,计算机设备获取到待检测图像3000后,可以先对待检测图像3000进行前后景识别,确定待检测图像3000包含的前景区域3001和后景区域3002,然后进行前后景区域的图像分割,得到待检测前景图像3003。然后,可以根据子图像预设长度和子图像预设宽度,例如,512*512,对待检测前景图像3003进行裁剪处理,得到至少两个子图像,即子图像3004、子图像3005、…、子图像3006。然后,可以将至少两个子图像输入特征提取器3007,通过特征提取器3007提取每个子图像对应的子图像特征,从而得到特征表示子集3008。其中,特征提取器3007可以采用特征提取器RestNet50(Residual net 50,一种残差网络)来实现,也可以使用其他的预训练网络来实现,本申请在此不作限制。可选的,可以对待检测前景图像3003进行设定倍率的缩放,比如,将待检测前景图像3003进行十倍的放大,得到待裁剪前景图像,对该待裁剪前景图像进行裁剪处理,可以比对待检测前景图像3003进行裁剪处理得到更多的子图像,因此能得到更多的子图像特征,可以更精细地表示待检测图像。
步骤S102,生成所述至少两个子图像特征各自对应的注意力权重,根据所述注意力权重,对所述至少两个子图像特征进行加权聚合处理,得到第一特征向量。
具体的,注意力权重又可以称为注意力分数,用于衡量子图像特征的重要性,注意力权重越大,代表对应的子图像特征越重要,在最终输出的第一特征向量中,其对应的子图像特征的占比就越大。每个子图像特征对应的注意力权重可以通过一个以其自身为输入的网络学习得到。
具体的,在得到注意力权重后,就可以对子图像特征进行加权聚合处理,即根据注意力权重对子图像特征加权求和,得到第一特征向量。
步骤S103,对所述至少两个子图像特征进行聚类采样处理,得到至少两个分类簇,所述分类簇包括采样子图像特征,根据所述至少两个分类簇和块稀疏矩阵,确定每个所述采样子图像特征对应的块稀疏自注意力,根据至少两个所述块稀疏自注意力确定第二特征向量;所述采样子图像特征对应的块稀疏自注意力是基于其所属的分类簇中的采样子图像特征确定的。
具体的,为了更好地挖掘各子图像特征之间的相关性,又避免过高的计算复杂度,可以先对至少两个子图像特征进行聚类,即根据子图像特征的相似性将至少两个子图像特征划分为至少两个分类簇,一个分类簇中的子图像特征对应的子图像属于相同类别的图像。然后从每个分类簇中采样出部分子图像特征,作为采样子图像特征。
具体的,采样子图像特征的自注意力是基于全局自注意力权重矩阵确定的,其中,全局自注意力权重矩阵用于表征采样子图像特征之间的相关度。因为之前已经对采样子图像特征进行了分类,因此计算机设备在确定某个采 样子图像特征的自注意力时,只关注和该采样子图像特征属于相同分类簇的采样子图像特征即可。计算机设备根据采样子图像特征确定全局自注意力权重矩阵后,可以获取至少两个分类簇各自匹配的块稀疏矩阵,根据该块稀疏矩阵对全局自注意力权重矩阵进行过滤,得到块稀疏全局自注意力权重矩阵,该块稀疏全局自注意力权重矩阵用于表征相同分类簇的采样子图像特征之间的相关度。随后,根据块稀疏全局自注意力权重矩阵就可以确定每个采样子图像特征的块稀疏自注意力,计算机设备再对所有采样子图像特征的块稀疏自注意力进行均值池化处理,就可以得到第二特征向量。
步骤S104,根据所述第一特征向量和所述第二特征向量,确定所述待检测图像的分类结果。
具体的,在得到第一特征向量和第二特征向量后,就可以采用MLP(Multilayer Perceptron,多层感知机)分类器对第一特征向量和第二特征向量进行预测,输出分类结果。
采用本申请实施例提供的方法,对包括至少两个子图像的待检测图像进行特征提取处理,得到该待检测图像的特征表示子集,该特征表示子集包括至少两个子图像各自对应的子图像特征,然后通过两种方式来挖掘子图像的信息,一是独立挖掘各个子图像的信息,即生成至少两个子图像特征各自对应的注意力权重,再根据注意力权重对至少两个子图像特征进行加权聚合处理,得到第一特征向量;二是挖掘同类别的子图像之间的相关信息,即对至少两个子图像特征进行聚类采样处理,得到至少两个分类簇,每个分类簇包括采样子图像特征,根据至少两个分类簇和块稀疏矩阵确定每个采样子图像特征对应的块稀疏自注意力,根据至少两个块稀疏自注意力确定第二特征向量;最后根据第一特征向量和第二特征向量,确定待检测图像的分类结果。通过两种信息挖掘方式得到的第一特征向量和第二特征向量之间可以相互补充且相互约束,因此可以提高图像的检测准确率,另外通过块稀疏矩阵可以保证在计算采样子图像特征对应的块稀疏自注意力时,只关注和该采样子图像特征属于同一分类簇的采样子图像特征之间的相关性,降低计算复杂度,提高了检测速度。
进一步地,请参见图4,图4是本申请实施例提供的一种图像检测方法的流程示意图。其中,该图像检测方法可以由计算机设备执行,其中,计算机设备可以为上述图1a所示的业务服务器100,也可以为上述图1a所示的终端设备集群中的任一终端设备,例如终端设备10c。以下将以本图像检测方法由计算机设备执行为例进行说明。其中,该图像检测方法至少可以包括以下步骤S201-步骤S210:
步骤S201,获取待检测图像,对所述待检测图像进行特征提取处理,得到所述待检测图像的特征表示子集;所述待检测图像包括至少两个子图像;所述特征表示子集包括至少两个子图像特征,所述至少两个子图像特征与所 述至少两个子图像一一对应。
具体的,步骤S201的实现可以参见上述图2所对应实施例中对步骤S101的具体描述,这里不再进行赘述。
具体的,假设待检测图像为X,经过上述组织背景分割和图片剪裁后,得到至少两个子图像,至少两个子图像组成的图像集合可以表示为{x 1,x 2,…,x n},其中每个子图像x i称为一个待检测图像的实例。可以通过下述公式(1)来表示上述对至少两个子图像进行特征提取处理的过程:
H={h 1,h 2,…,h n}=Ff{x 1,x 2,…,x n}     公式(1)
其中,H为特征表示子集,h i∈R 1xd,默认d=1024,i为小于或等于n的正整数。Ff表示特征提取处理,通常基于选用的特征提取器决定。
步骤S202,将所述至少两个子图像特征输入图像识别模型中的第一注意力子网络;所述第一注意力子网络包括权重学习网络层和加权聚合网络层。
步骤S203,通过所述权重学习网络层,对所述至少两个子图像特征分别进行权重拟合处理,得到所述至少两个子图像特征各自对应的注意力权重。
具体的,权重学习网络层可以采用参数化的神经网络,来学习子图像特征对应的注意力权重,其中,上述公式(1)得到的特征表示子集H中的子图像特征h k对应的注意力权重a k可以表示为公式(2):
Figure PCTCN2022137773-appb-000001
其中,W和V为参数矩阵,tanh为非线性函数。由公式(2)可知,注意力权重a k仅与子图像特征h k有关,与其他子图像特征无关,换言之,权重学习网络层对特征表示子集H进行了独立分布的假设。
步骤S204,通过所述加权聚合网络层,根据所述注意力权重对每个子图像特征进行加权处理,得到所述每个子图像特征对应的加权子图像特征,对所述至少两个子图像特征各自对应的所述加权子图像特征进行聚合处理,得到第一特征向量。
具体的,在加权聚合网络层中,可以采样一次项非线性的注意力加权的方式对子图像特征进行有效的聚合,即可以通过公式(3)来计算:
Figure PCTCN2022137773-appb-000002
其中,X 1是第一特征向量,n是特征表示子集H中包含的子图像特征的数量,h k是特征表示子集H中第k个子图像特征,a k是特征表示子集H中第k个子图像特征对应的注意力权重。
步骤S205,将所述至少两个子图像特征输入图像识别模型中的第二注意力子网络;所述第二注意力子网络包括聚类采样网络层、全局自注意力网络层、自注意力网络层以及均值池化网络层。
具体的,将至少两个子图像特征输入第一注意力子网络和将至少两个子 图像特征输入第二注意力网络可以是同时进行的,且第一注意力子网络和第二注意力子网络互不影响。
步骤S206,通过所述聚类采样网络层,对所述至少两个子图像特征进行聚类采样处理,得到至少两个分类簇,所述分类簇包括采样子图像特征;所述至少两个分类簇各自包括的采样子图像特征的数量之和为N,N为小于所述至少两个子图像特征的数量的正整数。
具体的,计算机设备在聚类采样网络层中,可以先对至少两个子图像特征进行聚类处理,得到至少两个分类簇,然后获取至少两个分类簇中的第k个分类簇,其中,k为正整数,第k个分类簇包括至少一个聚类子图像特征;然后计算机设备可以获取至少一个聚类子图像特征各自与第k个分类簇的簇中心之间的向量距离,作为参考距离;根据参考距离,在至少一个聚类子图像特征中按序获取h个聚类子图像特征,将h个聚类子图像特征作为第k个分类簇包括的采样子图像特征,其中,h为正整数,且h小于或等于至少一个聚类子图像特征的数量。
可以理解,假设通过对至少两个子图像特征进行聚类处理得到的分类簇的数量为p,每个分类簇拥有一个簇中心,在对每个分类簇进行采样时,可以采集距离该分类簇的簇中心最近的h个子图像特征,最终可以共得到N个采样子图像特征,N=p×h。为了降低计算的复杂度,一般默认采样总数N=128。经过聚类处理和采样处理后,得到的N个采样子图像特征具有多样性,且可近似作为原始的至少两个子图像特征的有效代表。
具体的,上述聚类处理可以采用无监督的k-means(一种无监督聚类方法)聚类方法,或者其他聚类方式,这里不作限定。聚类后得到的至少两个分类簇中,每个分类簇中包括的子图像特征对应的子图像所属的图像类别相同。为便于理解,请一并参见图5,图5是本申请实施例提供的一种结直肠病理图像的聚类结果示意图。如图5所示,同一组的子图像对应的子图像特征属于同一分类簇,从图5中可以发现不同的病理图像大致都可以分成组织学结构特征相似的类别。例如,组1中的子图像包含有明显的癌变组织;组2中的子图像均为染色异常或者成像质量较差的子图像;组3中的子图像包含的基本是腺体组织;组4中的子图像均聚集较多的免疫细胞等等。
步骤S207,通过所述全局自注意力网络层,基于块稀疏矩阵确定N个采样子图像特征的块稀疏全局自注意力权重矩阵。
具体的,第二注意力子网络包括查询权重矩阵和键权重矩阵。计算机设备通过全局自注意力网络层,基于块稀疏矩阵确定N个采样子图像特征的块稀疏全局自注意力权重矩阵的过程,可以为:通过全局自注意力网络层,根据至少两个分类簇各自包括的采样子图像特征构建采样子图像特征矩阵;将采样子图像特征矩阵与查询权重矩阵相乘,得到查询矩阵,将采样子图像特征矩阵与键权重矩阵相乘,得到键矩阵;根据查询矩阵、键矩阵对应的转置 矩阵以及块稀疏矩阵,确定块稀疏全局相关度矩阵;对块稀疏全局相关度矩阵进行归一化处理,得到块稀疏全局自注意力权重矩阵。
具体的,上述采样子图像特征矩阵可以表示为公式(4):
Figure PCTCN2022137773-appb-000003
其中,
Figure PCTCN2022137773-appb-000004
为采样子图像特征矩阵,
Figure PCTCN2022137773-appb-000005
为采样子图像特征矩阵
Figure PCTCN2022137773-appb-000006
中的第i个采样子图像特征,i为小于或等于N的正整数,N为上述所说的128。
具体的,查询矩阵的计算可以通过公式(5)来表示:
Figure PCTCN2022137773-appb-000007
其中,W q为查询权重矩阵,是由第二注意力子网络随机初始化得到的矩阵,
Figure PCTCN2022137773-appb-000008
为上述采样子图像特征矩阵,Q即为查询矩阵,q i为与采样子图像特征矩阵
Figure PCTCN2022137773-appb-000009
中第i个采样子图像特征相关联的查询向量。
同理,键矩阵的计算可以通过公式(6)来表示:
Figure PCTCN2022137773-appb-000010
其中,W k为键权重矩阵,也是由第二注意力子网络随机初始化得到的矩阵,
Figure PCTCN2022137773-appb-000011
为上述采样子图像特征矩阵,K即为键矩阵,k i为与采样子图像特征矩阵
Figure PCTCN2022137773-appb-000012
中第i个采样子图像特征相关联的键向量。
因此,块稀疏全局自注意力权重矩阵的计算可以通过公式(7)来表示:
Figure PCTCN2022137773-appb-000013
其中,Q为上述查询矩阵,K T为上述键矩阵的转置矩阵,B为与至少两个分类簇相关的块稀疏矩阵,d k为上述N,softmax的作用是归一化,A即为块稀疏全局自注意力权重矩阵。
为便于理解,请一并参见图6,图6是本申请实施例提供的一种对全局自注意力进行块稀疏约束的原理示意图。如图6所示,全局自注意力权重矩阵601是未被约束时的全局自注意力矩阵,此时,不同类别之间的采样子图像特征之间的信息也会被表示,计算机设备可以获取块稀疏矩阵602对其进行约束,就得到约束后的块稀疏全局自注意力权重矩阵603,由图6可知,块稀疏全局自注意力权重矩阵603中只会表示属于相同类别的采样子图像特征之间的信息。块稀疏约束本质上是利用类别的相关性,对全局自注意力进行了筛选,只关注和保留了同类别的注意力,去除了不同类别的注意力。相对于全局自注意力权重矩阵601,块稀疏全局自注意力权重矩阵603利用了实例的类别信息学习注意力,并做了适当的简化计算。
步骤S208,通过所述自注意力网络层,根据所述至少两个分类簇和所述块稀疏全局自注意力权重矩阵,确定每个采样子图像特征对应的块稀疏自注意力。
具体的,第二注意力子网络还包括值权重矩阵;上述N个采样子图像特 征包括采样子图像特征N a,a为小于或等于N的正整数。计算机设备通过自注意力网络层,将采样子图像特征矩阵和值权重矩阵相乘,得到值矩阵;将采样子图像特征N a所属的分类簇中的采样子图像特征,作为目标采样子图像特征;从块稀疏全局自注意力权重矩阵中,获取采样子图像特征N a和目标采样子图像特征之间的块稀疏全局自注意力权重,作为目标块稀疏全局自注意力权重;从值矩阵中,获取与目标采样子图像特征对应的值向量,作为目标值向量;根据目标值向量和目标块稀疏全局自注意力权重,确定采样子图像特征N a对应的块稀疏自注意力。其中,采样子图像特征N a即上述采样子图像特征矩阵
Figure PCTCN2022137773-appb-000014
中的第a个采样子图像特征。
具体的,值矩阵的计算可以通过公式(8)来表示:
Figure PCTCN2022137773-appb-000015
其中,W v为值权重矩阵,是由第二注意力子网络随机初始化得到的矩阵,
Figure PCTCN2022137773-appb-000016
为上述采样子图像特征矩阵,V即为值矩阵,v i为与采样子图像特征矩阵
Figure PCTCN2022137773-appb-000017
中第i个采样子图像特征相关联的值向量。
块稀疏自注意力的计算可以通过公式(9)来表示:
Figure PCTCN2022137773-appb-000018
其中,z a是指采样子图像特征N a对应的块稀疏自注意力,
Figure PCTCN2022137773-appb-000019
是采样子图像特征矩阵
Figure PCTCN2022137773-appb-000020
中的第a个采样子图像特征,也就是采样子图像特征N a
Figure PCTCN2022137773-appb-000021
是采样子图像特征矩阵
Figure PCTCN2022137773-appb-000022
中的第b个采样子图像特征,a和b皆为小于或等于N的正整数。
Figure PCTCN2022137773-appb-000023
是指
Figure PCTCN2022137773-appb-000024
所属的分类簇的簇中心,
Figure PCTCN2022137773-appb-000025
表示
Figure PCTCN2022137773-appb-000026
属于该簇中心对应的分类簇。
Figure PCTCN2022137773-appb-000027
是一个约束条件,即对b从1到N进行遍历,只有当
Figure PCTCN2022137773-appb-000028
Figure PCTCN2022137773-appb-000029
属于同一分类簇时,才对A abv b进行累加。v b是值矩阵中与
Figure PCTCN2022137773-appb-000030
相关联的值向量。A ab是块稀疏全局自注意力权重矩阵中第a行第b列的块稀疏全局自注意力权重,即
Figure PCTCN2022137773-appb-000031
Figure PCTCN2022137773-appb-000032
之间的块稀疏全局自注意力权重。
步骤S209,通过所述均值池化网络层,对至少两个所述块稀疏自注意力进行均值池化处理,得到第二特征向量。
具体的,均值池化处理是指将至少两个块稀疏自注意力相加,然后求平均,得到的向量即为第二特征向量X 2
步骤S210,将所述第一特征向量和所述第二特征向量输入图像识别模型的分类子网络;所述分类子网络包括特征融合网络层和分类网络层;通过所述特征融合网络层,对所述第一特征向量和所述第二特征向量进行特征融合处理,得到融合特征向量;通过所述分类网络层,对所述融合特征向量进行分类处理,得到所述待检测图像的分类结果。
具体的,第一注意力子网络会输出第一特征向量X 1,第二注意力子网络会输出第二特征向量X 2,两个并行的特征向量会在特征融合网络层进行特征 融合,分类网络层可以采用MLP分类器,因此最终的输出可以表示为下述公式(10):
y=MLP(concate(X 1,X 2))      公式(10)
其中,Concate表示特征融合操作,常用的特征融合方式是特征拼接,加权求和。最终的输出为y,可以是对于待检测图像的正常预测概率,当正常预测概率低于某个阈值时,就可以确定待检测图像为异常图像。
为便于理解上述步骤S202-步骤S210中的图像识别模型的结构,请一并参见图7,图7是本申请实施例提供的一种图像识别模型的结构示意图。如图7所示,该图像识别模型7包括第一注意力子网络71、第二注意力子网络72以及分类子网络73。计算机设备在得到待检测图像对应的特征表示子集700后,会将该特征表示子集700中的子图像特征顺序输入图像识别模型7。其中,特征表示子集700包括至少两个子图像特征。在图像识别模型7中,计算机设备会将特征表示子集700分别输入进第一注意力子网络71和第二注意力子网络72。在第一注意力子网络71中,针对特征表示子集700中的每个子图像特征,均会通过一个参数化的神经网络来学习其对应的权重,例如,子图像特征711可以输入进参数化的神经网络712,神经网络712会输出子图像特征711的权重,具体实现方式可以参见上述步骤S203。然后在第一注意力子网络71中,计算机设备可以采用一次项非线性的注意力加权的方式,对特征表示子集700中的所有的子图像特征进行有效的聚合,最终得到第一特征向量713,其中,聚合的过程可以参见上述步骤S204。在第二注意力子网络72中,计算机设备会先对特征表示子集700进行无监督聚类,得到至少两个分类簇,例如分类簇721,分类簇721中的子图像特征对应的子图像属于相同类别的图像。然后,计算机设备会从每个分类簇中获取部分子图像特征,作为采样子图像特征。其中,聚类与采样的处理可以参见上述步骤S206。然后,在第二注意力子网络72中,计算机设备可以将采样子图像特征构成的采用子图像特征矩阵722经过矩阵变换得到键矩阵723、查询矩阵724以及值矩阵725,其中,矩阵变换可以参见上述公式(5)、公式(6)以及公式(8),具体可以通过卷积核为1×1的卷积网络实现。然后在第二注意力子网络72中,可以根据键矩阵723的转置矩阵、查询矩阵724以及块稀疏矩阵确定块稀疏全局自注意力矩阵726,确定过程可以参见上述步骤S207。进一步根据块稀疏全局注意力矩阵726和值矩阵725确定第二特征向量727,确定过程可以参见上述步骤S208和步骤S209。最后,计算机设备会将第一特征向量713和第二特征向量727输入分类子网络73。在分类子网络73中,对第一特征向量713和第二特征向量727进行特征融合后,会将特征融合后的向量输入进分类器731,然后输出分类结果732,该分类结果732可以包括图像正常概率和图像异常概率。
采用本申请实施例提供的方法,图像识别模型中的第一注意力子网络和 第二注意力子网络通过两种不同的方式挖掘待检测图像的信息,得到第一特征向量和第二特征向量,两个特征向量融合后可以相互补充、相互约束,对融合后的特征向量进行预测,得到的分类结果的准确率高。
进一步地,请参见图8,图8是本申请实施例提供的一种初始图像识别模型训练方法的流程示意图。其中,该初始图像识别模型训练方法可以由计算机设备执行,其中,计算机设备可以为上述图1a所示的业务服务器100,也可以为上述图1a所示的终端设备集群中的任一终端设备,例如终端设备10c。以下将以本初始图像识别模型训练方法由计算机设备执行为例进行说明。其中,该初始图像识别模型训练方法至少可以包括以下步骤S301-步骤S305:
步骤S301,获取样本图像,对所述样本图像进行特征提取处理,得到所述样本图像的样本特征表示子集;所述样本图像包括至少两个样本子图像;所述样本特征表示子集包括至少两个样本子图像特征,所述至少两个样本子图像特征与所述至少两个样本子图像一一对应。
具体的,步骤S301的实现过程可以参见上述图2所对应实施例中对步骤S101的描述,这里不再进行赘述。
步骤S302,将所述至少两个样本子图像输入初始图像识别模型,通过所述初始图像识别模型,生成所述至少两个样本子图像特征各自对应的样本注意力权重,根据所述至少两个样本子图像特征各自对应的样本注意力权重,对所述至少两个样本子图像特征进行加权聚合处理,得到第一样本特征向量。
具体的,初始图像识别模型可以包括第一初始注意力子网络,计算机设备可以通过该第一初始注意力子网络,生成至少两个样本子图像特征各自对应的样本注意力权重,根据至少两个样本子图像特征各自对应的样本注意力权重,对至少两个样本子图像特征进行加权聚合处理,得到第一样本特征向量,具体实现过程可以参见上述图4所对应实施例中步骤S202-步骤S204的描述,这里不再进行赘述。
步骤S303,通过所述初始图像识别模型,对所述至少两个样本子图像特征进行聚类采样处理,得到至少两个样本分类簇,所述样本分类簇包括样本采样子图像特征,根据所述至少两个样本分类簇和块稀疏矩阵确定每个所述样本采样子图像特征对应的样本块稀疏自注意力,根据至少两个样本块稀疏自注意力确定第二样本特征向量;所述样本采样子图像特征对应的样本块稀疏自注意力是基于其所属的样本分类簇中的样本采样子图像特征确定的。
具体的,初始图像识别模型还可以包括第二初始注意力子网络,然后通过该第二初始注意力子网络,对至少两个样本子图像特征进行聚类采样处理,得到至少两个样本分类簇各自包括的样本采样子图像特征,根据至少两个样本分类簇和块稀疏矩阵确定每个样本采样子图像特征对应的样本块稀疏自注意力,根据至少两个样本块稀疏自注意力确定第二样本特征向量,具体实现过程可以参见上述图4所对应实施例中步骤S205-步骤S209的描述,这里不 再进行赘述。
步骤S304,通过所述初始图像识别模型,根据所述第一样本特征向量和所述第二样本特征向量,确定所述样本图像的样本分类结果。
具体的,初始图像识别模型还可以包括初始分类子网络,然后通过该初始分类子网络,根据第一样本特征向量和第二样本特征向量确定样本图像的样本分类结果,具体实现可以参见上述图4所对应实施例中步骤S210的描述,这里不再进行赘述。
步骤S305,根据所述至少两个样本分类簇、所述至少两个样本子图像特征各自对应的注意力权重、所述样本分类结果以及所述样本图像对应的分类标签,对所述初始图像识别模型进行模型参数调整,得到用于识别待检测图像的分类结果的图像识别模型。
具体的,因为最终得到的图像识别模型中第一注意力子网络和第二注意力子网络的输入都是相同的子图像特征,因此第一注意力子网络对至少两个子图像特征的注意力分布和第二注意力子网络对至少两个子图像特征的注意力分布应该是一致的,因此,计算机设备在对初始图像识别模型进行训练的过程中,可以先根据至少两个样本分类簇以及至少两个样本子图像特征各自对应的样本注意力权重,确定散度损失值;然后根据样本分类结果以及样本图像对应的分类标签,确定分类损失值;最后对散度损失值和分类损失值进行加权求和,得到模型总损失值;根据模型总损失值对初始图像识别模型进行模型参数调整,得到图像识别模型。其中,散度损失值用于保证最终训练得到的图像识别模型的两个网络分支对同样的子图像特征输入的注意力分布一致。分类损失值用于保证最终训练得到的图像识别模型输出的分类结果能更接近真实结果。
具体的,上述根据至少两个样本分类簇以及至少两个样本子图像特征各自对应的样本注意力权重,确定散度损失值的实现过程,可以为:获取至少两个样本分类簇中的第i个样本分类簇;i为正整数,且i小于或等于至少两个样本分类簇的数量;将第i个样本分类簇包括的样本子图像特征,作为目标样本子图像特征;根据目标样本子图像特征对应的样本注意力权重和目标样本子图像特征的数量,确定第i个样本分类簇对应的类别散度损失值;将每个样本分类簇对应的类别散度损失值进行累加,得到散度损失值。
因为计算机设备在对样本图像进行图像检测时,在第二初始注意力子网络中对样本图像包括的样本子图像特征进行了聚类,得到了至少两个样本分类簇,同一样本分类簇中的样本子图像特征在第二初始注意力子网络中的关注度是相同的,因此在第一初始注意力子网络中,同一样本分类簇中的样本子图像特征的关注度也应该是相同的。例如,样本图像包括6个样本子图像特征,即B1、B2、B3、B4、B5和B6,在第一初始注意力子网络中生成的样本注意力权重依次为0.10、0.22、0.11、0.31、0.22、0.12,在第二初始注 意力子网络中生成的样本分类簇为:样本分类簇1{B1,B3,B6},样本分类簇2{B2,B4,B5},可见,样本分类簇1中的B1、B3、B4对应的样本注意力权重接近一致,合理;但是样本分类簇2中的B4对应的样本注意力权重明显高于B2和B5,这是不合理的,因此需要通过散度损失值来进行调整。也就是说,同一样本分类簇中的样本子图像特征在第一注意力子网络中生成的注意力权重应当服从均匀分布,因此每个样本分类簇可以确定出一个类别散度损失值。最后将每个样本分类簇对应的类别散度损失值进行累加,就得到散度损失值。
具体的,根据目标样本子图像特征对应的样本注意力权重和目标样本子图像特征的数量,确定第i个样本分类簇对应的类别散度损失值的实现过程,可以为:获取由目标样本子图像特征对应的样本注意力权重构成的拟合注意力分布;对拟合注意力权重分布进行归一化处理,得到归一化拟合注意力分布;将目标样本子图像特征的数量对应的均匀注意力分布,作为注意力分布标签;根据归一化拟合注意力分布和注意力分布标签,确定第i个样本分类簇对应的类别散度损失值。
假设目标样本子图像特征对应的样本注意力权重为0.10、0.12、0.11,则构成的拟合注意力分布即为[0.10,0.12,0.11],为便于后续类别散度损失值计算时要求输入的是概率,因此需要对该拟合注意力分布进行归一化处理,即加起来和为1,得到归一化拟合注意力分布为[0.303,0.363,0.333]。目标样本子图像特征的数量为3,则对应的均匀注意力分布作为注意力分布标签,为[1/3,1/3,1/3]。
根据归一化拟合注意力分布和注意力分布标签,确定第i个样本分类簇对应的类别散度损失值的过程,可以通过下述公式(11)来表示:
Figure PCTCN2022137773-appb-000033
其中,G为目标样本子图像特征的数量,p(x i)为注意力分布标签中的第i个值,d(x i)为归一化拟合注意力分布中的第i个值;D KL(P∥D)为类别散度损失值。
因此,上述散度损失值的计算可以通过下述公式(12)来实现:
Figure PCTCN2022137773-appb-000034
其中,c为至少两个样本分类簇中的样本分类簇的数量,D KL(U∥D i)指至少两个样本分类簇中的第i个样本分类簇的类别散度损失值;KL为散度损失值。
具体的,总损失值的确定就可以通过下述公式(13)来实现:
Loss=CE(y,y′)+α*KL     公式(13)
其中,y表示样本图像对应的分类标签;y′表示上述初始分类子网络输出的样本分类结果,KL为上述散度损失值,α表示权重,默认为0.01。
具体的,初始图像识别模型训练时,可以训练100epoch(时期),优化器默认采用Adam(一种优化算话),初始化学习率为1e-4,采用余弦退火策略调整学习率,最小学习率为1e-6。
采用本申请实施例提供的方法,在对初始图像识别模型进行训练时,对第一初始注意力子网络和第二初始注意力子网络的注意力分布进行了额外约束,即在损失函数中增加K-L散度损失函数,保证对同样的多实例输入两个子网络的注意力分布一致,最终训练得到的图像识别模型的图像检测准确率高。
请参见图9,图9是本申请实施例提供的一种图像检测装置的结构示意图。该图像检测装置可以是运行于计算机设备的一个计算机程序(包括程序代码),例如该图像检测装置为一个应用软件;该装置可以用于执行本申请实施例提供的图像检测方法中的相应步骤。如图9所示,该图像检测装置1可以包括:特征提取模块11、第一向量生成模块12、第二向量生成模块13以及分类模块14。
特征提取模块11,用于获取待检测图像,对待检测图像进行特征提取处理,得到待检测图像的特征表示子集;待检测图像包括至少两个子图像;特征表示子集包括至少两个子图像特征,至少两个子图像特征与至少两个子图像一一对应;
第一向量生成模块12,用于生成至少两个子图像特征各自对应的注意力权重,根据注意力权重对至少两个子图像特征进行加权聚合处理,得到第一特征向量;
第二向量生成模块13,用于对至少两个子图像特征进行聚类采样处理,得到至少两个分类簇,分类簇包括采样子图像特征,根据至少两个分类簇和块稀疏矩阵,确定每个采样子图像特征对应的块稀疏自注意力,根据至少两个块稀疏自注意力确定第二特征向量;采样子图像特征对应的块稀疏自注意力是基于其所属的分类簇中的采样子图像特征确定的;
分类模块14,用于根据第一特征向量和第二特征向量,确定待检测图像的分类结果。
其中,特征提取模块11、第一向量生成模块12、第二向量生成模块13以及分类模块14的具体实现方式,可以参见上述图2所对应实施例的相关描述,这里将不再进行赘述。
其中,特征提取模块11,包括:预处理单元111以及特征提取单元112。
预处理单元111,用于识别待检测图像中的背景区域和前景区域;
预处理单元111,还用于根据背景区域和前景区域,对待检测图像进行图像分割,得到待检测前景图像;
预处理单元111,还用于根据缩放倍率对待检测前景图像进行缩放处理,得到待裁剪前景图像;
预处理单元111,还用于根据子图像预设长度和子图像预设宽度,对待裁剪前景图像进行裁剪处理,得到至少两个子图像;子图像预设长度小于待裁剪前景图像的长度;子图像预设宽度小于待裁剪前景图像的宽度;
特征提取单元112,用于对至少两个子图像分别进行图像特征提取处理,得到至少两个子图像各自对应的子图像特征,根据至少两个子图像各自对应的子图像特征,确定待检测图像的特征表示子集。
其中,预处理单元111以及特征提取单元112的具体实现方式,可以参见上述图2所对应实施例的相关描述,这里将不再进行赘述。
其中,第一向量生成模块12,包括:第一输入单元121、权重拟合单元122以及聚合单元123。
第一输入单元121,用于将至少两个子图像特征输入图像识别模型中的第一注意力子网络;第一注意力子网络包括权重学习网络层和加权聚合网络层;
权重拟合单元122,用于通过权重学习网络层,对至少两个子图像特征分别进行权重拟合处理,得到至少两个子图像特征各自对应的注意力权重;
聚合单元123,用于通过加权聚合网络层,根据注意力权重对每个子图像特征进行加权处理,得到每个子图像特征对应的加权子图像特征,对至少两个子图像特征各自对应的加权子图像特征进行聚合处理,得到第一特征向量。
其中,第一输入单元121、权重拟合单元122以及聚合单元123的具体实现方式,可以参见上述图4所对应实施例的相关描述,这里将不再进行赘述。
其中,第二向量生成模块13,包括:第二输入单元131、聚类采样单元132、全局自注意力确定单元133、自注意力确定单元134以及均值池化单元135。
第二输入单元131,用于将至少两个子图像特征输入图像识别模型中的第二注意力子网络;第二注意力子网络包括聚类采样网络层、全局自注意力网络层、自注意力网络层以及均值池化网络层;
聚类采样单元132,用于通过聚类采样网络层,对至少两个子图像特征进行聚类采样处理,得到至少两个分类簇,分类簇包括采样子图像特征;至少两个分类簇各自包括的采样子图像特征的数量之和为N,N为小于至少两个子图像特征的数量的正整数;
全局自注意力确定单元133,用于通过全局自注意力网络层,基于块稀疏矩阵确定N个采样子图像特征的块稀疏全局自注意力权重矩阵;
自注意力确定单元134,用于通过自注意力网络层,根据至少两个分类簇和块稀疏全局自注意力权重矩阵,确定每个采样子图像特征对应的块稀疏自注意力;
均值池化单元135,用于通过均值池化网络层,对至少两个块稀疏自注意力进行均值池化处理,得到第二特征向量。
其中,第二输入单元131、聚类采样单元132、全局自注意力确定单元133、自注意力确定单元134以及均值池化单元135的具体实现方式,可以参见上述图4所对应实施例的相关描述,这里将不再进行赘述。
其中,聚类采样单元132,包括:聚类子单元1321以及采样子单元1322。
聚类子单元1321,用于通过聚类采样网络层,对至少两个子图像特征进行聚类处理,得到至少两个分类簇;
采样子单元1322,用于获取至少两个分类簇中的第k个分类簇;k为正整数;第k个分类簇包括至少一个聚类子图像特征;
采样子单元1322,还用于获取至少一个聚类子图像特征各自与第k个分类簇的簇中心之间的向量距离,作为参考距离;
采样子单元1322,还用于根据参考距离,在至少一个聚类子图像特征中按序获取h个聚类子图像特征,将h个聚类子图像特征作为第k个分类簇包括的采样子图像特征;h为正整数,且h小于或等于至少一个聚类子图像特征的数量。
其中,聚类子单元1321以及采样子单元1322的具体实现方式,可以参见上述图4所对应实施例的相关描述,这里将不再进行赘述。
其中,第二注意力子网络包括查询权重矩阵和键权重矩阵;
全局自注意力确定单元133,包括:第一初始化子单元1331、矩阵确定子单元1332以及归一化子单元1333。
第一初始化子单元1331,用于通过全局自注意力网络层,根据至少两个分类簇各自包括的采样子图像特征,构建采样子图像特征矩阵;
第一初始化子单元1331,还用于将采样子图像特征矩阵与查询权重矩阵相乘,得到查询矩阵,将采样子图像特征矩阵与键权重矩阵相乘,得到键矩阵;
矩阵确定子单元1332,用于根据查询矩阵、键矩阵对应的转置矩阵以及块稀疏矩阵,确定块稀疏全局相关度矩阵;
归一化子单元1333,用于对块稀疏全局相关度矩阵进行归一化处理,得到块稀疏全局自注意力权重矩阵。
其中,第一初始化子单元1331、矩阵确定子单元1332以及归一化子单元1333的具体实现方式,可以参见上述图4所对应实施例的相关描述,这里将不再进行赘述。
其中,第二注意力子网络还包括值权重矩阵;N个采样子图像特征包括采样子图像特征N j,j为小于或等于N的正整数;
自注意力确定单元134,包括:目标获取子单元1341以及确定子单元1342。
目标获取子单元1341,用于通过自注意力网络层,将采样子图像特征矩 阵和值权重矩阵相乘,得到值矩阵;
目标获取子单元1341,还用于将采样子图像特征N j所属的分类簇中的采样子图像特征,作为目标采样子图像特征;
目标获取子单元1341,还用于从块稀疏全局自注意力权重矩阵中,获取采样子图像特征N j和目标采样子图像特征之间的块稀疏全局自注意力权重,作为目标块稀疏全局自注意力权重;
目标获取子单元1341,还用于从值矩阵中,获取与目标采样子图像特征对应的值向量,作为目标值向量;
确定子单元1342,用于根据目标值向量和目标块稀疏全局自注意力权重,确定采样子图像特征N j对应的块稀疏自注意力。
其中,目标获取子单元1341以及确定子单元1342的具体实现方式,可以参见上述图4所对应实施例的相关描述,这里将不再进行赘述。
其中,分类模块14,包括:第三输入单元141、特征融合单元142以及分类单元143。
第三输入单元141,用于将第一特征向量和第二特征向量输入图像识别模型的分类子网络;分类子网络包括特征融合网络层和分类网络层;
特征融合单元142,用于通过特征融合网络层,对第一特征向量和第二特征向量进行特征融合处理,得到融合特征向量;
分类单元143,用于通过分类网络层,对融合特征向量进行分类处理,得到待检测图像的分类结果。
其中,第三输入单元141、特征融合单元142以及分类单元143的具体实现方式,可以参见上述图4所对应实施例的相关描述,这里将不再进行赘述。
请参见图10,图10是本申请实施例提供的一种计算机设备的结构示意图。如图10所示,上述图10所对应实施例中的图像检测装置1可以应用于计算机设备1000,该计算机设备1000可以包括:处理器1001,网络接口1004和存储器1005,此外,上述计算机设备1000还可以包括:用户接口1003,和至少一个通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。其中,用户接口1003可以包括显示屏(Display)、键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器1005可选的还可以是至少一个位于远离前述处理器1001的存储装置。如图10所示,作为一种计算机可读存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。
在如图10所示的计算机设备1000中,网络接口1004可提供网络通讯网元;而用户接口1003主要用于为用户提供输入的接口;而处理器1001可以用于调用存储器1005中存储的设备控制应用程序,以实现本申请实施例提供的图像检测方法。
应当理解,本申请实施例中所描述的计算机设备1000可执行前文图2、图4任一个所对应实施例中对该图像检测方法的描述,在此不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
此外,这里需要指出的是:本申请实施例还提供了一种计算机可读存储介质,且上述计算机可读存储介质中存储有前文提及的图像检测装置1所执行的计算机程序,且上述计算机程序包括程序指令,当上述处理器执行上述程序指令时,能够执行前文图2、图4任一个所对应实施例中对上述图像检测方法的描述,因此,这里将不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。对于本申请所涉及的计算机可读存储介质实施例中未披露的技术细节,请参照本申请方法实施例的描述。
进一步地,请参见图11,图11是本申请实施例提供的另一种图像检测装置的结构示意图。该图像检测装置2可以是运行于计算机设备中的一个计算机程序(包括程序代码),例如该图像检测装置2为一个应用软件;该装置可以用于执行本申请实施例提供的方法中的相应步骤。如图11所示,该图像检测装置2可以包括:样本特征提取模块21、第一样本向量生成模块22、第二样本向量生成模块23、样本分类模块24以及训练模块25。
样本特征提取模块21,用于获取样本图像,对样本图像进行特征提取处理,得到样本图像的样本特征表示子集;样本图像包括至少两个样本子图像;样本特征表示子集包括至少两个样本子图像特征,至少两个样本子图像特征与至少两个样本子图像一一对应;
第一样本向量生成模块22,用于将至少两个样本子图像输入初始图像识别模型,通过初始图像识别模型,生成至少两个样本子图像特征各自对应的样本注意力权重,根据至少两个样本子图像特征各自对应的样本注意力权重,对至少两个样本子图像特征进行加权聚合处理,得到第一样本特征向量;
第二样本向量生成模块23,用于通过初始图像识别模型,对至少两个样本子图像特征进行聚类采样处理,得到至少两个样本分类簇,样本分类簇包括样本采样子图像特征,根据至少两个样本分类簇和块稀疏矩阵,确定每个样本采样子图像特征对应的样本块稀疏自注意力,根据至少两个样本块稀疏自注意力确定第二样本特征向量;样本采样子图像特征对应的样本块稀疏自注意力是基于其所属的样本分类簇中的样本采样子图像特征确定的;
样本分类模块24,用于通过初始图像识别模型,根据第一样本特征向量和第二样本特征向量,确定样本图像的样本分类结果;
训练模块25,用于根据至少两个样本分类簇、至少两个样本子图像特征各自对应的注意力权重、样本分类结果以及样本图像对应的分类标签,对初始图像识别模型进行模型参数调整,得到用于识别待检测图像的分类结果的图像识别模型。
其中,样本特征提取模块21、第一样本向量生成模块22、第二样本向量生成模块23、样本分类模块24以及训练模块25的具体实现方式,可以参见上述图8所对应实施例的相关描述,这里将不再进行赘述。
其中,训练模块25,包括:散度损失值确定单元251、分类损失值确定单元252、加权求和单元253以及模型调整单元254。
散度损失值确定单元251,用于根据至少两个样本分类簇以及至少两个样本子图像特征各自对应的样本注意力权重,确定散度损失值;
分类损失值确定单元252,用于根据样本分类结果以及样本图像对应的分类标签,确定分类损失值;
加权求和单元253,用于对散度损失值和分类损失值进行加权求和,得到模型总损失值;
模型调整单元254,用于根据模型总损失值,对初始图像识别模型进行模型参数调整,得到图像识别模型。
其中,散度损失值确定单元251、分类损失值确定单元252、加权求和单元253以及模型调整单元254的具体实现方式,可以参见上述图8所对应实施例的相关描述,这里将不再进行赘述。
其中,散度损失值确定单元251,包括:获取子单元2511、类别损失值确定子单元2512以及总损失值确定子单元2513。
获取子单元2511,用于获取至少两个样本分类簇中的第i个样本分类簇;i为正整数,且i小于或等于至少两个样本分类簇的数量;
获取子单元2511,还用于将第i个样本分类簇包括的样本子图像特征,作为目标样本子图像特征;
类别损失值确定子单元2512,用于根据目标样本子图像特征对应的样本注意力权重和目标样本子图像特征的数量,确定第i个样本分类簇对应的类别散度损失值;
总损失值确定子单元2513,将每个样本分类簇对应的类别散度损失值进行累加,得到散度损失值。
其中,类别损失值确定子单元2512具体用于获取由目标样本子图像特征对应的样本注意力权重构成的拟合注意力分布;对拟合注意力权重分布进行归一化处理,得到归一化拟合注意力分布;将目标样本子图像特征的数量对应的均匀注意力分布,作为注意力分布标签;根据归一化拟合注意力分布和注意力分布标签,确定第i个样本分类簇对应的类别散度损失值。
其中,获取子单元2511、类别损失值确定子单元2512以及总损失值确 定子单元2513的具体实现方式,可以参见上述图8所对应实施例的相关描述,这里将不再进行赘述。
进一步地,请参见图12,图12是本申请实施例提供的另一种计算机设备的结构示意图。如图12所示,上述图11所对应实施例中的图像检测装置2可以应用于计算机设备2000,该计算机设备2000可以包括:处理器2001,网络接口2004和存储器2005,此外,上述计算机设备2000还包括:用户接口2003,和至少一个通信总线2002。其中,通信总线2002用于实现这些组件之间的连接通信。其中,用户接口2003可以包括显示屏(Display)、键盘(Keyboard),可选用户接口2003还可以包括标准的有线接口、无线接口。网络接口2004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器2005可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器2005可选的还可以是至少一个位于远离前述处理器2001的存储装置。如图12所示,作为一种计算机可读存储介质的存储器2005中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。
在图12所示的计算机设备2000中,网络接口2004可提供网络通讯功能;而用户接口2003主要用于为用户提供输入的接口;而处理器2001可以用于调用存储器2005中存储的设备控制应用程序,以实现本申请实施例提供初始图像识别模型训练方法。
应当理解,本申请实施例中所描述的计算机设备2000可执行前文各个实施例中对该访问控制方法的描述,也可执行前文图11所对应实施例中对该图像检测装置2的描述,在此不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
此外,这里需要指出的是:本申请实施例还提供了一种计算机可读存储介质,且上述计算机可读存储介质中存储有前文提及的图像检测装置2所执行的计算机程序,当上述处理器加载并执行上述计算机程序时,能够执行前文任一实施例对上述访问控制方法的描述,因此,这里将不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。对于本申请所涉及的计算机可读存储介质实施例中未披露的技术细节,请参照本申请方法实施例的描述。
上述计算机可读存储介质可以是前述任一实施例提供的图像检测装置或者上述计算机设备的内部存储单元,例如计算机设备的硬盘或内存。该计算机可读存储介质也可以是该计算机设备的外部存储设备,例如该计算机设备上配备的插接式硬盘,智能存储卡(smart media card,SMC),安全数字(secure digital,SD)卡,闪存卡(flash card)等。进一步地,该计算机可读存储介质还可以既包括该计算机设备的内部存储单元也包括外部存储设备。该计算机可读存储介质用于存储该计算机程序以及该计算机设备所需的其他程序和 数据。该计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的数据。
此外,这里需要指出的是:本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行前文任一个所对应实施例提供的方法。
本申请实施例的说明书和权利要求书及附图中的术语“第一”、“第二”等是用于区别不同对象,而非用于描述特定顺序。此外,术语“包括”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、装置、产品或设备没有限定于已列出的步骤或模块,而是可选地还包括没有列出的步骤或模块,或可选地还包括对于这些过程、方法、装置、产品或设备固有的其他步骤单元。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照网元一般性地描述了各示例的组成及步骤。这些网元究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的网元,但是这种实现不应认为超出本申请的范围。
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (17)

  1. 一种图像检测方法,由计算机设备执行,包括:
    获取待检测图像,对所述待检测图像进行特征提取处理,得到所述待检测图像的特征表示子集;所述待检测图像包括至少两个子图像;所述特征表示子集包括至少两个子图像特征,所述至少两个子图像特征与所述至少两个子图像一一对应;
    生成所述至少两个子图像特征各自对应的注意力权重,根据所述注意力权重,对所述至少两个子图像特征进行加权聚合处理,得到第一特征向量;
    对所述至少两个子图像特征进行聚类采样处理,得到至少两个分类簇,所述分类簇包括采样子图像特征,根据所述至少两个分类簇和块稀疏矩阵,确定每个所述采样子图像特征对应的块稀疏自注意力,根据至少两个所述块稀疏自注意力确定第二特征向量;所述采样子图像特征对应的块稀疏自注意力是基于其所属的分类簇中的采样子图像特征确定的;
    根据所述第一特征向量和所述第二特征向量,确定所述待检测图像的分类结果。
  2. 根据权利要求1所述的方法,所述对所述待检测图像进行特征提取处理,得到所述待检测图像的特征表示子集,包括:
    识别所述待检测图像中的背景区域和前景区域;
    根据所述背景区域和所述前景区域,对所述待检测图像进行图像分割,得到待检测前景图像;
    根据缩放倍率对所述待检测前景图像进行缩放处理,得到待裁剪前景图像;
    根据子图像预设长度和子图像预设宽度,对所述待裁剪前景图像进行裁剪处理,得到所述至少两个子图像;所述子图像预设长度小于所述待裁剪前景图像的长度;所述子图像预设宽度小于所述待裁剪前景图像的宽度;
    对所述至少两个子图像分别进行图像特征提取处理,得到所述至少两个子图像各自对应的子图像特征,根据所述至少两个子图像各自对应的子图像特征,确定所述待检测图像的特征表示子集。
  3. 根据权利要求1所述的方法,所述生成所述至少两个子图像特征各自对应的注意力权重,根据所述注意力权重,对所述至少两个子图像特征进行加权聚合处理,得到第一特征向量,包括:
    将所述至少两个子图像特征输入图像识别模型中的第一注意力子网络;所述第一注意力子网络包括权重学习网络层和加权聚合网络层;
    通过所述权重学习网络层,对所述至少两个子图像特征分别进行权重拟合处理,得到所述至少两个子图像特征各自对应的注意力权重;
    通过所述加权聚合网络层,根据所述注意力权重对每个所述子图像特征进行加权处理,得到每个所述子图像特征对应的加权子图像特征,对所述至 少两个子图像特征各自对应的所述加权子图像特征进行聚合处理,得到所述第一特征向量。
  4. 根据权利要求1所述的方法,所述对所述至少两个子图像特征进行聚类采样处理,得到至少两个分类簇,所述分类簇包括采样子图像特征,根据所述至少两个分类簇和块稀疏矩阵,确定每个所述采样子图像特征对应的块稀疏自注意力,根据至少两个所述块稀疏自注意力确定第二特征向量,包括:
    将所述至少两个子图像特征输入图像识别模型中的第二注意力子网络;所述第二注意力子网络包括聚类采样网络层、全局自注意力网络层、自注意力网络层以及均值池化网络层;
    通过所述聚类采样网络层,对所述至少两个子图像特征进行聚类采样处理,得到所述至少两个分类簇;所述分类簇包括所述采样子图像特征;所述至少两个分类簇各自包括的采样子图像特征的数量之和为N,N为小于所述至少两个子图像特征的数量的正整数;
    通过所述全局自注意力网络层,基于块稀疏矩阵确定N个采样子图像特征的块稀疏全局自注意力权重矩阵;
    通过所述自注意力网络层,根据所述至少两个分类簇和所述块稀疏全局自注意力权重矩阵,确定每个所述采样子图像特征对应的块稀疏自注意力;
    通过所述均值池化网络层,对至少两个所述块稀疏自注意力进行均值池化处理,得到第二特征向量。
  5. 根据权利要求4所述的方法,所述通过所述聚类采样网络层,对所述至少两个子图像特征进行聚类采样处理,得到所述至少两个分类簇,包括:
    通过所述聚类采样网络层,对所述至少两个子图像特征进行聚类处理,得到所述至少两个分类簇;
    获取所述至少两个分类簇中的第k个分类簇;k为正整数;所述第k个分类簇包括至少一个聚类子图像特征;
    获取所述至少一个聚类子图像特征各自与所述第k个分类簇的簇中心之间的向量距离,作为参考距离;
    根据所述参考距离,在所述至少一个聚类子图像特征中按序获取h个聚类子图像特征,将所述h个聚类子图像特征作为所述第k个分类簇包括的采样子图像特征;h为正整数,且h小于或等于所述至少一个聚类子图像特征的数量。
  6. 根据权利要求4所述的方法,所述第二注意力子网络包括查询权重矩阵和键权重矩阵;所述通过所述全局自注意力网络层,基于块稀疏矩阵确定N个采样子图像特征的块稀疏全局自注意力权重矩阵,包括:
    通过所述全局自注意力网络层,根据所述至少两个分类簇各自包括的采样子图像特征,构建采样子图像特征矩阵;
    将所述采样子图像特征矩阵与所述查询权重矩阵相乘,得到查询矩阵, 将所述采样子图像特征矩阵与所述键权重矩阵相乘,得到键矩阵;
    根据所述查询矩阵、所述键矩阵对应的转置矩阵以及所述块稀疏矩阵,确定块稀疏全局相关度矩阵;
    对所述块稀疏全局相关度矩阵进行归一化处理,得到所述块稀疏全局自注意力权重矩阵。
  7. 根据权利要求6所述的方法,所述第二注意力子网络还包括值权重矩阵;所述N个采样子图像特征包括采样子图像特征N j,j为小于或等于N的正整数;所述通过所述自注意力网络层,根据所述至少两个分类簇和所述块稀疏全局自注意力权重矩阵,确定每个所述采样子图像特征对应的块稀疏自注意力,包括:
    通过所述自注意力网络层,将所述采样子图像特征矩阵和所述值权重矩阵相乘,得到值矩阵;
    将所述采样子图像特征N j所属的分类簇中的采样子图像特征,作为目标采样子图像特征;
    从所述块稀疏全局自注意力权重矩阵中,获取所述采样子图像特征N j和所述目标采样子图像特征之间的块稀疏全局自注意力权重,作为目标块稀疏全局自注意力权重;
    从所述值矩阵中,获取与所述目标采样子图像特征对应的值向量,作为目标值向量;
    根据所述目标值向量和所述目标块稀疏全局自注意力权重,确定所述采样子图像特征N j对应的块稀疏自注意力。
  8. 根据权利要求1所述的方法,所述根据所述第一特征向量和所述第二特征向量,确定所述待检测图像的分类结果,包括:
    将所述第一特征向量和所述第二特征向量输入图像识别模型的分类子网络;所述分类子网络包括特征融合网络层和分类网络层;
    通过所述特征融合网络层,对所述第一特征向量和所述第二特征向量进行特征融合处理,得到融合特征向量;
    通过所述分类网络层,对所述融合特征向量进行分类处理,得到所述待检测图像的分类结果。
  9. 一种图像检测方法,由计算机设备执行,包括:
    获取样本图像,对所述样本图像进行特征提取处理,得到所述样本图像的样本特征表示子集;所述样本图像包括至少两个样本子图像;所述样本特征表示子集包括至少两个样本子图像特征,所述至少两个样本子图像特征与所述至少两个样本子图像一一对应;
    将所述至少两个样本子图像输入初始图像识别模型,通过所述初始图像识别模型,生成所述至少两个样本子图像特征各自对应的样本注意力权重,根据所述至少两个样本子图像特征各自对应的样本注意力权重,对所述至少两个样本子图像特征进行加权聚合处理,得到第一样本特征向量;
    通过所述初始图像识别模型,对所述至少两个样本子图像特征进行聚类采样处理,得到至少两个样本分类簇,所述样本分类簇包括样本采样子图像特征,根据所述至少两个样本分类簇和块稀疏矩阵确定每个所述样本采样子图像特征对应的样本块稀疏自注意力,根据至少两个所述样本块稀疏自注意力确定第二样本特征向量;所述样本采样子图像特征对应的样本块稀疏自注意力是基于其所属的样本分类簇中的样本采样子图像特征确定的;
    通过所述初始图像识别模型,根据所述第一样本特征向量和所述第二样本特征向量,确定所述样本图像的样本分类结果;
    根据所述至少两个样本分类簇、所述至少两个样本子图像特征各自对应的注意力权重、所述样本分类结果以及所述样本图像对应的分类标签,对所述初始图像识别模型进行模型参数调整,得到用于识别待检测图像的分类结果的图像识别模型。
  10. 根据权利要求9所述的方法,所述根据所述至少两个样本分类簇、所述至少两个样本子图像特征各自对应的注意力权重、所述样本分类结果以及所述样本图像对应的分类标签,对所述初始图像识别模型进行模型参数调整,得到用于识别待检测图像的分类结果的图像识别模型,包括:
    根据所述至少两个样本分类簇以及所述至少两个样本子图像特征各自对应的样本注意力权重,确定散度损失值;
    根据所述样本分类结果以及所述样本图像对应的分类标签,确定分类损失值;
    对所述散度损失值和所述分类损失值进行加权求和,得到模型总损失值;
    根据所述模型总损失值,对所述初始图像识别模型进行模型参数调整,得到所述图像识别模型。
  11. 根据权利要求10所述的方法,所述根据所述至少两个样本分类簇以及所述至少两个样本子图像特征各自对应的样本注意力权重,确定散度损失值,包括:
    获取所述至少两个样本分类簇中的第i个样本分类簇;i为正整数,且i小于或等于所述至少两个样本分类簇的数量;
    将所述第i个样本分类簇包括的样本子图像特征,作为目标样本子图像特征;
    根据所述目标样本子图像特征对应的样本注意力权重和所述目标样本子图像特征的数量,确定第i个样本分类簇对应的类别散度损失值;
    将每个样本分类簇对应的类别散度损失值进行累加,得到所述散度损失值。
  12. 根据权利要求11所述的方法,所述根据所述目标样本子图像特征对应的样本注意力权重和所述目标样本子图像特征的数量,确定第i个样本分类簇对应的类别散度损失值,包括:
    获取由所述目标样本子图像特征对应的样本注意力权重构成的拟合注意力分布;
    对所述拟合注意力权重分布进行归一化处理,得到归一化拟合注意力分布;
    将所述目标样本子图像特征的数量对应的均匀注意力分布,作为注意力分布标签;
    根据所述归一化拟合注意力分布和所述注意力分布标签,确定所述第i个样本分类簇对应的类别散度损失值。
  13. 一种图像检测装置,包括:
    特征提取模块,用于获取待检测图像,对所述待检测图像进行特征提取处理,得到所述待检测图像的特征表示子集;所述待检测图像包括至少两个子图像;所述特征表示子集包括至少两个子图像特征,所述至少两个子图像特征与所述至少两个子图像一一对应;
    第一向量生成模块,用于生成所述至少两个子图像特征各自对应的注意力权重,根据所述注意力权重,对所述至少两个子图像特征进行加权聚合处理,得到第一特征向量;
    第二向量生成模块,用于对所述至少两个子图像特征进行聚类采样处理,得到至少两个分类簇,所述分类簇包括采样子图像特征,根据所述至少两个分类簇和块稀疏矩阵,确定每个所述采样子图像特征对应的块稀疏自注意力,根据至少两个所述块稀疏自注意力确定第二特征向量;所述采样子图像特征对应的块稀疏自注意力是基于其所属的分类簇中的采样子图像特征确定的;
    分类模块,用于根据所述第一特征向量和所述第二特征向量,确定所述待检测图像的分类结果。
  14. 一种图像检测装置,包括:
    样本特征提取模块,用于获取样本图像,对所述样本图像进行特征提取处理,得到所述样本图像的样本特征表示子集;所述样本图像包括至少两个样本子图像;所述样本特征表示子集包括至少两个样本子图像特征,所述至少两个样本子图像特征与所述至少两个样本子图像一一对应;
    第一样本向量生成模块,用于将所述至少两个样本子图像输入初始图像识别模型,通过所述初始图像识别模型,生成所述至少两个样本子图像特征各自对应的样本注意力权重,根据所述至少两个样本子图像特征各自对应的样本注意力权重,对所述至少两个样本子图像特征进行加权聚合处理,得到第一样本特征向量;
    第二样本向量生成模块,用于通过所述初始图像识别模型,对所述至少两个样本子图像特征进行聚类采样处理,得到至少两个样本分类簇,所述样本分类簇包括样本采样子图像特征,根据所述至少两个样本分类簇和块稀疏矩阵确定每个所述样本采样子图像特征对应的样本块稀疏自注意力,根据至少两个所述样本块稀疏自注意力确定第二样本特征向量;所述样本采样子图像特征对应的样本块稀疏自注意力是基于其所属的样本分类簇中的样本采样子图像特征确定的;
    样本分类模块,用于通过所述初始图像识别模型,根据所述第一样本特征向量和所述第二样本特征向量,确定所述样本图像的样本分类结果;
    训练模块,用于根据所述至少两个样本分类簇、所述至少两个样本子图像特征各自对应的注意力权重、所述样本分类结果以及所述样本图像对应的分类标签,对所述初始图像识别模型进行模型参数调整,得到用于识别待检测图像的分类结果的图像识别模型。
  15. 一种计算机设备,包括:处理器、存储器以及网络接口;
    所述处理器与所述存储器、所述网络接口相连,其中,所述网络接口用于提供数据通信功能,所述存储器用于存储程序代码,所述处理器用于调用所述程序代码,以执行权利要求1-12任一项所述的方法。
  16. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,所述计算机程序适于由处理器加载并执行权利要求1-12任一项所述的方法。
  17. 一种计算机程序产品,包括计算机程序/指令,所述计算机程序/指令被处理器执行时,可以执行权利要求1-12任一项所述的方法。
PCT/CN2022/137773 2022-03-23 2022-12-09 一种图像检测方法、装置、设备及可读存储介质 WO2023179099A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/378,405 US20240054760A1 (en) 2022-03-23 2023-10-10 Image detection method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210288699.0A CN114693624A (zh) 2022-03-23 2022-03-23 一种图像检测方法、装置、设备及可读存储介质
CN202210288699.0 2022-03-23

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/378,405 Continuation US20240054760A1 (en) 2022-03-23 2023-10-10 Image detection method and apparatus

Publications (1)

Publication Number Publication Date
WO2023179099A1 true WO2023179099A1 (zh) 2023-09-28

Family

ID=82139163

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/137773 WO2023179099A1 (zh) 2022-03-23 2022-12-09 一种图像检测方法、装置、设备及可读存储介质

Country Status (3)

Country Link
US (1) US20240054760A1 (zh)
CN (1) CN114693624A (zh)
WO (1) WO2023179099A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117557820A (zh) * 2024-01-08 2024-02-13 浙江锦德光电材料有限公司 一种基于机器视觉的量子点光学膜损伤检测方法及系统

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114693624A (zh) * 2022-03-23 2022-07-01 腾讯科技(深圳)有限公司 一种图像检测方法、装置、设备及可读存储介质
CN117333926B (zh) * 2023-11-30 2024-03-15 深圳须弥云图空间科技有限公司 一种图片聚合方法、装置、电子设备及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046980A (zh) * 2020-03-16 2020-04-21 腾讯科技(深圳)有限公司 一种图像检测方法、装置、设备及计算机可读存储介质
CN111553419A (zh) * 2020-04-28 2020-08-18 腾讯科技(深圳)有限公司 一种图像识别方法、装置、设备以及可读存储介质
WO2021099584A1 (en) * 2019-11-22 2021-05-27 F. Hoffmann-La Roche Ag Multiple instance learner for tissue image classification
CN113688886A (zh) * 2021-08-12 2021-11-23 上海联影智能医疗科技有限公司 图像分类方法、装置及存储介质
US20220058446A1 (en) * 2019-07-12 2022-02-24 Tencent Technology (Shenzhen) Company Limited Image processing method and apparatus, terminal, and storage medium
CN114693624A (zh) * 2022-03-23 2022-07-01 腾讯科技(深圳)有限公司 一种图像检测方法、装置、设备及可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220058446A1 (en) * 2019-07-12 2022-02-24 Tencent Technology (Shenzhen) Company Limited Image processing method and apparatus, terminal, and storage medium
WO2021099584A1 (en) * 2019-11-22 2021-05-27 F. Hoffmann-La Roche Ag Multiple instance learner for tissue image classification
CN111046980A (zh) * 2020-03-16 2020-04-21 腾讯科技(深圳)有限公司 一种图像检测方法、装置、设备及计算机可读存储介质
CN111553419A (zh) * 2020-04-28 2020-08-18 腾讯科技(深圳)有限公司 一种图像识别方法、装置、设备以及可读存储介质
CN113688886A (zh) * 2021-08-12 2021-11-23 上海联影智能医疗科技有限公司 图像分类方法、装置及存储介质
CN114693624A (zh) * 2022-03-23 2022-07-01 腾讯科技(深圳)有限公司 一种图像检测方法、装置、设备及可读存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117557820A (zh) * 2024-01-08 2024-02-13 浙江锦德光电材料有限公司 一种基于机器视觉的量子点光学膜损伤检测方法及系统
CN117557820B (zh) * 2024-01-08 2024-04-16 浙江锦德光电材料有限公司 一种基于机器视觉的量子点光学膜损伤检测方法及系统

Also Published As

Publication number Publication date
US20240054760A1 (en) 2024-02-15
CN114693624A (zh) 2022-07-01

Similar Documents

Publication Publication Date Title
WO2023179099A1 (zh) 一种图像检测方法、装置、设备及可读存储介质
CN112329800B (zh) 一种基于全局信息引导残差注意力的显著性目标检测方法
WO2020073951A1 (zh) 用于图像识别的模型的训练方法、装置、网络设备和存储介质
WO2021042828A1 (zh) 神经网络模型压缩的方法、装置、存储介质和芯片
Klibisz et al. Fast, simple calcium imaging segmentation with fully convolutional networks
CN109117781B (zh) 多属性识别模型的建立方法、装置及多属性识别方法
CN111814810A (zh) 图像识别方法、装置、电子设备及存储介质
CN111506773B (zh) 一种基于无监督深度孪生网络的视频去重方法
CN110222718B (zh) 图像处理的方法及装置
CN113749657B (zh) 一种基于多任务胶囊的脑电情绪识别方法
CN112786160A (zh) 基于图神经网络的多图片输入的多标签胃镜图片分类方法
CN115909011A (zh) 基于改进的SE-Inception-v3网络模型的天文图像自动分类方法
CN114331837A (zh) 特高压换流站保护系统全景监控图像处理与存储方法
CN114492581A (zh) 基于迁移学习和注意力机制元学习应用在小样本图片分类的方法
CN113989556A (zh) 一种小样本医学影像分类方法和系统
CN117095252A (zh) 目标检测方法
WO2023029559A1 (zh) 一种数据处理方法以及装置
CN108764289B (zh) 一种基于卷积神经网络的ui异常图片分类方法及系统
Yang et al. Explicit-implicit dual stream network for image quality assessment
CN115331081A (zh) 图像目标检测方法与装置
CN111882551B (zh) 病理图像细胞计数方法、系统及装置
CN114419341A (zh) 一种基于迁移学习改进的卷积神经网络图像识别方法
CN108304546B (zh) 一种基于内容相似度和Softmax分类器的医学图像检索方法
CN110929118A (zh) 网络数据处理方法、设备、装置、介质
CN116758379B (zh) 一种图像处理方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22933150

Country of ref document: EP

Kind code of ref document: A1