WO2024027347A9 - 内容识别方法、装置、设备、存储介质及计算机程序产品 - Google Patents

内容识别方法、装置、设备、存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2024027347A9
WO2024027347A9 PCT/CN2023/099991 CN2023099991W WO2024027347A9 WO 2024027347 A9 WO2024027347 A9 WO 2024027347A9 CN 2023099991 W CN2023099991 W CN 2023099991W WO 2024027347 A9 WO2024027347 A9 WO 2024027347A9
Authority
WO
WIPO (PCT)
Prior art keywords
image
feature representation
global
feature
content
Prior art date
Application number
PCT/CN2023/099991
Other languages
English (en)
French (fr)
Other versions
WO2024027347A1 (zh
Inventor
王赟豪
余亭浩
陈少华
刘浩
侯昊迪
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2024027347A1 publication Critical patent/WO2024027347A1/zh
Publication of WO2024027347A9 publication Critical patent/WO2024027347A9/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present application relates to the field of machine learning, and in particular to a content recognition method, apparatus, device, storage medium and computer program product.
  • the attribute information corresponding to the multimedia content can be determined, so as to better meet the browsing needs of users in different scenarios. For example, in the image search scenario, after the user enters the search keyword, the image whose image content matches the search keyword is selected from the image library as the search result and displayed to the user.
  • a deep learning model is usually used to extract global features corresponding to an image and establish a content search library.
  • image search scenario when a user enters a search keyword, the global features that match the search keyword are determined in the content search library based on the search keyword, so that the image corresponding to the global feature is directly used as a search result and displayed to the user.
  • images matching the search keywords are usually determined only based on the global features of the image. Although the global features corresponding to the image have a high degree of match with the search keywords, there may still be situations where the image does not match the search keywords, resulting in low accuracy in content recognition.
  • the embodiments of the present application provide a content recognition method, apparatus, device, storage medium and computer program product, which can improve the accuracy of content recognition.
  • the technical solution is as follows:
  • a content identification method comprising:
  • the category of the target content contained in the target area is identified based on the global feature representation and the second local feature representation.
  • a content identification device comprising:
  • An acquisition module used for acquiring images
  • An extraction module used to extract image key points based on the distribution law of pixel points in the image, and extract key point feature representations corresponding to the image key points in the image; and identify a target area from the image by performing saliency detection on the image;
  • a processing module configured to perform pooling processing on the image feature representation corresponding to the image to obtain a global feature representation, and to downsample the image feature representation based on the target area to obtain a first local feature representation;
  • a splicing module used for performing feature splicing on the key point feature representation and the first local feature representation to obtain a second local feature representation
  • a recognition module is used to recognize the category of the target content contained in the target area based on the global feature representation and the second local feature representation.
  • a computer device comprising a processor and a memory, the memory storing at least one instruction, at least one program, a code set or an instruction set, the at least one instruction, the at least one program, a code set or an instruction set.
  • the segment program, the code set or the instruction set is loaded and executed by the processor to implement the content identification method as described in any of the above embodiments of the present application.
  • a computer-readable storage medium wherein at least one instruction, at least one program, a code set or an instruction set is stored in the storage medium, and the at least one instruction, the at least one program, the code set or the instruction set is loaded and executed by a processor to implement a content identification method as described in any of the above-mentioned embodiments of the present application.
  • a computer program product or a computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium.
  • a processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs any of the content identification methods described in the above embodiments.
  • the key points of the image are extracted and the key point feature representations corresponding to the key points of the image are extracted, and the image is saliency detected to obtain the target area in the image;
  • the image feature representation corresponding to the image is pooled to obtain the global feature representation, and the image feature representation is downsampled based on the target area, and the downsampled first local feature representation and the key point feature representation are feature spliced to obtain the second local feature representation, so as to identify the category of the target content in the target area according to the global feature representation and the second local feature representation.
  • the global feature representation representing global information is obtained by performing global pooling processing on the image feature representation;
  • the first local feature representation representing local information is obtained by performing local downsampling processing on the image feature representation based on the target area, so as to effectively extract the effective information about the local features in the image feature representation, and then when the first local feature representation and the key point feature representation are feature spliced, the purpose of combining the image key points to obtain a more accurate second local feature representation can be achieved, and the category of the target content contained in the target area is identified by using the global feature and the second local feature, which can effectively improve the accuracy of the corresponding category of the identified content.
  • FIG1 is a schematic diagram of a related technology of a content identification method provided by an exemplary embodiment of the present application.
  • FIG2 is a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application.
  • FIG3 is a flow chart of a content identification method provided by an exemplary embodiment of the present application.
  • FIG4 is a flow chart of a content identification method provided by another exemplary embodiment of the present application.
  • FIG5 is a flow chart of a content identification method provided by another exemplary embodiment of the present application.
  • FIG6 is a schematic diagram of a target area provided by an exemplary embodiment of the present application.
  • FIG7 is a schematic diagram of a target area provided by another exemplary embodiment of the present application.
  • FIG8 is a schematic diagram of a saliency detection model provided by an exemplary embodiment of the present application.
  • FIG9 is a schematic diagram of a content identification method provided by another exemplary embodiment of the present application.
  • FIG10 is a structural block diagram of a content identification device provided by an exemplary embodiment of the present application.
  • FIG11 is a structural block diagram of a content recognition device provided by another exemplary embodiment of the present application.
  • FIG. 12 is a schematic diagram of a server structure provided by an exemplary embodiment of the present application.
  • Figure 1 shows a schematic diagram of a content recognition method provided by an exemplary embodiment of the present application.
  • an image 110 is acquired, wherein the image 110 is implemented as a scenic spot image, and image key points corresponding to the image 110 can be extracted through the distribution pattern of pixel points in the image 110 (not shown in the figure), and then the key point feature representations corresponding to the image key points are extracted; in addition, the target area in the image can be obtained by performing saliency detection on the image.
  • the image feature representation corresponding to the image is pooled to obtain a global feature representation 130 representing global information of the image, and the image feature representation 120 is downsampled based on the target area obtained by saliency detection to obtain a first local feature representation 140 representing local information of the target area in the image.
  • the first local feature representation 140 and the key point feature representation are concatenated to obtain a second local feature representation 150 .
  • the category of the target content 111 in the target area of the image 110 is identified based on the global feature representation 130 and the second local feature representation 150 to obtain a content recognition result 160 , wherein the content recognition result 160 is implemented as “A Landscape Building”, which represents the category corresponding to the target content 111 .
  • the implementation environment involved in the embodiments of the present application is explained. For schematic illustration, please refer to FIG. 2 .
  • the implementation environment involves a terminal 210 and a server 220 .
  • the terminal 210 and the server 220 are connected via a communication network 230 .
  • the terminal 210 sends a content recognition request to the server, wherein the content recognition request includes an image, and the image includes the target content.
  • the server 220 After receiving the content recognition request from the terminal 210, the server 220 performs content recognition on the image and feeds back the content recognition result obtained to the terminal 210, wherein the content recognition result reflects the category corresponding to the target content.
  • the server 220 can obtain the target area in the image by performing saliency detection on the image; in addition, the image feature representation corresponding to the image is pooled to obtain a global feature representation 222, and the image feature representation is downsampled based on the target area to obtain a first local feature representation 223; in addition, the image key points are extracted based on the distribution law of pixel points in the image, and then the key point feature representation 224 corresponding to the image key points in the image is extracted, the key point feature representation 224 and the first local feature representation 223 are feature-concatenated to obtain a second local feature representation 225, and the target content is identified based on the second local feature representation 225 and the global feature representation 222, and the category 226 corresponding to the target content is determined.
  • the terminal 210 may be a mobile phone, a tablet computer, a desktop computer, a portable laptop computer, a smart TV, a smart car-mounted device, or other terminal devices, which is not limited in the embodiments of the present application.
  • server 220 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), as well as big data and artificial intelligence platforms.
  • cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), as well as big data and artificial intelligence platforms.
  • cloud services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), as well as big data and artificial intelligence platforms.
  • CDN Content Delivery Network
  • cloud technology refers to a hosting technology that unifies hardware, software, network and other resources within a wide area network or local area network to achieve data calculation, storage, processing and sharing.
  • the server 220 may also be implemented as a node in a blockchain system.
  • the information including but not limited to user device information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant laws, regulations and standards of relevant countries and regions. For example, the images involved in this application were obtained with full authorization.
  • Figure 3 shows a flow chart of the content identification method provided by an exemplary embodiment of the present application.
  • the method can be executed by a terminal, or by a server, or by both a terminal and a server.
  • the method is explained as being executed by a server.
  • the method includes the following steps.
  • Step 310 acquiring an image.
  • the image is obtained from an open source image database; or, the image is acquired by an image acquisition device (such as a camera, terminal, video camera, etc.); or, the image is synthesized by image synthesis software; or, the image is a screenshot from a video, etc.
  • an image acquisition device such as a camera, terminal, video camera, etc.
  • the image is synthesized by image synthesis software
  • the image is a screenshot from a video, etc.
  • Step 320 extract image key points based on the distribution pattern of pixel points in the image, and extract key point feature representations corresponding to the image key points in the image.
  • a pixel is a unit that characterizes the image resolution.
  • the image can be regarded as a content composed of multiple pixels, each pixel corresponds to a pixel point, and according to the representation content of the image resolution, each pixel point corresponds to a pixel value.
  • the pixel distribution law characterizes the changes in the pixel values corresponding to different pixel points on the image.
  • pixel values corresponding to multiple pixel points in the image are comprehensively analyzed, and the distribution pattern of the pixel points in the image is determined according to the change of the pixel values. Then, image key points corresponding to the image are extracted according to the pixel point distribution pattern, and the image key points include pixel points in the image that symbolize the target content.
  • pixel points whose pixel distribution pattern meets preset conditions are used as the above-mentioned image key points.
  • the preset condition is implemented as the amplitude of the pixel value change exceeding the preset amplitude threshold, for example: the pixel value corresponding to two adjacent pixel points changes greatly (exceeding the preset amplitude threshold), and at least one of the pixel points is used as the above-mentioned image key point; or, the preset condition is implemented as the speed of the pixel value change exceeding the preset speed threshold, for example: the pixel value corresponding to multiple adjacent pixel points changes quickly (exceeding the preset speed threshold), and at least one of the pixel points is used as the above-mentioned image key point, etc.
  • the type of image key points includes at least one of key point types such as corner points, edges or blocks, which is not limited in the embodiments of the present application.
  • the image is annotated with a single image key point; or, the image is annotated with multiple image key points, which is not limited in the embodiments of the present application.
  • the key points of an image are obtained by identifying the content based on the entire image, so they can show the information that the entire image needs to highlight to a certain extent.
  • the target area represents the area where the image is mainly expressed, which contains a lot of information that the image highlights;
  • the background area represents the area where the image is secondarily expressed, which contains a small amount of information that the image highlights.
  • the multiple image key points extracted mainly express the information corresponding to the target area, and also reflect the information outside the target area in the image to a small extent.
  • the multiple image key points extracted include image key points 1 to 10, image key points 1 to 8 are all key points extracted from the target area, and image key points 9 and 10 are image key points extracted from outside the target area in the image.
  • key points of an image are extracted using a preset key point detector, and the results output by the key point detector are used as image key points corresponding to the image.
  • feature extraction is performed on the extracted image key points to obtain feature representations of key points in the image corresponding to the image key points.
  • feature extraction is performed on all image key points to obtain key point feature representations corresponding to all image key points, and the key point feature representations are used for subsequent feature stitching; or, some image key points are selected for feature extraction to obtain key point feature representations corresponding to some image key points, etc., which is not limited to the embodiments of the present application.
  • Step 330 Perform saliency detection on the image to obtain a target area in the image.
  • saliency detection is used to locate the most "salient" area in an image, i.e., the target area, using image processing techniques and computer vision algorithms.
  • saliency detection is used to determine an area containing target content, i.e., the target area.
  • a target area refers to an eye-catching or relatively important area in an image, such as the area that the human eye will first focus on when viewing a picture.
  • the process of automatically locating key content in an image or important areas in a scene is called saliency detection.
  • Saliency detection is widely used in multiple image recognition fields such as target detection and robotics.
  • saliency detection is based on the detection of important content in the image. Therefore, after the target area is obtained by saliency detection, the target area represents the content of a fixed area. Compared with the image key points extracted from the image, the target area focused on by saliency detection is more single and clearer, unlike image key points, which may exist in many areas of the image.
  • saliency detection is implemented as a threshold-based region segmentation method.
  • the pixel value of each pixel in the image is obtained.
  • the pixel values corresponding to a plurality of consecutive pixel points respectively reach a preset pixel threshold
  • the area composed of the plurality of consecutive pixel points is taken as the target area mentioned above
  • the area outside the target area in the image is taken as the background area, thereby realizing the segmentation process of the target area and the background area.
  • This process is regarded as a process of performing saliency detection on the image to obtain the target area.
  • saliency detection is implemented as a region segmentation method based on edge detection.
  • the Fourier transform method can be used to transform the image from the spatial domain to the frequency domain.
  • the edge of the region corresponds to the high-frequency part, so different regions can be segmented more intuitively, thereby better separating the target area and the background area from the image.
  • saliency detection is implemented as a method of image recognition through a pre-trained image segmentation model.
  • a pre-trained image segmentation model for image segmentation is obtained, and the target area and background area in the image are separated by inputting the image into the image segmentation model.
  • the pre-trained model for training the image segmentation model is implemented as a model developed by the Visual Geometry Group (VGG), and the image segmentation model is obtained by training the VGG model with a large number of sample images, wherein the large number of sample images include region labels referring to the target area in the sample images, and the model training process is realized by using the difference between the prediction results of the VGG model for the sample images and the region labels annotated by the sample images themselves, and finally the image is recognized by using the trained image segmentation model to obtain the target area in the image, thereby achieving the purpose of saliency detection.
  • VGG Visual Geometry Group
  • the target area is used to indicate an area including target content.
  • an image refers to an image containing target content of unknown category.
  • the target content includes at least one of content types such as people, animals, food, attractions, and landmarks.
  • saliency detection is used to determine a target area corresponding to target content and a background area corresponding to background content in an image, that is, saliency detection is used to divide the image into regions according to content features, thereby separating the target area from the background area.
  • the target area corresponding to the target content and the background area corresponding to the background content can be separated by saliency detection, thereby determining the target area including the target content in the image.
  • the pixel values corresponding to the pixels in the target area are larger, while the pixel values corresponding to the pixels in the background area are smaller.
  • the saliency detection process is realized by setting a pixel threshold, thereby separating the target area from the background area.
  • the saliency detection process is realized by an image sharpening method, that is, the image is sharpened to enhance the edge detail information in the image, thereby realizing the process of enhancing the boundary between the target area and the background area, so as to separate the target area from the background area.
  • the image includes a single target content; or, the image includes multiple target contents, wherein when the image includes multiple target contents, the multiple target contents correspond to different contents, or correspond to the same content, which is not limited here.
  • the target area when the image includes a single target content, the target area is implemented as a single area including the single target content; or, when the image includes multiple target contents, the target area is implemented as a single area including the multiple target contents; or, when the image includes multiple target contents, the target area is implemented as multiple areas, and each target area includes at least one target content, etc.
  • a saliency detection model is preset, an image is input into the saliency detection model, and a recognition saliency map corresponding to the image is output, wherein the recognition saliency map includes a target area corresponding to the target content and a background area corresponding to the background content.
  • the recognition saliency map is implemented as an image after the target area is enhanced.
  • Step 340 performing pooling processing on the image feature representation corresponding to the image to obtain a global feature representation, and downsampling the image feature representation based on the target area to obtain a first local feature representation.
  • pooling refers to downsampling the image feature representation to compress the image feature representation, while reducing the number of parameters while maintaining a certain invariance of the image feature representation (such as at least one of rotation invariance, translation invariance and scaling invariance).
  • the pooling process is implemented through a convolution kernel, and a global feature representation of the image after the pooling process is obtained by using the image size corresponding to the image and the convolution kernel size corresponding to the convolution kernel.
  • an image size corresponding to an image is determined; a convolution kernel for performing pooling processing on the image is obtained, and a convolution kernel size of the convolution kernel is obtained; the image is pooled with the convolution kernel to obtain a global feature representation, and the size of the global feature representation is the quotient of the image size and the convolution kernel size.
  • the size of the processed global feature representation is 2*2; or, if the image size is 20*20, the convolution kernel size used is 5*5, and the image is pooled by the convolution kernel, the size of the processed global feature representation is 2*2.
  • the size of the processed global feature representation is 4*4, etc.
  • the image size of the image and the convolution kernel size of the convolution kernel are only illustrative examples, and the embodiments of the present application do not limit this.
  • the pooling process performed by the convolution kernel can greatly reduce the analysis complexity when performing feature representation analysis, and can also increase the receptive field of the entire image with a smaller size, thereby improving the expression effect of the global feature representation on the global image.
  • the pooling process includes at least one of the pooling process types of maximum pooling process (Max-Pooling), average pooling process (Mean-Pooling) and generalized mean pooling process (Generalized-Mean Pooling), which is not limited to this.
  • Max-Pooling maximum pooling process
  • Mean-Pooling average pooling process
  • Generalized-Mean Pooling generalized mean pooling process
  • the image feature representation corresponding to the image is downsampled based on the target region.
  • downsampling also known as feature downsampling
  • feature downsampling refers to the process of performing image sampling and image reduction on the image feature representation, thereby obtaining a processed feature vector, which is the first local feature representation.
  • feature downsampling includes sparse sampling.
  • pooling and downsampling are performed simultaneously; or, pooling is first performed on the image feature representation, and then downsampling is performed on the image feature representation, which is not limited.
  • the image is composed of a plurality of image blocks
  • the image feature representation is composed of a plurality of sub-feature representations
  • the plurality of image blocks corresponds one-to-one to the plurality of sub-feature representations.
  • the image is composed of multiple image blocks, and the multiple image blocks include image blocks 1 to image blocks 9, image block 1 corresponds to sub-feature representation a, image block 2 corresponds to sub-feature representation b..., and image block 9 corresponds to sub-feature representation i.
  • a plurality of regional image blocks within the target area are acquired from a plurality of image blocks included in the image.
  • the regional image block is used to represent an image block located in the target region.
  • each image block corresponds to a sub-feature representation
  • multiple image blocks constituting the image correspond to a sub-feature representation respectively, thereby obtaining multiple sub-feature representations, and thus the multiple sub-feature representations corresponding to the image are called image feature representations corresponding to the image.
  • the partial image blocks in the target area of the image include image block 3, image block 5 and image block 8, image block 2, image block 5 and image block 8 are used as the regional image blocks corresponding to the target area.
  • sub-feature representations respectively corresponding to a plurality of regional image blocks are obtained as sparse sampling results.
  • sub-feature representations corresponding to the multiple regional image blocks are obtained in the image feature representation, thereby obtaining a partial sub-feature representation corresponding to the target area, and the partial sub-feature representation is used as a sparse sampling result.
  • the sparse sampling results can be expressed as a set of feature vectors; or, the sparse sampling results can be expressed as a feature vector graph (feature matrix), that is, the feature vector graph contains multiple feature blocks (Patch blocks), each Patch block represents a feature vector (sub-feature representation corresponding to the image block), which is not limited to the embodiments of the present application.
  • feature matrix feature matrix
  • the feature vector graph contains multiple feature blocks (Patch blocks)
  • each Patch block represents a feature vector (sub-feature representation corresponding to the image block), which is not limited to the embodiments of the present application.
  • a content recognition model is set in advance, and after the image is input into the content recognition model, a sparse sampling result corresponding to the target content is directly output; or, a content recognition model is set in advance, and after the image is input into the content recognition model, an image feature representation corresponding to the image is output, and a sparse sampling result corresponding to the target content is selected from the image feature representation based on the target area, and there is no limitation on this.
  • the method for extracting the sparse sampling result includes at least one of the following extraction methods:
  • Swin Transformer model Transformer based on a moving window
  • a feature map corresponding to the target content is output.
  • the feature map is used as the sparse sampling result.
  • Each patch block in the feature map represents a sub-feature representation.
  • ResNet deep residual network
  • T2T-ViT model Use the Tokens-to-Tokens Vision Transformer model (T2T-ViT model) for feature extraction, input the image into the T2T-ViT model, and output the string sequence (Token sequence) corresponding to the target content as the sparse sampling result corresponding to the target content.
  • the sparse sampling result is pooled to obtain a first local feature representation.
  • the sparse sampling result can be pooled by using the above-mentioned pooling method with the help of convolution kernel to obtain the first local feature representation.
  • the sparse sampling result is implemented as the above-mentioned feature vector set
  • the first local feature representation obtained after pooling the sparse sampling result is implemented as the result of combining multiple feature vectors; or, if the sparse sampling result is implemented as the above-mentioned feature vector graph, the first local feature representation obtained after pooling the sparse sampling result is implemented as a single feature vector, etc.
  • Step 350 concatenating the key point feature representation and the first local feature representation to obtain a second local feature representation.
  • feature concatenation refers to concatenating feature vectors of the first local feature and the key point feature representation, and using the feature vector obtained after concatenation as the second local feature representation.
  • feature concatenation is implemented through a neural network concatenation layer (Concatenate, Concat), and the function of the Concat layer is to concatenate two or more feature representations in the channel dimension.
  • the key point feature representation and the first local feature representation are concatenated along the channel dimension to obtain the second local feature representation.
  • the feature size of the key point feature representation and the first local feature representation is the same, but expanded in the number of channels. Therefore, the feature size of the second local feature representation obtained after feature splicing remains unchanged, and the number of channels is added.
  • the size of the first local feature is 1*H*W
  • the size of the key point feature representation is C*H*W.
  • 1 in the first local feature is used to indicate the number of channels of the first local feature
  • C in the key point feature representation is used to indicate the number of channels of the key point feature representation
  • H is used to indicate the height of the first local feature or the key point feature representation
  • W is used to indicate the width of the first local feature or the key point feature representation.
  • the single key point feature representation and the single first local feature representation are spliced according to the above-mentioned feature splicing method to obtain a second local feature representation.
  • the multiple first local feature representations are subjected to the above-mentioned feature splicing to obtain a feature splicing result, and the feature splicing result is spliced with the single key point feature representation to obtain a second local feature representation; or, the single key point feature representation is spliced with the multiple first local feature representations respectively to obtain feature splicing results corresponding to the multiple first local feature representations, and the multiple feature splicing results are subjected to feature splicing to obtain a second local feature representation.
  • the multiple key point feature representations are subjected to the above-mentioned feature splicing to obtain a feature splicing result, and the feature splicing result is spliced with the single first local feature representation to obtain a second local feature representation; or, the single first local feature representation is spliced with multiple key point feature representations respectively to obtain feature splicing results corresponding to the multiple key point feature representations, and the multiple feature splicing results are subjected to feature splicing to obtain a second local feature representation.
  • feature concatenation includes the following: At least one of several splicing methods:
  • a single first local feature representation and a single key point feature representation are concatenated to obtain a single second local feature representation, that is, the first local feature representation and the key point feature representation are concatenated one by one, and the second local feature representation includes multiple feature vectors obtained by concatenating the features;
  • first local feature representations are firstly concatenated one by one, and the concatenated results are then concatenated with the key point feature representation in sequence to obtain a final second local feature representation, that is, the second local feature representation contains a single feature vector obtained by concatenation;
  • Step 360 Identify the category of the target content contained in the target area based on the global feature representation and the second local feature representation.
  • the result after recognition is called a content recognition result, that is, the content recognition result is used to characterize the category corresponding to the target content.
  • the content recognition result indicates the category name corresponding to the target content, such as: the content recognition result for target content a is "garden”; or, the content recognition result indicates the category type corresponding to the target content, such as: the content recognition result for target content b is "X garden", without limitation.
  • the content identification result includes a single target content and its corresponding category, such as: target content a, corresponding to category "A Park”; target content b, corresponding to category “B Park”; or, the content identification result includes multiple categories, each category corresponds to at least one target content, such as: category A is "Dolphin", category A includes target content 1 and target content 2 (that is, target content 1 and target content 2 are both “Dolphin"), category B is "Clownfish", category B includes target content 3 (that is, target content 3 is “Clownfish”), and there is no limitation on this.
  • the category corresponding to the target content is implemented as a coarse-grained category, such as: the image includes target content A (first amusement park) and target content B (second amusement park), and in the final content recognition result, the categories corresponding to the target content A and the target content B are both "amusement park”; or, the category corresponding to the target content is implemented as a fine-grained category, such as: the target content A and the target content B both belong to "museum", but the target content A is finally identified as "a museum” and the target content B is "b museum”.
  • the content recognition method extracts key points based on the distribution law of pixel points in the image and extracts the key point feature representation corresponding to the key points of the image, performs saliency detection on the image to obtain the target area in the image; in addition, the image feature representation is pooled to obtain the global feature representation, and the image feature representation is downsampled based on the target area, and the first local feature representation obtained after the downsampling and the key point feature representation are feature spliced to obtain the second local feature representation, so as to identify the category of the target content in the target area according to the global feature representation and the second local feature representation.
  • the process of obtaining the first local feature by downsampling the image feature representation based on the target area can effectively extract the effective information about the local feature in the image feature representation, and then when the first local feature representation and the key point feature representation are feature spliced, the purpose of obtaining a more accurate second local feature representation by combining the image key points can be achieved, and the content recognition of the image can be performed using the global feature and the second local feature, which can effectively improve the accuracy of content recognition.
  • both the first local feature representation and the global feature representation can be obtained through a variety of different pooling processes.
  • FIG. 4 shows a schematic diagram of a content recognition method provided by an exemplary embodiment of the present application.
  • step 340 includes step 341 and step 342.
  • the method includes the following steps:
  • Step 341 performing pooling processing on the image feature representation to obtain a global feature representation, and performing sparse sampling on the image feature representation based on the target area to obtain a sparse sampling result.
  • each image feature representation represents a feature representation of an image patch in the image.
  • some image feature representations are selected from multiple image feature representations corresponding to the image for pooling processing to obtain a global feature representation; or, all image feature representations are pooled to obtain a global feature representation, which is not limited to this.
  • the pooling process includes any one of an average pooling process, a maximum pooling process, and a generalized mean pooling process.
  • Mean-Pooling refers to performing vector averaging on the input image feature representation to obtain the averaged feature vector as the global feature representation.
  • Max-Pooling refers to selecting the feature vector with the largest vector value from the input image feature representation as the global feature representation.
  • GeM Generalized-Mean Pooling
  • Formula 1 is the image feature representation corresponding to the kth image block.
  • Formula 1 can be implemented as an average evaluation process, that is, the current Formula 1 is equivalent to the average pooling process; when p approaches infinity, Formula 1 can be implemented as taking the maximum value, that is, the current Formula 1 is equivalent to the maximum pooling process.
  • the first one is to obtain global feature representation through a single pooling process.
  • generalized mean pooling is performed on the image feature representation to obtain a global feature representation. That is, by performing generalized mean pooling on the image feature representation, the pooling result obtained is used as the global feature representation.
  • generalized mean pooling is a method that combines generalized pooling and average pooling. It mainly calculates the weighted mean of the area covered by the convolution kernel sliding on the input feature representation (that is, the above-mentioned image feature representation) to obtain each dimension of the output feature representation (that is, the above-mentioned global feature representation).
  • the weighting coefficients involved in the weighted mean can be obtained through previous model learning or can be custom set coefficients.
  • average pooling is performed on the image feature representation to obtain a global feature representation.
  • maximum pooling is performed on the image feature representation to obtain a global feature representation.
  • a global feature representation is obtained, and the global feature representation is used to characterize the pooling result corresponding to a single pooling process.
  • the second method is to obtain global feature representation through a variety of different pooling processes.
  • the image feature representation is subjected to average pooling to obtain a first global feature representation; the image feature representation is subjected to maximum pooling to obtain a second global feature representation; the image feature representation is subjected to generalized mean pooling to obtain a third global feature representation; the first global feature representation, the second global feature representation, and the third global feature representation are feature concatenated to obtain a global feature representation.
  • three different pooling processes are performed on the image feature representation to obtain a first global feature representation, a second global feature representation and a third global feature representation, and feature splicing is performed on them, and the feature splicing result is used as the global feature representation, that is, the global feature representation includes the splicing result of the pooling results corresponding to the three pooling processes.
  • the first global feature representation, the second global feature representation and the third global feature representation are arranged in a fixed order.
  • the first global feature representation, the second global feature representation and the third global feature representation are sequentially concatenated (e.g., the first global feature representation, the second global feature representation and the third global feature representation are sequentially concatenated); or, the first global feature representation, the second global feature representation and the third global feature representation are randomly arranged in an order for feature concatenation, without limitation.
  • sparse sampling refers to sparse processing of the image feature representation to obtain a sparse vector matrix as a sparse sampling result.
  • the image feature representation is a dense vector matrix
  • the sparse sampling result is a sparse vector matrix, that is, the sparse sampling result includes multiple zero elements.
  • the zero element represents the image feature representation corresponding to the unsampled image block; the element corresponding to the zero element is an element, and the element is used to represent the image feature representation corresponding to the sampled image block.
  • the image feature representation is implemented as a feature map (i.e., feature matrix) of size k ⁇ k ⁇ 1024.
  • n ⁇ 1024 Token vectors are obtained, and the n ⁇ 1024 Token vectors are used as the sparse sampling results.
  • the number of Token vectors is a preset fixed number; or, the number of Token vectors can be freely set according to actual needs, and there is no limitation on this.
  • sparse sampling is performed on the image feature representation based on the target area to obtain a sparse sampling result.
  • the image feature representation corresponding to the image block in the target area is implemented as a one element, and the image feature representation corresponding to the image block outside the target area is implemented as a zero element, thereby implementing a sparse sampling process for the image feature representation, so that the sparse sampling result can more specifically display the local information corresponding to the target area;
  • the image feature representation corresponding to the image blocks in most of the target area is set to a one-element value, and the image feature representation corresponding to the image blocks outside a small part of the target area is set to a zero element, thereby implementing a sparse sampling process for the image feature representation, so that the sparse sampling result can more specifically display the local information corresponding to the target area;
  • the image feature representation corresponding to the image block within a certain area including the target area is taken as a one-element, and the image feature representation corresponding to the image block outside the certain area is taken as a zero element, thereby realizing a sparse sampling process for the image feature representation, so that the sparse sampling result can more specifically display the local information corresponding to the target area.
  • the processes of performing pooling processing and downsampling on the image feature representation are performed simultaneously.
  • Step 342 perform pooling processing on the sparse sampling result to obtain a first local feature representation.
  • the pooling process includes at least one of the pooling processes such as average pooling process, maximum pooling process and generalized mean pooling process.
  • the first one is to perform a single pooling process on the sparse sampling results.
  • the sparse sampling result is subjected to maximum pooling processing, and the Token vector with the largest vector value in the sparse sampling result is selected as the first local feature representation.
  • average pooling is performed on the sparse sampling results, and the sparse sampling results are averaged and evaluated, and the obtained average value vector is used as the first local feature representation.
  • the first local feature representation includes a feature vector obtained by a single pooling processing.
  • the second method is to perform various pooling operations on the sparse sampling results.
  • the sparse sampling results are average pooled to obtain a third local feature representation; the sparse sampling results are maximum pooled to obtain a fourth local feature representation; the sparse sampling results are generalized mean pooled to obtain a fifth local feature representation; the third local feature representation, the fourth local feature representation and the fifth local feature representation are feature concatenated to obtain a first local feature representation.
  • the current first local feature representation includes the splicing results corresponding to the feature vectors obtained by multiple different pooling processes.
  • the sparse sampling results are pooled according to a preset processing order of the three pooling processes, which is not limited.
  • the preset processing order is a fixed order set in advance; or, the preset processing order can be freely set according to actual needs.
  • the third local feature representation, the fourth local feature representation and the fifth local feature representation are feature spliced in a fixed arrangement order (such as: feature splicing is performed in the splicing order of the third local feature representation, the fourth local feature representation and the fifth local feature representation); or, the third local feature representation, the fourth local feature representation and the fifth local feature representation are feature spliced in a random arrangement order, without limitation.
  • any of the above-mentioned pooling processes can be selected for combination for image feature representation and sparse sampling results respectively (that is, including four pooling combination methods), such as: using any two of the three pooling processes to process the sparse sampling results and stitch the features to obtain a first local feature representation; or, using any two of the three pooling processes to process the image feature representation and stitch the features to obtain a global feature representation, etc.
  • the embodiments of the present application are not limited to this.
  • the content recognition method extracts image key points based on the distribution law of pixel points in the image and extracts the key point feature representation corresponding to the image key points, performs saliency detection on the image to obtain the target area in the image; in addition, the image feature representation is pooled to obtain the global feature representation, and the image feature representation is downsampled based on the target area, and the first local feature representation and the key point feature representation are feature spliced to obtain the second local feature representation, so as to identify the category of the target content in the target area according to the global feature representation and the second local feature representation, and finally obtain the content recognition result.
  • the effective information about the local feature in the image feature representation can be effectively extracted, and then when the first local feature representation and the key point feature representation are feature spliced, the purpose of combining the image key points to obtain a more accurate second local feature representation can be achieved, and the category of the content of the image can be identified by using the global feature and the second local feature, which can effectively improve the accuracy of content recognition.
  • the complexity of the image feature representation is reduced by sparsely sampling the image feature representation based on the target area, and a sparse sampling result with low complexity is obtained.
  • the sparse sampling result is then pooled to reduce the size of the target content while retaining the image spatial information corresponding to the target area as much as possible, and to extract the first local feature representation containing high-dimensional local feature information. While effectively extracting the corresponding local features in the image feature representation, the efficiency of feature extraction and the utilization rate of the feature representation are improved.
  • the operation form of pooling processing is introduced.
  • pooling processing is performed on sparse sampling results and image feature representations
  • at least one of the above three processing methods can be used to obtain the corresponding feature representation.
  • the operation form of the pooling processing can be simplified to a certain extent; when at least two pooling processing methods are selected, a more appropriate pooling processing method can be determined according to the processing conditions of the sparse sampling results and image feature representations, thereby improving the flexibility of applying the pooling processing.
  • two different pooling processing methods are provided for image feature representation, including a single pooling processing method and a feature splicing method after multiple different pooling processing.
  • the pooling processing methods such as maximum pooling, average pooling and generalized mean pooling
  • the amount of computation of the pooling processing operation can be reduced to a certain extent, and the efficiency of obtaining the global feature representation can be improved
  • the method of splicing the features after multiple different pooling processing is selected, the first global feature representation after the average pooling processing, the second global feature representation after the maximum pooling processing and the third global feature representation after the generalized mean pooling processing are combined to effectively achieve the purpose of more comprehensive analysis of the image feature representation, and while increasing the diversity of pooling processing options, the accuracy of obtaining the global feature representation is improved.
  • two different pooling methods are provided for sparse sampling results, including a single pooling method and a feature concatenation method after multiple different pooling methods.
  • the amount of computational effort of the pooling processing operation can be reduced to a certain extent, and the efficiency of obtaining the first local feature representation can be improved;
  • the method of feature splicing of features after multiple different pooling processing is chosen, the third local feature representation after the average pooling processing, the fourth local feature representation after the maximum pooling processing, and the fifth local feature representation after the generalized mean pooling processing are comprehensively combined to effectively achieve the purpose of more comprehensive analysis of the sparse sampling results, while improving the diversity of pooling processing selections and the accuracy of obtaining the first local feature representation.
  • the key point feature representation is obtained through a key point extraction algorithm
  • the image feature representation is obtained through a content recognition model
  • the category recognition result of the target content in the target area is determined by a content category library.
  • Step 310 acquiring an image.
  • the target content in the image corresponds to the image key points
  • the image key points are key points extracted based on the distribution law of pixel points in the image.
  • a single image is acquired at a time; or, multiple images are acquired at a time, etc., which is not limited in the embodiments of the present application.
  • an image refers to an image containing target content of unknown category, such as: scenic spot images (including images of unknown scenic spot categories), celebrity photos (including photos of unknown celebrities), anime images (including images of unknown anime characters), etc., which are not limited in the embodiments of the present application.
  • the key points of the image are obtained by analyzing the pixels in the image through a feature detector and extracting the key points according to the distribution pattern of the pixels.
  • the image key points are obtained by at least one of the following extraction methods:
  • SIFT feature detection Scal invariant feature transform detection
  • DOG Difference of Gaussian
  • Step 320 extract image key points based on the distribution pattern of pixel points in the image, and extract key point feature representations corresponding to the image key points in the image.
  • key point feature representations corresponding to image key points are extracted by a key point extraction algorithm.
  • the key point feature representation corresponding to the image key points is extracted (SIFT feature representation); or, after determining the image key points through a SURF key point detector, the key point feature representation corresponding to the image key points is extracted (SURF feature representation); or, after determining the image key points through an ORB key point detector, the key point feature representation corresponding to the image key points is extracted (ORB feature representation), which is not limited to the embodiments of the present application.
  • At least one of the above-mentioned SIFT feature representation, SURF feature representation or ORB feature representation is selected as the key point feature representation.
  • Step 330 Identify the target area from the image by performing saliency detection on the image.
  • saliency detection is performed on the image to identify a target area corresponding to the target content from the image.
  • saliency detection is used to determine the target area corresponding to the target content and the background area corresponding to the background content in the image. That is, saliency detection is used to divide the image into regions according to content features.
  • a saliency detection model is preset, an image is input into the saliency detection model, and a recognition saliency map corresponding to the image is output, wherein the recognition saliency map includes a target area corresponding to the target content and a background area corresponding to the background content.
  • the recognition saliency map is implemented as an image after the target area is enhanced.
  • Figure 6 shows a schematic diagram of a target area provided by an exemplary embodiment of the present application.
  • Figure 6 shows a schematic diagram 600 of recognized saliency maps obtained after saliency detection of three different images, including a first image 610 and a first saliency map 611 corresponding to the first image 610, a second image 620 and a second saliency map 621 corresponding to the second image 620, and a third image 630 and a third saliency map 631 corresponding to the third image 630.
  • the first saliency map 611 includes the first target region (white region)
  • the second saliency map 621 includes the second target region (white region)
  • the third saliency map 631 includes the third target region (white region).
  • the target regions in FIG6 are marked with white regions, and the background regions are marked with black regions.
  • the recognition saliency map shown in FIG. 6 shows that the main features corresponding to the current target content are more obvious, that is, the target area in the current recognition saliency map has a better display completeness, and the edge of the area corresponding to the white area is clearer.
  • Figure 7 shows a schematic diagram of the target area provided by an exemplary embodiment of the present application.
  • Figure 7 shows a schematic diagram 700 of the recognition saliency map obtained after saliency detection of two different images, including a fourth image 710 and a fourth saliency map 711 corresponding to the fourth image 710, and a fifth image 720 and a fifth saliency map 721 corresponding to the fifth image 720.
  • the white area in the fourth salient map 711 and the fifth salient map 721 is the target area, and the black area is the background area.
  • the current fourth salient map 711 and the fifth salient map 721 belong to the situation where the main features corresponding to the target content are not obvious, that is, the edges of the areas corresponding to the white areas are blurred.
  • the saliency detection model includes at least one of the models including Visual Saliency Transformer (VST model), Edge Guidance Network for Salient Object Detection (EGNet model), etc., without limitation.
  • VST model Visual Saliency Transformer
  • EGNet model Edge Guidance Network for Salient Object Detection
  • VST model is described in detail.
  • FIG. 8 shows a schematic diagram of a saliency detection model provided by an embodiment of the present application.
  • a VST model is currently displayed, wherein the model input of the VST model includes a first image 810 and a second image 820, wherein the first image 810 is an image (the image is an RGB image, and the color is not shown in FIG.
  • the second image 820 is a grayscale image (RGB-D image) corresponding to the image
  • a first image block 811 corresponding to the first image 810 and a second image block 821 corresponding to the second image 820 are respectively input into a Transformer encoder space 830 (Transformer Encoder), wherein T In the ransformer encoder space 830, a Token-to-Token (T2T) module is used to encode the first image block 811 and the second image block 821 into multi-level Token vectors (such as T1, T2, T3), and the multi-level Token vectors are input into the converter 840 (Convertor).
  • T2T Token-to-Token
  • the converter 840 is used to convert the multi-layer Token vectors from the encoder space 830 to the decoder space 850 (Transformer Decoder) for feature decoding, and outputs the recognition saliency map 8111 corresponding to the first image 810 and the recognition boundary map 8221 corresponding to the second image 820.
  • Transformer Decoder Transformer Decoder
  • multi-level Token vector fusion is also used, and a new Token vector upsampling method is proposed under the Transformer structure to obtain high-resolution salient detection results.
  • a multi-task decoder based on Token vectors is also developed, which simultaneously performs saliency detection and boundary detection by introducing task-related Token vectors and a Patch-Task-Attention mechanism.
  • Step 341 input the image into the content recognition model, and output the image feature representation.
  • the content recognition model is used to extract deep features of images.
  • a single image is input into the content recognition model at a time, and the image feature representation corresponding to the single image is output; or, multiple images are input into the content recognition model at a time, and the image feature representations corresponding to the multiple images are output at the same time, without limitation.
  • the image feature representation is implemented as a multi-dimensional feature vector map, wherein the feature vector map includes multiple patches, and each patch represents a feature vector.
  • the content identification model includes at least one of a Swin Transformer model, a ResNet model, or a T2T-ViT model, which is not limited.
  • the Swin Transformer model is used as the content recognition model.
  • the Swin Transformer model is briefly introduced below.
  • the Swin Transformer model introduces two concepts: hierarchical feature mapping process and window attention conversion process.
  • the hierarchical feature mapping process refers to the mapping process of feature representation in the Swin Transformer model, which is gradually merged after each layer of model output, and features are downsampled to establish a hierarchical feature map. This hierarchical feature map enables the Swin Transformer model to be well applied to the field of fine-grained feature prediction (such as semantic segmentation).
  • the module used in the Swin Transformer model is the standard window-based multi-headed self-attention (W-MSA), which only calculates the corresponding attention within each window. This transformation will result in the existence of patches that do not belong to any window, that is, the patch block is isolated, and there are windows (Window) with incomplete patch blocks.
  • W-MSA window-based multi-headed self-attention
  • the Swin Transformer model applies the "circular shift" technique to move isolated patch blocks to windows with incomplete patch blocks. After this shift, a window will consist of non-adjacent patch blocks in the original feature vector map, so a mask is applied during the calculation process to limit self-attention to adjacent patch blocks.
  • the image is input into the Swin Transformer model, and a k ⁇ k ⁇ 1024 feature vector graph is output at the end of the Swin Transformer model as the image feature representation.
  • Step 342 downsample the image feature representation based on the target area to obtain a first local feature representation.
  • sparse sampling is performed on the image feature representation based on the target area to obtain a sparse sampling result; and pooling is performed on the sparse sampling result to obtain a first local feature representation.
  • sparse sampling is performed on the k ⁇ k ⁇ 1024 feature vector map to obtain n ⁇ 1024 Token vectors, and then the n ⁇ 1024 Token vectors are average pooled to obtain local features.
  • the sparse sampling result is subjected to any one of maximum pooling, average pooling and generalized mean pooling, and the pooling result is used as the first local feature representation; or, the sparse sampling result is subjected to maximum pooling, average pooling and generalized mean pooling respectively, and the three pooling results are feature spliced to obtain the first local feature representation, which is not limited.
  • average pooling is performed on the image feature representation, and the pooling result is used as the first local feature representation as an example.
  • Step 343 performing pooling processing on the image feature representation to obtain a global feature representation.
  • any one of maximum pooling, average pooling and generalized mean pooling is performed on the image feature representation, and the pooling result is used as the global feature representation; or, the image feature representation is respectively subjected to maximum pooling, average pooling and generalized mean pooling, and the three pooling results are feature spliced to obtain the global feature representation, which is not limited.
  • generalized mean pooling is performed on the image feature representation, and the pooling result is used as the global feature representation as an example.
  • Step 350 Concatenate the first local feature representation and the key point feature representation to obtain a second local feature representation.
  • the first local feature representation and the key point feature representation are sequentially concatenated, and the feature concatenation result is used as the second local feature representation.
  • the process of extracting key point feature representation and obtaining the second local feature representation by using the key point extraction algorithm is introduced.
  • the key points corresponding to the image are determined by the key point extraction algorithm and the key point feature representation corresponding to the key points is determined, so that the key points of the image are referred to by the key point feature representation, which facilitates the model to conduct targeted analysis of the key points of the image, reduces the complexity of model recognition, and shortens the model recognition time; and then the key point feature representation is merged with the first local feature representation corresponding to the target area in the image feature representation, while highlighting the image key point information, the overall perception of the target content is increased, the expression of the local feature representation to the target content is enhanced, and the accuracy of the second local feature representation is improved.
  • Step 361 obtaining a content category library, wherein the content category library includes a set of n preset categories, where n is a positive integer.
  • the content category library includes n pre-stored categories, and each category stores a candidate feature representation corresponding to at least one candidate image (i.e., the image feature representation corresponding to the candidate image). That is, the candidate feature representation corresponds to the category.
  • the category "poodle” stores multiple images containing poodles, and each poodle image is annotated with a feature representation corresponding to the poodle as a candidate feature representation.
  • the library of content categories is pre-acquired.
  • Step 362 respectively matching the global feature representation with the n categories in the content category library to obtain k candidates in the content category library that match the global feature representation, where 0 ⁇ k ⁇ n and k is an integer.
  • the global feature representation is matched with n categories in a content category library respectively to obtain global matching scores corresponding to the n categories, and the global matching scores are used to characterize the probability that the target content belongs to a category; the global matching scores corresponding to the n categories are sorted to obtain a matching ranking result; the top k categories in the matching ranking result are used as k candidate categories that match the global feature representation.
  • the global matching score is determined according to the distance between the global feature representation and the candidate feature representations corresponding to the n categories in the vector space. For example, when the distance between the global feature representation and the candidate feature representation in the vector space is smaller, the global matching score is higher; when the distance between the global feature representation and the candidate feature representation in the vector space is larger, the global matching score is lower.
  • the corresponding candidate feature representations under all categories in the content category library are traversed, each candidate feature representation is matched with the global feature representation, and the global matching score corresponding to the category is determined according to the matching of the corresponding candidate feature representation under the category with the current global feature representation, wherein the higher the global matching score of the category, the higher the matching degree between the candidate feature representation under the category and the global feature representation, that is, the higher the probability that the category corresponding to the current target content is this category.
  • the global matching score are arranged in descending order to obtain the matching ranking result, and the first k categories in the matching ranking result are selected as the k candidate categories that match the global feature representation.
  • Step 363 sort the k candidate categories based on the first local feature representation to obtain a category sorting result.
  • the first k candidate categories with the highest content matching scores with the target content are selected from the content category library through the global feature representation, and the k candidate categories are sorted again according to the local feature representation to obtain the category sorting result.
  • the first local feature representation is matched with the candidate feature representations stored under k candidate categories respectively, and the local matching scores corresponding to the k candidate categories are determined according to the matching between the candidate feature representation and the first local feature representation, wherein the local matching score is used to indicate the matching between the current first local feature representation and the candidate feature representation under the category, and the higher the matching degree, the higher the local matching score corresponding to the category, and the k candidate categories are sorted from high to low according to the local matching scores corresponding to the k candidate categories, to obtain the category sorting result.
  • Step 364 Obtain the identification category corresponding to the target content according to the category sorting result.
  • the candidate category with the highest local matching score (or several higher ones) in the category ranking result is selected as the recognition category, as the content recognition result.
  • the content recognition method extracts image key points based on the distribution law of pixel points in the image and extracts the key point feature representation corresponding to the image key points, performs saliency detection on the image to obtain the target area in the image; performs pooling processing on the image feature representation corresponding to the image to obtain the global feature representation, and downsamples the image feature representation based on the target area, and performs feature splicing on the downsampled first local feature representation and the key point feature representation to obtain the second local feature representation, thereby identifying the category corresponding to the target content in the target image according to the global feature representation and the second local feature representation.
  • the target area of the target content in the image is determined by saliency detection, so that the target content in the image
  • the target area is effectively divided from the background area to enhance the distinguishing power of the target area. Then, when performing regional analysis on the image feature representation through the target area, the interference of the background area in the image can be effectively eliminated, the expression strength of the obtained image feature representation on the target area can be improved, and the background content that does not contain the main features can be filtered out to the greatest extent, thereby improving the accuracy and efficiency of content category recognition of the target content.
  • k matching candidate categories are selected from the content category library including n categories through the global feature representation representing the global information content, and then the k candidate categories are reordered according to the local feature representation representing the targeted local information content, and the category corresponding to the target content is determined according to the category sorting result.
  • the process of identifying the target content from the global to the local is effectively realized, and the standardization of the content identification process is improved by means of a hierarchical analysis method. While improving the accuracy of content category identification, the flexibility of content identification is improved.
  • the selection process of the candidate category is more intuitively realized, which is conducive to obtaining k candidate categories more comprehensively and accurately, and thus is conducive to improving the accuracy of content identification.
  • the application scenario corresponding to the content recognition method provided in the present application is described.
  • Figure 9 shows a schematic diagram of the content recognition method provided by an exemplary embodiment of the present application, and describes the application of the content method to an image search scenario as an example.
  • the current user inputs an image as an image, and the image library is searched through the image to obtain the recognition image with the highest matching degree with the image as the image search result.
  • an image 910 is acquired, wherein the image 910 is an image input by a user, the image 910 includes a target content 911, the image corresponds to a plurality of image key points (not shown in FIG9 ), and the image key points are feature points detected by at least one of the three key point detectors: a SIFT key point detector, an ORB key point detector, or a SURF key point detector.
  • the image 910 is input into the content recognition model 920, and the image feature representation 930 is output, wherein the content recognition model 920 is implemented as a Swin Transformer model, and the image feature representation 930 is implemented as a feature vector map with a feature size of k ⁇ k ⁇ 1024 output by the last layer of the Swin Transformer model.
  • Saliency detection is performed on the image 910 to extract a target region 912 , which corresponds to the target content 911 .
  • the image feature representation 930 is subjected to generalized mean pooling processing 940 and sparse sampling 950 to obtain a global feature representation 941 and a sparse sampling result 950, respectively.
  • sparse sampling 950 is performed on the image feature representation 930 to obtain a sparse sampling result 950 .
  • sparse sampling 950 is performed on the image feature representation 930 based on the target area 912 .
  • a feature representation corresponding to the target area 912 is extracted from the image feature representation 930 as a sparse sampling result 950 after sparse sampling; or, a feature representation corresponding to a certain area slightly larger than the target area 912 is extracted from the image feature representation 930 as a sparse sampling result after sparse sampling.
  • the sparse sampling result 950 is average pooled 960 to obtain a first local feature representation (not shown in Figure 9), and the first local feature representation and the key point feature representation extracted from the image key points (at least one of SIFT feature representation, SURF feature representation or ORB feature representation) are concatenated to obtain a second local feature representation 951.
  • a feature dimensionality reduction operation is performed on the results obtained by the above pooling process to remove redundant features with high correlation between feature representations.
  • a match is performed in the category library 970 according to the global feature representation 941 to obtain the top k candidate categories (TOP-K) with the highest global matching scores with the global feature representation 941 as k candidate categories 971 .
  • the k candidate categories 971 are matched again according to the local feature library 952 storing the first local feature representation 951 to obtain local matching scores corresponding to the k candidate categories 971, and the k candidate categories are reordered according to the local matching scores. Finally, the category 980 corresponding to the target content is selected as the output, wherein the category 980 corresponding to the target content is implemented as the "Great Wall".
  • the category corresponding to the target content is input into the image library, and the candidate images corresponding to the category corresponding to the target content in the image library are selected for output and displayed to the user.
  • the content recognition method extracts image key points based on the distribution law of pixel points in the image and extracts the key point feature representation corresponding to the image key points, performs saliency detection on the image to obtain the target area in the image; performs pooling processing on the image feature representation corresponding to the image to obtain the global feature representation, and downsamples the image feature representation, and performs feature splicing on the first local feature representation obtained by downsampling and the key point feature representation to obtain the second local feature representation, thereby identifying the category of the target content in the image according to the global feature representation and the second local feature representation.
  • the process of downsampling the image feature representation based on the target area to obtain the first local feature representation can effectively extract the effective information about the local features in the image feature representation, and then when the second local feature representation and the key point feature representation are feature spliced, the purpose of combining the image key points to obtain a more accurate second local feature representation can be achieved, and the category of the target content in the image can be identified by using the global feature and the second local feature representation, which can effectively improve the accuracy of content recognition.
  • FIG. 10 is a structural block diagram of a content identification device provided by an exemplary embodiment of the present application. As shown in FIG. 10 , the device includes the following parts:
  • An acquisition module 1010 is used to acquire an image
  • the extraction module 1020 is used to extract image key points based on the distribution law of pixel points in the image, and extract key point feature representations corresponding to the image key points in the image; and identify the target area from the image by performing saliency detection on the image;
  • a processing module 1030 is used to perform pooling processing on the image feature representation corresponding to the image to obtain a global feature representation, and to downsample the image feature representation based on the target area to obtain a first local feature representation;
  • a splicing module 1040 is used to perform feature splicing on the key point feature representation and the first local feature representation to obtain a second local feature representation;
  • the identification module 1050 is configured to identify a category of the target content contained in the target area based on the global feature representation and the second local feature representation.
  • the processing module 1030 includes:
  • a sampling unit 1031 is used to perform sparse sampling on the image feature representation based on the target area to obtain a sparse sampling result
  • the processing unit 1032 is used to perform pooling processing on the sparse sampling result to obtain the first local feature representation.
  • the image is composed of a plurality of image blocks
  • the image feature representation is composed of a plurality of sub-feature representations
  • the plurality of image blocks correspond one to one to the plurality of sub-feature representations
  • the processing module 1030 obtains a plurality of regional image blocks within the target area from a plurality of image blocks included in the image; and obtains sub-feature representations respectively corresponding to the plurality of regional image blocks from a plurality of sub-feature representations included in the image feature representation as the sparse sampling result.
  • the pooling process includes any one of average pooling process, maximum pooling process and generalized mean pooling process.
  • the processing unit 1032 is further configured to perform average pooling on the sparse sampling result. Processing is performed to obtain a third local feature representation; performing maximum pooling processing on the sparse sampling result to obtain a fourth local feature representation; performing generalized mean pooling processing on the sparse sampling result to obtain a fifth local feature representation; and feature splicing is performed on the third local feature representation, the fourth local feature representation and the fifth local feature representation to obtain the first local feature representation.
  • the extraction module 1020 is further configured to extract key point feature representations corresponding to the image key points through a key point extraction algorithm.
  • the processing module 1030 is further used to input the image into a content recognition model and output the image feature representation, wherein the content recognition model is used to perform deep feature extraction on the image; and generalized mean pooling is performed on the image feature representation to obtain the global feature representation.
  • the processing module 1030 is also used to perform average pooling processing on the image feature representation to obtain a first global feature representation; perform maximum pooling processing on the image feature representation to obtain a second global feature representation; perform generalized mean pooling processing on the image feature representation to obtain a third global feature representation; and perform feature splicing on the first global feature representation, the second global feature representation and the third global feature representation to obtain the global feature representation.
  • the identification module 1050 is further used to obtain a content category library, which includes a pre-set set of n categories, where n is a positive integer; respectively match the global feature representation with the n categories in the content category library to obtain k candidate categories in the content category library that match the global feature representation, where 0 ⁇ k ⁇ n and k is an integer; sort the k candidate categories based on the second local feature representation to obtain a category sorting result; and obtain an identification category corresponding to the target content based on the category sorting result.
  • a content category library which includes a pre-set set of n categories, where n is a positive integer
  • the identification module 1050 is further used to match the global feature representation with the n categories in the content category library respectively to obtain global matching scores corresponding to the n categories, and the global matching scores are used to characterize the probability that the target content belongs to the category; sort the global matching scores corresponding to the n categories to obtain a matching ranking result; and use the top k categories in the matching ranking result as k candidate categories that match the global feature representation.
  • the content recognition device extracts image key points based on the distribution law of pixel points in the image and extracts the key point feature representation corresponding to the image key points, performs saliency detection on the image to obtain the target area in the image; performs pooling processing on the image 0 feature representation corresponding to the image to obtain the global feature representation, and downsamples the image feature representation based on the target area 0, and performs feature splicing on the first local feature representation obtained by the downsampling and the key point feature representation to obtain the second local feature representation, thereby identifying the category of the target content in the image according to the global feature representation and the second local feature representation.
  • the process of downsampling the image feature representation based on the target area to obtain the first local feature can effectively extract the effective information about the local feature in the image feature representation, and then when the first local feature representation and the key point feature representation are feature spliced, the purpose of combining the image key points to obtain a more accurate second local feature representation can be achieved, and the category of the target content in the image can be identified by using the global feature and the second local feature, which can effectively improve the accuracy of content recognition.
  • the content recognition device provided in the above embodiment is only illustrated by the division of the above functional modules.
  • the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the content recognition device provided in the above embodiment and the content recognition method embodiment belong to the same concept. The specific implementation process is detailed in the method embodiment and will not be repeated here.
  • FIG12 shows a schematic diagram of the structure of a server provided by an exemplary embodiment of the present application. Specifically:
  • the server 1200 includes a central processing unit (CPU) 1201, a system memory 1204 including a random access memory (RAM) 1202 and a read-only memory (ROM) 1203, and a system bus 1205 connecting the system memory 1204 and the central processing unit 1201.
  • the server 1200 also includes a mass storage device 1206 for storing an operating system 1213, application programs 1214, and other program modules 1215.
  • the mass storage device 1206 is connected to the system bus 1205 through a mass storage controller (not shown). Central processing unit 1201. Mass storage device 1206 and its associated computer-readable media provide non-volatile storage for server 1200.
  • computer readable media may comprise computer storage media and communication media.
  • the server 1200 can be connected to the network 1212 via a network interface unit 1211 connected to the system bus 1205, or the network interface unit 1211 can be used to connect to other types of networks or remote computer systems (not shown).
  • the above-mentioned memory also includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.
  • An embodiment of the present application also provides a computer device, which includes a processor and a memory, wherein the memory stores at least one instruction, at least one program, code set or instruction set, and the at least one instruction, at least one program, code set or instruction set is loaded and executed by the processor to implement the content identification method provided by the above-mentioned method embodiments.
  • An embodiment of the present application also provides a computer-readable storage medium, on which is stored at least one instruction, at least one program, code set or instruction set, and the at least one instruction, at least one program, code set or instruction set is loaded and executed by a processor to implement the content identification method provided by the above-mentioned method embodiments.
  • the embodiments of the present application also provide a computer program product or a computer program, which includes a computer instruction stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the content recognition method described in any of the above embodiments.
  • the computer readable storage medium may include: a read-only memory (ROM), a random access memory (RAM), a solid state drive (SSD), or an optical disk.
  • the random access memory may include a resistance random access memory (ReRAM) and a dynamic random access memory (DRAM).
  • ReRAM resistance random access memory
  • DRAM dynamic random access memory

Abstract

一种内容识别方法、装置、设备、存储介质及计算机程序产品,涉及机器学习领域。该方法包括:获取图像(310);基于图像中像素点分布规律提取得到图像关键点,并提取图像中与图像关键点对应的关键点特征表示(320);通过对图像进行显著性检测,从图像中识别出目标区域(330);对图像对应的图像特征表示进行池化处理,得到全局特征表示,以及,基于目标区域对图像特征表示进行下采样,得到第一局部特征表示(340);将关键点特征表示和第一局部特征表示进行特征拼接,得到第二局部特征表示(350);基于全局特征表示和第二局部特征表示对目标区域中包含的目标内容的类别进行识别(360)。

Description

内容识别方法、装置、设备、存储介质及计算机程序产品
本申请要求于2022年08月04日提交的申请号为202210934770.8、发明名称为“内容识别方法、装置、设备、存储介质及计算机程序产品”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及机器学习领域,特别涉及一种内容识别方法、装置、设备、存储介质及计算机程序产品。
背景技术
随着互联网技术的不断发展,用户每天会浏览大量的多媒体内容,例如图片、视频、文章等。通过确定多媒体内容中包含的类别信息确定该多媒体内容对应的属性信息,从而能够更好地满足不同场景下的用户浏览需求,如:在图像搜索场景下,用户输入搜索关键词后,从图像库中选择图像内容与搜索关键词匹配的图像作为搜索结果并向用户进行展示。
在相关技术中,通常采用深度学习模型提取图像对应的全局特征并建立内容搜索库,在图像搜索场景下,当用户输入搜索关键词后,根据搜索关键词在内容搜索库中确定与搜索关键词匹配的全局特征,从而将全局特征对应的图像直接作为搜索结果并向用户进行展示。
然而在相关技术中,通常仅根据图像的全局特征确定与搜索关键词匹配的图像,尽管该图像对应的全局特征与搜索关键词匹配度较高,但仍然可能存在图像与搜索关键词并不匹配的情况,导致内容识别的准确度较低。
发明内容
本申请实施例提供了一种内容识别方法、装置、设备、存储介质及计算机程序产品,能够提高内容识别的准确度。所述技术方案如下:
一方面,提供了一种内容识别方法,所述方法包括:
获取图像;
基于所述图像中像素点分布规律提取得到图像关键点,并提取所述图像中与所述图像关键点对应的关键点特征表示;
通过对所述图像进行显著性检测,从所述图像中识别出目标区域;
对所述图像对应的图像特征表示进行池化处理,得到全局特征表示,以及,基于所述目标区域对所述图像特征表示进行下采样,得到第一局部特征表示;
将所述关键点特征表示和所述第一局部特征表示进行特征拼接,得到第二局部特征表示;
基于所述全局特征表示和所述第二局部特征表示对所述目标区域中包含的目标内容的类别进行识别。
另一方面,提供了一种内容识别装置,所述装置包括:
获取模块,用于获取图像;
提取模块,用于基于所述图像中像素点分布规律提取得到图像关键点,并提取所述图像中与所述图像关键点对应的关键点特征表示;通过对所述图像进行显著性检测,从所述图像中识别出目标区域;
处理模块,用于对所述图像对应的图像特征表示进行池化处理,得到全局特征表示,以及,基于所述目标区域对所述图像特征表示进行下采样,得到第一局部特征表示;
拼接模块,用于将所述关键点特征表示和所述第一局部特征表示进行特征拼接,得到第二局部特征表示;
识别模块,用于基于所述全局特征表示和所述第二局部特征表示对所述目标区域中包含的目标内容的类别进行识别。
另一方面,提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一 段程序、所述代码集或指令集由所述处理器加载并执行以实现如上述本申请实施例中任一所述内容识别方法。
另一方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如上述本申请实施例中任一所述的内容识别方法。
另一方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述实施例中任一所述的内容识别方法。
本申请实施例提供的技术方案带来的有益效果至少包括:
基于图像中像素点分布规律提取得到图像关键点并提取与图像关键点对应的关键点特征表示,对图像进行显著性检测得到图像内的目标区域;对图像对应的图像特征表示进行池化处理得到全局特征表示,并基于目标区域对图像特征表示进行下采样,将下采样后的第一局部特征表示和关键点特征表示进行特征拼接后得到第二局部特征表示,从而根据全局特征表示和第二局部特征表示对目标区域中目标内容的类别进行识别。也即,通过对图像特征表示进行全局层面的池化处理得到代表全局信息的全局特征表示;通过基于目标区域对图像特征表示进行局部层面的下采样处理得到代表局部信息的第一局部特征表示,从而有效提取图像特征表示中关于局部特征的有效信息,进而在将第一局部特征表示和关键点特征表示进行特征拼接时,能够实现结合图像关键点得到更准确的第二局部特征表示的目的,利用全局特征和第二局部特征对目标区域中包含的目标内容的类别进行识别,能够有效提高识别内容对应类别的准确度。
附图说明
图1是本申请一个示例性实施例提供的内容识别方法相关技术示意图;
图2是本申请一个示例性实施例提供的实施环境示意图;
图3是本申请一个示例性实施例提供的内容识别方法流程图;
图4是本申请另一个示例性实施例提供的内容识别方法流程图;
图5是本申请另一个示例性实施例提供的内容识别方法流程图;
图6是本申请一个示例性实施例提供的目标区域示意图;
图7是本申请另一个示例性实施例提供的目标区域示意图;
图8是本申请一个示例性实施例提供的显著性检测模型示意图;
图9是本申请另一个示例性实施例提供的内容识别方法示意图;
图10是本申请一个示例性实施例提供的内容识别装置结构框图;
图11是本申请另一个示例性实施例提供的内容识别装置结构框图;
图12是本申请一个示例性实施例提供的服务器结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
示意性的,请参考图1,其示出了本申请一个示例性实施例提供的内容识别方法示意图,如图1所示,获取图像110,其中,图像110实现为景点图像,通过图像110中像素点分布规律能够提取得到与图像110对应的图像关键点(图中未示出),进而提取到与图像关键点对应的关键点特征表示;此外,通过对图像进行显著性检测能够得到图像内的目标区域。
对图像对应的图像特征表示进行池化处理,得到指代图像全局信息的全局特征表示130,并且,基于显著性检测得到的目标区域对图像特征表示120进行下采样,得到指代图像中目标区域的局部信息的第一局部特征表示140。
将第一局部特征表示140和关键点特征表示进行特征拼接,得到第二局部特征表示150。 根据全局特征表示130和第二局部特征表示150对图像110中目标区域内目标内容111的类别进行识别,得到内容识别结果160,其中,内容识别结果160实现为“A景观楼”,即表征了目标内容111对应的类别。
对本申请实施例中涉及的实施环境进行说明,示意性的,请参考图2,该实施环境中涉及终端210、服务器220,终端210和服务器220之间通过通信网络230连接。
示意性的,终端210向服务器发送内容识别请求,其中,内容识别请求中包括图像,图像中包括目标内容,服务器220收到来自终端210发送的内容识别请求后,对图像进行内容识别,将识别得到的内容识别结果反馈至终端210,内容识别结果反映了目标内容对应的类别。
其中,服务器220在对图像进行内容识别的过程中,通过对图像进行显著性检测能够得到图像内的目标区域;此外,对图像对应的图像特征表示进行池化处理得到全局特征表示222,以及,基于目标区域对图像特征表示进行下采样,得到第一局部特征表示223;此外,基于图像中像素点分布规律提取得到图像关键点,进而提取图像中与图像关键点对应的关键点特征表示224,将关键点特征表示224和第一局部特征表示223进行特征拼接,得到第二局部特征表示225,根据第二局部特征表示225和全局特征表示222对目标内容进行识别,确定目标内容对应的类别226。
上述终端210可以是手机、平板电脑、台式电脑、便携式笔记本电脑、智能电视、智能车载等多种形式的终端设备,本申请实施例对此不加以限定。
值得注意的是,上述服务器220可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。
其中,云技术(Cloud Technology)是指在广域网或局域网内将硬件、软件、网络等系列资源统一起来,实现数据的计算、储存、处理和共享的一种托管技术。
在一些实施例中,上述服务器220还可以实现为区块链系统中的节点。
需要说明的是,本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如,本申请中涉及到的图像是在充分授权的情况下获取的。
示意性的,对本申请提供的内容识别方法进行说明,请参考图3,其示出了本申请一个示例性实施例提供的内容识别方法流程图,该方法可以由终端执行,也可以由服务器执行,或者,也可以由终端和服务器共同执行,本实施例中以该方法由服务器执行进行说明,如图3所示,该方法包括如下步骤。
步骤310,获取图像。
示意性的,图像是从开源图像数据库中获取得到的图像;或者,图像是通过图像采集设备(如:相机、终端、摄影机等设备)采集得到的图像;或者,图像是通过图像合成软件合成得到的图像;或者,图像是从一段视频中截图得到的图像等。
步骤320,基于图像中像素点分布规律提取得到图像关键点,并提取图像中与图像关键点对应的关键点特征表示。
其中,像素点分布规律用于表征不同像素点之间的变化情况。示意性的,像素是表征图像分辨率的单位,可以将图像视为由多个像素组成的内容,每个像素对应一个像素点,且根据图像分辨率的表征内容,每个像素点对应一个像素值,像素点分布规律表征了图像上不同像素点对应的像素值的变化情况。
可选地,综合分析图像中多个像素点分别对应的像素值,根据像素值的变化确定图像中像素点分布规律,进而根据该像素点分布规律提取得到与图像对应的图像关键点,图像关键点中包括图像中具有象征目标内容的像素点。
在一些实施例中,将像素点分布规律符合预设条件的像素点作为上述的图像关键点。
示意性的,预设条件实现为像素值变化幅度超过预设幅度阈值,例如:相邻两个像素点对应的像素值变化较大(超过预设幅度阈值),将其中至少一个像素点作为上述的图像关键点;或者,预设条件实现为像素值变化速度超过预设速度阈值,例如:相邻多个像素点对应的像素值变化较快(超过预设速度阈值),将其中至少一个像素点作为上述的图像关键点等。
可选地,图像关键点的类型包括角点、边缘或者区块等关键点类型中至少一种,本申请实施例对此不加以限定。
可选地,图像对应标注有单个图像关键点;或者,图像对应标注有多个图像关键点,本申请实施例对此不加以限定。
示意性的,图像关键点是基于图像整体进行识别得到的内容,因此能够在一定程度上展现图像整体需要突出表达的信息。基于图像通常由目标区域和背景区域所组成,目标区域代表图像主要表达的区域,其中包含了大量图像突出表达的信息;背景区域代表图像次要表达的区域,其中包含了少量图像突出表达的信息。
在基于图像提取得到多个图像关键点时,多个图像关键点中的大量图像关键点指代的是目标区域所表达的信息,多个图像关键点中的少量图像关键点指代的是背景区域所表达的信息。因此,提取得到的多个图像关键点主要表达的是目标区域对应的信息,也会在小程度上体现图像内目标区域之外的信息。
例如:提取得到的多个图像关键点中包括图像关键点1至图像关键点10,图像关键点1至图像关键点8均是从目标区域内提取得到的关键点,图像关键点9和图像关键点10是从图像中目标区域外提取得到的图像关键点。
在一些实施例中,通过预设的关键点检测器对图像进行关键点提取,并将关键点检测器输出得到的结果作为图像对应的图像关键点。
在一个可选的实施例中,对提取得到的图像关键点进行特征提取,得到图像中与图像关键点对应的关键点特征表示。
可选地,对所有图像关键点进行特征提取,得到所有图像关键点对应的关键点特征表示,关键点特征表示用于后续特征拼接;或者,选择部分图像关键点进行特征提取,得到部分图像关键点对应的关键点特征表示等,本申请实施例对此不加以限定。
步骤330,对图像进行显著性检测得到图像内的目标区域。
示意性的,显著性检测用于使用图像处理技术和计算机视觉算法定位图像中的最“显著”的区域,即为目标区域。可选地,显著性检测用于确定包含目标内容的区域,该区域即为目标区域。目标区域是指图像中引人注目的区域或比较重要的区域,例如:人眼在观看一幅图片时会首先关注的区域。可选地,将自动定位图像中关键内容或场景中重要区域的过程称为显著性测。显著性检测在目标检测、机器人领域等多个图像识别领域中存在广泛应用。
也即:基于显著性检测是针对图像中重要内容进行的检测,因此在显著性检测得到目标区域后,该目标区域是代表固定区域的内容。相比从图像中提取得到的图像关键点而言,显著性检测关注的目标区域更加的单一,且目标区域的明确性更强,而非像图像关键点一般,在图像中的众多区域均可能存在。
在一些实施例中,显著性检测实现为基于阈值的区域分割方法。
示意性的,获取图像中各个像素点的像素值,当连续多个像素点分别对应的像素值均达到预设像素阈值,将连续多个像素点组成的区域作为上述的目标区域,将图像中目标区域之外的区域作为背景区域,从而实现目标区域和背景区域的分割过程,将该过程视为对图像进行显著性检测得到目标区域的过程。
在一些实施例中,显著性检测实现为基于边缘检测的区域分割方法。
示意性的,考虑到通常不同区域的边界上像素点的灰度值变化比较剧烈,因此可以采用傅里叶变换方法,将图像从空间域通过变换到频率域,区域边缘则对应着高频部分,因此能够较为直观地将不同的区域进行分割,从而较好地从图像中将目标区域和背景区域相分离。
在一些实施例中,显著性检测实现为通过预先训练的图像分割模型进行图像识别的方法。
示意性的,获取预先训练得到的、用于进行图像分割的图像分割模型,通过将图像输入该图像分割模型中,将图像中的目标区域和背景区域相分离。可选地,训练得到图像分割模型的预训练模型实现为视觉几何小组(Visual Geometry Group,VGG)研发得到的模型,通过大量样本图像对VGG模型进行训练得到图像分割模型,其中大量的样本图像中包括指代样本图像中目标区域的区域标签,借助VGG模型对样本图像的预测结果和样本图像本身标注的区域标签之间的差异实现模型训练过程,并最终借助训练后的图像分割模型对图像进行识别,以得到图像中的目标区域,实现显著性检测的目的。
值得注意的是,以上显著性检测方法仅为示意性的举例,各种能将目标区域从图像中分离得到的方法均可以实现上述显著性检测的目的,本申请实施例对此不加以限定。
可选地,目标区域用于表示包括目标内容的区域。
示意性的,图像是指包含未知类别的目标内容的图像。目标内容包括人物、动物、食物、景点、地标等内容类型中至少一种。
可选地,显著性检测用于确定图像中的目标内容对应的目标区域和背景内容对应的背景区域,也即,显著性检测用于根据内容特征对图像进行区域划分,从而将目标区域与背景区域进行分离。
示意性的,基于目标内容代表图像中与其他背景内容存在一定差异的内容,因此通过显著性检测的方式能够将目标内容对应的目标区域与背景内容对应的背景区域相分离,从而确定图像内包括目标内容的目标区域。
例如:目标区域中像素点对应的像素值较大,背景区域中像素点对应的像素值较小,通过设定像素阈值的方式实现显著性检测的过程,从而将目标区域与背景区域进行分离;或者,通过图像锐化方法实现显著性检测过程,即对图像进行锐化处理,增强图像中的边缘细节信息,从而实现增强目标区域和背景区域之间边界的过程,以将目标区域与背景区域相分离。
可选地,图像中包括单个目标内容;或者,图像中包括多个目标内容,其中,当图像中包括多个目标内容的情况下,多个目标内容对应为不同的内容,或者对应相同的内容,此处不加以限定。
示意性的,当图像中包括单个目标内容时,目标区域实现为包括该单个目标内容的单个区域;或者,当图像中包括多个目标内容时,目标区域实现为包括该多个目标内容的单个区域;或者,当图像中包括多个目标内容时,目标区域实现多个区域,且每个目标区域中包括至少一个目标内容等。
在一些实施例中,预设显著性检测模型,将图像输入显著性检测模型,输出得到图像对应的识别显著图,识别显著图中包括目标内容对应的目标区域,以及背景内容对应的背景区域。其中,识别显著图实现为将目标区域进行增强后的图像。
步骤340,对图像对应的图像特征表示进行池化处理,得到全局特征表示,以及,基于目标区域对图像特征表示进行下采样,得到第一局部特征表示。
示意性的,池化处理(Pooling)是指对图像特征表示进行降采样处理,以将图像特征表示进行压缩,在减少参数数量的同时保持图像特征表示某种不变性(如:旋转不变性、平移不变性和伸缩不变性中至少一种)。
可选地,池化处理通过卷积核实现,借助图像对应的图像尺寸以及卷积核对应的卷积核尺寸,得到对图像进行池化处理后的全局特征表示。
示意性的,确定图像对应的图像尺寸;获取用于对图像进行池化处理的卷积核,并获取卷积核的卷积核尺寸;以卷积核对图像进行池化处理,得到全局特征表示,全局特征表示的尺寸是图像尺寸与卷积核尺寸之商。
例如:图像的图像尺寸为20*20,所采用的卷积核的卷积核尺寸为10*10,通过该卷积核对该图像进行池化处理,得到的处理后的全局特征表示的尺寸为2*2;或者,图像的图像尺寸为20*20,所采用的卷积核的卷积核尺寸为5*5,通过该卷积核对该图像进行池化处理,得 到的处理后的全局特征表示的尺寸为4*4等。
其中,图像的图像尺寸以及卷积核的卷积核尺寸仅为示意性的举例,本申请实施例对此不加以限定。借助卷积核执行的池化处理过程,能够大大降低进行特征表示分析时的分析复杂度,还能够借助较小的尺寸增大对图像整体的感受野,提升全局特征表示对图像全局进行表达的表达效果。
可选地,池化处理包括最大池化处理(Max-Pooling)、平均池化处理(Mean-Pooling)和广义均值池化处理(Generalized-Mean Pooling)的池化处理类型中至少一种,对此不加以限定。后续实施例中将针对这三种池化处理进行详细说明,此处暂不做赘述。
在一个可选的实施例中,在对图像进行区域划分得到目标区域后,基于目标区域对图像对应的图像特征表示进行下采样。
示意性的,下采样又称特征下采样,是指对图像特征表示进行图像抽样以及图像缩小过程,从而得到处理后的特征向量,即为第一局部特征表示。
在一些实施例中,特征下采样包括稀疏采样。
可选地,池化处理和下采样是同时进行的;或者,首先针对图像特征表示进行池化处理,再对图像特征表示进行下采样,对此不加以限定。
在一些实施例中,图像由多个图像块组成,图像特征表示由多个子特征表示组成,多个图像块与多个子特征表示一一对应。
示意性的,图像由多个图像块组成,多个图像块包括图像块1至图像块9,图像块1对应子特征表示a,图像块2对应子特征表示b……,图像块9对应子特征表示i。
可选地,从图像包括的多个图像块中获取目标区域内的多个区域图像块。
其中,区域图像块用于表示位于该目标区域内的图像块。
示意性的,每个图像块对应一个子特征表示,组成图像的多个图像块分别对应一个子特征表示,因此得到多个子特征表示,从而将与图像对应的多个子特征表示称为图像对应的图像特征表示。
例如:图像中处于目标区域的部分图像块包括图像块3、图像块5以及图像块8,则将图像块2、图像块5和图像块8作为目标区域对应的区域图像块。
可选地,从图像特征表示包括的多个子特征表示中,获取多个区域图像块分别对应的子特征表示作为稀疏采样结果。
示意性的,在确定位于目标区域内的多个区域图像块后,获取图像特征表示中与多个区域图像块分别对应的子特征表示,从而得到与该目标区域对应的部分子特征表示,将该部分子特征表示作为稀疏采样结果。
例如:在确定目标区域对应的区域图像块为图像块3、图像块5以及图像块8后,从图像特征表示包括的多个子特征表示中,获取与图像块3对应的子特征表示c、与图像块5对应的子特征表示e以及与图像块8对应的子特征表示h,将子特征表示c、子特征表示e以及子特征表示h作为上述的稀疏采样结果。
可选地,稀疏采样结果的表现形式可实现为特征向量集合;或者,稀疏采样结果的表现形式可实现为特征向量图(特征矩阵),也即,特征向量图中包含多个特征块(Patch块),每个Patch块代表一个特征向量(图像块对应的子特征表示),本申请实施例对此不加以限定。
可选地,预先设定内容识别模型,将图像输入内容识别模型后,直接输出得到与目标内容对应的稀疏采样结果;或者,预先设定内容识别模型,将图像输入内容识别模型后,输出得到图像对应的图像特征表示,基于目标区域从图像特征表示中选出与目标内容对应的稀疏采样结果,对此不加以限定。
可选地,稀疏采样结果的提取方式包括如下提取方式中至少一种:
1.采用Swin Transformer模型(基于移动窗口的Transformer)进行特征提取,将图像输入Swin Transformer模型后,输出得到与目标内容对应的特征图(Feature Map),将该特征图作为稀疏采样结果,特征图中每个Patch块代表一个子特征表示;
2.采用深度残差网络(Deep residual network,ResNet)进行特征提取,将图像输入ResNet中,从ResNet每层网络输出的特征图获取与目标内容对应的稀疏采样结果;
3.采用Tokens-to-Tokens Vision Transformer模型(T2T-ViT模型)进行特征提取,将图像输入T2T-ViT模型中,输出得到与目标内容对应的字符串序列(Token序列),作为与目标内容对应的稀疏采样结果。
值得注意的是,上述关于稀疏采样结果的提取方式仅为示意性的举例,本申请实施例对此不加以限定。
在一些实施例中,对稀疏采样结果进行池化处理,得到第一局部特征表示。
示意性的,在得到稀疏采样结果后,可以采用上述借助卷积核进行池化处理的方式对稀疏采样结果进行池化处理,从而得到第一局部特征表示。
示意性的,若稀疏采样结果实现为上述的特征向量集合,对稀疏采样结果进行池化处理后得到的第一局部特征表示实现为多个特征向量组合后的结果;或者,若稀疏采样结果实现为上述的特征向量图,对稀疏采样结果进行池化处理后得到的第一局部特征表示实现为单个特征向量等。
值得注意的是,以上仅为示意性的举例,本申请实施例对此不加以限定。
步骤350,将关键点特征表示和第一局部特征表示进行特征拼接,得到第二局部特征表示。
示意性的,特征拼接是指将第一局部特征和关键点特征表示进行特征向量的拼接,将拼接后得到的特征向量作为第二局部特征表示。
可选地,特征拼接通过神经网络拼接层(Concatenate,Concat)实现,Concat层的作用在于将两个及两个以上的特征表示在通道(channel)维度上进行拼接。
示意性的,沿通道维度对关键点特征表示和第一局部特征表示进行特征拼接,得到第二局部特征表示。
其中,在特征拼接过程中,关键点特征表示和第一局部特征表示的特征尺寸大小相同,而是在通道数上扩展,因此特征拼接后得到的第二局部特征表示的特征尺寸不变,通道数相加。
例如:第一局部特征的大小为:1*H*W,关键点特征表示的大小为C*H*W。其中,第一局部特征中的1用于表示第一局部特征的通道数量,关键点特征表示中的C用于表示关键点特征表示的通道数量,H用于指示第一局部特征或关键点特征表示的高;W用于指示第一局部特征或关键点特征表示的宽。在沿通道维度将分类特征表示与图像特征表示进行通道维度的拼接时,得到第二局部特征表示,第二局部特征表示的大小为(C+1)*H*W。
可选地,考虑到关键点特征表示的数量可能不唯一,以及,考虑到第一局部特征表示的数量可能不唯一,因此综合两方面因素对特征拼接的方式进行说明。
示意性的,当存在单个关键点特征表示和单个第一局部特征表示时,将单个关键点特征表示和单个第一局部特征表示依照上述的特征拼接方法进行拼接,得到第二局部特征表示。
示意性的,当存在单个关键点特征表示和多个第一局部特征表示时,将多个第一局部特征表示进行上述的特征拼接后得到特征拼接结果,将特征拼接结果与单个关键点特征表示进行拼接后,得到第二局部特征表示;或者,将单个关键点特征表示与多个第一局部特征表示分别进行拼接后,得到与多个第一局部特征表示分别对应的特征拼接结果,将多个特征拼接结果进行特征拼接,得到第二局部特征表示。
示意性的,当存在多个关键点特征表示和单个第一局部特征表示时,将多个关键点特征表示进行上述的特征拼接后得到特征拼接结果,将特征拼接结果与单个第一局部特征表示进行拼接后,得到第二局部特征表示;或者,将单个第一局部特征表示与多个关键点特征表示分别进行拼接后,得到与多个关键点特征表示分别对应的特征拼接结果,将多个特征拼接结果进行特征拼接,得到第二局部特征表示。
示意性的,当存在多个关键点特征表示和多个第一局部特征表示时,特征拼接包括如下 几种拼接方式中至少一种:
1.将单个第一局部特征表示和单个关键点特征表示进行拼接,得到单个第二局部特征表示,也即,第一局部特征表示和关键点特征表示是逐个进行特征拼接的,第二局部特征表示包含特征拼接得到的多个特征向量;
2.将多个第一局部特征表示先进行逐个特征拼接,将拼接结果再依次和关键点特征表示进行拼接,得到最终的第二局部特征表示,也即,第二局部特征表示包含拼接得到的单个特征向量;
3.将多个第一局部特征进行逐个特征拼接,得到第一拼接特征表示,并将多个关键点特征表示进行逐个特征拼接,得到第二拼接特征表示,再将第一拼接特征表示和第二特征表示进行特征拼接,得到第二局部特征表示,也即,先将多个第一局部特征和多个关键点特征表示分别进行特征拼接,再将各自拼接得到的特征向量再次进行特征拼接,最终得到的拼接结果作为第二局部特征表示。
值得注意的是,上述关于特征拼接的方式仅为示意性的举例,本申请实施例对此不加以限定。
步骤360,基于全局特征表示和第二局部特征表示对目标区域中包含的目标内容的类别进行识别。
可选地,将识别之后的结果称为内容识别结果,也即:内容识别结果用于表征目标内容对应的类别。
示意性的,内容识别结果表示目标内容对应的类别名称,如:针对目标内容a的内容识别结果为“园林”;或者,内容识别结果表示目标内容对应的类别类型,如:目标内容b的内容识别结果为“X园林”,对此不加以限定。
可选地,内容识别结果中包括单个目标内容与其对应的类别,如:目标内容a,对应类别“A公园”;目标内容b,对应类别“B公园”;或者,内容识别结果中包含多个类别,每个类别下对应至少一个目标内容,如:类别A为“海豚”,类别A中包括目标内容1、目标内容2(也即,目标内容1和目标内容2都为“海豚”),类别B为“小丑鱼”,类别B中包括目标内容3(也即,目标内容3为“小丑鱼”),对此不加以限定。
可选地,目标内容对应的类别实现为粗粒度类别,如:图像中包括目标内容A(第一游乐场)和目标内容B(第二游乐场),最终得到的内容识别结果中,目标内容A和目标内容B对应的类别都为“游乐场”;或者,目标内容对应的类别实现为细粒度类别,如:目标内容A和目标内容B都属于“博物馆”,但最终识别得到目标内容A为“a博物馆”,目标内容B为“b博物馆”。
综上所述,本申请实施例提供的内容识别方法,基于图像中像素点分布规律提取得到关键点并提取与图像关键点对应的关键点特征表示,对图像进行显著性检测得到图像内的目标区域;此外,对图像特征表示进行池化处理得到全局特征表示,并基于目标区域对图像特征表示进行下采样,将下采样后得到的第一局部特征表示和关键点特征表示进行特征拼接后得到第二局部特征表示,从而根据全局特征表示和第二局部特征表示对目标区域中目标内容的类别进行识别。也即,基于目标区域对图像特征表示进行下采样得到第一局部特征的过程,能够有效提取图像特征表示中关于局部特征的有效信息,进而在将第一局部特征表示和关键点特征表示进行特征拼接时,能够实现结合图像关键点得到更准确的第二局部特征表示的目的,利用全局特征和第二局部特征对图像进行内容识别,能够有效提高内容识别的准确度。
在一个可选的实施例中,第一局部特征表示和全局特征表示均可以通过多种不同的池化处理获取,示意性的,请参考图4,其示出了本申请一个示例性实施例提供的内容识别方法示意图,如图4所示,步骤340中包括步骤341和步骤342,该方法包括如下步骤:
步骤341,对图像特征表示进行池化处理,得到全局特征表示,以及,基于目标区域对图像特征表示进行稀疏采样,得到稀疏采样结果。
示意性的,对图像进行特征处理后得到的多个图像特征表示,每个图像特征表示代表图像中一个图像块(patch)的特征表示。
可选地,从图像对应的多个图像特征表示中选择部分图像特征表示进行池化处理,得到全局特征表示;或者,对所有的图像特征表示进行池化处理,得到全局特征表示,对此不加以限定。
首先,针对平均池化处理、最大池化处理以及广义均值池化处理进行详细说明。
在一些实施例中,池化处理包括平均池化处理、最大池化处理和广义均值池化处理中任意一种。
平均池化处理(Mean-Pooling)是指将输入的图像特征表示进行向量平均求值,得到平均求值后的特征向量作为全局特征表示。
最大池化处理(Max-Pooling)是指从输入的图像特征表示中选择向量值最大特征向量作为全局特征表示。
广义均值池化处理(Generalized-Mean Pooling,GeM)是指预设一个可学习参数p,对于输入的图像特征表示首先求p次幂,然后取向量平均值,再进行p次开方,将p次开方得到的结果作为全局特征表示,示意性的,GeM处理是参考公式一:
公式一:
由公式一可知,Xk为第k个图像块对应的图像特征表示,当p=1时,公式一可实现为平均求值过程,也即,当前公式一等同于平均池化处理;当p趋近于无穷时,公式一可实现为取最大值,也即,当前公式一等同于最大池化处理。
示意性的,当p值越大时,对局部特征的关注度越高。
下面针对全局特征表示的两种获取方式进行说明。
第一种,通过单种池化处理获得全局特征表示。
在一些实施例中,对图像特征表示进行广义均值池化处理,得到全局特征表示。也即,通过对图像特征表示进行广义均值池化处理,将得到的池化处理结果作为全局特征表示。
其中,广义均值池化是将广义池化和平均池化结合起来的方法,主要是通过计算卷积核在输入特征表示(即上述的图像特征表示)上滑动所覆盖区域的加权均值,从而得到输出特征表示(即上述的全局特征表示)的每一个维度,其中,加权均值所涉及的加权系数可以通过之前的模型学习得到,也可以是自定义设置的系数。
当对图像特征表示进行广义均值池化处理时,能够实现对图像特征表示的灵活处理过程,借助广义均值池化处理过程中的加权系数提高池化处理过程的泛化能力,在减小全局特征表示的特征大小的同时,降低池化处理的计算成本,提高池化处理的处理效率。
在一些实施例中,对图像特征表示进行平均池化处理,得到全局特征表示。
在一些实施例中,对图像特征表示进行最大池化处理,得到全局特征表示。
也即,通过对图像特征表示进行最大池化处理、平均池化处理和广义均值池化处理中任意一种池化处理,得到全局特征表示,全局特征表示用于表征单种池化处理对应的池化结果。
第二种,通过多种不同池化处理得到全局特征表示。
在一些实施例中,对图像特征表示进行平均池化处理,得到第一全局特征表示;对图像特征表示进行最大池化处理,得到第二全局特征表示;对图像特征表示进行广义均值池化处理,得到第三全局特征表示;将第一全局特征表示、第二全局特征表示和第三全局特征表示进行特征拼接,得到全局特征表示。
本实施例中,对图像特征表示分别进行三种不同的池化处理,得到第一全局特征表示、第二全局特征表示和第三全局特征表示,并对其进行特征拼接,将特征拼接结果作为全局特征表示,也即,全局特征表示包括三种池化处理对应的池化结果的拼接结果。
可选地,将第一全局特征表示、第二全局特征表示和第三全局特征表示按照固定排列顺 序进行特征拼接(如:按照第一全局特征表示、第二全局特征表示和第三全局特征表示的拼接顺序进行特征拼接);或者,将第一全局特征表示、第二全局特征表示和第三全局特征表示按照随机排列顺序进行特征拼接,对此不加以限定。
首先,针对稀疏采样进行详细说明。
示意性的,稀疏采样是指将图像特征表示进行稀疏化处理,得到稀疏向量矩阵,作为稀疏采样结果。其中,图像特征表示为一个稠密的向量矩阵,而稀疏采样结果为一个稀疏的向量矩阵,也即,稀疏采样结果中包括多个零元素。可选地,零元素代表未被采样的图像块对应的图像特征表示;与零元素对应的元素为一元素,一元素用于表示被采样的图像块对应的图像特征表示。
本实施例中,图像特征表示实现为一个尺寸大小为k×k×1024的特征图(也即,特征矩阵),对图像特征表示进行稀疏采样后,得到n×1024个Token向量,将n×1024个Token向量作为稀疏采样结果。其中,Token向量的个数是预先设置好的固定数量;或者,Token向量的个数可根据实际需要进行自由设定,对此不加以限定。
在一个可选的实施例中,基于目标区域对图像特征表示进行稀疏采样,得到稀疏采样结果。
示意性的,在得到图像中的目标区域后,以目标区域中的图像块对应的图像特征表示实现为一元素,以目标区域外的图像块对应的图像特征表示为零元素,实现对图像特征表示进行的稀疏采样过程,从而使得稀疏采样结果能够更针对性地展现目标区域对应的局部信息;
或者,在得到图像中的目标区域后,将大部分目标区域中的图像块对应的图像特征表示取值为一元素,将小部分目标区域外的图像块对应的图像特征表示取值为零元素,实现对图像特征表示进行的稀疏采样过程,从而使得稀疏采样结果能够较为针对性地展现目标区域对应的局部信息;
或者,在得到图像中的目标区域后,将包括目标区域在内的一定区域内图像块对应的图像特征表示表示取值为一元素,将该一定区域外的图像块对应的图像特征表示取值为零元素,实现对图像特征表示进行的稀疏采样过程,从而使得稀疏采样结果能够较为针对性地展现目标区域对应的局部信息等。
本实施例中,针对图像特征表示分别进行池化处理和下采样的过程是同时进行的。
步骤342,对稀疏采样结果进行池化处理,得到第一局部特征表示。
在一些实施例中,池化处理包括平均池化处理、最大池化处理和广义均值池化处理等池化处理方式中至少一种。
示意性的,针对第一局部特征表示两种获取方式进行详细说明。
第一种,通过对稀疏采样结果进行单个池化处理。
在一个可实现的情况下,对稀疏采样结果进行最大池化处理,选择稀疏采样结果中向量值最大的Token向量作为第一局部特征表示。
在一个可实现的情况下,对稀疏采样结果进行平均池化处理,对稀疏采样结果进行平均求值,将得到的平均值向量作为第一局部特征表示。
在一个可实现的情况下,对稀疏采样结果进行广义均值池化处理,设置一个可学习参数p,通过上述公式一对稀疏采样结果进行池化处理,得到池化处理结果作为第一局部特征表示。
针对上述三种不同的池化处理方式,也即,第一局部特征表示中包括由单种池化处理得到的特征向量。
第二种,通过对稀疏采样结果进行多种不同的池化处理。
在一些实施例中,对稀疏采样结果进行平均池化处理,得到第三局部特征表示;对稀疏采样结果进行最大池化处理,得到第四局部特征表示;对稀疏采样结果进行广义均值池化处理,得到第五局部特征表示;将第三局部特征表示、第四局部特征表示和第五局部特征表示进行特征拼接,得到第一局部特征表示。
本实施例中,针对稀疏采样结果分别进行平均池化处理、最大池化处理和广义均值池化 处理,分别得到第三局部特征表示、第四局部特征表示和第五局部特征表示,并对其进行特征拼接,将拼接得到的结果作为第一局部特征表示。也即,当前第一局部特征表示中包括多种不同池化处理得到的特征向量对应的拼接结果。
可选地,对稀疏采样结果同时进行三种不同的池化处理;或者,按照三种池化处理的预设处理顺序对稀疏采样结果进行池化处理,对此不加以限定。其中,预设处理顺序是预先设置好的固定顺序;或者,预设处理顺序可根据实际需要进行自由设定。
可选地,将第三局部特征表示、第四局部特征表示和第五局部特征表示按照固定排列顺序进行特征拼接(如:按照第三局部特征表示、第四局部特征表示和第五局部特征表示的拼接顺序进行特征拼接);或者,将第三局部特征表示、第四局部特征表示和第五局部特征表示按照随机排列顺序进行特征拼接,对此不加以限定。
值得注意的是,上述针对图像特征表示的两种池化处理(包括单种池化处理和多种池化处理后进行拼接)以及针对稀疏采样结果的两种池化处理(包括单种池化处理和多种池化处理后进行拼接)仅为示意性的举例,在应用过程中针对图像特征表示和稀疏采样结果分别可选择上述任意池化处理进行组合(也即,包括四种池化组合方式),如:采用三种池化处理中的任意两种池化处理方式对稀疏采样结果进行处理及特征拼接后得到第一局部特征表示;或者,采用三种池化处理中的任意两种池化处理方式对图像特征表示进行处理及特征拼接后得到全局特征表示等,本申请实施例对此不加以限定。
综上所述,本申请实施例提供的内容识别方法,基于图像中像素点分布规律提取得到图像关键点并提取与图像关键点对应的关键点特征表示,对图像进行显著性检测得到图像内的目标区域;此外,对图像特征表示进行池化处理得到全局特征表示,并基于目标区域对图像特征表示进行下采样,将得到的第一局部特征表示和关键点特征表示进行特征拼接后得到第二局部特征表示,从而根据全局特征表示和第二局部特征表示对目标区域中目标内容的类别进行识别,最终得到内容识别结果。也即,通过基于目标区域对图像特征表示进行下采样得到第一局部特征的过程,能够有效提取图像特征表示中关于局部特征的有效信息,进而在将第一局部特征表示和关键点特征表示进行特征拼接时,能够实现结合图像关键点得到更准确的第二局部特征表示的目的,利用全局特征和第二局部特征对图像进行内容的类别识别,能够有效提高内容识别的准确度。
本实施例中,通过基于目标区域对图像特征表示进行稀疏采样的方式,降低图像特征表示的复杂性,得到具有低复杂度的稀疏采样结果,进而对稀疏采样结果进行池化处理,以在尽可能保留目标区域对应的图片空间信息的前提下,降低目标内容的尺寸,并提取到包含高维局部特征信息的第一局部特征表示,在有效提取图像特征表示中对应的局部特征的同时,提高特征提取的效率以及特征表示的利用率。
本实施例中,介绍了池化处理的操作形式,当对稀疏采样结果和图像特征表示进行池化处理时,可以采用上述三种处理方式中的至少一种处理方式得到对应的特征表示。当选择采用单种池化处理时,能够一定程度上简化池化处理的操作形式;当选择采用至少两种池化处理时,可以根据对稀疏采样结果和图像特征表示的处理条件确定更合适的池化处理方式,提高应用池化处理的灵活度。
例如:针对图像特征表示提供两种不同的池化处理方式,包括单种池化处理方式和多种不同池化处理后的特征拼接方式。当选择采用诸如最大池化处理、平均池化处理和广义均值池化处理中任意一种池化处理方式时,能够对在一定程度上减少池化处理操作的运算量,提高全局特征表示的获取效率;当选择采用将多种不同的池化处理后的特征进行特征拼接的方式时,综合平均池化处理后的第一全局特征表示、最大池化处理后的第二全局特征表示以及广义均值池化处理后的第三全局特征表示,有效实现对图像特征表示进行更全面分析的目的,在提高池化处理选择多样性的同时,提高了全局特征表示的获取准确度。
又或者,针对稀疏采样结果提供两种不同的池化处理方式,包括单种池化处理方式和多种不同池化处理后的特征拼接方式。当选择采用诸如最大池化处理、平均池化处理和广义均 值池化处理中任意一种池化处理方式时,能够对在一定程度上减少池化处理操作的运算量,提高第一局部特征表示的获取效率;当选择采用将多种不同的池化处理后的特征进行特征拼接的方式时,综合平均池化处理后的第三局部特征表示、最大池化处理后的第四局部特征表示以及广义均值池化处理后的第五局部特征表示,有效实现对稀疏采样结果进行更全面分析的目的,在提高池化处理选择多样性的同时,提高了第一局部特征表示的获取准确度。
在一个可选的实施例中,关键点特征表示通过关键点提取算法获取,图像特征表示通过内容识别模型获取,目标区域中目标内容的类别识别结果由内容类别库确定,示意性的,请参考图5,其示出了本申请一个示例性实施例提供的内容识别方法流程图,也即,步骤340中还可以包括步骤341至步骤343,步骤360中包括步骤361至步骤364,如图5所示,该方法包括如下步骤。
步骤310,获取图像。
其中,目标内容在图像中对应图像关键点,图像关键点是基于图像中像素点分布规律提取得到的关键点。
可选地,单次获取单张图像;或者,单次同时获取多张图像等,本申请实施例对此不加以限定。
示意性的,图像是指包含未知类别的目标内容的图像,如:景点图像(包含未知景点类别的图像)、明星写真(包含未知明星的写真图像)、动漫图像(包含未知动漫角色的图像)等,本申请实施例对此不加以限定。
在一些实施例中,图像关键点是通过特征检测器对图像中的像素点进行分析,并根据像素点分布规律提取得到的关键点。
可选地,图像关键点通过如下提取方式中至少一种得到:
1.通过尺度不变特征变换检测(Scale Invariant Feature Transform,SIFT特征检测)提取图像对应的图像关键点,其中,利用SIFT特征检测的过程中,将图像输入SIFT特征检测器,利用SIFT特征检测器中的高斯拉普拉斯金字塔尺度空间(Difference of Gaussian,DOG)获取图像中的极值点,作为图像关键点;
2.通过SURF特征检测(Speeded Up Robust Features,基于加速版的SIFT特征检测)提取图像对应的图像关键点,其中,SURF特征检测器的过程中,将图像输入SURF特征检测器,SURF特征检测器使用海森(Hesseian)矩阵的行列式值对图像进行关键点检测,确定图像对应的图像关键点;
3.通过ORB特征检测(Oriented FAST and Rotated BRIEF,ORB)提取图像中目标内容对应的图像关键点,将图像输入ORB特征检测器,确定图像对应的图像关键点。
值得注意的是,上述关于图像关键点的提取方式仅为示意性的举例,本申请实施例对此不加以限定。
步骤320,基于图像中像素点分布规律提取得到图像关键点,并提取图像中与图像关键点对应的关键点特征表示。
在一些实施例中,通过关键点提取算法提取与图像关键点对应的关键点特征表示。
可选地,通过SIFT关键点检测器确定图像关键点后提取图像关键点对应的关键点特征表示(SIFT特征表示);或者,通过SURF关键点检测器确定图像关键点后提取图像关键点对应的关键点特征表示(SURF特征表示);或者,通过ORB关键点检测器确定图像关键点后提取图像关键点对应的关键点特征表示(ORB特征表示),本申请实施例对此不加以限定。
可选地,选择上述SIFT特征表示、SURF特征表示或者ORB特征表示中至少一种作为关键点特征表示。
步骤330,通过对图像进行显著性检测,从图像中识别出目标区域。
在一些实施例中,对图像进行显著性检测,从图像中识别出与目标内容对应的目标区域。
示意性的,显著性检测用于确定图像中的目标内容对应的目标区域和背景内容对应的背 景区域,也即,显著性检测用于根据内容特征对图像进行区域划分。
在一些实施例中,预设显著性检测模型,将图像输入显著性检测模型,输出得到图像对应的识别显著图,识别显著图中包括目标内容对应的目标区域,以及背景内容对应的背景区域。其中,识别显著图实现为将目标区域进行增强后的图像。
示意性的,请参考图6,其示出了本申请一个示例性实施例提供的目标区域示意图,如图6所示,图6展示了三种不同图像经过显著性检测后得到的识别显著图示意图600,包括第一图像610和第一图像610对应的第一显著图611、第二图像620和第二图像620对应的第二显著图621、第三图像630和第三图像630对应的第三显著图631。
其中,第一显著图611中包括第一目标区域(白色区域),第二显著图621中包括第二目标区域(白色区域),第三显著图631中包括第三目标区域(白色区域)。图6中的目标区域均以白色区域进行标记显示,背景区域以黑色区域进行标记显示。
本实施例中,图6示出的识别显著图为当前目标内容对应的主体特征较明显,也即,当前的识别显著图中目标区域显示完整度较好,白色区域对应的区域边缘较清晰。
此外,本实施例中还存在识别显著图中目标内容对应的主体特征不明显的情况,示意性的,请参考图7,其示出了本申请一个示例性实施例提供的目标区域示意图,如图7所示,图7展示了两种不同图像经过显著性检测后得到的识别显著图示意图700,包括第四图像710和第四图像710对应的第四显著图711,第五图像720和第五图像720对应的第五显著图721。
其中,第四显著图711和第五显著图721中白色区域即为目标区域,黑色区域即为背景区域,当前第四显著图711和第五显著图721属于目标内容对应的主体特征不明显的情况,也即白色区域对应的区域边缘较模糊。
可选地,显著性检测模型包括Visual Saliency Transformer(VST模型)、Edge Guidance Network for Salient Object Detection(EGNet模型)等模型中至少一种,对此不加以限定。
本实施例中,针对VST模型进行详细说明。
示意性的,请参考图8,其示出了本申请一个实施例提供的显著性检测模型示意图,如图8所示,当前显示VST模型,其中,VST模型的模型输入包括第一图像810和第二图像820,第一图像810为图像(该图像为RGB图像,图8中未显示颜色),第二图像820为图像对应的灰度图像(RGB-D图像),将第一图像810对应的第一图像块811和第二图像820对应的第二图像块821分别输入Transformer编码器空间830(Transformer Encoder),其中,Transformer编码器空间830中利用Token-to-Token(T2T)模块对第一图像块811和第二图像块821分别编码成多级Token向量(如:T1、T2、T3),将多级Token向量输入转换器840(Convertor),转换器840用于将多层Token向量从编码器空间830转换到解码器空间850(Transformer Decoder)进行特征解码,输出得到第一图像810对应的识别显著图8111,以及第二图像820对应的识别边界图8221。
在VST模型中,除了使用Transformer模型结构外,还利用多级Token向量融合,并在Transformer结构下提出一种新的Token向量上采样的方法,以获得高分辨率的显著检测结果。还开发了一个基于Token向量的多任务解码器,通过引入任务相关的Token向量和一个Patch-Task-Attention机制来同时进行显著检测(Saliency)和边缘(Boundary)检测。
步骤341,将图像输入内容识别模型,输出得到图像特征表示。
其中,内容识别模型用于对图像进行深层特征提取。
可选地,单次仅将单张图像输入内容识别模型,输出得到单张图像对应的图像特征表示;或者,单次将多张图像同时输入内容识别模型,同时输出多张图像分别对应的图像特征表示,对此不加以限定。
在一些实施例中,图像特征表示实现为一张多维度特征向量图,其中,特征向量图中包括多个Patch块,每个Patch代表一个特征向量。
可选地,内容识别模型包括Swin Transformer模型、ResNet模型或者T2T-ViT模型中至少一种,对此不加以限定。
本实施例中,采用Swin Transformer模型作为内容识别模型,下面,针对Swin Transformer模型进行简单介绍。
Swin Transformer模型引入了层次化特征映射过程和窗口注意力转换过程两个概念。其中,层次化特征映射过程是指Swin Transformer模型中特征表示的映射过程在每一层模型输出后逐步合并,并进行特征下采样,建立具有层次结构的特征映射,该具有层次结构的特征映射使得Swin Transformer模型能够很好地应用于细粒度特征预测的领域(如:语义分割领域)。
Swin Transformer模型中使用的无卷积特征下采样方法称为Patch Merging。其中,“Patch”指的是特征向量图中的最小单位,如:在一个特征尺寸为14x14的特征向量图中,有14x14=196个Patch块,也即,有196个特征块。
Swin Transformer模型中使用的模块为基于窗口的标准多头自注意力(Window Multi-headed Self-attention,W-MSA),该W-MSA只在每个窗口内计算对应的注意力。这种转变会导致存在不属于任何窗口的Patch块,也即,该Patch块处于被孤立状态,以及存在Patch块不完整的窗口(Window)。Swin Transformer模型应用了“循环移位”技术,将被孤立的Patch块移动到存在不完整Patch块的窗口中。通过这次移位之后,一个窗口会由原始特征向量图中不相邻的Patch块组成,因此在计算过程中应用一个Mask,将自注意力限制在相邻的Patch块上。
本实施例中,将图像输入Swin Transformer模型中,在Swin Transformer模型的末端输出k×k×1024的特征向量图,作为图像特征表示。
步骤342,基于目标区域对图像特征表示进行下采样,得到第一局部特征表示。
示意性的,基于目标区域对图像特征表示进行稀疏采样,得到稀疏采样结果;对稀疏采样结果进行池化处理,得到第一局部特征表示。
本实施例中,对k×k×1024的特征向量图进行稀疏采样,得到n×1024个Token向量,再对这n×1024个Token向量进行平均池化,得到局部特征。
可选地,对稀疏采样结果进行最大池化处理、平均池化处理和广义均值池化处理中任意一种,将池化处理结果作为第一局部特征表示;或者,对稀疏采样结果分别进行最大池化处理、平均池化处理和广义均值池化处理,将三种池化处理结果进行特征拼接,得到第一局部特征表示,对此不加以限定。本实施例中,以对图像特征表示进行平均池化处理,将池化处理结果作为第一局部特征表示为例。
步骤343,对图像特征表示进行池化处理,得到全局特征表示。
可选地,对图像特征表示进行最大池化处理、平均池化处理和广义均值池化处理中任意一种,将池化处理结果作为全局特征表示;或者,对图像特征表示分别进行最大池化处理、平均池化处理和广义均值池化处理,将三种池化处理结果进行特征拼接,得到全局特征表示,对此不加以限定。本实施例中,以对图像特征表示进行广义均值池化处理,并将池化处理结果作为全局特征表示为例。
步骤350,将第一局部特征表示和关键点特征表示进行特征拼接,得到第二局部特征表示。
示意性的,将第一局部特征表示和关键点特征表示依次进行特征拼接,将特征拼接结果作为第二局部特征表示。
在上述步骤341至步骤342中,介绍了通过关键点提取算法提取关键点特征表示并得到第二局部特征表示的过程。借助关键点提取算法确定图像对应的关键点并确定与关键点对应的关键点特征表示,从而以关键点特征表示指代图像关键点,便于模型对图像关键点进行针对性地分析,减少模型识别的复杂性,缩短模型识别时间;进而将关键点特征表示与图像特征表示中与目标区域对应的第一局部特征表示进行融合,在突出图像关键点信息的同时增大目标内容的整体感知性,提升局部特征表示对目标内容的表达力度,提高第二局部特征表示的准确度。
步骤361,获取内容类别库,内容类别库中包括预先设定的n个类别的集合,n为正整数。
示意性的,内容类别库中包括n个预先存储的类别,每个类别对应存储至少一张候选图像对应的候选特征表示(即:候选图像对应的图像特征表示),也即,候选特征表示与类别对应,如:类别“贵宾犬”下存储有多张包含贵宾犬图像,每张贵宾犬图像中标注有贵宾犬对应的特征表示,作为候选特征表示。
在一些实施例中,内容类别库是预先获取得到的。
步骤362,将全局特征表示与内容类别库中的n个类别分别进行匹配,得到内容类别库中与全局特征表示匹配的k个候选,0<k<n且k为整数。
在一些实施例中,将全局特征表示与内容类别库中的n个类别分别进行匹配,得到n个类别分别对应的全局匹配分数,全局匹配分数用于表征目标内容属于类别的概率;将n个类别分别对应的全局匹配分数进行排序,得到匹配度排序结果;将匹配度排序结果中前k个类别,作为与全局特征表示匹配的k个候选类别。
可选地,根据全局特征表示与n个类别下分别对应的候选特征表示在向量空间中的距离,确定全局匹配分数。例如:当全局特征表示与候选特征表示在向量空间中的距离越小时,全局匹配分数越高;当全局特征表示与候选特征表示在向量空间中的距离越大时,全局匹配分数越低。
示意性的,根据全局特征表示在内容类别库中遍历所有类别下对应的候选特征表示,将每个候选特征表示与全局特征表示进行匹配,根据类别下对应的候选特征表示与当前全局特征表示匹配的情况,确定该类别对应的全局匹配分数,其中,类别的全局匹配分数越高,表明该类别下候选特征表示与全局特征表示的匹配度越高,也即,当前目标内容对应的类别为该类别的概率越高。
根据全局匹配分数按照从高到低的顺序进行排列,得到匹配度排序结果,选择匹配度排序结果中前k个类别,作为和全局特征表示匹配的k个候选类别。
步骤363,基于第一局部特征表示对k个候选类别进行类别排序,得到类别排序结果。
示意性的,通过全局特征表示从内容类别库中选择与目标内容的内容匹配分数最高的前k个候选类别,针对k个候选类别,根据局部特征表示将这k个候选类别再次进行类别排序,得到类别排序结果。
其中,将第一局部特征表示与k个候选类别下存储的候选特征表示分别进行匹配,根据候选特征表示和第一局部特征表示的匹配情况,确定k个候选类别分别对应的局部匹配分数,其中,局部匹配分数用于表示当前第一局部特征表示和该类别下候选特征表示之间的匹配情况,匹配度越高,表明该类别对应的局部匹配分数越高,根据k个候选类别分别对应的局部匹配分数由高到低进行排序,得到类别排序结果。
步骤364,根据类别排序结果,得到目标内容对应的识别类别。
示意性的,选择类别排序结果中局部匹配分数最高(或者较高的几个)的候选类别作为识别类别,作为内容识别结果。
综上所述,本申请实施例提供的内容识别方法,基于图像中像素点分布规律提取得到图像关键点并提取与图像关键点对应的关键点特征表示,对图像进行显著性检测得到图像内的目标区域;对图像对应的图像特征表示进行池化处理得到全局特征表示,并基于目标区域对图像特征表示进行下采样,将下采样后的第一局部特征表示和关键点特征表示进行特征拼接后得到第二局部特征表示,从而根据全局特征表示和第二局部特征表示对目标图像中目标内容对应的类别进行识别。也即,通过对图像特征表示进行下采样得到第一局部特征的过程,能够有效提取图像特征表示中关于局部特征的有效信息,进而在将第一局部特征表示和关键点特征表示进行特征拼接时,能够实现结合图像关键点得到更准确的第二局部特征表示的目的,利用全局特征和第二局部特征对图像中目标内容的类别进行识别,能够有效提高对内容进行类别识别的准确度。
本实施例中,通过显著性检测确定目标内容在图像中的目标区域,从而将图像中的目标 区域与背景区域进行有效划分,增强目标区域的区别力度;进而在通过目标区域对图像特征表示进行区域分析时,能够有效排除图像中背景区域的干扰,提高得到的图像特征表示对目标区域的表达强度,尽最大程度过滤不包含主体特征的背景内容,提高目标内容的内容类别识别的准确度和识别效率。
本实施例中,如步骤361至步骤364所示,在基于全局特征表示和第二局部特征表示对得到目标内容对应类别的过程中,通过表征全局信息内容的全局特征表示从包括n个类别的内容类别库中选择相匹配的k个候选类别,进而根据表征针对性的局部信息内容的局部特征表示将k个候选类别进行重新排序,并根据类别排序结果确定与目标内容对应的类别。有效实现了从全局至局部对目标内容进行识别的过程,借助层次性分析方式提高了内容识别过程的规范性,在提高内容类别识别准确度的同时,提升了内容识别的灵活性。此外,借助将全局特征表示与内容类别库中n个类别分别进行匹配的全局匹配分数,更直观地实现了候选类别的选择过程,有利于更全面且更准确地获取到k个候选类别,进而有利于提高内容识别的准确性。
在一个可选的实施例中,针对本申请提供的内容识别方法对应的应用场景进行说明,示意性的,请参考图9,其示出了本申请一个示例性实施例提供的内容识别方法示意图,以内容方法应用于图像搜索场景为例进行说明。
当前用户输入一张图像作为图像,通过该图像在图像库中进行检索得到与该图像匹配度最高的识别图像,作为图像搜索结果。
如图9所示,获取图像910,其中,图像910为用户输入的图像,图像910中包括目标内容911,图像对应多个图像关键点(图9中未示出),图像关键点是通过SIFT关键点检测器、ORB关键点检测器或者SURF关键点检测器三种关键点检测器中至少一种检测得到的特征点。
将图像910输入内容识别模型920,输出得到图像特征表示930,其中,内容识别模型920实现为Swin Transformer模型,图像特征表示930实现为通过Swin Transformer模型最后一层输出的特征尺寸为k×k×1024的特征向量图。
对图像910进行显著性检测,提取得到目标区域912,该目标区域912与目标内容911相对应。
其中,显著性检测采用VST模型实现。
对图像特征表示930分别进行广义均值池化处理940和稀疏采样950,分别得到全局特征表示941和稀疏采样结果950。
可选地,在对图像特征表示930进行稀疏采样950得到稀疏采样结果950时,基于目标区域912对图像特征表示930进行稀疏采样950。
示意性的,从图像特征表示930中提取得到与目标区域912对应的特征表示作为稀疏采样后的稀疏采样结果950;或者,从图像特征表示930中提取得到略大于目标区域912的一定区域对应的特征表示作为稀疏采样后的稀疏采样结果。
将稀疏采样结果950进行平均池化处理960,得到第一局部特征表示(图9中未示出),将第一局部特征表示和图像关键点提取的关键点特征表示(SIFT特征表示、SURF特征表示或者ORB特征表示中至少一种)进行拼接,得到第二局部特征表示951。
此外,针对上述池化处理得到的结果,进行特征降维操作,去除特征表示之间相关度较高的冗余特征。
根据全局特征表示941在类别库970中进行匹配,得到与全局特征表示941全局匹配分数最高的前k个候选类别(TOP-K),作为k个候选类别971。
根据存储了第一局部特征表示951的局部特征库952对k个候选类别971再次进行匹配,得到k个候选类别971对应的局部匹配度分数,根据局部匹配度分数对k个候选类别进行重新排序,最终选择局部匹配度分数最高的作为目标内容对应的类别980输出,其中,目标内容对应的类别980实现为“长城”。
将目标内容对应的类别输入图像库中,选择图像库中与该目标内容对应的类别对应的候选图像进行输出,向用户进行展示。
此外,本申请实施例提供的内容识别方法还可应用于以下场景。
1.应用于帐号推荐。以用户搜索视频帐号为例进行说明,当前视频帐号发布的视频中标注了候选地点,当用户输入地点内容作为搜索内容,通过上述内容识别方法确定视频帐号,将该视频帐号中标注有对应地点的视频进行加权,从而提高该视频向用户进行推荐的概率;
2.应用于内容推荐。在针对向用户进行内容推荐的过程中,若推荐库中的图像或者视频内容通过上述内容识别方法识别得到后,将该图像或者视频内容向用户进行推荐;
综上所述,本申请实施例提供的内容识别方法,基于图像中像素点分布规律提取得到图像关键点并提取与图像关键点对应的关键点特征表示,对图像进行显著性检测得到图像内的目标区域;对图像对应的图像特征表示进行池化处理得到全局特征表示,并对图像特征表示进行下采样,将下采样得到的第一局部特征表示和关键点特征表示进行特征拼接后得到第二局部特征表示,从而根据全局特征表示和第二局部特征表示对图像中的目标内容的类别进行识别。也即,基于目标区域对图像特征表示进行下采样得到第一局部特征表示的过程,能够有效提取图像特征表示中关于局部特征的有效信息,进而在将第二局部特征表示和关键点特征表示进行特征拼接时,能够实现结合图像关键点得到更准确的第二局部特征表示的目的,利用全局特征和第二局部特征表示对图像中目标内容的类别进行识别,能够有效提高内容识别的准确度。
本申请提供内容识别方法的有益效果包括:
1)构建了全局特征表示进行内容召回,第二局部特征表示进行重排序的结构;
2)将第一局部特征表示和关键点特征表示进行特征拼接后得到第二局部特征表示;
3)引入显著性检测来避免背景信息的干扰。
图10是本申请一个示例性实施例提供的内容识别装置的结构框图,如图10所示,该装置包括如下部分:
获取模块1010,用于获取图像;
提取模块1020,用于基于所述图像中像素点分布规律提取得到图像关键点,并提取所述图像中与所述图像关键点对应的关键点特征表示;通过对所述图像进行显著性检测,从所述图像中识别出目标区域;
处理模块1030,用于对所述图像对应的图像特征表示进行池化处理,得到全局特征表示,以及,基于所述目标区域对所述图像特征表示进行下采样,得到第一局部特征表示;
拼接模块1040,用于将所述关键点特征表示和所述第一局部特征表示进行特征拼接,得到第二局部特征表示;
识别模块1050,用于基于所述全局特征表示和所述第二局部特征表示对所述目标区域中包含的目标内容的类别进行识别。
在一个可选的实施例中,如图11所示,所述处理模块1030,包括:
采样单元1031,用于基于所述目标区域对所述图像特征表示进行稀疏采样,得到稀疏采样结果;
处理单元1032,用于对所述稀疏采样结果进行池化处理,得到所述第一局部特征表示。
在一个可选的实施例中,所述图像由多个图像块组成,所述图像特征表示由多个子特征表示组成,所述多个图像块与所述多个子特征表示一一对应;
所述处理模块1030,从所述图像包括的多个图像块中获取所述目标区域内的多个区域图像块;从所述图像特征表示包括的多个子特征表示中,获取所述多个区域图像块分别对应的子特征表示作为所述稀疏采样结果。
在一个可选的实施例中,所述池化处理包括平均池化处理、最大池化处理和广义均值池化处理中任意一种。
在一个可选的实施例中,所述处理单元1032,还用于对所述稀疏采样结果进行平均池化 处理,得到第三局部特征表示;对所述稀疏采样结果进行最大池化处理,得到第四局部特征表示;对所述稀疏采样结果进行广义均值池化处理,得到第五局部特征表示;将所述第三局部特征表示、所述第四局部特征表示和所述第五局部特征表示进行特征拼接,得到所述第一局部特征表示。
在一个可选的实施例中,所述提取模块1020,还用于通过关键点提取算法提取与所述图像关键点对应的关键点特征表示。
在一个可选的实施例中,所述处理模块1030,还用于将所述图像输入内容识别模型,输出得到所述图像特征表示,其中,所述内容识别模型用于对所述图像进行深层特征提取;对所述图像特征表示进行广义均值池化处理,得到所述全局特征表示。
在一个可选的实施例中,所述处理模块1030,还用于对所述图像特征表示进行平均池化处理,得到第一全局特征表示;对所述图像特征表示进行最大池化处理,得到第二全局特征表示;对所述图像特征表示进行广义均值池化处理,得到第三全局特征表示;将所述第一全局特征表示、所述第二全局特征表示和所述第三全局特征表示进行特征拼接,得到所述全局特征表示。
在一个可选的实施例中,所述识别模块1050,还用于获取内容类别库,所述内容类别库中包括预先设定的n个类别的集合,n为正整数;将所述全局特征表示与所述内容类别库中的n个类别分别进行匹配,得到所述内容类别库中与所述全局特征表示匹配的k个候选类别,0<k<n且k为整数;基于所述第二局部特征表示对所述k个候选类别进行类别排序,得到类别排序结果;根据所述类别排序结果,得到所述目标内容对应的识别类别。
在一个可选的实施例中,所述识别模块1050,还用于将所述全局特征表示与所述内容类别库中的n个类别分别进行匹配,得到所述n个类别分别对应的全局匹配分数,所述全局匹配分数用于表征所述目标内容属于所述类别的概率;将所述n个类别分别对应的全局匹配分数进行排序,得到匹配度排序结果;将匹配度排序结果中前k个类别,作为与所述全局特征表示匹配的k个候选类别。
综上所述,本申请实施例提供的内容识别装置,基于图像中像素点分布规律提取得到图像关键点并提取与图像关键点对应的关键点特征表示,对图像进行显著性检测得到图像内的目标区域;对图像对应的图像0特征表示进行池化处理得到全局特征表示,并基于目标区域0对图像特征表示进行下采样,将下采样得到的第一局部特征表示和关键点特征表示进行特征拼接后得到第二局部特征表示,从而根据全局特征表示和第二局部特征表示对图像中的目标内容的类别进行识别。也即,基于目标区域对图像特征表示进行下采样得到第一局部特征的过程,能够有效提取图像特征表示中关于局部特征的有效信息,进而在将第一局部特征表示和关键点特征表示进行特征拼接时,能够实现结合图像关键点得到更准确的第二局部特征表示的目的,利用全局特征和第二局部特征对图像中目标内容的类别进行识别,能够有效提高内容识别的准确度。
需要说明的是:上述实施例提供的内容识别装置,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的内容识别装置与内容识别方法实施例属于同一构思,其具体实现过程详见方法实施例,此处不再赘述。
图12示出了本申请一个示例性实施例提供的服务器的结构示意图。具体来讲:
服务器1200包括中央处理单元(Central Processing Unit,CPU)1201、包括随机存取存储器(Random Access Memory,RAM)1202和只读存储器(Read Only Memory,ROM)1203的系统存储器1204,以及连接系统存储器1204和中央处理单元1201的系统总线1205。服务器1200还包括用于存储操作系统1213、应用程序1214和其他程序模块1215的大容量存储设备1206。
大容量存储设备1206通过连接到系统总线1205的大容量存储控制器(未示出)连接到 中央处理单元1201。大容量存储设备1206及其相关联的计算机可读介质为服务器1200提供非易失性存储。
不失一般性,计算机可读介质可以包括计算机存储介质和通信介质。
根据本申请的各种实施例,服务器1200可以通过连接在系统总线1205上的网络接口单元1211连接到网络1212,或者说,也可以使用网络接口单元1211来连接到其他类型的网络或远程计算机系统(未示出)。
上述存储器还包括一个或者一个以上的程序,一个或者一个以上程序存储于存储器中,被配置由CPU执行。
本申请的实施例还提供了一种计算机设备,该计算机设备包括处理器和存储器,该存储器中存储有至少一条指令、至少一段程序、代码集或指令集,至少一条指令、至少一段程序、代码集或指令集由处理器加载并执行以实现上述各方法实施例提供的内容识别方法。
本申请的实施例还提供了一种计算机可读存储介质,该计算机可读存储介质上存储有至少一条指令、至少一段程序、代码集或指令集,至少一条指令、至少一段程序、代码集或指令集由处理器加载并执行,以实现上述各方法实施例提供的内容识别方法。
本申请的实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述实施例中任一所述的内容识别方法。
可选地,该计算机可读存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、固态硬盘(SSD,Solid State Drives)或光盘等。其中,随机存取记忆体可以包括电阻式随机存取记忆体(ReRAM,Resistance Random Access Memory)和动态随机存取存储器(DRAM,Dynamic Random Access Memory)。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。

Claims (20)

  1. 一种内容识别方法,由服务器执行,所述方法包括:
    获取图像;
    基于所述图像中像素点分布规律提取得到图像关键点,并提取所述图像中与所述图像关键点对应的关键点特征表示;
    通过对所述图像进行显著性检测,从所述图像中识别出目标区域;
    对所述图像对应的图像特征表示进行池化处理,得到全局特征表示,以及,基于所述目标区域对所述图像特征表示进行下采样,得到第一局部特征表示;
    将所述关键点特征表示和所述第一局部特征表示进行特征拼接,得到第二局部特征表示;
    基于所述全局特征表示和所述第二局部特征表示对所述目标区域中包含的目标内容的类别进行识别。
  2. 根据权利要求1所述的方法,其中,所述基于所述目标区域对所述图像特征表示进行下采样,得到第一局部特征表示,包括:
    基于所述目标区域对所述图像特征表示进行稀疏采样,得到稀疏采样结果;
    对所述稀疏采样结果进行池化处理,得到所述第一局部特征表示。
  3. 根据权利要求2所述的方法,其中,所述图像由多个图像块组成,所述图像特征表示由多个子特征表示组成,所述多个图像块与所述多个子特征表示一一对应;
    所述基于所述目标区域对所述图像特征表示进行稀疏采样,得到稀疏采样结果,包括:
    从所述图像包括的多个图像块中获取所述目标区域内的多个区域图像块;
    从所述图像特征表示包括的多个子特征表示中,获取所述多个区域图像块分别对应的子特征表示作为所述稀疏采样结果。
  4. 根据权利要求2所述的方法,其中,所述池化处理包括平均池化处理、最大池化处理和广义均值池化处理中任意一种。
  5. 根据权利要求2所述的方法,其中,所述对所述稀疏采样结果进行池化处理,得到所述第一局部特征表示,包括:
    对所述稀疏采样结果进行平均池化处理,得到第三局部特征表示;
    对所述稀疏采样结果进行最大池化处理,得到第四局部特征表示;
    对所述稀疏采样结果进行广义均值池化处理,得到第五局部特征表示;
    将所述第三局部特征表示、所述第四局部特征表示和所述第五局部特征表示进行特征拼接,得到所述第一局部特征表示。
  6. 根据权利要求1至5任一所述的方法,其中,所述基于所述图像中像素点分布规律提取得到图像关键点,包括:
    通过关键点提取算法提取与所述图像关键点对应的关键点特征表示。
  7. 根据权利要求1至6任一所述的方法,其中,所述对所述图像对应的图像特征表示进行池化处理,得到全局特征表示,包括:
    将所述图像输入内容识别模型,输出得到所述图像特征表示,其中,所述内容识别模型用于对所述图像进行深层特征提取;
    对所述图像特征表示进行广义均值池化处理,得到所述全局特征表示。
  8. 根据权利要求1至7任一所述的方法,其中,所述对所述图像对应的图像特征表示进行池化处理,得到全局特征表示,包括:
    对所述图像特征表示进行平均池化处理,得到第一全局特征表示;
    对所述图像特征表示进行最大池化处理,得到第二全局特征表示;
    对所述图像特征表示进行广义均值池化处理,得到第三全局特征表示;
    将所述第一全局特征表示、所述第二全局特征表示和所述第三全局特征表示进行特征拼 接,得到所述全局特征表示。
  9. 根据权利要求1至8任一所述的方法,其中,所述基于所述全局特征表示和所述第二局部特征表示对所述目标区域中包含的目标内容的类别进行识别,包括:
    获取内容类别库,所述内容类别库中包括预先设定的n个类别的集合,n为正整数;
    将所述全局特征表示与所述内容类别库中的n个类别分别进行匹配,得到所述内容类别库中与所述全局特征表示匹配的k个候选类别,0<k<n且k为整数;
    基于所述第二局部特征表示对所述k个候选类别进行类别排序,得到类别排序结果;
    根据所述类别排序结果,得到所述目标内容对应的识别类别。
  10. 根据权利要求9所述的方法,其中,所述将所述全局特征表示与所述内容类别库中的n个类别分别进行匹配,得到所述内容类别库中与所述全局特征表示匹配的k个候选类别,包括:
    将所述全局特征表示与所述内容类别库中的n个类别分别进行匹配,得到所述n个类别分别对应的全局匹配分数,所述全局匹配分数用于表征所述目标内容属于所述类别的概率;
    将所述n个类别分别对应的全局匹配分数进行排序,得到匹配度排序结果;
    将匹配度排序结果中前k个类别,作为与所述全局特征表示匹配的k个候选类别。
  11. 一种内容识别装置,所述装置包括:
    获取模块,用于获取图像;
    提取模块,用于基于所述图像中像素点分布规律提取得到图像关键点,并提取所述图像中与所述图像关键点对应的关键点特征表示;通过对所述图像进行显著性检测,从所述图像中识别出目标区域;
    处理模块,用于对所述图像对应的图像特征表示进行池化处理,得到全局特征表示,以及,基于所述目标区域对所述图像特征表示进行下采样,得到第一局部特征表示;
    拼接模块,用于将所述关键点特征表示和所述第一局部特征表示进行特征拼接,得到第二局部特征表示;
    识别模块,用于基于所述全局特征表示和所述第二局部特征表示对所述目标区域中包含的目标内容的类别进行识别。
  12. 根据权利要求11所述的装置,其中,
    所述处理模块,还用于基于所述目标区域对所述图像特征表示进行稀疏采样,得到稀疏采样结果;对所述稀疏采样结果进行池化处理,得到所述第一局部特征表示。
  13. 根据权利要求11所述的装置,其中,
    所述处理模块,还用于从所述图像包括的多个图像块中获取所述目标区域内的多个区域图像块;从所述图像特征表示包括的多个子特征表示中,获取所述多个区域图像块分别对应的子特征表示作为所述稀疏采样结果。
  14. 根据权利要求12所述的装置,其中,
    所述池化处理包括平均池化处理、最大池化处理和广义均值池化处理中任意一种。
  15. 根据权利要求12所述的装置,其中,
    所述处理模块,还用于对所述稀疏采样结果进行平均池化处理,得到第三局部特征表示;对所述稀疏采样结果进行最大池化处理,得到第四局部特征表示;对所述稀疏采样结果进行广义均值池化处理,得到第五局部特征表示;将所述第三局部特征表示、所述第四局部特征表示和所述第五局部特征表示进行特征拼接,得到所述第一局部特征表示。
  16. 根据权利要求11至15任一所述的装置,其中,
    所述提取模块,还用于通过关键点提取算法提取与所述图像关键点对应的关键点特征表示。
  17. 根据权利要求11至16任一所述的装置,其中,
    所述处理模块,还用于将所述图像输入内容识别模型,输出得到所述图像特征表示,其 中,所述内容识别模型用于对所述图像进行深层特征提取;对所述图像特征表示进行广义均值池化处理,得到所述全局特征表示。
  18. 一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一段程序,所述至少一段程序由所述处理器加载并执行以实现如权利要求1至10任一所述的内容识别方法。
  19. 一种计算机可读存储介质,所述存储介质中存储有至少一段程序,所述至少一段程序由处理器加载并执行以实现如权利要求1至10任一所述的内容识别方法。
  20. 一种计算机程序产品,包括计算机指令,所述计算机指令被处理器执行时实现如权利要求1至10任一所述的内容识别方法。
PCT/CN2023/099991 2022-08-04 2023-06-13 内容识别方法、装置、设备、存储介质及计算机程序产品 WO2024027347A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210934770.8A CN115272768A (zh) 2022-08-04 2022-08-04 内容识别方法、装置、设备、存储介质及计算机程序产品
CN202210934770.8 2022-08-04

Publications (2)

Publication Number Publication Date
WO2024027347A1 WO2024027347A1 (zh) 2024-02-08
WO2024027347A9 true WO2024027347A9 (zh) 2024-04-18

Family

ID=83748492

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/099991 WO2024027347A1 (zh) 2022-08-04 2023-06-13 内容识别方法、装置、设备、存储介质及计算机程序产品

Country Status (2)

Country Link
CN (1) CN115272768A (zh)
WO (1) WO2024027347A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115272768A (zh) * 2022-08-04 2022-11-01 腾讯科技(深圳)有限公司 内容识别方法、装置、设备、存储介质及计算机程序产品

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095827B (zh) * 2014-04-18 2019-05-17 汉王科技股份有限公司 人脸表情识别装置和方法
CN107491726B (zh) * 2017-07-04 2020-08-04 重庆邮电大学 一种基于多通道并行卷积神经网络的实时表情识别方法
CN107844750B (zh) * 2017-10-19 2020-05-19 华中科技大学 一种水面全景图像目标检测识别方法
CN110751218B (zh) * 2019-10-22 2023-01-06 Oppo广东移动通信有限公司 图像分类方法、图像分类装置及终端设备
CN113569616A (zh) * 2021-02-24 2021-10-29 腾讯科技(深圳)有限公司 内容识别方法、装置和存储介质及电子设备
CN115272768A (zh) * 2022-08-04 2022-11-01 腾讯科技(深圳)有限公司 内容识别方法、装置、设备、存储介质及计算机程序产品

Also Published As

Publication number Publication date
CN115272768A (zh) 2022-11-01
WO2024027347A1 (zh) 2024-02-08

Similar Documents

Publication Publication Date Title
CN110866140B (zh) 图像特征提取模型训练方法、图像搜索方法及计算机设备
Leng et al. A survey of open-world person re-identification
Asghar et al. Copy-move and splicing image forgery detection and localization techniques: a review
CN108694225B (zh) 一种图像搜索方法、特征向量的生成方法、装置及电子设备
CN111062871B (zh) 一种图像处理方法、装置、计算机设备及可读存储介质
US20170109615A1 (en) Systems and Methods for Automatically Classifying Businesses from Images
CN109871490B (zh) 媒体资源匹配方法、装置、存储介质和计算机设备
Tu et al. Ultra-deep neural network for face anti-spoofing
CN111027576B (zh) 基于协同显著性生成式对抗网络的协同显著性检测方法
CN114663670A (zh) 一种图像检测方法、装置、电子设备及存储介质
WO2012141655A1 (en) In-video product annotation with web information mining
CN111182364B (zh) 一种短视频版权检测方法及系统
CN115443490A (zh) 影像审核方法及装置、设备、存储介质
CN113569740B (zh) 视频识别模型训练方法与装置、视频识别方法与装置
Maigrot et al. Mediaeval 2016: A multimodal system for the verifying multimedia use task
WO2024027347A9 (zh) 内容识别方法、装置、设备、存储介质及计算机程序产品
Zhang et al. Retargeting semantically-rich photos
Bhattacharjee et al. Query adaptive multiview object instance search and localization using sketches
Kalaiarasi et al. Clustering of near duplicate images using bundled features
CN115687670A (zh) 图像搜索方法、装置、计算机可读存储介质及电子设备
Bahroun et al. KS‐FQA: Keyframe selection based on face quality assessment for efficient face recognition in video
Al-Jubouri et al. A comparative analysis of automatic deep neural networks for image retrieval
Salih et al. Two-layer content-based image retrieval technique for improving effectiveness
Yousaf et al. Patch-CNN: Deep learning for logo detection and brand recognition
Sheikh Fathollahi et al. Gender classification from face images using central difference convolutional networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23849057

Country of ref document: EP

Kind code of ref document: A1