CN117953250A - Image processing method and device - Google Patents

Image processing method and device Download PDF

Info

Publication number
CN117953250A
CN117953250A CN202311502232.2A CN202311502232A CN117953250A CN 117953250 A CN117953250 A CN 117953250A CN 202311502232 A CN202311502232 A CN 202311502232A CN 117953250 A CN117953250 A CN 117953250A
Authority
CN
China
Prior art keywords
image
text
feature
clustering
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311502232.2A
Other languages
Chinese (zh)
Inventor
王淳
王钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mashang Xiaofei Finance Co Ltd
Original Assignee
Mashang Xiaofei Finance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mashang Xiaofei Finance Co Ltd filed Critical Mashang Xiaofei Finance Co Ltd
Priority to CN202311502232.2A priority Critical patent/CN117953250A/en
Publication of CN117953250A publication Critical patent/CN117953250A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses an image processing method and device, wherein the method comprises the following steps: acquiring a first clustering result of an image to be processed; performing feature extraction on the first clustering result according to the first image model to obtain a first image feature set, and performing feature extraction on target texts representing the clustering intentions of users by using the first text model to obtain text features; combining the first clustering result according to the first image feature set and the text feature to obtain a first interactive image set; extracting features of the first interactive image set according to the first image model to obtain a second image feature set; and splitting the first interactive image set according to the second image feature set and the text feature to obtain a second interactive image set. Therefore, the clustering result of the image is adjusted by carrying out interactive clustering twice with the text representing the user clustering intention, so that the clustering result meeting the user clustering requirement can be obtained.

Description

Image processing method and device
Technical Field
The present application relates to the field of image processing technologies, and in particular, to an image processing method and apparatus.
Background
Unsupervised clustering is a common data analysis method, and potential association relations or structural rules among samples can be automatically extracted from a large number of samples, so that users can be helped to understand data. In the field of image processing, an unsupervised clustering may be used to classify images or the like. For example, when images of different animals are processed using unsupervised clustering, images containing horses may be classified as one type, images containing lions may be classified as one type, and so on.
However, in practical applications, when performing image processing using unsupervised clustering, there may be a problem that the clustering result cannot meet the user clustering requirement. For example, when images are clustered, images containing horses may be classified into one type, and images containing Ma Quan's body may be classified into one type, but for a user, both types of images belong to images of horses and should be classified into one type.
Disclosure of Invention
The embodiment of the application provides an image processing method and device, which are used for solving the problem that a clustering result cannot meet the clustering requirement of a user when image processing is carried out based on unsupervised clustering.
In order to solve the technical problems, the embodiment of the application is realized as follows:
in a first aspect, an image processing method is provided, including:
Acquiring a first clustering result of an image to be processed;
Performing feature extraction on the first clustering result according to a first image model to obtain a first image feature set, and performing feature extraction on a target text according to a first text model to obtain text features, wherein the target text is used for representing the clustering intention of a user, and the first image model is associated with the first text model;
combining the first clustering results according to the first image feature set and the text features to obtain a first interactive image set;
Extracting features of the first interactive image set according to the first image model to obtain a second image feature set;
And splitting the first interactive image set according to the second image feature set and the text feature to obtain a second interactive image set.
In a second aspect, an image processing apparatus is provided, comprising:
the acquisition module acquires a first clustering result of the image to be processed;
The first feature extraction module is used for carrying out feature extraction on the first clustering result according to a first image model to obtain a first image feature set, and carrying out feature extraction on a target text according to a first text model to obtain text features, wherein the target text is used for representing the clustering intention of a user, and the first image model is in communication association with the first text model;
The merging module is used for merging the first clustering results according to the first image feature set and the text features to obtain a first interactive image set;
the second feature extraction module is used for carrying out feature extraction on the first interactive image set according to the first image model to obtain a second image feature set;
And the splitting module is used for splitting the first interaction image set according to the second image feature set and the text feature set to obtain a second interaction image set.
In a third aspect, an image processing method is provided, including:
Receiving an image search request, wherein the image search request comprises target text, and the target text is used for representing the search intention of a user;
Acquiring a first clustering result of an image to be processed;
Performing feature extraction on the first clustering result according to a first image model to obtain a first image feature set, and performing feature extraction on the target text according to a first text model to obtain text features, wherein the first image model is associated with the first text model;
combining the first clustering results according to the first image feature set and the text features to obtain a first interactive image set;
Extracting features of the first interactive image set according to the first image model to obtain a second image feature set;
Splitting the first interactive image set according to the second image feature set and the text feature to obtain a second interactive image set;
And returning the image set conforming to the description of the target text in the first interactive image set and the second interactive image set as a search result.
In a fourth aspect, an image processing apparatus is provided, comprising:
the receiving module is used for receiving an image search request, wherein the image search request comprises target text, and the target text is used for representing the search intention of a user;
the acquisition module acquires a first clustering result of the image to be processed;
the first feature extraction module is used for carrying out feature extraction on the first clustering result according to a first image model to obtain a first image feature set, carrying out feature extraction on the target text according to a first text model to obtain text features, and associating the first image model with the first text model;
The merging module is used for merging the first clustering results according to the first image feature set and the text features to obtain a first interactive image set;
the second feature extraction module is used for carrying out feature extraction on the first interactive image set according to the first image model to obtain a second image feature set;
the splitting module is used for splitting the first interaction image set according to the second image feature set and the text feature set to obtain a second interaction image set;
and the result returning module returns the image set which accords with the description of the target text in the first interactive image set and the second interactive image set as a search result.
In a fifth aspect, the present application provides an electronic device, comprising:
A processor;
A memory for storing the processor-executable instructions;
Wherein the processor is configured to execute the instructions to implement the method according to the first or third aspect.
In a sixth aspect, the application provides a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the method of the first or third aspect.
The above at least one technical scheme adopted by the embodiment of the application can achieve the following beneficial effects:
when the images are processed, after a first clustering result of the images is obtained, a first image model can be used for extracting image features of the first clustering result, a first text model associated with the first image model is used for extracting text features of target texts representing user clustering intentions, the first clustering result is combined according to the extracted image features and the text features to obtain a first interactive image set, and then the first interactive image set is split based on the text features and the image features of the first interactive image set to obtain a second interactive image set. In this way, the text features of the text representing the user clustering intention and the image features of the first clustering result are combined, images which are dissimilar in images and have similar semantics in the first clustering result can be combined, the first interactive image set is split by the text features of the text representing the user clustering intention and the image features of the first interactive image set, and the images which are similar in images and have dissimilar semantics in the first interactive image set can be split, so that the clustering result of the images can be adjusted by performing two interactive clustering with the text representing the user clustering intention, and the clustering result meeting the user clustering requirement can be obtained.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of an image processing method according to an embodiment of the application;
FIG. 2 is a flow chart of an image processing method according to an embodiment of the application;
FIG. 3 is a schematic diagram of a first clustering result according to one embodiment of the application;
FIG. 4 is a schematic representation of a first set of interactive images in accordance with an embodiment of the present application;
FIG. 5 is a schematic representation of a second set of interactive images according to an embodiment of the present application;
FIG. 6 is a flow chart of an image processing method according to an embodiment of the application;
FIG. 7 is a schematic diagram of the architecture of an electronic device according to one embodiment of the application;
FIG. 8 is a schematic diagram of an image processing apparatus according to an embodiment of the present application;
Fig. 9 is a schematic structural view of an image processing apparatus according to an embodiment of the present application.
Detailed Description
At present, when an unsupervised clustering is used for image processing, the problem that a clustering result does not meet the clustering requirement of a user may occur. For example, when images including animals are clustered, it is possible to classify images including horses as one type and images including Ma Quan bodies as one type, however, it is more desirable for users to classify images including horses as one type than two types because both horses and horses are whole bodies.
In the related art, in order to obtain a clustering result satisfying a user's clustering requirement, a model for performing image feature extraction may be generally selected by a user, or an algorithm for unsupervised clustering may be selected by a user, or parameters of a clustering algorithm may be adjusted by a user, or the like. However, the effect of these solutions is not obvious, and there is still a problem that the clustering result does not meet the user clustering requirement.
The embodiment of the application provides an image processing method and device, wherein the clustering intention of a user is introduced through a text, and on the basis of obtaining a first clustering result of an image, the image processing method and device can perform interactive clustering with the text representing the user clustering twice, so that the clustering result can be adjusted, and the clustering result capable of meeting the user clustering requirement is obtained.
In order to make the technical solution of the present application better understood by those skilled in the art, the technical solution of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, shall fall within the scope of the application.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the present description may be capable of operation in sequences other than those illustrated or described herein. In addition, in the present specification and claims, "and/or" means at least one of the connected objects, and the character "/" generally means a relationship in which the associated objects are one or.
The core inventive concept of the technical scheme provided by the embodiment of the application is as follows: when the images are clustered, the clustering intentions of the users are introduced through the texts, after a first clustering result of the images is obtained, the images can be subjected to interactive clustering with the target texts representing the clustering intentions of the users for two times, so that the first clustering result can be adjusted, and the clustering result meeting the clustering requirements of the users is obtained. Specifically, after a first clustering result of the images is obtained, the first image model may be used to perform image feature extraction on the first clustering result, meanwhile, a first text model associated with the first image model is used to perform text feature extraction on a target text representing a user clustering intention, the first clustering result is combined according to the extracted image feature and text feature to obtain an interactive image set (first interactive clustering), the combining process may combine images which are dissimilar but have similar semantics in the first clustering result, then, the first image model may be used to perform image feature extraction on the first interactive image set, and split processing may be performed on the first interactive image set according to the extracted image feature and the previously extracted text feature to obtain a second interactive image set (second interactive clustering), and the split processing may split the images which are similar but have dissimilar semantics in the first interactive image set, thereby, through performing two interactive clustering with the target text representing the user clustering intention, a clustering result meeting the user clustering requirement may be finally obtained. The first image model and the first text model are associated with each other, wherein the first image model and the first text model are trained through the same image sample and the same text sample, and the text sample comprises texts obtained after text description of the image sample, so that when a clustering result is combined and split according to image features extracted by the first image model and text features extracted by the first text model, the clustering result can have mutual matching capability of crossing the images and the texts, and therefore mutual measurement can be carried out on the image features and the text features, and the combination and split processing of the clustering result is realized. Optionally, the first clustering result may be obtained by clustering according to an image feature extraction model selected by the user and a clustering algorithm, so that the clustering result may better conform to the actual requirement of the user. Optionally, when the first image model is used for extracting the image features of the first clustering result, a core sample set corresponding to each class in the first clustering result can be determined first, and then the image features of the core sample set are extracted, so that when the first interactive clustering is performed, the interactive clustering can be performed based on the image features of the core sample set, thereby reducing the calculated amount and improving the efficiency.
The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.
FIG. 1 is a flow chart of an image processing method according to an embodiment of the application. The image processing method is as follows.
S12: and acquiring a first clustering result of the image to be processed.
The image to be processed may be an image to be subjected to clustering processing, may be a landscape image, may be a person image, may be an animal image, or the like, and may be specific according to an actual application scene, and is not particularly limited herein. Wherein the number of images to be processed may be plural.
The first clustering result may be a result obtained after clustering the images to be processed based on unsupervised clustering. In some embodiments, the first clustering result may be an existing clustering result, and when the first clustering result is acquired, the first clustering result may be acquired from a client or a server that stores the clustering result. In other implementations, the first clustering result may not be an existing clustering result, so that when the first clustering result is obtained, unsupervised clustering may be performed on the image to be processed, and a result obtained by clustering is used as the first clustering result.
In the case of obtaining the first clustering result by clustering, obtaining the first clustering result of the image to be processed may include the steps of:
acquiring a second image model and a first clustering algorithm;
Extracting features of the image to be processed according to the second image model to obtain image features corresponding to the image to be processed;
And clustering the image features according to a first clustering algorithm to obtain a first clustering result.
The second image model is used for extracting image features of the image to be processed. Because different image models can correspond to different training samples and training targets, when the image models are used for image feature extraction, the emphasis points of the extracted features are different, and therefore, when the second image features are acquired, the image models used for representing the clustering intentions of the users can be acquired. Alternatively, in some embodiments, the second image model may be selected by the user, and the first image model selected by the user may be acquired while the first image model is acquired. Specifically, when the images to be processed are clustered, a plurality of different types of image models can be provided for users to select, and the emphasis points of the image features extracted by the different types of image models are different, for example, a first type of image model obtained based on self-coding (autoencoder) reconstruction training and a second type of image model obtained based on contrast learning (contrastive learning) training can be provided for users to select, wherein the image features extracted by the first type of image model are more prone to retain complete image information, the image features extracted by the second type of image model are more prone to extract distinguishing image information, and other types of images can be provided, which are not illustrated one by one. When the user selects the first image model, the user can select a proper image model according to the actual requirement of the user. For example, if the user is more concerned about the entirety of the image, the first type of image model may be selected if the user wishes to extract features that preserve complete image information, the second type of image model may be selected if the user is more concerned about image details, the features that are desired to be extracted may be more discriminative, and so on. After the user selects the first image model, the second image model may be acquired based on the user's selection.
The first clustering algorithm is used for clustering the images to be processed. The first clustering algorithm may be an unsupervised clustering algorithm. Because the clustering parameters corresponding to different clustering algorithms are different, the corresponding obtained clustering results are different, and therefore, when the first clustering algorithm is acquired, the clustering algorithm used for representing the clustering intention of the user can be selected. Alternatively, in some embodiments, the first clustering algorithm may be selected by a user from a plurality of clustering algorithms, the clustering parameters of the plurality of clustering algorithms being different. When a user selects a clustering algorithm, a proper clustering algorithm can be selected according to actual requirements, and optionally, when the clustering algorithm is selected, parameters of the clustering algorithm can be adjusted.
In a possible implementation manner, when the second image model and the first clustering algorithm are acquired, a plurality of image models for extracting image features and a plurality of clustering algorithms for clustering are displayed to a user through an interactive interface when the images to be processed are clustered, and information such as the features extracted by each image model in a focusing mode and the clustering effect of each clustering algorithm is marked at the same time, so that the user can select the corresponding image model and the corresponding clustering algorithm according to the displayed information and own clustering requirements. After the user selects the image model and the clustering algorithm, the image model selected by the user may be used as the second image model and the clustering algorithm selected by the user may be used as the first clustering algorithm.
After the second image model and the first clustering algorithm are obtained, the second image model can be used for extracting image features of the image to be processed to obtain image features corresponding to the image to be processed, and then the first clustering algorithm is used for clustering the image features, so that a first clustering result can be obtained. The first clustering result comprises a plurality of classes, each class comprises one or more images, the similarity of the images among different classes is low, and the similarity of the images in the same class is high. The specific implementation manner of using the second image model to perform image feature extraction and using the first clustering algorithm to perform clustering may refer to the specific implementation manner in the related art, and will not be described in detail herein.
The second image model and/or the first clustering algorithm used can be selected by the user when the images to be processed are clustered, so that the obtained first clustering result can be more in line with the clustering requirement of the user, and further, after the first clustering result is subjected to interactive clustering in the follow-up process, the obtained clustering result can be more in line with the clustering requirement of the user.
S14: and extracting features of the target text according to the first text model to obtain text features, wherein the target text is used for representing the clustering intention of the user, and the first image model is associated with the first text model.
After the first clustering result is obtained, the first image model may be used to extract image features of the first clustering result, so as to obtain an image feature set (one class may correspond to one set) corresponding to each class in the first clustering result, for convenience of distinguishing, the extracted image feature set corresponding to each class may be uniformly represented as the first image feature set, and meanwhile, the first text model may be used to extract text features of a target text for representing the user clustering intention, so as to obtain text features, and the first image feature set and the text features may be used to perform first interactive clustering on the first clustering result, so as to obtain the first interactive image set.
The target text may be input by the user according to the actual clustering requirement, specifically, may be input before S12 is executed, or may be input after the first clustering result is obtained, where the first clustering result does not meet the user requirement, and is not specifically limited herein. The target text may be used to characterize the user's clustering intent, i.e., the text of the sample that the user wishes to obtain. For example, the target text may be "an image containing a horse", or may also be "an image containing a dark brown horse", or the like. When a user inputs texts, a plurality of texts can be input, each text can correspond to a desired type, when the user inputs texts in an interactive clustering mode, each text can be subjected to interactive clustering independently, the interactive clustering method of each text is the same, and the embodiment of the application only takes interactive clustering with one text as an example.
The first image model and the first text model are associated models, specifically, the first image model and the first text model are trained through the same image sample and the same text sample, the text sample comprises text obtained after the image sample is subjected to text description, for example, the image sample comprises an image of a horse, and the text sample comprises a text "image of a horse". When model training is performed according to the image sample and the text sample, training can be performed by means of contrast learning on the image sample and the text sample. In addition, the positive sample and the negative sample can be included in the image sample and the text sample, so that an image model and a text model with better effect can be trained. The negative sample in the image sample may be an image in which the text description of the image is inconsistent with the text sample, and the negative sample in the text sample may be text in which the text description corresponding to the image sample is inconsistent.
It should be noted that, in the embodiment of the present application, the use of the associated first image model and the first text model is because the associated image model and text image have the capability of matching and measuring each other across images and texts, so that when the clustering result is subsequently combined and split according to the image features extracted from the first image model and the text features extracted from the first text model, the extracted image features and text features can be measured each other, so that the combination and the splitting of the clustering result can be achieved, if there is no correlation between the first image model and the first text model, the matching and the measurement across images and texts cannot be achieved, that is, if there is no correlation between the first image model and the first text model, in other words, when the clustering result is subsequently combined and split according to the image features extracted from the first image model and the text features extracted from the first text model, the accuracy of the obtained result is low, and therefore, the use of the associated first image model and the text model is required to perform the feature extraction and the feature extraction respectively.
Alternatively, in some implementations, the first image model and the first text feature may be image branches and text branches in a multimodal model having the aforementioned associated characteristics between the first image model and the first text model. Wherein, the multi-modal model refers to a model which correlates two different modal data. The embodiment of the application mainly focuses on models of two modes of images and texts, and the models can correlate image features (or image feature vectors) with corresponding text features (or text feature vectors). When the image content and the text content have a corresponding relation, the feature vectors of the image content and the text content have similarity, and when the image content and the text content do not correspond to each other, the feature vector measurement of the image content and the text content has larger difference. With such a model, it is possible to relate what is difficult for a person to understand, such as image features, to text which is easy for a person to understand, for example, after feature extraction of an image containing a horse, it is difficult for a user to determine only from whether the image features are images of which type (for example, an image of a horse or an image of a lion, the user is unable to determine only from the image features), but if the image features are related to text features of the text "horse image", the user can know that the image to which the image features correspond is an image of a horse, not an image of other animals.
The first image model and the first text model can be selected by a user, and the first image model and the first text model selected by the user can represent the clustering intention of the user, so that when the image features and the text features are extracted according to the first image model and the first text model, the extracted features can be more in line with the actual requirements of the user. Optionally, in the case that the first image model and the first text model are image branches and text branches in the multimodal model, the multimodal model may also be selected by the user, so that when image features and text features are extracted according to the image branches and the text branches in the multimodal model, the extracted features may better meet the user requirements.
In one possible implementation manner, before feature extraction is performed according to the first image model and the first text model, a plurality of first image models and a plurality of first text models can be displayed to a user through an interactive interface, the plurality of first image models and the plurality of first text models can be displayed in a pairing mode, then the emphasis points of each pair of models in feature extraction are marked, and the user can select the corresponding first image models and first text models according to actual requirements. After the user selects the first image model and the first text model, feature extraction can be performed on the first clustering result according to the first image model selected by the user to obtain a first image feature set, and feature extraction is performed on the target text according to the first text model to obtain text features. Or a plurality of multi-mode models can be displayed to the user through the interactive interface, and the emphasis point of each multi-mode model in the process of feature extraction is marked at the same time, so that the user can select the corresponding multi-mode model according to actual requirements. After the user selects the multimodal model, feature extraction can be performed on the first clustering result according to image branches in the multimodal model to obtain a first image feature set, and feature extraction is performed on the target text according to text branches in the multimodal model to obtain text features.
The first clustering result comprises a plurality of classes, and each class comprises one or more images. When the image feature extraction is performed on the first clustering result according to the first image model, optionally, in some embodiments, feature extraction may be performed on all images in each class, where the obtained first image feature set includes an image feature set corresponding to each class, and each class may correspond to one image feature set, and each image feature set includes image features corresponding to all images in one class. Optionally, in other embodiments, in order to reduce the amount of calculation and improve the subsequent interactive clustering efficiency, when the image feature extraction is performed on the first clustering result, feature extraction may be performed on a part of images in each class, alternatively, feature extraction may be performed on core sample images with representativeness in each class, that is, feature extraction is performed on core sample sets (sets formed by the core sample images) corresponding to each class, an image feature set corresponding to each core sample set is included in the obtained first image feature set, each core sample set may correspond to one image feature set, and each image feature set includes image features corresponding to core sample images in one core sample set.
The core sample set may be obtained by determining according to the first clustering result after the first clustering result is obtained. Specifically, after the first clustering result is obtained, for each class included in the first clustering result, one or more core sample images having representativeness in the class may be determined, and then a set formed by the one or more core sample images is determined as a core sample set corresponding to the class, so that a core sample set corresponding to each class may be obtained, and one class may correspond to one core sample set. In this way, when the first image model is used to perform image feature extraction on the first clustering result, feature extraction can be performed on a plurality of core sample sets.
In determining the core sample set corresponding to each class in the first clustering result, in some implementations, for each class, a corresponding core sample image may be determined by means of a graph structure, thereby obtaining a core sample set, and specifically, the method may include the following steps:
Constructing a first graph structure according to the image characteristics of each image in the class, wherein each node in the first graph structure corresponds to the image characteristics of one image, and the edge between two nodes corresponds to the cosine distance between the image characteristics of two images;
Removing edges, corresponding to cosine distances smaller than a first preset threshold, in the first graph structure;
for each node in the first graph structure, determining the proximity centrality of the node according to the number of neighbor nodes of the node, the cosine distance between the node and the neighbor nodes and the total number of the nodes in the first graph structure;
And determining a set formed by the first images in the class as a core sample set corresponding to the class, wherein the proximity centrality of the image features of the first images in the nodes corresponding to the first image structure is larger than or equal to a second preset threshold value.
Specifically, for a certain class, the cosine distance between the image features of every two images in the class can be calculated according to the image features of a plurality of images in the class. The image feature of each image in the class may be an image feature used when the class is obtained by clustering, for example, the image feature extracted in S12 above using the first image model. The cosine distance may be used to measure the similarity between two images (or image features), the closer the cosine distance is to 1, the more similar the two images are, whereas the smaller the cosine distance, the less similar the two images are. The concrete implementation of calculating the cosine distance according to the image features can be found in the related art, and will not be described in detail here.
After the cosine distance is obtained, the image features of the plurality of images in the class can be used as nodes, the cosine distance between every two image features is used as the weight of the edge, and the constructed image structure can be expressed as a first image structure for the convenience of distinguishing. After the first graph structure is constructed, for each edge in the first graph structure, the cosine distance corresponding to the edge can be compared with a first preset threshold. If the cosine distance is greater than or equal to the first preset threshold, it can be indicated that the similarity between the images corresponding to the two nodes connected to the edge is high, the edge can be reserved, if the cosine distance is less than the first preset threshold, it can be indicated that the similarity between the images corresponding to the two nodes connected to the edge is low, the edge can be ignored, i.e. the edge is removed (the two nodes connected to the edge are reserved). The first preset threshold may be set according to actual requirements, which is not specifically limited herein.
After the above-mentioned determination and processing are performed on each edge in the first graph structure, the proximity centrality of each node may be further calculated. The proximity centrality of a node may represent an average of minimum distances between the node and other nodes, and in general, the greater the proximity centrality of a node, the closer the node to the central location, the more representative the node. When calculating the approximate centrality of the nodes, the calculation may be performed according to the number of neighboring nodes of the nodes, the cosine distance between the nodes and the neighboring nodes, and the total number of nodes in the first graph structure. Alternatively, in some embodiments, for a certain node r j in the first graph structure, the approximate centrality of the node may be calculated by the following formula:
Wherein R (R j) is a set of neighbor nodes directly connected to node R j with edges, |R (R j) | is the number of neighbor nodes, N is the total number of nodes in the first graph structure, d (R j,ri) is the cosine distance between node R j and node R i, Is the sum of cosine distances between node r j and all neighbor nodes r i∈R(rj of node r j).
After the proximity center of each node is obtained, determining a target node according to the proximity center, and determining a set formed by a first image corresponding to the target node as a core sample set corresponding to the type. Wherein the first image is a core sample image in the class. In determining the target node, optionally, in some embodiments, a node in which the proximity centrality is greater than or equal to a second preset threshold may be used as the target node, and in other embodiments, a preset number of nodes with the highest proximity centrality may be selected as the target node. The second preset threshold and the preset number may be determined according to actual requirements, which are not specifically limited herein.
The method for determining the core sample set by constructing the graph structure is described in detail above, in other possible implementations, the core sample set may also be determined by other methods, for example, taking one of the classes as an example, for each image in the class, a sum (or average) of the similarity between the image and other images in the class may be calculated, a set formed by one or more images with the largest sum (or average) is determined as the core sample set corresponding to the class, and so on. Other possible implementations are not illustrated here.
S16: and combining the first clustering result according to the first image feature set and the text feature to obtain a first interactive image set.
The merging process may be to merge classes that are not similar but semantically similar in the images in the first clustering result. The first interactive image set is an image set obtained after the combination processing, and the image set comprises images with dissimilar image characteristics and similar semanteme.
As described in S14, the first image feature set includes a plurality of image feature sets, each image feature set corresponds to one class in the first clustering result, and each image feature set includes image features of all images in one class or image features of a core sample image in one class, so when the first clustering result is combined according to the first image feature set and the text feature, optionally, in some embodiments, the method may include the following steps:
Determining the similarity between each image feature set and the text feature;
determining a target image feature set from a plurality of image feature sets, wherein the similarity between the target image feature set and the text feature is greater than or equal to a preset similarity threshold;
and merging the classes corresponding to the target image feature set to obtain a first interactive image set.
Taking one image feature set as an example, determining the similarity between the image feature set and the text feature refers to determining whether the image corresponding to the image feature set accords with the description of the target text corresponding to the text feature, if the similarity is high, the image is more consistent with the description of the target text, and if the similarity is low, the image is less consistent with the description of the target text.
In determining the similarity between the image feature set and the text feature, optionally, in some embodiments, a cosine distance between the image feature set and the text feature may be determined, where the cosine distance may characterize the similarity between the image feature set and the text feature, and the greater the cosine distance may indicate a higher similarity between the image feature set and the text feature, and conversely, the smaller the cosine distance may indicate a lower similarity between the image feature set and the text feature.
In determining the cosine distance between the image feature set and the text feature, the cosine distance between each image feature and the text feature may be determined for each image feature in the image feature set, and specific implementations may be found in the related art and will not be described in detail herein. After the cosine distance between each image feature and the text feature in the image feature set is obtained, the cosine distance between the image feature set and the text feature can be further determined according to the cosine distance. For example, the sum of cosine distances between each image feature and text feature in the image feature set may be determined as the cosine distance between the image feature set and text feature, or the average of cosine distances between each image feature and text feature in the image feature set may be determined as the cosine distance between the image feature set and text feature, or the like, which is not particularly limited herein. After the cosine distance between the image feature set and the text feature is obtained, the similarity between the image feature set and the text feature is also obtained.
After obtaining the similarity between each image feature set and the text feature, each similarity may be compared with a preset similarity threshold, and image feature sets with similarity greater than or equal to the preset similarity threshold may be determined from a plurality of image feature sets according to the comparison result, and for convenience of distinction, these sets may be represented as target image feature sets. The preset similarity threshold may be set according to actual requirements, which is not specifically limited herein. Alternatively, in some embodiments, if the similarity between the image feature set and the text feature is a cosine distance, when determining the target image feature set from the plurality of image feature sets, a set in which the cosine distance is greater than or equal to a third preset threshold value in the plurality of image feature sets may be determined as the target image feature set. The third preset threshold may be set according to actual requirements, and is not specifically limited herein.
After the target image feature set is obtained, as the similarity between the target image feature set and the text feature is higher, the degree of coincidence between the image in the class corresponding to the target image feature set and the target text representing the clustering intention of the user can be described as higher, that is, the semantic similarity of the image in the class corresponding to the target image feature set is higher, so that the class corresponding to the target image feature set can be combined, and the combined image set is the first interactive image set.
For example, the first clustering result includes four classes, namely class a, class B, class C and class D, and if the similarity between the image feature sets corresponding to class a and class B and the text features of the target text representing the user clustering intention is greater than a preset similarity threshold, the similarity between the image feature sets corresponding to class C and class D and the text features of the target text representing the user clustering intention is less than the preset similarity threshold, it may be stated that the images in class a and class B are similar (because they are classified into two classes during clustering) but have similar semantics (because the similarity between the images and the text features is high), so that the class a image and the class B image may be combined to obtain the first interactive image set.
Because the classes which are dissimilar in the images and have similar semantics in the first clustering result can be combined based on the target text which characterizes the clustering intention of the user, the separation of the images with similar semantics into two classes can be avoided, and the clustering result which meets the clustering requirement of the user can be obtained.
S18: and extracting features of the first interactive image set according to the first image model to obtain a second image feature set.
The first interactive image set includes a plurality of images, and when the feature extraction is performed on the first interactive image set, feature extraction may be performed on each image in the set to obtain an image feature of each image, where the image features of the images form a second image feature set, that is, the second image feature set may include an image feature of each image in the first interactive image set.
The second set of image features may be used for subsequent second interactive clustering of the first set of interactive images. In the second interactive clustering, since the text features used are still text features of the target text used for representing the user clustering intention in S14, in order to facilitate the subsequent second interactive clustering, and further obtain a clustering result meeting the user requirement, in the feature extraction of the first interactive image set, the feature extraction needs to be performed by using the first image model in S14, because the first image model is associated with the first text model used in the feature extraction, and only in the case that the first image model is associated with the first text model, the image features and the text features extracted by the first image model and the first text model can be mutually matched (i.e., the mutual measurement distance) only have the capability of mutually matching the cross images and the texts, so that in the second interactive clustering, the clustering result meeting the user requirement can be obtained. If the used image model is not matched with the first text model, the capability of cross-image and text matching is not provided, and when the second interactive clustering is carried out, the matching measurement between the image features and the text features cannot be effectively carried out, and correspondingly, the clustering result meeting the user requirements cannot be obtained.
S110: and splitting the first interactive image set according to the second image feature set and the text feature to obtain a second interactive image set.
The splitting process may be to split images in the first set of interaction images that are similar but semantically dissimilar. The second interactive image set is an image set obtained by splitting images similar in image but dissimilar in semantics in the first interactive image set, and the images included in the image set are similar in characteristics and similar in semantics. The second set of interaction images contains fewer images than the first set of interaction images that do not correspond to the user's clustering intent.
In splitting the first set of interaction images according to the second set of image features and the text features, optionally, in some embodiments, the steps of:
determining target image features similar to the text features from the second image feature set according to the second image feature set and the text features;
Determining a second image corresponding to the target image characteristic from the first interactive image set according to the target image characteristic;
A set of second images is determined as a second set of interaction images.
The second image is an image in the first set of interaction images that is similar at the image feature level and also similar at the semantic level to text that characterizes the user's clustering intent. The other images except the second image in the first interactive image set are images with similar image characteristic layers but dissimilar from texts representing the clustering intentions of the users at the semantic layers, the images do not meet the clustering requirements of the users, and the images are required to be split from the second image.
When the first interactive image set is split, the second image may be determined from the first interactive image set. When determining the second image from the first interactive image set, determining a target image feature similar to the text feature from the second image feature set according to the second image feature set obtained by extracting the features of the first interactive image set and the text feature of the text for representing the clustering intention of the user, and then determining the second image according to the target image feature. Wherein in determining target image features from the second set of image features that are similar to the text features, optionally, in some embodiments, this may be achieved by means of a graph structure, in particular comprising the steps of:
Constructing a second graph structure according to the second image feature set and the text features, wherein the image features and the text features in the second image feature set correspond to nodes in the second graph structure, and the weights of edges between the two nodes correspond to cosine distances between the two features;
removing edges with the cosine distance smaller than a fourth preset threshold value from the second graph structure;
Starting from the node corresponding to the text feature, performing depth-first search in the second graph structure, and determining a target node which can be reached by search;
And determining the image characteristics corresponding to the target nodes to obtain target image characteristics similar to the text characteristics.
The second image feature set includes a plurality of image features, when the second image structure is constructed, the plurality of image features and text features of text input by a user can be used as nodes in the second image structure, and the weight of edges between the nodes is cosine distance between two features (text features and image features or image features and image features) connected with the edges, so that the second image structure can be constructed. The cosine distance may be used to measure the similarity between two features (text and image features, or image and image features), the closer the cosine distance is to 1, the more similar the two features are, and conversely, the smaller the cosine distance, the less similar the two features are. The concrete implementation of calculating the cosine distance can be found in the related art, and will not be described in detail here.
After the second graph structure is constructed, for each edge in the second graph structure, the cosine distance corresponding to the edge can be compared with a fourth preset threshold. If the cosine distance is greater than or equal to the fourth preset threshold, it may be indicated that the similarity between the features corresponding to the two nodes connected to the edge is high, the edge may be retained, and if the cosine distance is less than the fourth preset threshold, it may be indicated that the similarity between the features corresponding to the two nodes connected to the edge is low, the edge may be omitted, i.e., the edge is removed (the two nodes connected to the edge are still retained). The fourth preset threshold may be set according to actual requirements, which is not specifically limited herein.
After the above-mentioned judgment and processing are performed on each edge in the second graph structure, a depth-first search may be performed in the second graph structure from the node corresponding to the text feature, and a node that can be reached by the search is determined.
After the target image features are obtained, when the second image is determined, one or more images corresponding to the target image features can be determined from the first interactive image set according to the corresponding relation between the images in the first interactive image set and the image features (obtained when the features of the first interactive image set are extracted), and the one or more images are the second image.
After the second image is obtained, the second image can be split from the first interactive image set, and a set formed by the second image is determined to be the second interactive image set, wherein the images included in the second interactive image set are similar in image feature level and similar to texts representing the clustering intentions of the users, so that the requirements of the users can be met.
Optionally, in some embodiments, where the first set of interaction images and the second set of image interactions are obtained, further comprising:
If the images in the first interactive image set accord with the description of the target text, the first interactive image set is output as a target clustering result; or alternatively, the first and second heat exchangers may be,
And if the images in the second interactive image set accord with the description of the target text, outputting the second interactive image set as a target clustering result.
In the first interactive image set and the second interactive image set obtained by the embodiment of the application, at least one image in the set meets the clustering requirement of the user. After the first interactive image set and the second interactive image set are obtained, when a final clustering result is output, judging which image in the first interactive image set and the second interactive image set accords with the description of the target text. If the images in the first interactive image set meet the description of the target text, the images in the first interactive image set can be described to meet the clustering requirement of the user, and in this case, the first interactive image set can be output as a target clustering result. If the images in the second interactive image set meet the description of the target text, it can be stated that the images in the second interactive image set meet the clustering requirement of the user, and in this case, the second interactive image set can be output as the target clustering result.
When determining whether the images in the first interactive image set conform to the description of the target text, optionally, whether the image features of all the images in the first interactive image set are similar to the text features of the target text (where the image features and the text features may be extracted from the first image model and the first text model or may be extracted from image branches and text branches in the multimodal model), if so, it may be determined that the images in the first interactive image set conform to the description of the target text, and if not, it may be determined that the images in the first interactive image set do not conform to the description of the target text. Optionally, the first interactive image set may be displayed to the user in a preview manner, if the confirmation information of the user is received, it may be determined that the image in the first interactive image set accords with the description of the target text, and if the negative information of the user is received, it may be determined that the image in the first interactive image set does not accord with the description of the target text. In determining whether the images in the second set of interactive images conform to the description of the target text, the specific implementation may be the same as the specific implementation of determining whether the images in the first set of interactive images conform to the description of the target text, and the description will not be repeated here.
Optionally, in some embodiments, after the first interaction image set is obtained, it may be determined whether the first interaction image set meets the description of the target text, if so, the first interaction image set is output as the target clustering result, and the second interaction clustering of the first interaction image set may not be required to be performed later, that is, S18 and S110 may not be required to be performed, so that the image processing step may be simplified and the image processing efficiency may be improved. If not, the steps S18 and S110 may be performed continuously to obtain a second interactive image set meeting the user clustering requirement.
In the embodiment shown in fig. 1, when the first clustering result is subjected to the interactive clustering twice, the first clustering result is subjected to the merging process, then the first interaction image set obtained by merging is subjected to the splitting process, in other possible implementation manners, the first clustering result may be subjected to the splitting process first (a specific implementation manner may be the same as a specific implementation manner when the embodiment shown in fig. 1 is subjected to the splitting process), and then the image set obtained by splitting is subjected to the merging process (a specific implementation manner may be the same as a specific implementation manner when the embodiment shown in fig. 1 is subjected to the merging process), so that the obtained clustering effect is the same as the clustering effect obtained by firstly performing the merging process and then performing the splitting process. If the splitting process is performed before the merging process, optionally, after the splitting result is obtained, whether the splitting result meets the clustering requirement of the user can be judged, if yes, the splitting result can be output as a target clustering result, the subsequent merging process is not needed, and if not, the merging process can be continued, so that the clustering result meeting the clustering requirement of the user can be obtained. Of course, after the splitting result and the merging result are obtained, which result meets the clustering requirement of the user may be judged, and the result meeting the clustering requirement of the user may be output as the target clustering result, which is not limited herein.
In order to facilitate understanding of the technical solution provided by the embodiments of the present application, the following may be described by taking a more specific implementation manner shown in fig. 2 as an example, please refer to fig. 2. FIG. 2 is a flowchart of an image processing method according to an embodiment of the present application, which includes the following steps.
S22: and acquiring a second image model, and extracting features of the image to be processed according to the second image model to obtain corresponding image features.
The second image model may be selected by a user. The second image model selected by the user may be denoted as F 1, the image to be processed as s= { S i }, and the image features extracted using the second image model are denoted as R 1={ri }, where R i=F1(si).
S24: and acquiring a first clustering algorithm, and clustering the image features according to the first clustering algorithm to obtain a first clustering result.
The first clustering algorithm may be selected by a user. The first clustering algorithm may be denoted as C 1, where the first clustering result obtained by clustering using the first clustering algorithm includes K classes, where the image feature set corresponding to the kth class is { r i}k, and the image set corresponding to the kth class is { s i}k.
S26: a set of core samples corresponding to each class in the first cluster result is determined.
For the kth class, in determining the corresponding set of core samples:
Firstly, a graph structure (graph) is formed by using samples (namely image features) in a k-th class image feature set { r i}k. Each image feature is a node of the graph structure, a connecting line between two nodes forms an edge, the weight of the edge is the cosine distance d (r j,ri) of the two image features, and if the cosine distance is lower than a first preset threshold, the edge is removed from the graph.
Second, the approximate centrality of the nodes is calculated. Assuming that an important node is closer to other nodes, the index of the judgment may be the average of the minimum distances of the node to all other nodes, i.e., the approximate centrality. For node r j, its approximate centrality is:
Wherein R (R j) is a neighbor node set directly connected with the edge of the node R j, |R (R j) | is the number of neighbor node sets, N is the total node number of the graph corresponding to the kth image feature set { R i}k, d (R j,ri) is the cosine distance between the node R j and the node R i, Is the sum of cosine distances between node r j and all neighbor nodes r i∈R(rj of node r j).
Finally, a core sample image is selected. Nodes with high near centrality may be preferentially selected as representative nodes herein. By way of example, top3 may be selected, i.e., the three nodes with the highest proximity centrality may be selected as representative nodes.
And respectively executing the steps on the K classes of the first clustering result, finding out images corresponding to the representative nodes of each class, and recording the images as a core sample set { s j *}k.
S28: and extracting the characteristics of the core sample set corresponding to each class by using the image branches in the multi-modal model to obtain a first image characteristic set.
The image branches in the multimodal model may be denoted as F 2, and the first image feature set includes a plurality of image feature sets, each corresponding to a core sample, and may be denoted as { q j *}k, where q j *=F2(sj *).
S210: and extracting features of the target text for representing the clustering intention of the user by using text branches in the multimodal model to obtain text features.
The user composes a textual description of the desired sample, for example: containing an image of a horse. The user can write multiple sentences, each sentence corresponding to a desired type, and each sentence is processed independently. The image branch in the multimodal model may be denoted as T 1 and the extracted text feature may be denoted as T i.
In the foregoing steps S28 and S210, the image branches and the text branches in the multi-modal model may be used in performing the image feature extraction and the text feature extraction, and in other implementations, the image feature extraction and the text feature extraction may also be performed using the associated first image model and the first text model, where the first image model and the first text model are associated, and in particular, the first image model and the first text model are trained by using the same image sample and text sample, and the text sample includes text obtained after the text description of the image sample.
S212: and combining the first clustering results by taking the text feature vector as a class center to obtain a first interactive image set.
And taking the image feature set { q j *}k and the current text feature t i for the K classes of the first clustering result respectively, and measuring cosine distances between the image feature set { q j *}k and the current text feature t i. Cosine distance of image feature set most similar to text featureExceeding the second preset threshold, the image set { s i}k of the kth class is considered to be consistent with the text description. And merging all the image sets conforming to the text description to obtain a first interactive image set of the text. /(I)
Optionally, after the first interactive image set is obtained, it may be determined whether the first interactive image set meets the description of the target text, if so, the first interactive image set may be returned as the target clustering result, where S214 may not be executed, and if not, S214 may be executed continuously. Here, the description will be given taking the case where S214 needs to be executed as an example.
S214: and taking the first text feature vector as a class center, splitting the first interactive image set to obtain a second interactive image set.
And extracting the feature vector of each image in the first interactive image set by using the image branch F 2 in the multi-mode model to obtain a second image feature set. And then constructing a graph structure (graph) by using the text features and the second image feature set, wherein the weight of the edge is the cosine distance d (r j,ri) of the two features, and if the cosine distance is lower than a fourth preset threshold value, eliminating the edge. Then, starting from the node corresponding to the text feature, performing depth-first search (DEPTH FIRST SEARCH, DFS) on the graph structure), and determining a set formed by images corresponding to all the touchable nodes as a second interactive image set. And returning the second interactive image set as a result. The second set of interaction images may contain fewer images that are not similar to the text than the first set of interaction images.
Optionally, after obtaining the second interaction image set, the second interaction image set may be output as the target clustering result.
The specific implementation of S22 to S214 may refer to the specific implementation of the corresponding steps in the embodiment shown in fig. 1, and will not be described in detail here.
In a typical application scenario, assuming that there are an image including a horse head and an image including a horse whole body in the image to be processed, when the image to be processed is clustered, the existing scheme considers that the similarity between the image including the horse head and the image including the horse whole body is low, and therefore the image including the horse head is classified into one type, the image including the Ma Quan body is classified into two types, as shown in fig. 3, the image including the horse head and the image including the horse whole body are classified into two types (a) and (b). However, sometimes the user can accept the division of the horse head and the whole body of the horse into two classes, but sometimes the user feels that the horse head and the whole body of the horse are both horses, and should not split them into two classes. It can be seen that the result obtained by the existing unsupervised clustering has the problem of not meeting the user demand.
Based on the technical scheme provided by the embodiment of the application, after the first clustering result shown in fig. 3 is obtained, interactive clustering can be performed with the text representing the user clustering intention, so that the first clustering result can be adjusted, and the clustering result meeting the user requirement is obtained. For example, assuming that the clustering intention of the user is "horse image", the final output clustering result may be the result shown in fig. 4, that is, the two types of images shown in fig. 3 are combined into one type, that is, (c) type, and such clustering result obviously meets the clustering requirement of the user. For another example, assuming that the clustering intention of the user is "dark brown horse image", the final output clustering result may be the result shown in fig. 5, i.e., the image including the white horse head and the image of the whole body of the white horse in fig. 4 are both rejected. Clearly, the clustering result shown in fig. 5 meets the user clustering requirement.
According to the technical scheme provided by the embodiment of the application, when the image is processed, after a first clustering result of the image is obtained, the first image model can be used for extracting image features of the first clustering result, the first text model associated with the first image model is used for extracting text features of target texts representing the clustering intention of the user, the first clustering result is combined according to the extracted image features and the text features to obtain a first interactive image set, and then the first interactive image set is split based on the text features and the image features of the first interactive image set to obtain a second interactive image set. In this way, the text features of the text representing the user clustering intention and the image features of the first clustering result are combined, images which are dissimilar in images and have similar semantics in the first clustering result can be combined, the first interactive image set is split by the text features of the text representing the user clustering intention and the image features of the first interactive image set, and the images which are similar in images and have dissimilar semantics in the first interactive image set can be split, so that the clustering result of the images can be adjusted by performing two interactive clustering with the text representing the user clustering intention, and the clustering result meeting the user clustering requirement can be obtained.
The technical scheme provided by the embodiment of the application can be applied to various application scenes for performing unsupervised clustering on images, such as a construction scene of a material database, an image searching scene and the like, and an exemplary application scene of image searching is taken as an example for illustration.
FIG. 6 is a flow chart of an image processing method according to an embodiment of the application. The embodiment shown in fig. 6 may include the following steps.
S62: an image search request is received, wherein the image search request comprises target text, and the target text is used for representing the search intention of a user.
In performing an image search, a user may input target text describing a search intention (i.e., a search purpose), such as an image of a horse, an image of a dark brown horse, or the like. After the user enters the target text, an image search request may be received when a search is triggered.
S64: and acquiring a first clustering result of the image to be processed.
The first clustering result may be a clustering result obtained by clustering in advance and stored in a database, or may be obtained by clustering a plurality of images which are not clustered in the database after receiving an image search request of a user, which is not particularly limited herein. Wherein the image model and the clustering algorithm used can be selected by the user when image clustering is performed.
S66: and extracting features of the target text according to the first text model to obtain text features, wherein the first image model is associated with the first text model.
S68: and combining the first clustering result according to the first image feature set and the text feature to obtain a first interactive image set.
S610: and extracting features of the first interactive image set according to the first image model to obtain a second image feature set.
S612: and splitting the first interactive image set according to the second image feature set and the text feature to obtain a second interactive image set.
S614: and returning the image set conforming to the description of the target text in the first interactive image set and the second interactive image set as a search result.
The specific implementation of S64 to S614 may refer to the specific implementation of the corresponding steps in the embodiment shown in fig. 1, and will not be described in detail here.
Based on the technical scheme provided by the embodiment of the application, in the application scene of image searching, the images meeting the requirements of users can be searched.
The foregoing describes certain embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Fig. 7 is a schematic structural view of an electronic device according to an embodiment of the present application. Referring to fig. 7, at the hardware level, the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.
The processor, network interface, and memory may be interconnected by an internal bus, which may be an ISA (Industry Standard Architecture ) bus, a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 7, but not only one bus or type of bus.
And the memory is used for storing programs. In particular, the program may include program code including computer-operating instructions. The memory may include memory and non-volatile storage and provide instructions and data to the processor.
The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs, forming the image processing device on a logic level. The processor is used for executing the programs stored in the memory and is specifically used for executing the following operations:
Acquiring a first clustering result of an image to be processed;
Performing feature extraction on the first clustering result according to a first image model to obtain a first image feature set, and performing feature extraction on a target text according to a first text model to obtain text features, wherein the target text is used for representing the clustering intention of a user, and the first image model is associated with the first text model;
combining the first clustering results according to the first image feature set and the text features to obtain a first interactive image set;
Extracting features of the first interactive image set according to the first image model to obtain a second image feature set;
And splitting the first interactive image set according to the second image feature set and the text feature to obtain a second interactive image set.
Or for performing the following operations:
Receiving an image search request, wherein the image search request comprises target text, and the target text is used for representing the search intention of a user;
Acquiring a first clustering result of an image to be processed;
Performing feature extraction on the first clustering result according to a first image model to obtain a first image feature set, and performing feature extraction on the target text according to a first text model to obtain text features, wherein the first image model is associated with the first text model;
combining the first clustering results according to the first image feature set and the text features to obtain a first interactive image set;
Extracting features of the first interactive image set according to the first image model to obtain a second image feature set;
Splitting the first interactive image set according to the second image feature set and the text feature to obtain a second interactive image set;
And returning the image set conforming to the description of the target text in the first interactive image set and the second interactive image set as a search result.
The method performed by the image processing apparatus disclosed in the embodiment of fig. 7 of the present application may be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but may also be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.
The electronic device may also execute the methods of fig. 1, fig. 2 and fig. 6, and implement the functions of the image processing apparatus in the embodiments shown in fig. 1, fig. 2 and fig. 6, which are not described herein again.
Of course, other implementations, such as a logic device or a combination of hardware and software, are not excluded from the electronic device of the present application, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or a logic device.
Embodiments of the present application also provide a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the methods of the embodiments of fig. 1,2 and 6, and in particular to perform the operations of:
Acquiring a first clustering result of an image to be processed;
Performing feature extraction on the first clustering result according to a first image model to obtain a first image feature set, and performing feature extraction on a target text according to a first text model to obtain text features, wherein the target text is used for representing the clustering intention of a user, and the first image model is associated with the first text model;
combining the first clustering results according to the first image feature set and the text features to obtain a first interactive image set;
Extracting features of the first interactive image set according to the first image model to obtain a second image feature set;
And splitting the first interactive image set according to the second image feature set and the text feature to obtain a second interactive image set.
Or for performing the following operations:
Receiving an image search request, wherein the image search request comprises target text, and the target text is used for representing the search intention of a user;
Acquiring a first clustering result of an image to be processed;
Performing feature extraction on the first clustering result according to a first image model to obtain a first image feature set, and performing feature extraction on the target text according to a first text model to obtain text features, wherein the first image model is associated with the first text model;
combining the first clustering results according to the first image feature set and the text features to obtain a first interactive image set;
Extracting features of the first interactive image set according to the first image model to obtain a second image feature set;
Splitting the first interactive image set according to the second image feature set and the text feature to obtain a second interactive image set;
And returning the image set conforming to the description of the target text in the first interactive image set and the second interactive image set as a search result.
Fig. 8 is a schematic diagram of an image processing apparatus 80 according to an embodiment of the present application. Referring to fig. 8, in a software implementation, the image processing apparatus 80 may include: an acquisition module 81, a first feature extraction module 82, a merging module 83, a second feature extraction module 84, and a splitting module 85, wherein:
the acquisition module 81 acquires a first clustering result of the image to be processed;
The first feature extraction module 82 performs feature extraction on the first clustering result according to a first image model to obtain a first image feature set, and performs feature extraction on a target text according to a first text model to obtain text features, wherein the target text is used for representing a clustering intention of a user, and the first image model is associated with the first text model;
the merging module 83 merges the first clustering result according to the first image feature set and the text feature to obtain a first interactive image set;
a second feature extraction module 84, configured to perform feature extraction on the first interaction image set according to the first image model, to obtain a second image feature set;
and the splitting module 85 performs splitting processing on the first interaction image set according to the second image feature set and the text feature set to obtain a second interaction image set.
Optionally, in some embodiments, the obtaining module 81 obtains a first clustering result of the image to be processed, including:
acquiring a second image model and a first clustering algorithm;
extracting features of the image to be processed according to the second image model to obtain image features corresponding to the image to be processed;
And clustering the image features according to the first clustering algorithm to obtain the first clustering result.
Optionally, in some embodiments, after the obtaining module 81 obtains the first clustering result, the method further includes:
determining a core sample set corresponding to each class in the first clustering result;
the first feature extraction module 82 performs feature extraction on the first clustering result according to the first image model, to obtain a first image feature set, including:
And extracting features of a core sample set corresponding to each class in the first clustering result according to the first image model to obtain a first image feature set.
Optionally, in some embodiments, the obtaining module 81 determines a set of core samples corresponding to each class in the first clustering result, including:
For each class, the following is performed:
According to the image characteristics of a plurality of images in the class, a first graph structure is constructed, each node in the first graph structure corresponds to one image characteristic, and the weight of an edge between two nodes corresponds to the cosine distance between the two image characteristics;
Removing edges, corresponding to cosine distances smaller than a first preset threshold, in the first graph structure;
For each node in the first graph structure, determining the proximity centrality of the node according to the number of neighbor nodes of the node, the cosine distance between the node and the neighbor nodes and the total number of nodes in the first graph structure;
And determining a set formed by the first image in the class as a core sample set corresponding to the class, wherein the approximate centrality of the image characteristic of the first image in the node corresponding to the first image structure is larger than or equal to a second preset threshold value.
Optionally, in some embodiments, the first image feature set includes a plurality of image feature sets, each image feature set corresponding to a class in the first clustering result; the merging module 83 performs merging processing on the first clustering result according to the first image feature set and the text feature to obtain a first interaction image set, where the merging module includes:
Determining a similarity between each image feature set and the text feature;
determining a target image feature set from the plurality of image feature sets, wherein the similarity between the target image feature set and the text feature is greater than or equal to a preset similarity threshold;
And merging the classes corresponding to the target image feature set to obtain the first interactive image set.
Optionally, in some embodiments, the merging module 83 determines a similarity between each image feature set and the text feature, including:
Determining cosine distances between each image feature set and the text features;
the merging module 83 determines a target image feature set from the plurality of image feature sets, including:
And determining a set, of the plurality of image feature sets, in which the cosine distance is greater than or equal to a third preset threshold value as the target image feature set.
Optionally, in some embodiments, the merging module 83 determines a cosine distance between each image feature set and the text feature, including:
for each image feature set, the following is performed:
For each image feature in the set of image features, determining a cosine distance between the image feature and the text feature;
And determining the cosine distance between the image feature set and the text feature according to the cosine distance between each image feature and the text feature.
Optionally, in some embodiments, the splitting module 85 performs splitting processing on the first interaction image set according to the second image feature set and the text feature to obtain a second interaction image set, including:
Determining target image features similar to the text features from the second image feature set according to the second image feature set and the text features;
Determining a second image corresponding to the target image feature from the first interactive image set according to the target image feature;
and determining the set of the second image as the second interactive image set.
Optionally, in some embodiments, the splitting module 85 determines, from the second image feature set and the text feature, a target image feature similar to the text feature from the second image feature set, including:
Constructing a second graph structure according to the second image feature set and the text feature, wherein the image feature and the text feature in the second image feature set correspond to nodes in the second graph structure, and the weight of an edge between two nodes corresponds to the cosine distance between the two features;
Removing edges with the cosine distance smaller than a fourth preset threshold value from the second graph structure;
starting from the node corresponding to the text feature, performing depth-first search in the second graph structure, and determining a target node which can be reached by search;
and determining the image characteristics corresponding to the target nodes to obtain target image characteristics similar to the text characteristics.
Optionally, in some embodiments, the apparatus further comprises an output module for:
If the images in the first interactive image set accord with the description of the target text, outputting the first interactive image set as a target clustering result; or alternatively, the first and second heat exchangers may be,
And if the images in the second interactive image set accord with the description of the target text, outputting the second interactive image set as a target clustering result.
The image processing device 80 provided in the embodiment of the present application may also execute the methods of fig. 1 and fig. 2, and implement the functions of the image processing device in the embodiment shown in fig. 1 and fig. 2, which are not described herein again.
Fig. 9 is a schematic diagram of an image processing apparatus 90 according to an embodiment of the present application. Referring to fig. 9, in a software implementation, the image processing apparatus 90 may include: a receiving module 91, an acquiring module 92, a first feature extraction module 93, a merging module 94, a second feature extraction module 95, a splitting module 96 and a result return module 97, wherein:
a receiving module 91, which receives an image search request, wherein the image search request comprises target text, and the target text is used for representing the search intention of a user;
the acquisition module 92 acquires a first clustering result of the image to be processed;
The first feature extraction module 93 performs feature extraction on the first clustering result according to a first image model to obtain a first image feature set, and performs feature extraction on the target text according to a first text model to obtain text features, where the first image model is associated with the first text model;
The merging module 94 performs merging processing on the first clustering result according to the first image feature set and the text feature to obtain a first interaction image set;
The second feature extraction module 95 performs feature extraction on the first interaction image set according to the first image model to obtain a second image feature set;
The splitting module 96 is configured to split the first interaction image set according to the second image feature set and the text feature set to obtain a second interaction image set;
The result returning module 97 returns, as a search result, an image set conforming to the description of the target text in the first interactive image set and the second interactive image set.
The image processing apparatus 90 provided in the embodiment of the present application may also execute the method of fig. 6 and implement the functions of the image processing apparatus in the embodiment shown in fig. 6, which is not described herein again.
In summary, the foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

Claims (15)

1. An image processing method, comprising:
Acquiring a first clustering result of an image to be processed;
Performing feature extraction on the first clustering result according to a first image model to obtain a first image feature set, and performing feature extraction on a target text according to a first text model to obtain text features, wherein the target text is used for representing the clustering intention of a user, and the first image model is associated with the first text model;
combining the first clustering results according to the first image feature set and the text features to obtain a first interactive image set;
Extracting features of the first interactive image set according to the first image model to obtain a second image feature set;
And splitting the first interactive image set according to the second image feature set and the text feature to obtain a second interactive image set.
2. The method of claim 1, wherein the obtaining a first clustering result of the image to be processed comprises:
acquiring a second image model and a first clustering algorithm;
extracting features of the image to be processed according to the second image model to obtain image features corresponding to the image to be processed;
And clustering the image features according to the first clustering algorithm to obtain the first clustering result.
3. The method of claim 1, wherein after the first clustering result is obtained, the method further comprises:
determining a core sample set corresponding to each class in the first clustering result;
Performing feature extraction on the first clustering result according to the first image model to obtain a first image feature set, including:
And extracting features of a core sample set corresponding to each class in the first clustering result according to the first image model to obtain a first image feature set.
4. The method of claim 3, wherein the determining a set of core samples corresponding to each class in the first cluster result comprises:
For each class, the following is performed:
According to the image characteristics of a plurality of images in the class, a first graph structure is constructed, each node in the first graph structure corresponds to one image characteristic, and the weight of an edge between two nodes corresponds to the cosine distance between the two image characteristics;
Removing edges, corresponding to cosine distances smaller than a first preset threshold, in the first graph structure;
For each node in the first graph structure, determining the proximity centrality of the node according to the number of neighbor nodes of the node, the cosine distance between the node and the neighbor nodes and the total number of nodes in the first graph structure;
And determining a set formed by the first image in the class as a core sample set corresponding to the class, wherein the approximate centrality of the image characteristic of the first image in the node corresponding to the first image structure is larger than or equal to a second preset threshold value.
5. The method of claim 1, wherein the first set of image features includes a plurality of sets of image features, each set of image features corresponding to a class in the first clustering result; the merging processing is performed on the first clustering result according to the first image feature set and the text feature to obtain a first interactive image set, including:
Determining a similarity between each image feature set and the text feature;
determining a target image feature set from the plurality of image feature sets, wherein the similarity between the target image feature set and the text feature is greater than or equal to a preset similarity threshold;
And merging the classes corresponding to the target image feature set to obtain the first interactive image set.
6. The method of claim 5, wherein said determining a similarity between each image feature set and the text feature comprises:
Determining cosine distances between each image feature set and the text features;
The determining a target image feature set from the plurality of image feature sets includes:
And determining a set, of the plurality of image feature sets, in which the cosine distance is greater than or equal to a third preset threshold value as the target image feature set.
7. The method of claim 6, wherein said determining cosine distances between each set of image features and the text features comprises:
for each image feature set, the following is performed:
For each image feature in the set of image features, determining a cosine distance between the image feature and the text feature;
And determining the cosine distance between the image feature set and the text feature according to the cosine distance between each image feature and the text feature.
8. The method of claim 1, wherein splitting the first set of interaction images according to the second set of image features and the text features to obtain a second set of interaction images comprises:
Determining target image features similar to the text features from the second image feature set according to the second image feature set and the text features;
Determining a second image corresponding to the target image feature from the first interactive image set according to the target image feature;
and determining the set of the second image as the second interactive image set.
9. The method of claim 8, wherein the determining a target image feature from the second set of image features that is similar to the text feature based on the second set of image features and the text feature comprises:
Constructing a second graph structure according to the second image feature set and the text feature, wherein the image feature and the text feature in the second image feature set correspond to nodes in the second graph structure, and the weight of an edge between two nodes corresponds to the cosine distance between the two features;
Removing edges with the cosine distance smaller than a fourth preset threshold value from the second graph structure;
starting from the node corresponding to the text feature, performing depth-first search in the second graph structure, and determining a target node which can be reached by search;
and determining the image characteristics corresponding to the target nodes to obtain target image characteristics similar to the text characteristics.
10. The method of claim 1, wherein the method further comprises:
If the images in the first interactive image set accord with the description of the target text, outputting the first interactive image set as a target clustering result; or alternatively, the first and second heat exchangers may be,
And if the images in the second interactive image set accord with the description of the target text, outputting the second interactive image set as a target clustering result.
11. An image processing method, comprising:
Receiving an image search request, wherein the image search request comprises target text, and the target text is used for representing the search intention of a user;
Acquiring a first clustering result of an image to be processed;
Performing feature extraction on the first clustering result according to a first image model to obtain a first image feature set, and performing feature extraction on the target text according to a first text model to obtain text features, wherein the first image model is associated with the first text model;
combining the first clustering results according to the first image feature set and the text features to obtain a first interactive image set;
Extracting features of the first interactive image set according to the first image model to obtain a second image feature set;
Splitting the first interactive image set according to the second image feature set and the text feature to obtain a second interactive image set;
And returning the image set conforming to the description of the target text in the first interactive image set and the second interactive image set as a search result.
12. An image processing apparatus, comprising:
The acquisition module acquires a first clustering result of the image to be processed;
the first feature extraction module is used for carrying out feature extraction on the first clustering result according to a first image model to obtain a first image feature set, and carrying out feature extraction on a target text according to a first text model to obtain text features, wherein the target text is used for representing the clustering intention of a user, and the first image model is associated with the first text model;
The merging module is used for merging the first clustering results according to the first image feature set and the text features to obtain a first interactive image set;
the second feature extraction module is used for carrying out feature extraction on the first interactive image set according to the first image model to obtain a second image feature set;
And the splitting module is used for splitting the first interaction image set according to the second image feature set and the text feature set to obtain a second interaction image set.
13. An image processing apparatus, comprising:
the receiving module is used for receiving an image search request, wherein the image search request comprises target text, and the target text is used for representing the search intention of a user;
the acquisition module acquires a first clustering result of the image to be processed;
the first feature extraction module is used for carrying out feature extraction on the first clustering result according to a first image model to obtain a first image feature set, carrying out feature extraction on the target text according to a first text model to obtain text features, and associating the first image model with the first text model;
The merging module is used for merging the first clustering results according to the first image feature set and the text features to obtain a first interactive image set;
the second feature extraction module is used for carrying out feature extraction on the first interactive image set according to the first image model to obtain a second image feature set;
the splitting module is used for splitting the first interaction image set according to the second image feature set and the text feature set to obtain a second interaction image set;
and the result returning module returns the image set which accords with the description of the target text in the first interactive image set and the second interactive image set as a search result.
14. An electronic device, comprising:
A processor;
A memory for storing the processor-executable instructions;
Wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 11.
15. A computer readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any one of claims 1 to 11.
CN202311502232.2A 2023-11-10 2023-11-10 Image processing method and device Pending CN117953250A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311502232.2A CN117953250A (en) 2023-11-10 2023-11-10 Image processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311502232.2A CN117953250A (en) 2023-11-10 2023-11-10 Image processing method and device

Publications (1)

Publication Number Publication Date
CN117953250A true CN117953250A (en) 2024-04-30

Family

ID=90792753

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311502232.2A Pending CN117953250A (en) 2023-11-10 2023-11-10 Image processing method and device

Country Status (1)

Country Link
CN (1) CN117953250A (en)

Similar Documents

Publication Publication Date Title
CN110188168B (en) Semantic relation recognition method and device
US20200175062A1 (en) Image retrieval method and apparatus, and electronic device
CN110321537B (en) Method and device for generating file
US20200167598A1 (en) User identity determining method, apparatus, and device
US20180150561A1 (en) Searching method and searching apparatus based on neural network and search engine
CN110928992B (en) Text searching method, device, server and storage medium
CN111539197A (en) Text matching method and device, computer system and readable storage medium
CN110991187A (en) Entity linking method, device, electronic equipment and medium
CN114817538B (en) Training method of text classification model, text classification method and related equipment
KR20200094624A (en) Image-based data processing method, device, electronic device and storage medium
CN111090771A (en) Song searching method and device and computer storage medium
CN116089567A (en) Recommendation method, device, equipment and storage medium for search keywords
CN116882372A (en) Text generation method, device, electronic equipment and storage medium
CN116108150A (en) Intelligent question-answering method, device, system and electronic equipment
CN116150306A (en) Training method of question-answering robot, question-answering method and device
CN111144098B (en) Recall method and device for extended question
CN113849679A (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
CN114840762A (en) Recommended content determining method and device and electronic equipment
CN113656575B (en) Training data generation method and device, electronic equipment and readable medium
CN117953250A (en) Image processing method and device
CN112528646B (en) Word vector generation method, terminal device and computer-readable storage medium
CN115221316A (en) Knowledge base processing method, model training method, computer device and storage medium
CN114418114A (en) Operator fusion method and device, terminal equipment and storage medium
CN113191401A (en) Method and device for three-dimensional model recognition based on visual saliency sharing
CN112861974A (en) Text classification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination